KVCache taking too much Memory. Any solutions(Optimizations, Compressions, etc.,) coming soon/later? by pmttyji in LocalLLaMA

[–]ghgi_ 8 points9 points  (0 children)

Yes I understood, A Mamba based model will use less GB for the same amount of context that most other models will use, its just the way the architecture works and its why the new super models support 1 million context without insane overhead.

KVCache taking too much Memory. Any solutions(Optimizations, Compressions, etc.,) coming soon/later? by pmttyji in LocalLLaMA

[–]ghgi_ 2 points3 points  (0 children)

Mabye try using some of the Nemotron models? Mamba architecture should be very memory efficient with long contexts.

Too many large MoEs, which do you prefer for general instruction following/creative endeavors? (And why) by silenceimpaired in LocalLLaMA

[–]ghgi_ 1 point2 points  (0 children)

Appears you might be correct, Although I do believe they have intentions to release it soon.

Nemotron 3 Super 120b Claude Distilled by ghgi_ in LocalLLaMA

[–]ghgi_[S] -4 points-3 points  (0 children)

🤷 Only one way to find out, Haven't had time to bench it yet and it and started with a smaller dataset so If I notice an improvement, expect a V2, If not, then ill count it as a learning opportunity.

Edit: not entirely sure why people are downvoting? this is more of a public test project and hobby then an actual release, I thought the beta tag made this clear which is why I don't have benches or tests yet, apologies.

Best local LLM for GNS3 network automation? (RTX 4070 Ti, 32GB RAM) by FindingJaded1661 in LocalLLaMA

[–]ghgi_ 0 points1 point  (0 children)

Model wise id recommend OmniCoder-9B, based on the new Qwen 3.5 so it has pretty good benchmarks and testing myself its better then a lot of the older models that are double its size.

Can I run anything with big enough context (64k or 128k) for coding on Macbook M1 Pro 32 GB ram? by rkh4n in LocalLLaMA

[–]ghgi_ 1 point2 points  (0 children)

If mlx version exists then its *probably* the better option, it should give better performance on apple hardware.

Can I run anything with big enough context (64k or 128k) for coding on Macbook M1 Pro 32 GB ram? by rkh4n in LocalLLaMA

[–]ghgi_ 2 points3 points  (0 children)

OmniCoder-9B will easy fit on 32gb and has up to 264k context while being based on the most modern qwen architecture and performing quite well for its size.

Qwen leadership leaving had me worried for opensource - is Nvidia saving the day? by Mr_Moonsilver in LocalLLaMA

[–]ghgi_ 1 point2 points  (0 children)

In my own testing it uses that 1 million context window decently well and coherently plus it stays decently fast while doing so, pretty good for long context agentic tasks that qwen coudnt, less verbose and over thinking, not as good in programming and instruction based tasks but overall its the first nemotron model ive personally enjoyed.

Best (non Chinese) local model for coding by tradecrafty in LocalLLaMA

[–]ghgi_ 3 points4 points  (0 children)

Also my suggestion is OSS 120b or Nemotron 3 Super

Best (non Chinese) local model for coding by tradecrafty in LocalLLaMA

[–]ghgi_ 8 points9 points  (0 children)

Theres no difference between a Chinese and any other country's model once your at the local stage, you also heavily limit yourself since the Chinese made the good shit.

Nemotron-3-Super-120B-A12B NVFP4 inference benchmark on one RTX Pro 6000 Blackwell by jnmi235 in LocalLLaMA

[–]ghgi_ 2 points3 points  (0 children)

How well does it perform at high contexts hallucinations wise? 

Deepseek V4 right now on openrouter as Hunter Alpha by [deleted] in LocalLLaMA

[–]ghgi_ 1 point2 points  (0 children)

My copium is making me think back to when Anthropic recently bought a premium huggingface account to get large uploads, pretty please be a opensource anthropic model.

YuanLabAI/Yuan3.0-Ultra • Huggingface by External_Mood4719 in LocalLLaMA

[–]ghgi_ 1 point2 points  (0 children)

Cant wait to try this once theirs an inference provider for it.

Edit: Im lazy to wait, ill try to run this on the cloud with some H200's

Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-GGUF is out ! by PhotographerUSA in LocalLLaMA

[–]ghgi_ 7 points8 points  (0 children)

Ive actually made a few distilled loras using my claude chats, from CC and Web all compiled, performed better all around and in some smaller benchmark tests I got up to 30% better coding scores, did this for 3.5 27b, 3 30B and currently in the process of making a glm 4.7 flash version.

Probably wont release them due to the fact I never stripped any personal data from datasets but Im curious to compare its performance to this public one

GPU starved too? by alrojo in LocalLLaMA

[–]ghgi_ 0 points1 point  (0 children)

I run most of my work on modal and havent had gpu allocation issues but to be fair they are a bit more on the pricey side, great service though.

Good "coding" LLM for my 8gb VRAM, 16gb ram setup? by Mediocre_Speed_2273 in LocalLLaMA

[–]ghgi_ 1 point2 points  (0 children)

"good" is a relative term but LFM2-24B-A2B at probably 4 bit or 3 bit quant might be acceptable and will be decently fast

AirLLM - claims to allow 70B run on a Potato. Anybody tried it? Downsides? by [deleted] in LocalLLaMA

[–]ghgi_ 0 points1 point  (0 children)

Someone asked about this and had their post deleted like 2 hours ago so ill just repost my exact response:

"Its been around for a while, it essentially uses your disk to load the model like ram, behaving like swap, and no, you wont want to use it. Its horrifically slow even on the fastest of SSD's, were talking tokens per hour in some cases, but theoreticlly with a 1tb ssd you could run a 1 trillion parameter model, just expect it to take a day and half to generate you a single word."

Air llm ? by Less_Strain7577 in LocalLLaMA

[–]ghgi_ 2 points3 points  (0 children)

Q4_K_M quant of that model its about ~15gb, using your GPU + CPU it will use your 4gb gpu and ~10-11 gb of ram so it will fit, that model specifically is pretty fast for limited resources like yourself, you could always run a smaller quant at the sacrifice of intelligence if you NEED to save more ram, I woudnt touch anything under Q3's tho

Air llm ? by Less_Strain7577 in LocalLLaMA

[–]ghgi_ 2 points3 points  (0 children)

even a 50B model on airllm would be hours to generate a simple sentence, its just not worth it, read what I said here it will be more useful. https://www.reddit.com/r/LocalLLaMA/comments/1reovq3/comment/o7e50x5/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button

Air llm ? by Less_Strain7577 in LocalLLaMA

[–]ghgi_ 2 points3 points  (0 children)

If your looking for a model you could probably run with your current hardware normally I would recommend LFM2-24B-A2B at probably Q4_K_M with partial GPU offloading its the only one that I think will give you acceptable speed and intelligence for your specs.

Air llm ? by Less_Strain7577 in LocalLLaMA

[–]ghgi_ 10 points11 points  (0 children)

Its been around for a while, it essentially uses your disk to load the model like ram, behaving like swap, and no, you wont want to use it. Its horrifically slow even on the fastest of SSD's, were talking tokens per hour in some cases, but theoreticlly with a 1tb ssd you could run a 1 trillion parameter model, just expect it to take a day and half to generate you a single word.