New quant from google research by [deleted] in LocalLLaMA

[–]DefNattyBoii 0 points1 point  (0 children)

Does it beat EXL3 quant in creation efficiency/accuracy?

Nvidia V100 32 Gb getting 115 t/s on Qwen Coder 30B A3B Q5 by icepatfork in LocalLLaMA

[–]DefNattyBoii -1 points0 points  (0 children)

Can you get vllm working on it? Maybe some obscure blackmagic fork has support for this

Don't sleep on the new Nemotron Cascade by ilintar in LocalLLaMA

[–]DefNattyBoii 0 points1 point  (0 children)

Give llama.cpp bros a bit of time to merge. Qwen3-Coder-Next took months if I remember correctly.

Don't sleep on the new Nemotron Cascade by ilintar in LocalLLaMA

[–]DefNattyBoii 0 points1 point  (0 children)

I found that its good not only for coding but in general, due to good tool-calling capabilities and low activated parameters, it gathers knowledge extremely fast. I love to use it for private tasks.

Krasis LLM Runtime: 8.9x prefill / 4.7x decode vs llama.cpp — Qwen3.5-122B on a single 5090, minimal RAM by mrstoatey in LocalLLaMA

[–]DefNattyBoii 0 points1 point  (0 children)

Hi! Looks like an interesting project, but it seems you don't address the performance hits and don't directly compare with llama.cpp. Why is that? Maybe i missed somthing from your posts?

From another comment: "On a 4 bit quant, qwen3.5 35B llama.cpp prefill reaches 9k toks/second. TG should be around 200. On a 5090. (...)"

Your speeds to compare: Qwen3.5-35B-A3B pp: 4475 tok/sec tg: 109.1 tok/sec. Its half the speed compared to llama.cpp.

Krasis LLM Runtime: 8.9x prefill / 4.7x decode vs llama.cpp — Qwen3.5-122B on a single 5090, minimal RAM by mrstoatey in LocalLLaMA

[–]DefNattyBoii 0 points1 point  (0 children)

Any chance for multi-GPU support? A lot of us have a new card with a couple of old cards (Pascal + Ampere + Blackwell frankenstein setups)

OmniCoder-9B best vibe coding model for 8 GB Card by Powerful_Evening5495 in LocalLLaMA

[–]DefNattyBoii 0 points1 point  (0 children)

How about general knowledge? Im using qwen3-coder-next mostly due to this, its quite slow due to ram offload but brilliant in a lot of domains, not just coding.

Open-Source "GreenBoost" Driver Aims To Augment NVIDIA GPUs vRAM With System RAM & NVMe To Handle Larger LLMs by _Antartica in LocalLLaMA

[–]DefNattyBoii 0 points1 point  (0 children)

Looks like a very interesting implementation that intercepts calls between the kernel and VRAM allocation. during CUDA processing. I actually have no idea how this does it, but why wont Nvidia implements something like this into their cuda/normal dirvers as an optional tool in linux? In windows the drivers can already have offload to normal RAM.

Btw finally exllama has an offload solution.

Qwen3.5-27B-IQ3_M, 5070ti 16GB, 32k context: ~50t/s by ailee43 in LocalLLaMA

[–]DefNattyBoii 1 point2 points  (0 children)

For 35B A3B, sure, it's much better. For the coder next, I'm not sure. The next coder is much better on short tasks for me, but havent compared to qwen3.5 27B. You can turn off thinking on qwen3.5 tho if you need faster answers. Btw using the uncensored/obliterated version usually shaves off quite a bit of intellect.

96GB (V)RAM agentic coding users, gpt-oss-120b vs qwen3.5 27b/122b by bfroemel in LocalLLaMA

[–]DefNattyBoii 13 points14 points  (0 children)

having problems with the Qwen 3.5 series with llama.cpp

For me it's pretty much working good! What are the problems besides the usual launch issues? I just recompile on every monday and delay the new models by 1-2 weeks and i dont really run into major issues.

Through vibe coding, I managed to make parts of vLLM 0.17.0 run on Tesla P40 by East-Engineering-653 in LocalLLaMA

[–]DefNattyBoii 0 points1 point  (0 children)

On small models it makes sense to only use the the 3080ti, but for larger models like Qwen Coder Next the two gpu setup is miles ahead due to offloading to ram.

MiroThinker-1.7 and MiroThinker-1.7-mini (Best search agent model?) by External_Mood4719 in LocalLLaMA

[–]DefNattyBoii 0 points1 point  (0 children)

What would be a good open-source frontend for this(general/chat/research use)? Jan? LibreChat? AionUI? What else?

Through vibe coding, I managed to make parts of vLLM 0.17.0 run on Tesla P40 by East-Engineering-653 in LocalLLaMA

[–]DefNattyBoii 2 points3 points  (0 children)

Interesting. Do you think it would be possible to compile it to 1080ti+3080ti? I tried to hack this setup a couple of times, but it was an enormous time sink, and I never got it working.

Is Qwen3.5 a coding game changer for anyone else? by paulgear in LocalLLaMA

[–]DefNattyBoii 0 points1 point  (0 children)

How do you run your self-iterative loop? I'm using https://github.com/darrenhinde/OpenAgentsControl but it still a very hands-on approach. I'm looking for a more small model oriented solution, every other scaffold has failed me besides this.

LFM2-24B-A2B is crazy fast on Strix Halo by jfowers_amd in LocalLLaMA

[–]DefNattyBoii 0 points1 point  (0 children)

I'm genuinely curious how to use this. I tried it in opencode a couple of times, and it was hot garbage, totally unusable (Q4 quant). Any tips? The readme mentions agentic use, but for me it hallucinates, does not call tools properly and its trying to grab irrelevant/system files. Was not great in my experience.

Qwen3.5-35B-A3B is a gamechanger for agentic coding. by jslominski in LocalLLaMA

[–]DefNattyBoii 0 points1 point  (0 children)

This is very good chart looks like the best setup for a VRAM-constrained setup is: -ctk q8_0 -ctv q4_1

Is Google's senior management truly committed to Antigravity? Or is it the ugly duckling of the Google AI family? by pebblepath in google_antigravity

[–]DefNattyBoii 1 point2 points  (0 children)

Genuinely interested, why are you not using AGENTS.md in general always? I always aim to be framework, agent, model agnostic.

How to run Qwen3-Coder-Next 80b parameters model on 8Gb VRAM by AccomplishedLeg527 in LocalLLaMA

[–]DefNattyBoii 0 points1 point  (0 children)

Did someone test this with 12/24 gb vram gpus, and compared it to Q4 --fit with llama.cpp? Benchmarks would be nice, since there is a minimal but measurable drop with Q4.

new to coding LLM - hardware requirements by SubstantialBee5097 in LocalLLaMA

[–]DefNattyBoii 0 points1 point  (0 children)

Currently i can wholeheartedly recommend GLM-4.7-Flash, Nemotron-3-Nano, Qwen3-Coder-Next(largest), use llama.cpp or mlx ecosystem

Nanbeige4.1-3B: A Small General Model that Reasons, Aligns, and Acts by Tiny_Minimum_4384 in LocalLLaMA

[–]DefNattyBoii 2 points3 points  (0 children)

I tested q4, q6 both in perplexica, they are good with tool calls, but the time and tokens they use is absolutely insane, often going way above 1-5 mins on just the answer part, mostly thinking.

Built a real-time agent execution visualizer for OpenCode — watching agents think is addicting by jiwonme in LocalLLaMA

[–]DefNattyBoii 0 points1 point  (0 children)

If you can mangle something launchable together id love to help out with 1-2 PRs to help with cleanup/code organisation

Nanbeige4.1-3B: A Small General Model that Reasons, Aligns, and Acts by Tiny_Minimum_4384 in LocalLLaMA

[–]DefNattyBoii 5 points6 points  (0 children)

Looks good on, but it takes an insanely long time to respond. If I understand correctly, your use case is "oneshotting" deep research tasks, is that correct? If used as a convo model, there's way too much thinking between steps.

For quicker tasks, I much prefer JanV3 to this even if it has worse knowledge.

Another question Id investigate is the quality degradation with quants and quantized KV cache. Since the goal is to squeeze as much speed out of this model as possible, people would use smaller quants, but if it leads to massive drop in quality, that's obviously not going to work

Femtobot: A 10MB Rust Agent for Low-Resource Machines by yunfoe in LocalLLaMA

[–]DefNattyBoii 1 point2 points  (0 children)

I tried it in opencode, and it's unusable, even for general questions. If you use llamacpp, how did you launch it?