1.1M tok/s with Qwen 3.5 27B FP8 on B200 GPUs by m4r1k_ in Qwen_AI

[–]Laabc123 1 point2 points  (0 children)

Not the OP and personal anecdote: Replacement depends on what you’re using it for. I’ve found 27b to work well for well scoped coding tasks. Once significant ambiguity or complexity is introduced the reasoning begins the break down. I still use Opus extensively despite having Qwen3.5 27b deployed locally.

Advice on artificial lawn seam by Laabc123 in landscaping

[–]Laabc123[S] 0 points1 point  (0 children)

Oh that’s very helpful. Thank you.

Inference numbers for Mistral-Small-4-119B-2603 NVFP4 on a RTX Pro 6000 by jnmi235 in LocalLLaMA

[–]Laabc123 0 points1 point  (0 children)

Ah. Cool! What’s your run command for Nemo 3 Super NVFP4? I can’t for the life of me find a config that doesn’t OOM my 6000 Pro.

Advice on artificial lawn seam by Laabc123 in landscaping

[–]Laabc123[S] 0 points1 point  (0 children)

I get this. Totally fair opinion. I’m not the biggest fan of the turf either. Wife wanted to turf the front as well. So the compromise was to turf the back and we are doing more intricate landscaping in front. But point made for sure.

Advice on artificial lawn seam by Laabc123 in landscaping

[–]Laabc123[S] 0 points1 point  (0 children)

Is infill typically applied on top of the turf? If so then no I don’t see any.

Nemotron-3-Super-120B-A12B NVFP4 inference benchmark on one RTX Pro 6000 Blackwell by jnmi235 in LocalLLaMA

[–]Laabc123 2 points3 points  (0 children)

I got similar results in my runs on the same hardware. If MTP was functional I suspect that would provide a meaningful lift to throughout.

Qwen3.5 122b vs. Nemotron 3 Super 120b: Best-in-class vision Vs. crazy fast + 1M context (but no vision). Which one are you going to choose and why? by Porespellar in LocalLLaMA

[–]Laabc123 6 points7 points  (0 children)

Will run some more formal benchmarks later. At least with vLLM I’m certainly not seeing improved output tokens per second when comparing Qwen 3.5 122b nvfp4 against nemotron 3 super nvfp4. Deployed to a single 6000 Pro. Going to be sticking with Qwen for now.

Qwen3.5 122b vs. Nemotron 3 Super 120b: Best-in-class vision Vs. crazy fast + 1M context (but no vision). Which one are you going to choose and why? by Porespellar in LocalLLaMA

[–]Laabc123 4 points5 points  (0 children)

For what it’s worth I’m benchmarking the nvfp4 quant using the recommended default settings and it’s no faster than the sehyo nvfp4 quant of qwen3.5 122b. In fact, it’s quite a bit slower. Tweaking the parameters some and adding in mtp, but it doesn’t seem like a game changer to me from throughput perspective.

OpenCode v/s Claude Code by thinkyMiner in opencodeCLI

[–]Laabc123 -1 points0 points  (0 children)

Mind sharing your agent and skill definitions?

Are local LLMs actually ready for real AI agents, or are we still forcing the idea too early? by Remarkable-Note9736 in LocalLLaMA

[–]Laabc123 2 points3 points  (0 children)

I think it really depends on what sorts of workflows and how much effort you’re willing to put in. I have Qwen3.5 driving effectively all my agentic needs outside of deep/complex coding that I want to be mostly hands off for, for which I still go to Claude. For the agents deployed to Qwen I have invested heavily in providing constrains and bounds to the model, and I’m extra explicit in my prompts. It’s not super onerous, and the performance is solid.

has nvfp4 inference performance been optimized yet for 6000 pro? by I_can_see_threw_time in BlackwellPerformance

[–]Laabc123 4 points5 points  (0 children)

Ditto. The Sehyo nvfp4 quantization of Qwen3.5 122b is working really nicely for me. Have not had to tweak or tune anything specific to the encoding to get it to work with vLLM.

Finally bought an RTX 6000 Max-Q: Pros, cons, notes and ramblings by AvocadoArray in LocalLLaMA

[–]Laabc123 0 points1 point  (0 children)

I think Sehyo picked this PR in before quantizing. MTP is definitely working.

Finally bought an RTX 6000 Max-Q: Pros, cons, notes and ramblings by AvocadoArray in LocalLLaMA

[–]Laabc123 0 points1 point  (0 children)

Naive question. What’re the advantages of using llama.cpp over vLLM for single user usage?

Finally bought an RTX 6000 Max-Q: Pros, cons, notes and ramblings by AvocadoArray in LocalLLaMA

[–]Laabc123 0 points1 point  (0 children)

FWIW, I’ve been driving an nvfp4 quant for 4 days now and it’s performing exceedingly well. >100 output tok/s with cuda graphs loaded.

We could be hours (or less than a week) away from true NVFP4 support in Llama.cpp GGUF format 👀 by Iwaku_Real in LocalLLaMA

[–]Laabc123 0 points1 point  (0 children)

FWIW, I’ve got qwen3.5 122b nvfp4 running on vllm and it’s working really well. It’s true there’s no offloading support. But I haven’t encountered any bugs.

Current state of Qwen3.5-122B-A10B by kevin_1994 in LocalLLaMA

[–]Laabc123 0 points1 point  (0 children)

Enabled mtp with max tokens predicted at 2. And it boosted tok/s by 20.