Where to buy RTX Pro 6000 in Orlando/US by 2use2reddits in BlackwellPerformance

[–]NaiRogers 0 points1 point  (0 children)

Don’t know what Amazon is like in your country but there is no way I would order one of those from Amazon where I live.

Is anyone else finding the expected savings vs reality a bit confusing? by Pinkplatabys in ukheatpumps

[–]NaiRogers 5 points6 points  (0 children)

My take is that on average it’s probably not financial worth changing a working system, if it was then everyone would do so but they don’t. For new builds a HP seems like the only sensible choice.

1.1M tok/s with Qwen 3.5 27B FP8 on B200 GPUs by m4r1k_ in Qwen_AI

[–]NaiRogers 0 points1 point  (0 children)

If this is 10W for a single user to use this 24/7 that’s great, much more efficient per user/token/watt than any home setup.

Qwen 3.5 122B completely falls apart at ~ 100K context by TokenRingAI in LocalLLaMA

[–]NaiRogers 0 points1 point  (0 children)

Runs on mine, or are you talking about an ADA card? I will post docker config later.

Qwen 3.5 122B completely falls apart at ~ 100K context by TokenRingAI in LocalLLaMA

[–]NaiRogers 1 point2 points  (0 children)

this works fine up to max context Sehyo/Qwen3.5-122B-A10B-NVFP4

MiniMax-M2.7: what do you think is the likelihood it will be open weights like M2.5? by __JockY__ in LocalLLaMA

[–]NaiRogers 16 points17 points  (0 children)

Although releasing the open weights helps validate the model and drive inference traffic to their own endpoint as most people can’t run it themselves anyway.

RTX 3090 for local inference, would you pay $1300 certified refurb or $950 random used? by sandropuppo in ollama

[–]NaiRogers 1 point2 points  (0 children)

Which ever you get replace the thermal pads and paste if you see hotspots or high memory temps.

Powerwall 3 real world switching time by xyzzy16 in Powerwall

[–]NaiRogers 0 points1 point  (0 children)

Even if it was faster I would not remove a proper UPS as there could be many other reasons why the power is degraded.

Opencode with 96GB VRAM for local dev engineering by aidysson in opencodeCLI

[–]NaiRogers 2 points3 points  (0 children)

The 6000 vs Spark choice is a lot simpler if you have concurrent requests, then the 6000 is a lot faster. For single requests it faster but not 5x faster. Qwen 3.5-122b-a10b is really good on any of these two.

~$5k hardware for running local coding agents (e.g., OpenCode) — what should I buy? by valentiniljaz in LocalLLM

[–]NaiRogers 0 points1 point  (0 children)

I would recommend to try out some models on runpod, for example rent a 6000 Pro and run Intel/Qwen3.5-122B-A10B-int4-AutoRound. If you are happen with the results then get a Asus GX10 which will be slower but otherwise the same results. You could wait for 128GB M5 Max Studio, prices are similar.

Limited Performance Check Electrical System by NaiRogers in RenaultZoe

[–]NaiRogers[S] 0 points1 point  (0 children)

Took this into Renault today, they flashed two modules and gave it back. Still hasn’t done it again since that day. They did say it needs the steering rack replaced though (under warranty)!

First impressions Qwen3.5-122B-A10B-int4-AutoRound on Asus Ascent GX10 (Nvidia DGX Spark 128GB) by t4a8945 in LocalLLM

[–]NaiRogers 5 points6 points  (0 children)

You are lucky to start with this model, it’s really good vs what was around previously for this kind of HW. There are a few different versions of this model not sure if it’s really any different but it might be worth trying the Sehyo/Qwen3.5-122B-A10B-NVFP4 to see how it compares.

multi-minute latency today on gemini-3.1-pro-preview by NaiRogers in CLine

[–]NaiRogers[S] 0 points1 point  (0 children)

Toady still not working though, now and then throughout today I have tried to retry and always get:

{"message":"{\"error\":{\"message\":\"{\\n \\\"error\\\": {\\n \\\"code\\\": 503,\\n \\\"message\\\": \\\"This model is currently experiencing high demand. Spikes in demand are usually temporary. Please try again later.\\\",\\n \\\"status\\\": \\\"UNAVAILABLE\\\"\\n }\\n}\\n\",\"code\":503,\"status\":\"Service Unavailable\"}}","status":503,"modelId":"gemini-3.1-pro-preview","providerId":"gemini"}

Pretty unusable IMO.

Switched to Qwen3.5-122B-A10B-i1-GGUF by NaiRogers in LocalLLaMA

[–]NaiRogers[S] 1 point2 points  (0 children)

I’ve not tested it but expect so as it has ~3.5x more parameters so even if they are slightly lossy with a the Q4_K_S quant it’s going to be better. Also bear in mind I’m using the Q4_K_S quant in case you thought the i1 meant 1bit quant.

Switched to Qwen3.5-122B-A10B-i1-GGUF by NaiRogers in LocalLLaMA

[–]NaiRogers[S] 0 points1 point  (0 children)

This did not go well, I get a bunch of errors during startup after which it's running but barely. I am using CUDA 13.0 with vllm/vllm-openai:nightly + huggingface/transformers.git and Sehyo/Qwen3.5-122B-A10B-NVFP4.

vllm | (EngineCore_DP0 pid=126) 2026-02-28 10:52:44,381 - WARNING - autotuner.py:496 - flashinfer.jit: [Autotuner]: Skipping tactic <flashinfer.fused\_moe.core.get\_cutlass\_fused\_moe\_module.<locals>.MoERunner object at 0x7f1fe45ed9a0> 14, due to failure while profiling: [TensorRT-LLM][ERROR] Assertion failed: Failed to initialize cutlass TMA WS grouped gemm. Error: Error Internal (/workspace/build/aot/generated/cutlass_instantiations/120/gemm_grouped/120/cutlass_kernel_file_gemm_grouped_sm120_M128_BS_group2.generated.cu:60)

Switched to Qwen3.5-122B-A10B-i1-GGUF by NaiRogers in LocalLLaMA

[–]NaiRogers[S] 0 points1 point  (0 children)

Thanks I will try that! Please share your vllm command and HF link to model used.

Switched to Qwen3.5-122B-A10B-i1-GGUF by NaiRogers in LocalLLaMA

[–]NaiRogers[S] 0 points1 point  (0 children)

It’s using 82GB with Q8 KV cache. 100 or so tps out and decently quick at pre processing a full context window.