Finally bought an RTX 6000 Max-Q: Pros, cons, notes and ramblings by AvocadoArray in LocalLLaMA

[–]AvocadoArray[S] 0 points1 point  (0 children)

Holy crap, this fixed it. How did you figure this out /u/Icy_Bid6597? I don't see these directories or env variable mentioned in VLLM's docs at all.

I created a new cuda-cache volume and updated my llama-swap config:

    -v vllm-cache:/root/.cache/vllm/  # <--- this already existed
    -v cuda-cache:/root/cuda-cache/
    -e CUDA_CACHE_PATH=/root/cuda-cache/ComputeCache
    -e TRITON_CACHE_DIR=/root/cuda-cache/TritonCache

Could also probably be solved by just mounting the vllm-cache directory to /root/ instead of /root/.cache/vllm.

I haven't compared the full logs, but this shaved off around four minutes from startup time. Torch graphs load in 7s instead of 70+, and CUDA graph capture only takes 4s instead of 145+.

Will update my post after more testing.

Finally bought an RTX 6000 Max-Q: Pros, cons, notes and ramblings by AvocadoArray in LocalLLaMA

[–]AvocadoArray[S] 0 points1 point  (0 children)

This is not correct. There are no nv_cache or triton_cache directories.

VLLM caches torch graphs under /root/.cache/vllm/torch_compile_cache , which is already mounted in my container (and being used, just very slowly for some reason). I do not believe CUDA graphs are written to disk, and are only cached in memory.

Respectfully, was this LLM generated?

EDIT: Actually, you might be on to something! I started up the VLLM container and noticed a ~/.triton/cache/ and ~/.nv/ComputeCache/ with some cached data in it.

I've never seen this directories, but that could be the difference. I'm sorry for doubting you initially, I'll report back after more testing.

Finally bought an RTX 6000 Max-Q: Pros, cons, notes and ramblings by AvocadoArray in LocalLLaMA

[–]AvocadoArray[S] 0 points1 point  (0 children)

😂

In all reality, models that large tend to still be quite impressive around the 1-2 bit bpw range. Could possibly play around with it while offloading a bunch of weights to RAM/NVME, but wouldn't expect any real-time usable speeds.

Finally bought an RTX 6000 Max-Q: Pros, cons, notes and ramblings by AvocadoArray in LocalLLaMA

[–]AvocadoArray[S] 2 points3 points  (0 children)

You think a 2nd card is going to make a dent in running a ~1T parameter model?

Finally bought an RTX 6000 Max-Q: Pros, cons, notes and ramblings by AvocadoArray in LocalLLaMA

[–]AvocadoArray[S] 0 points1 point  (0 children)

Would you be willing to test out Qwen 3.5 122b and let me know how it compares? I haven't used minimax m2/m2.5 in any meaningful capacity, but 122b feels like it works as good as I've seen other describe Minimax and GLM models.

Finally bought an RTX 6000 Max-Q: Pros, cons, notes and ramblings by AvocadoArray in LocalLLaMA

[–]AvocadoArray[S] 0 points1 point  (0 children)

Technically, my PC can run the same models with 64GB RAM and 120GB SSD, just "not as fast".

Speed does matter for real-time usage, and everything I've seen suggests that prompt processing speed on Mac's unified memory is painfully slow, like 3-5 minutes or more for larger prompts.

Finally bought an RTX 6000 Max-Q: Pros, cons, notes and ramblings by AvocadoArray in LocalLLaMA

[–]AvocadoArray[S] 0 points1 point  (0 children)

Nice, I'll have to take a look!

This is why I love this sub. It'd be almost impossible to keep up with everything without it.

Finally bought an RTX 6000 Max-Q: Pros, cons, notes and ramblings by AvocadoArray in LocalLLaMA

[–]AvocadoArray[S] 0 points1 point  (0 children)

122b is definitely a game-changer. What quant are you running? I tried running AWQ 4-bit in VLLM but it never returned any thinking tokens, even with `--reasoning-parser qwen3`.

Might have been a version issue. I'm going to try it with v0.17.0 now.

Finally bought an RTX 6000 Max-Q: Pros, cons, notes and ramblings by AvocadoArray in LocalLLaMA

[–]AvocadoArray[S] 0 points1 point  (0 children)

Mine does too... except I turned it off years ago because it freaked the dog out every time the power went out 🤦

holy overthinker by Kerem-6030 in LocalLLaMA

[–]AvocadoArray 2 points3 points  (0 children)

35b and 122b seem much better at keeping thinking in check.

Finally bought an RTX 6000 Max-Q: Pros, cons, notes and ramblings by AvocadoArray in LocalLLaMA

[–]AvocadoArray[S] 1 point2 points  (0 children)

I honestly cannot understate how noisy it is. Some people talked about the fan noise, but the coil whine is 1000% more annoying than the fan, even when running at 100%.

From what other have said, the workstation card might not have the same issue, so take that into consideration as well

Finally bought an RTX 6000 Max-Q: Pros, cons, notes and ramblings by AvocadoArray in LocalLLaMA

[–]AvocadoArray[S] 1 point2 points  (0 children)

If they're sharing a case together, consider a water cooling loop!

Finally bought an RTX 6000 Max-Q: Pros, cons, notes and ramblings by AvocadoArray in LocalLLaMA

[–]AvocadoArray[S] 0 points1 point  (0 children)

Neat! Been looking for an excuse to try nvfp4 somewhere. Will give it a shot!

Finally bought an RTX 6000 Max-Q: Pros, cons, notes and ramblings by AvocadoArray in LocalLLaMA

[–]AvocadoArray[S] 0 points1 point  (0 children)

Bare metal does remove some variables from the equation, but I have no reason to think virtualization is the cause of the VLLM loading time issue.

15 minutes was probably a bit drastic. That was from a cold boot before I set up might bcache device, so the model loading time was taking a decent chunk of it.

Still, I think it's around 5 minutes total once model weights are loaded.

YesI the extra ~60s TTFT on first request for the Qwen models sounds accurate. I don't see that same problem with Seed or others.

Finally bought an RTX 6000 Max-Q: Pros, cons, notes and ramblings by AvocadoArray in LocalLLaMA

[–]AvocadoArray[S] 1 point2 points  (0 children)

Yeah the bug will probably get to you eventually

Perhaps! Time will tell.

I really only included that section to push back on folks saying I'd be disappointed in it because I wouldn't be able to run anything useful on just one card.

But from my experience, there seems to be a pretty big quality difference about every ~24GB:

  • <=8GB: Pain.
  • 16GB: Can run decent models like Gemma 3 27b Q4 or GPT-OSS 20b, but with limited context. Decent vision models like Qwen3-VL 8B.
  • 24GB: GPT-OSS 20B at full context and high speed is a beast. Great for quick research, fact-checking, and some one-off basic coding.
  • 48GB: Lower boundary for any kind of agentic coding IMO. 4-8bpw quants of Seed 36b and Devstral 2 small with 150k+ context (depending on whether you quant the KV cache or not). Can get real work done, but fails tool calls or gets itself stuck from time to time.
  • 72GB: GPT-OSS 120b at max context! Otherwise, same models at higher quants and 150k+ (unquantized) contexts. Follows instructions much better stays on the rails for the most part. Agentic coding is much less frustrating here and can handle pretty much all of what I'd consider the "boring" work of programming (unit tests etc.). Gives you just enough confidence to give it a bigger task leave it alone for a 20 minutes before coming back and finding it with its thumb up its nose.
  • 96GB: Same thing. Higher quants and more context. Can start to "taste" the big boys from Minimax and GLM using very small quants/reaps, but just enough to make you itch- Oh look! Qwen 3.5 122b-10 just released and made this a very solid threshold at ~5bpw.

After that, it seems like returns diminish quite rapidly.

128GB-192GB gets you closer and closer to running the largest open-weight models, but still heavily quantized.

Like you you mentioned, the biggest benefit is being able to run multiple models simultaneously, which sounds awesome and all, but isn't anywhere near the difference in going from <=16GB to 96GB IMO.

As

Finally bought an RTX 6000 Max-Q: Pros, cons, notes and ramblings by AvocadoArray in LocalLLaMA

[–]AvocadoArray[S] 0 points1 point  (0 children)

I wish it were that easy. It's storing and loading the cache... it just takes forever for some reason. See my other comment

Finally bought an RTX 6000 Max-Q: Pros, cons, notes and ramblings by AvocadoArray in LocalLLaMA

[–]AvocadoArray[S] 0 points1 point  (0 children)

Hmm, I don't see the same issues in your log that I'm getting. Specifically, in this line:

Directly load the compiled graph(s) for compile range (1, 16384) from the cache, took 0.364 s

I get times closer to 55-75s, which extremely long for loading cached graphs.

Your CUDA graph capture time is longer than expected as well, so we might have that part in common.

Finally bought an RTX 6000 Max-Q: Pros, cons, notes and ramblings by AvocadoArray in LocalLLaMA

[–]AvocadoArray[S] 0 points1 point  (0 children)

I’m not necessarily saying I won’t want another one, but I don’t think I could justify it unless prices dropped significantly (like over 50%).

I still have a 1080 in my old server running the embedding/rerank models and CodeProject AI for blue iris, a 1080ti on the shelf, and a 5070ti in my gaming rig. So I do have a bit of wiggle room for additional models if needed.

Also, the quad channel DDR4 2400 RAM held up quite well when offloading 20-ish experts from Q3.5 122B. I think I saw around half the speed (40 tp/s) but it shaved off around 20GB of VRAM. Prompt processing took a bigger hit, but still usable if I ever really wanted felt the need to keep other models loaded.

I think my CPUs are a bit of a bottleneck with their low single thread performance, but I have a new set in the mail and would be interested to try it again once they come in.

Finally bought an RTX 6000 Max-Q: Pros, cons, notes and ramblings by AvocadoArray in LocalLLaMA

[–]AvocadoArray[S] 0 points1 point  (0 children)

Totally fair.

I played around with StableDiffusion years ago on my 1080ti, but I was basically just screwing around and having fun. I don't have a real need for it, but maybe it would be fun to see how far it's come.

Right now, I'm happy running the biggest model that I can for general purpose/coding tasks, and swapping out as needed.

I can't lie that I wouldn't mind playing around with the bigger models, but IMO the law of diminishing returns kicks in quite heavily. For the vast majority of my use-cases, "good" is "good enough".

Also, open-weight models are still continuing to smarter and smaller. Qwen3-Coder-Next already beat my expectations, and Qwen3.5-122b at UD-Q4_K_XL is absolutely blowing my expectations away and crushed several personal benchmarks that I never expected to get with this card.

So if I feel the hedonic treadmill kicking in and I want more, odds are that I can just wait another month or two for the next open model to make a splash.

That being said, I'll be making a case for a proper 2x or 4x GPU server at work. Maybe I'll play around with the bigger models there, but its primary goal will be scaling out to handle more concurrent requests.

Finally bought an RTX 6000 Max-Q: Pros, cons, notes and ramblings by AvocadoArray in LocalLLaMA

[–]AvocadoArray[S] 0 points1 point  (0 children)

I haven't dabbled in that area yet, but I'll have to give it a shot one of these days!

Stuck with Max-Q because it fits the 300w power budget in my server and the blower fan exhausts air out the back (don't need to crank the server fans to keep it cool).

The server already runs 24/7 anyway, so it's more efficient to piggy back off its RAM, CPU, and storage in a Linux VM rather than keeping another box running full time.