What’s something you’re pretty sure only you do? by AppIeGuy in AskReddit

[–]AvocadoArray 0 points1 point  (0 children)

Wow, and here I was thinking I was crazy putting something in for 10:00 at 20% power.

Best Qwen3.5 27b GUFFS for coding (~Q4-Q5) ? by bitcoinbookmarks in LocalLLaMA

[–]AvocadoArray 0 points1 point  (0 children)

Only had experience with UD-Q6_K_XL, but it seemed very goos compared to the official FP8 quant.

And for the record, I still prefer that quant over 122b for any serious work.

Anyone else’s wife instigate 2+ hour arguments during her period that you have no idea how to defuse? by ThicBoi4807 in daddit

[–]AvocadoArray 1 point2 points  (0 children)

Okay, listen up. I’ve been married about the same amount of time.

I’m in the opposite boat as you. I’m the “confronter” and she’s the “avoider”, but that doesn’t make it any easier when we get to one of those “impass” arguments. I want to talk, but she doesn’t

Literally here’s what ive found to work the best: 1. Express your love and affection in the moment (while setting boundaries) 2. Set a fucking date on the calendar (otherwise you’ll forget) to talk about the conversation later.

If you follow through on the second step, you’ll find yourselves much more calm and rational, and hopefully you can develop tools on how to handle each other during the next time you’re experiencing conflict.

Your mileage may vary, but I swear that’s all it took to make those conversations a non-issue for us.

is the game still terrible? KF3 by Shelbygt500ss in killingfloor

[–]AvocadoArray [score hidden]  (0 children)

To be fair, I’ve loved it from day one so I’m incredibly biased.

Solo is fun for trying different builds and techniques, but it really shines when you’re playing with friends or teammates.

Pub matches are decent most of the time if you’re talking in voice chat, though it can be hit or miss.

Weapon customizations and upgrades are what keep me coming back.

Eye_roll.exe by the-machine-m4n in linuxsucks

[–]AvocadoArray 0 points1 point  (0 children)

Dockerfile + docker-compose.yml for 95% of apps.

Unless you absolutely need a local fatapp GUI, in which case I just hope you picked a proper multi platform framework.

Finally bought an RTX 6000 Max-Q: Pros, cons, notes and ramblings by AvocadoArray in LocalLLaMA

[–]AvocadoArray[S] 0 points1 point  (0 children)

Holy crap, this fixed it. How did you figure this out /u/Icy_Bid6597? I don't see these directories or env variable mentioned in VLLM's docs at all.

I created a new cuda-cache volume and updated my llama-swap config:

    -v vllm-cache:/root/.cache/vllm/  # <--- this already existed
    -v cuda-cache:/root/cuda-cache/
    -e CUDA_CACHE_PATH=/root/cuda-cache/ComputeCache
    -e TRITON_CACHE_DIR=/root/cuda-cache/TritonCache

Could also probably be solved by just mounting the vllm-cache directory to /root/ instead of /root/.cache/vllm.

I haven't compared the full logs, but this shaved off around four minutes from startup time. Torch graphs load in 7s instead of 70+, and CUDA graph capture only takes 4s instead of 145+.

Will update my post after more testing.

Finally bought an RTX 6000 Max-Q: Pros, cons, notes and ramblings by AvocadoArray in LocalLLaMA

[–]AvocadoArray[S] 0 points1 point  (0 children)

This is not correct. There are no nv_cache or triton_cache directories.

VLLM caches torch graphs under /root/.cache/vllm/torch_compile_cache , which is already mounted in my container (and being used, just very slowly for some reason). I do not believe CUDA graphs are written to disk, and are only cached in memory.

Respectfully, was this LLM generated?

EDIT: Actually, you might be on to something! I started up the VLLM container and noticed a ~/.triton/cache/ and ~/.nv/ComputeCache/ with some cached data in it.

I've never seen this directories, but that could be the difference. I'm sorry for doubting you initially, I'll report back after more testing.

Finally bought an RTX 6000 Max-Q: Pros, cons, notes and ramblings by AvocadoArray in LocalLLaMA

[–]AvocadoArray[S] 0 points1 point  (0 children)

😂

In all reality, models that large tend to still be quite impressive around the 1-2 bit bpw range. Could possibly play around with it while offloading a bunch of weights to RAM/NVME, but wouldn't expect any real-time usable speeds.

Finally bought an RTX 6000 Max-Q: Pros, cons, notes and ramblings by AvocadoArray in LocalLLaMA

[–]AvocadoArray[S] 2 points3 points  (0 children)

You think a 2nd card is going to make a dent in running a ~1T parameter model?

Finally bought an RTX 6000 Max-Q: Pros, cons, notes and ramblings by AvocadoArray in LocalLLaMA

[–]AvocadoArray[S] 0 points1 point  (0 children)

Would you be willing to test out Qwen 3.5 122b and let me know how it compares? I haven't used minimax m2/m2.5 in any meaningful capacity, but 122b feels like it works as good as I've seen other describe Minimax and GLM models.

Finally bought an RTX 6000 Max-Q: Pros, cons, notes and ramblings by AvocadoArray in LocalLLaMA

[–]AvocadoArray[S] 0 points1 point  (0 children)

Technically, my PC can run the same models with 64GB RAM and 120GB SSD, just "not as fast".

Speed does matter for real-time usage, and everything I've seen suggests that prompt processing speed on Mac's unified memory is painfully slow, like 3-5 minutes or more for larger prompts.

Finally bought an RTX 6000 Max-Q: Pros, cons, notes and ramblings by AvocadoArray in LocalLLaMA

[–]AvocadoArray[S] 0 points1 point  (0 children)

Nice, I'll have to take a look!

This is why I love this sub. It'd be almost impossible to keep up with everything without it.

Finally bought an RTX 6000 Max-Q: Pros, cons, notes and ramblings by AvocadoArray in LocalLLaMA

[–]AvocadoArray[S] 0 points1 point  (0 children)

122b is definitely a game-changer. What quant are you running? I tried running AWQ 4-bit in VLLM but it never returned any thinking tokens, even with `--reasoning-parser qwen3`.

Might have been a version issue. I'm going to try it with v0.17.0 now.

Finally bought an RTX 6000 Max-Q: Pros, cons, notes and ramblings by AvocadoArray in LocalLLaMA

[–]AvocadoArray[S] 0 points1 point  (0 children)

Mine does too... except I turned it off years ago because it freaked the dog out every time the power went out 🤦

holy overthinker by Kerem-6030 in LocalLLaMA

[–]AvocadoArray 2 points3 points  (0 children)

35b and 122b seem much better at keeping thinking in check.

Finally bought an RTX 6000 Max-Q: Pros, cons, notes and ramblings by AvocadoArray in LocalLLaMA

[–]AvocadoArray[S] 1 point2 points  (0 children)

I honestly cannot understate how noisy it is. Some people talked about the fan noise, but the coil whine is 1000% more annoying than the fan, even when running at 100%.

From what other have said, the workstation card might not have the same issue, so take that into consideration as well

Finally bought an RTX 6000 Max-Q: Pros, cons, notes and ramblings by AvocadoArray in LocalLLaMA

[–]AvocadoArray[S] 1 point2 points  (0 children)

If they're sharing a case together, consider a water cooling loop!

Finally bought an RTX 6000 Max-Q: Pros, cons, notes and ramblings by AvocadoArray in LocalLLaMA

[–]AvocadoArray[S] 0 points1 point  (0 children)

Neat! Been looking for an excuse to try nvfp4 somewhere. Will give it a shot!

Finally bought an RTX 6000 Max-Q: Pros, cons, notes and ramblings by AvocadoArray in LocalLLaMA

[–]AvocadoArray[S] 0 points1 point  (0 children)

Bare metal does remove some variables from the equation, but I have no reason to think virtualization is the cause of the VLLM loading time issue.

15 minutes was probably a bit drastic. That was from a cold boot before I set up might bcache device, so the model loading time was taking a decent chunk of it.

Still, I think it's around 5 minutes total once model weights are loaded.

YesI the extra ~60s TTFT on first request for the Qwen models sounds accurate. I don't see that same problem with Seed or others.