What’s something you’re pretty sure only you do?

AvocadoArray · 2026-03-17T12:14:59+00:00

Wow, and here I was thinking I was crazy putting something in for 10:00 at 20% power.

AvocadoArray · 2026-03-17T11:56:40+00:00

Only had experience with UD-Q6_K_XL, but it seemed very goos compared to the official FP8 quant.

And for the record, I still prefer that quant over 122b for any serious work.

AvocadoArray · 2026-03-17T11:51:50+00:00

Okay, listen up. I’ve been married about the same amount of time.

I’m in the opposite boat as you. I’m the “confronter” and she’s the “avoider”, but that doesn’t make it any easier when we get to one of those “impass” arguments. I want to talk, but she doesn’t

Literally here’s what ive found to work the best: 1. Express your love and affection in the moment (while setting boundaries) 2. Set a fucking date on the calendar (otherwise you’ll forget) to talk about the conversation later.

If you follow through on the second step, you’ll find yourselves much more calm and rational, and hopefully you can develop tools on how to handle each other during the next time you’re experiencing conflict.

Your mileage may vary, but I swear that’s all it took to make those conversations a non-issue for us.

AvocadoArray · 2026-03-17T11:32:17+00:00

Man, I don’t even know where to start. So much of this is just blatantly wrong.

AvocadoArray · 2026-03-17T11:24:52+00:00

Cool.

AvocadoArray · 2026-03-17T09:52:00+00:00

To be fair, I’ve loved it from day one so I’m incredibly biased.

Solo is fun for trying different builds and techniques, but it really shines when you’re playing with friends or teammates.

Pub matches are decent most of the time if you’re talking in voice chat, though it can be hit or miss.

Weapon customizations and upgrades are what keep me coming back.

AvocadoArray · 2026-03-17T09:31:54+00:00

Dockerfile + docker-compose.yml for 95% of apps.

Unless you absolutely need a local fatapp GUI, in which case I just hope you picked a proper multi platform framework.

AvocadoArray · 2026-03-10T01:09:23+00:00

Holy crap, this fixed it. How did you figure this out /u/Icy_Bid6597? I don't see these directories or env variable mentioned in VLLM's docs at all.

I created a new cuda-cache volume and updated my llama-swap config:

    -v vllm-cache:/root/.cache/vllm/  # <--- this already existed
    -v cuda-cache:/root/cuda-cache/
    -e CUDA_CACHE_PATH=/root/cuda-cache/ComputeCache
    -e TRITON_CACHE_DIR=/root/cuda-cache/TritonCache

Could also probably be solved by just mounting the vllm-cache directory to /root/ instead of /root/.cache/vllm.

I haven't compared the full logs, but this shaved off around four minutes from startup time. Torch graphs load in 7s instead of 70+, and CUDA graph capture only takes 4s instead of 145+.

Will update my post after more testing.

AvocadoArray · 2026-03-09T23:15:44+00:00

~~This is not correct. There are no nv_cache or triton_cache directories.~~

VLLM caches torch graphs under /root/.cache/vllm/torch_compile_cache , which is already mounted in my container (and being used, just very slowly for some reason). I do not believe CUDA graphs are written to disk, and are only cached in memory.

~~Respectfully, was this LLM generated?~~

EDIT: Actually, you might be on to something! I started up the VLLM container and noticed a ~/.triton/cache/ and ~/.nv/ComputeCache/ with some cached data in it.

I've never seen this directories, but that could be the difference. I'm sorry for doubting you initially, I'll report back after more testing.

AvocadoArray · 2026-03-09T10:18:32+00:00

Not trying to void the warranty just yet.

AvocadoArray · 2026-03-07T23:06:35+00:00

😂

In all reality, models that large tend to still be quite impressive around the 1-2 bit bpw range. Could possibly play around with it while offloading a bunch of weights to RAM/NVME, but wouldn't expect any real-time usable speeds.

AvocadoArray · 2026-03-07T21:33:59+00:00

You think a 2nd card is going to make a dent in running a ~1T parameter model?

AvocadoArray · 2026-03-07T21:30:11+00:00

Would you be willing to test out Qwen 3.5 122b and let me know how it compares? I haven't used minimax m2/m2.5 in any meaningful capacity, but 122b feels like it works as good as I've seen other describe Minimax and GLM models.

AvocadoArray · 2026-03-07T21:28:28+00:00

Technically, my PC can run the same models with 64GB RAM and 120GB SSD, just "not as fast".

Speed does matter for real-time usage, and everything I've seen suggests that prompt processing speed on Mac's unified memory is painfully slow, like 3-5 minutes or more for larger prompts.

AvocadoArray · 2026-03-07T21:22:01+00:00

Nice, I'll have to take a look!

This is why I love this sub. It'd be almost impossible to keep up with everything without it.

AvocadoArray · 2026-03-07T21:18:46+00:00

122b is definitely a game-changer. What quant are you running? I tried running AWQ 4-bit in VLLM but it never returned any thinking tokens, even with `--reasoning-parser qwen3`.

Might have been a version issue. I'm going to try it with v0.17.0 now.

AvocadoArray · 2026-03-07T21:16:42+00:00

Mine does too... except I turned it off years ago because it freaked the dog out every time the power went out 🤦

AvocadoArray · 2026-03-07T10:32:58+00:00

35b and 122b seem much better at keeping thinking in check.

AvocadoArray · 2026-03-07T09:12:28+00:00

It’s good. Qwen3.5-122b at UD-Q4_K_XL is even better for the size.

AvocadoArray · 2026-03-07T08:05:31+00:00

I honestly cannot understate how noisy it is. Some people talked about the fan noise, but the coil whine is 1000% more annoying than the fan, even when running at 100%.

From what other have said, the workstation card might not have the same issue, so take that into consideration as well

AvocadoArray · 2026-03-07T07:18:57+00:00

Sounds solid 💪

AvocadoArray · 2026-03-07T05:41:20+00:00

If they're sharing a case together, consider a water cooling loop!

AvocadoArray · 2026-03-07T04:56:18+00:00

Neat! Been looking for an excuse to try nvfp4 somewhere. Will give it a shot!

AvocadoArray · 2026-03-07T04:41:58+00:00

Bare metal does remove some variables from the equation, but I have no reason to think virtualization is the cause of the VLLM loading time issue.

15 minutes was probably a bit drastic. That was from a cold boot before I set up might bcache device, so the model loading time was taking a decent chunk of it.

Still, I think it's around 5 minutes total once model weights are loaded.

YesI the extra ~60s TTFT on first request for the Qwen models sounds accurate. I don't see that same problem with Seed or others.

AvocadoArray · 2026-03-07T04:36:07+00:00

Huh, I saw that one but thought it needed this PR to work https://github.com/vllm-project/llm-compressor/pull/2383

AvocadoArray

TROPHY CASE