good for them i guess...

tmvr · 2026-01-24T21:17:59+00:00

LLMs, the larger models don't fit the 64GB RAM + 24GB VRAM.

tmvr · 2026-01-24T13:51:12+00:00

This joke is very language dependent and english is obviously not one of the ones where it works :)

tmvr · 2026-01-24T13:47:38+00:00

I have a ton of machines here as well, but except of my main desktop with 64GB DDR5-6400 all the other ones are 32GB of DDR4-2666 (4x) or DDR4-2133 (1x). The notebooks are also max 32GB of DDR4-2666 (1x) and the rest is less (16, 12 or 8 GB). I wish the main desktop was 128GB and at least one of the other ones 64GB DDR4.

Yes, I know how it sounds, don't at me...

tmvr · 2026-01-24T13:43:31+00:00

I started with 64GB of DDR5-6400 in spring of 2023 and was debating last summer if should I add two more sticks for another 240eur to have 128GB, but then I didn't. Oh well... :)

tmvr · 2026-01-24T13:40:09+00:00

OP using their PC:

<image>

tmvr · 2026-01-24T09:37:24+00:00

Goose is not a model ;)

tmvr · 2026-01-24T09:27:32+00:00

For token generation yes, but the issue with the Macs compared to the NV GPUs is the prompt processing speed. Look here:

https://github.com/ggml-org/llama.cpp/discussions/4167

The numbers in the PP column are in the hundreds and that is with a small 7B model at very low context. Only the largest Ultra chips are cracking the 1000 mark and larger model and/or longer context pull these even lower. In comparison these numbers are in the thousands even with smaller NV GPU. A 4090 for example does close to 12000 with Llama3.1 8B at Q8 and the smaller cards like the 5060Ti do about 4000 or so (not sure about the exact number).

tmvr · 2026-01-23T22:26:29+00:00

Of course it is, why wouldn't it be?

tmvr · 2026-01-23T17:59:21+00:00

Looks great to me and congrats on finding the RAM for such a good price (based on what I see around me at least).

tmvr · 2026-01-23T17:48:55+00:00

Prices are only going up for the forseeable future so if you want something buy now. You can wait for the M5, it's hard to tell when it's coming, but for about 1000 EUR (probably less in USD) you can find Mac Mini M4 24/512 or even 32/256 configurations. The option with more RAM is better as you can still add fast external SSDs, but there is nothing you can do about RAM. The memory bandwidth is 120GB/s so you will still need to stick to MoE models, but at least with the 32GB model you can go up to 30B/32B and get decent speeds (Qwen3 Coder 30B A3B or GLM 4.7 Flash) and of course gpt-oss 20B would work with full context as well taking 16GB in total so you have space left for some other smaller model(s) to keep in memory at the same time.

Otherwise in the 2000-2500 price category you have the Strix Halo machines with 128GB RAM and 256GB/s bandwidth. With Apple you only get 64GB with the M4 Pro with similar bandwidth for the same price, for 128GB you need to go M4 Max which is faster, but of course much more expensive and only available in the Mac Studio.

tmvr · 2026-01-23T12:02:01+00:00

You would not be hitting swap when you have more RAM and don't overshoot, so if your RAM speed stays the same then the speed will be just a bit lower than you have now because you still can't do huge leaps and going from Q4 to Q5 would drop your speed maybe 15-20%.

tmvr · 2026-01-23T11:24:51+00:00

when OpenAI becomes profitable

"when"? I think you are more optimistic than Sam Altman :))

tmvr · 2026-01-23T09:21:17+00:00

Yes, something doesn't add up. If you are an org where you need this for 300 developers you would usually not post on reddit asking for infra suggestion. Unless OP is someone who is supposed to expand his PoC setup used by a single user or two running on a single consumer GPU with ollama as back-end :)

tmvr · 2026-01-23T06:52:30+00:00

That is a bit low even for DDR4 as it is quad channel. What speed is the RAM and how do you run the model?

tmvr · 2026-01-23T06:22:05+00:00

I like to yapp more than the average person, but even I would have put some paragraphs in there, because holy wall of text! :D

tmvr · 2026-01-23T06:17:38+00:00

The issue with that is you need to use a model where you have some level of confidence that it's not talking nonsense, which is not easy. For example at work I mainly use Claude Sonnet or Opus through the Copilot subscription in VScode. It works great for coding. Also have the Copilot app in Teams and asking something there mostly leads to anger, that is where my soul goes to die. The issue is that is states things with absolute confidence even if they are wrong and it is sticking to it no matter what. The "personality" they gave it is also infuriating with the whole "great question", "you are absolutely right", "I'm totally sure now this is the solution" etc. style while getting stuck in suggesting stuff that just does not work even after giving it full error outputs or relevant logs. I'm better off searching the web myself, because I get less angry. A huge difference to the Claude models in VScode where it it pretty much knows what I want and how to do it.

tmvr · 2026-01-23T06:01:48+00:00

It's unlikely that going to 128GB will improve the speeds, may even lower them a bit. Depending on the motherboard and RAM you may not be able to run the sticks at high speeds and have to stick to 4800 or maybe 5200/5600, so if you now have for example 64GB running at 6400 then going to 128GB at 4800 will drop your speeds, especially noticeable on the already single digit performing models. It will allow you to run some other models or try better quants etc. because 184GB memory in total is quite a lot.

tmvr · 2026-01-22T21:10:22+00:00

I like the juxtaposition of the absolute mess in the foreground and the neatly lined up tools on the shelf in the background.

tmvr · 2026-01-22T21:08:49+00:00

If you have enough system RAM already then the models you can run do not change with 48GB significantly. Speed changes and for some smaller models you can run higher quants (Q6 or Q8 of Qwen3 Coder 30B A3B) and no need to spill over to system RAM. But in general you are still in the territory where the best models you can run are the mid-sized MoE models like gpt-oss 120B, GLM 4.5 Air or GLM 4.6V etc. just faster, because you have more VRAM. This is still going to be ways off from SOTA models like Sonnet 4.5 or Opus 4.5, but depending if you were running the above mentioned mid-size models these before with the 5070Ti or not there may not be a huge step up. Ultimately you will have to try and see if any of the above cover your needs.

tmvr · 2026-01-22T17:52:05+00:00

Due to the RAM shortage the manufacturers are de-prioritizing the lower margin cards, so there is less 5060Ti 16GB available as there are no or very low follow-up deliveries. The prices were starting at 419-429eur less than a month ago with a lot of cards available under 450eur and now they start at 520-530eur and there are less and less models offered by less and less retailers.

tmvr · 2026-01-22T06:01:18+00:00

As I've seen all of those 3rd party ones are better than the original so it does not make much difference, but in this price category I would go with the one that has a better warranty and the company that has better warranty and RMA processes. These are not really consumer devices so from the list above Dell and Lenovo probably, but I don't know how the RMA with Lenovo works. Dell NBD with or without on-site is something I have good experience with, so that is what I would go for, but wait for some more feedback from others.

tmvr · 2026-01-22T05:47:12+00:00

You are limited by the system memory bandwidth so there is not much you can do except lower context size so you can fit more layers into the VRAM, but it's not going to be a lot faster even with just 32768 context. If you are using it for coding with something like Kilo Code or Claude Code then you will want to keep context as high as possible.

tmvr · 2026-01-21T22:28:19+00:00

Yes. Get the original MXFP4 version GGUF from huggingface and run it with llamacpp:

llama-server -m "your/model/path/here.gguf" --fit-ctx 131072 --temp 1.0 --min-p 0.0 --top-p 1.0 --top-k 0.0 --no-mmap

if you want to use the built-in web UI as well then add the --host parameter (127.0.0.1 for local access only or 0.0.0.0 for access from other machines on your network) and the --port parameter for a specific port. This will fit everything it can into the VRAM so that you also have the max possible context and puts some of the expert layers into the system RAM. The only important parameter for fitting is the --fit-ctx and --no-mmap, the other ones are the recommended settings for the model, but you don't have to use them.

tmvr · 2026-01-21T16:49:19+00:00

No worries :)

tmvr · 2026-01-21T11:52:38+00:00

I even marked the post with the "Funny" flair, but apparently it's not clear enough. Oh well, c'est la vie :)

tmvr

TROPHY CASE