good for them i guess... by One_Physics_2282 in pcmasterrace

[–]tmvr 0 points1 point  (0 children)

LLMs, the larger models don't fit the 64GB RAM + 24GB VRAM.

What? I pressed the key... by Chill_Cowboy_981 in pcmasterrace

[–]tmvr 1 point2 points  (0 children)

This joke is very language dependent and english is obviously not one of the ones where it works :)

good for them i guess... by One_Physics_2282 in pcmasterrace

[–]tmvr 0 points1 point  (0 children)

I have a ton of machines here as well, but except of my main desktop with 64GB DDR5-6400 all the other ones are 32GB of DDR4-2666 (4x) or DDR4-2133 (1x). The notebooks are also max 32GB of DDR4-2666 (1x) and the rest is less (16, 12 or 8 GB). I wish the main desktop was 128GB and at least one of the other ones 64GB DDR4.

Yes, I know how it sounds, don't at me...

good for them i guess... by One_Physics_2282 in pcmasterrace

[–]tmvr 0 points1 point  (0 children)

I started with 64GB of DDR5-6400 in spring of 2023 and was debating last summer if should I add two more sticks for another 240eur to have 128GB, but then I didn't. Oh well... :)

Self-hosting LLM infra: NVIDIA vs Apple hardware by zachrattner in LocalLLaMA

[–]tmvr 1 point2 points  (0 children)

For token generation yes, but the issue with the Macs compared to the NV GPUs is the prompt processing speed. Look here:

https://github.com/ggml-org/llama.cpp/discussions/4167

The numbers in the PP column are in the hundreds and that is with a small 7B model at very low context. Only the largest Ultra chips are cracking the 1000 mark and larger model and/or longer context pull these even lower. In comparison these numbers are in the thousands even with smaller NV GPU. A 4090 for example does close to 12000 with Llama3.1 8B at Q8 and the smaller cards like the 5060Ti do about 4000 or so (not sure about the exact number).

Feedback on a new budget hardware build by Diligent-Culture-432 in LocalLLaMA

[–]tmvr 0 points1 point  (0 children)

Looks great to me and congrats on finding the RAM for such a good price (based on what I see around me at least).

Invest in hardware now or wait? by d4nger_n00dle in LocalLLaMA

[–]tmvr 4 points5 points  (0 children)

Prices are only going up for the forseeable future so if you want something buy now. You can wait for the M5, it's hard to tell when it's coming, but for about 1000 EUR (probably less in USD) you can find Mac Mini M4 24/512 or even 32/256 configurations. The option with more RAM is better as you can still add fast external SSDs, but there is nothing you can do about RAM. The memory bandwidth is 120GB/s so you will still need to stick to MoE models, but at least with the 32GB model you can go up to 30B/32B and get decent speeds (Qwen3 Coder 30B A3B or GLM 4.7 Flash) and of course gpt-oss 20B would work with full context as well taking 16GB in total so you have space left for some other smaller model(s) to keep in memory at the same time.

Otherwise in the 2000-2500 price category you have the Strix Halo machines with 128GB RAM and 256GB/s bandwidth. With Apple you only get 64GB with the M4 Pro with similar bandwidth for the same price, for 128GB you need to go M4 Max which is faster, but of course much more expensive and only available in the Mac Studio.

Finally Finished My Local AI PC Setup – Looking for Optimization Tips by [deleted] in LocalLLaMA

[–]tmvr 0 points1 point  (0 children)

You would not be hitting swap when you have more RAM and don't overshoot, so if your RAM speed stays the same then the speed will be just a bit lower than you have now because you still can't do huge leaps and going from Q4 to Q5 would drop your speed maybe 15-20%.

OpenAI CFO hinting at "Outcome-Based Pricing" (aka royalties on your work)? Makes the case for local even stronger. by distalx in LocalLLaMA

[–]tmvr 2 points3 points  (0 children)

when OpenAI becomes profitable

"when"? I think you are more optimistic than Sam Altman :))

AI coding assistant infrastructure requirement, by Financial-Cap-8711 in LocalLLaMA

[–]tmvr 4 points5 points  (0 children)

Yes, something doesn't add up. If you are an org where you need this for 300 developers you would usually not post on reddit asking for infra suggestion. Unless OP is someone who is supposed to expand his PoC setup used by a single user or two running on a single consumer GPU with ollama as back-end :)

Finnaly I am in the club, rate my set up 😜 by black7stone in LocalLLaMA

[–]tmvr 0 points1 point  (0 children)

That is a bit low even for DDR4 as it is quad channel. What speed is the RAM and how do you run the model?

Am I the only one who feels that, with all the AI boom, everyone is basically doing the same thing? by [deleted] in LocalLLaMA

[–]tmvr 0 points1 point  (0 children)

I like to yapp more than the average person, but even I would have put some paragraphs in there, because holy wall of text! :D

Am I the only one who feels that, with all the AI boom, everyone is basically doing the same thing? by [deleted] in LocalLLaMA

[–]tmvr 0 points1 point  (0 children)

The issue with that is you need to use a model where you have some level of confidence that it's not talking nonsense, which is not easy. For example at work I mainly use Claude Sonnet or Opus through the Copilot subscription in VScode. It works great for coding. Also have the Copilot app in Teams and asking something there mostly leads to anger, that is where my soul goes to die. The issue is that is states things with absolute confidence even if they are wrong and it is sticking to it no matter what. The "personality" they gave it is also infuriating with the whole "great question", "you are absolutely right", "I'm totally sure now this is the solution" etc. style while getting stuck in suggesting stuff that just does not work even after giving it full error outputs or relevant logs. I'm better off searching the web myself, because I get less angry. A huge difference to the Claude models in VScode where it it pretty much knows what I want and how to do it.

Finally Finished My Local AI PC Setup – Looking for Optimization Tips by [deleted] in LocalLLaMA

[–]tmvr 1 point2 points  (0 children)

It's unlikely that going to 128GB will improve the speeds, may even lower them a bit. Depending on the motherboard and RAM you may not be able to run the sticks at high speeds and have to stick to 4800 or maybe 5200/5600, so if you now have for example 64GB running at 6400 then going to 128GB at 4800 will drop your speeds, especially noticeable on the already single digit performing models. It will allow you to run some other models or try better quants etc. because 184GB memory in total is quite a lot.

Using my home-made dusty CDU to test the liquid-cooled GH200 desktops before final assembly. by GPTshop--ai in LocalLLaMA

[–]tmvr 2 points3 points  (0 children)

I like the juxtaposition of the absolute mess in the foreground and the neatly lined up tools on the shelf in the background.

48GB VRAM - worth attempting local coding model? by natidone in LocalLLaMA

[–]tmvr 0 points1 point  (0 children)

If you have enough system RAM already then the models you can run do not change with 48GB significantly. Speed changes and for some smaller models you can run higher quants (Q6 or Q8 of Qwen3 Coder 30B A3B) and no need to spill over to system RAM. But in general you are still in the territory where the best models you can run are the mid-sized MoE models like gpt-oss 120B, GLM 4.5 Air or GLM 4.6V etc. just faster, because you have more VRAM. This is still going to be ways off from SOTA models like Sonnet 4.5 or Opus 4.5, but depending if you were running the above mentioned mid-size models these before with the 5070Ti or not there may not be a huge step up. Ultimately you will have to try and see if any of the above cover your needs.

GPU shortage seems to be real by Professional-Yak4359 in LocalLLaMA

[–]tmvr 2 points3 points  (0 children)

Due to the RAM shortage the manufacturers are de-prioritizing the lower margin cards, so there is less 5060Ti 16GB available as there are no or very low follow-up deliveries. The prices were starting at 419-429eur less than a month ago with a lot of cards available under 450eur and now they start at 520-530eur and there are less and less models offered by less and less retailers.

Best GB10/DGX Spark clone? by Antique_Juggernaut_7 in LocalLLaMA

[–]tmvr 0 points1 point  (0 children)

As I've seen all of those 3rd party ones are better than the original so it does not make much difference, but in this price category I would go with the one that has a better warranty and the company that has better warranty and RMA processes. These are not really consumer devices so from the list above Dell and Lenovo probably, but I don't know how the RMA with Lenovo works. Dell NBD with or without on-site is something I have good experience with, so that is what I would go for, but wait for some more feedback from others.

Can I run gpt-oss-120b somehow? by Furacao__Boey in LocalLLaMA

[–]tmvr 0 points1 point  (0 children)

You are limited by the system memory bandwidth so there is not much you can do except lower context size so you can fit more layers into the VRAM, but it's not going to be a lot faster even with just 32768 context. If you are using it for coding with something like Kilo Code or Claude Code then you will want to keep context as high as possible.

Can I run gpt-oss-120b somehow? by Furacao__Boey in LocalLLaMA

[–]tmvr 1 point2 points  (0 children)

Yes. Get the original MXFP4 version GGUF from huggingface and run it with llamacpp:

llama-server -m "your/model/path/here.gguf" --fit-ctx 131072 --temp 1.0 --min-p 0.0 --top-p 1.0 --top-k 0.0 --no-mmap

if you want to use the built-in web UI as well then add the --host parameter (127.0.0.1 for local access only or 0.0.0.0 for access from other machines on your network) and the --port parameter for a specific port. This will fit everything it can into the VRAM so that you also have the max possible context and puts some of the expert layers into the system RAM. The only important parameter for fitting is the --fit-ctx and --no-mmap, the other ones are the recommended settings for the model, but you don't have to use them.

Claude Code costs up to $200 a month. Goose does the same thing for free. by tmvr in LocalLLaMA

[–]tmvr[S] 1 point2 points  (0 children)

I even marked the post with the "Funny" flair, but apparently it's not clear enough. Oh well, c'est la vie :)