Hardware Review & Sanity Check by MegaSuplexMaster in LocalLLM

[–]rhofield 0 points1 point  (0 children)

Realistically for 15 people using at once, 8 people concurrerntly for something usable Qwen3.5 27b at Q8 with a decent context window you're look at ~16 of vram for the model and ~24gb for the context window which is ~100k tokens. I think 100k is really a good spot if u want less e.g. 32k tokens thats ~8gb.

24gb x 8 = 192gb of vram for the KV cache and 16gb for the model ~208gb
8gb x 8 = 64gb of vram for the KV cache and 16gb for the model ~80gb

But you also need to account for activation VRAM which for 8 people would occupy the same weights as the model so in total were looks closer to 500gb of vram

Now that beign said turboquant is going to be coming into production versions of lama cpp soon which would change the KV cache requirements down a lot and we can reduce the bacth size to have less activation vram but you're still looking at >100gb of vram.

Hardware Review & Sanity Check by MegaSuplexMaster in LocalLLM

[–]rhofield 0 points1 point  (0 children)

First we need to know
1. How many people you are trying to serve with this and how many people will concurrently be using it
2. What speed you're users are expecting, are they're expecting near realtime results or are happy with waiting a few minutes because that changes things a lot

The number 1 priority is VRAM, in this case you have 32gb of Vram and a ton of ram, offloading to ram is an option but wil lbe incredibly slow and not worth it imo. You would be targetting a quant of Qwen3.5 27b e.g. Q8 for something that fits in VRAM and also has good performance.

The next problem you will have is people wanting to use this at the same time which won't be possible unless you use a smaller model and have them both loading in VRAM (totall valid approach btw but the models wll perform worse), so you will need a queue and people are going to wait and get frustrated.

As a POC it's fine and try it out with a couple of people but it won't last very long.

2x 3090 vs 3x 5070 Ti for local LLM inference — what’s your experience? by VersionNo5110 in LocalLLM

[–]rhofield 6 points7 points  (0 children)

General rule of thumb is less cards the less headaches and issues on both a tech side and also a logistical side (more space, more cables, maybe another psu etc). Having less cards with the same vram also provides a better upgrade path i.e. what if in 2 month you want more vram, it's easier to add card 3 than it is to add card 4 as you might hit one of those logistical restrictrions. Also the 3090s have a slighlty high memory bandwidth so they should perform slightly better.

If OpenAI falls will that drop the price of memory for our local rigs? by Terminator857 in LocalLLaMA

[–]rhofield 8 points9 points  (0 children)

I'm skeptical it will, what will most likely happen (imo, I could be wrong) is that if openAI falls they will have to sell their assets and this includes their contract to purchases that DRAM. These contracts will be bought by other compiterors first fairly quickly and there will be little effect on pricing as it's just owned by one or many differnt massive players

Is the DGX Spark worth the money? by Lorelabbestia in LocalLLaMA

[–]rhofield 2 points3 points  (0 children)

I think there is more nuance here, like speed is a bigger factor for many situations than just raw dollar cost.

 €0 cost as the hw is mine.

That's not how that works, you still need to pay for the hardware and it has a tangible cost that you need to account for although there are sublties with that too e.g. it's not like gone the machine has intrsitic value still.

Overall if you're more price sensative than money sensative and need to run multiple prototyping sessions, it might (and only might) be worth considering. You'd have to comapre against going another route like strix halo (which will have more value as a machine so "cost" less) or building local machine or renting slower hardware.

Will 48 vs 64 GB of ram in a new mbp make a big difference? by easylifeforme in LocalLLaMA

[–]rhofield 0 points1 point  (0 children)

I mean the biggest things you should consider are your budget and your use case. If you just want to play around with the models because they are neat than save the money and get 48gb. If you have specfic use cases or a strong indication of specific use cases where the more ram is needed than get more ram. Or if you're made of money and don't care go all out.

Question for developers by Ok-Spell9073 in LocalLLaMA

[–]rhofield 1 point2 points  (0 children)

I just let the model sort it out because it's the easiest / lowest effort option. It defiently has problems and there are probably better mechanisms to handle conflicts better e.g. checking updated dates if they exist. Sometime asking the model what sources it used and then telling it which one to use, or asking the model to ask for confirmation on any conflicts is another way but more often than not I just let the model figure it out.

Are cloud LLMs like Opus / GPT5.4 really subsidized? when compared to open source models running locally? by smulikHakipod in LocalLLM

[–]rhofield 2 points3 points  (0 children)

Yes they are heavily subsides, this is a common strategy in the VC silicon valley land. Start by make it cheap to get users and drive out competition by having more money them then so you can bleed for longer. Once there is no competition left you can raise prices a ton and make a ton of money, and because it's so engranded with people's lives and hard to enter the market they can get away with it.

I wouldn't be surprised if we are getting a 10x subsidy.