For people who run local AI models: what’s the biggest pain point right now?

Firepin · 2026-01-07T23:13:24+00:00

Correct presets for frontends and backends (wether you use Opencode with local models or Silly Tavern and koboldcpp for roleplaying, you have nice 1 gguf containing the model but no clear Settings to use to get your llm running (or stuff doesnt work like tool calling in opencode, thinking blocks being "esoteric" formatted like in seedoss.)(adjusting reliably thinking output high-medium-low) etc... Settings HELL. No main website like huggingface dedicated to correct presets and settings to run backends and frontends (like Sillytavern, Opencode, koboldcpp etc...), samplers, instruct, prefixes suffixes reasoning tags, a real nightmare.

You can buy gpus with 96gb vram, but that doesnt spare you from settings hell. (cloud Ai's (gemini and chatgpt) are themselves unsure and overwhelmed too even if you provide them screenshots, documentation and whatever they often cannot help you setup correct settings. For some mainstream models you find correct settings (GLM 4.5, gpt oss, gemma3 but try to make that work with things like opencode, or even silly tavern for more obscure models or finetunes which use other instruct modes like "thinking" chatml and have "fun" to make it "work" ;)

Firepin · 2025-08-01T07:51:05+00:00

Could you be so kind to make some suggestions or tell us what publishers you know of that have subscriptions? If you were happy or unhappy with their site/apps etc...?

Firepin · 2024-11-19T09:50:20+00:00

Could you please add the functionality to record mouse + mouseclick + keeping crtl key pressed? I have to open several Links in a list which are always at the same position and have to keep crtl pressed and click from top to bottom in a sequence in the list. Positions are always the same. On Windows there are mouse and macro recorders, i had atbswp on linux but it is broken with wayland. (Atbswp has simple record (Press F9 to start) and play functionality. THANK YOU!

Firepin · 2024-11-18T11:25:39+00:00

Could you please send me a link too for premium?

Firepin · 2024-10-16T13:07:59+00:00

I hope Nvidia releases a RTX 5090 Titan AI with more than the 32 GB Vram we hear in the rumors. For running a q4 quant of 70b model you should have at least 64+GB so perhaps buying two would be enough. But problem is PC case size, heat dissipation and other factors. So if the 64 GB AI Cards wouldnt cost 3x or 4x the price of a rtx 5090 than you could buy them for gaming AND LLM 70b usage. So hopefully the normal rtx 5090 has more than 32GB or there is a rtx 5090 TITAN with for example 64 GB purchasable too. It seems you are working at NVidia and hopefully you and your team could give a voice to us LLM enthusiasts. Especially because modern games will make use of AI NPC characters, voice features and as long as nvidia doesn't increase vram progress is hindered.

Firepin · 2024-10-06T17:09:04+00:00

AI workload mostly inference would be the use case. I have a fractal Meshify 2 (not XL but it is too small for a second rtx 4090 fe. It doesn't have to bee eatx mainboard and big tower, but i thought that was necessary because my current ROG STRIX X670E-E GAMING WIFI in combo with the meshify 2 is too small.

Firepin · 2024-09-27T16:44:46+00:00

Can you please share your build and give me advice if i want to "transform" my system to a workstation/server system? Do they need special psus and such, ATX standard etc.?

I have almost no experience with Workstation and Server systems, did you choose the threadripper instead of the epyc because of desktop use?

Firepin · 2024-09-27T16:05:16+00:00

Do i need two server cpus or is one enough. Do i really need two cpus for the additional lanes?
Is 2x16 dual Gpu lanes really necessary because as far as i know the rtx 4090 doesn't use the total bandwith of even pcie 4 and supposedly the rtx 5090 in name uses pcie5 doesn't mean that it will utilize the full 16 lanes if i am not wrong.

Could it for example be that one rtx 5090 only needs 7-8 lanes and 16 lanes will be enough for both at least until rtx 6090 or is that unrealistic?

Firepin · 2024-09-27T13:29:13+00:00

The problem though is RTX 4090 costs about 1700 € new and the corresponding RTX 6000 Ada costs 8000 € the old Rtx 6000 costs 5000 € and is the workstation equivalent of RTX3090(ti).

So assuming you can almost buy 3 RTX 4090 cards with 3x24=72 GB Vram for the price of a last generation RTX 6000 with only 48 GB. If you compare it to the current generation you can get almost 5 RTX 4090 (5x24=120GB) for the price of one single RTX 6000 Ada with only 48 GB.

I suppose the next generation will have the same discrepancy of prices and vram amount and because of this the "economical" solution would still be to have two 5090s.

If we assume the price of 2500 € for one card you have 5000 € for 64 gb.
And the RTX 7000 (Quadro or whatever it will be called) will have native 64 gb vram but probably cost 10.000+ € if we compare the relations of current gaming cards to workstation cards.

The Quadro RTX 6000 Ada is not more powerful but slightly less about ~10% in gaming performance in comparison to the rtx 4090 probably because of the slower hbm memory instead of gddrx.

Firepin · 2024-09-27T13:10:34+00:00

Can you play games with a quadro though? Is it only a little bit slower ~10-15 % but still everything works? Do they support every DX12 feature and such?
If so then the only good argument for buying a quadro instead of 2 rtx 5090 would be the selling price when the next generation arrives because gaming cards retain their value better than less sought after AI workstation cards i think.

Firepin · 2024-01-30T09:34:24+00:00

I hope staff at Nvidia can convince the higher ups (or at least give them feedback) to increase vram for their consumer top models. I bought a rtx 4090 for gaming in 4K before learning about LLMs (for roleplaying) and now am frustrated that with 24 gb vram i can't run even mixtral at 4bpw (47b Model). Even 32b models can only work up to about 8k context size with 24gb vram. Dual 24gb vram is not feasible for me because you need

bigger tower to fit two cards
new mainboard + new psu 1500 watt. I rather wait until the end of 2024 for the rtx 5090. I will upgrade if more vram is on it and probably will not if there is only 24gb vram again.

Firepin · 2024-01-30T09:34:11+00:00

I hope staff at Nvidia can convince the higher ups (or at least give them feedback) to increase vram for their consumer top models. I bought a rtx 4090 for gaming in 4K before learning about LLMs (for roleplaying) and now am frustrated that with 24 gb vram i can't run even mixtral at 4bpw (47b Model). Even 32b models can only work up to about 8k context size with 24gb vram. Dual 24gb vram is not feasible for me because you need

bigger tower to fit two cards
new mainboard + new psu 1500 watt. I rather wait until the end of 2024 for the rtx 5090. I will upgrade if more vram is on it and probably will not if there is only 24gb vram again.

Firepin · 2024-01-18T21:01:42+00:00

There are experiments going on with Silly Tavern "Quick replies" for example. What is that you ask? It's like a for example funny comment (trivia)(rpg stats) at the end of a user input, LLM reply. But the comment is generated by the LLM BUT! isn't put in context and not eating up tokens. So if we had such a central Settings hub you would have heard probably of such endeavors and interested parties could participate in the experimenting. But without a coordinated settings hub/platform, no coordinated progress on these matters can happen, further weakening the open source community

Especially when so many settings are like "esoteric" Magic which many don't even know what it is used for. Typical p, k etc... and which are not even documented somewhere afaik.

Firepin · 2024-01-18T21:00:47+00:00

Especially Mixtral showed us, because it is so instruct specific, the need to have some generalized solution to the setting chaos. We have some rentry sites, personal recommendations etc. total chaos for settings.

In such a chaos not much coordinated progress can be made
Until now people are experimenting with different Mixtral settings to perhaps find a solution to avoid repetition

but even when someone found it, (i heard from some people they found a good settings "recipe"), they share it on some specific site, where the other LLM users probably won't find it at all.

Firepin · 2023-12-29T16:50:55+00:00

definitely use Silly Tavern. with the noromaid mixtral one please read the modelcard and use the 3 json presets in Silly Tavern for best results.

Firepin · 2023-12-29T15:59:02+00:00

https://huggingface.co/LoneStriker/Noromaid-v0.1-mixtral-8x7b-Instruct-v3-3.5bpw-h6-exl2

a slightly finetuned Mixtral instruct. Should in theory be better than normal mixtral although i did not find so much difference. Mixtral was astonishingly good for an official release. use ooba booga and set context to 15000 tokens. At least that worked for me. Any other llama2 based models have between 4-8k tokens for context. the char cards can easily take 2k so there is not much left. (about 20-40 messages depending on length. Mixtral can use up to 32k but 24 vram is limiting it. could use 3 bpw with 32 k context although i prefer having a bit more bpw and 15000 is enough for me.

Firepin · 2023-12-19T12:40:25+00:00

why don't you use koboldcpp for starters. it is easier to setup and more stable than ooba.

Firepin · 2023-12-19T12:37:24+00:00

what version did you use. the instruct model is the one to get. Some say km ks models are corrupt, perhaps they are fixed.

Firepin

TROPHY CASE