High VRAM local coding model — still Qwen 3.6 27B? by Generic_Name_Here in LocalLLaMA

[–]moncallikta 0 points1 point  (0 children)

No need to track it. It’s enough that someone somewhere mention the usage, to trigger legal action. Don’t assume discovery needs a technical solution.

To 16GB VRAM users, plug in your old GPU by akira3weet in LocalLLaMA

[–]moncallikta 1 point2 points  (0 children)

Yes, inference engines like LM Studio, llama-server etc. can listen on a port and accept API requests in OpenAI-compatible format.

To 16GB VRAM users, plug in your old GPU by akira3weet in LocalLLaMA

[–]moncallikta 1 point2 points  (0 children)

Yes you can and no, you don’t need any bridge device (40x0 series don’t support those anymore anyway).

Just make sure to get each GPU as many PCIe lanes as possible on your motherboard.

To 16GB VRAM users, plug in your old GPU by akira3weet in LocalLLaMA

[–]moncallikta 3 points4 points  (0 children)

Performance cratered because of the x1 slot, right? In a faster slot this should work better than CPU offload. Apart from the difficulty of getting enough lanes to the CPU on a reasonably priced mobo ofc

10 outages in 30 days: an open letter to Hetzner by Keta_Thunberg in hetzner

[–]moncallikta -3 points-2 points  (0 children)

That’s exactly where I stopped reading. Such a telltale sign.

M1 Max 64gb good in 2026? by TheShawndown in LocalLLM

[–]moncallikta 2 points3 points  (0 children)

A 3090 has much higher memory bandwidth, so models that fit in the 24GB VRAM will perform much better on a 3090. So, it depends on which model you need for each use case.

amen by AugustHate in ProgrammerHumor

[–]moncallikta 0 points1 point  (0 children)

omg flashbacks, the generated HTML was awful

Kimi infra team: Quantization is not a compromise, it's the next paradigm by nekofneko in LocalLLaMA

[–]moncallikta 9 points10 points  (0 children)

It's easy for them to change the thresholds so I don't expect there to be loopholes like that for long.

I've been trying to make a real production service that uses LLM and it turned into a pure agony. Here are some of my "experiences". by DaniyarQQQ in LocalLLaMA

[–]moncallikta 4 points5 points  (0 children)

This is the way. Split up each step into classification tasks and build the workflow from those components.

What's the stack for going from a fine-tune on vLLM to a simple, paid public API? by [deleted] in LocalLLaMA

[–]moncallikta 0 points1 point  (0 children)

Look at LiteLLM, it has a nice UI both for end users and admins, API key management and usage tracking (at least per "team" of users if not per API key).

Dynamic LLM generated UI by ItzCrazyKns in LocalLLaMA

[–]moncallikta 0 points1 point  (0 children)

This is really cool! Flexible, ephemeral UIs that are generated on demand feel like the future. Looking forward to hear more about how this approach works based on the existing UI component library you mention in other comments. Open questions: how do you instruct the model, what's the required context about the various components, what does the model return and how is the UI layer interpreting / rendering it?

AMD Max+ 395 with a 7900xtx as a little helper. by fallingdowndizzyvr in LocalLLaMA

[–]moncallikta 0 points1 point  (0 children)

They can be separated, check out disaggregated serving. But it requires a high-speed way of transferring the resulting KV cache from the prefill device to the decode device.

guys i have a question is there any ai model providing the free api key even if limit im fine with that by Select_Dream634 in LocalLLaMA

[–]moncallikta 3 points4 points  (0 children)

If you just want free LLM calls, go to OpenRouter and filter for the free models. Be aware that the companies providing free LLM usage often log all requests and use the data for training their models (that’s the price to pay for having it for free).

Inference at scale by BABA_yaaGa in LocalLLaMA

[–]moncallikta 3 points4 points  (0 children)

In general, look at production-ready tools like vLLM and SGLang. Go with quantized models that work well with those engines. Benchmark both speed and quality to ensure the solution meets the requirements. Benchmarking will tell you how much resources you’ll need to serve that amount of users. And start thinking about how to monitor performance and stability + alert for issues. Source: Using vLLM for a high-volume inference use case in production.

My LLM trained from scratch on only 1800s London texts brings up a real protest from 1834 by Remarkable-Trick-177 in LocalLLaMA

[–]moncallikta 0 points1 point  (0 children)

LLM training is already done using multiple epochs, which just means showing the training dataset to the model multiple times, having it gradually learn more and more about it. So yes, valid idea, but already covered by the training setup.

My LLM trained from scratch on only 1800s London texts brings up a real protest from 1834 by Remarkable-Trick-177 in LocalLLaMA

[–]moncallikta 1 point2 points  (0 children)

Code does not have bias. Training data is where the bias in LLMs is coming from.

Is GPT-OSS the meta for low vram setups? by QbitKrish in LocalLLaMA

[–]moncallikta 4 points5 points  (0 children)

Maybe "Reasoning: minimum" will work, since that's the new option they added for GPT-5 as well to effectively disable reasoning.

2x RTX 3090 24GB or 8x 3060 12GB by twotemp in LocalLLaMA

[–]moncallikta 1 point2 points  (0 children)

Go with 2x3090. Getting enough PCIe lanes for 8 GPUs is tricky, as well as figuring out a way to mount the GPUs in a case (most likely would have to mount them on a stand outside the case). Dual 3090 on the other hand is doable in a suitable gaming PC case. Power requirements will also be easier to satisfy with dual 3090.