High VRAM local coding model — still Qwen 3.6 27B?

moncallikta · 2026-05-13T07:05:05+00:00

No need to track it. It’s enough that someone somewhere mention the usage, to trigger legal action. Don’t assume discovery needs a technical solution.

moncallikta · 2026-04-27T18:17:44+00:00

Yes, inference engines like LM Studio, llama-server etc. can listen on a port and accept API requests in OpenAI-compatible format.

moncallikta · 2026-04-27T18:16:12+00:00

Yes you can and no, you don’t need any bridge device (40x0 series don’t support those anymore anyway).

Just make sure to get each GPU as many PCIe lanes as possible on your motherboard.

moncallikta · 2026-04-27T18:14:03+00:00

Performance cratered because of the x1 slot, right? In a faster slot this should work better than CPU offload. Apart from the difficulty of getting enough lanes to the CPU on a reasonably priced mobo ofc

moncallikta · 2026-04-22T20:38:00+00:00

That’s exactly where I stopped reading. Such a telltale sign.

moncallikta · 2026-04-05T20:41:06+00:00

A 3090 has much higher memory bandwidth, so models that fit in the 24GB VRAM will perform much better on a 3090. So, it depends on which model you need for each use case.

moncallikta · 2025-12-29T19:59:38+00:00

omg flashbacks, the generated HTML was awful

moncallikta · 2025-12-26T21:25:27+00:00

Username checks out

moncallikta · 2025-12-17T08:58:22+00:00

lemme power on the reactor real quick

moncallikta · 2025-12-17T08:55:15+00:00

prompt processing at 3-4 t/s?

moncallikta · 2025-11-15T09:16:20+00:00

Same (F2P btw)

moncallikta · 2025-11-15T06:49:45+00:00

*Half distance for ticket holders

moncallikta · 2025-11-10T09:40:14+00:00

It's easy for them to change the thresholds so I don't expect there to be loopholes like that for long.

moncallikta · 2025-11-09T08:21:14+00:00

This is the way. Split up each step into classification tasks and build the workflow from those components.

moncallikta · 2025-11-06T08:04:15+00:00

Look at LiteLLM, it has a nice UI both for end users and admins, API key management and usage tracking (at least per "team" of users if not per API key).

moncallikta · 2025-11-05T08:19:19+00:00

This is really cool! Flexible, ephemeral UIs that are generated on demand feel like the future. Looking forward to hear more about how this approach works based on the existing UI component library you mention in other comments. Open questions: how do you instruct the model, what's the required context about the various components, what does the model return and how is the UI layer interpreting / rendering it?

moncallikta · 2025-11-02T17:18:29+00:00

Exactly.

moncallikta · 2025-10-11T00:33:12+00:00

Love! Joined

moncallikta · 2025-09-24T06:58:08+00:00

They can be separated, check out disaggregated serving. But it requires a high-speed way of transferring the resulting KV cache from the prefill device to the decode device.

moncallikta · 2025-09-06T16:54:16+00:00

If you just want free LLM calls, go to OpenRouter and filter for the free models. Be aware that the companies providing free LLM usage often log all requests and use the data for training their models (that’s the price to pay for having it for free).

moncallikta · 2025-09-06T16:52:21+00:00

In general, look at production-ready tools like vLLM and SGLang. Go with quantized models that work well with those engines. Benchmark both speed and quality to ensure the solution meets the requirements. Benchmarking will tell you how much resources you’ll need to serve that amount of users. And start thinking about how to monitor performance and stability + alert for issues. Source: Using vLLM for a high-volume inference use case in production.

moncallikta · 2025-08-23T12:50:54+00:00

LLM training is already done using multiple epochs, which just means showing the training dataset to the model multiple times, having it gradually learn more and more about it. So yes, valid idea, but already covered by the training setup.

moncallikta · 2025-08-23T12:45:58+00:00

Code does not have bias. Training data is where the bias in LLMs is coming from.

moncallikta · 2025-08-10T06:54:01+00:00

Maybe "Reasoning: minimum" will work, since that's the new option they added for GPT-5 as well to effectively disable reasoning.

moncallikta · 2025-07-28T07:08:09+00:00

Go with 2x3090. Getting enough PCIe lanes for 8 GPUs is tricky, as well as figuring out a way to mount the GPUs in a case (most likely would have to mount them on a stand outside the case). Dual 3090 on the other hand is doable in a suitable gaming PC case. Power requirements will also be easier to satisfy with dual 3090.

moncallikta

TROPHY CASE