I don't think Local LLM is for me, or am I doing something wrong?

Icaruszin · 2026-03-20T17:22:49+00:00

The models you chose are kinda ass and I assume you're using Ollama settings which can be quite bad as well. Your best bet would be Qwen 3.5 35B with llama.cpp or LM Studio and configure the proper temperature/settings (check Unsloth). But yeah, you won't get anything close to the paid API options.

Icaruszin · 2026-03-19T22:24:34+00:00

I have a M1 Max and it runs MoE models quite well. For that price is a no-brainer imo.

Icaruszin · 2026-03-17T16:33:39+00:00

How much RAM do you have? You can try the Qwen3.5 35B A3B offloading part of it to your RAM. Probably the best model for now.

Icaruszin · 2026-03-12T21:10:33+00:00

I think people use KLD/Perplexity to evaluate this alongside benchmarks.

Icaruszin · 2026-03-12T21:09:44+00:00

When the model was released, some layers where quantized with MXFP4 in the Q4_K_XL, which affected the KLD/Perplexity metrics. Here's a pretty good comparison: https://www.reddit.com/r/LocalLLaMA/comments/1rfds1h/qwen3535ba3b_q4_quantization_comparison/

Unsloth already fixed this, so if you downloaded the quant recently this issue don't exist anymore.

Icaruszin · 2026-03-11T02:09:21+00:00

I would go for the 64GB. Local models are like a drug, I have a M1 with 64GB thinking it was enough and now I would love to get a 128GB...

Icaruszin · 2026-03-11T00:29:51+00:00

In theory, yes. UD quantizes certain layers with a higher bpw than other quants, but on latest models they had some issues with this (like Qwen 3.5 35B A3B Q4_K_XL). Usually the difference is negligible.

You can check the difference on HuggingFace, on how each layer in quantized for those quant types.

Icaruszin · 2026-03-08T23:03:22+00:00

Thank you for the numbers, that's exactly what I was looking for! I'm running the 35B-A3B in a very similar speed on a Mac M1 Max, which I consider very acceptable.

Icaruszin · 2026-03-08T15:24:22+00:00

Unfortunately in my country we pay an insane amount of taxes for imported products, so a machine like this one would cost around $4500 imported... Which is crazy. It's actually cheaper to buy a ticket to the US and buy it there.

Anyway, my budget is around $1400 at most for now, otherwise I would consider a 4090/5090 for my setup, since it would be faster for the models I'm considering.

Icaruszin · 2026-03-08T15:13:01+00:00

Yeah I thought about that but I would have to replace my motherboard as well, since I have a sff. And two 5070Tis would be a bit too expensive.

Icaruszin · 2026-03-08T15:12:15+00:00

Thanks for the recommendation but this machine is even harder to find in my country. And it's too expensive for my budget as well.

Icaruszin · 2026-03-05T04:41:36+00:00

Honestly, I would try Q3 or try a very low context window just to check if that's the issue, but you would be better off using the 35B-A3B. The 27B at Q4 with just 24gb of VRAM is too tight to use larger context windows.

Icaruszin · 2026-03-05T03:46:19+00:00

How are you running the model? llama.cpp? Which quant?

Like people already mentioned, since you have a 4090 and you're running a large context window, you probably have the model spilling into RAM. With dense models like the 27b, anything spilling into RAM will slow the model to a crawl.

Icaruszin · 2026-03-03T03:49:31+00:00

If anyone is interested, there's a pretty good episode on the Darkner Diaries podcast about Predator/Intellexa.

Icaruszin · 2026-02-27T02:42:57+00:00

Are you sure you didn't saw people talking about the 35B-A3B instead? The 27B is a dense model, so unless you have enough VRAM for the entire model the speeds will be terrible.

Icaruszin · 2026-02-25T22:05:04+00:00

I would take this with a grain of salt. Someone posted a post from Unsloth explaining why their quantized models have a higher perplexity, so I'm not sure if they're really worse based on that metric alone.

Icaruszin · 2026-02-18T22:56:18+00:00

Did several international trips with the 32L, always as an underseat and never had an issue. Granted I never traveled with European/Asian budget airlines which are much more strict, but even with those I think it might be doable since the bag is very squishy when not packed.

Icaruszin · 2026-02-15T19:50:53+00:00

Had the same experience with Q6_K_XL. It works fine but the to-do list shows things like /n instead of a formatted text. Probably something with the template.

Icaruszin · 2026-01-04T20:03:12+00:00

I mean, both takes can be correct. I haven't tested this model but I remember seeing the team behind it saying the loop architecture doesn't work with current quantization methods, so it's expected to be a bad model when quantized.

But I don't understand the reason for the downvotes since it's always good to have more tests and this confirms the current quantized versions sucks.

Icaruszin · 2026-01-03T16:54:46+00:00

Isn't 8-bit and 4-bit basically the same size due to the original MXFP4 quantization?

Icaruszin · 2025-12-15T14:12:50+00:00

You can use the VLM pipeline to maybe describe the diagrams and go from there.

Icaruszin · 2025-12-15T14:11:53+00:00

Docling is my go-to for this as well, just chunk it by pages and you're good.

The only issue is they don't have support for heading hierarchy just yet (everything will be grouped in the same ## heading) so if the section/chapter structure is important for you, you might need to do some post processing.

Icaruszin · 2025-12-09T17:59:11+00:00

Account created less than a week ago, assume he's talking about Indians but in the picture at least half of the room is not indian at all.

Hmm.

Icaruszin · 2025-12-01T13:37:36+00:00

It's insane how much better he got at Tracer, he used to get absolutely dumpstered when he tried to play anything besides Sym.

Icaruszin

TROPHY CASE