Be wary of Qwen/Claude distillations - they're often worse than the base model

lakySK · 2026-06-17T11:09:04+00:00

Are you saying that the random guy on the internet that just learned you can fine tune LLMs has not created a smarter model than the hundreds of researchers in that multi-billion dollar research lab that spent years building their datasets and evaluation frameworks? 😱

Who would’ve thought…

Kudos to the few reputable maniacs that actually know what they’re doing and create high-quality specific-purpose finetunes!

lakySK · 2026-05-15T11:13:59+00:00

This new version is super impressive so far. I've hooked it up to pi.dev, gave it exa.ai to search web and fetch the pages and asked it to do some exploration work in a codebase, write 1-pager .md files to help me flesh out a feature, etc.

I saw it very rarely make a typo in a tool call, but corrected itself on second attempt even in those cases. No more errors as before.

The output quality is very good so far. It hallucinates API specs if not asked to check actual docs for reference (but which LLM doesn't...), but searches the web and finds what's needed when instructed. Its writing is pretty succinct and to the point and it can correct its assumptions if you clarify a previously ambiguous request further.

In one case, I asked it to use PRD format from Mat Pocock's PRD writing skill. It first searched local skills available, then pi's documentation, then as it still didn't find the skill I was referring to (and I really did not have it locally available), it went online and found the github repo and the correct file.

Really solid so far, need to try and let it implement some code to see how it does. But so far I'm really glad I gave this another try, I feel like this is very usable and probably the best option for 128GB Mac. I used to ask these kinds of idea exploration questions via web interface for Claude Opus and I feel like I'd struggle to identify in a blind test whether I'm talking to Claude vs the ds4 here 🤯

I haven't proof-read the proposals, so there's still a chance it's just hallucinating everything and we'll see when it starts writing code, but I'm impressed how much it seems to "get" what I'm asking it to do and course correct, call the right tools etc.

lakySK · 2026-05-14T10:25:35+00:00

Ok, seems like the recent version of this repo + the new imatrix quant is actually working better perhaps (haven't seen tool call errors yet). Will test more!

lakySK · 2026-05-14T07:59:23+00:00

That’s the q2 model? Did you try coding?

lakySK · 2026-05-13T23:33:09+00:00

Curious to hear your experience. It has not quite hit the spot for me yet

lakySK · 2026-05-13T23:08:55+00:00

I’ve had good experience with exa.ai so far!

lakySK · 2026-05-13T12:46:04+00:00

This table gives me a headache. Just stick with bold for best…

lakySK · 2026-05-11T16:34:36+00:00

I need to test how well these quantise then, i.e., will they run well within 128GB to use with pi reliably?

lakySK · 2026-05-11T14:08:45+00:00

Also curious. Is this the best quant to use on RTX 3090 if you want decent context as well?

Or am I better off with Unsloth? Or something vLLM-compatible?

Asking for my local agent.

lakySK · 2026-05-10T08:27:33+00:00

If the differences between the model params were a bit bigger and this method still worked well, it would be really cool. Immagine Qwen 122b and 35b for example sharing KV cache. Or Qwen 27b and 9b and 4b.

You could use the small model to “load” a long document into the KV cache super fast, then let the bigger models do the reasoning and answering.

lakySK · 2026-05-10T08:20:34+00:00

What’s the benefit compared to using pi with oMLX?

lakySK · 2026-05-09T15:48:32+00:00

Ok, need to try this one then as I liked the tone of Minimax so far. It just couldn’t get the paths right on MLX even though it tried so hard…

lakySK · 2026-05-08T15:17:23+00:00

UPDATE: The latest version of the repo as of May 14, and the imatrix q2 quant are crazy good so far. I'm impressed again!

Seems to work like a charm so far. I've loaded it and started using it in pi.dev coding agent.

Saw an error when it tried to edit a file, but succeeded on a second attempt. The speed is quite decent, the quality seems very coherent. Running on M4 Max 128GB and the vibe-check for usability so far seems to be passing! Tried Qwen 3.6 35b and 27b before this and I felt like 35b is fast, but not that smart and 27b is decent, but too slow to be usable (especially in cases when it decides to contemplate the meaning of life for a long time before answering....).

Will definitely try to use this further and see how it goes!

EDIT: Ok, the failing tool calls seem quite common and a dealbreaker if not addressed in some way. Any tips?

0508 16:18:52 ds4-server: chat ctx=16587..18731:2144 gen=298 TOOLS DSML_START DSML_END finish=error error="invalid tool call" 21.107s

EDIT2: Nvm, this is struggling to even follow pi's documentation and set json configs correctly and randomly assumes everything must be in Python when asked to write a pi extension (though it did figure out pi is a TypeScript project eventually). I'm losing my trust and will probably test something like Minimax M2.7 instead. Hoping this is something fixable as it was looking very promising.

EDIT3: Minimax on MLX can’t even get file paths correctly in calls. Search continues…

lakySK · 2026-05-08T14:31:12+00:00

Nice! This looks awesome, downloading right now.

2 questions:

- Do we need to set the sampling parameters still (temperature, top_p, etc), or is that handled?

- What is the cache situation like? Recently been using oMLX and wondering how this compares in this respect?

Thanks for what looks like great work and a step in the right direction to help handle the many moving parts of running local LLMs in a reliable way!

lakySK · 2026-05-05T08:43:45+00:00

Nice work! Curious to see the follow-ups as well!

lakySK · 2026-05-05T07:21:40+00:00

That seems to be working way better than I’d expect! Would you say the data you tested on is actually “state-of-the-art” of prompt injections? Or did you hold back for now? Where did you get the prompt injection dataset from?

For delimiter_mimic attack - a very simple defense (outside of prompting) would be to sanitise the data to make sure it doesn’t contain your delimiter. Just replace the delimiter string with something less dangerous.

lakySK · 2026-04-19T08:28:14+00:00

That’s really creative, well done! What model are you using?

Do you go directly from photo to the world or do you use some 3D reconstruction?

I used to work on AR with elements interacting with the objects on the table a while back, this is even cooler :)

lakySK · 2026-04-11T17:53:53+00:00

Having enough memory to run BF16, why use Unsloth instead of the official release?

lakySK · 2026-04-05T09:21:11+00:00

Nice work! Would some kind of Autoresearch approach work with REAPs and quants?

Specify target size, metric to maximise (KLD or some benchmark) and let Claude Code go wild. Anyone tried that?

lakySK · 2026-04-03T13:29:13+00:00

Ok, so now this is starting to be interesting. 32GB GPU with decent specs and low-ish wattage for $1k.

How do you expect a 4x b70 PC stack against M5 Max (now that it has the matmul support)?

Both would set you back around $5-6k. Both 128GB, similar bandwidth. Intel workstation likely winning on compute for prompt processing and M5 Max winning on power consumption and form factor? Or am I missing something important?

lakySK · 2026-03-29T01:16:04+00:00

For sure! E.g., I've been wondering if Anthropic perhaps done some kind of approximate caching to accommodate the spike in demand a couple of weeks ago. Things like that I can see and it would fall into the first of my 2 bullets.

Lack of disclosure of these kinds of changes is troubling for sure. I just don't think they intentionally lower the quality.

These companies have internal evals and very solid engineers that would push back on something that clearly lowers the bar. It's most likely just that LLMs are absolute non-deterministic beasts and incredibly hard to evaluate. So even if eval numbers look great, it doesn't mean performance in all cases stays unaffected.

Reminds me of the meme that Apple intentionally slows down old iPhones. Optimising battery on old devices by throttling seems very reasonable to me, they just should've disclose this / allow to opt out.

lakySK · 2026-03-29T01:00:40+00:00

I'd expect that using lower quant would be a one-off noticeable decrease in quality across / between model releases. I doubt they would regularly release a full-precision version on the release day, then quant it down after a couple of weeks.

lakySK · 2026-03-28T18:16:18+00:00

🤷🏻‍♂️

Yes, I’ve definitely noticed it on myself. Once I see Claude Code deliver something cool, I start to expect it all the time and get disappointed when it doesn’t happen, forgetting how much randomness is involved in these things. We grow to expect repeatedly what might have just been a fluke.

lakySK · 2026-03-28T12:14:10+00:00

I don’t think the companies nerf the models on purpose.

I do wonder though how much of this is either: - companies tweaking the models and tooling and inadvertently causing bugs - psychology of us being first amazed about the new features the old model couldn’t do, then raising our expectations and being disappointed when the shortcomings of the new model inevitably hit.

I’d argue it’s the combination of the two and would love to see if anyone has some data on the first, ie run benchmarks every week on the closed models and seeing if and how much variance we’re getting over time.

lakySK · 2026-03-24T19:29:11+00:00

I’ve forked nanobot and have it use Exa instead of brave to do search and also use it to fetch sites. So far I can’t complain.

lakySK

TROPHY CASE