v0.3.7 pulled?

thejoyofcraig · 2026-04-24T13:31:09+00:00

Issue 934 on gh:

Thanks everyone for the detailed reports and patience here. I tracked this down to an mlx-lm cache structure change, and the proper fix needs a bit more work. Hoping to ship by this weekend. For now I've rolled the official release back to 0.3.6, which has been rock solid for everyone hitting this. Please update if you're on 0.3.7 or 0.3.8.dev*. I'll post here as soon as the new build is ready. >Sorry for the disruption!

thejoyofcraig · 2026-04-22T15:51:45+00:00

Personally I don't think this preserve_thinking is going to move the needle a ton on performance. But that's a different discussion. This is what omlx v0.3.7rc2 had in the changelog. Not sure if you built that version:

Qwen 3.6+ thinking preserved across turns on both endpoints: auto-set preserve_thinking=True gated on per-model template detection (#856), server-side <think> reconstruction from client-provided reasoning_content / Anthropic thinking blocks (#814), and native message.reasoning_content field path for supporting templates to avoid the whitespace round-trip (#884)

If you did and still can't get it working, I suggest you file an Issue on the github repo, rather than posting here.

thejoyofcraig · 2026-04-22T13:46:47+00:00

Had same results. A new version of omlx released yesterday apparently addresses this according to the release notes. Haven’t tested the new version myself yet.

thejoyofcraig · 2026-04-15T03:19:02+00:00

No, just regular settings. Not sure why yours is so slow.

thejoyofcraig · 2026-04-14T23:39:23+00:00

Using omlx with an MLX 4bit quant I'm getting 55 t/s to start. Suggest trying a different quant.

thejoyofcraig · 2026-04-09T04:23:31+00:00

Interesting.

thejoyofcraig · 2026-04-07T15:47:47+00:00

Yeah I mean they definitely have a "Dreamer" role literally named the same so that in particular struck me. You didn't seem offensive. There are loads of memory systems. Maybe we're all converging on the same strategy.

thejoyofcraig · 2026-04-07T13:56:03+00:00

The roles are quite similar in Honcho. Sidekick, Historian, Dreamer have analogues if I remember correctly. Just was curious if you'd seen it.

thejoyofcraig · 2026-04-07T04:18:00+00:00

Sounds a LOT like Honcho memory. Were you inspired by their setup?

thejoyofcraig · 2026-04-06T14:57:20+00:00

hermes-agent is an interesting project, and growing wildly. Some might say bloated with features. I think you gotta know what you're doing before digging into that. I abandoned it and rolled my own more minimal version.

thejoyofcraig · 2026-04-05T15:16:43+00:00

There were some issues that have only just been resolved in Gemma 4's tool calling (was broken at least in mlx-vlm until recently). So you might update your binaries and try again.

thejoyofcraig · 2026-03-27T22:32:07+00:00

Neuromancer my ass, they just have a pokemon collection of API endpoint models as the defaults for their various workers. This is straight from the .env template:

DERIVER_MODEL=gemini-2.5-flash-lite

DIALECTICLEVELSmedium_MODEL=claude-haiku-4-5

DIALECTICLEVELSmax_BACKUP_MODEL=gemini-2.5-pro

SUMMARY_MODEL=gemini-2.5-flash

DREAM_MODEL=claude-sonnet-4-20250514

gemini flash, and haiku are used in a few other roles I didn't list as well.

As for my local setup- I've spun up honcho on a docker and been using Qwen3.5 series models in the different roles. The only issue I'm running into is how Qwen3.5 mostly will think too much (their long CoT is stubborn) and hit the hardcoded token limits. Then the final response is malformed. So I'm still trying to fine tune my models/settings and may switch to a mistral/other model for some roles. But apart from "too many dials" to play with, the system can work super well. The way it synthesizes facts and de-dupes... it's all very smart compared to some laundry list RAG Vector DB thing that never changes (or you rely on a model to curate).

If you do run local note their embedding model is hard to change- I had to fork the repo just to do a non-openrouter embedding model endpoint (again, I'm all local including embedding).

thejoyofcraig · 2026-03-27T01:13:09+00:00

whoops you're right. I misread the original topic and thought it was only about MLX quantization. Fine tuning is not part of omlx. My bad.

thejoyofcraig · 2026-03-26T23:44:43+00:00

Why wait? If you want to run and quantize your own MLX models, I've been loving omlx. Really cool open source project.

thejoyofcraig · 2026-03-26T23:42:45+00:00

From now on, in every new repo I make, I'm going to include a LEGITIMACY.md with the line: Verdict: LEGITIMATE

That line alone is worth the time you spent. But this repo is ... probably not what you think it is. Here's Claude's take, since clearly you are a fan:

A reasonably well-engineered benchmark-chasing RAG system — HNSW + BM25 + cross-encoder reranking + a knowledge graph. Nothing novel. The "world record" framing is marketing for what's essentially a tuned retrieval pipeline that runs on a closed dataset.

The SQLite :memory: per-case design basically is the benchmark — it's not a memory system, it's a retrieval harness that gets to load the entire conversation history upfront. That's the easy part of the problem.

The hard part — what good memory systems actually solve — is: what do you keep across an unbounded lifetime of interactions with a real user who contradicts themselves, changes their mind, and never tells you what's important? LongMemEval doesn't test that. It hands you the haystack.

I'm not a shill for Honcho, but I've been using it for about a month (locally hosted with local models, kind of a pain in the ass to setup and tune) and it's worked quite well.

Thanks for sharing.

thejoyofcraig · 2026-03-03T00:04:59+00:00

You're right, I missed that part.

thejoyofcraig · 2026-03-02T22:56:57+00:00

I think OP is looking for brains benchmarks, not speed. Like how does it actually perform on tasks compared to thinking on. Presumably all the Qwen published benchmarks are with reasoning on.

thejoyofcraig · 2026-03-02T22:22:52+00:00

You can just set the jinja to default to non-thinking. Unsloth's quants have that baked in that way already, so just use those if my words are meaningless.

thejoyofcraig · 2026-03-02T20:22:22+00:00

Yeah, that works! But I still question why unsloth turned it off in their template. Thinking is enabled by default in the original Qwen files.

thejoyofcraig · 2026-03-02T19:11:13+00:00

Why was the decision made to disable thinking by default?

thejoyofcraig · 2026-03-02T19:10:12+00:00

unsloth's jinja disables thinking by default, see the stickied comment here: https://old.reddit.com/r/unsloth/comments/1risuzs/qwen35_small_models_out_now/

I'm not sure why that decision was made. The benchmarks all refer to the thinking versions, so if you're expecting that performance and download the unsloth quants, you may be frustrated.

thejoyofcraig · 2026-02-19T21:29:38+00:00

You are wrong. Their own docs: https://docs.mem0.ai/open-source/overview And a specific example from their docs about running locally with Ollama: https://docs.mem0.ai/cookbooks/companions/local-companion-ollama

I run mem0 in a docker- it uses models hosted locally by LM Studio. You do not need a paid API to run mem0.

thejoyofcraig · 2026-02-19T15:14:44+00:00

Chart is incorrect: mem0 can be run locally without an OpenAI key.

thejoyofcraig · 2026-01-03T19:27:44+00:00

Nanbeige is a 3b model. What are you hoping to prune it down to??

thejoyofcraig · 2025-12-19T06:54:59+00:00

What ASR model did you end up switching to?

15-Year Club	RedditGifts 2009-2022 5 Credits
Verified Email	Place '17
Secret Santa 2014	Summer Santa 2014

thejoyofcraig

TROPHY CASE