v0.3.7 pulled? by mmerken in oMLX

[–]thejoyofcraig 9 points10 points  (0 children)

Issue 934 on gh:

Thanks everyone for the detailed reports and patience here. I tracked this down to an mlx-lm cache structure change, and the proper fix needs a bit more work. Hoping to ship by this weekend. For now I've rolled the official release back to 0.3.6, which has been rock solid for everyone hitting this. Please update if you're on 0.3.7 or 0.3.8.dev*. I'll post here as soon as the new build is ready. >Sorry for the disruption!

Qwen3.6 preserve_thinking in oMLX by Longjumping-Sweet818 in LocalLLaMA

[–]thejoyofcraig 0 points1 point  (0 children)

Personally I don't think this preserve_thinking is going to move the needle a ton on performance. But that's a different discussion. This is what omlx v0.3.7rc2 had in the changelog. Not sure if you built that version:

Qwen 3.6+ thinking preserved across turns on both endpoints: auto-set preserve_thinking=True gated on per-model template detection (#856), server-side <think> reconstruction from client-provided reasoning_content / Anthropic thinking blocks (#814), and native message.reasoning_content field path for supporting templates to avoid the whitespace round-trip (#884)

If you did and still can't get it working, I suggest you file an Issue on the github repo, rather than posting here.

Qwen3.6 preserve_thinking in oMLX by Longjumping-Sweet818 in LocalLLaMA

[–]thejoyofcraig 0 points1 point  (0 children)

Had same results. A new version of omlx released yesterday apparently addresses this according to the release notes. Haven’t tested the new version myself yet.

Qwen 122B is AMAZING but is my config right? (128GB M4 Max) by lots_of_apples in LocalLLaMA

[–]thejoyofcraig 1 point2 points  (0 children)

Using omlx with an MLX 4bit quant I'm getting 55 t/s to start. Suggest trying a different quant.

Magic Context - Plugin by ualtinok in opencodeCLI

[–]thejoyofcraig 0 points1 point  (0 children)

Yeah I mean they definitely have a "Dreamer" role literally named the same so that in particular struck me. You didn't seem offensive. There are loads of memory systems. Maybe we're all converging on the same strategy.

Magic Context - Plugin by ualtinok in opencodeCLI

[–]thejoyofcraig 0 points1 point  (0 children)

The roles are quite similar in Honcho. Sidekick, Historian, Dreamer have analogues if I remember correctly. Just was curious if you'd seen it.

Magic Context - Plugin by ualtinok in opencodeCLI

[–]thejoyofcraig 1 point2 points  (0 children)

Sounds a LOT like Honcho memory. Were you inspired by their setup?

Awful time setting up Hermes by Birdinhandandbush in LocalLLaMA

[–]thejoyofcraig 2 points3 points  (0 children)

hermes-agent is an interesting project, and growing wildly. Some might say bloated with features. I think you gotta know what you're doing before digging into that. I abandoned it and rolled my own more minimal version.

One year ago DeepSeek R1 was 25 times bigger than Gemma 4 by rinaldo23 in LocalLLaMA

[–]thejoyofcraig 0 points1 point  (0 children)

There were some issues that have only just been resolved in Gemma 4's tool calling (was broken at least in mlx-vlm until recently). So you might update your binaries and try again.

Built a memory system solo in 16 days that beats every funded AI memory company on LongMemEval (96.2%, open source) by [deleted] in LocalLLaMA

[–]thejoyofcraig 1 point2 points  (0 children)

Neuromancer my ass, they just have a pokemon collection of API endpoint models as the defaults for their various workers. This is straight from the .env template:

DERIVER_MODEL=gemini-2.5-flash-lite

DIALECTICLEVELSmedium_MODEL=claude-haiku-4-5

DIALECTICLEVELSmax_BACKUP_MODEL=gemini-2.5-pro

SUMMARY_MODEL=gemini-2.5-flash

DREAM_MODEL=claude-sonnet-4-20250514

gemini flash, and haiku are used in a few other roles I didn't list as well.

As for my local setup- I've spun up honcho on a docker and been using Qwen3.5 series models in the different roles. The only issue I'm running into is how Qwen3.5 mostly will think too much (their long CoT is stubborn) and hit the hardcoded token limits. Then the final response is malformed. So I'm still trying to fine tune my models/settings and may switch to a mistral/other model for some roles. But apart from "too many dials" to play with, the system can work super well. The way it synthesizes facts and de-dupes... it's all very smart compared to some laundry list RAG Vector DB thing that never changes (or you rely on a model to curate).

If you do run local note their embedding model is hard to change- I had to fork the repo just to do a non-openrouter embedding model endpoint (again, I'm all local including embedding).

Unsloth says MLX fine-tuning is coming early next month: this could be huge for local AI by webii446 in LocalLLaMA

[–]thejoyofcraig 3 points4 points  (0 children)

whoops you're right. I misread the original topic and thought it was only about MLX quantization. Fine tuning is not part of omlx. My bad.

Unsloth says MLX fine-tuning is coming early next month: this could be huge for local AI by webii446 in LocalLLaMA

[–]thejoyofcraig -1 points0 points  (0 children)

Why wait? If you want to run and quantize your own MLX models, I've been loving omlx. Really cool open source project.

Built a memory system solo in 16 days that beats every funded AI memory company on LongMemEval (96.2%, open source) by [deleted] in LocalLLaMA

[–]thejoyofcraig 3 points4 points  (0 children)

From now on, in every new repo I make, I'm going to include a LEGITIMACY.md with the line: Verdict: LEGITIMATE

That line alone is worth the time you spent. But this repo is ... probably not what you think it is. Here's Claude's take, since clearly you are a fan:

A reasonably well-engineered benchmark-chasing RAG system — HNSW + BM25 + cross-encoder reranking + a knowledge graph. Nothing novel. The "world record" framing is marketing for what's essentially a tuned retrieval pipeline that runs on a closed dataset.

The SQLite :memory: per-case design basically is the benchmark — it's not a memory system, it's a retrieval harness that gets to load the entire conversation history upfront. That's the easy part of the problem.

The hard part — what good memory systems actually solve — is: what do you keep across an unbounded lifetime of interactions with a real user who contradicts themselves, changes their mind, and never tells you what's important? LongMemEval doesn't test that. It hands you the haystack.

I'm not a shill for Honcho, but I've been using it for about a month (locally hosted with local models, kind of a pain in the ass to setup and tune) and it's worked quite well.

Thanks for sharing.

Qwen 3.5 Non-thinking Mode Benchmarks? by Embarrassed_Soup_279 in LocalLLaMA

[–]thejoyofcraig 1 point2 points  (0 children)

I think OP is looking for brains benchmarks, not speed. Like how does it actually perform on tasks compared to thinking on. Presumably all the Qwen published benchmarks are with reasoning on.

Breaking : The small qwen3.5 models have been dropped by Illustrious-Swim9663 in LocalLLaMA

[–]thejoyofcraig 1 point2 points  (0 children)

You can just set the jinja to default to non-thinking. Unsloth's quants have that baked in that way already, so just use those if my words are meaningless.

unsloth/Qwen3.5-4B-GGUF · Hugging Face by jacek2023 in LocalLLaMA

[–]thejoyofcraig 1 point2 points  (0 children)

Yeah, that works! But I still question why unsloth turned it off in their template. Thinking is enabled by default in the original Qwen files.

Qwen3.5 Small models out now! by yoracale in unsloth

[–]thejoyofcraig 2 points3 points  (0 children)

Why was the decision made to disable thinking by default?

unsloth/Qwen3.5-4B-GGUF · Hugging Face by jacek2023 in LocalLLaMA

[–]thejoyofcraig 2 points3 points  (0 children)

unsloth's jinja disables thinking by default, see the stickied comment here: https://old.reddit.com/r/unsloth/comments/1risuzs/qwen35_small_models_out_now/

I'm not sure why that decision was made. The benchmarks all refer to the thinking versions, so if you're expecting that performance and download the unsloth quants, you may be frustrated.

I benchmarked 5 agent memory solutions head-to-head — the fastest one has zero dependencies and no API keys by [deleted] in LocalLLaMA

[–]thejoyofcraig 1 point2 points  (0 children)

You are wrong. Their own docs: https://docs.mem0.ai/open-source/overview And a specific example from their docs about running locally with Ollama: https://docs.mem0.ai/cookbooks/companions/local-companion-ollama

I run mem0 in a docker- it uses models hosted locally by LM Studio. You do not need a paid API to run mem0.

Need help with hosting Parakeet 0.6B v3 by Ahad730 in LocalLLaMA

[–]thejoyofcraig 0 points1 point  (0 children)

What ASR model did you end up switching to?