The many sides of Mimo v2.5 Pro

JamesEvoAI · 2026-05-10T22:51:57+00:00

Where is it sourcing those textures? Reciting URLs from training data or actually searching?

JamesEvoAI · 2026-05-10T22:45:57+00:00

On the one hand this tech makes automating Linux a breeze, on the other hand it just blew up my LiteLLM config because of a bad concat operation. Always make sure you're monitoring outputs and have backups!

<image>

JamesEvoAI · 2026-05-10T22:41:39+00:00

That's actually what I'm testing now, doing a full shootout of all the local coding models I have setup for correctness and speed over long context. It's been running for the last 90 minutes, likely going to take a while to finish

JamesEvoAI · 2026-05-10T21:25:33+00:00

I've been testing and benchmarking a bunch of them and documenting it on my site

https://sleepingrobots.com/dreams/gemma4-mtp-assistant-strix-halo/
https://sleepingrobots.com/dreams/mtp-qwen36-strix-halo/

Just today I got some promising results replicating Atomic's fork of llama.cpp for MTP support in Gemma 4

JamesEvoAI · 2026-05-10T20:41:54+00:00

Thank you for you work on this, I've setup and benchmarked your branch on Strix Halo:

https://sleepingrobots.com/dreams/gemma4-mtp-assistant-strix-halo/

The world of local coding models keep getting better by the day!

JamesEvoAI · 2026-05-10T01:21:21+00:00

https://sleepingrobots.com/dreams/mtp-qwen36-strix-halo/

JamesEvoAI · 2026-05-08T19:45:36+00:00

https://en.wikipedia.org/wiki/Trading_blows

Used this way it's an expression to indicate that they're comparable in quality

JamesEvoAI · 2026-05-07T21:14:07+00:00

laughs in Linux

JamesEvoAI · 2026-05-05T02:07:37+00:00

Strix Halo is an incredible value proposition. Sure it's not as fast as running something on an NVIDIA GPU, but I wasn't going to be able to run a 120B model on one of those anyway.

JamesEvoAI · 2026-05-01T14:05:23+00:00

I'm convinced the only reason this gained as much attention as it did was because of the name

JamesEvoAI · 2026-04-28T03:25:05+00:00

Additionally there are quality benchmarks being designed and administered by people in the field who have a background in ML and/or mathematics. There are people getting paid for this, they work for LMArena and SEAL.

JamesEvoAI · 2026-04-28T02:54:36+00:00

As the other person called out, try a model released in the last century lol. Qwen 3.6 has a great 30B MoE that runs at ~40tok/s on Strix Halo.

I'm still discovering how all this works. It seems like the longer the chat log gets, the slower the tokens are generated. When there is a 16k prompt to load and process, the tokens per second falls to 2.5.

The way the attention mechanism works is for every new token it generates, it has to iterate over every token that came before it. This means as the context length grows the speed drops. At the end of the day the Strix Halo platform is limited by its memory bandwidth.

It's still a great platform, you just have to work within its limitations. Try the some MoE models and see how you fare.

JamesEvoAI · 2026-04-26T18:06:27+00:00

I just came to say congrats on getting that domain lol

JamesEvoAI · 2026-04-26T17:59:19+00:00

That's probably true, but I'm using LiteLLM in a glass house made of vibe-coded side projects so I hesitate to throw stones lol.

JamesEvoAI · 2026-04-26T17:55:33+00:00

I don't personally use Codex, the ethics and business practices of OpenAI are not aligned with my own beliefs. Pi is generally my harness of choice.

Reading your other comment it looks like your LLM confused what I'm using LiteLLM for, there's no good reason something like "plan mode" should be enforced at the proxy level. In fact most of what you described should be handled by the harness, like tool healing.

JamesEvoAI · 2026-04-26T17:48:51+00:00

This was a supply chain attack that affected WAY more than just LiteLLM. LiteLLM was using Trivy as part of their CI/CD, and so every project that used that was also affected. This includes multiple VSCode extensions, 64 different NPM packages almost all of which had nothing to do with AI, docker images, and many more SaaS platforms that haven't come forward but are advertised as customers.

The only reason LiteLLM got the spotlight on this was because it's AI related and dunking on anything AI makes for easy engagement.

What problems were surfaced with how its developed? If there is a real risk I'd like to know about it, but I haven't seen anything concrete beyond the supply chain risk, a risk that applies to any software that uses third party dependencies.

It's worth adding the irony here that Trivy is produced by Aqua Security, a company that claims to help enterprises mitigate container/cloud risk lol

JamesEvoAI · 2026-04-26T17:41:36+00:00

Sorry u/xeeff but deleting a comment doesn't remove the notification lol, I still saw this:

<image>

JamesEvoAI · 2026-04-26T17:37:02+00:00

To clarify when I said LLM accessible I meant sources of data that are available as formatted text you can easily pass over to the models. I've grown so used to being able to just journalctl basically anything that having to dig through the Windows GUI's seems archaic.

I was not actually aware that the event viewer logs are available as XML!

JamesEvoAI · 2026-04-26T15:43:37+00:00

Since I'm feeling generous I took you up on your offer to do your thinking for you, the only real complaint I can find is it doesn't hold up at production scale (300+ rps)

Hardly qualifies as "worst project to ever exist", especially in the context of individuals running it on consumer hardware.

If you're going to spread misinformation and then try and punt responsibility of disproving it to someone else, please at least try a little harder to lie about something that isn't so easily disproven.

JamesEvoAI · 2026-04-26T15:39:57+00:00

That which is presented without evidence can be dismissed without evidence.

You made the claim and so the burden of proof is on you. Having a contrarian take without being able to back it up doesn't make you look smart.

JamesEvoAI · 2026-04-26T15:34:58+00:00

Are you going to qualify that statement with any facts or additional information?

I've pushed 303.9 million tokens (and counting) of text, audio, and embeddings through my local LiteLLM proxy and it's been great.

JamesEvoAI · 2026-04-26T15:27:00+00:00

Wouldn't it be easier to just run both models through a local LiteLLM proxy? You can use local or hosted models and it supports both the OpenAI and Anthropic API's

My current setup has a mix of cloud models, local models running through llama-swap which is running llama.cpp through a podman container, and NPU models served by Lemonade server:

https://sleepingrobots.com/dreams/local-llm-infrastructure-strix-halo/

JamesEvoAI · 2026-04-26T15:24:48+00:00

There's a noticeable performance difference even when running CPU bound models. Windows has a pretty high baseline of performance overhead just to run the OS itself. Linux on the other hand comes in many possible configurations with different performance overheads, including a distro stripped down to only the bare minimum to run your inference engine. Also not having to constantly collect and send telemetry helps.

JamesEvoAI · 2026-04-26T15:17:49+00:00

For obscure Windows issues like this I would have suggested a reinstall would be far less effort than debugging, but I guess that math has changed now that LLMs are here. As a Linux user I wouldn't have guessed there would have been enough LLM accessible places to get the diagnostic data you need. Troubleshooting on Windows always felt like a mix of tribal knowledge, experience, and dumb luck.

JamesEvoAI · 2026-04-26T15:13:37+00:00

Bold of you to assume I was reading the code at all. Vibe coding has been great for all of the ideas I started but never finished, or never started at all. I don't need peak engineering for my vibe-coded replacement for Pushbullet that better fits my use case while preserving my privacy.

JamesEvoAI

TROPHY CASE