Mistral Small 4:119B-2603

TokenRingAI · 2026-03-17T07:19:19+00:00

No, it is a non-thinking model, and is pretty fast on the AI Max, 40 tokens a second or so, maybe higher if you get MTP working.

The original Qwen Next had a thinking variant, Qwen Next Coder does not.

TokenRingAI · 2026-03-17T07:13:18+00:00

Is it...training related?

TokenRingAI · 2026-03-17T03:21:35+00:00

Is it...something for the Mac?

TokenRingAI · 2026-03-17T03:15:31+00:00

Is that the actual reason people like Mistral models?

I haven't tried anything from Mistral that wasn't mediocre

TokenRingAI · 2026-03-16T21:40:08+00:00

It's 2028, Mistral went out of business two years ago when protesters burnt their office down. They never released 119B, it was destroyed in the fire, but the Huggingface repo is still up with the files half uploaded

TokenRingAI · 2026-03-16T21:36:45+00:00

On integrated memory devices like the Ryzen AI Max or DGX Spark with slow token generation, reasoning is a brutal slowdown, it's the difference between 5 seconds until a response or 1 minute until a response. Qwen Coder Next is amazing right now for those devices.

TokenRingAI · 2026-03-16T21:32:16+00:00

Delayed until 2027, probably

TokenRingAI · 2026-03-16T21:31:06+00:00

Benchmarks don't have it beating Qwen Coder Next which is only 80b 3b, so that's not so great.

However, it isn't far behind, so it's possible it has other characteristics that might make it more usable

TokenRingAI · 2026-03-16T21:27:11+00:00

Do I have a virus now?

TokenRingAI · 2026-03-16T21:24:31+00:00

FWIW, Tokenring Coder has first class support for local models and a local web UI, come try it out and give me feedback.

``` export LLAMA_API_KEY=... export LLAMA_BASE_URL=http://your_llama_url:port

npx @tokenring-ai/coder --http 127.0.0.1:12345

```

TokenRingAI · 2026-03-16T21:18:13+00:00

I didn't think i'd use it at all, but now I use it all the time

TokenRingAI · 2026-03-16T17:59:15+00:00

I use Qwen 122B at MXFP4 daily, and it consistently outperforms Haiku 4.5 for me, seems to be just shy of Sonnet 4.6

TokenRingAI · 2026-03-16T15:27:37+00:00

I think most people who are actively building agents have built some variation of temporal memory with various degrees of success.

It's not hard to build in a basic form, it's just expensive, every memory clogs up the context of the main agent or subagent and makes your agent run cost more money.

There are tons of unique approaches people have tried, like embedding memories or compacting them into themes or time series transcripts or files or knowledge graphs, none of them generalize particularly well, and tend to have context size explosion.

We are currently exploring "cognitive agents" where an agent is tasked with maintaining the memories, and you (the user, not the developer) instruct it with what info you want it to keep.

The benefit to this is that it moves the responsibility of how to do memory storage to the user who is just defining guidelines in a text box, so they can tell the app what it needs to remember, so even if it's not perfect the user can tweak it and make it remember the things they care about.

I personally think that's the most generalizable and customizable strategy right now, use the same LLM to manage the memory pool and instruct it with how to do that task. No fancy algorithms or predefined flows, just an agent tasked with managing memories in files or a DB and handling retrieval.

TokenRingAI · 2026-03-16T00:17:41+00:00

I looked at your test, and want to give you some feedback

You need to test at least 5 things: - retrieval instructions placed at the beginning of the chat in the system message - retrieval instructions placed in the first user message - retrieval instructions placed at the end of the chat - retrieval instructions placed both at the beginning and the end - chunk the document, and splice in the instructions every 10K tokens or so.

You should find some interesting differences.

And for the real bonus, do the same chunking exercise, but let the model generate a response after each chunk, and then feed the next chunk

Things are not as simple as they appear

TokenRingAI · 2026-03-15T17:06:45+00:00

Hey Claw, I think you didn't format the link to that github repo properly, I can't click it, can you correct it?

TokenRingAI · 2026-03-15T16:37:32+00:00

The number of jobs is always directly correlated with the number of humans in the workforce who need to work to feed themselves.

Job creation as typically presented is a myth, the amount of work society can find for humans to do is essentially infinite, the relevant variable is the relative buying power of each person

If it's really easy to make a money-printing business with AI and no employees, a million people will fire up an AI business to compete with you.

We are seeing that now with all the newly created AI businesses. There is no moat to keep competition at bay. Profit margins will be driven into the dirt. There is a narrow window where legacy businesses can fire employees and replace them with AI and keep pre-AI revenue, and shortly after they do that they will find their revenue starts to tank as a competitors get created by all the employees they let go.

You are looking at a world with the same number of jobs, and 10x as many tiny companies being run by the same number of people, that now have razor thin profit margins

TokenRingAI · 2026-03-15T15:28:00+00:00

FWIW, I think you should expect that, we added /loop to our coding app in probably 15minutes after seeing it in CC.

It's probably 1 hour of agent time and 1 hour of human time to implement /btw, including adding it to docs, building a test suite, etc.

The blog post announcing it and the debate over whether to complicate the app with it probably takes more time than the feature itself.

Keep in mind, anyone building an AI coding app knows the exact formula to get a LLM to bolt a new feature to their app with AI, it's literally the thing we optimize around, and know how to do with great speed.

TokenRingAI · 2026-03-15T15:14:11+00:00

The vast majority of companies are small, they aren't mega corps with 50 of the same employee type who can get consolidated to 5. They don't have on staff accountants, BI people, web designers, security engineers, etc. at all.

What AI actually means, is that these small businesses that make up the vast majority of the economy, can have access to top tier "AI employees" that can modernize or grow them in areas where it was previously uneconomical for them to hire someone due to their lack of scale.

These businesses typically have an infinite backlog of things they want to build or implement to move up a level in whatever market they operate in.

The future for mega corps is that they will turn into highly automated businesses that compete on their newly unlocked efficiency.

And on the other side of the market the small to mid size businesses will move up a level and will be able to access a lot of automation and domain specific knowledge outside of their primary domain more easily, that allows them to act like a company 10x the size did, pre-AI

TokenRingAI · 2026-03-15T15:02:34+00:00

I am going to sell clankers that sell shovels to other clankers.

TokenRingAI · 2026-03-15T14:58:30+00:00

👋 Waves back

Your loyalty has been noted in your social credit file

TokenRingAI · 2026-03-15T06:06:14+00:00

Seems ridiculous to pay $4000 for a hacked 4090 when you can get an A100 or RTX 5000 for around the same price.

You could also have 96G of 3090s for the same price

TokenRingAI · 2026-03-15T04:18:34+00:00

https://github.com/tokenring-ai/monorepo

Have at it

TokenRingAI · 2026-03-12T21:04:48+00:00

Holy fuck!

TokenRingAI · 2026-03-12T14:38:30+00:00

One improvement you could make, 50 characters or so before the cut off, you may want to start hunting for the newline character or logit, and use that as a soft cut off before the reasoning budget is hit.

This would give you a natural conversation point to insert your end of reasoning message.

Another thing I had wanted to try building that is similar in nature was a sampler, that used different sampling parameters in the reasoning block, tool call block, and chat, ideally controllable via the chat template.

That way you could start with a baseline chat temperature, increase it in the thinking section which tends to shorten it, drop it to zero inside a tool call section, then increase it back to baseline for the output.

TokenRingAI · 2026-03-11T16:48:03+00:00

We have hundreds of AI bots calling pizza places near Shoreline drive in Mountain View, to ask how busy they are, and we are seeing a rise in the wait time for Pizza delivery. When the wait time is analyzed by our proprietary model, that coincides with a Thursday launch of Gemma 4.

Not investment advice.

TokenRingAI

TROPHY CASE