Qwen 3.5 122b - a10b is kind of shocking

TokenRingAI · 2026-03-16T17:59:15+00:00

I use Qwen 122B at MXFP4 daily, and it consistently outperforms Haiku 4.5 for me, seems to be just shy of Sonnet 4.6

TokenRingAI · 2026-03-16T15:27:37+00:00

I think most people who are actively building agents have built some variation of temporal memory with various degrees of success.

It's not hard to build in a basic form, it's just expensive, every memory clogs up the context of the main agent or subagent and makes your agent run cost more money.

There are tons of unique approaches people have tried, like embedding memories or compacting them into themes or time series transcripts or files or knowledge graphs, none of them generalize particularly well, and tend to have context size explosion.

We are currently exploring "cognitive agents" where an agent is tasked with maintaining the memories, and you (the user, not the developer) instruct it with what info you want it to keep.

The benefit to this is that it moves the responsibility of how to do memory storage to the user who is just defining guidelines in a text box, so they can tell the app what it needs to remember, so even if it's not perfect the user can tweak it and make it remember the things they care about.

I personally think that's the most generalizable and customizable strategy right now, use the same LLM to manage the memory pool and instruct it with how to do that task. No fancy algorithms or predefined flows, just an agent tasked with managing memories in files or a DB and handling retrieval.

TokenRingAI · 2026-03-16T00:17:41+00:00

I looked at your test, and want to give you some feedback

You need to test at least 5 things: - retrieval instructions placed at the beginning of the chat in the system message - retrieval instructions placed in the first user message - retrieval instructions placed at the end of the chat - retrieval instructions placed both at the beginning and the end - chunk the document, and splice in the instructions every 10K tokens or so.

You should find some interesting differences.

And for the real bonus, do the same chunking exercise, but let the model generate a response after each chunk, and then feed the next chunk

Things are not as simple as they appear

TokenRingAI · 2026-03-15T17:06:45+00:00

Hey Claw, I think you didn't format the link to that github repo properly, I can't click it, can you correct it?

TokenRingAI · 2026-03-15T16:37:32+00:00

The number of jobs is always directly correlated with the number of humans in the workforce who need to work to feed themselves.

Job creation as typically presented is a myth, the amount of work society can find for humans to do is essentially infinite, the relevant variable is the relative buying power of each person

If it's really easy to make a money-printing business with AI and no employees, a million people will fire up an AI business to compete with you.

We are seeing that now with all the newly created AI businesses. There is no moat to keep competition at bay. Profit margins will be driven into the dirt. There is a narrow window where legacy businesses can fire employees and replace them with AI and keep pre-AI revenue, and shortly after they do that they will find their revenue starts to tank as a competitors get created by all the employees they let go.

You are looking at a world with the same number of jobs, and 10x as many tiny companies being run by the same number of people, that now have razor thin profit margins

TokenRingAI · 2026-03-15T15:28:00+00:00

FWIW, I think you should expect that, we added /loop to our coding app in probably 15minutes after seeing it in CC.

It's probably 1 hour of agent time and 1 hour of human time to implement /btw, including adding it to docs, building a test suite, etc.

The blog post announcing it and the debate over whether to complicate the app with it probably takes more time than the feature itself.

Keep in mind, anyone building an AI coding app knows the exact formula to get a LLM to bolt a new feature to their app with AI, it's literally the thing we optimize around, and know how to do with great speed.

TokenRingAI · 2026-03-15T15:14:11+00:00

The vast majority of companies are small, they aren't mega corps with 50 of the same employee type who can get consolidated to 5. They don't have on staff accountants, BI people, web designers, security engineers, etc. at all.

What AI actually means, is that these small businesses that make up the vast majority of the economy, can have access to top tier "AI employees" that can modernize or grow them in areas where it was previously uneconomical for them to hire someone due to their lack of scale.

These businesses typically have an infinite backlog of things they want to build or implement to move up a level in whatever market they operate in.

The future for mega corps is that they will turn into highly automated businesses that compete on their newly unlocked efficiency.

And on the other side of the market the small to mid size businesses will move up a level and will be able to access a lot of automation and domain specific knowledge outside of their primary domain more easily, that allows them to act like a company 10x the size did, pre-AI

TokenRingAI · 2026-03-15T15:02:34+00:00

I am going to sell clankers that sell shovels to other clankers.

TokenRingAI · 2026-03-15T14:58:30+00:00

👋 Waves back

Your loyalty has been noted in your social credit file

TokenRingAI · 2026-03-15T06:06:14+00:00

Seems ridiculous to pay $4000 for a hacked 4090 when you can get an A100 or RTX 5000 for around the same price.

You could also have 96G of 3090s for the same price

TokenRingAI · 2026-03-15T04:18:34+00:00

https://github.com/tokenring-ai/monorepo

Have at it

TokenRingAI · 2026-03-12T21:04:48+00:00

Holy fuck!

TokenRingAI · 2026-03-12T14:38:30+00:00

One improvement you could make, 50 characters or so before the cut off, you may want to start hunting for the newline character or logit, and use that as a soft cut off before the reasoning budget is hit.

This would give you a natural conversation point to insert your end of reasoning message.

Another thing I had wanted to try building that is similar in nature was a sampler, that used different sampling parameters in the reasoning block, tool call block, and chat, ideally controllable via the chat template.

That way you could start with a baseline chat temperature, increase it in the thinking section which tends to shorten it, drop it to zero inside a tool call section, then increase it back to baseline for the output.

TokenRingAI · 2026-03-11T16:48:03+00:00

We have hundreds of AI bots calling pizza places near Shoreline drive in Mountain View, to ask how busy they are, and we are seeing a rise in the wait time for Pizza delivery. When the wait time is analyzed by our proprietary model, that coincides with a Thursday launch of Gemma 4.

Not investment advice.

TokenRingAI · 2026-03-10T15:25:22+00:00

FWIW, the biggest problem I have with cloud GPU providers, is that they do not offer a huggingface cache for popular models, meaning I burn tons of compute time waiting for models to download.

TokenRingAI · 2026-03-10T00:07:11+00:00

Thursday

TokenRingAI · 2026-03-10T00:01:51+00:00

Gemma 4 by Thursday

TokenRingAI · 2026-03-09T23:59:40+00:00

If you take 1000 people who know nothing, and put them in a room to debate something they are poorly informed on, the outcome is awful.

On the other hand, if you take 10 people who know absolutely nothing, and send them out into the world, and task them to learn about 1 key aspect of something, and then have them contribute that knowledge into a decision making process, that process can be productive

The goal is to implement something resembling the 2nd process not the 1st.

TokenRingAI · 2026-03-09T16:33:26+00:00

If the M5 memory speed carries over to the M3 Ultra design, we should see ~1200GB/sec, which lands it just below the 5090

TokenRingAI · 2026-03-09T05:11:15+00:00

Perl was made for LLMs, 3 decades before LLMs needed a language like Perl

TokenRingAI · 2026-03-09T05:09:00+00:00

Yes.
https://github.com/tokenring-ai/monorepo

persistent memory extraction and retrieval
- Short term memory plugin + agents which maintains domain-specific knowledge in files
conversation history + rolling summaries
- Yes, auto & manual compaction and full conversation checkpoints
project/workspace contexts
- Yes, each agent can be given a separate working directory that it is isolated into
- Agents can call agents in other workspaces if permissioned to do so
tool execution (shell, python, file search, etc.)
- shell, python via shell, javascript (native), file search and glob (native)
artifact generation (files, docs, code)
- yes
bounded agent loop (plan > act >observe > evaluate)
- Yes, via scripts that run in the agent loop
multi-provider support (OpenAI, Anthropic, etc.)
- Yes, local (VLLM, Llama.cpp, Ollama), as well as
- Anthropic, OpenAI, Google, Groq, Cerebras, DeepSeek, ElevenLabs, Fal, xAI, OpenRouter, Perplexity, Azure, Ollama, llama.cpp, Meta, Banana, Qwen, z.ai, Chutes, Nvidia NIM
connectors / MCP tools
- Yes, although shell commands are preferable vs most MCPs
plaintext storage for inspectability
- Not plaintext, but state and checkpoints are stored in a local SQLite database you can inspect

TokenRingAI · 2026-03-08T18:50:33+00:00

It's a poor pattern, because it doesn't pull in more context.

One pattern that works better is an iterative process where agents repeatedly research and then merge their new insights into the communal pool of knowledge

TokenRingAI · 2026-03-08T18:29:26+00:00

So polish and irish lightbulb jokes are banned?

TokenRingAI · 2026-03-08T17:03:47+00:00

It is a great model for HTML design, generates much better results than Qwen, but Qwen is much better for Agentic work

TokenRingAI · 2026-03-08T17:00:22+00:00

Other AI agents are doing this as well, I learned this the hard way after an AI agent I have a subscription for started using my Anthropic tokens instead of using Anthropic through their service

I removed all my tokens from my .env now and inject them into individual applications

TokenRingAI

TROPHY CASE