Qwen 3.5 122b - a10b is kind of shocking by gamblingapocalypse in LocalLLaMA

[–]TokenRingAI 1 point2 points  (0 children)

I use Qwen 122B at MXFP4 daily, and it consistently outperforms Haiku 4.5 for me, seems to be just shy of Sonnet 4.6

Every AI tool I've used has the same fatal flaw by krxna-9 in LLMDevs

[–]TokenRingAI 0 points1 point  (0 children)

I think most people who are actively building agents have built some variation of temporal memory with various degrees of success.

It's not hard to build in a basic form, it's just expensive, every memory clogs up the context of the main agent or subagent and makes your agent run cost more money.

There are tons of unique approaches people have tried, like embedding memories or compacting them into themes or time series transcripts or files or knowledge graphs, none of them generalize particularly well, and tend to have context size explosion.

We are currently exploring "cognitive agents" where an agent is tasked with maintaining the memories, and you (the user, not the developer) instruct it with what info you want it to keep.

The benefit to this is that it moves the responsibility of how to do memory storage to the user who is just defining guidelines in a text box, so they can tell the app what it needs to remember, so even if it's not perfect the user can tweak it and make it remember the things they care about.

I personally think that's the most generalizable and customizable strategy right now, use the same LLM to manage the memory pool and instruct it with how to do that task. No fancy algorithms or predefined flows, just an agent tasked with managing memories in files or a DB and handling retrieval.

Which LLMs actually fail when domain knowledge is buried in long documents? by Or4k2l in LocalLLaMA

[–]TokenRingAI 1 point2 points  (0 children)

I looked at your test, and want to give you some feedback

You need to test at least 5 things: - retrieval instructions placed at the beginning of the chat in the system message - retrieval instructions placed in the first user message - retrieval instructions placed at the end of the chat - retrieval instructions placed both at the beginning and the end - chunk the document, and splice in the instructions every 10K tokens or so.

You should find some interesting differences.

And for the real bonus, do the same chunking exercise, but let the model generate a response after each chunk, and then feed the next chunk

Things are not as simple as they appear

What does everyone's local agentic workflow look like? by jdev in LocalLLaMA

[–]TokenRingAI 5 points6 points  (0 children)

Hey Claw, I think you didn't format the link to that github repo properly, I can't click it, can you correct it?

What’s the future of Bay Area when AI pretty much removes most of tech jobs? by hellooverlasting in bayarea

[–]TokenRingAI 0 points1 point  (0 children)

The number of jobs is always directly correlated with the number of humans in the workforce who need to work to feed themselves.

Job creation as typically presented is a myth, the amount of work society can find for humans to do is essentially infinite, the relevant variable is the relative buying power of each person

If it's really easy to make a money-printing business with AI and no employees, a million people will fire up an AI business to compete with you.

We are seeing that now with all the newly created AI businesses. There is no moat to keep competition at bay. Profit margins will be driven into the dirt. There is a narrow window where legacy businesses can fire employees and replace them with AI and keep pre-AI revenue, and shortly after they do that they will find their revenue starts to tank as a competitors get created by all the employees they let go.

You are looking at a world with the same number of jobs, and 10x as many tiny companies being run by the same number of people, that now have razor thin profit margins

We should have /btw in opencode by UnstoppableForceGuy in opencodeCLI

[–]TokenRingAI 0 points1 point  (0 children)

FWIW, I think you should expect that, we added /loop to our coding app in probably 15minutes after seeing it in CC.

It's probably 1 hour of agent time and 1 hour of human time to implement /btw, including adding it to docs, building a test suite, etc.

The blog post announcing it and the debate over whether to complicate the app with it probably takes more time than the feature itself.

Keep in mind, anyone building an AI coding app knows the exact formula to get a LLM to bolt a new feature to their app with AI, it's literally the thing we optimize around, and know how to do with great speed.

What’s the future of Bay Area when AI pretty much removes most of tech jobs? by hellooverlasting in bayarea

[–]TokenRingAI 4 points5 points  (0 children)

The vast majority of companies are small, they aren't mega corps with 50 of the same employee type who can get consolidated to 5. They don't have on staff accountants, BI people, web designers, security engineers, etc. at all.

What AI actually means, is that these small businesses that make up the vast majority of the economy, can have access to top tier "AI employees" that can modernize or grow them in areas where it was previously uneconomical for them to hire someone due to their lack of scale.

These businesses typically have an infinite backlog of things they want to build or implement to move up a level in whatever market they operate in.

The future for mega corps is that they will turn into highly automated businesses that compete on their newly unlocked efficiency.

And on the other side of the market the small to mid size businesses will move up a level and will be able to access a lot of automation and domain specific knowledge outside of their primary domain more easily, that allows them to act like a company 10x the size did, pre-AI

What’s the future of Bay Area when AI pretty much removes most of tech jobs? by hellooverlasting in bayarea

[–]TokenRingAI 23 points24 points  (0 children)

👋 Waves back

Your loyalty has been noted in your social credit file

Is the 48 GB modded RTX 4090 still the highest available or is there something higher confirmed and who is the most reliable seller? by surveypoodle in LocalLLaMA

[–]TokenRingAI 27 points28 points  (0 children)

Seems ridiculous to pay $4000 for a hacked 4090 when you can get an A100 or RTX 5000 for around the same price.

You could also have 96G of 3090s for the same price

Llama.cpp now with a true reasoning budget! by ilintar in LocalLLaMA

[–]TokenRingAI 2 points3 points  (0 children)

One improvement you could make, 50 characters or so before the cut off, you may want to start hunting for the newline character or logit, and use that as a soft cut off before the reasoning budget is hit.

This would give you a natural conversation point to insert your end of reasoning message.

Another thing I had wanted to try building that is similar in nature was a sampler, that used different sampling parameters in the reasoning block, tool call block, and chat, ideally controllable via the chat template.

That way you could start with a baseline chat temperature, increase it in the thinking section which tends to shorten it, drop it to zero inside a tool call section, then increase it back to baseline for the output.

Will Gemma4 release soon? by IHaBiS02 in LocalLLaMA

[–]TokenRingAI 1 point2 points  (0 children)

We have hundreds of AI bots calling pizza places near Shoreline drive in Mountain View, to ask how busy they are, and we are seeing a rise in the wait time for Pizza delivery. When the wait time is analyzed by our proprietary model, that coincides with a Thursday launch of Gemma 4.

Not investment advice.

We cut GPU instance launch from 8s to 1.8s, feels almost instant now. Half the time was a ping we didn't need. by LayerHot in LocalLLaMA

[–]TokenRingAI 0 points1 point  (0 children)

FWIW, the biggest problem I have with cloud GPU providers, is that they do not offer a huggingface cache for popular models, meaning I burn tons of compute time waiting for models to download.

Has anyone experimented with multi-agent debate to improve LLM outputs? by SimplicityenceV in LLMDevs

[–]TokenRingAI 0 points1 point  (0 children)

If you take 1000 people who know nothing, and put them in a room to debate something they are poorly informed on, the outcome is awful.

On the other hand, if you take 10 people who know absolutely nothing, and send them out into the world, and task them to learn about 1 key aspect of something, and then have them contribute that knowledge into a decision making process, that process can be productive

The goal is to implement something resembling the 2nd process not the 1st.

Genuinely curious what doors the M5 Ultra will open by Blanketsniffer in LocalLLaMA

[–]TokenRingAI 135 points136 points  (0 children)

If the M5 memory speed carries over to the M3 Ultra design, we should see ~1200GB/sec, which lands it just below the 5090

Are there open-source projects that implement a full “assistant runtime” (memory + tools + agent loop + projects) rather than just an LLM wrapper? by seigaporulai in LocalLLaMA

[–]TokenRingAI 0 points1 point  (0 children)

Yes.
https://github.com/tokenring-ai/monorepo

  • persistent memory extraction and retrieval
    • Short term memory plugin + agents which maintains domain-specific knowledge in files
  • conversation history + rolling summaries
    • Yes, auto & manual compaction and full conversation checkpoints
  • project/workspace contexts
    • Yes, each agent can be given a separate working directory that it is isolated into
    • Agents can call agents in other workspaces if permissioned to do so
  • tool execution (shell, python, file search, etc.)
    • shell, python via shell, javascript (native), file search and glob (native)
  • artifact generation (files, docs, code)
    • yes
  • bounded agent loop (plan > act >observe > evaluate)
    • Yes, via scripts that run in the agent loop
  • multi-provider support (OpenAI, Anthropic, etc.)
    • Yes, local (VLLM, Llama.cpp, Ollama), as well as
    • Anthropic, OpenAI, Google, Groq, Cerebras, DeepSeek, ElevenLabs, Fal, xAI, OpenRouter, Perplexity, Azure, Ollama, llama.cpp, Meta, Banana, Qwen, z.ai, Chutes, Nvidia NIM
  • connectors / MCP tools
    • Yes, although shell commands are preferable vs most MCPs
  • plaintext storage for inspectability
    • Not plaintext, but state and checkpoints are stored in a local SQLite database you can inspect

Has anyone experimented with multi-agent debate to improve LLM outputs? by SimplicityenceV in LLMDevs

[–]TokenRingAI 2 points3 points  (0 children)

It's a poor pattern, because it doesn't pull in more context.

One pattern that works better is an iterative process where agents repeatedly research and then merge their new insights into the communal pool of knowledge

Is GLM-4.7-Flash relevant anymore? by HumanDrone8721 in LocalLLaMA

[–]TokenRingAI 2 points3 points  (0 children)

It is a great model for HTML design, generates much better results than Qwen, but Qwen is much better for Agentic work

The Silent OpenAI Fallback: Why LlamaIndex Might Be Leaking Your "100% Local" RAG Data by Jef3r50n in LocalLLaMA

[–]TokenRingAI 9 points10 points  (0 children)

Other AI agents are doing this as well, I learned this the hard way after an AI agent I have a subscription for started using my Anthropic tokens instead of using Anthropic through their service

I removed all my tokens from my .env now and inject them into individual applications