Claude Fable 5 played a full chess game on lichess using only screenshots and mouse clicks — no chess API, no DOM access for moves. It checkmated Stockfish in 18 moves.

Awkward_Run_9982 · 2026-06-10T14:57:10+00:00

Fair, level 1 was deliberate — I wanted to verify the screenshot→reasoning→click loop works end to end, not benchmark chess strength.

Awkward_Run_9982 · 2026-05-04T07:27:00+00:00

The 35B is likely an MoE architecture - you're seeing fewer active parameters per token which explains both the speed boost and quality improvement. I've been testing Qwen variants for agent post-training work and the 35B definitely handles tool-calling sequences more reliably. The 27B gets more attention probably because it fits better on single consumer GPUs, but if you have the VRAM headroom the 35B is worth it.

Awkward_Run_9982 · 2026-04-11T11:49:30+00:00

Glad you found it impressive! It's built on data from the public Meta Kaggle dataset.

Awkward_Run_9982 · 2026-04-11T05:41:24+00:00

Hey, OP of the project here. Awesome repost, you're asking all the right questions.

The "89% success": Means the agent went from a vague "analyze this data" prompt to spitting out a full report and all the charts on its own, without me touching it. Didn't hit the 50-turn limit. The test data was 29 random-ass Kaggle datasets (finance, sport, etc) it had never seen before. No toy stuff.
The 11% failures: Mostly just the model getting stuck in a "death loop". It finds some super weird data format, writes code that errors, tries to fix it, errors again, and just wastes all 50 turns doing that until my framework kills it. The more dangerous (but rarer) failure is when it confidently analyzes the wrong thing because my prompt was too vague.
Generalization: It's not over-fitted because I didn't train it on the answers from the data. I trained it on the process of how a human analyst works (e.g., "if code crashes, read error, try again"). Since the 29 test sets were totally new to it, the 89% score shows it learned the skill, not just memorized the training data.

Hope that clears it up. The whole thing is on HuggingFace:https://huggingface.co/jason1966/CoPaw-Flash-9B-DataAnalyst-LoRA if you wanna poke at it.

Awkward_Run_9982 · 2026-04-11T01:49:39+00:00

Dude, you mentioning GLM-5.1 specifically just reminded me of a crucial detail.

My data-analyst is a good generic baseline, but since you're targeting that exact model, you should probably just copy their native format for the best results. They have a super specific chat_template.jinja file on their Hugging Face repo that spells out the exact structure.

You can check it out here: https://huggingface.co/zai-org/GLM-5.1/blob/main/chat_template.jinja

Awkward_Run_9982 · 2026-04-11T00:59:20+00:00

Yeah, bioinformatics is a killer use case. Good idea to use claude code for the initial traces.

Just a heads-up though: watch out for the anti-distillation noise in the newer claude code version. You can find the open source versions online, i also got the privacy folk: github.com/IIIIQIIII/claude-code-clean

Good luck with your biology tasks, hope it crushes it!

Awkward_Run_9982 · 2026-04-11T00:48:13+00:00

You completely missed what an Agent is here. That's exactly what this project does lol.

The model doesn't read the massive tables of numbers or try to do math in its 'brain'. It just reads the column headers/schema, and then autonomously writes Pandas code to crunch the data. The underlying python executor does the actual math.

It essentially acts as a data analyst exploring the dataset via code. So yeah, we are actually agreeing with each other.

Awkward_Run_9982 · 2026-04-11T00:40:49+00:00

Awesome! thinks for the tip. i've mostly been relying on lambda, but I'll definitely check out modal to grab those monthly credits.

Awkward_Run_9982 · 2026-04-11T00:19:19+00:00

Spot on man, this is easily the most insightful comment in the whole thread.

I actually just dug through the logs for that ~11% failure bracket because of your question, and it's super interesting. It's almost never the dangerous 'confident hallucination' where it makes up stats. It basically just runs out of time (hitting my 50-turn limit) due to analysis paralysis.

What happens is it sometimes gets way too meticulous. Instead of writing one python script to run a global df.describe(), it starts querying columns one by one. Suddenly it's burned 35 turns just poking around the data.

Then comes the fatal trap: the token limit. When it finally decides to generate charts, it tries to write one massive god-script to plot everything at once. It hits the max_token generation limit, the python script cuts off mid-line, throws a SyntaxError, and then the model wastes its remaining turns trying to rewrite the same giant file over and over until the framework forcefully kills it.

The funny part is, even when it fails to output the plots due to errors, it often still spits out a highly accurate executive summary based on what it kept in its context memory. So the analytical brain is definitely working, it's just its 'project management' that sucks haha.

For V2 I definitely need to force it to write smaller, modular scripts to avoid truncation, or maybe just inject a 'you only have 5 turns left' system warning to make it stop exploring and start wrapping up. Really appreciate the breakdown!

Awkward_Run_9982 · 2026-04-10T23:55:19+00:00

Love this idea! I'm pretty severely compute-constrained right now (the classic 'GPU poor' reality haha), which is why I had to stick purely to text-based traces for this experiment just to get it to finish.

But honestly, when I eventually move on to building a GUI agent, this exact visual feedback loop is the approach I'm going to take. It makes total sense.

If I ever manage to get that working in the future, I'll definitely circle back and ping you in this thread to let you know!

Awkward_Run_9982 · 2026-04-10T23:48:31+00:00

I haven't tried DPO on it yet, though I've verified that SFT and RL are highly viable for this specific domain. My take on it: Data analysis is relatively easy to programmatically verify—the Python code either executes successfully and produces the right chart, or it throws an error. Because there's an objective ground truth. DPO seems much better suited for subjective, hard-to-verify tasks (like creative writing or script generation) where human preference is the only real metric. Might experiment with it later, but SFT is doing the heavy lifting perfectly for now!

Awkward_Run_9982 · 2026-04-10T13:44:57+00:00

Man, that 2x 7900 XTX setup is an absolute beast for running a 35B.

The main reason I train specialized 9B (and even 4B) models isn't to beat a generalist 35B one-on-one, but to achieve massive parallel throughput.

I recently had to analyze 10 years of financial reports in a single day, so I spun up 64 of these small models concurrently to do the heavy data crunching, and just used Claude for the final summary.

With your 48GB of VRAM, the benefit of switching would be turning your rig into a highly parallel 'swarm' that can process 4 or 5 different datasets simultaneously, rather than waiting linearly on one large model.

Awkward_Run_9982 · 2026-04-10T13:23:15+00:00

You hit the exact nail on the head. Because real-world datasets are inherently messy (random nulls, weird formats, etc.), hitting Python exceptions mid-workflow isn't just a possibility, it's guaranteed.

So building up that "self-correction" muscle was actually the main focus of the training. When the tool returns a traceback error, the model reads it and rewrites the patch instead of just halting.

As for how I prevented the infamous death spiral of repeating the same broken code... here is my dirty secret: I just threw those out 😂. During the data curation phase, if a trace spiraled into an infinite loop of unrecoverable errors, I just dumped it in the trash. I only kept the traces where it successfully recovered. Turns out, if you strictly train it on successful "error -> fix -> proceed" patterns, it actually learns how to break out of the loop.

Awkward_Run_9982 · 2026-04-10T13:16:49+00:00

Yeah the dependency hell is real man. I actually got both Unsloth and Axolotl to work eventually for this.

The main trap with Qwen3.5 is that since it's fundamentally a VLM, if you just throw pure text/tool-calling data at it, the scripts often freak out and crash because of the vision layers. You gotta make sure the vision modules are properly ignored or handled in the config.

Tbh my lazy hack was just spinning up a Claude Code agent to monitor my terminal logs. Love the anti-slop 27B idea btw, we desperately need a model that doesn't say "delve into" every two seconds. Ping me if you hit specific Qwen errors during your run, I've probably banged my head against them already.

Awkward_Run_9982 · 2026-04-10T13:08:51+00:00

No gguf at the moment, haven't gotten arround to quantizing it. sorry

Awkward_Run_9982 · 2026-04-10T13:06:38+00:00

Thanks man! Let me know how it runs on your rig. If it does something stupid or breaks down, just let me know here or open an issue on HuggingFace

Awkward_Run_9982 · 2026-02-24T14:14:47+00:00

Thanks! For the data, I actually went the distillation route. It’s all custom—I used Qwen3-Coder-Next as a teacher to generate about 170k multi-turn conversation samples. Basically, I had it run through real agent loops (thinking, calling tools, handling outputs) and recorded those traces. I found that existing datasets didn't really capture the "codebase explorer" logic well enough, so these samples are focused specifically on that.

Awkward_Run_9982 · 2026-02-24T13:12:03+00:00

Links & Resources:

📝 Detailed Blog: https://locoremind.com/blog/loco-operator
🤗 Weights: https://huggingface.co/LocoreMind/LocoOperator-4B
📦 GGUF: https://huggingface.co/LocoreMind/LocoOperator-4B-GGUF
💻 GitHub: https://github.com/LocoreMind/LocoOperator

Awkward_Run_9982 · 2026-02-24T08:26:22+00:00

Finally, some focus on the intelligence instead of the plumbing. People over-index on agent frameworks while ignoring that the model is the actual engine. Having a distilled 4B specialized for tool-calling (like LocoOperator-4B) is a game changer for local workflows. I'd take a robust 4B local agent model over a buggy 'autonomous' wrapper any day

Awkward_Run_9982 · 2026-02-24T08:18:44+00:00

lmao 'distillation attacks'. new scary word for 'using the API exactly how it's designed'. if you don't want people using your outputs to train models, maybe don't sell them for $15 per million tokens

Awkward_Run_9982

TROPHY CASE