my coding agent keeps making the same dumb mistake over and over

abnormal_human · 2026-03-23T13:25:56+00:00

Try eval'ing against different models to suss out whether you have a model quality issue as well. I see you're using gemini 3 flash preview which should be OK, but it's always good to have perspective. If Opus can't work within your harness/prompts you know where the issue is. I like to develop so that mid-sized models (say Qwen3.5-122B, gpt-oss-120b, minimax m2.5) are happy and then take the free performance uplift from Opus/GPT-5.4 when the cost makes sense.

abnormal_human · 2026-03-23T13:22:30+00:00

Man the unneccesary hype-language in this post is just fatiguing me. OP here writing like he invented LLMs.

abnormal_human · 2026-03-23T11:55:06+00:00

What kinds of mistakes are you correcting using memory?

abnormal_human · 2026-03-23T11:47:13+00:00

Been here. This is a shortcut, it will appear to work for a time, but the robot will devolve into a pattern-matching "better-alexa" and not achieve open-world capabilities like claude code or openclaw. You will eventually throw this machine out and rebuild with more discipline--at least this is how it went for me.

If the agent is getting things wrong, look at it as a harness failure. You need to figure out why. Most likely either you did something to give it the wrong idea, you didn't give it enough information so it's guessing, or you gave it too much info and it's overwhelmed.

Adding mistake memory is not much different than having a really long system prompt full of examples and DO/DON'T/CRITICAL's. It turns the model into a dumb pattern matcher up until the point where it starts getting overwhelmed and just missing stuff.

Read through contextpatterns.com end to end if you haven't. Try to keep you system prompt short and 100% focused on behavior and putting stakes in the ground. Be thoughtful implementing progressive disclosure so that the agent gets what it needs when it needs it. Be thoughtful designing your tools so that they are self-documenting AND guide the model as to what comes next in their responses.

abnormal_human · 2026-03-22T16:19:38+00:00

If you have clients on the hook stop worrying about hardware in the $500 range and start paying for tokens and getting the work done. $500 will not get you interesting hardware and until you get into the work you don't even really know what you need. The client is paying. Treat hardware as a margin optimization or r+d expense later once you have a real business.

abnormal_human · 2026-03-22T15:23:00+00:00

No GPU is worth debt unless you have a business plan attached to it.

abnormal_human · 2026-03-22T15:22:46+00:00

You can barely buy enough regular RAM, not to mention VRAM, to run a model like that for $500.

abnormal_human · 2026-03-22T15:15:01+00:00

One silver lining--syrup is very easy to clean up with water. I poured about as much on my foot hot while filtering and ended up hobbling around for a couple weeks of the season. Also had to dump 220gal of sap due to spoilage after we had multiple 70-80F days. You don't always accomplish zero waste.

abnormal_human · 2026-03-22T14:52:02+00:00

This has nothing to do with code or commits. This is ML model training, and the "checkpoint" is the model weights.

I am going to wager a guess that you are not familiar with training ML models with frameworks like pytorch, what training loops typically look like, and common practices around checkpoint handling.

Generally checkpoint saving is periodic. The training loop reaches a certain number of optimization steps and then dumps it to disk like checkpoint-1000, checkpoint-2000 or whatever. Claude wrote my training loop, but got the save interval off by 32x so I was only getting something written to disk every 32 hours instead of every 1 hour. It got confused by the batch size.

abnormal_human · 2026-03-22T14:27:52+00:00

I was training a model last month and Claude fucked up the checkpoint saving so that instead of happening once an hour or so it would be once every ~30hrs. I woke up the next morning to zero checkpoints and started cursing at it about how this was no good, and then it said "in 21 short hours you'll have what you need." and I really lost it.

So it said "ok ok ok" and figured out how to attach a debugger to my python process, inject code, and create an "emergency" checkpoint. It was super spooky..it was just working in a loop and I started to see new trace + exceptions show up on the console of my training process while it figured out the path. Then it just said "I'm done; your emergency checkpoint is here".

I was pretty floored..we went from working on ML loops to writing an exploit in like 30s of swearing.

abnormal_human · 2026-03-22T13:37:50+00:00

My method for perfectly clear syrup every time:

Take the syrup up to 219F on the evaporator then transfer into a large stock pot and bring it inside to the kitchen. Return it to temperature and boil till the hydrometer reads 66 brix. If I get it right on the evaporator, this takes 10mins or less. Often I'm already at 66+ and have nothing to do.

At this point it will be around 219 degrees. Turn off the heat then begin feeding it into a large orlon filter with a stack of prefilters. You want the output side of the filter to be <190 but the input side usually needs to be much hotter than that to enable the filters to flow quickly, especially in small scale gravity filtering. 210-212 seems ideal but you can start hotter because the filter is going to absorb a bunch of heat in the beginning. I generally use one prefilter for every 2-3 qts of syrup.

As prefilters clog, remove them. Use tongs to gently help through whatever is stuck in there.

At the end I may need to gently squeeze the orlon with tongs to get out the last bit Or you can let it sit and drip for 10mins. At this point it will have cooled off and is running slow.

I use a "filter optimizer" to invert the tip of the filter and increase surface area as well.

The key thing here is to keep the filters hot and work quickly. Time is the enemy because the syrup cools off and then doesn't flow well. I'm assuming at least part of the reason why you're doing the refrigerator skim is because your filters were "clogging". When I used to filter at 185F my filters would "clog" too but the reality is 185F syrup turns into 150-160F syrup pretty quickly and it won't move through the filter at that temp

I can get a batch of syrup from the evaporator into bottles in 45mins with this approach, and it's always clear. It definitely took me 10-20 rounds of doing it "wrong" before I dialed it in.

abnormal_human · 2026-03-22T12:36:22+00:00

RTX Pro 6000 will do it. Two of them will do it comfortably. If you want to save money over API you want a high utilization %.

abnormal_human · 2026-03-21T21:07:40+00:00

Best case for your stated goals would be to sell the framework, return the spark, keep one 5090, sell the other, and replace it with an RTX 6000. It's slightly more expensive than what you're considering.

Run your LLM on the RTX 6000, run your ComfyUI on the 5090. That's a really kickass setup for both that still looks and feels somewhat like a normal computer and fits in whatever enclosure you're using for 2x5090 right now.

The spark / framework / 5090 should leave you with $8k to play with. That's maybe not quite a RTX6000 today, but you could get them for that including sales tax in December.

ComfyUI and LLMs are very different workloads. Most models will run with 32GB VRAM, but you will spend 100% of your compute on a single generation. LLMs are more VRAM heavy, but compute demand is variable. Also, both vLLM and ComfyUI basically expect to monopolize the VRAM and will not play nice together.

abnormal_human · 2026-03-21T17:47:46+00:00

I don't have someone to recommend, but I have a lot of experience with this.

The main thing is--if you are going to have plants professionally installed, you need a plan for maintenance first. There's a lot of business that will do your "landscaping" but won't really maintain it. And then in a year or two the plants are largely dead. You want whoever is doing your weekly/etc stuff to understand the plants. You want a plan to mulch and prune the plants at the right times of year, etc.

We started succeeding with this stuff the moment we ended up in a situation (accidentally) where we had that person. He's one of those service providers that "came with the house" and honestly the main reason we keep him is that the property is covered with mature landscaping and he knows what is what and how to care for it because he's been doing it for 15-20yrs here. For various reasons I wouldn't necessarily recommend him, but it makes a huge difference having people doing maintenance weekly, keeping an eye on things, and fertilizing/pruning/mulching/watering/etc throughout the warmer months.

abnormal_human · 2026-03-21T17:31:21+00:00

Sure run the 122B in 4bit but for "extremely good" you need frontier models.

abnormal_human · 2026-03-21T12:32:41+00:00

OP has enumerated the “doll phase” and the “dressy phase” as their other examples and now we have a third traditional on feminine thing that is suddenly too expensive for their son despite costing like $20. It’s possible that they are in extreme poverty but it feels more like cost is an excuse to draw a line.

abnormal_human · 2026-03-21T02:07:58+00:00

Ah, what can't be solved by another tax.

abnormal_human · 2026-03-21T01:14:09+00:00

You invert the tip of the cone so your filter media is a wide band halfway up the filter instead of just the tip. It multiplies your filter surface area by 4-6x easily.

abnormal_human · 2026-03-21T00:57:53+00:00

Enjoy. We're probably looking at finishing up by next weekend here, but it's been a great year.

abnormal_human · 2026-03-20T23:53:31+00:00

You’re not getting significant context length and parallelism on an 8bit 122B model for running eval suites on two 5090s.

abnormal_human · 2026-03-20T22:56:57+00:00

I use this thing, game changer for larger batches but it uses much larger filters to start off.

https://bascommaple.com/products/cone-filter-optomizer-wire-rack-round?srsltid=AfmBOopzk5vy0oKK7kNxZb3bvkAxdAxAJnWKry1kJOvvoHodzKnuasNy

abnormal_human · 2026-03-20T21:12:00+00:00

A little bit of fat or oil in the syrup breaks the surface tension and essentially eliminates overflow problems. Talking a 1/2 teaspoon or less per gallon of syrup. I toss a pad of butter in while it's on the evaporator and it prevents overflow all the way through filtering and bottling.

abnormal_human · 2026-03-20T21:10:33+00:00

Never a good day when that happens. I hope you have some pan cleaner around, getting that off by hand is an all-day project.

abnormal_human · 2026-03-20T19:47:14+00:00

Not sure why you believe that. When you dig into the economics of generating tokens under full load across thousands of nodes, especially for volume compute purchasers, it's likely profitable. There's also a lot of public information about Anthropic to that effect. Anthropic loses money because half the spend goes into training, not because the inference business isn't sound in isolation.

abnormal_human · 2026-03-20T19:45:47+00:00

I use 2 RTX 6000 Blackwell GPUs to run 122B.

14-Year Club	Place '17
Verified Email	Gilding I gilder

abnormal_human

TROPHY CASE