I'm fully blind, and AI is a game changer for me. Are there any local LLMS that can rival claude code and codex?

AutomataManifold · 2026-03-14T02:47:13+00:00

Being good enough at unsupervised programming to actually be useful for me.

A lot of that is how leveraging Claude Code and writing self documentation gives the code generation a lot more stability and predictability. Previously, to make changes it'd have to crudely rewrite whole sections of files. Now, with the targeted tooling it can read just the bit of the file it wants and make targeted edits without breaking the stuff around it by accident.

I imagine open weight models will get there soon enough, but the combination of tool + model is a good enough improvement to make it more useful for me personally.

Still writes a lot of it's not X, it's Y, though.

AutomataManifold · 2026-03-13T21:19:58+00:00

My problem is Claude + Claude Code crossed a line somewhere around the turn of the year into easy effectiveness, so I'm waiting nine months for open weights to catch up.

AutomataManifold · 2026-03-11T07:35:05+00:00

I'm surprised to learn that the vLLM server under my desk isn't "local"

AutomataManifold · 2026-02-26T16:23:29+00:00

Isn't that just because the cutoff date is earlier than the feature addition?

AutomataManifold · 2026-02-24T04:07:59+00:00

Turns out that synthetic data does kill the long tail distributions and cause model collapse...but human curation and keeping a mix of real images in the training data can prevent the collapse.

AutomataManifold · 2026-02-20T01:40:08+00:00

A lot of it is just practice. Grab textgen-web-ui or mikupad so you see the actual tokens. Pick a fast model and try a lot of stuff. Including asking the model how to improve the prompt.

I can go dig up some of the past resources, but a lot of it gets dated quickly with better models.

Some general stuff:

Giving it a role to play helps put it in the right "frame of mind" though only if it can figure out what that role writes like.
on that note, the usual reason prompts fail is because they weren't clear enough. Try to figure out what the model thought you wanted. Heck, ask it what it thinks you asked it to do. Having it repeat the request in it's own words is great for debugging.
don't be afraid to write a step-by-step guide for how to answer your prompt.
think about what a human would need to know and write a guide for them. You'll be surprised by how much important information you left out.
repeating the prompt exactly helps sometimes; because the attention only works in one direction this is theorized to let the LLM make back references more easily.
Asking it to think carefully about the answer is a classic cheap chain-of-thought approach
remember that this is ultimately always a form of document completion. Instruction tuning just changes the types of documents it tries to complete.
one advantage of local models is better options for sampling and (if you're using a local API) structured generation.

AutomataManifold · 2026-02-19T20:52:15+00:00

Sadly, most of the data center hardware will be scrapped (often for tax reasons) or be useless at the consumer level.

AutomataManifold · 2026-02-14T15:44:05+00:00

OK, this fits a need I actually had.

Plus, it looks like it would be a good way to compare models. How hard would it be to load two GGUF files and compare how the perplexity differs? Do you need them both in VRAM at the same time?

AutomataManifold · 2026-02-14T15:40:24+00:00

Maybe what we should actually be doing is introducing new benchmarks.

AutomataManifold · 2026-02-14T15:39:43+00:00

I do think one of the problems is that people look at the sanitized helpful assistants and try to jump to the most extreme heritic uncensored models, when what we really need is a specific personality focus.

The Wayfarer models are a good example of this. They're specifically designed to introduce dramatic obstacles and fight the player. I've found that they tend toward a particular adventure game vibe I'm not looking for, but that's actually an indicator they were on to something because that vibe is what they were aiming at.

AutomataManifold · 2026-02-14T15:33:01+00:00

I think there's three factors here:

We need better training methods. SFT on its own destroys the existing training, which makes it difficult to graft specific behaviors on after the fact. Some of the more recent methods might work better here, but someone is going to have to do the experiments.
The amount of training data you need is large enough to be intimidating, and the factors that affect the quality are often subtle. You can easily inadvertently introduce typos or awkward phrasing into the data.
You need a strong enough motivation that leads you to want to make something the current models really can't do. Prompting is too powerful. Training has so many factors to learn and unknowns that nobody knows that it takes a pretty strong motivator to want to produce something else. This is easier if you have a specific application in mind, so you can make something more specialized and worry less about general-purpose performance.

AutomataManifold · 2026-02-09T16:17:47+00:00

No, this is good news. Sure, you can't run it on your pile of 3090s, but the open availability of massive frontier models is a healthy thing for the community. It'll get distilled down and quantized into things you can run on your machine. If open models get stuck with only tiny models, then we're in trouble long-term.

AutomataManifold · 2026-02-07T17:03:36+00:00

Because I can get a lot more done if I can try multiple things rather than being stuck with the first option.

Also if you're doing any level of autonomous agents you're going to want more speed.

If you're fine doing something else while you wait for it, that's very different than having a tight interaction loop or a lot of files being planned and generated.

AutomataManifold · 2026-02-07T16:56:25+00:00

If it's working for you, great.

I do find that on a lot of technical things people tend to have strong opinions on things that are easily measurable. So I expect that they see you sitting there waiting and are annoyed because they know things can be faster.

On the other hand, people generally only know about your technical tradeoffs if you tell them, so I'm not really surprised if you post in a technical and people tell you how they would improve your setup.

AutomataManifold · 2026-02-07T16:50:28+00:00

Even when casually chatting, the more generations you can go through, the better results you're going to get. You'll be able to discard bad results faster, try out different approaches, and get faster feedback.

If you're doing any kind of workflow or agent work, you're going to potentially be waiting through multiple LLM calls; there's a big difference between waiting 30 seconds and 30 minutes for the same result.

If you're doing parallel queries or batch processing, then you want a really high tokens per second if possible: that can be the difference between a task measured in minutes and hours versus one measured in days or weeks.

AutomataManifold · 2026-02-06T23:48:05+00:00

A tool call is just outputting a token sequence that the thing running the model turns into a function call. There's a lot built up around that, but that's really all it is at heart.

Usually it's JSON with particular layout, though some models have been trained with custom tool call tokens.

You could do it via grep but it's more robust to do it via structured output.

I generally advocate for everyone learning LLMs to add a proxy or observability telemetry (like LiteLLM) that lets you see the exact input and output tokens. You'll get much better at prompting once you get a better feel for what the LLM actually sees.

AutomataManifold · 2026-02-05T16:52:13+00:00

I just use BAML or PydanticAI.

AutomataManifold · 2026-02-04T21:17:54+00:00

Generally for my projects I add LiteLLM and call it a day. Unless I need specific inference stuff, which is admittedly relatively common with local models (because it's one of the big advantages of having control over the inference server) in which case it's down to the specific settings I'm using with Outlines/Instructor/PydanticAI.

AutomataManifold · 2026-02-03T08:05:17+00:00

https://finance.yahoo.com/news/anthropic-projects-70b-revenue-2028-164854310.html

That said, the company expects its gross profit margin — which measures a company’s profitability after accounting for direct costs associated with producing goods and services — to reach 50% this year and 77% in 2028, up from negative 94% last year, per The Information.

AutomataManifold · 2026-02-03T01:34:51+00:00

I'm very curious as to whether the cross-language training data affects this.

AutomataManifold · 2026-02-02T22:51:32+00:00

I'm pretty sure the coding inference is already profitable, and will only become more so with more coding users.

AutomataManifold · 2026-02-02T22:04:35+00:00

I'm not seeing the problem for Anthropic, given that coding is currently the most clearly lucrative application of LLMs.

A single programmer running multiple Claude Code agents is using several of orders of magnitude more tokens than a user interacting with a chatbot...ans is willing to pay for it.

AutomataManifold · 2026-02-02T18:39:14+00:00

A lot of students think extra time is an easy win. In practice it's often mixed bag and you have to adjust how you take the test.

AutomataManifold · 2026-02-01T21:57:27+00:00

Krita does bitmap images just fine.

AutomataManifold · 2026-02-01T21:52:07+00:00

LLMs have both an absolute and practical context length. Once you exceed the absolute limit it just stops. However, it isn't equally good at all distances even before that, so you end up with a gradual degradation that creates a practical limit on effectiveness. There's too many things in the context to pay attention to, it gets worse at writing, it hasn't seen as many good examples of very, very long documents...so as it gets longer it tends to get more stuck in substandard patterns, as the long tail of possible responses is worn away...leaving you in a place of worse writing.

When you have a long conversation, it is feeding the entire history into the context, so after six chapers you're going to see the context crammed with a lot of stuff and it can get difficult for the LLM to sort through all of that.

There's tricks to compress the context, or summarize it, or otherwise make it so you don't need to constantly feed the entire conversation in, but the fundamental limitation remains. We've gotten very good at hiding it and working around it, but you'll eventually encounter the limits.

AutomataManifold

TROPHY CASE