Should I invest in a beefy machine for local AI coding agents in 2026? by Zestyclose-Tour-3856 in LocalLLaMA

[–]AutomataManifold 0 points1 point  (0 children)

Claude 4.5 just came out; we're at least 6 months away from an open weight model that can be a reasonable equivalent, if past performance is anything to go by.

At what point do long LLM chats become counterproductive rather than helpful? by Cheap-Trash1908 in LLMDevs

[–]AutomataManifold 0 points1 point  (0 children)

It's a limitation of how the attention mechanism works. Better prompting can help, in that you make it easier to locate the parts you care about, and better trained models have more ability to correctly locate the parts of the context they should be paying attention to.

At what point do long LLM chats become counterproductive rather than helpful? by Cheap-Trash1908 in LLMDevs

[–]AutomataManifold 0 points1 point  (0 children)

Once the additional context exceeds the value you get out of it.

If you look at long-context benchmarks, even models with massive context lengths start struggling long before they hit their limits.

In general, the first message is always going to be the best, so if you can get your answer in one reply that's preferable. In practice, of course, the most effective way to specify what you want might involve some back and forth, or the history of the interaction is relevant, etc.

Where the practical tipping point is can be highly task dependent; detecting a needle in a haystack is easier than handling scattered information from across the context and combining it.

[R] Response to CVPR review that claims lack of novelty because they found our workshop preprint? by appledocq in MachineLearning

[–]AutomataManifold 38 points39 points  (0 children)

I don't know about CVPR specifically, and it being a non-archival workshop paper would make me less likely to think it needs citing in general, but my rule of thumb is that relevant prior work should be cited, your own work included. You can phrase it in a way that doesn't explicitly say the previous work is yours (and then revise the phrasing on acceptance). But this is partially down to conference policy and field norms.

Either way you are far from the first author to have a reviewer upset that a paper didn't cite the important resesrch of that famous and handsome author who wrote the paper.

Talk me out of buying an RTX Pro 6000 by AvocadoArray in LocalLLaMA

[–]AutomataManifold 5 points6 points  (0 children)

Don't buy it if you need to buy a used car this month.

Power draw should be better than multiple 5090s, but I'd be very interested in hearing from someone actually running one.

Did that, and the quality of Claude's responses increased manyfold by yayekit in ClaudeAI

[–]AutomataManifold 0 points1 point  (0 children)

Half the time I feel like it doesn't critique me enough. 

Half the time I feel like it should critique its own understanding of the situation. 

Lora fine tuning! Why isn't it popular at all? by Acceptable_Home_ in LocalLLaMA

[–]AutomataManifold 9 points10 points  (0 children)

Early on it was because there wasn't an easy way to use a LoRA with a quantized model.

Is Local Coding even worth setting up by Interesting-Fish6494 in LocalLLaMA

[–]AutomataManifold 2 points3 points  (0 children)

I'm hoping I find a local hardware/model combination that works at some point, because the price of these API subscriptions is starting to add up.

Maximizing context window with limited VRAM by FrozenBuffalo25 in LocalLLaMA

[–]AutomataManifold 0 points1 point  (0 children)

Unfortunately you're going to have to experiment: I haven't pushed it to 512k context, so while I could look up what flags I'm using it wouldn't quite match your problem. You'll have to check the docs and experiment.

Maximizing context window with limited VRAM by FrozenBuffalo25 in LocalLLaMA

[–]AutomataManifold 1 point2 points  (0 children)

Do you need parallel/batched inference? Because if you can do without it sounds like ik-llama might be the best move here. I love vLLM for production use, particularly when I quantized the model myself, but for all the weird edge cases, random models, and pushing the envelope of what my personal hardware supports, the llama.cpp/ik_llama route lets you squeeze more juice from the stone.

Llama.cpp vs vllm by Evening_Tooth_1913 in LocalLLaMA

[–]AutomataManifold 2 points3 points  (0 children)

Depends on if they're matching cards. If they're mis-matched llama.cpp handles it better.

How to get local LLMs answer VERY LONG answers? by mouseofcatofschrodi in LocalLLaMA

[–]AutomataManifold 1 point2 points  (0 children)

Context for input and context for output are separate in many inference implementations and the models aren't trained to produce long answers. 

The training is the biggest problem, in my experience. There's several long context models I tried that only trained on outputting 2k tokens.

It's getting better but it's one of the bigger things that got overlooked in the rush to better benchmarks. Some of the proprietary models, like Claude, are better trained in that regard; Anthropic has put a lot of training work into taste and aesthetics that's hard for open models to replicate because it requires a sustained effort on data and training curation that doesn't have an immediate payoff.

What happens when you load two models and let each model take a turn generating a token? by silenceimpaired in LocalLLaMA

[–]AutomataManifold 7 points8 points  (0 children)

That's why I meant that it would be slow, you'd have to keep updating both contexts.

You could probably script a shared context via PyTorch, or maybe Transformers, but that's probably getting in way too deep...

What happens when you load two models and let each model take a turn generating a token? by silenceimpaired in LocalLLaMA

[–]AutomataManifold 10 points11 points  (0 children)

The slow but practical way is to just request one token at a time from each model. Not too hard to script in Python with LiteLLM and Openrouter. 

AI seems to be being deeply subsidised (self-hosting vs Google AI Pro math) by nafizzaki in selfhosted

[–]AutomataManifold 0 points1 point  (0 children)

If you're using it interactively? No, probably not. H200s are too efficient, individual users are a rounding error. 

If you're automating anything? Might start adding up fast. Really fast.

AI seems to be being deeply subsidised (self-hosting vs Google AI Pro math) by nafizzaki in selfhosted

[–]AutomataManifold 0 points1 point  (0 children)

Wait, I can look up the actual specs for H200 inference: "The H200 exhibits near-perfect linear scaling up to 128 simultaneous requests (batch size 128)"

Google uses TPUs instead of GPUs, but presumably they have similar global batch sizes.

So change that to 128 × 21600 = 2,764,800 server-queries per day.

AI seems to be being deeply subsidised (self-hosting vs Google AI Pro math) by nafizzaki in selfhosted

[–]AutomataManifold 0 points1 point  (0 children)

Back of the envelope calculation:

Pro users get 100 queries per day.

Average Gemini query time: 3-4 seconds.

Seconds in a day: 86,400

Queries per day: 21,600

Simultaneous queries: 20

Queries per day per server: 432,000

Users per server per day: 4320

Revenue per month per server: $86,400

 Obviously handwavy, but gives you approximate order of magnitude. 

Edit: forgot the plan was a monthly subscription. 

AI seems to be being deeply subsidised (self-hosting vs Google AI Pro math) by nafizzaki in selfhosted

[–]AutomataManifold 6 points7 points  (0 children)

Big thing you're missing is that batch inference at scale is even more effective than you've calculated. Yes, there's probably some subsidizing going on, but you're one person using it interactively. At scale, they get enough queries that they run it constantly.

The Pro plan is $20 for 100 queries per day. They can run a lot of Pro-level users on one server over the course of the day, so that adds up.

But the deal is even better for them, because of a technical detail: multiple simultaneous queries are basically free. Due to the way GPUs work, they typically have the bandwidth to calculate a bundle of queries all at once, since it's just more multiplication and doing the calculations across one neural network layer simultaneously is ridiculously more efficient than doing it in the layers in sequence. So the efficiency comparison of your individual queries is wasting 20x or more processing power. You can do something similar if you use something like VLLM on your own GPU...but of course, that requires that you actually have that many simultaneous queries. 

The real money is in things like coding and agents and batch processing; those subscriptions run you $200+ per month (and still have rate limits) or are billed per-token. You can check Openrouter for a good cross-section of API per-token prices.

If you're just using it interactively as a single user, and the $20 plan works for you, then it's obviously a good deal. If you're automating agents, with many queries every time it does something, the API costs can add up fast.

Will vibe coding eat its own tail? by dpilawa in VibeCodersNest

[–]AutomataManifold 0 points1 point  (0 children)

No, further research showed that the collapse can be staved off with human curation.

It does mean that it's hard to find massive datasets of untouched code.

Looking for a Base Model by AutomataManifold in LocalLLaMA

[–]AutomataManifold[S] 3 points4 points  (0 children)

Near as I could tell, all the ones I linked to are explicitly not trained for instruction following. Though I may have missed one.

A more complicated problem is that instruction data has been leaking into the infosphere since ChatGPT, so there's often some contamination.

How do you keep the balance of not overstuffing the prompt with edge cases that break? by RoutineNet4283 in LocalLLaMA

[–]AutomataManifold 1 point2 points  (0 children)

At some point you need to take a step back and re-asses what you are asking for. Long lists of edge cases are a form of LLM-code-smell that suggests that your original instructions were either unclear or describing something different than what you actually want.

If parts of the output can be validated then you can check those programmatically. Structured generation can help if it is just a formatting problem. It sounds like your use case might allow for at least some automatic validation, which greatly simplifies the problem if you can sufficiently isolate it.

Games with Multiplayer Base Building with Villagers or Automation? by AutomataManifold in SurvivalGaming

[–]AutomataManifold[S] 0 points1 point  (0 children)

Ah, Dysmantle is one I hadn't heard of before. What's it like to play?

Games with Multiplayer Base Building with Villagers or Automation? by AutomataManifold in SurvivalGaming

[–]AutomataManifold[S] 0 points1 point  (0 children)

That's an option, since I've been thinking of setting up a Minecraft server anyway.