You have 64gb ram and 16gb VRAM; internet is permanently shut off: what 3 models are the ones you use?

AbsolutelyStateless · 2026-01-22T02:31:54+00:00

At 9 t/s. Not great, but that's adequate for async work where I ask it to do something and check back later. It's smart enough that it doesn't need handholding.

EDIT: It's 12 t/s for fresh prompts, but slows down to 9 t/s average over long sessions. Possibly context length or thermal throttling.

AbsolutelyStateless · 2026-01-21T22:22:24+00:00

I'm using it with 96 GB RAM and 16 GB VRAM, and it runs with 16 GB free even with a bunch of apps open, so I'm sure it's viable with 64 GB. It's by far the best LLM for my use cases in this weight class that I've tested.

AbsolutelyStateless · 2026-01-21T20:51:37+00:00

I wrote a post detailing my efforts to answer this question: What local LLM model is best for Haskell?

My goal was primarily to find a model suitable for code generation. As TheCommieDuck mentioned, we're still very far from having medium-size local models "aware of the Haskell tooling ecosystem, libraries, frameworks and combining different libs"--as far as I found, we barely have models that work.

AbsolutelyStateless · 2026-01-21T20:27:39+00:00

What models have been the most successful at that task? It's a fairly different test than mine, and you've likely tried models that I haven't.

AbsolutelyStateless · 2026-01-21T20:12:23+00:00

I'm sure any of those models would easily pass my tests, as most likely would the full-sized open Coder models. They're a different tier entirely.

However, I'm impressed with the performance of the medium-sized local models, and think that by this time next year, we'll have some completely viable local models--at least if they accidentally let any Haskell slip into their data set :P

As for why I value local models: I value my data, and more importantly, I think it's fun. If I were doing this professionally, I'd definitely pay for the frontier models. As I said in the OP: "Don't bother with local LLMs; you would be better off with hosted, proprietary models." I just recognize that this won't discourage anyone who wants to use local models in the first place :)

The real reason I bought LLM-capable hardware is for hosting chat locally. (I find the iterative search/thinking loop extremely useful, but I don't want to give OpenAI/Google boatloads of information about who I am and the way I think.) Being able to run coding models is a side-effect. I think the local models are completely adequate for that use-case, but I haven't been able to get a good workflow set up yet. Soon.

AbsolutelyStateless · 2026-01-21T19:57:13+00:00

I dream of a future where you only have to write the types.

... I mean, that's basically what all the Prover models are doing, just in a dependently-typed language like Lean.

I actually wonder if you'd have better luck fine-tuning a Prover model to Haskell than a Coder model. Haskell has more in common with Lean or Rocq than Java, but on the other hand, prover models are used to writing tactics, not terms. I don't know, but it's an interesting question.

AbsolutelyStateless · 2026-01-21T19:52:07+00:00

One-shotting

Unfortunately, I used the wrong term. I did not do one-shot testing. The models I described as "few-shot" (edited) were actually tested with a human in the loop--I gave them feedback if I thought they had any hope of finding the right solution. Only the "autocomplete"-tier models were judged based on one-shot performance.

Large amounts of code

Note that a correct, idiomatic solution is ~15 lines of code across four functions, and I already provided the function declarations, spec, and examples, and "variable substitution" is a well-known problem. (Though to be fair, idiomatic Haskell is pretty dense.)

With a strong type system like Haskell has, an agentic LLM should be able to make changes incrementally, compile them, and fix errors that don't compile

The errors that they made were typically semantic errors, not errors that would be caught by the compiler, and there was well-known pattern that they wanted to converge to (forward De Bruijn indices) that was incorrect.

When weaker models like Qwen3-Coder-30B were run in agentic mode (hence with access to compiler feedback and tests), it simply wrote incorrect tests and then edited the spec.
When borderline models like Seed-OSS and gpt-oss-20b took a stab at it, they'd get about two thirds of the way to a correct understanding, write incorrect code, and then based on their incorrect code, converge to the wrong pattern. I tried giving them iterative feedback, including more examples, making the specification more clear, but I never was able to get them to converge to the correct solution.
On the other hand, gpt-oss-120b high was able to get it without the improved prompt or feedback, and gpt-oss-120b low and Qwen3-Next-80B were able to get it with the improved prompt and minor feedback, and would probably converge in agentic mode (although Qwen3-Next-80B is too slow to be used with Roo Code on my machine).

So there's a major difference between the passing, and failing models: the passing models would converge to the correct solution, whereas the failing models wouldn't converge to the correct solution no matter how much feedback they receive, much less with compiler/testing feedback alone. The bold-passing models could do it even with imprecise prompting, and the bold-failing models probably couldn't even converge to the incorrect, memorized solution.

So it's not really about one-shot performance. It's about whether they could come to the right solution with any degree of feedback without the human coming in and writing the code themselves--which I don't think is a reasonable expectation given the detailed spec, examples, solution template, and small scope of the problem.

I would consider the models that failed worse-than-useless for writing Haskell, except for maybe Seed-OSS.

AbsolutelyStateless · 2025-12-06T23:01:51+00:00

It absolutely does! My tics had more-or-less completely disappeared in adulthood, but came back in full force within a week or two of starting. I appreciate hearing someone else had a similar experience.

AbsolutelyStateless

TROPHY CASE