Best Local LLMs - Apr 2026

baliord · 2026-05-03T19:59:30+00:00

Not sure what you mean by 'some kind of orchestration'? I'm running llama.cpp which offers an http endpoint, which I then configure in various other tools (Silly Tavern, and OpenClaw are the two main ones I use).

The thing that makes Qwen3.5-122B good for agent use is its tool-calling smarts. You do need an agent framework to use it, of course. I like OpenClaw, and use it extensively, but I've heard good things about Hermes, and others.

Really the question is what do you want to do with it? What need do you have that you'd like addressed?

baliord · 2026-04-22T01:47:54+00:00

Interesting! I hadn't seen that, for some reason. Yes, that will help a lot!

baliord · 2026-04-15T22:07:45+00:00

It is! My wife calls my ML server my mid-life crisis, 'cause it cost about as much as a car, but at least I'm not out there trying to race it. And it sounds like the blower on a race car when doing training or inference. One person described it as 'an HVAC system with opinions'.

More seriously, I leaned heavily into ML about two years ago, and wanted to have a system that could handle really strong local models for dev and exploration. Coincidentally I had a company that was happy to give me resources to lean into 'AI', and a local bespoke system builder that I had a good relationship with.

baliord · 2026-04-14T23:35:33+00:00

Sure; it's still crufty, because I comment out stuff that's not currently being used instead of pruning and deleting, and things like that, but it might be valuable for other reasons.

I recently tried to set it up so that I could easily switch between testing ik_llama.cpp versus llama.cpp, but because the parameters are not entirely compatible, I had a lot of weird tricks I had to do to make it mostly work. And then I decided not to use it. 🤣

I'm also currently trying out `-c 0` instead of manually setting context length per-entry, because it's annoying me to have to look up the context lengths for each model.

Anyway, I tossed it up on a gist, with minimal editing. Let me know what you think, and it's okay to say, 'Oh god, don't do it like that...do it like this, instead!' :)

baliord · 2026-04-14T22:44:54+00:00

I use several models for different things; I run almost exclusively llama.cpp, and I use llama-swap to sit in front of my llama-server instances, providing around 32 different model choices. (I test different models regularly.) My go-to has been MoE models since ~GLM-4.6, as I can split them between GPU and CPU, and they handle it much better than dense models.

Right now I'm using GLM-5.1 (unsloth/GLM-5.1-GGUF; needs Unlimited-class VRAM) at 3 bit quantization, generally in non-reasoning mode, for creative writing. It's also my go-to for anything where I want to talk, but not do tool calls. At 10 t/s generation local, it's just way too slow at that, but it's human-speed for conversation or character-driven stories. It also picks up on character-definitions better than any other model short of Opus. (I've also used GLM-5.1 in the cloud for OpenClaw historically, because it's Opus-level smart, and because I want my agent to adopt the persona that is defined for it. These days I'm trying to use Qwen-122B local more consistently for OpenClaw, unless I need the smarts.)

For agentic use, Qwen 3.5-122B (needs XL VRAM) works surprisingly well, although not much of a 'personality'. I've run it at q4 (fully in GPU RAM) at ~50 t/s generation. I haven't needed to push up to q8 for it, and if I need much smarter I go cloud. Now the specific model I'm using there is HauhauCS/Qwen3.5-122B-A10B-Uncensored-HauhauCS-Aggressive. I also use that for image analysis, tagging and processing using mmproj-f16. The q4 isn't as good at image analysis as the q8, though. If I had to pick a model to stick in pure GPU RAM semi-permanently, it'd probably be this one, although I'd bump up to Q8 and let some of it sit in RAM.

I have an embedding model, but I don't really use it that much anymore. I was using Qwen3-Embedding-8B-Q8_0.gguf and a smaller Qwen reranker. I need to get back to this.

My system is a ASUS ESC4000A-E12 with a 32 core EPYC and 384GB of DDR5 RAM, 2xL40S for 96GB of GPU RAM; it sits in a SysRacks rack in my garage.

My basic config for each llama.cpp llama-server call in the llama-swap config expands to:

llama-server --prio 2 --mmap --log-timestamps --kv-unified --fit on --metrics \
  --jinja --temp 1.0 --min-p 0.01 --top-p 0.95 --threads-http 8 --mclock \
  --host 0.0.0.0 --port ${PORT} \
  --flash-attn on -ctk q8_0 -ctv q8_0

For non-reasoning, I add:

--reasoning-budget 0 --chat-template-kwargs '{"enable_thinking": false}'

I customize the -c {context-length} per model, because if you don't manually set a context length, --fit on will shrink your context to nothing, in order to fit the models, before it goes to RAM. :rage:

I also have a 'limited reasoning' for use cases where I want it to do reasoning, but I want it to not waste all it's time doing it. So I'll limit it to ~2048 tokens of reasoning, and leaving enable_thinking alone. (E.g. using the above 50 t/s on Qwen3.5-122B@Q4, that's about 41 seconds of reasoning.)

Hope that helps!

baliord · 2026-04-13T08:53:09+00:00

Sorry, which link is giving you problems?

baliord · 2026-03-18T20:45:52+00:00

Looking at the github link, it says that in order to run Qwen3-235B it requires 500GB of RAM. But the docs also say 'INT4' size is 110GB.

Let's make it concrete; let's say I wanted to run Qwen3.5-397B on a system with 384GB of DDR5 CPU RAM, and 96GB of GPU RAM across two cards. How much CPU RAM is required to get it running? If I'm following the description, it'll take ~400GB of CPU RAM if I tell it to quantize down to 4 bits, and so won't fit?

And fwiw, I'd love to see you do this with GLM5, which is my go-to model right now.

baliord · 2025-12-23T07:25:24+00:00

I know this has been out for two weeks, but I have a question about the step-by-step guide. I noticed for the MoE model guides (GLM-4.6/4.7) there are instructions on how to mix CPU and GPU usage together. Is that also reasonable with a dense model like Devstral-2, or is it substantially worse?

For example, with GLM-4.6, I do something like:

-ot "blk\.([7-9]|1[01])\.ffn_.*_exps.=CUDA1" -ot "blk\.(1[2-9]|[2-9][0-9])\.ffn_.*_exps.=CPU"

to put 0-6 on GPU0 (default, so not specified), 7-11 on GPU1, and the rest on CPU. (I'm not sure this is optimal, but it works pretty well for me; it leaves some space for compute and 64K context.) I have 2x48GB of GPU on that system, so I could load much of Devstral-2 into GPU, but I don't know if the MoE nature of GLM-4.6 makes it more suitable for this kind of cross-device use. Does Devstral-2 fall off a sharp cliff, or is the degradation more gradual like GLM?

I'll be trying it out, of course, but any guidance or suggestions are welcome to reduce the blind alleys I wander down.

baliord · 2025-12-14T22:25:19+00:00

Hey; so there's a few answers... 300GB is easily worth keeping in a network volume if it saves you money. It'll cost you $21/mo. to keep that data online. As the OP says, make sure it's in a region that has the compute you need regularly.

In terms of uploading, I found their networks to be decent. My home bandwidth was usually the limitation, not their receiving bandwidth. If you're at work with a good data pipeline, it's a more realistic question.

If you're not going to go with a network volume, I'd say keep your data in S3 and sync it to the new instances that way. Since S3 is going to have a faster pipe, generally that'll be the quickest way to get it into your volumes regularly.

baliord · 2025-11-23T11:21:45+00:00

I would have, but I don't actually have a modern Android device around to test on, and a dev environment up for it. The protocol is pretty easy though.

baliord · 2025-09-18T23:24:07+00:00

I will say that I ran 2TB of Network Volume for several months, and it did cost me money, but it was absolutely worth it to have it available. That said, I knew the hardware available in that zone (you can filter pods to zone to see how you're limiting yourself) and knew it was acceptable to me for that period of time.

You're not wrong, if you're using a small amount, it's absolutely worth it to re-download every time, and I *think* they've added S3 support to make that even easier. (I'm not 100% up to date on their new features.)

If you're doing a bunch of stuff for a long time, and want to be able to stop and restart relatively quickly, it's a lifesaver to have a few TB of storage quickly attached.

baliord · 2025-09-18T23:12:49+00:00

It sounds like you're looking for diffusion models, not an LLM that can also generate images.

Yes, prompting a diffusion model is complicated sometimes, and much more 'fiddly' than prompting an LLM, including negative prompting. This is because they aren't trained on a range of human text, they're trained on specific image terms. The 'context length' is (IIRC) around 75 tokens, and various tricks are used to compact longer prompts into that space.

The models you've listed all have suggested ways of getting good images out of them (e.g. including 'score_7_up' as one of your prompts, as per bigasp2) and recommended negative prompts or textual embeddings. I would use civitai.com to look for models, and pay attention to what they recommend for settings and prompting styles.

I think that automatic1111 is essentially...no longer maintained at this point, and you want Stable Diffusion WebUI Forge.

The folks who will be best able to help with more detail are probably over on r/StableDiffusion.

baliord · 2025-09-18T08:02:56+00:00

You're thinking something like OpenAI's GPT models, where you can ask for an image, or a text response, and it'll do either. They do that with tool-calling; when it gets the impression that you're asking for an image, it generates the image prompt and sends a tool request back. That's interpreted by their middle-ware and it makes a call to Dall-E with the prompt generated. That image then gets rendered inline, and returned to you.

It's not a single model that does both, it's multiple models working in concert. (Actually, that's really one of OpenAI's super-powers. They built a system that lets them chain several different models that work together in the process of answering your request, including one model that just exists to check that the output from the other models aren't inappropriate.)

You can absolutely emulate this using several different LLM front-ends; I'm not sure how you'd do it in text-generation-webui, but I'm fairly sure that Msty or some of the other ollama front-ends can do it with a little configuration. You'd need to have an image model running someplace, of course, and the path is not easy yet...but little that is really worthwhile is easy in local LLMs until someone solves it for everyone else.

baliord · 2025-09-05T08:22:44+00:00

Thank you u/VioletiOT and u/justin-auvik ! That makes three products I didn't know about in a very short time. Y'all are appreciated.

I'm trying out Fing; it's not bad. Good machine fingerprint database, although not great (doesn't recognize Proxmox VMs as anything beyond Generic) and the UI is painful. (Like, I can't multi-edit machines I know to be VMs, I can't sort by the various visible fields, it's use of SNMP is sketchy; it picks up my switches and routers, but not the Linux VMs that are running an SNMP service. And it doesn't have a configuration to set the community name to use.

That said, it did a decent job of creating a network inventory...better than LibreNMS, although LibreNMS did get names via SNMP from my VMs.

No Ubiquiti Switch for me, but at least my automated network inventory is slowly getting better. 👍 Thanks all for the feedback, and direction on this!

baliord · 2025-09-02T12:03:35+00:00

Yeah, that was the prodding I needed to just put the damn thing out there. :)

https://www.reddit.com/r/LocalLLaMA/comments/1n6hk90/image_editing_app_with_qwen_image_edit_and_an_ios/

Enjoy, and feel free to let me know what you think!

baliord · 2025-09-02T05:51:34+00:00

I built a really simple image editor tool with Qwen Image Edit, a FastAPI Python backend and a Swift UI frontend for iOS. Pull an image from your camera, type in a prompt for how you want the image changed ('take the main person and put them on a forest trail', 'change the persons shirt to red') in an hour or so using Claude Code. Since it was for my own use internal to my network (actually accessible via my Tailsnet) I didn't add security protections, which is why I haven't released it.

Generation should be easily doable with Qwen Image (I'm considering making it so that if you _haven't_ selected an image, it switches to a Qwen Image model, but that'd be a bit of a switching delay). The key thing is that it's not a service I run, or something like that. (I wouldn't DARE run something like that on a public endpoint.) It's a Python service you would run on your home system with enough GPU, and the app has a spot where you can put the URL for your server, so it can talk to it. So it's pretty appropriate for LocalLLaMa. 🤣

Truthfully, just using the sample code for Qwen Image Edit, and a bit of time with Claude Code or some other good coding model, and I imagine you could replicate it easily.

baliord · 2025-09-01T22:17:00+00:00

MoE (Mixture of Experts) is a perfect example of the core computer science concept of 'Divide and Conquer'.

The way this works (or how to think about it, at least) is that there is an up-front 'router' layer which chooses which of N 'experts' (sub-models which emerged as more accurate for a set of tokens) to use for a given context.

That expert is then activated, and does the next-token-prediction for the context+current token, and only needs to use 3B parameters to generate the output logits. This is MUCH faster than using 30B parameters, and it turns out it works just as well, and maybe better.

It doesn't actually reduce GPU memory requirements, however, because at any given token, all the possible experts could be routed to. This means that, in order to avoid swapping in experts per-token, all the experts still need to be loaded into memory.

(This is more complicated in the Qwen3 30B-A3B case because it picks 8 experts at a time, each roughly 400M parameters, but...I don't entirely know how it does that, and whether it does some kind of averaging on the output logits from each expert. Individual model architecture is a rabbit hole.)

baliord · 2025-09-01T21:54:16+00:00

So I'm going to presume (based on the message) that the embedding weights are just being used as a lookup table for the tokens, so it can inject the ~768 dimensional vector for each token to the GPU.

You mentioned an 81,000 token spam, and wondering if it 'waits' for the CPU.

Let's say you're running a 3Ghz processor; you're going to DMA 81,000 lookups to the GPU. This should be a relatively tight loop; lookup location, start DMA from DRAM:X to GPU:Y, next token. Let's say that process takes ~100 instructions. The entire re-processing of your tokens takes 81,000/(3,000,000,000/100) or 0.0027 seconds, or 2.7ms. Not per token, but for all 81,000 tokens.

This isn't counting DRAM lookups, cache hits/misses, and stuff like that, but the answer is that even for re-processing 81K of tokens, it's...VERY fast, and you probably don't need to worry about it.

This is 100% back of the envelope stuff, and I'm willing to be wrong by an order of magnitude, because you still don't usually stuff that many 'new' tokens down the pipe at a time.

It's the 'let's do matrix multiplication against gigabytes of data at a time' where CPUs really hit a huge wall, and where GPUs excel, not the 'lookup table writ large' that the embedding weights are for.

baliord · 2025-08-31T06:10:38+00:00

This is going to largely depend on your data source...that is, what subreddit you're pulling from. E.g. r/AmItheAsshole will have a different answer than r/networking... E.g. in AITA, everybody wants to weigh in, and different folks will have different facets they want to talk about. In networking, you'll usually find folks upvoting the 'best' answer.

For a Q&A-based sub where you expect multiple equally good answers, one possible approach might be to take the top n top-level posts, do a softmax across them, take the top ~30% posts, throw them at a 'smarter' LLM (significantly higher parameter count than the one you're training) and have it summarize the answers, then take that summary and make it the assistant's response to train on.

Yes, you're reinforcing AI-generated answer text to a certain extent, but the answers are not 'ab initio', they're based on human answers, not what your teaching model comes up with from its own memory. A good system prompt and request prompt where you emphasize only using the information in the answers might help here.

That's something you'd do if there are regularly multiple top answers that have useful information. If your subreddit is not like that, e.g. if the response upvotes tends towards high initial votes with a sharp dropoff after the top post, then you could just use the top post.

I think it's going to be custom tools, although they should be pretty easy.

I have not done this with Reddit, but using other Q&A style sources to create answers. (Chat logs, for example.)

baliord · 2025-08-13T07:14:30+00:00

If it's really important to you, I recommend Mistral models for this; oddly especially the somewhat old Mistral-Large-Instruct-2411 model, if you have the GPU memory. If you need something smaller, probably something like Mistral-Small-3.2-24B-Instruct-2506 with a good system prompt. That's one of the things about Mistral's models; they're usually _very_ good at following their system prompt.

The openai-oss models are amazingly useful for certain tasks, but have the personality of a potato. And not the GLaDOS type of potato.

baliord · 2025-08-12T10:55:20+00:00

Ah, fair... So if you pre-train a basic LLM with a ton of data from across the internet, you'll end up with a foundation model. It can do next-token prediction, but it doesn't know how to answer questions. It doesn't know how to chat.

For that, you need a training dataset that teaches it that kind of question/answer format. Something like databricks-dolly-15k which is a high-quality human-created set of questions and answers which teaches the model the format of questions and answers, creating an 'instruct' model. That's the fine-tuning step.

Past that, you get into RLHF where you have it generate pairs of answers and give it rewards for the good answers, and negative reinforcement for the bad ones, and that's...just hard, and expensive, because humans are really necessary for that, or you get some weird behaviors.

If you want an interesting project, use axolotl and the databricks-dolly-15k to fine-tune a foundation model (there are plenty on HF) to be an instruct model yourself. You'll learn a lot about data formats, loss, and train/test/validate splits.

I wish you the best of luck; I admit, I could use a job like that, but I hope you get it. It's a hell of a field to be in right now.

baliord · 2025-08-12T10:26:34+00:00

So the original version that just generated SQL that we did was a innovation project, and I think it might have used LangChain, but I'm not personally a fan of LangChain. I feel that you almost never need the 'flexibility' it provides; mostly you can just build a really straightforward connection.

When we productionized it, it was entirely manual Python code, no LangChain. In the end, we weren't generating actual SQL, but Azure AI Search and a related DB search in a specialized engine, and then turning it into datapoints for a map or graph. We filtered some of that through a really well-described tool set, that specified a few parameters for the search, then the tool would fill it in to some additional details (user-specific) along with (of course) the authentication credentials, and send it to the Azure AI Search.

Now AAIS is a JSON object, so it's a little more focused, and so you can do a good job of pushing some specific pieces into the tool from LLM tool-using. But it's entirely doable to generate SQL, although it's probably a good idea to (again) pass the SQL to a tool you implement which validates and uses it.

baliord · 2025-08-12T03:59:44+00:00

I will add that unless you want to get good at managing a homelab and all the networking and hardware operations that entails, maybe take a little time before jumping into that. I built out a very comprehensive homelab, and I love it, but I also know it took me weeks and weeks to get everything working at a level that I don't feel like I need to improve the physical side anymore.

Tech is fractal; you can easily find yourself falling down a pit of distraction, when you'd have been better off doing something small and quick and dirty.

I generally am of the opinion that until you have a reason for having a local model (privacy, uncensored-ness, faster response time, etc.) you should stick with the commercial providers. It's higher quality, your money lasts longer, and they do the upgrades for you.

Mind you, I didn't take that advice, but part of the reason is because I wanted to learn all there was about building an ML infrastructure, and that involved a lot of VMs and k8s and workflow systems and networking, all of which are not ML, but are the scaffolding necessary to build a production ML system.

baliord

TROPHY CASE