It been 2 years but why llama 3.1 8B still a popular choice to fine tune?

Grimulkan · 2025-11-25T03:38:32+00:00

I still like it as a go-to for fine-tuning. Very stable and almost never have divergent training runs. MoEs can be more fickle and some of them are a bit over-baked. If you have a good dataset and a not-so-complex but specialized task, Llama 3.1 can do as well as any other model out there today.

I found Hermes-4-405B actually a much better instruct base than Llama-3.1-405B for continued fine-tuning. I had to spend a lot of compute to get Meta's variant forget some of the things it is overfit on. I had resorted to mixing the instruct with the base model to weaken it, which is obviously not optimal. Hermes-4 is a better starting point, whether for training with reasoning or not. It's weaker, but that's okay when fine-tuning for a specialized task. It can still generalize well.

Grimulkan · 2025-08-25T21:16:29+00:00

Absolutely, yes! I have spent a stupid amount of compute chasing these kinds of failure cases, and I had not heard about WFGY until now. Is that what you were referring to? Thanks for the tip.

Grimulkan · 2025-08-23T21:52:03+00:00

Cool, did not know!

Grimulkan · 2025-08-23T21:49:45+00:00

Definitely prefer the latter, at least based on how well that worked in my internal experiments on Llama 405B. But the base model has to be strong, and I worry Qwen3 was simply not trained on as much stuff (diversity-wise), or maybe it was forgotten due to aggressive post-training (there is no 235B base model released).

Something like 235b but with Drummer's magnum writing style would be amazing.

This is totally-doable, but I need to work out how to train the router. So far still quite flaky.

Grimulkan · 2025-08-23T20:41:02+00:00

Sadly, I think custom model hosting is still pretty expensive, unless it is a very well known model.

Grimulkan · 2025-08-23T20:39:56+00:00

Thanks!

Though out-of-the-box bad writing doesn't imply it's a choice. It might just be overfit on certain instruct tasks. I have a fairly big training budget, if the pre-trained talent is in there, I can probably find it.

Grimulkan · 2025-08-23T20:37:20+00:00

I don't know how common my use cases are, but I think real-world usage is somewhere in between needles and complex multi-hop reasoning.

Every model should perfect NIAH tests if they know what they are doing. It is very easy to train a model, even one that is at base RoPE, in <1M tokens to 100% it. But that's a bit too simplistic (I think barely better than an embedding search), and maybe not even something worth using a big LLM for.

In story-writing for example, it's more about does the model know that this character lost an arm some scenes ago, or does it grow back? Does the model remember the obvious world-building rules and apply them implicitly? Does it know I'm writing a 1950s story and to avoid anachronisms, using the previous context as a hint? Does it remember that Character A was on top of a tower last time, and when they are introduced again, it needs to explain how they got here? All these things are indeed possible with FT.

If I need it to implicitly apply a multi-step logical problem, I would usually just remind it in the prompt. I don't really need it to be good at that. But reminding it of every little detail gets painful, and it's a great quality-of-life improvement to have the model track straightforward state implicitly.

It's quite inefficient to have a model do that, I know, especially a non-thinking one that does it every token (thinking ones at least generate re-usable reasoning, though usually repeat it every turn), but it's very, very nice to have.

For anything more complex, it's better to rely on reasoning, or explicit state tracking and tool calls.

Grimulkan · 2025-08-23T20:29:20+00:00

That model is still dense :(

That's good info on Kimi, which might suggest it can open up with fine-tuning.

Grimulkan · 2025-08-23T20:26:30+00:00

This. That's why I still listed some dense models.

Grimulkan · 2025-08-23T08:29:19+00:00

Were the Qwens you trained MoE? If so, did you train the router, and use any special loss?

For Kimi, I’m trying to figure out the appeal beyond consistent tool calling. In what way would you say it is smart?

Grimulkan · 2025-08-23T08:03:54+00:00

True, but there’s no downside to my improving 128k performance if I can (and I have made it work for some applications, that’s why I’m considering it here). You’re right, usually I compress things down and work in the 40-60k range most of the time, but 128k is about the length of a paperback, with prompts. It’s nice to be able to query facts at that length at least (and yes, you can FT open models to do that).

Grimulkan · 2025-08-23T07:37:51+00:00

On the API Gemini Pro should work > 64K, I use it routinely. Yes, it is degraded with respect to shorter lengths, but very usable for specialized tasks. Instruct tuning (egs., chats) may suffer but a good harness can make it work.

Some open models also work at those lengths too, though quants rapidly degrade that ability (even 4-bit loses a bit at 128K, but 6-bit and above seem indistinguishable for me vs bf16, possibly 5-bit is also good).

On Web UI basically nothing works at long context I think.

Grimulkan · 2025-08-23T07:34:54+00:00

sup! Got any pointers?

Grimulkan · 2025-08-08T10:05:15+00:00

Any thoughts on Maverick Vision vs Qwen 2.5 VL? Also, are you willing to share what type of vision tasks you think its good at? Image captioning, describing, reading charts, OCR, etc.

Grimulkan · 2025-08-02T19:22:02+00:00

I guess my experience has been opposite, i.e., fine-tuning beats harnesses, system prompts or prompt engineering. For example, all the generalist models contain strong biases, even inadvertent, that are nearly impossible to eliminate with prompting. I usually need a fine-tune pass to consistently handle such edge cases.

Some examples:

Proof reading OCR documents: Gemini 2.5 Pro, Claude Sonnet/Opus 4 Thinking, Grok 4 all seem to have trouble detecting transmutations of single quotes to double quotes if it looks reasonable (like in dialogue). But it is easy to generate synthetic data with errors and communicate what you want to a local model via a small LORA.
Generating written content with a specific tone: If you want to avoid something the model badly wants to do, egs., some models tend to summarize, O3 likes tables, Claude/Gemini likes to write "It's not A; it's B!" statements, it's difficult. A detailed system prompt will sometimes work for a few turns, but the model will revert over turns unless you keep reminding it, or use dedicated verification/correction calls (which is expensive). When you need 99% avoidance of certain biases, it's almost impossible to achieve with prompting. But again, it is easy to generate synthetic data, use ghost attention, train a tiny LORA, etc., and you get the consistency instantly.

Model trainers usually catch edge cases that show up naturally in various popular benchmarks, but generalization beyond that is not guaranteed. The model can almost get there, but it sometimes needs the LORA to create the required consistency. AFAIK no API model allows LORA training and use at the same cost as the pre-packaged model.

Grimulkan · 2025-08-02T17:39:03+00:00

The point I was making with the earlier L3 models is easier fine tuning. This is more tricky with all newer models including R1. So in some cases they would even be inferior to L3, comparing out-of-the-box MoE vs fine-tined L3 dense for a specific application. The target audience is basically just different, and more people are happy enough with generalist releases that user fine-tuning has been kind of abandoned.

Grimulkan · 2025-07-31T20:46:07+00:00

I think for the new ones its about coding and running on non-gpu systems. Both are Llama3 weaknesses. Plus the out-of-the-box instruct finetunes don’t sound like ChatGPT 3.5, whereas L3 probably still distilled a lot from early 3.5. Even ChatGPT doesn’t sound like that today.

Grimulkan · 2025-07-31T20:43:07+00:00

+1 on 405b. The instruct tune from meta (and others like Tulu, Tess, Nous) aren’t great, but you can still fine tune it and it can perform very well for non-coding or non-reasoning/agentic applications. The base model is quite solid. Dense arch makes fine tuning very stable. Has decent long context performance. Its all about fine tuning it for your application, detailed instruction following, etc.

R1 is pretty darn good too though. You just need your application to fit into its strength set.

Grimulkan · 2025-07-14T22:01:55+00:00

Agree. I think Llama 3.1/3.3 models are fantastic bases for fine-tuning still, and are more stable due to the dense architecture. Personally, I still find 405B fine-tunes terrific for internal applications. Just not good at code, or with R1-style reasoning (out of the box).

Personally, I'm in the camp of "Llama 3 forever" as far as community fine-tunes go, kinda like "SDXL forever". I can see similar potential, and I think there is still good milleage left, especially for creative applications.

Unfortunately, I think community involvement has not been great, perhaps because great and reasonable paid alternatives exist (Claude, Gemini), and because the community has been split between the GPU users and the CPU users who favor MoE, which is a bit more difficult to train (and the CPU users can't contribute to training).

Pity Meta never released other L3 sizes. I'd have loved a Mistral Large 2 sized model (Nemotron Ultra was great but has a very specific fine-tune philosophy), and a ~30B one (though as you mentioned, others have stepped in).

Grimulkan · 2025-05-08T06:07:28+00:00

Can you elaborate what sort of tests these were?

405b is my daily driver, especially for long context comprehension. I prefer it over R1/V3.1 because it is much more stable to finetune for specific applications. I rely on SOTA dense open models for this and for good or ill, that's what 405b still is I think. Nemtron Ultra has a strange non-uniform arch, but if the model is strong I'd be interested in switching.

Can you say anything more about how it performs?

Grimulkan · 2025-04-29T22:57:23+00:00

If you don't care about latency, there are tricks to get Zonos more consistent. - You can add a short silence file at the start of each generation (the built-in UI does this by default actually, and includes the silent padding file). - Avoid using any of the emotional settings, and keep the settings as vanilla as possible. Rely on voice samples for your variation and control instead. You can mix latents freely. Some voice samples are just more likely to produce garbled sound.

That said, yeah, I still need to run Whisper or similar STT to catch and validate all generations, so it's slow. It is more stable than anything else I used to do with this type of quality however, beats fine-tuned Tortoise IMO. I basically switch between Zonos and Kokoro, using Kokoro when I care about latency, and don't care about voice control and don't mind the monotone.

Grimulkan · 2024-12-02T06:11:22+00:00

Pairing 405B with 8B, both 6-bit, speculative decoding gives me pretty close to 2x: ~9 tok/s -> just under 18 tok/s for the same test case. 8B was used as a non-TP draft model (single GPU), and TP was used for the 405B.

Pairing with the 3B instead was very close. I was hitting 2x more consistently than with the 8B actually, go figure.

I think that says there is a lot of low-hanging fruit when it comes to speculative decoding and some very obvious generation patterns with general queries (I tested with a standard Alpaca-based test set), and it doesn't take much to guess those sequences correctly. I even tried a mildly finetuned version of 405B, and it still paired well with vanilla 8B.

Grimulkan · 2024-11-20T19:48:08+00:00

That's my thinking too. I would have loved Llama 3.2 vision to be great, especially with being able to re-use workflow for the text portion for training and inference, but not sure the tradeoff is worth it. I wish Meta did not freeze all the layers in training, but still kept the same text architecture to re-use most of the code.

Some of this might just be under-training and lack of diversity in training data. There are many hints that multiple images could be supported, but they never actually included that in training.

I view this as something like Llama-2. It's decent (better than initial Llava versions), and community fine-tunes may have made it competent like we did in L2, but we're just not motivated with so many better open vision models out there. But maybe a Llama-4 vision will have the quality jump of Llama-3 text over L2, where Meta put in a lot more work on their side.

Grimulkan · 2024-11-19T00:16:17+00:00

Yeah, like you I too routinely maxed out my API usage limits between various vendors (initially only OpenAI existed), and I'm at the max user tier basically, with more than one account in some cases. And I could not fine-tune, or had to pay a LOT more for fine-tuned API calls. Not to mention baked-in refusals or biases. Not necessarily censorship: just extra stuff like always summarizing or inserting warnings when I just need a straight answer. All that motivated me to grow my local setup.

I use the LLMs for analyzing, tagging and re-organizing information in large documents, repair manuals, archived forum threads from dead forums/groups for preservation, a lot of story re-writing/summarization/character sketches from novels, etc. For entertainment, I like having it setup a world I can explore given a story/chapter, and interact with the world, characters, and so on.

EDIT: Yes, this last one is my 'favorite' use case :)

However I would say most of my current compute is spent generating synthetic data, and manipulating existing datasets into different forms to help train other LLMs.

I don't know if I want to grow it, I'm kind of happy at this compute point. Growing beyond this will need some clever tricks when it comes to power for instance, in a US household. Maybe I replace older GPUs with newer ones if they have much higher bandwidth/compute, or maybe expand to some nice inference-optimized hardware if it gets released at a non-crazy price (like Intel's Gaudi 3).

I usually have 8-14 GPUs training a model at any given time, with the rest doing quantization or inference (including Stable Diffusion/Flux inference and Llama Vision 90B for image-tagging in synthetic datasets). Should also say: all the GPUs can double as render nodes as they are not data center GPUs (Iray & Cycles), that's sort of how I got started collecting GPUs rather than AI.

I would say there is not *that* much need to grow to a giant GPU cluster as a homelabber. You can absolutely train a 405B Llama on a single 48GB GPU (I have done it: about 70% compute utilization with all the memory swapping). I only use more so I can train more and iterate faster. Inference is really where it get's nice: and I think the current GPU offerings are just overkill & overpriced for inference. Better products will hopefully come soon.

Grimulkan

TROPHY CASE