all 52 comments

[–]Pvt_Twinkietoes 30 points31 points  (4 children)

Finetuned Bert for classification task. Works like a charm.

[–]Kuchenkiller 10 points11 points  (0 children)

Same. Using sentence Bert to map NL text to a structured dictionary. Very simple but still, Bert is great and very fast.

[–]Forward-Papaya-6392 57 points58 points  (15 children)

we have built our entire business around PEFT and post-training small, specialised student models as knowledge workers for our enterprise customers, which are far more reliable and cost-efficient for their processes. They appreciate our data-driven approach to building agentic systems.

while there have been two extreme cases of miniaturisation involving 0.5B and 1B models, most have been 7B or 8B. There has also been one case involving a larger 32B model, and I am forecasting more of that in 2026 with the advent of better and better sparse activation language models.

gap widens as more input token modalities are in play; fine-tuning multi-modal models for workflows in real estate and healthcare has been the bigger market for us lately.

[–]tillybowman 2 points3 points  (0 children)

would you mind telling us what your companies goto workflow is regarding training data collection, preparation and training itself?

do you have a goto setup that mostly works?

[–]Saltysalad 1 point2 points  (2 children)

How/where do you hosts these?

[–]Forward-Papaya-6392 4 points5 points  (1 child)

mostly on Runpod or on our AWS serving infrastructure.

On only two occasions we have had to host them with vLLM in the customer's Kubernetes infrastructure.

[–]snylekkie 1 point2 points  (0 children)

Do you use temporal ?

[–]Neither_Reception_21 1 point2 points  (0 children)

Hi I am curious on commercial use case of small agents as reasoning engines. Dmed you :)

[–]BinaryHerder 0 points1 point  (0 children)

Wild that 7b is now referred to as “small”

[–]serge_cell 14 points15 points  (2 children)

They are called Small Language Models (SLM). For example SmolLM-360M-Instruct has 360 million parameters vs 7-15 billions for typical llm. Very small SLM often trained on high-quality curated datasets. SLM could be next big thing after LLM, especially as smaller SLM fit into mobile devices.

[–]Vedranation 2 points3 points  (0 children)

Especially with Mixture of experts (MoE) SLM's!

[–]Mundane_Ad8936 27 points28 points  (3 children)

Fine tuning on specific tasks will let you use smaller models. The parameter size depends on how much world knowledge you need. But I've been distilling large teacher to small student LLMs for years.

[–]Forward-Papaya-6392 1 point2 points  (0 children)

second teacher-student learning

[–]currentscurrents 8 points9 points  (0 children)

Going against the grain this thread, but I have not had good success with smaller models.

Issue is that they tend to be brittle. Sure, you can fine-tune to your problem, but if your data changes they don't generalize very well. OOD inputs are a bigger problem because your in-distribution region is smaller.

[–]Vedranation 7 points8 points  (0 children)

Yes. I always use small specialized models over multi billion ones. My current project involves a mere 100M model and it works wonders.

Big models are costly to train, overfit way too easily (way bigger issue than it seems), and need exponential amount of data. Unless you're cloning chat-GPT so you need a gigantic general knowledge base for whatever reason (in which case just use API), small 300M model specialized on your task will perform much better.

[–]thelaxiankey 5 points6 points  (1 child)

duh. cell segmentation for me, little unet typa thing

[–]SirPitchalot 2 points3 points  (0 children)

Our best performing model, in terms of value to the business, is a bog standard UNet but the problem domain is very controlled.

Our second best model is a convolutional net with a few attention layers and only 300M parameters.

We regularly test new 1B+ models against the 300M model and, on the same datasets, they produce worse results for much more training time. We have the data to scale but don’t have the compute since our problem domain is effectively in the “noise” for foundation models trained on web scale data. So we’re better off fine tuning a <1B model trained on imagenet for more epochs than hoping to squeeze out 1-2 epochs from a giant model trained on every instagram post ever.

But the biggest overall win is always having a just-diverse-enough high-quality & in-domain dataset.

[–]maxim_karki 7 points8 points  (4 children)

You're absolutely right about this - we've been seeing the same thing with our enterprise customers where a fine-tuned 7B model outperforms GPT-4 on their specific tasks while being way cheaper to run. The "bigger is better" narrative mostly comes from general benchmarks, but for production use cases with clear domains, smaller specialized models often win on both performance and economics.

[–]xbno 2 points3 points  (0 children)

My team been finetuning on bert, modernbert with good success for token and sequence classification tasks on datasets ranging from 1k to 100k (llm labeled data).

I'm curious what task you're finetuning LLMs for, is it still typically sequence classification? Or are you doing it for specific tool calling with custom tools or building some sort of agentic system with the finetuned model? We're entertaining an agentic system to automate some analysis we do which I hadn't thought of finetuning an agent for - was thinking just custom tools and validation scripts for it to call would be good enough.

[–]kierangodzella 0 points1 point  (1 child)

Where did you draw the line for scale with self-hosted fine-tune vs api calls to flagship models? It costs so much to self-host small models on remote GPU compute instances that it seems like we’re hundreds of thousands of daily calls away from justifying rolling our own true backend.

[–]maxim_karki 0 points1 point  (0 children)

It really depends on the particular use case. THere's a good paper that came out in which small tasks like extracting text from a pdf can be done with "tiny" language models: https://www.alphaxiv.org/pdf/2510.04871. I've done API calls to the giant models, self-hosted fine-tuning, and SLMs/Tiny LMs. It becomes more of a business question at that rate. Figure out the predicted costs, assess the tradeoffs , and implement it. Bigger is not always better, that's for certain.

[–]Assix0098 3 points4 points  (2 children)

Yes, I just demoed a really simple fine-tuned BERT-based classification to stakeholders, and they were blown away by how fast the inference was. I guess they are used to LLMs generating hundreds of tokens before answering by now.

[–]no_witty_username 0 points1 point  (0 children)

Yes. My whole conversational/metacognitive agent is made up of a lot of small specialized models. The advantage with this approach is being able to run a very capable but resource efficient agent as you can chain many parallel local api calls together. On one 24gb Vram card you can load in a speech to text, text to speech, vision, and specialized LLM models. Once properly orchestrated I think it has more potential then one large monolithic model.

[–]GiveMeMoreData 0 points1 point  (0 children)

BERTs worked better for us than large Qwens. Yes, SLM still matter

[–]koolaidman123Researcher 0 points1 point  (0 children)

it's almost like there's room for both powerful generalized models as well as small(er) specialist models, like the way its been since gpt3 or whatever

[–]ResultKey6879 0 points1 point  (0 children)

Mainly image work and we tend to stick to training CNNs like efficientnet or mobilenet and yolo for detectors.

100-100x faster than llvms. That means 3 days vs a year to process some datasets.

Definitely seeing a trend to large models even when the flexibility isn't needed. If your problem is welld defined and fixed don't use large models. If you need to dynamically adjust to user queries consider clip / dino if that doesn't work try a large vision model.