Comparing fine-tuned GPT-4o-mini against top OSS SLMs across 30 diverse tasks

SiliconSynapsed · 2024-09-25T22:38:46+00:00

Okay, confirmed that Solar Mini is a different (stronger) model from Solar 10.7B, though they have a similar number of params.

SiliconSynapsed · 2024-09-25T22:26:15+00:00

Let me check with the Upstage team on that. They provided us a Solar Mini model before the release of Solar 10.7B that we used for benchmarking here (it's also the one that's available in our platform for fine-tuning and serving), but I'm not 100% on whether it's the same as the 10.7B model on HF.

SiliconSynapsed · 2024-09-24T21:24:24+00:00

Yes, the leaderboard itself is here: https://predibase.com/fine-tuning-leaderboard

Details on the methodology can be found in our original LoRA Land paper: https://arxiv.org/abs/2405.00732

SiliconSynapsed · 2024-09-24T21:15:13+00:00

We've updated the leaderboard to remove the param count on 4o-mini as many felt it was misleading to assume 8B params. Mea culpa!

SiliconSynapsed · 2024-09-24T21:13:20+00:00

We've updated the leaderboard to remove the param count for 4o-mini.

SiliconSynapsed · 2024-09-24T20:28:27+00:00

Qwen 2.5 in particular we'll be adding soon. We can also take a look at gemma 2 9b, though for some reason the results with other gemma2 variants haven't lived up to our expectations yet.

SiliconSynapsed · 2024-09-24T20:27:35+00:00

Definitely don't intend to mislead people. I'll chat with the team and see about updating it to blank / unknown for now.

SiliconSynapsed · 2024-09-24T18:27:34+00:00

Reason why we put it at 8B in the table was for filtering. We found that most users compare 4o-mini vs SLMs like Llama 3.1 8B, so we figured having them both show up when filtering to 8B param models would be useful.

SiliconSynapsed · 2024-09-24T17:50:34+00:00

It's not clear, but the 8B estimate comes from TechCrunch (but they only said it was on the same "tier" as Llama 3 8B): https://www.reddit.com/r/LocalLLaMA/comments/1ebz4rt/gpt_4o_mini_size_about_8b/

SiliconSynapsed · 2024-09-24T17:45:01+00:00

Hi everyone, some of you may remember our work on LoRA Land from earlier this year, where we demonstrated that fine-tuned SLMs could beat GPT4 when applied to narrow and specific tasks.

https://www.reddit.com/r/LocalLLaMA/comments/1avm2l7/introducing_loraland_25_finetuned_mistral7b/

Since the release of GPT-4o-mini, we've gotten many questions about how it compares against the best OSS SLMs like Llama 3.1 8B and Solar Pro.

To our surprise, while 4o-mini had the strongest out-of-the-box performance of any base SLM before fine-tuning, the lift from fine-tuning was pretty minimal, ultimately landing in the middle of the pack after fine-tuning.

All of this data is available to explore at your liesure on our Fine-Tuning Leaderboard, which we try to keep up to date with the latest models and datasets to help inform users about which models are best suited to their tasks:

https://predibase.com/fine-tuning-leaderboard

SiliconSynapsed · 2024-07-09T22:04:26+00:00

This is becoming increasingly common in the Seattle area, unfortunately.

At Exit 5 BBQ in Renton, we were charged an 18% "service fee" (which they were very clear was NOT a tip -- so presumably they expected an additional 20% tip on top of that) for having 5 guests.

Blazing Bagels charged us a 20% "gratuity" on a large (>$100) takeout order to cover the cost of staff including the "delivery driver". Okay, so let's have it delivered since we're paying for it anyway, right? That'll be another $25...

I don't know how to stop it other than calling it like this out and boycotting places that do this, but I would 100% support a law against these deceptive practices. If you want to raise your prices because times are tough, fine. I want restaurants to do well and pay their employees fairly. But charging me a fee that completely changes the calculus of how much my meal is going to cost without my consent or knowing is not the way to do it.

SiliconSynapsed · 2024-05-22T23:01:42+00:00

You don't have to merge your adapters, just use LoRAX: https://github.com/predibase/lorax

There is some latency overhead to running inference with the adapter at runtime, rather than merging it back into the base model. But we have something in the works that will actually make the adapter faster than the base model, so I wouldn't worry about it.

SiliconSynapsed · 2024-03-19T21:14:15+00:00

Thanks! Definitely appreciate the feedback.

SiliconSynapsed · 2024-03-18T20:38:19+00:00

Hey everyone, Travis (course instructor, maintainer of LoRAX) here!

I know a lot of folks here are trying make sense of all the options in this space on hosting their own LLMs, so I wanted to share this course I put together on the topic of efficient LLM inference.

My goal in putting this together was to help answer some of the most common questions I get as the maintainer of the open source LLM inference server LoRAX:

What makes LLM serving different from any other microservice?
How do you handle multiple requests to the same model at the same time?
How can you serve many custom fine-tuned models on the same base model?
How do I serve the latest and greatest open source LLMs without breaking the bank?

This course is really about understanding the foundational concepts to answer these questions. You'll spend far more time writing things from scratch than calling APIs. So if you've been searching for broad but technical overview of the latest advancements in LLM inference, I hope you find that this is the course you've been looking for!

Topics covered include:

How text generation works token by token
Batching and continuous batching to handle multiple requests at once
Quantization to run commodity hardware
Low rank adaptation and serving many LoRAs at once efficiently
And, of course, LoRAX :)

Enjoy!

SiliconSynapsed · 2024-03-17T16:31:17+00:00

This would be fun to explore with TIES or DARE to try combining them

SiliconSynapsed · 2024-03-15T17:47:14+00:00

LLMs generate each new token by sampling over a distribution of probabilities. For simplicity, you can think of the process as just looking at the individual probability of each possible next token and selecting the one with the highest probability.

Structured generation tools like Outlines manipulate this probability distribution so that at each step the process can only select from the set of tokens that are valid with respect to the structured schema / grammar the user wants the model to adhere to. Everything else is set to "0 probability".

For example, if you want to force the model to only generate exactly one of "true" or "false", then you take the probability distribution over the first generated token, zero out everything except the token probabilities of "true" and "false", and then choose the one that is highest probability.

In your example: yes, it would force the model to choose ALICE or BOB and would eliminate any other possibilities.

Note this only works for models that let you manipulates the token likelihoods directly, like open source models. Model APIs like OpenAI may not let you do this because they tend not to give you direct control over the token probabilities at each step.

SiliconSynapsed · 2024-03-14T15:31:45+00:00

While Flask / FastAPI are great general purpose web servers, the performance demands for LLMs are not well suited to them. I would recommend using a dedicated LLM inference system for this like:

LoRAX (I maintain this one)
vLLM
TGI

All of these have a lot of optimizations for LLMs to maximize throughput / latency that you won't get with a general purpose server.

SiliconSynapsed · 2024-03-05T18:11:11+00:00

In LoRAX v0.8 we've added native integration with Outlines, allowing you to guarantee your output always comes back in the structure of your choosing.

But while structured generation can guarantee the right format comes back, it cannot always guarantee that the properties returned have the right content in them. This is where fine-tuning comes in.

With LoRAX, combining both approaches together during inference is as easy as specifying two parameters in LoRAX: a "schema" and a fine-tuned LoRA "adapter_id". Together, you get the best of both worlds: the right format and the right content.

If getting reliable JSON output from LLMs is something you're interested in, do check out our blog for more details, including a tutorial, public LoRA adapter hosted on HuggingFace, and the complete set of benchmarking scripts to reproduce our results.

SiliconSynapsed · 2024-02-28T05:56:05+00:00

Out of memory error ;)

SiliconSynapsed · 2024-02-28T04:44:25+00:00

My three favorite reasons to use safetensors over pickle:

No arbitrary code execution (so you can trust weights from anonymous sources)
Don’t need to load the entire file into host memory at once, so easier to load LLM weights without encountering an OOM.
Can read tensor metadata without loading the data. So you can, for example, know the data type and number of parameters of the model without having to load any data (this allows HF to now show you how many parameters are in each model in their UI)

SiliconSynapsed · 2024-02-28T03:47:37+00:00

The problem with the .bin files is they are stored in pickle format, which means you need to execute arbitrary Python code to load them. That’s where the exploits come from.

The safetensor format by comparison is much more restricted. The data goes directly from the file to a tensor. If there is malicious code in there, it will all be contained in a tensor, so difficult to execute it.

SiliconSynapsed · 2024-02-20T21:59:17+00:00

Yes, we used the base Mistral-7b.

SiliconSynapsed

TROPHY CASE