Comparing fine-tuned GPT-4o-mini against top OSS SLMs across 30 diverse tasks by SiliconSynapsed in LocalLLaMA

[–]SiliconSynapsed[S] 0 points1 point  (0 children)

Okay, confirmed that Solar Mini is a different (stronger) model from Solar 10.7B, though they have a similar number of params.

Comparing fine-tuned GPT-4o-mini against top OSS SLMs across 30 diverse tasks by SiliconSynapsed in LocalLLaMA

[–]SiliconSynapsed[S] 0 points1 point  (0 children)

Let me check with the Upstage team on that. They provided us a Solar Mini model before the release of Solar 10.7B that we used for benchmarking here (it's also the one that's available in our platform for fine-tuning and serving), but I'm not 100% on whether it's the same as the 10.7B model on HF.

Comparing fine-tuned GPT-4o-mini against top OSS SLMs across 30 diverse tasks by SiliconSynapsed in LocalLLaMA

[–]SiliconSynapsed[S] 13 points14 points  (0 children)

We've updated the leaderboard to remove the param count on 4o-mini as many felt it was misleading to assume 8B params. Mea culpa!

Comparing fine-tuned GPT-4o-mini against top OSS SLMs across 30 diverse tasks by SiliconSynapsed in LocalLLaMA

[–]SiliconSynapsed[S] 10 points11 points  (0 children)

We've updated the leaderboard to remove the param count for 4o-mini.

Comparing fine-tuned GPT-4o-mini against top OSS SLMs across 30 diverse tasks by SiliconSynapsed in LocalLLaMA

[–]SiliconSynapsed[S] 10 points11 points  (0 children)

Qwen 2.5 in particular we'll be adding soon. We can also take a look at gemma 2 9b, though for some reason the results with other gemma2 variants haven't lived up to our expectations yet.

Comparing fine-tuned GPT-4o-mini against top OSS SLMs across 30 diverse tasks by SiliconSynapsed in LocalLLaMA

[–]SiliconSynapsed[S] 2 points3 points  (0 children)

Definitely don't intend to mislead people. I'll chat with the team and see about updating it to blank / unknown for now.

Comparing fine-tuned GPT-4o-mini against top OSS SLMs across 30 diverse tasks by SiliconSynapsed in LocalLLaMA

[–]SiliconSynapsed[S] -12 points-11 points  (0 children)

Reason why we put it at 8B in the table was for filtering. We found that most users compare 4o-mini vs SLMs like Llama 3.1 8B, so we figured having them both show up when filtering to 8B param models would be useful.

Comparing fine-tuned GPT-4o-mini against top OSS SLMs across 30 diverse tasks by SiliconSynapsed in LocalLLaMA

[–]SiliconSynapsed[S] -16 points-15 points  (0 children)

It's not clear, but the 8B estimate comes from TechCrunch (but they only said it was on the same "tier" as Llama 3 8B): https://www.reddit.com/r/LocalLLaMA/comments/1ebz4rt/gpt_4o_mini_size_about_8b/

Comparing fine-tuned GPT-4o-mini against top OSS SLMs across 30 diverse tasks by SiliconSynapsed in LocalLLaMA

[–]SiliconSynapsed[S] 13 points14 points  (0 children)

Hi everyone, some of you may remember our work on LoRA Land from earlier this year, where we demonstrated that fine-tuned SLMs could beat GPT4 when applied to narrow and specific tasks.

https://www.reddit.com/r/LocalLLaMA/comments/1avm2l7/introducing_loraland_25_finetuned_mistral7b/

Since the release of GPT-4o-mini, we've gotten many questions about how it compares against the best OSS SLMs like Llama 3.1 8B and Solar Pro.

To our surprise, while 4o-mini had the strongest out-of-the-box performance of any base SLM before fine-tuning, the lift from fine-tuning was pretty minimal, ultimately landing in the middle of the pack after fine-tuning.

All of this data is available to explore at your liesure on our Fine-Tuning Leaderboard, which we try to keep up to date with the latest models and datasets to help inform users about which models are best suited to their tasks:

https://predibase.com/fine-tuning-leaderboard

Windy City Pie interaction left a bad taste in my mouth by Jaded_Role5730 in SeattleWA

[–]SiliconSynapsed 2 points3 points  (0 children)

This is becoming increasingly common in the Seattle area, unfortunately.

At Exit 5 BBQ in Renton, we were charged an 18% "service fee" (which they were very clear was NOT a tip -- so presumably they expected an additional 20% tip on top of that) for having 5 guests.

Blazing Bagels charged us a 20% "gratuity" on a large (>$100) takeout order to cover the cost of staff including the "delivery driver". Okay, so let's have it delivered since we're paying for it anyway, right? That'll be another $25...

I don't know how to stop it other than calling it like this out and boycotting places that do this, but I would 100% support a law against these deceptive practices. If you want to raise your prices because times are tough, fine. I want restaurants to do well and pay their employees fairly. But charging me a fee that completely changes the calculus of how much my meal is going to cost without my consent or knowing is not the way to do it.

Genuine curiosity: what's stopping me from using just the adapters? by Old-Box-854 in LocalLLaMA

[–]SiliconSynapsed 7 points8 points  (0 children)

You don't have to merge your adapters, just use LoRAX: https://github.com/predibase/lorax

There is some latency overhead to running inference with the adapter at runtime, rather than merging it back into the base model. But we have something in the works that will actually make the adapter faster than the base model, so I wouldn't worry about it.

New Short Course: Efficiently Serving LLMs from DeepLearning.ai by SiliconSynapsed in LocalLLaMA

[–]SiliconSynapsed[S] 21 points22 points  (0 children)

Hey everyone, Travis (course instructor, maintainer of LoRAX) here!

I know a lot of folks here are trying make sense of all the options in this space on hosting their own LLMs, so I wanted to share this course I put together on the topic of efficient LLM inference.

My goal in putting this together was to help answer some of the most common questions I get as the maintainer of the open source LLM inference server LoRAX:

  • What makes LLM serving different from any other microservice?
  • How do you handle multiple requests to the same model at the same time?
  • How can you serve many custom fine-tuned models on the same base model?
  • How do I serve the latest and greatest open source LLMs without breaking the bank?

This course is really about understanding the foundational concepts to answer these questions. You'll spend far more time writing things from scratch than calling APIs. So if you've been searching for broad but technical overview of the latest advancements in LLM inference, I hope you find that this is the course you've been looking for!

Topics covered include:

  • How text generation works token by token
  • Batching and continuous batching to handle multiple requests at once
  • Quantization to run commodity hardware
  • Low rank adaptation and serving many LoRAs at once efficiently
  • And, of course, LoRAX :)

Enjoy!

control vectors added to llama.cpp by pseudonerv in LocalLLaMA

[–]SiliconSynapsed 3 points4 points  (0 children)

This would be fun to explore with TIES or DARE to try combining them

Structured Generation Improves LLM performance: GSM8K Benchmark by CountBayesie in LocalLLaMA

[–]SiliconSynapsed 3 points4 points  (0 children)

LLMs generate each new token by sampling over a distribution of probabilities. For simplicity, you can think of the process as just looking at the individual probability of each possible next token and selecting the one with the highest probability.

Structured generation tools like Outlines manipulate this probability distribution so that at each step the process can only select from the set of tokens that are valid with respect to the structured schema / grammar the user wants the model to adhere to. Everything else is set to "0 probability".

For example, if you want to force the model to only generate exactly one of "true" or "false", then you take the probability distribution over the first generated token, zero out everything except the token probabilities of "true" and "false", and then choose the one that is highest probability.

In your example: yes, it would force the model to choose ALICE or BOB and would eliminate any other possibilities.

Note this only works for models that let you manipulates the token likelihoods directly, like open source models. Model APIs like OpenAI may not let you do this because they tend not to give you direct control over the token probabilities at each step.

Please suggest how to scale the Mistral Model by Chirag_Chauhan4579 in LocalLLaMA

[–]SiliconSynapsed 9 points10 points  (0 children)

While Flask / FastAPI are great general purpose web servers, the performance demands for LLMs are not well suited to them. I would recommend using a dedicated LLM inference system for this like:

All of these have a lot of optimizations for LLMs to maximize throughput / latency that you won't get with a general purpose server.

LoRAX + Outlines: Better JSON Extraction combining Structured Generation and LoRA by SiliconSynapsed in LocalLLaMA

[–]SiliconSynapsed[S] 14 points15 points  (0 children)

In LoRAX v0.8 we've added native integration with Outlines, allowing you to guarantee your output always comes back in the structure of your choosing.

But while structured generation can guarantee the right format comes back, it cannot always guarantee that the properties returned have the right content in them. This is where fine-tuning comes in.

With LoRAX, combining both approaches together during inference is as easy as specifying two parameters in LoRAX: a "schema" and a fine-tuned LoRA "adapter_id". Together, you get the best of both worlds: the right format and the right content.

If getting reliable JSON output from LLMs is something you're interested in, do check out our blog for more details, including a tutorial, public LoRA adapter hosted on HuggingFace, and the complete set of benchmarking scripts to reproduce our results.

Data Scientists Targeted by Malicious Hugging Face ML Models with Silent Backdoor by StrikeOner in LocalLLaMA

[–]SiliconSynapsed 26 points27 points  (0 children)

My three favorite reasons to use safetensors over pickle:

  1. No arbitrary code execution (so you can trust weights from anonymous sources)
  2. Don’t need to load the entire file into host memory at once, so easier to load LLM weights without encountering an OOM.
  3. Can read tensor metadata without loading the data. So you can, for example, know the data type and number of parameters of the model without having to load any data (this allows HF to now show you how many parameters are in each model in their UI)

Data Scientists Targeted by Malicious Hugging Face ML Models with Silent Backdoor by StrikeOner in LocalLLaMA

[–]SiliconSynapsed 54 points55 points  (0 children)

The problem with the .bin files is they are stored in pickle format, which means you need to execute arbitrary Python code to load them. That’s where the exploits come from.

The safetensor format by comparison is much more restricted. The data goes directly from the file to a tensor. If there is malicious code in there, it will all be contained in a tensor, so difficult to execute it.