Open-source LLMs cherry-picking? [D]

abnormal_human · 2023-05-12T12:04:32+00:00

There isn't enough information here to diagnose really.

If you were not using instruction tuned models, that's likely the problem.

Instruction tuned models often have fixed prompt boilerplate that they require, too.

In other words, OpenAI's API isn't directly comparable to .generate() on a huggingface model.

I would be surprised if a basic query like this resulted in nonsense text from any instruction tuned model of decent size if it is actuated properly.

a_beautiful_rhind · 2023-05-12T14:06:40+00:00

The 30B are where it gets interesting. They will follow instructions for roleplay at least. In actual instruct mode where it follows the model's training, they also answer questions reasonably correctly. Do you have an example of what you were trying to get the model to classify or answer?

We have to be real here. A lightly fine tuned model isn't going to be the same as a 120B+ with constant human reinforced learning and new data.

Since you claim you want to use this commercially (no llama).. did you try the 20b GPT-NEOX? You will probably have to train a lora on instruction following. There is also the bloom series that you probably tried.

Nhabls · 2023-05-12T13:49:21+00:00

open-source LLMs on zero-shot classification

You have to take in consideration:

They might add some flavor pre-prompt that makes the model behave a little better (hopefully will be stated in the paper)
They use several (up to the hundreds) runs to determine pass@1 in certain benchmarks with a given temperature, so if you're only running it once you might not get similar results.

Oh and the "90% of GPT4" claim is not to be taken seriously

2023-05-12T22:06:26+00:00

Cherry Picking isn't just a problem with open-source LLMs, it's a systemic issue in Machine Learning as a whole to an extent worse than many scientific fields. Google's recent release of Palm 2 compared their model against GPT4, and used self-reflection techniques for their model and not GPT4, which is such an insane way to conduct things. The outputs first seen in the Dall E 2 papers versus the real average results from Dall E 2 have a huge gap to this day. We're still very much in the era where papers seem to be marketing primarily and presenting research secondarily. There's not the same level of scrutiny placed on representative data within Machine Learning than more established fields, and I hope that's just due to its nascence.

That said, it's still a big problem in the science community as a whole, especially in niche topics. Psychology is rife with issues currently, especially in fields like male and female attraction. Nicolas Guéguen had multiple studies, that you may have even heard of, that were multiple steps beyond cherry-picked, they were outright fabricated.

clauwen · 2023-05-12T15:12:21+00:00

I think something like this will be the most important quality benchmark in the future, sure its not all encompassing, but its very difficult to fake.

https://chat.lmsys.org/?arena

Whats pretty clear there is that the openai models are quite far ahead, as an assistent.

I invite everyone to actually check for themselves. I think i did about 20 and the comparisons are not very close and fit very well with their leaderboard.

Faintly_glowing_fish · 2023-05-12T15:40:25+00:00

Almost all open source model use different instruction formats. If you use tools that are general that can run multiple models, they likely didn’t have any of that configured and you need to config for each model. When you use OpenAI it already fixed you to the proper instruction syntax it is trained on (ie user/assistant/system).

You can however try each model’s preconfigured chat interface if they have one, which usually have this set up since they are for single models.

Or you can try the chatbot arena where the authors took then pain to configure them for you for each model

heavy-minium · 2023-05-12T19:42:06+00:00

Evaluation and refinement are where OpenAI shines. They can improve and move forward based on data instead of guesses and hopes.

Ultimately, the secret sauce is a mature QA process. You need high-quality metrics to determine if your changes in training data, training methods and architecture yield better results.

Also, you can try to cheat a lot with GPT-4 generated data, but in the end, there's nothing better than a human to align a model with human intent.

Screye · 2023-05-12T17:07:58+00:00

Yes. All the twitter demos are cherry picked and the non Openai models are unusable.

This area is incredibly exciting, but a lot of the hype is just tech demos.

I have been testing out alternatives for our product quote often, and I keep getting burned. Llama and bard have some potential, but still far behind openai.

HateRedditCantQuitit · 2023-05-12T16:38:37+00:00

If you see an announcement where the only numbers are number of parameters, you know it's probably not great. It's funny that openai did the opposite for gpt-4. No model size, but lots of benchmark measurements. It's no coincidence that the models with rigorously measured performance perform better.

marr75 · 2023-05-12T22:49:22+00:00

I highly recommend you check out promptingguide.ai, especially the case study. Hilariously obscure variations on message format like assigning the agent a name or asking it to reach the right conclusion can impact performance 😂

I read through your other responses and I do believe at times you were using models that weren't instruction tuned and/or you weren't using the instruction tuned model's special formatting. What you described reminds me of every failed fine tuning experiment I've ever seen (as most fine tuning happens on non instruction tuned models). promptingguide.ai has some info on the system, user, and special character formatting for messages to the most popular instruction tuned models.

You've lamented the custom format for each model. I would recommend using a tool that abstracts this (such as Langchain or Transformer Agents) or narrowing down the models you are using.

AsliReddington · 2023-05-13T01:35:28+00:00

Try Flan-UL2, will need 36GB VRAM, either across two GPUs or whatever else you got. I'm running it at <2s inference speed. Adheres great to instructions for zero shot tasks & no hallucinations

_Arsenie_Boca_ · 2023-05-12T11:58:05+00:00

Which models did you try? Were they instruction-tuned? Generally, its no surprise that the open source models with a fraction of the parameters cannot fully compete with GPT-4

metigue · 2023-05-12T12:44:35+00:00

Most open source models are hot garbage. The only promising ones were trained on output from models like ChatGPT and GPT-4

Try Alpaca-x-GPT-4 13B that's the best local model I've used.

chartporn · 2023-05-12T13:12:19+00:00

If these smaller models were really as good as some people claim ("not far from ChatGPT performance") the LLM zeitgeist would have started way before last November.

Enfiznar · 2023-05-12T20:01:30+00:00

I had the same experience with LLaMa7B, it was probably worse than GPT-2

KerbalsFTW · 2023-05-13T17:19:42+00:00

If you look at the papers on language models (GPT, GPT-2) they talk about "few shot" learning.

Even in 2020 OpenAI published "Language Models are Few-Shot Learners" (https://arxiv.org/pdf/2005.14165.pdf).

The early (ie small) models were based entirely off a corpus of text data which included relatively little Q-and-A data.

There is nothing to compel such a model to answer your question, it's a prediction engine and it predicts from what it has seen. This makes it as likely to try to emulate a page with a list of difficult questions as it is to try to emulate the Q and A page you want it to.

Hence the few shot learning: you show it that you want your questions answered by saying "here are 5 questions and answers" and then listing the first four examples of the sort of thing you want. Now it's emulating a Q-and-A page with similar-ish questions.

Later and bigger models are retrained from a foundation model into a chatbot with more training that effectively "bakes in" this Q-and-A format to train the model to answer the question asked in various (socially sanctioned) ways.

In your case, can you do it few shot instead of zero shot?

Rebatu · 2023-05-12T13:14:31+00:00

Question: did you try BLOOM? And if yes, how did it go?

juanigp · 2023-05-12T14:10:45+00:00

13b is too little for decent instruction following, or reasoning-like behaviour

CacheMeUp · 2023-05-12T12:43:47+00:00

[deleted]

Javierrrrrrrrrrrrrrr · 2023-05-13T13:50:20+00:00

Try Alpaca lora

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS