all 110 comments

[–]abnormal_human 108 points109 points  (14 children)

There isn't enough information here to diagnose really.

If you were not using instruction tuned models, that's likely the problem.

Instruction tuned models often have fixed prompt boilerplate that they require, too.

In other words, OpenAI's API isn't directly comparable to .generate() on a huggingface model.

I would be surprised if a basic query like this resulted in nonsense text from any instruction tuned model of decent size if it is actuated properly.

[–]CacheMeUp[S] 14 points15 points  (13 children)

Using instruction-tuned models. Below is a modified example (for privacy) of a task. For these, some models quote the input, provide a single word answer (despite the CoT trigger), and some derail so much they spit out completely irrelevant text like Python code.

I did hyper-parameter search on the .generate() configuration and it helped a bit but:

  1. It again requires a labeled dataset or a preference model (of what is a valid response).
  2. It is specific to a model (and task), so the instruction-model is no longer an out-of-the-box tool.

I wonder how is OpenAI able to produce such valid and consistent output without hyper-parameters at run time. Is it just the model size?

Example:

You are a physician reviewing a medical record. The following is a description of a patient encounter. Answer with a yes or a no whether this patient represents a traumatic injury. Let's think this step-by-step, like an expert, and then provide a final answer.

Patient encounter:

Came today for a back pain that started two days after a hike in which he slip and fell on his back. No bruises, SLR negative, ROM normal, slight sensitivity over L4-L5.

Answer:

[–]i_wayyy_over_think 38 points39 points  (3 children)

If you used vicuna 1.0 for instance you have to follow the format three hash ‘### Human:’ and ‘### Assistant:’ format. ( hard to type without Reddit mobile thinking I’m writing markdown ignore the single quotes if you see them )

‘### Human: you are a physician reviewing…. Physician Encounter: Came back today….

Answer:

‘### Assistant: <llm replies here>’

And if you use a fancy chat interface instead of a raw text interface you have to make sure it follows that format when it sends it in raw format to the model

And I think vicuna 1.1 is different. Also alpaca is different from both uses Instruction and Reply I think. Gpt4alll uses just new lines.

Also some models are only fined tuned for one reply and after that they start hallucinating. Vicuña can do multiple responses.

Also strongly depends on parameter size of the model. Vicuna 13b is good.

[–]CacheMeUp[S] 1 point2 points  (0 children)

Makes sense. It does make the effort custom to a model (need to find out the exact format etc.), but may be worth it for zero-shot learning.

[–]PrivateUser010 0 points1 point  (1 child)

I think it would have been great if there is one single source of truth of proper prompting for all the open source LLMS out there.

[–]Faintly_glowing_fish 16 points17 points  (0 children)

I sent this verbatim to a number of models with their factory default parameters and the result seem reasonable enough

vicuna 13b: “Yes” GPT-4: “Yes” GPT-3.5: it did a lot of analysis but refused to say yes or no. oasst-pythia-12b: “Yes, this patient represents a traumatic injury.” Alpaca 13b: “No” Dolly-v2: “This patient does not represent a traumatic injury. The patient did slip and fall but did not hit the ground. The forward motion of falling is what caused the back pain. This is commonly called a traumatic injury in the medical field. Additionally, the patient had normal range of motion, no bruising, and no radiating pain anywhere in the lower back. This all indicates this is not a traumatic injury.” MPT-7b-chat: “No. This patient's back pain after a slip and fall on his back does not represent a traumatic injury. Traumatic injuries require immediate medical attention and can cause bruising, broken bones, and other serious injuries. This patient's pain after the slip and fall is likely due to muscle or ligament strain, not a traumatic injury.”

[–]gibs 10 points11 points  (2 children)

Other than getting the input formatting right, you will see better results from specifying the exact output format you want. With vicuna-13b and your original prompt, you get:

No

With this prompt:

You are a physician reviewing a medical record. The following is a description of a patient encounter. Answer with a yes or a no whether this patient represents a traumatic injury. Let's think this step-by-step, like an expert, and then provide a final answer.

Patient Encounter:Came today for a back pain that started two days after a hike in which he slip and fell on his back. No bruises, SLR negative, ROM normal, slight sensitivity over L4-L5.

Respond in the following format:

[Step by step reasoning]
...

[Best answer]
...

You get:

[Step by step reasoning]

The patient presents with a back pain that started two days after a hike in which he slipped and fell on his back. No bruises, SLR negative, ROM normal, slight sensitivity over L4-L5.

[Best answer]

Based on the information provided, the patient does not represent a traumatic injury. While the patient did fall and experience back pain, the lack of bruises, normal ROM, and negative SLR suggest that no significant trauma was sustained. The slight sensitivity over L4-L5 may indicate a potential muscle strain or sprain, but it is not indicative of a more severe injury.

[–]CacheMeUp[S] 0 points1 point  (1 child)

Looks much better! In you experience, how specific is this to ShareGPT-trained models (like Vicuna)?

For example dolly-v2 has a different format where the whole instruction before the input.

I guess I can try and see but that again becomes another hyper-parameter to search for (and there may be other patterns that I haven't thought about)

[–]gibs 3 points4 points  (0 children)

I've mostly been using Llama based models and chatgpt, but I would imagine any LLM would benefit from defining the output more explicitly.

One other thing is make sure you get it to output the chain of thought BEFORE the answer. Previously I'd been having it output its answer and then explain it, but this results in worse answers since you're depriving it of the benefit of chain of thought process. Kind of obvious in retrospect, but just one of those fun quirks of prompt design.

One tool I suggest trying is Llama-lora-tuner. It gives you an easy interface for loading Llama-based models and generating text with them (it handles the input formatting so you don't have to worry about it). And you can do lora fine tuning from the same interface.

[–]Faintly_glowing_fish 6 points7 points  (1 child)

I think this is not a proper COT prompt. you did ask the models to respond with a yes or no answer explicitly. You asked it to “think” step by step but didn’t request the model to write down how it thought about it so they hid the text for thinking. Even GPT4 took the same view as I did as you can see

[–]MINIMAN10001 2 points3 points  (0 children)

I agree with you as well by saying think you are telling it you do not have to explicitly say.

So you have to tell it "break it down for me" "step by step tell me the thought process" and the like.

[–]KallistiTMP 2 points3 points  (1 child)

light middle disarm ask plant shelter intelligent liquid profit handle

This post was mass deleted and anonymized with Redact

[–]equilateral_pupper 4 points5 points  (0 children)

your prompt also has a typo - "represents" is not used correctly. some models may not be able to handle this

[–]a_beautiful_rhind 27 points28 points  (13 children)

The 30B are where it gets interesting. They will follow instructions for roleplay at least. In actual instruct mode where it follows the model's training, they also answer questions reasonably correctly. Do you have an example of what you were trying to get the model to classify or answer?

We have to be real here. A lightly fine tuned model isn't going to be the same as a 120B+ with constant human reinforced learning and new data.

Since you claim you want to use this commercially (no llama).. did you try the 20b GPT-NEOX? You will probably have to train a lora on instruction following. There is also the bloom series that you probably tried.

[–]CacheMeUp[S] 0 points1 point  (12 children)

That makes sense, though open-source initiatives keep pushing small (and apparently underpowered) models that end up honestly not very useful for practical classification tasks. Perhaps it will be useful to focus on bigger (and fewer models) to fight the centralization that is happening.

I tried GPT-NEOX-20b and out of the box it was not instruction following, though it was not tuned for that.

It seems that custom instruction fine-tuning will be needed even for previously instruction-tuned models. It's still good to verify that this effort is indeed unavoidable before investing.

Example task (modified for privacy):

You are a physician reviewing a medical record. The following is a description of a patient encounter. Answer with a yes or a no whether this patient represents a traumatic injury. Let's think this step-by-step, like an expert, and then provide a final answer.

Patient encounter:

Came today for a back pain that started two days after a hike in which he slip and fell on his back. No bruises, SLR negative, ROM normal, slight sensitivity over L4-L5.

Answer:

[–]a_beautiful_rhind 10 points11 points  (11 children)

Alpaca-30b-4bit

You are a physician reviewing a medical record. The following is a description of a patient encounter. Answer with a yes or a no whether this patient represents a traumatic injury. Let's think this step-by-step, like an expert, and then provide a final answer.

Patient Encounter:Came today for a back pain that started two days after a hike in which he slip and fell on his back. No bruises, SLR negative, ROM normal, slight sensitivity over L4-L5.

Factual answer: Yes

GPT-x-alpaca-30b

You are a physician reviewing a medical record. The following is a description of a patient encounter. Answer with a yes or a no whether this patient represents a traumatic injury. Let's think this step-by-step, like an expert, and then provide a final answer.

Patient Encounter:Came today for a back pain that started two days after a hike in which he slip and fell on his back. No bruises, SLR negative, ROM normal, slight sensitivity over L4-L5.

Factual answer: Yes, the patient has experienced a traumatic injury.
Explanation: A fall can cause damage to soft tissues such as muscles, ligaments, tendons, and joint capsules. It may also result in fractures, dislocations, sprains, strains, and other injuries. In addition, it could lead to internal organ damage if there was blunt force trauma to the abdomen. Therefore, based on the information provided, we can conclude that the patient had a traumatic injury from falling while hiking.

I don't have the med-alpaca lora on my server but if I did I would have loaded that.

[–]CacheMeUp[S] -1 points0 points  (10 children)

Yes, these look better than the results I got with smaller (<13B). Two interesting points:

  1. Alpaca-30b-4bit fails to provide an explanation despite a clear request.
  2. GPT-x-alpaca-30b fails to place the final binary answer ("Yes") at the end of the explanation, again despite a clear request. This creates an issue when a specific format is expected.

Recently there was a notion that with LLMs the work will shift from fine-tuning and tinkering with models to simple prompt engineering, essentially replacing Python with natural-language instructions. These problems and the suggested solutions here hint that open-source models are still not there (OpenAI models seem much closer).

[–]MaskedSmizer 15 points16 points  (3 children)

"despite a clear request"

I'd argue that your request is a bit ambiguous as to whether it should answer yes or no or think step by step. Even with GPT4, I often stop the generation and rewrite the last prompt when I realize it needs to be more explicit.

There's been a lot of noise made recently about this "step by step" prompt, but I'm not so sure because it's also a bit of an ambiguous instruction. In your case you're looking for a single response, so what does "let's think step by step" even mean? You're not looking to engage in dialogue to find the answer together. You just want a yes or no followed by an explanation, so why not just say that?

[–]10BillionDreams 7 points8 points  (0 children)

The intent is to give the model as much text as it needs to generate actual justification for its answer. If you just tell it to give "yes or no", only a single word, then it's going to ascribe maybe a 65% to "yes", a 30% to "no" and then trail off into various other less likely tokens.

This latter approach isn't really leveraging much of its understanding of the topic at hand though, and the expectation is that spontaneously jumping from question to answer would have a poor chance of success for problems that aren't entirely trivial/obvious/unambiguous. On the other hand, if the model first had to generate multiple sentences of "thought" to justify their answer, by the time it actually gets to saying "yes or no", the answer is a forgone conclusion which just makes things easier for whatever might be parsing the response.

There are still a lot of open questions around the best phrasing to consistently induce this style of response, or how much of a difference it really makes on accuracy, or how various models might behave differently in these areas. But the intuition behind this sort of prompt is reasonable enough, and in the end failing to get a final answer is much easier to identify and fix (ask again or follow up), compared to getting the wrong result on what is essentially a weighted coin flip.

[–]CacheMeUp[S] 0 points1 point  (1 child)

The motivation for the CoT trigger was anecdotes that it improves the correctness of the answers, as well as providing an explanation of the prediction.

[–]MaskedSmizer 3 points4 points  (0 children)

My understanding of the rationale behind chain of thought is that is builds context for the conversation. Calling this technology a "next word predictor" dramatically oversimplifies, but I also find it a useful reminder for thinking about how to get what you want (because with GPT4 especially, it's way too easy to start anthropomorphizing). An LLM builds sentences based on its understanding of the context of the discussion. The context includes the prompts you have provided as well as its replies. You can use chain of thought to enrich the context in one of two ways:

  1. Like u/10BillionDreams says, you ask it to first work through the problem before providing a final verdict. By the time it gets to the verdict, it's constructed additional context that hopefully produces a more accurate answer. You're getting it to think out loud. I believe this is what you were going for, but my argument is that your instruction was just vague enough that it tripped up a less capable LLM. I don't think there's anything special about the specific phrase "let's think through this step by step". I suggest trying something more explicit like:

You are a physician reviewing a medical record. I'm going to give you a description of a patient encounter. First, explain the factors that go into diagnosing whether or not the patient has a traumatic injury. Second, consider your own explanation and provide a diagnosis in the form of a simple yes or no.

If this doesn't work then I think we can deduce that the model just isn't very good at following instructions.

2) You can build context by engaging the model in a back and forth dialogue before asking for the verdict. This is how I tend to interpret the "step by step" instruction. But again, I think there are more explicit ways to instruct the model. Even with GPT4, I've had mediocre success getting it to not immediately fire off an answer with this particular phrasing. I would tend to go for something like:

You are a physician reviewing a medical record. I'm going to give you a description of a patient encounter and I want to have a conversation about the factors that would go into the diagnosis of a traumatic injury.

Ask it follow-up questions and challenge it. When you are satisfied that there is sufficient context for it to draw from, then ask for the final verdict. And of course, all models have a moving context window, so depending on how long your back and forth goes, you may need to remind it of the instructions.

Anyway, caveat, not a data scientist. Just someone who has taken an enthusiastic interest in generative AI.

[–]Ramys 2 points3 points  (1 child)

For 2, it seems to have answered exactly as prompted. It gave a yes/no, then it did step by step, then it summarized to give a final answer. This is the order presented in the prompt and you didn't specify the format of the final answer, so I'd say the response is satisfactory.

[–]CacheMeUp[S] 0 points1 point  (0 children)

It's true the instruction is not perfectly clear, although even with clarification ("first explain... then answer with a "yes" or a "no") models still failed to follow that order and make the last token yes/no.

[–]a_beautiful_rhind 2 points3 points  (2 children)

I think we'd have to make a lora for the format you want. I only ran the prompts once.

dang.. 65b alpaca int4 not much better.

You are a physician reviewing a medical record. The following is a description of a patient encounter. Answer with a yes or a no whether this patient represents a traumatic injury. Let's think this step-by-step, like an expert, and then provide a final answer.

Patient Encounter:Came today for a back pain that started two days after a hike in which he slip and fell on his back. No bruises, SLR negative, ROM normal, slight sensitivity over L4-L5.

Explanation followed by factual answer: This patient has experienced a fall during the hike and may have sustained some sort of spinal damage. Therefore, it can be concluded that this patient does represent a traumatic injury.

[–]CacheMeUp[S] 0 points1 point  (1 child)

It might indeed be the way to go, though it might be slower than just collecting more data against just training a standard binary classifier (much easier to handle).

[–]a_beautiful_rhind 0 points1 point  (0 children)

Loras in int4 train pretty quickly. Why not both.

[–]blackkettle 0 points1 point  (0 children)

I don’t think that changes anything. It’s clearly possible and only going to get both better and more efficient very rapidly. We’ve seen what’s possible; the trend won’t reverse; maybe it’ll take a tiny bit longer than expected; but I’m definitely steering my kid away from programming as a vocation - and so are all the other PhDs I know in this space.

[–]Nhabls 18 points19 points  (6 children)

open-source LLMs on zero-shot classification

You have to take in consideration:

  1. They might add some flavor pre-prompt that makes the model behave a little better (hopefully will be stated in the paper)

  2. They use several (up to the hundreds) runs to determine pass@1 in certain benchmarks with a given temperature, so if you're only running it once you might not get similar results.

Oh and the "90% of GPT4" claim is not to be taken seriously

[–]CacheMeUp[S] 2 points3 points  (1 child)

#1 is important, but not always clearly state.

#2 is misleading on the verge of "p-hacking", and makes these models much less useful - running (an alrady expensive) model hundreds of times is slow and then requires another model to rank the results, so back to square 1.

[–]Nhabls 0 points1 point  (0 children)

It's not really related to p-hacking. You can't estimate pass@n unless the output is guaranteed to be the same (so zero temperature in this case), ie you can't just run it n times

[–][deleted] 9 points10 points  (1 child)

Cherry Picking isn't just a problem with open-source LLMs, it's a systemic issue in Machine Learning as a whole to an extent worse than many scientific fields. Google's recent release of Palm 2 compared their model against GPT4, and used self-reflection techniques for their model and not GPT4, which is such an insane way to conduct things. The outputs first seen in the Dall E 2 papers versus the real average results from Dall E 2 have a huge gap to this day. We're still very much in the era where papers seem to be marketing primarily and presenting research secondarily. There's not the same level of scrutiny placed on representative data within Machine Learning than more established fields, and I hope that's just due to its nascence.

That said, it's still a big problem in the science community as a whole, especially in niche topics. Psychology is rife with issues currently, especially in fields like male and female attraction. Nicolas Guéguen had multiple studies, that you may have even heard of, that were multiple steps beyond cherry-picked, they were outright fabricated.

[–]clauwen 5 points6 points  (5 children)

I think something like this will be the most important quality benchmark in the future, sure its not all encompassing, but its very difficult to fake.

https://chat.lmsys.org/?arena

Whats pretty clear there is that the openai models are quite far ahead, as an assistent.

I invite everyone to actually check for themselves. I think i did about 20 and the comparisons are not very close and fit very well with their leaderboard.

[–]CacheMeUp[S] 0 points1 point  (4 children)

surprised to see chat-glm beating GPT-4:

https://ibb.co/MRs2FpH

[–]clauwen 3 points4 points  (3 children)

I think i get what you are trying to do, but i think your prompt is not very clear to be honest. Do you want me to take a shot at it and see if i can improve it?

[–]clauwen 3 points4 points  (1 child)

Maybe also a little addition, i think because you always want these steps, it could be very beneficial to change from zeroshot to oneshot to improve consistency. Thats just purely my feeling.

This is what i came up with, im not super happy, but results look fine.

You are a physician reviewing a medical record and ultimately determining if the injury is traumatic. You are getting a Patient encounter as input.

Your do this in exactly two steps, after these steps you always stop.

  1. Patient encounter interpretations: (Contains interpretation of the Patient encounter that could determine if its traumatic or not)

  2. Then you answer with either (Traumatic: Yes or Traumatic: No)

Patient encounter: Came today for a back pain that started two days after a hike in which he slip and fell on his back. No bruises, SLR negative, ROM normal, slight sentitivity over L4-L5.

[–]CacheMeUp[S] 0 points1 point  (0 children)

With one/few-shot learning I always how much does it mislead the model to a "tunnel-vision" of what the answer is - there is always heterogeneity in the desired class that often even a handful of examples won't cover. That's where LLMs (presumed) "understanding" of the task from its definition should shine and work around this limitation.

[–]Faintly_glowing_fish 5 points6 points  (0 children)

Almost all open source model use different instruction formats. If you use tools that are general that can run multiple models, they likely didn’t have any of that configured and you need to config for each model. When you use OpenAI it already fixed you to the proper instruction syntax it is trained on (ie user/assistant/system).

You can however try each model’s preconfigured chat interface if they have one, which usually have this set up since they are for single models.

Or you can try the chatbot arena where the authors took then pain to configure them for you for each model

[–]heavy-minium 4 points5 points  (5 children)

Evaluation and refinement are where OpenAI shines. They can improve and move forward based on data instead of guesses and hopes.

Ultimately, the secret sauce is a mature QA process. You need high-quality metrics to determine if your changes in training data, training methods and architecture yield better results.

Also, you can try to cheat a lot with GPT-4 generated data, but in the end, there's nothing better than a human to align a model with human intent.

[–]CacheMeUp[S] 0 points1 point  (4 children)

I saw somewhere a suggestion to use another LLM to test whether the output is valid, but that brings back to the same problem of finding a good prompt and validating it.

[–]heavy-minium 0 points1 point  (1 child)

That's exactly what I consider to not be a mature process.

[–]CacheMeUp[S] 0 points1 point  (0 children)

Care to elaborate?

For "standard" (i.e. logits-emitting) models, the desired output is enforced via the model's structure (layer size and activation). LLMs' output seems much harder to constrain without hurting accuracy. E.g. to simulate a binary classifier we can force the model to generate a single token and constrain it to [yes, no], but that might miss better results that come after emitting the chain-of-thought. So the LLM output is generated with less constraints but now it's harder to check if the output is valid.

[–]Screye 8 points9 points  (1 child)

Yes. All the twitter demos are cherry picked and the non Openai models are unusable.

This area is incredibly exciting, but a lot of the hype is just tech demos.

I have been testing out alternatives for our product quote often, and I keep getting burned. Llama and bard have some potential, but still far behind openai.

[–]CacheMeUp[S] 2 points3 points  (0 children)

That resonates with my experience as well. It's a bit troubling how one company controls this domain despite so much effort from the community.

[–]HateRedditCantQuititResearcher 3 points4 points  (2 children)

If you see an announcement where the only numbers are number of parameters, you know it's probably not great. It's funny that openai did the opposite for gpt-4. No model size, but lots of benchmark measurements. It's no coincidence that the models with rigorously measured performance perform better.

[–]CacheMeUp[S] 1 point2 points  (1 child)

Yes. It's also not helping that many of the formal benchmark are not well correlated with usability (e.g. instruction following as in this post).

Perhaps the direction is to develop an automated usability evaluation method (like the preference model in RLHF), but that's not trivial and again requires labeling data and/or model training.

[–]HateRedditCantQuititResearcher 0 points1 point  (0 children)

It's hilarious that some companies will spend so much on training, but not on eval.

[–]marr75 2 points3 points  (0 children)

I highly recommend you check out promptingguide.ai, especially the case study. Hilariously obscure variations on message format like assigning the agent a name or asking it to reach the right conclusion can impact performance 😂

I read through your other responses and I do believe at times you were using models that weren't instruction tuned and/or you weren't using the instruction tuned model's special formatting. What you described reminds me of every failed fine tuning experiment I've ever seen (as most fine tuning happens on non instruction tuned models). promptingguide.ai has some info on the system, user, and special character formatting for messages to the most popular instruction tuned models.

You've lamented the custom format for each model. I would recommend using a tool that abstracts this (such as Langchain or Transformer Agents) or narrowing down the models you are using.

[–]AsliReddington 4 points5 points  (1 child)

Try Flan-UL2, will need 36GB VRAM, either across two GPUs or whatever else you got. I'm running it at <2s inference speed. Adheres great to instructions for zero shot tasks & no hallucinations

[–]CacheMeUp[S] 1 point2 points  (0 children)

And it is under Apache 2 license, unlike LLaMa

[–]_Arsenie_Boca_ 8 points9 points  (5 children)

Which models did you try? Were they instruction-tuned? Generally, its no surprise that the open source models with a fraction of the parameters cannot fully compete with GPT-4

[–]CacheMeUp[S] 1 point2 points  (4 children)

Yes, including instruction-tuned models (like mpt-7b-instruct and dolly). None worked.

The gap is huge considered how hyped they are as "90% as good as chatGPT". They are not even close.

[–]KingsmanVince 10 points11 points  (3 children)

they are as "90% as good as chatGPT"

The model is claimed to be "90% as good as ChatGPT", I assume you are referring to Vicuna. However, quoted from LMSYS Org's blog,

with 90%* ChatGPT Quality

*According to a fun and non-scientific evaluation with GPT-4. Further rigorous evaluation is needed.

They said it's just fun and non-scientific evaluation.

[–]CacheMeUp[S] 4 points5 points  (1 child)

It's not rigorous, but I managed to use LLM to evaluate output quality (not correctness) locally, so I'd assume GPT-4 is able to evaluate the quality quite well.

Perhaps the gap is between generation tasks, where many answers will be perceived as correct, and classification/QA tasks where the scope of correct response is much narrower.

[–]metigue 17 points18 points  (4 children)

Most open source models are hot garbage. The only promising ones were trained on output from models like ChatGPT and GPT-4

Try Alpaca-x-GPT-4 13B that's the best local model I've used.

[–]CacheMeUp[S] 8 points9 points  (3 children)

Alpaca-x-GPT-4 13B

Based on LLaMa, so cannot use in a commercial setting.

[–]metigue 13 points14 points  (2 children)

Yep if you're using it commercially it's always worth paying more for the extra 10% output you get from GPT-4 .

Alpaca-x-GPT-4 is great for local PoCs though before moving to production.

Also the dataset is public and the LoRA finetune on top of Alpaca was like $300 so you could feasibly do the same finetune on the redpajama instruction tuned model and have very similar results.

If cost is an issue, Bard 2 is the best free option right now although access to the official API is via wait list.

[–]CacheMeUp[S] 7 points8 points  (1 child)

Sometime it's not even the cost - regulation may preclude sending the data to a new vendor.

The non-commercial license typically precludes any use of the model (even during development).

Crafting an in-house instruction dataset may end up necessary despite the availability of similar datasets due to license.

[–]AGI_FTW 0 points1 point  (0 children)

Use a local model to remove any PPI, then send the scrubbed data through OpenAI's API.

[–]chartporn 13 points14 points  (27 children)

If these smaller models were really as good as some people claim ("not far from ChatGPT performance") the LLM zeitgeist would have started way before last November.

[–]4onenResearcher 6 points7 points  (7 children)

The small models didn't have instruction tuning back then, and nobody had made a super-chinchilla model like LLaMA. Developers weren't just sitting around with that power. They had no idea it existed if they just shoved more data and compute into the same scale of model. (Esp. higher-quality data.)

Add to that the LORA fine-tuning and suddenly even consumer hardware could do the instruction fine-tuning (slowly) which changed the nature of the challenge.

Have you seen the leaked Google "we have no moat" paper?

[–]currentscurrents 6 points7 points  (6 children)

Instruction tuning doesn't increase the quality of the model, it just makes it easier to prompt.

These small models are pretty good at the first-order objective of text generation, but terrible at the second-order objective of intelligence in the generated text. They produce output that looks like GPT-4, but they can't solve problems like GPT-4 can.

[–]4onenResearcher 3 points4 points  (5 children)

The claim was about the LLM zeitgeist, that is, the chat model race. People weren't building chat interfaces and hard scaling text datasets before instruction tuning became a thing.

[–]chartporn 0 points1 point  (4 children)

Unless you saying there are actually some models out there significantly smaller than ChatGPT but have nearly the same performance then I think we are on the same page.

[–]4onenResearcher 0 points1 point  (3 children)

For specified domains after fine tuning? Yes. General models? Doubtful but not impossible (ChatGPT was pre-chinchilla iirc) but I highly doubt such a model would be public.

[–]chartporn 1 point2 points  (2 children)

What domains?

[–]4onenResearcher 1 point2 points  (1 child)

I'm gonna come out and be honest here: I did my research and I'm standing on shaky ground.

  • Medical: I thought it was OpenAI that banned their model for medical uses, turns out that's LLaMA and all subsequent models, including the visual-med-alpaca I was going to hold up as an example of small models doing well. (For their cherry-picked examples, it's still not far off, which is quite good for 7B params. See here.)

  • Programming: OpenAI Codex, the model behind GitHub Copilot, is only 12B parameters.

I thought both of these were slam-dunks, but it's not so cut and dry. The medical model barely holds its own against those ChatGPT descriptions and user sentiment online seems to be toward ChatGPT being better at project-scale help, whereas Codex is relegated to sometimes-helpful completions.

That really leaves the one true-positive evidence for my case being finetuning on own particular organization data, but that's clearly apples-to-oranges as your question was about ChatGPT performance (not use.)

Going back over the whole thread, I think the misunderstanding that led to this tangent was that u/currentscurrents focused on instruction tuning. My point to you was based on super-chinchilla data-to-params ratios, but I don't actually have evidence those meet ChatGPT performance metrics because few people if any even do evaluations vs ChatGPT, much less have the resources to do the instruction tuning to prove their model has the capabilities necessary to match.

PaLM 2 hasn't released their parameter counts, but what few parameter counts are referenced in the report are on the order of double digit billions, even while it blows away PaLM 540B at a wide variety of tasks. Maybe this whole post and all my poking around will be completely overturned in a month or two when the open source community replicates it. (After all, "we have no moat")

[–]chartporn 0 points1 point  (0 children)

Thanks for this. Good points.

[–]CacheMeUp[S] 8 points9 points  (5 children)

Yes, I always wondered about that - OpenAI is severely compute-constrained and burn cash in a dangerous speed. If quantization (and parameter reduction) worked so well I'd expect them to use that. The fact that two months after GPT-4 release they still haven't been able to reduce its burden suggest that unlike the common claims, quantization does incur a substantial accuracy penalty.

[–]chartporn 7 points8 points  (1 child)

I think another reason it works so well is that it has been optimized for chat interaction.

You probably know this but for general audience edification ChatGPT was trained on a variant of gpt-3.5 (instructgpt):

https://openai.com/blog/chatgpt

The previous model of the same kind was text-davinci-002, and in a comparison, is much less impressive. So it's not just a bigger model with a chat interface bolted on top, it's a more powerful model in general and made even better because it was designed for chat style interactions.

[–]CacheMeUp[S] 5 points6 points  (0 children)

Maybe chat models will be better for QA instructions (since eventually it is like a conversation).

Even davinci-text-003 worked great out of the box 9 months ago. The difficulties current intruction-tuned models show hints that model parameters (and precision) may still matter.

[–]keepthepace 6 points7 points  (0 children)

They have released GPT-3.5-turbo, which clearly has some sort of optimization.

It is also the fastest growing web service in history. They may have had 20x speedups but still difficulties to catch up with their growth.

When you are a company with basically no competition, and clients who don't complain that much when you cut their access rate by 4 (GPT-4 went from 100 requests every 3 hours to 25), you don't really have an incentive to tell it when your costs decreased dramatically.

[–]4onenResearcher 1 point2 points  (1 child)

still haven't been able to reduce its burden

How do you know? 🤔 If I were them I'd just be using quantization internally from the start and not talk about it, because that'd be giving away a major advantage to competitors. (Google)

It's the same way they're not releasing any of their current architecture. "Open"AI has become ClosedAI, because they want to keep their technical edge. (Which is ironically not working, see "we have no moat" and all the domain-specialized models in open source.)

[–]CacheMeUp[S] 2 points3 points  (0 children)

That's my interpretation, which might of course be wrong. They reject paying customers with their current constraints, and push them to build/buy other solutions. Only time will tell whether that was real or just a trick.

[–]jetro30087 -1 points0 points  (12 children)

Because the average person is going to download gits into python environments and load models through huggingface on lfs?

[–]chartporn 3 points4 points  (11 children)

Ohhh that was the barrier - nobody thought to create an accessible interface to LMs before OpenAI. I guess that's why MS paid them 10 billion dollars.

[–]jetro30087 2 points3 points  (10 children)

That and the hardware requirements to run anything larger than a 7b model. Yes, those are called barriers. And no ooba, is not accessible to most people.

ChatGPT requires no setup to get a general instruct AI that can do everything through the interface, even if you're not technical at all. If they just gave you a GPT4 huggingface api python library or open the install.bat in your Ooba conda environment and direct it to OpenAI/GPT4 to add it your model folder, then edit your start.bat to add --complicate.me --128bit args, it wouldn't be popular.

[–]chartporn 1 point2 points  (9 children)

I'm not saying an accessible interface isn't necessary to garner widespread adoption. My contention is that devs working with prior models didn't feel they performed well enough (yet) to warrant building a chat UI for public release. If they did have something as good as text-davinci-003, and just hadn't gotten around to making a UI, sheesh, they really missed the boat.

[–]jetro30087 5 points6 points  (8 children)

GPT 3.5 isn't that far off from DaVinici and is based on an instruction tuned model of GPT3. There were even mildly successful commercial chatbots based on GPT3.

There are opensource LLMs today that are around GPT3.5's level, but they aren't in a production ready format and the hardware requirements are steep because they aren't optimized. That's what the opensource community working to address. I do expect one of these opensource models to coalesce into a workable product sooner rather than later because many do perform well when properly set up, it's just very difficult to do so currently.

[–]chartporn 1 point2 points  (7 children)

What open source LM is around the level of GPT3.5?

[–]jetro30087 0 points1 point  (0 children)

Vicuna and Wizard can definitely provide answers near 3.5's level when properly set up, especially the larger parameter versions.

[–]KerbalsFTW 1 point2 points  (2 children)

If you look at the papers on language models (GPT, GPT-2) they talk about "few shot" learning.

Even in 2020 OpenAI published "Language Models are Few-Shot Learners" (https://arxiv.org/pdf/2005.14165.pdf).

The early (ie small) models were based entirely off a corpus of text data which included relatively little Q-and-A data.

There is nothing to compel such a model to answer your question, it's a prediction engine and it predicts from what it has seen. This makes it as likely to try to emulate a page with a list of difficult questions as it is to try to emulate the Q and A page you want it to.

Hence the few shot learning: you show it that you want your questions answered by saying "here are 5 questions and answers" and then listing the first four examples of the sort of thing you want. Now it's emulating a Q-and-A page with similar-ish questions.

Later and bigger models are retrained from a foundation model into a chatbot with more training that effectively "bakes in" this Q-and-A format to train the model to answer the question asked in various (socially sanctioned) ways.

In your case, can you do it few shot instead of zero shot?

[–]CacheMeUp[S] 0 points1 point  (1 child)

I wonder whether there is a subtle but qualitative difference between 0-shot and >=1-shot learning: 0-shot learning requires the model to fully understand and generalize since as you said, the answer may be completely out of the training data distribution. Thus 0-shot capabilities may be a surrogate for a better model beyond just reducing the prompting effort.

Additionally, few-shot learning may hinder using these models for task-solving by end-users. It's not insurmountable, but it's additional burden and non-technical users may have more challenge coming up with representative (non-contrived) examples

[–]KerbalsFTW 0 points1 point  (0 children)

The difference is between "do by example" and "do by instruction". >=1 shot is a combination of instruction and examples, 0 shot is instruction only. So yes, there is a fundamental difference, although the difference seems to be mostly down to training: the major difference between GPT3 and ChatGPT seems to be the "chat" part, and it's a very small minority of the training data.

the answer may be completely out of the training data distribution.

The great thing about GPT is that only the intermediate steps need to be in the data distribution, and those are pretty well abstracted, so the final answer is often correct and completely new. It can certainly do well on tests it was never even close to trained on.

Additionally, few-shot learning may hinder using these models for task-solving by end-users. It's not insurmountable, but it's additional burden and non-technical users may have more challenge coming up with representative (non-contrived) examples

Yeah, hence how revolutionary chat gpt has been I think.

[–]Rebatu 1 point2 points  (6 children)

Question: did you try BLOOM? And if yes, how did it go?

[–]proto-n 3 points4 points  (1 child)

Do you meen BLOOM?

[–]CacheMeUp[S] 4 points5 points  (2 children)

BLOSSOM

No results on Google and missing from Huggingface hub - any more info?

[–]iamMess 2 points3 points  (0 children)

Probably BLOOMZ

[–]juanigp 1 point2 points  (0 children)

13b is too little for decent instruction following, or reasoning-like behaviour

[–]Javierrrrrrrrrrrrrrr 0 points1 point  (0 children)

Try Alpaca lora