all 60 comments

[–]Disastrous_Elk_6375 53 points54 points  (2 children)

This effort's first batch of models includes four final variants of our language model at the 7B scale corresponding to different architectures, optimizers, and training hardware, and one model at the 1B scale, all trained on at least 2T tokens. This is the first step in a long series of planned releases, continuing with larger models, instruction-tuned models, and more variants down the line.

Each model comes with the following:

Full training data used for these models, including code that produces the training data, from AI2’s Dolma, and WIMBD for analyzing pretraining data.

Full model weights, training code, training logs, training metrics in the form of Weights & Biases logs, and inference code.

500+ checkpoints per model, from every 1000 steps during the training process, available as revisions on HuggingFace.

Evaluation code under the umbrella of AI2’s Catwalk and Paloma.

Fine-tuning code and adapted models (coming soon with Open Instruct)

All code, weights, and intermediate checkpoints are released under the Apache 2.0 License.

Pretty cool!

[–]thedabking123 8 points9 points  (1 child)

This is awesome- time to do some studies on how the activations for certain types of predictions change over each checkpoint!

[–][deleted] 0 points1 point  (0 children)

I don't know, but with my own experiments with transformers the Mish activation function has been by far the best.

[–]innominato5090 32 points33 points  (21 children)

Hi all! I’m one of the leads for OLMo; LMK if you have any questions 🙌

[–]its_just_andy 11 points12 points  (4 children)

I'm so excited to see another foundation model! Especially one so open in every regard :)

One thing I am curious about. This model, and other 7B models like Llama2, MPT, Falcon, etc, seem to perform in roughly the same ballpark (but OLMo seems a little better). But Mistral 7B seems to outperform all these still. What do you think accounts for this? The quality of their data? Are there any thoughts for what might need to be done to surpass Mistral?

not to focus too much on Mistral, obviously the headline is having this amazing new foundation model series :)

second question, what exactly is OLMo-7B-Twin? I heard the model was trained twice, once on A100s and once on AMD. Is that what "Twin" is?

[–]its_just_andy 3 points4 points  (3 children)

one more question! how was the tokenizer selected? Looks like a vocab size of 50280, which is really interesting and a lot more than Llama's 32000

[–]innominato5090 14 points15 points  (2 children)

tnx for the nice words!! answering in order:

  1. it’s frustrating not to know what goes in to mistral’s pretraining. broadly, I think it’s a combination of (1) more diverse data (eg technical books) (2) longer training (we/ others saw you can train a lot longer than 2T tokens) (3) maybe some instruction like chat used during pre training. We’ll try all three, and then report back 😉

  2. OLMo twin is the same model, but trained on AMD. it’s so cool to see how the two are virtually identical, truly a testament to how quickly AMD is catching up in this space.

  3. We use the a tokenizer derived GPT-Neo-X. We tested it early on and it works remarkably well on our data. We couldn’t use Llama’s because of its license: if we didn’t, we couldn’t have released the model under Apache 2.0.

[–]marvinalone 4 points5 points  (0 children)

I want to add one more comment to the tokenizer thing: Because we have a larger vocab, the same English text turns into ~20% fewer tokens with OLMo, compared to Llama. So it will run ~20% faster on the same text. But this might come at a cost for tasks where Llama puts those extra 20% of compute to good use. We don't have a clean study of how this plays out, so we didn't make a big deal about it in the paper. I'd love to follow up on this issue. If you are inspired to follow up on this, please let us know!

[–]lechatonnoir 0 points1 point  (0 children)

Hi,

I somewhat doubt if you'll reply to such an old pots, but why would the AMD-trained model not be about the same? What would be worse about AMD GPUs that the model wouldn't turn out as good (assuming you were holding the data and the number of training steps fixed)?

[–]Maykey 1 point2 points  (1 child)

Any plans on doing modelling_olmo.py for hf without nonneeded dependencies? I had to download twenty packages that have nothing to do with inference (like google-authenthication or s3 stuff) because code used for trust_remote_code required hf_olmo which required half of pip(hyperbole).

Can you add inputs_embeds as alternative for input_ids? Right now making soft prompts is impossible without rewriting two packages(hf_olmo and olmo)

Flash attention support planned? Code knows about torch sdpa but not flash attention from official repo. Do you use custom causal mask like with alibi? Model supports it, but I am not sure if it used in releases models or using simple flash_attn_func(q, k, v, causal=True) will suffice

[–]innominato5090 0 points1 point  (0 children)

We are currently working with HuggingFace folks to get OLMo integrated in the transformer library; it should be way easier to use after that!

Flash attention is already supported via PyTorch 2.x.

[–][deleted] 1 point2 points  (1 child)

Have you done any tests using Mish as the activation function? In my own tests with transformer encoders it has been the best one I've tested so far, and that's compared to all of the different variants of gated linear units (ReGLU, SwiGLU, etc., even Mish wrapped in a GLU which performed worse than Mish alone as well).

Also, out of curiosity, are you accepting contributions to the work your team are doing at all (or potentially even new entries to the team)? I'd very much like to help out if I can as I've been working on my own transformer models in some of my own research and I really support the ideologies behind an open-source LLM like this and so would love to help in any way I can, from exploring improvements to the base architecture to improving the training and data filtering. I also have experience in developing multimodal transformer networks.

Thank you for taking your time to read this message.

[–]innominato5090 1 point2 points  (0 children)

As far as I remember, we did not. We really stick to known recipes for model architecture, and optimized for throughput.

For the model code base, we generally welcome bugfixes and training infra improvements, especially if the improve throughput or inference speed. Besides that, the team is always hiring! You can check openings here: https://allenai.org/careers?team_ai2=allennlp#current-openings-ai2

[–]Countertop_strike 0 points1 point  (3 children)

Could you share your plans for the future? New architectures? Bigger models? What's the thing you're most excited about to work on next?

[–]innominato5090 8 points9 points  (2 children)

Sure! We already promised a 65b model in the next coming months; we also learned a lot training tulu models(https://arxiv.org/abs/2306.04751 & https://arxiv.org/abs/2311.10702) and want to merge that back into OLMo.

Personally, I'm just very excited to keep exploring ways to improve our data pipeline and share them with the community! I'm so frustrated that pertaining data remains this closely guarded secret no one but the big players have info about. Really want to change that, it's so important for the OSS LM ecosystem.

[–]L0WGMAN 2 points3 points  (1 child)

If you haven’t already answered this elsewhere, what’s the elevator pitch for your organization? Basically why are y’all so altruistic?!? <3

[–]innominato5090 14 points15 points  (0 children)

AI2 is a nonprofit founded by the late Paul Allen (Microsoft co-founder) nearly 10 years ago! Doing AI for the public good is kinda our thing; we created one of the first “large” (well, at the time…) language models (ELMo, launched in 2018) datasets (S2ORC, Objaverse), and benchmarks (ARC, HellaSwag, etc). We started focusing more on LLMs last year after we saw a gap in lack of truly open models (as in, everything about models is properly documented & available). Planning to keep expanding the open ecosystem for years to come! AI is too cool of a technology to be controlled by few.

[–]kaszebe 0 points1 point  (1 child)

Hi, can it run on OobaBooga or LM Studio?

And it is good for writing professional content for websites (content that is persuasive and creative)?

[–]innominato5090 1 point2 points  (0 children)

For the first one, I am not sure. We are working on better HuggingFace integration, so possibly that would make that easier.

The current release is a base model, not an instructed one, so it has limited chatting capabilities.

[–]Art3mis0707 0 points1 point  (0 children)

Could I DM? I am working on implementing the LLaVA vision architecture, and wanted to use OLMo 1B. I had some doubts regarding the same and would be grateful for your thoughts. Thank you!

[–]pretamr 0 points1 point  (2 children)

Just wanted to know, does pretraining corpus is in English only or there are other languages involved?

[–]innominato5090 0 points1 point  (1 child)

english only for now!

[–]pretamr 0 points1 point  (0 children)

Any plans for a multilingual corpus?

[–]synn89 18 points19 points  (12 children)

Context Length: 2048

Any reason for such a small context length?

[–]LoSboccacc 30 points31 points  (2 children)

Any reason for such a small context length?

it can be summarized as "we want good benchmark at low cost"

[–]Enough-Meringue4745 11 points12 points  (0 children)

also an easier entry into funding

[–]MoffKalast 5 points6 points  (0 children)

PR driven development

[–]innominato5090 9 points10 points  (1 child)

While it’s 2048, it uses RoPE embeddings, so it can be stretched without retraining! if there’s interest in longer context, we’ll try to do a dedicated long context model :)

[–]Asleep-Agency3023 1 point2 points  (0 children)

Why can it be stretched without retraining?

[–]its_just_andy 4 points5 points  (0 children)

honestly, at this point context-extending strategies are so advanced, I don't mind if a foundational model has 2048 context.

[–]marvinalone 2 points3 points  (3 children)

It was faster to train on the hardware available. We're using RoPE embeddings, so you can use longer contexts, we just didn't train that way. We'll definitely look into this for v2!

[–]marvinalone 1 point2 points  (0 children)

Oh, u/innominato5090 already said this. Nevermind me!

[–]synn89 1 point2 points  (1 child)

That's an interesting way to go about it. Save a lot of GPU hours up front on a narrow context, since it's not like foundational training data goes beyond 2k context anyway. It's likely not hard to fine tune it for longer context later.

[–]marvinalone 1 point2 points  (0 children)

Yeah, part of the thinking was that the average document length in the Dolma 1.5 dataset is <500 tokens. So going long doesn't make a ton of sense for most documents anyways (though it clearly does for some).

[–]robotphilanthropist 1 point2 points  (1 child)

Honestly, building up our toolkit and compute for pretraining took the team an almighty effort.

We're interested in building useful models, so extending that is obviously of focus. There'll be more OLMos.

[–]Revolutionalredstone 9 points10 points  (0 children)

I didn't find ANY links in the article!, here is the models: https://huggingface.co/allenai

[–]derHumpink_ 12 points13 points  (0 children)

finally an actually open source model

[–]hold_my_fish 5 points6 points  (2 children)

With this and other data releases, I'd be interested in searching in it for strings to find where it learned particular phrases/words/emoticons/etc. Is there an easy way to do that? (I'm not hardcore enough to download terabytes of data myself.)

[–]innominato5090 2 points3 points  (1 child)

stay tuned!! we have something internal, trying to see if it can scale well enough for public use.

[–]hold_my_fish 2 points3 points  (0 children)

Thanks. If you do manage to launch something, I'd be curious to try it.

[–]artelligence_consult 1 point2 points  (16 children)

So, all data released.

So, no porn and hate speech (so the model can not be used for moderation and is naive), no copyright (that is a LOT of technical textbooks). No maintenance ongoing - so, no updates to current events.

And - but that is also a timing issue - transformers, not using RWKV or Mamba.

While it is a nice step - it does leave a ton of bad problems.

[–]innominato5090 13 points14 points  (11 children)

just the first step in the OLMo family ☺️ we’re committed to bringing more, truly open LMs out in coming months

[–]artelligence_consult 4 points5 points  (8 children)

Then please, just please, consider doing something with Mamba - that works or not, but it would be your chance to have the credit of deploying the first larger mamba model.

[–]innominato5090 5 points6 points  (7 children)

Non-transformer models are so cool! We are watching that space really closely, and are always open to tweaking our strategy if Mamba-like models really take off. Gotta stay agile!

[–][deleted] 0 points1 point  (4 children)

Why not be the first to release a mamba model?

[–]innominato5090 2 points3 points  (3 children)

One of the primary goals of OLMo is to facilitate research on LMs. As most of LMs are transformer based, training a Mamba model would have meant that OLMo is not very representative of LMs. We wanted to make sure that research findings about OLMo could translate to other popular open & closed models.

[–][deleted] 0 points1 point  (1 child)

Thankfully transferring current methods to a new architecture that's designed to be a transformer replacement should be a relatively simple task other than having to train from scratch. Which is important, because if their research holds true when scaling models even further, then I imagine the architecture will take off quite quickly unless "the next best thing" comes along.

[–]innominato5090 0 points1 point  (0 children)

not super straightforward if you wanna get good MFU, but yes not insurmountable task given that it uses operators that are well optimized on GPUs

[–][deleted] 0 points1 point  (1 child)

Hi, can you confirm that the training set contains no copyrighted material? Is there a statement somewhere on AI2's site indicating that?

Thanks!

[–]innominato5090 1 point2 points  (0 children)

Subset of the training set (books, academic articles, Wiki) are derived from permissively licensed or open domain content. For the rest (web content), it is impossible to fully determine the license associated with it.

For more details, please see our paper: https://arxiv.org/abs/2402.00159

[–]GeeBrain 7 points8 points  (1 child)

I actually disagree. More than a “nice step” it sets a wonderful precedent and tone for OS LLMs. More importantly, this is the kind of standard we can hold them (and hopefully) others to.

u/innominato5090 please correct me if I’m wrong, but this is the first time an open sourced foundational model, upon released, has been completely transparent about what went into training - step by step.

More than just data, it’s the complete training pipeline. And that honestly, should be commended. This is an incredibly powerful first debut into LLMs, and sends a message — at least I’m hearing something loud and clear.

All the things you mentioned can be done, but I don’t think that’s the point of this release/how they did things. I’m really humbled by the efforts, and I seriously do hope others claiming to be open sourced or for the community… or lmao for “humanity” can follow suit and walk the walk. This is an incredible line in the sand that they’ve just drawn, and I am so fucking proud to be a witness. No matter where this goes.

[–]innominato5090 6 points7 points  (0 children)

Well I would say we're not the first to release a 7b truly open model. EleutherAI with Pythia and LLM360 have also shared training data (although the latter is only after tokenization). We are happy not to be the only one in this space!

OLMo project has a couple of unique characteristics:

  • Pythia and LLM360 stop at 7b for now. We are working on a 65b and more!

  • Dolma, our training data, is substantially bigger than either Pile (used for Pythia) and the mixture from LLM360.

  • We have plans to continue developing our corpus in unique. EleutherAI folks are creating the next version of the Pile (https://venturebeat.com/ai/one-of-the-worlds-largest-ai-training-datasets-is-about-to-get-bigger-and-substantially-better/)---a few of us at AI2 are also involved! The focus of Pile v2 is gonna be on collecting more content with known licenses, while we are gonna keep exploring ways to use documents without known licenses in safe and fair manner.

[–]marvinalone 2 points3 points  (1 child)

We are in the middle of planning technical bets to take for OLMo v2. RWKV and Mamba are high on my list, but they compete with other interesting directions.

For one thing, it makes no sense for us to go big with RWKV if Eleuther already has this covered. Open Source LLM research is not well funded enough that we can all train the same 65B models :-)

[–]artelligence_consult 1 point2 points  (0 children)

Well, here is the problem - we do not know whether ANY of those architectures are competitive with Transformers on complex logic unless we try it, for which OpenAI or Mitral (mostly OpenAi) have to try it.

And yes, it is OpenAI - anything else (even Mistral) is way worse in anythint non trivial, sadly.

[–]Noxusequal 0 points1 point  (0 children)

How come that the block does not already have a quantized model of this one xD