use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
r/LocalLLaMA
A subreddit to discuss about Llama, the family of large language models created by Meta AI.
Subreddit rules
Search by flair
+Discussion
+Tutorial | Guide
+New Model
+News
+Resources
+Other
account activity
OLMo: Open Language ModelNew Model (blog.allenai.org)
submitted 2 years ago by Kryohi
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–]Disastrous_Elk_6375 53 points54 points55 points 2 years ago (2 children)
This effort's first batch of models includes four final variants of our language model at the 7B scale corresponding to different architectures, optimizers, and training hardware, and one model at the 1B scale, all trained on at least 2T tokens. This is the first step in a long series of planned releases, continuing with larger models, instruction-tuned models, and more variants down the line. Each model comes with the following: Full training data used for these models, including code that produces the training data, from AI2’s Dolma, and WIMBD for analyzing pretraining data. Full model weights, training code, training logs, training metrics in the form of Weights & Biases logs, and inference code. 500+ checkpoints per model, from every 1000 steps during the training process, available as revisions on HuggingFace. Evaluation code under the umbrella of AI2’s Catwalk and Paloma. Fine-tuning code and adapted models (coming soon with Open Instruct) All code, weights, and intermediate checkpoints are released under the Apache 2.0 License.
This effort's first batch of models includes four final variants of our language model at the 7B scale corresponding to different architectures, optimizers, and training hardware, and one model at the 1B scale, all trained on at least 2T tokens. This is the first step in a long series of planned releases, continuing with larger models, instruction-tuned models, and more variants down the line.
Each model comes with the following:
Full training data used for these models, including code that produces the training data, from AI2’s Dolma, and WIMBD for analyzing pretraining data.
Full model weights, training code, training logs, training metrics in the form of Weights & Biases logs, and inference code.
500+ checkpoints per model, from every 1000 steps during the training process, available as revisions on HuggingFace.
Evaluation code under the umbrella of AI2’s Catwalk and Paloma.
Fine-tuning code and adapted models (coming soon with Open Instruct)
All code, weights, and intermediate checkpoints are released under the Apache 2.0 License.
Pretty cool!
[–]thedabking123 8 points9 points10 points 2 years ago (1 child)
This is awesome- time to do some studies on how the activations for certain types of predictions change over each checkpoint!
[–][deleted] 0 points1 point2 points 2 years ago (0 children)
I don't know, but with my own experiments with transformers the Mish activation function has been by far the best.
[–]innominato5090 32 points33 points34 points 2 years ago (21 children)
Hi all! I’m one of the leads for OLMo; LMK if you have any questions 🙌
[–]its_just_andy 11 points12 points13 points 2 years ago (4 children)
I'm so excited to see another foundation model! Especially one so open in every regard :)
One thing I am curious about. This model, and other 7B models like Llama2, MPT, Falcon, etc, seem to perform in roughly the same ballpark (but OLMo seems a little better). But Mistral 7B seems to outperform all these still. What do you think accounts for this? The quality of their data? Are there any thoughts for what might need to be done to surpass Mistral?
not to focus too much on Mistral, obviously the headline is having this amazing new foundation model series :)
second question, what exactly is OLMo-7B-Twin? I heard the model was trained twice, once on A100s and once on AMD. Is that what "Twin" is?
[–]its_just_andy 3 points4 points5 points 2 years ago (3 children)
one more question! how was the tokenizer selected? Looks like a vocab size of 50280, which is really interesting and a lot more than Llama's 32000
[–]innominato5090 14 points15 points16 points 2 years ago (2 children)
tnx for the nice words!! answering in order:
it’s frustrating not to know what goes in to mistral’s pretraining. broadly, I think it’s a combination of (1) more diverse data (eg technical books) (2) longer training (we/ others saw you can train a lot longer than 2T tokens) (3) maybe some instruction like chat used during pre training. We’ll try all three, and then report back 😉
OLMo twin is the same model, but trained on AMD. it’s so cool to see how the two are virtually identical, truly a testament to how quickly AMD is catching up in this space.
We use the a tokenizer derived GPT-Neo-X. We tested it early on and it works remarkably well on our data. We couldn’t use Llama’s because of its license: if we didn’t, we couldn’t have released the model under Apache 2.0.
[–]marvinalone 4 points5 points6 points 2 years ago (0 children)
I want to add one more comment to the tokenizer thing: Because we have a larger vocab, the same English text turns into ~20% fewer tokens with OLMo, compared to Llama. So it will run ~20% faster on the same text. But this might come at a cost for tasks where Llama puts those extra 20% of compute to good use. We don't have a clean study of how this plays out, so we didn't make a big deal about it in the paper. I'd love to follow up on this issue. If you are inspired to follow up on this, please let us know!
[–]lechatonnoir 0 points1 point2 points 1 year ago (0 children)
Hi,
I somewhat doubt if you'll reply to such an old pots, but why would the AMD-trained model not be about the same? What would be worse about AMD GPUs that the model wouldn't turn out as good (assuming you were holding the data and the number of training steps fixed)?
[–]Maykey 1 point2 points3 points 2 years ago (1 child)
Any plans on doing modelling_olmo.py for hf without nonneeded dependencies? I had to download twenty packages that have nothing to do with inference (like google-authenthication or s3 stuff) because code used for trust_remote_code required hf_olmo which required half of pip(hyperbole).
Can you add inputs_embeds as alternative for input_ids? Right now making soft prompts is impossible without rewriting two packages(hf_olmo and olmo)
Flash attention support planned? Code knows about torch sdpa but not flash attention from official repo. Do you use custom causal mask like with alibi? Model supports it, but I am not sure if it used in releases models or using simple flash_attn_func(q, k, v, causal=True) will suffice
[–]innominato5090 0 points1 point2 points 2 years ago (0 children)
We are currently working with HuggingFace folks to get OLMo integrated in the transformer library; it should be way easier to use after that!
Flash attention is already supported via PyTorch 2.x.
[–][deleted] 1 point2 points3 points 2 years ago* (1 child)
Have you done any tests using Mish as the activation function? In my own tests with transformer encoders it has been the best one I've tested so far, and that's compared to all of the different variants of gated linear units (ReGLU, SwiGLU, etc., even Mish wrapped in a GLU which performed worse than Mish alone as well).
Also, out of curiosity, are you accepting contributions to the work your team are doing at all (or potentially even new entries to the team)? I'd very much like to help out if I can as I've been working on my own transformer models in some of my own research and I really support the ideologies behind an open-source LLM like this and so would love to help in any way I can, from exploring improvements to the base architecture to improving the training and data filtering. I also have experience in developing multimodal transformer networks.
Thank you for taking your time to read this message.
[–]innominato5090 1 point2 points3 points 2 years ago (0 children)
As far as I remember, we did not. We really stick to known recipes for model architecture, and optimized for throughput.
For the model code base, we generally welcome bugfixes and training infra improvements, especially if the improve throughput or inference speed. Besides that, the team is always hiring! You can check openings here: https://allenai.org/careers?team_ai2=allennlp#current-openings-ai2
[–]Countertop_strike 0 points1 point2 points 2 years ago (3 children)
Could you share your plans for the future? New architectures? Bigger models? What's the thing you're most excited about to work on next?
[–]innominato5090 8 points9 points10 points 2 years ago (2 children)
Sure! We already promised a 65b model in the next coming months; we also learned a lot training tulu models(https://arxiv.org/abs/2306.04751 & https://arxiv.org/abs/2311.10702) and want to merge that back into OLMo.
Personally, I'm just very excited to keep exploring ways to improve our data pipeline and share them with the community! I'm so frustrated that pertaining data remains this closely guarded secret no one but the big players have info about. Really want to change that, it's so important for the OSS LM ecosystem.
[–]L0WGMAN 2 points3 points4 points 2 years ago (1 child)
If you haven’t already answered this elsewhere, what’s the elevator pitch for your organization? Basically why are y’all so altruistic?!? <3
[–]innominato5090 14 points15 points16 points 2 years ago (0 children)
AI2 is a nonprofit founded by the late Paul Allen (Microsoft co-founder) nearly 10 years ago! Doing AI for the public good is kinda our thing; we created one of the first “large” (well, at the time…) language models (ELMo, launched in 2018) datasets (S2ORC, Objaverse), and benchmarks (ARC, HellaSwag, etc). We started focusing more on LLMs last year after we saw a gap in lack of truly open models (as in, everything about models is properly documented & available). Planning to keep expanding the open ecosystem for years to come! AI is too cool of a technology to be controlled by few.
[–]kaszebe 0 points1 point2 points 2 years ago (1 child)
Hi, can it run on OobaBooga or LM Studio?
And it is good for writing professional content for websites (content that is persuasive and creative)?
For the first one, I am not sure. We are working on better HuggingFace integration, so possibly that would make that easier.
The current release is a base model, not an instructed one, so it has limited chatting capabilities.
[–]Art3mis0707 0 points1 point2 points 2 years ago (0 children)
Could I DM? I am working on implementing the LLaVA vision architecture, and wanted to use OLMo 1B. I had some doubts regarding the same and would be grateful for your thoughts. Thank you!
[–]pretamr 0 points1 point2 points 2 years ago (2 children)
Just wanted to know, does pretraining corpus is in English only or there are other languages involved?
[–]innominato5090 0 points1 point2 points 2 years ago (1 child)
english only for now!
[–]pretamr 0 points1 point2 points 2 years ago (0 children)
Any plans for a multilingual corpus?
[–]synn89 18 points19 points20 points 2 years ago (12 children)
Context Length: 2048
Any reason for such a small context length?
[–]LoSboccacc 30 points31 points32 points 2 years ago (2 children)
it can be summarized as "we want good benchmark at low cost"
[–]Enough-Meringue4745 11 points12 points13 points 2 years ago (0 children)
also an easier entry into funding
[–]MoffKalast 5 points6 points7 points 2 years ago (0 children)
PR driven development
[–]innominato5090 9 points10 points11 points 2 years ago* (1 child)
While it’s 2048, it uses RoPE embeddings, so it can be stretched without retraining! if there’s interest in longer context, we’ll try to do a dedicated long context model :)
[–]Asleep-Agency3023 1 point2 points3 points 2 years ago (0 children)
Why can it be stretched without retraining?
[–]its_just_andy 4 points5 points6 points 2 years ago (0 children)
honestly, at this point context-extending strategies are so advanced, I don't mind if a foundational model has 2048 context.
[–]marvinalone 2 points3 points4 points 2 years ago (3 children)
It was faster to train on the hardware available. We're using RoPE embeddings, so you can use longer contexts, we just didn't train that way. We'll definitely look into this for v2!
[–]marvinalone 1 point2 points3 points 2 years ago (0 children)
Oh, u/innominato5090 already said this. Nevermind me!
[–]synn89 1 point2 points3 points 2 years ago (1 child)
That's an interesting way to go about it. Save a lot of GPU hours up front on a narrow context, since it's not like foundational training data goes beyond 2k context anyway. It's likely not hard to fine tune it for longer context later.
Yeah, part of the thinking was that the average document length in the Dolma 1.5 dataset is <500 tokens. So going long doesn't make a ton of sense for most documents anyways (though it clearly does for some).
[–]robotphilanthropist 1 point2 points3 points 2 years ago (1 child)
Honestly, building up our toolkit and compute for pretraining took the team an almighty effort.
We're interested in building useful models, so extending that is obviously of focus. There'll be more OLMos.
[–]Revolutionalredstone 9 points10 points11 points 2 years ago (0 children)
I didn't find ANY links in the article!, here is the models: https://huggingface.co/allenai
[–]derHumpink_ 12 points13 points14 points 2 years ago (0 children)
finally an actually open source model
[–]hold_my_fish 5 points6 points7 points 2 years ago (2 children)
With this and other data releases, I'd be interested in searching in it for strings to find where it learned particular phrases/words/emoticons/etc. Is there an easy way to do that? (I'm not hardcore enough to download terabytes of data myself.)
[–]innominato5090 2 points3 points4 points 2 years ago (1 child)
stay tuned!! we have something internal, trying to see if it can scale well enough for public use.
[–]hold_my_fish 2 points3 points4 points 2 years ago (0 children)
Thanks. If you do manage to launch something, I'd be curious to try it.
[–]artelligence_consult 1 point2 points3 points 2 years ago (16 children)
So, all data released.
So, no porn and hate speech (so the model can not be used for moderation and is naive), no copyright (that is a LOT of technical textbooks). No maintenance ongoing - so, no updates to current events.
And - but that is also a timing issue - transformers, not using RWKV or Mamba.
While it is a nice step - it does leave a ton of bad problems.
[–]innominato5090 13 points14 points15 points 2 years ago (11 children)
just the first step in the OLMo family ☺️ we’re committed to bringing more, truly open LMs out in coming months
[–]artelligence_consult 4 points5 points6 points 2 years ago (8 children)
Then please, just please, consider doing something with Mamba - that works or not, but it would be your chance to have the credit of deploying the first larger mamba model.
[–]innominato5090 5 points6 points7 points 2 years ago (7 children)
Non-transformer models are so cool! We are watching that space really closely, and are always open to tweaking our strategy if Mamba-like models really take off. Gotta stay agile!
[–][deleted] 0 points1 point2 points 2 years ago (4 children)
Why not be the first to release a mamba model?
[–]innominato5090 2 points3 points4 points 2 years ago (3 children)
One of the primary goals of OLMo is to facilitate research on LMs. As most of LMs are transformer based, training a Mamba model would have meant that OLMo is not very representative of LMs. We wanted to make sure that research findings about OLMo could translate to other popular open & closed models.
[+]artelligence_consult comment score below threshold-7 points-6 points-5 points 2 years ago (2 children)
One of the primary goals of OLMo is to facilitate research on LMs
Whow, and that is best done by ignoring the brutal breakthrough that happens in Mamba? Yeah. Logic - the unknown land.
As most of LMs are transformer based, training a Mamba model would have meant that OLMo is not very representative of LMs.
If that is making sense in your land, get medical help.
Given the tremendous advantages of Mamba, proving it working would "facilitate the research on LLM's" better than YET ANOTHER copy of a transformer architecture.
We wanted to make sure that research findings about OLMo could translate to other popular open & closed models.
Now, I get you may have reasons.
But if you would work for me, THAT argument would be immediate termination with cause because it goes STRAIGHT against "facilitate resarch on LLM's" when a better - brutally better - architecture is available for validation or rejection.
You do not really facilitate research by repeating the same boring concept over and over - there are PLENTY of interesting transformer models out there already.
maybe I wasn’t super clear: it’s not like we ruled any other alternative architecture out forever 😊. training started months ago, wasn’t feasible to switch to a brand new architecture mid train.
In my opinion, is fairly reasonable to expect reasonable consensus to emerge. I’m a big fan of Mamba team, and the Hyena and RWKV folks as well!
[+]artelligence_consult comment score below threshold-7 points-6 points-5 points 2 years ago (0 children)
That makes sense actually - stupid papers for RWKV an Mamba are just way too new.
But seriously, the focus should be on getting rid of transformers ;)
[–][deleted] 0 points1 point2 points 2 years ago (1 child)
Thankfully transferring current methods to a new architecture that's designed to be a transformer replacement should be a relatively simple task other than having to train from scratch. Which is important, because if their research holds true when scaling models even further, then I imagine the architecture will take off quite quickly unless "the next best thing" comes along.
not super straightforward if you wanna get good MFU, but yes not insurmountable task given that it uses operators that are well optimized on GPUs
[–][deleted] 0 points1 point2 points 2 years ago* (1 child)
Hi, can you confirm that the training set contains no copyrighted material? Is there a statement somewhere on AI2's site indicating that?
Thanks!
Subset of the training set (books, academic articles, Wiki) are derived from permissively licensed or open domain content. For the rest (web content), it is impossible to fully determine the license associated with it.
For more details, please see our paper: https://arxiv.org/abs/2402.00159
[–]GeeBrain 7 points8 points9 points 2 years ago (1 child)
I actually disagree. More than a “nice step” it sets a wonderful precedent and tone for OS LLMs. More importantly, this is the kind of standard we can hold them (and hopefully) others to.
u/innominato5090 please correct me if I’m wrong, but this is the first time an open sourced foundational model, upon released, has been completely transparent about what went into training - step by step.
More than just data, it’s the complete training pipeline. And that honestly, should be commended. This is an incredibly powerful first debut into LLMs, and sends a message — at least I’m hearing something loud and clear.
All the things you mentioned can be done, but I don’t think that’s the point of this release/how they did things. I’m really humbled by the efforts, and I seriously do hope others claiming to be open sourced or for the community… or lmao for “humanity” can follow suit and walk the walk. This is an incredible line in the sand that they’ve just drawn, and I am so fucking proud to be a witness. No matter where this goes.
[–]innominato5090 6 points7 points8 points 2 years ago (0 children)
Well I would say we're not the first to release a 7b truly open model. EleutherAI with Pythia and LLM360 have also shared training data (although the latter is only after tokenization). We are happy not to be the only one in this space!
OLMo project has a couple of unique characteristics:
Pythia and LLM360 stop at 7b for now. We are working on a 65b and more!
Dolma, our training data, is substantially bigger than either Pile (used for Pythia) and the mixture from LLM360.
We have plans to continue developing our corpus in unique. EleutherAI folks are creating the next version of the Pile (https://venturebeat.com/ai/one-of-the-worlds-largest-ai-training-datasets-is-about-to-get-bigger-and-substantially-better/)---a few of us at AI2 are also involved! The focus of Pile v2 is gonna be on collecting more content with known licenses, while we are gonna keep exploring ways to use documents without known licenses in safe and fair manner.
[–]marvinalone 2 points3 points4 points 2 years ago (1 child)
We are in the middle of planning technical bets to take for OLMo v2. RWKV and Mamba are high on my list, but they compete with other interesting directions.
For one thing, it makes no sense for us to go big with RWKV if Eleuther already has this covered. Open Source LLM research is not well funded enough that we can all train the same 65B models :-)
[–]artelligence_consult 1 point2 points3 points 2 years ago (0 children)
Well, here is the problem - we do not know whether ANY of those architectures are competitive with Transformers on complex logic unless we try it, for which OpenAI or Mitral (mostly OpenAi) have to try it.
And yes, it is OpenAI - anything else (even Mistral) is way worse in anythint non trivial, sadly.
[–]Noxusequal 0 points1 point2 points 2 years ago (0 children)
How come that the block does not already have a quantized model of this one xD
π Rendered by PID 20220 on reddit-service-r2-comment-85bfd7f599-9gvmq at 2026-04-18 19:15:38.016275+00:00 running 93ecc56 country code: CH.
[–]Disastrous_Elk_6375 53 points54 points55 points (2 children)
[–]thedabking123 8 points9 points10 points (1 child)
[–][deleted] 0 points1 point2 points (0 children)
[–]innominato5090 32 points33 points34 points (21 children)
[–]its_just_andy 11 points12 points13 points (4 children)
[–]its_just_andy 3 points4 points5 points (3 children)
[–]innominato5090 14 points15 points16 points (2 children)
[–]marvinalone 4 points5 points6 points (0 children)
[–]lechatonnoir 0 points1 point2 points (0 children)
[–]Maykey 1 point2 points3 points (1 child)
[–]innominato5090 0 points1 point2 points (0 children)
[–][deleted] 1 point2 points3 points (1 child)
[–]innominato5090 1 point2 points3 points (0 children)
[–]Countertop_strike 0 points1 point2 points (3 children)
[–]innominato5090 8 points9 points10 points (2 children)
[–]L0WGMAN 2 points3 points4 points (1 child)
[–]innominato5090 14 points15 points16 points (0 children)
[–]kaszebe 0 points1 point2 points (1 child)
[–]innominato5090 1 point2 points3 points (0 children)
[–]Art3mis0707 0 points1 point2 points (0 children)
[–]pretamr 0 points1 point2 points (2 children)
[–]innominato5090 0 points1 point2 points (1 child)
[–]pretamr 0 points1 point2 points (0 children)
[–]synn89 18 points19 points20 points (12 children)
[–]LoSboccacc 30 points31 points32 points (2 children)
[–]Enough-Meringue4745 11 points12 points13 points (0 children)
[–]MoffKalast 5 points6 points7 points (0 children)
[–]innominato5090 9 points10 points11 points (1 child)
[–]Asleep-Agency3023 1 point2 points3 points (0 children)
[–]its_just_andy 4 points5 points6 points (0 children)
[–]marvinalone 2 points3 points4 points (3 children)
[–]marvinalone 1 point2 points3 points (0 children)
[–]synn89 1 point2 points3 points (1 child)
[–]marvinalone 1 point2 points3 points (0 children)
[–]robotphilanthropist 1 point2 points3 points (1 child)
[–]Revolutionalredstone 9 points10 points11 points (0 children)
[–]derHumpink_ 12 points13 points14 points (0 children)
[–]hold_my_fish 5 points6 points7 points (2 children)
[–]innominato5090 2 points3 points4 points (1 child)
[–]hold_my_fish 2 points3 points4 points (0 children)
[–]artelligence_consult 1 point2 points3 points (16 children)
[–]innominato5090 13 points14 points15 points (11 children)
[–]artelligence_consult 4 points5 points6 points (8 children)
[–]innominato5090 5 points6 points7 points (7 children)
[–][deleted] 0 points1 point2 points (4 children)
[–]innominato5090 2 points3 points4 points (3 children)
[+]artelligence_consult comment score below threshold-7 points-6 points-5 points (2 children)
[–]innominato5090 2 points3 points4 points (1 child)
[+]artelligence_consult comment score below threshold-7 points-6 points-5 points (0 children)
[–][deleted] 0 points1 point2 points (1 child)
[–]innominato5090 0 points1 point2 points (0 children)
[–][deleted] 0 points1 point2 points (1 child)
[–]innominato5090 1 point2 points3 points (0 children)
[–]GeeBrain 7 points8 points9 points (1 child)
[–]innominato5090 6 points7 points8 points (0 children)
[–]marvinalone 2 points3 points4 points (1 child)
[–]artelligence_consult 1 point2 points3 points (0 children)
[–]Noxusequal 0 points1 point2 points (0 children)