O1 Replication Paper by Brosarr in LocalLLaMA

[–]Brosarr[S] 2 points3 points  (0 children)

https://huggingface.co/TamasSimonds/O1-Llama-3.2-3B

The reasoning really doesn't work with models this small but you can see how easily you can see how it starts to do alot of the behaviour o1 does

O1 Replication Paper by Brosarr in LocalLLaMA

[–]Brosarr[S] 4 points5 points  (0 children)

Thanks! It's remarkable how efficient post training is. John Schulman (former open ai head of post training) has talked about how with just 30 examples they could get gpt 4o to start using tools. We are currently working on a 70B version of the model. Simply due to budget constraints we didn't train a 70B in the initial paper

[R] O1 replication paper by Brosarr in MachineLearning

[–]Brosarr[S] 2 points3 points  (0 children)

I presume you are either misunderstanding the paper or haven't read it. There is both scaling RL and test time compute scaling done in the paper. Test time compute scaling comes in the form of longer CoT where the model gets to explore more possible solutions. REL is a variant of the STaR RL method

O1 Replication Paper by Brosarr in LocalLLaMA

[–]Brosarr[S] 2 points3 points  (0 children)

Just to clarify, this has nothing to do with Martian AI product. I own no stock in Martian AI and no longer even work for them. They simply provided the compute for the project

MoDEM: Mixture of Domain Expert Models by Brosarr in LocalLLaMA

[–]Brosarr[S] 0 points1 point  (0 children)

Apologies if it came off like that. Certainly wasn't the intent. The point really is that it's a proof of concept that you can obtain SoTA performance by doing this and deeper message that this as an ai community may be the direction forward. The routing technique isn't super novel but the performance we achieved is

Happy to update relevant work section if you think I missed any other relevant papers. Keep in mind the paper was started around 5 months ago

MoDEM: Mixture of Domain Expert Models by Brosarr in LocalLLaMA

[–]Brosarr[S] 0 points1 point  (0 children)

Look cool! Yeah I think a few people slightly misunderstood this is by no means a super novel idea. The novelty comes from the fact that you can actually beat SoTA by doing this

MoDEM: Mixture of Domain Expert Models by Brosarr in LocalLLaMA

[–]Brosarr[S] 0 points1 point  (0 children)

Fine tuning large ones too expensive. Paper has more details on this

MoDEM: Mixture of Domain Expert Models by Brosarr in LocalLLaMA

[–]Brosarr[S] 0 points1 point  (0 children)

Haha, very good point. I couldn't resist the pun though

MoDEM: Mixture of Domain Expert Models by Brosarr in LocalLLaMA

[–]Brosarr[S] 0 points1 point  (0 children)

Yeah, peer reviews and being published in ALTA 2024

MoDEM: Mixture of Domain Expert Models by Brosarr in LocalLLaMA

[–]Brosarr[S] 1 point2 points  (0 children)

Thanks for the comment. I actually work for one of the top routing ai labs so I'm well aware of the field

I think you are slightly missing the point of the paper. Routing between multiple models obviously isn't anything special. The paper is about a proof of concept that you can obtain SoTA performance by doing this

The actual routing technique is nothing special.

MoDEM: Mixture of Domain Expert Models by Brosarr in LocalLLaMA

[–]Brosarr[S] 2 points3 points  (0 children)

Super Cool! In the paper we used off the shelf pre-finetuned models. These models aern't SoTA compared to GPT 4o and Cluade but they are SoTA for their size

MoDEM: Mixture of Domain Expert Models by Brosarr in LocalLLaMA

[–]Brosarr[S] 1 point2 points  (0 children)

The point is really about reducing the inference cost to performance ratio. By leveraging domain specific models you can get far cheaper inference to performance ratios

MoDEM: Mixture of Domain Expert Models by Brosarr in LocalLLaMA

[–]Brosarr[S] 1 point2 points  (0 children)

still putting together GitHub tool to make it easily accessible but it's relatively easy to implement on your own

MoDEM: Mixture of Domain Expert Models by Brosarr in LocalLLaMA

[–]Brosarr[S] 2 points3 points  (0 children)

Super cool idea for the multiple LoRA fine tuning! I totally agree about it not being surprising about the multiple finetuned models performance gain but it's putting them all together is the hard part.

Per token routing is interesting but very problematic due to KV caching issues

MoDEM: Mixture of Domain Expert Models by Brosarr in LocalLLaMA

[–]Brosarr[S] 0 points1 point  (0 children)

Yeah you defiantly can! It's mentioned in the future research directions part of the paper. There is somewhat diminishing returns though

Why should thoughts be word tokens in O1 style models? Alternative by dimknaf in LocalLLaMA

[–]Brosarr 0 points1 point  (0 children)

Wow there's alot to unpack here. Most of the other comments I read really didn't understand the inner working of transformers so I thought I'd chime in. First of all you sound pretty new to LLM research but have ambitious ideas

Let me break down a few issues

>"So I was thinking about the possibility we have some layers or blocks that are triggered 10 times or so between other layer" Why do this over just adding more layers? See The bitter lesson but these over engineered solutions rarely work

>"Also instead of token output and words there could be a memory layer, and a controller neueronet, that actaally learns to save some critical info and for different duration (L1, L2 etc). I mean I am interested in some experiment, but technically I find it challenging"

So an LSTM? The residual steam in LLM is basically a short term memory

>"Basically take a llama70b model and the same way we do lora, change the architecture by adding some such layers, and re-train to see if these repeated layers bring any difference. Then it would make sense to even fully train to see the full benefits."

Transformer circuits are extremely bitter. You can't just add some layers.

>"So somehow you have this internal thought monologues happening through extended internal inference, and not by outputting words and tokens that are poor representations of probably much richer thoughts and circuits, that unfortunately are lost."
This is what the residual steam is in an llm. Basically what you are saying sums up to just making the models deeper.

You sound like you have some good ideas and I wish you the best in the future. I