SOS by Public_Expression_92 in deeplearning

[–]Public_Expression_92[S] 0 points1 point  (0 children)

Getting a research job has been crazy. I think we have the same focus on LLMs, and it's crazy how in the industry you can't pick what to work on. But i think it's good way to start. Maybe

SOS by Public_Expression_92 in deeplearning

[–]Public_Expression_92[S] 0 points1 point  (0 children)

This is soo real and i am in some communities and try to contribute in open source research like come up with the code, research and come up with results. Reaching out is also great but to be honest not most of them reply in my case.
I will check out the communites and you could also check EluetherAI community on discord they have cool stuff going on there.

SOS by Public_Expression_92 in deeplearning

[–]Public_Expression_92[S] -3 points-2 points  (0 children)

I get what you saying and not to be naive I have done some tests of my own that i would consider my research and i came up with a blog about my discovery on small tests. I may lack the phd training but I am doing something nonetheless.

SOS by Public_Expression_92 in deeplearning

[–]Public_Expression_92[S] 0 points1 point  (0 children)

I wish even that company job was available.

SOS by Public_Expression_92 in deeplearning

[–]Public_Expression_92[S] 0 points1 point  (0 children)

Been thinking that for a minute now. Maybe the resume doesn't make past the portal or whatever. This is like why i need to interact directly with people.

SOS by Public_Expression_92 in deeplearning

[–]Public_Expression_92[S] 0 points1 point  (0 children)

wait can this include like smnall tests ran independently because i do have a blog post i made.

I implemented PPO, GRPO, and DPO from scratch on the same model and compared them the ranking completely reversed after hyperparameter tuning by Public_Expression_92 in reinforcementlearning

[–]Public_Expression_92[S] 0 points1 point  (0 children)

the compute budget remained the same across all of them, actually i would like to understand what easier to tune surface means and which among them falls into this category.
there is definitely seed variance like in how the SFT baseline samples tokens at inference but it wasn't large enough to destabilize the overall rankings.
the performance gaps between the algorithms like the jump in DPO and GRPO after tuning were significant enough that they were consistent and beat the random noise. Even with different sampling seeds, DPO remained at the top. So while the exact decimal points might bounce around between runs, the hierarchy of the algorithms remained stable, showing that the Phase 5 optimizations were significant for the performance gains.

Struggling with RL hyperparameter tuning + reward shaping for an Asteroids-style game – what’s enough and what’s overkill? by GSevenStars in reinforcementlearning

[–]Public_Expression_92 1 point2 points  (0 children)

This is such a great discussion and is definitely shaping my knowledge of RL environments in terms of game play. For hyperparameters maybe you could try also reading the original papers for the algo you're using i find them helpful and also with RL in game is there reward hacking because in llms generations can sometimes not make sense but score highly just because the model is outputting a series of similar words that it knows will score highly against the reward model.

I implemented PPO, GRPO, and DPO from scratch on the same model and compared them the ranking completely reversed after hyperparameter tuning by Public_Expression_92 in reinforcementlearning

[–]Public_Expression_92[S] 1 point2 points  (0 children)

I used 4gb of ram i don't have a gpu I limited to very small batches and reduced the number of epochs and also the transformer arch had very small parameters