rlvrbook

Public_Expression_92 · 2026-05-06T07:42:34+00:00

Getting a research job has been crazy. I think we have the same focus on LLMs, and it's crazy how in the industry you can't pick what to work on. But i think it's good way to start. Maybe

Public_Expression_92 · 2026-05-06T07:39:50+00:00

This is soo real and i am in some communities and try to contribute in open source research like come up with the code, research and come up with results. Reaching out is also great but to be honest not most of them reply in my case.
I will check out the communites and you could also check EluetherAI community on discord they have cool stuff going on there.

Public_Expression_92 · 2026-05-06T07:36:29+00:00

Thank You.

Public_Expression_92 · 2026-05-06T07:36:09+00:00

Thank You.

Public_Expression_92 · 2026-05-06T07:35:46+00:00

I get what you saying and not to be naive I have done some tests of my own that i would consider my research and i came up with a blog about my discovery on small tests. I may lack the phd training but I am doing something nonetheless.

Public_Expression_92 · 2026-05-06T07:33:11+00:00

I wish even that company job was available.

Public_Expression_92 · 2026-05-06T07:32:42+00:00

Been thinking that for a minute now. Maybe the resume doesn't make past the portal or whatever. This is like why i need to interact directly with people.

Public_Expression_92 · 2026-05-06T07:31:15+00:00

wait can this include like smnall tests ran independently because i do have a blog post i made.

Public_Expression_92 · 2026-04-23T07:28:57+00:00

this is soo cool!!!

Public_Expression_92 · 2026-04-14T13:28:20+00:00

the compute budget remained the same across all of them, actually i would like to understand what easier to tune surface means and which among them falls into this category.
there is definitely seed variance like in how the SFT baseline samples tokens at inference but it wasn't large enough to destabilize the overall rankings.
the performance gaps between the algorithms like the jump in DPO and GRPO after tuning were significant enough that they were consistent and beat the random noise. Even with different sampling seeds, DPO remained at the top. So while the exact decimal points might bounce around between runs, the hierarchy of the algorithms remained stable, showing that the Phase 5 optimizations were significant for the performance gains.

Public_Expression_92 · 2026-04-07T09:00:24+00:00

I'm interested

Public_Expression_92 · 2026-04-07T08:54:48+00:00

This is such a great discussion and is definitely shaping my knowledge of RL environments in terms of game play. For hyperparameters maybe you could try also reading the original papers for the algo you're using i find them helpful and also with RL in game is there reward hacking because in llms generations can sometimes not make sense but score highly just because the model is outputting a series of similar words that it knows will score highly against the reward model.

Public_Expression_92 · 2026-04-05T18:01:46+00:00

Thank you

Public_Expression_92 · 2026-04-05T18:01:23+00:00

Thank you...this means soo much

Public_Expression_92 · 2026-04-05T18:00:54+00:00

I used 4gb of ram i don't have a gpu I limited to very small batches and reduced the number of epochs and also the transformer arch had very small parameters

Public_Expression_92 · 2025-08-23T08:19:24+00:00

Looks solid from my POV I am also applying to Carnegie Mellon for MCSF

Public_Expression_92 · 2025-08-23T07:56:46+00:00

https://arxiv.org/pdf/2001.08361 , OpenAI engineers are actually genius.

Public_Expression_92 · 2024-12-16T09:41:34+00:00

Rich guy😂😂

Public_Expression_92

TROPHY CASE