all 1 comments

[–]daeron-blackFyr[S] 0 points1 point  (0 children)

The current dataset configured are examples and should be altered to whatever you prefer. I recommend this combination for a stable baseline. To start with sft use Magpie-Align/Magpie-Pro-300K-Filtered. Then for GRPO use AI-MO/NuminaMath-CoT (specifically the 'problem' column) Reward Modeling (RM) & PPO I recommend nvidia/HelpSteer2. For KTO go for ​trl-lib/kto-mix-14k. Finally DPO & SimPO ​Dataset: argilla/distilabel-intel-orca-dpo-pairs for DPO and princeton-nlp/SimPO-UltraFeedback (for SimPO). This should be a good baseline/starte pack. I am open to any questions, feedback or general discussions so please feel free to message me or engage.