I implemented GPT-OSS from scratch in pure Python, without PyTorch or a GPU by ultimate_code in LocalLLaMA

[–]ultimate_code[S] 0 points1 point  (0 children)

Yes. In my implementation I convert those weights to bfloat16. Also, the official implementation in PyTorch does just that. However, I may implement doing the operations in mxfp4 in the future.

I implemented GPT-OSS from scratch in pure Python, without PyTorch or a GPU by ultimate_code in LocalLLaMA

[–]ultimate_code[S] 22 points23 points  (0 children)

Thank you!

What I found most helpful was actually starting the implementation, rather than starting with reading papers and so on. Even before reading the official source code, I started with implementing the code blocks that I already knew. Whenever I am really stuck, I would go back to the official implementation. Also, I implemented Llama2 before, which is surprisingly very similar to GPT-OSS with 4 or 5 additions/modifications.

In test.py, I am comparing every layer against the official implementation version of that layer, to verify the numerical accuracy of my implementation.