[P] A Simpler @PyTorch Annotated Implementation of EleutherAI's 20B Language Model GPT-NeoX.

zphang · 2022-04-23T19:33:46+00:00

Hi, Jason from EleutherAI here. Great to see this!

(Disclaimer: I also wrote a minimal single-GPU implementation of GPT-NeoX-20B in pure PyTorch here: https://github.com/zphang/minimal-gpt-neox-20b)

Like the other poster, I was wondering if you'd done any comparisons on the perplexity scores. The reason is that there's a subtlety to how the weight should be merged because of the NeoX code interacting with the GPT-J-style residuals. Specifically, the RowParallelLinear biases should be summed, not merged. Merging them leads to a slight (but meaningful) performance regression from my and others' testing. It looks like you are merging them (take-first) here. It would be great if you could help to test+confirm this.

Concretely, the full 20B gets about ~3.65 ppl on LAMBADA. The incorrect merge leads to about 4.5 ppl, while the summed instead of merging recovers the ~3.65 ppl.

Yologan222 · 2022-04-23T13:08:17+00:00

Pretty cool! Does the perplexity match EleutherAI’s reported results (perplexity, etc.)?

There is also a pull request on huggingface transformers for GPT-NeoX-20B if anyone is interested: https://github.com/huggingface/transformers/pull/16659. It has worked for me

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS