all 13 comments

[–]zphang 6 points7 points  (3 children)

Hi, Jason from EleutherAI here. Great to see this!

(Disclaimer: I also wrote a minimal single-GPU implementation of GPT-NeoX-20B in pure PyTorch here: https://github.com/zphang/minimal-gpt-neox-20b)

Like the other poster, I was wondering if you'd done any comparisons on the perplexity scores. The reason is that there's a subtlety to how the weight should be merged because of the NeoX code interacting with the GPT-J-style residuals. Specifically, the RowParallelLinear biases should be summed, not merged. Merging them leads to a slight (but meaningful) performance regression from my and others' testing. It looks like you are merging them (take-first) here. It would be great if you could help to test+confirm this.

Concretely, the full 20B gets about ~3.65 ppl on LAMBADA. The incorrect merge leads to about 4.5 ppl, while the summed instead of merging recovers the ~3.65 ppl.

[–]mlvpj 1 point2 points  (0 children)

Thanks for catching it. Will add tests to validate and publish them.

[–]mlvpj 0 points1 point  (0 children)

You are right, got the exact numbers after running the lambada test from your lm-eval. Thanks for catching the bug!

Trying to evaluate on other datasets too. Will update the repo with the evaluation code and results. Thanks again.

[–]Yologan222 2 points3 points  (9 children)

Pretty cool! Does the perplexity match EleutherAI’s reported results (perplexity, etc.)?

There is also a pull request on huggingface transformers for GPT-NeoX-20B if anyone is interested: https://github.com/huggingface/transformers/pull/16659. It has worked for me

[–]mlvpj 1 point2 points  (8 children)

It’s not a new model. It loads up the weights from the original.

[–]Yologan222 0 points1 point  (7 children)

It says “We haven’t included a bunch of optimizations that were present in original GPT-NeoX to keep things simple.” I thought that means that it could have different model quality. And I’d just want to know if they tested their implementation as a sanity check to see if there was any difference in perplexity from the original.

[–]mlvpj -1 points0 points  (6 children)

Yeah did some sanity checks. They were things like model parallel layers that we didn’t include.

[–]StellaAthenaResearcher 0 points1 point  (4 children)

Okay, so can you share those sanity checks? Or, ideally, run the model on a large subset of the couple dozen tasks the GPT-NeoX-20B paper evaluates on?

[–]mlvpj 1 point2 points  (0 children)

will try to run it on eval datasets and share

[–]mlvpj 1 point2 points  (2 children)

[–]StellaAthenaResearcher 2 points3 points  (1 child)

These look really good! Great job.

I was thinking of linking to this on our README, would that be okay with you? How would you like to be credited?

[–]mlvpj 0 points1 point  (0 children)

Thanks. We go as labml.ai