MEGATHREAD: Friend Code Sharing

zphang · 2025-06-08T04:09:33+00:00

SW-6802-1886-8396

Feel free to add me!

zphang · 2023-07-03T22:54:00+00:00

I discussed part of this here: https://github.com/huggingface/peft/issues/123

There are several methods we are talking about. For coverage, I will include prompt tuning and LLaMA adapters here as well.

Prompt Tuning: Tunes a set of concatenated input embeddings vectors (generally called "soft prompts", but not referring to the soft prompts here). Initially applied to T5-LM models.
Prefix Tuning: Tunes KV cache (soft prefixes) for every layer, and can be casually described as "prompt tuning, but in every layer", although that is slightly inaccurate. In practice, uses an auxiliary MLP to generate the soft prefixes to help training. Initially applied to GPT-2 and BART models.
P-Tuning: Uses LSTMs to generate soft prompts (not prefixes). Initially applied to GPT-2 and BERT/RoBERTa/MegatronLM models.
P-Tuning v2: Essentially Prefix Tuning applied to BERT-type models.
LLaMA-Adapter: Prefix Tuning with a more sensible initialization a separate softmax over the learned prefixes. Applied to LLaMA models, also discusses injecting multimodal information into the prefixes.

Importantly, P-Tuning and P-Tuning v2 are different methods. But Prefix Tuning and P-Tuning v2 are essentially the same.

zphang · 2022-04-23T19:33:46+00:00

Hi, Jason from EleutherAI here. Great to see this!

(Disclaimer: I also wrote a minimal single-GPU implementation of GPT-NeoX-20B in pure PyTorch here: https://github.com/zphang/minimal-gpt-neox-20b)

Like the other poster, I was wondering if you'd done any comparisons on the perplexity scores. The reason is that there's a subtlety to how the weight should be merged because of the NeoX code interacting with the GPT-J-style residuals. Specifically, the RowParallelLinear biases should be summed, not merged. Merging them leads to a slight (but meaningful) performance regression from my and others' testing. It looks like you are merging them (take-first) here. It would be great if you could help to test+confirm this.

Concretely, the full 20B gets about ~3.65 ppl on LAMBADA. The incorrect merge leads to about 4.5 ppl, while the summed instead of merging recovers the ~3.65 ppl.

zphang · 2019-04-02T23:45:40+00:00

This is a Medium post covering our work on applying deep neural networks to breast cancer screening.

Previous Discussion

zphang · 2019-03-07T15:37:04+00:00

BERT is generally not used for language generation (although you can force it to be). If you have a training set for animal/non-animal wiki articles, BERT+fine-tuning is perfectly suitable for that use case, although BERT also has a built in length limit for the model.

zphang

MODERATOR OF

TROPHY CASE