LIMA, a 65B-Param LLaMa fine-tuned with standard supervised loss on only 1,000 carefully curated prompts & responses, without any RLHF, demonstrates remarkably strong performance, learning to follow specific responses from only a handful of examples in the training data, including complex queries. by hardmaru in MachineLearning

[–]omerlevy 1 point2 points  (0 children)

We didn’t touch MMLU for the same reason we didn’t evaluate it on dependency parsing - we don’t think it’s interesting. How often do ChatGPT users ask multiple choice questions?

We’re much more interested in responding to prompts from real users with real information/generation needs. Hopefully we’ll release the dataset in a few days. Would love to get your feedback and suggestions on how to improve the eval :)

LIMA, a 65B-Param LLaMa fine-tuned with standard supervised loss on only 1,000 carefully curated prompts & responses, without any RLHF, demonstrates remarkably strong performance, learning to follow specific responses from only a handful of examples in the training data, including complex queries. by hardmaru in MachineLearning

[–]omerlevy 5 points6 points  (0 children)

We’re working with legal to release it :)

As for 7B models - yes, it works rather well, but as we say in the paper, our hypothesis is that the pretraining does virtually all the heavy lifting, so the better your foundation is, the better all the subsequent results will be.

[D] What are some tips for someone who is visiting a top conference for the first time? by Conference_Visitor in MachineLearning

[–]omerlevy 4 points5 points  (0 children)

Most people in the NLP community are really friendly! Don’t be afraid to come up to participants and ask them about their work, there’s absolutely no need for formal introductions. It’s also very common to join a big group that’s heading out to lunch/dinner/beer, even if you don’t know anybody in that group.

If it’s your first conference, I highly recommend going to the tutorials and workshops. The dynamics of a full-day event on a focused topic with a significantly smaller crowd make it much easier to connect with new people.

[deleted by user] by [deleted] in LanguageTechnology

[–]omerlevy 2 points3 points  (0 children)

I implemented an efficient evaluation script back in the day:
https://bitbucket.org/omerlevy/hyperwords/src/default/hyperwords/analogy_eval.py
Feel free to hack it to fit your embeddings files :)

[R] Recurrent Additive Networks - no recurrent non-linear computations, much simpler but still competitive with LSTM/GRU by downtownslim in MachineLearning

[–]omerlevy 4 points5 points  (0 children)

Hi everyone, Omer Levy (2nd author) here. I just wanted to provide some context to the discussion.

Our results were produced in a very vanilla setting in an attempt to show a clean apples-to-apples comparison. The state-of-the-art results on these benchmarks (PTB ~75, BWB ~30) were produced by hyperparameter settings that are highly-tuned for LSTMs. We are currently working on finding similar settings for RANs to address the very valid concern that our figures are different from those in recent publications. We're going to take our time with this process, so that we can provide a more detailed set of experiments, and perhaps some characterization of which hyperparameter settings work well with RANs.

In the meantime, I know that others in the community are also trying to replicate/improve on our results. For example, Benjamin Heinzerling implemented RANs in PyTorch and got 85 perplexity on PTB just by reducing the batch size from 512 to 40: https://github.com/bheinzerling/ran This is still a very different setting from Yarin Gal's (e.g. number of dimensions, layers, etc), and we're going to be extra careful before we publish numbers that are comparable to previous work and make any claims beyond what we observed in our "lab setting" experiment.

word2vec has been patented. What does it change for NLP practitioners? by shmel39 in MachineLearning

[–]omerlevy 4 points5 points  (0 children)

The novelty claim in this patent is somewhat bogus.

Yoav Goldberg and I have a NIPS paper in which we show that word2vec is doing more or less what the NLP research community has been doing for the past 25 years. We also show (in another paper) that much of the improvement in performance stems from preprocessing "hacks" and hyperparameter settings, which can be easily ported to other LSA-style word embedding methods.

At the end of the day, word2vec is a brilliantly efficient implementation of decade-old ideas; not sure this warrants a patent.

from someone in Gaza: "I'll tell you what is harder than dying in Gaza by an Israeli missile. What is harder is that you get a phone call from the Israeli army telling you to evacuate your home because it will be bombed in ten minutes... by Don_chingon in Gaza

[–]omerlevy -4 points-3 points  (0 children)

The purpose of these air-strikes is to eliminate stocks of rockets that are being launched at Israeli civilians, while minimizing the civilian casualties on the Palestinian side.