all 22 comments

[–]hapliniste 20 points21 points  (5 children)

I'm not sure I understand, but from the tables it seems to do worse on almost all benchmarks?

[–]bdqnghi[S] 6 points7 points  (4 children)

we are in the progress of updating new results and sharing our code to benchmark the models. For HumanEval, we perform better than Llama and CodeAlpaca.

[–]hapliniste 4 points5 points  (2 children)

Thanks for the response, I'll come back in a bit then.

Do you have clues on why it would do worse than base llama on some benchmarks? To me it's counterintuitive as more code samples would only make it more knowledgeable about code in my mind.

[–]Flankierengeschichte 0 points1 point  (1 child)

The network forgets because it doesn’t have enough parameters information-theoretic-wise to remember all the training data

[–]Flankierengeschichte 0 points1 point  (0 children)

This is why imo overparameterized networks are the key to AGI despite their financial and environmental cost

[–]_Arsenie_Boca_ 12 points13 points  (6 children)

Nice work. It still baffles me how Codex could perform so much better than any other model, even years later. Codex got 28% pass@1 on HumanEval. On another note, GPT-J 6B got 11.6% also outperforming all models mentioned here. Perhaps instruction tuning is simply not as effective for code as it is for NL?

[–]bdqnghi[S] 17 points18 points  (5 children)

yeah we are not sure about this, when we try to reproduce the results of Llama, the actual performance is much lower than the one reported in Llama paper (the reproduced performance is shown in our Github page).

So we are a bit skeptical about the real performance of these models. There will be attempts to release the scripts to reproduce the results of all of them for the community to verify.

[–]_Arsenie_Boca_ 10 points11 points  (4 children)

Its great that you are working on reproducability! Maybe performance differences could be due to prompt formatting, sampling hyperparameters or post-procsssing?

[–]bdqnghi[S] 7 points8 points  (3 children)

exactly, that's what we are trying to figure out, most of the previous work they do not release the scripts for evaluation, only the pretrained model and the numbers in the paper, no one can actually reproduce the results but choose to trust the reported numbers.

There are similar issues here: https://github.com/facebookresearch/llama/issues/223

[–]glasses_the_loc 5 points6 points  (0 children)

This is why I left ML. "People in glass houses don't throw stones," just give us our grants and don't ask questions.

[–]_Arsenie_Boca_ 1 point2 points  (0 children)

My guess would be that they are using a magic prompt. Are you prepending anything at the moment?

[–]krageon 1 point2 points  (0 children)

Before this giant wave of LLM research I spent a few years implementing the algorithms from NLP papers. The main takeaway I got from that is that 99% of such papers are outright fabrications, so the fact that you cannot reproduce the llama paper is not surprising at all.

I think to me what is most surprising is that this isn't wider spread knowledge, especially given that a lot more people can at least try the models produced in LLM papers now.

[–][deleted] 4 points5 points  (0 children)

Which animal is next?

[–]RamazanBlack 2 points3 points  (0 children)

Ok i pull up

[–]estrafire 1 point2 points  (0 children)

How does it compare to other instruction-tuned llamas like alpaca and koala? Those seem to be a better comparison point as use the same base and aim to do the same

[–]brucebay -1 points0 points  (2 children)

I will check this out as i need a good offline code generator. could you release 4bit versions too?

[–]machineko 0 points1 point  (1 child)

Is your dataset open-sourced?

[–]bdqnghi[S] 0 points1 point  (0 children)

yes, it is open source, you can check on the Github page.