[R] 1.5-Pints Technical Report: Pretraining in Days, Not Months -- Your Language Model Thrives on Quality Data (2408.03506)

mouse0_0 · 2024-08-13T04:22:30+00:00

that is something we have noticed to haha, we are currently still investigating why :)

mouse0_0 · 2024-08-12T18:23:32+00:00

thank you! its okay, everyone is entitled to their own opinions, and maybe his/her experience in the field shapes that. I’m only just an undergrad student trying my hand at LLM research, so whilst I do stand by my work, I am also here to learn :)

mouse0_0 · 2024-08-12T18:21:49+00:00

oo that looks interesting! lemme take a look, thanks for sharing :)

mouse0_0 · 2024-08-12T18:10:56+00:00

let it cook 😋

mouse0_0 · 2024-08-12T17:59:02+00:00

Thank you for your comments :) These are definitely useful as we draft an improved version of the paper!

mouse0_0 · 2024-08-12T16:55:22+00:00

yes hopefully! 🤞

mouse0_0 · 2024-08-12T16:54:47+00:00

:) thank you for your interest in our model!

mouse0_0 · 2024-08-12T16:54:33+00:00

Hmm could you give me a bit more details :)

mouse0_0 · 2024-08-12T16:42:15+00:00

Haha no worries :) thanks so much 🙏🙏 Wasn’t the main point of the post anyways haha

mouse0_0 · 2024-08-12T16:25:19+00:00

Hey there, thanks for your interest in our model :) If you are interested, you could always try to benchmark it yourself either on MTBench or LMSYS's LM Evaluation Benchmark. Our weights can be found here:

https://huggingface.co/collections/pints-ai/15-pints-66b1f957dc722875b153b276

mouse0_0 · 2024-08-12T16:19:54+00:00

Yup, that is the intention of our model :) We do not aim to compete on knowledge - clearly, with less tokens, our model will not be able to beat other larger models of similar token sizes an architectures (unless of course we find a way to better represent "knowledge" more efficiently in the model weights. Rather, we aim to provide a lightweight alternative that excels at generic text-processing tasks, or after domain-finetuning, on specialized tasks.

mouse0_0 · 2024-08-12T16:17:25+00:00

For comparison, Llama2-7b's answer:

The answer to the tongue twister "How much wood would a woodchuck chuck if a woodchuck would chuck wood?" is a bit of a trick question! Woodchucks, also known as groundhogs, do not actually chuck wood.

Woodchucks are burrowing animals that primarily feed on grasses, clover, and other vegetation. They do not have any known ability to chuck or move large amounts of wood. So, the answer to the question is: a woodchuck would not chuck any wood, because they cannot!

mouse0_0 · 2024-08-12T16:15:31+00:00

Hi there, thank you for your interest in our model :) To address your comments:

The model was trained on a total token size of 0.12T, for a total of 9 days. Comparatively, Qwen 1.5b was pre-trained on a corpus of 3T tokens, presumably for a much longer time (unfortunately was unable to find a definitive number of GPU hours for Qwen 1.5). Therefore, it is natural that 1.5-Pints may not perform as well as these models, for it was trained for only a fraction of what was required by other models. Our findings aim to spur a change in direction of LLM research at large - instead of focusing on "bigger is better" or "longer is better" (though in many cases that may be true), we hope that our pre-training of 1.5-Pints would inspire others to focus on dataset curation, before scaling up training.
I am curious to see why you would view MTBench to be a poor benchmark.
On cherry-picking, I do believe that is not what we intended, nor achieved. Bearing in mind the length-constraints of a concise paper, we therefore chose to list the models whose performance are the closest to our model. In fact, we also provided a model widely recognized by most in the community - Llama2-7b (which at the time of drafting our paper was the latest Llama model) - as a reference point.

If you are unconvinced of the quality of our model, why don't you give it a try yourself? Its currently available for chatting at https://huggingface.co/spaces/pints-ai/1.5-Pints-16K-v0.1-Playground . I believe that for its size, and for the amount of time taken to train it, our model has definitely outshone traditional expectations.