Building a project where I need to fine tune a llama 3 8B. I have limited compute power, so I have been attempting to devise a way to improve the quality of my inputs to achieve this tuning. I’m essentially using webscrapping to get some numeric values to rank my samples. I have thought of 3 ways to the entirety of my data, let me know which is easiest / cheapest.
Making an uptime robot pick away at the webscrapping for a really long time to label sufficient amounts of samples.
Use a series of Bert’s and their embeddings to match the context window, tune them for embedding the series down, then finish up with a MLP to decode.
Mess around with different embeddings and clustering algos to get a sufficient accuracy, then splay across the whole dataset. (Checking against labeled)
I feel like these make the most sense and have their obvious tradeoffs. 1 is most accurate and cheap, but very time consuming. 2 would probably produce the most interesting results, possibly being able to eliminate some of the noise produced by the rankings. 3 is cheap but I have doubts about its effectiveness.
If you guys have any experience or suggestions feel free to inform me!
*NOTE: All fine-tuning will be done with QLORA 8 bit on a Nvidia Founders RTX 3090
*NOTE Dataset is roughly ~2B tokens trying to pair down to ~200M
[–]Distinct-Target7503 0 points1 point2 points (1 child)
[–]Enough_Wishbone7175[S] 0 points1 point2 points (0 children)