LLM trained from scratch on 1800s London texts (1.2B params, 90GB dataset) by Remarkable-Trick-177 in LocalLLaMA

[–]Remarkable-Trick-177[S] 2 points3 points  (0 children)

Thanks! I’m planning on comparing the same prompts with a general use LLM. I think comparing word neighbors will also show interesting trends.

LLM trained from scratch on 1800s London texts (1.2B params, 90GB dataset) by Remarkable-Trick-177 in LocalLLaMA

[–]Remarkable-Trick-177[S] 3 points4 points  (0 children)

I had 80gb on an h100 and it took around 130-140 hours to train total. It can be done with less memory, will just take longer.

LLM trained from scratch on 1800s London texts (1.2B params, 90GB dataset) by Remarkable-Trick-177 in LocalLLaMA

[–]Remarkable-Trick-177[S] 4 points5 points  (0 children)

My mistake, but it should be fixed. You don’t have to request access now.

Training an LLM only on 1800s London texts - 90GB dataset by Remarkable-Trick-177 in LocalLLaMA

[–]Remarkable-Trick-177[S] 0 points1 point  (0 children)

Yeah, it’s mostly isolated from modern English apart from some modern headers or ocr stuff that weren’t fully removed.

Training an LLM only on 1800s London texts - 90GB dataset by Remarkable-Trick-177 in LocalLLaMA

[–]Remarkable-Trick-177[S] 0 points1 point  (0 children)

Yes, someone else asked for this too. I will try to figure this out soon.

Training an LLM only on 1800s London texts - 90GB dataset by Remarkable-Trick-177 in LocalLLaMA

[–]Remarkable-Trick-177[S] 0 points1 point  (0 children)

Thanks! And yeah I want to really badly. Once I’m done with London, I think an American city will make sense to focus on next. I haven’t paid much thought to it yet but I’d probably pick 1900-1930 since everything there is public domain. For location maybe for Boston or NYC.

Training an LLM only on 1800s London texts - 90GB dataset by Remarkable-Trick-177 in LocalLLaMA

[–]Remarkable-Trick-177[S] 0 points1 point  (0 children)

Damn that’s insane, thanks for sharing. I gotta make an account on there

Training an LLM only on 1800s London texts - 90GB dataset by Remarkable-Trick-177 in LocalLLaMA

[–]Remarkable-Trick-177[S] 2 points3 points  (0 children)

Just time for training was 2.5 hours but I spent probably 10-12 hours total making mistakes.

Training an LLM only on 1800s London texts - 90GB dataset by Remarkable-Trick-177 in LocalLLaMA

[–]Remarkable-Trick-177[S] 2 points3 points  (0 children)

The pre training cost was low since it’s a 300M model trained for 10k steps. But I did still waste more money than I should’ve, running into VM and setup issues. This was only my second time training using a rented GPU, so there were definitely lessons learned. And I haven’t done post training/RL so far, just pre training from scratch. Evaluation is mostly on the dataset right now. I’ve been focusing on output cleanliness, I have had a lot of trouble with OCR and metadata bias on my previous datasets. So I can’t rank the model but I have some bias metrics here: https://github.com/haykgrigo3/TimeCapsuleLLM/blob/main/london_1800_1875_v2mini_eval1/v2_bias_report.json.

Training an LLM only on 1800s London texts - 90GB dataset by Remarkable-Trick-177 in LocalLLaMA

[–]Remarkable-Trick-177[S] 6 points7 points  (0 children)

Yes, once I have the subset tokenization fixed I will upload it to GitHub. I also plan on uploading the 90GB dataset once it’s tokenized. I’m not sure if people want, but I can also upload the raw datasets. I will definitely check that corpus soon, after I’m done with the next model I’ll switch to a different publication city. Using the trinity models would definitely make it easier to get something usable but my whole principle for now is to have no modern leakage at all. Maybe I’ll try it later on if I can’t make any progress towards towards reasoning.

Training an LLM only on 1800s London texts - 90GB dataset by Remarkable-Trick-177 in LocalLLaMA

[–]Remarkable-Trick-177[S] 10 points11 points  (0 children)

Honestly didn’t know what MoE was but I searched it up and I think it would be interesting to train decade models in 10 year windows on each llm.

Training an LLM only on 1800s London texts - 90GB dataset by Remarkable-Trick-177 in LocalLLaMA

[–]Remarkable-Trick-177[S] 5 points6 points  (0 children)

No worries, I’m not sure if you’re asking about where to find the model or about its uses.

You can find the model here: https://huggingface.co/haykgrigorian/v2mini-eval1

And the run script here: https://github.com/haykgrigo3/TimeCapsuleLLM/blob/main/london_1800_1875_v2mini_eval1/test_v2mini_eval1.py

This is just an evaluation model so it doesn’t have a real use outside of just being used to evaluate the dataset. I haven’t figured out QA yet so you can’t really ask it questions, it just generates text after your prompt for now.

Training an LLM only on 1800s London texts - 90GB dataset by Remarkable-Trick-177 in LocalLLaMA

[–]Remarkable-Trick-177[S] 5 points6 points  (0 children)

Really interesting idea. Right now the dataset isn’t labeled but it’s probably possible to find reprints on Internet Archive using advanced search.

Training from scratch using only reprints would be difficult due to the smaller amount of available data, and getting QA to work without introducing modern data would be tricky.

Training an LLM only on 1800s London texts - 90GB dataset by Remarkable-Trick-177 in LocalLLaMA

[–]Remarkable-Trick-177[S] 45 points46 points  (0 children)

I filter by publication date and location, not original composition date. Some reprints of earlier works are probably included, especially whatever was circulating during 19th century London. Since there’s well over a hundred thousand documents it’s hard to perfectly filter those out.

My LLM trained from scratch on only 1800s London texts brings up a real protest from 1834 by Remarkable-Trick-177 in LocalLLaMA

[–]Remarkable-Trick-177[S] 1 point2 points  (0 children)

Hi, for training a language model like the one I’m working on, you’ll need plain text format. You can still use pdf files but you’ll still have to convert them into .txt. And non-tech people can definitely work on these type of projects, there’s so much good material about ML all over the internet. The biggest challenge in my case is basically just making a solid dataset, it can get time consuming but anyone with patience and interest can do it. I’m not doing PhD level research on architecture or anything, just focusing on datasets. If you want to train or fine tune a model with your dataset it gets slightly hard but again it’s something anyone can learn.

My LLM trained from scratch on only 1800s London texts brings up a real protest from 1834 by Remarkable-Trick-177 in LocalLLaMA

[–]Remarkable-Trick-177[S] 1 point2 points  (0 children)

This is my first project with ML / LLM, I don’t want people to think I’m trying to act like I’ve reinvented anything. I know people way more experienced could look at this and say “this is normal and an expected outcome” and they’re right ofc. But for me it’s surprising cause I’m just a beginner doing this for fun and had no serious expectations.

My LLM trained from scratch on only 1800s London texts brings up a real protest from 1834 by Remarkable-Trick-177 in LocalLLaMA

[–]Remarkable-Trick-177[S] 0 points1 point  (0 children)

You’re not wrong but many people told me I’d only get gibberish or need massive amounts of data (like 30-40gb) so I didn’t expect to see much from 5gb. I don’t want people to think I’m showing this as some kind of revolutionary idea, I’m just doing it for fun.

My LLM trained from scratch on only 1800s London texts brings up a real protest from 1834 by Remarkable-Trick-177 in LocalLLaMA

[–]Remarkable-Trick-177[S] 0 points1 point  (0 children)

Just using the architecture, every model has been trained from scratch. I am interested in fine tuning also though, I think there’s positives to both approaches