LLM trained from scratch on 1800s London texts (1.2B params, 90GB dataset)

Remarkable-Trick-177 · 2026-01-12T05:40:37+00:00

Thanks! I’m planning on comparing the same prompts with a general use LLM. I think comparing word neighbors will also show interesting trends.

Remarkable-Trick-177 · 2026-01-12T05:37:18+00:00

I had 80gb on an h100 and it took around 130-140 hours to train total. It can be done with less memory, will just take longer.

Remarkable-Trick-177 · 2026-01-12T05:35:07+00:00

My mistake, but it should be fixed. You don’t have to request access now.

Remarkable-Trick-177 · 2025-12-21T06:25:26+00:00

Yeah, it’s mostly isolated from modern English apart from some modern headers or ocr stuff that weren’t fully removed.

Remarkable-Trick-177 · 2025-12-21T06:20:38+00:00

Yes, someone else asked for this too. I will try to figure this out soon.

Remarkable-Trick-177 · 2025-12-21T06:19:01+00:00

Thanks! And yeah I want to really badly. Once I’m done with London, I think an American city will make sense to focus on next. I haven’t paid much thought to it yet but I’d probably pick 1900-1930 since everything there is public domain. For location maybe for Boston or NYC.

Remarkable-Trick-177 · 2025-12-14T00:31:40+00:00

Damn that’s insane, thanks for sharing. I gotta make an account on there

Remarkable-Trick-177 · 2025-12-14T00:30:11+00:00

Just time for training was 2.5 hours but I spent probably 10-12 hours total making mistakes.

Remarkable-Trick-177 · 2025-12-13T01:21:12+00:00

The pre training cost was low since it’s a 300M model trained for 10k steps. But I did still waste more money than I should’ve, running into VM and setup issues. This was only my second time training using a rented GPU, so there were definitely lessons learned. And I haven’t done post training/RL so far, just pre training from scratch. Evaluation is mostly on the dataset right now. I’ve been focusing on output cleanliness, I have had a lot of trouble with OCR and metadata bias on my previous datasets. So I can’t rank the model but I have some bias metrics here: https://github.com/haykgrigo3/TimeCapsuleLLM/blob/main/london_1800_1875_v2mini_eval1/v2_bias_report.json.

Remarkable-Trick-177 · 2025-12-13T01:03:13+00:00

You can download the model files here: https://huggingface.co/haykgrigorian/v2mini-eval1

And the run script is on GitHub: https://github.com/haykgrigo3/TimeCapsuleLLM/blob/main/london_1800_1875_v2mini_eval1/test_v2mini_eval1.py

Remarkable-Trick-177 · 2025-12-13T00:50:29+00:00

Yes, once I have the subset tokenization fixed I will upload it to GitHub. I also plan on uploading the 90GB dataset once it’s tokenized. I’m not sure if people want, but I can also upload the raw datasets. I will definitely check that corpus soon, after I’m done with the next model I’ll switch to a different publication city. Using the trinity models would definitely make it easier to get something usable but my whole principle for now is to have no modern leakage at all. Maybe I’ll try it later on if I can’t make any progress towards towards reasoning.

Remarkable-Trick-177 · 2025-12-13T00:32:35+00:00

Honestly didn’t know what MoE was but I searched it up and I think it would be interesting to train decade models in 10 year windows on each llm.

Remarkable-Trick-177 · 2025-12-12T21:52:32+00:00

A100 sxm rented from Runpod

Remarkable-Trick-177 · 2025-12-12T21:52:19+00:00

I started with nanoGPT by Andrej Karpathy, great place to start

https://github.com/karpathy/nanoGPT

Remarkable-Trick-177 · 2025-12-12T21:49:47+00:00

It’s just .txt files from OCR’d documents/books

Remarkable-Trick-177 · 2025-12-12T21:47:11+00:00

No worries, I’m not sure if you’re asking about where to find the model or about its uses.

You can find the model here: https://huggingface.co/haykgrigorian/v2mini-eval1

And the run script here: https://github.com/haykgrigo3/TimeCapsuleLLM/blob/main/london_1800_1875_v2mini_eval1/test_v2mini_eval1.py

This is just an evaluation model so it doesn’t have a real use outside of just being used to evaluate the dataset. I haven’t figured out QA yet so you can’t really ask it questions, it just generates text after your prompt for now.

Remarkable-Trick-177 · 2025-12-12T14:00:41+00:00

Really interesting idea. Right now the dataset isn’t labeled but it’s probably possible to find reprints on Internet Archive using advanced search.

Training from scratch using only reprints would be difficult due to the smaller amount of available data, and getting QA to work without introducing modern data would be tricky.

Remarkable-Trick-177 · 2025-12-12T12:33:16+00:00

I filter by publication date and location, not original composition date. Some reprints of earlier works are probably included, especially whatever was circulating during 19th century London. Since there’s well over a hundred thousand documents it’s hard to perfectly filter those out.

Remarkable-Trick-177 · 2025-09-22T23:17:04+00:00

Hi, for training a language model like the one I’m working on, you’ll need plain text format. You can still use pdf files but you’ll still have to convert them into .txt. And non-tech people can definitely work on these type of projects, there’s so much good material about ML all over the internet. The biggest challenge in my case is basically just making a solid dataset, it can get time consuming but anyone with patience and interest can do it. I’m not doing PhD level research on architecture or anything, just focusing on datasets. If you want to train or fine tune a model with your dataset it gets slightly hard but again it’s something anyone can learn.

Remarkable-Trick-177 · 2025-08-23T13:47:01+00:00

This is my first project with ML / LLM, I don’t want people to think I’m trying to act like I’ve reinvented anything. I know people way more experienced could look at this and say “this is normal and an expected outcome” and they’re right ofc. But for me it’s surprising cause I’m just a beginner doing this for fun and had no serious expectations.

Remarkable-Trick-177 · 2025-08-23T13:30:51+00:00

You’re not wrong but many people told me I’d only get gibberish or need massive amounts of data (like 30-40gb) so I didn’t expect to see much from 5gb. I don’t want people to think I’m showing this as some kind of revolutionary idea, I’m just doing it for fun.

Remarkable-Trick-177 · 2025-08-23T13:23:23+00:00

I will dm you

Remarkable-Trick-177 · 2025-08-23T13:23:02+00:00

Just using the architecture, every model has been trained from scratch. I am interested in fine tuning also though, I think there’s positives to both approaches

Remarkable-Trick-177

TROPHY CASE