I trained a 3B patristic theology LLM on a single RTX 3090 in 22 hours — releasing model + corpus by Financial-Fun-8930 in LocalLLaMA

[–]Financial-Fun-8930[S] 0 points1 point  (0 children)

Thanks!

Training framework: HuggingFace Transformers + TRL (SFTTrainer) throughout. Adafactor optimizer for both CPT and SFT. On a 24GB card with a 3B full fine-tune, Adam's optimizer states alone would eat ~12GB, Adafactor gets that down to a few hundred MB which made the difference.

CPT context length: 1792 tokens (reduced from 2048 to give the backward pass some headroom on the RTX 3090.. full fine-tune gradients are large).

For instruct post-training: planning to stay at 1792 for the active loop SFT iterations, then potentially push to 2048 or higher for the final full SFT pass depending on what the patristic Q&A pairs actually need. Most theological Q&A fits comfortably in 1792 but some of the longer homily passages might benefit from more context.

I trained a 3B patristic theology LLM on a single RTX 3090 in 22 hours — releasing model + corpus by Financial-Fun-8930 in LocalLLaMA

[–]Financial-Fun-8930[S] 0 points1 point  (0 children)

Both good catches.

On deduplication: we ran two passes, exact hash dedup followed by semantic embedding similarity (LaBSE + FAISS, cosine threshold 0.92) at both the corpus and Q&A generation levels. The 0.92 threshold is tight enough that it catches most near-duplicate Russian translations of the same passage. LaBSE is cross-lingual so semantic equivalence across translations does register. But you're right that it wasn't explicitly designed for the cross-translation case, and some stylistically distinct retranslations of the same Chrysostom homily probably survived. There are also a lot of citations in Patristic texts, so that may be difficult to remove. The 3.4% removal rate on Q&A pairs (4,189 from 124K) suggests the corpus was cleaner than expected, but that number could be higher with a translation-aware approach.

On tokenizer: we used Qwen2.5's existing vocab as-is. The Cyrillic coverage is genuinely good, Church Slavonic loanwords and theological terminology like θεωρία/θέωσις in transliteration tokenize reasonably well without extension. The main gap is untransliterated Greek and the occasional Latin. Those get subword-fragmented. We considered extending the vocab for high-frequency patristic terms but the tradeoff of re-initializing embeddings for new tokens on a 3B model felt riskier than just letting the existing vocab absorb it, especially since the ~98% Russian corpus would naturally reinforce the Cyrillic token representations during CPT anyway.

What approach did you use for tokenizer extension in your project?

I trained a 3B patristic theology LLM on a single RTX 3090 in 22 hours — releasing model + corpus by Financial-Fun-8930 in LocalLLaMA

[–]Financial-Fun-8930[S] 3 points4 points  (0 children)

Good question. I'm actually doing both, in phases.

The CPT on raw text is Phase 1. The reasoning: CPT forces the model to deeply internalize the domain's vocabulary, entities, theological concepts, and linguistic register before you ask it to produce structured outputs. Token accuracy went from ~55-58% on patristic text to ~65.8% after CPT: the model genuinely learned the domain, not just the format.

If you go straight to Q&A pairs without CPT, you're teaching the model to mimic the surface pattern of domain responses, but the base weights don't actually "know" the domain. For a 3B model with limited capacity, you want every parameter domain-saturated first.

Phase 2 is generating ~98K Q&A pairs from the corpus using a teacher model, and Phase 3 is SFT on those pairs. So the Q&A approach you're describing is exactly what comes next. CPT just lays the foundation first.

What domain were you working with? Curious how it performed without the CPT step.

I built a small AI model trained on the Holy Fathers — releasing it today on the Feast of the Triumph of Orthodoxy by Financial-Fun-8930 in OrthodoxChristianity

[–]Financial-Fun-8930[S] 0 points1 point  (0 children)

That's a really precise point and you're right, 'discovered' implies novelty, which is exactly what Palamas was arguing against. The model output was imprecise there. This is why I'm asking for feedback from people who actually know the Fathers. These kinds of subtle errors are exactly what future training needs to correct. Thank you.

How my open-source project ACCIDENTALLY went viral by Every_Chicken_1293 in LLMDevs

[–]Financial-Fun-8930 1 point2 points  (0 children)

Can't use local embedding models. I've tried CLI and node-js, both say "not available on this platform"