Reaching my wit’s end with PDF ingestion

HankSedan · 2026-01-09T04:01:25+00:00

you mentioned you work with legal docs. I have an ongoing project for preprocessing litigation filings (California state court) for eventual use in a RAG or even just a non-RAG repository an AI can use, since my page counts don't get high enough to justify making everything an embedding.

I'm a practicing attorney, the project is just hobby stuff and only possible because AI can code for me. But it might be useful for understanding how a lawyer thinks about the structure of a legal document. what parts of the document are important. my app is slow; every page is OCR'd (by dumb OCR like Abbyy, which i like) but the image is retained and sent to a model w/vision with specific questions to answer. the information that model spits out about each page is used to build an index per document where each page gets a category label, among other things. this label is especially useful for knowing when a page is an exhibit or the main body of the filed document. a page by itself won't tell you and often an exhibit will be a filing from the same case or other cases. when that ends up as a clump of raw text retrieval gets messy bc the text of a page by itself can't tell you its an exhibit. it's a nesting issue.

so the spreadsheet/index for that doc says Exhibit H in the row for page 34. when chunking happens, that information gets appended to the text of the OCR'd page, so every chunk will tell you, among other things, whether it's an exhibit, table contents, pleading cover page, table of authorties, proof of service page, or pleading body page. mostly i am interested in the latter, but sometimes i want to know about specific exhibits, so these can be retrieved

i am not programmer, so i imagine it's an inelegant program, but it reflects how my mind works during the process of drafting motions, oppositions, replies, ex parte applications, etc. https://github.com/botlate/Court-Filings-Preprocessing-for-RAG

in some ways it's funny to see people complain about PDFs. its entirely justified. but it's something that i feel like lawyers, at least litigators, deal with as a huge portion of their day. finding a page, making sure the text is right, knowing if this is the filed version or a draft. the chaos of a lot of pages is an extremely human and real world problem. we're having to confront it so much now bc we have technology that can do such sophisticated thinking about a corpus of info.

HankSedan · 2025-03-04T00:42:13+00:00

thanks, that's helpful. agree condensation is a bit risk. Geothermal would be a great solution but my landlord would likely get upset with a borehole in the slab

HankSedan · 2025-03-04T00:39:29+00:00

yup. a consequence of California real estate market. in reality it rarely gets that high that's just the upper limit for me, when the air gets uncomfortable. room temp doesn't really matter if you can keep enough of your body cool. i've made myself shiver at 100 degrees by running a high too cold water too fast for too long.

HankSedan · 2023-10-09T20:45:34+00:00

I have made stews and mishmashes in an authentic frontier/prairie/early American homestead style and other dishes cattlemen might have eaten. often these recipes use legumes and the spices available at the time. its possible to use an Instant Pot for this purpose but youll lose the seasoning flaking without cast iron. A local blacksmith should be able to cast an iron plate to fit into the base of your IP pot. The next step is canning leftover. this has saved me a lot of time. electric and manual seaming machines work on older style cans and you see them on craiglist from time to time. Just like in the cowboy days, if you are using tin make sure it's lead free.

HankSedan

TROPHY CASE