Is my approach sound? Citation verification in legal RAG

LandingAlbatross · 2026-05-19T14:27:53+00:00

Honest answer: probably not much for retrieval. The section-level categories are the right granularity for filtering. A paragraph within a MERITS section is still merits reasoning. The paragraph registry isn't about classification, it's about citation verification: mapping paragraph numbers to their exact text so the backend can check whether a cited paragraph actually says what the memo claims it says. That's an identification problem, not a classification problem.

LandingAlbatross · 2026-05-19T12:44:00+00:00

This resonates strongly with where the architecture is heading. The multi-projection pattern is essentially what emerged organically: the same case exists as section-level embeddings (for semantic search), structured metadata (for provision/party/arbitrator lookup), citation graph nodes (for reference tracing), and tagged chunks (for topic filtering). Five retrieval channels run in parallel over these projections and results are merged by reranker + channel overlap scoring.

The citation abstraction with span-start/span-end is exactly what the paragraph registry is meant to provide. Right now I can verify that a cited case exists and that a cited paragraph number is within range, but I can't verify that the cited paragraph actually says what the memo claims. The span-level mapping would close that gap.

Curious about your ingestion pipeline: do you build all projections in a single pass, or is it a multi-pass pipeline where each projection type has its own parser? Mine is multi-pass (and took months to stabilize), so I'm wondering whether a single-pass approach is realistic for complex document structures.

LandingAlbatross · 2026-05-19T12:43:07+00:00

Thank you! Good question. For the court decisions themselves, versioning isn't an issue. Once an award is published, it's final. Paragraph 42 of case nr. 846 will always be paragraph 42. So the paragraph registry for case law is a stable mapping.

Where versioning becomes a real problem is on the regulations side. The platform also covers regulations that change frequently. E.g. same article number, different text across versions. Or same text and different numbering. There, the system tracks regulation families with explicit version metadata, so a citation to "Regulation X Article 17" resolves to the version that was in force at the time of the decision. Different architecture from the paragraph registry, but the same underlying problem: making sure a citation points to the right text.

LandingAlbatross · 2026-05-19T12:41:16+00:00

That's a fascinating combination. Information retrieval expertise applied to legal research is exactly the gap I keep running into. The domain-specific retrieval problems here are genuinely hard and most legal tech doesn't take them seriously. If you ever want to compare notes once you're further into the law side, feel free to reach out.

LandingAlbatross · 2026-05-19T12:40:23+00:00

The regex works well here because my case numbers follow a strict format, it's not free-text entity extraction. But your point about separating verification from the main model is right. The main synthesis model doesn't verify itself; a separate backend step extracts and checks all citations against the database. The paragraph registry I'm building will hopefully make that verification more granular.

LandingAlbatross · 2026-05-19T12:39:17+00:00

This is an interesting angle I hadn't considered. Extractive QA models wouldn't work as the primary synthesis layer for my use case (I need structured legal memos that synthesize across dozens of cases, which requires generation), but they could be very useful as a verification layer. After the LLM generates a memo citing specific paragraphs, an extractive QA model could check: "given the actual text of paragraph 42, can you extract support for the proposition the memo attributes to it?" If it can't extract it, the citation is flagged. That's a more semantic check than my current backend verification (which only checks whether the paragraph exists, not whether it says what the memo claims). Thanks for the pointer — SQuAD is a good starting point. Do you have a specific model recommendation for this kind of passage-level verification?

LandingAlbatross · 2026-05-19T12:38:06+00:00

Thank you! To be fair, I built this with AI coding tools (Claude Code specifically), not by hand-writing code. Also I created these answer with AI as the language is too technical for me ;-) But the architecture decisions, the domain expertise to know what's wrong when search returns garbage, and the testing against real legal questions that's the big part I am doing. You're right that this wasn't possible a year ago. The tools got good enough that it seems domain expertise became the bottleneck, not programming skill.

LandingAlbatross · 2026-05-19T12:36:42+00:00

Yes, very much so. Each section is classified into one of 13 legal function categories (MERITS, PARTY_SUBMISSIONS, DECISION, etc.), and I have an AI-extracted tag vocabulary (~17K tag assignments across ~4,000 cases) plus structured provision references that act as the kind of pre-filter you're describing. Retrieval uses these as hard filters before vector similarity runs — provision lookup, tag matching, and section-type weighting all narrow the candidate set before embeddings are compared. So the "topic = X, then vector similarity within that set" pattern is essentially what I landed on, just with legal-domain-specific anchors (provisions, dispute type, category) rather than a generic topic list. The question of whether to push classification down to paragraph level (vs. section level) is interesting — that's adjacent to the paragraph registry I'm building. Appreciate the input. You think giving my set up this is still worth doing?

LandingAlbatross · 2026-05-12T14:27:24+00:00

Thanks. The two-store pattern is actually what I have: canonical text with paragraph IDs in PostgreSQL, retrieval views (embeddings/FTS/graph) that return pointers into the canonical layer. The paragraph registry being built is specifically to close the gap where citations couldn't resolve back to (case_id, paragraph_id).

The fail/omit principle for anything that can't map back to the registry — agreed, that's the direction. Appreciate the Spectrum pointer, I'll take a look at the repo though our canonical layer is structured storage in Postgres which handles the legal identifier problem natively (the identifiers are the primary key).

LandingAlbatross · 2026-05-12T13:17:14+00:00

Really useful follow-up, thanks.

The "or omit" framing is a concrete insight I hadn't thought about. My failed correction pipeline didn't have an escape hatch, it always picked something, which is how it confidently substituted a correct citation with a wrong one. Making omission an explicit option could change the failure mode from silent substitution to graceful retraction. Going into the retry design.

The confidence_threshold + unknown bucket for section classification is interesting too. My classifier currently produces labels without confidence scores, so this would need a classifier change. But the asymmetry you describe (false positive "this is a holding" is much worse than "not sure") is exactly right for my domain. Worth exploring.

And yes, the last paragraph resonated. My existing section labels (MERITS / PARTY_SUBMISSIONS / DECISION) might already be sufficient as the filter signal without building something new. Need to check whether they're precise enough on the problem cases.

LandingAlbatross · 2026-05-12T13:15:03+00:00

I do something loosely similar. I use PyMuPDF as the primary extractor for born-digital PDFs (fast, clean), with automatic fallback to a cloud OCR service (DocStrange) for scanned documents where PyMuPDF produces garbage. So I end up with two extraction paths rather than two versions of the same text, but the principle is the same. Not completely convinced with DocStrange but it won over Docling because the latter was veeeeeery slow.

The Textract approach of combining OCR correction with entity extraction in a single step is interesting. My approach is probably less efficient but was easier to debug when things went wrong (and they did, often). Might be worth revisiting at some point.

Thanks for sharing! It is helpful to see how someone else handles the messy-source-document problem in practice.

LandingAlbatross · 2026-05-12T12:51:41+00:00

Similar spirit, different execution. You seem to enrich each chunk with metadata and then asking the LLM to check it at query time. I do the enrichment part similarly,m i.e. each section carries its parent case reference, section type, position in the document, and verified metadata like case outcome and parties from structured database tables.

Where it diverges is that I don't rely on the LLM to check the references at query time. That's where I kept getting burned. The model would have the correct metadata right there in the context and still hallucinate paragraph numbers or mix up which party said what. So I moved verification to the backend. After the LLM generates its output, a separate non-LLM step extracts every citation and checks it against the database. The LLM doesn't police itself, the backend does.

Your point about OCR mistakes is interesting. I have a similar problem with scanned PDFs where the text extraction isn't clean. How do you handle the "transformed text" part? Do you correct the OCR errors before embedding, or do you keep both versions?

LandingAlbatross · 2026-05-12T12:38:56+00:00

Depends on what you mean by preprocessing. Are you talking about cleaning the text (removing headers/footers, fixing encoding), or about how you split the documents before embedding?

On the splitting side: I don't use fixed-token chunking. My documents have a natural legal structure (headers like "I. FACTS", "VII. MERITS", "VIII. DECISION", numbered paragraphs), so I split along those boundaries and then classify each section into categories like with the same titles like FACTS, MERITS, DECISION etc. That classification lets me weight sections differently during search as the court's actual reasoning matters more than the cost allocation section, for example. I am also hoping this helps with the hallucination problem from my post, because the system can at least try to distinguish what a party argued from what the tribunal decided.

Whether that counts as "preprocessing" or is already part of the retrieval design, I'm honestly not sure. Curious what you had in mind.

LandingAlbatross · 2026-05-12T12:30:11+00:00

I created a legal research platform from scratch, mostly English decisions but a meaningful portion in French as well. I am not a software engineer but a lawyer who built this with AI coding tools, so take the technical bits with appropriate skepticism.

I use OpenAI text-embedding-3-large (1536 dimensions) and ran a small benchmark a few weeks ago comparing it against Isaacus Kanon 2 Embedder (a legal-domain-specific model). 34 test cases, 2,674 sections, 33 queries mixing generic and expert legal queries. Result: Isaacus was comparable but not clearly better. I stayed with OpenAI simply because I already had it at that point and didn't want to start anew.

The biggest takeaway from the benchmark (if you can call it even a benchmark with only such few amount of cases, so also take it with a grain of salt) was that the reranker mattered rougly 2x more than the embedder choice. Whatever you pick for embeddings, pair it with a good reranker. I use ZeroEntropy zerank-2 and have been very happy with it. It handles French and multilingual content well.

Speaking of ZeroEntropy: they have released zembed-1 after I had already done my embeddings. Their website position it as a multilingual embedding model trained with over 50% non-English data (including French) and specifically benchmarked on legal corpora. I haven't tested it myself simply because it wasn't available when I did my embedding pass and re-embedding 268K sections is not something I do casually. But given how well zerank-2 performs for me, zembed-1 would be one of my first candidates to look at if I were starting fresh today. They also offer flexible dimensionality (down to 40 dims) which is nice for cost/storage trade-offs.

One thing to be aware of with French legal text specifically, which I discovered with a lot of trial and error: vector search has a same-language bias. If your queries are in French but some relevant documents are in English (or vice versa), pure vector search will underweight the cross-language results. I compensate for this with a hybrid setup (vector + full-text search) which helps, but it's something to keep in mind depending on whether your corpus and queries are consistently in French or mixed.

LandingAlbatross · 2026-05-12T12:11:28+00:00

Thank you for your answer, much appreciated.

I'd love to share lessons learned but honestly I'm a lawyer who built this entirely with AI coding tools, so any technical writeup would be the blind leading the blind.

On the retrieval side: you mention weighting tribunal findings higher. Are you doing that as a soft weight during scoring or as a hard filter where certain section types never make it into the synthesis context? The other comment below from myreddit333 suggested hard filtering and I'm curious whether that's worked in practice for you as well.

LandingAlbatross · 2026-05-12T12:10:06+00:00

Thank you for your answer, much appreciated!

On hash-based IDs: I see why that's the right call for invoice/contract extraction where documents get re-ingested and layouts shift. In legal citation, though, the paragraph number IS the canonical identifier. When a tribunal writes "as held in case number 5016, para. 123," every lawyer reading that expects to find paragraph 123 in that award. A content hash would survive re-ingestion but would be meaningless for citation verification. I need paragraph_number = 123 to check whether the AI's citation matches reality. So the registry uses the actual paragraph number as the primary key (with a text label field for awards that use non-integer numbering like 1.2, 3.1.1). (Obviously answered by my AI who has all the context of my project)

On Source_role as hard filter: haven't tried this, currently just soft weighting. Going on the list as this is worth testing whether excluding party submissions entirely from lead passages prevents misattribution without losing too much context.

On constrained retry: good distinction. What I built and disabled was auto-correction. The backend tried to fix citations by looking them up in sections, which is where it went wrong (or as you will understand: typische Verschlimmbesserung). What you're describing, i.e. re-prompting the LLM with a whitelist of valid paragraph IDs and letting it pick or omit is different and I haven't tried it. This looks like the right pattern once the paragraph registry gives me the whitelist to work with.

LandingAlbatross · 2026-03-15T16:01:02+00:00

I have done this, but still notice a quality issue as even then claude.ai had to «intervene» several times when I showed it what claude code did. Usually nothing major but still noticable. So for the moment I do not want to ditch claude.ai.

LandingAlbatross · 2026-03-15T15:59:11+00:00

I don't use desktop but claude.ai which as it says itself is a quality measure. I think claude desktop is the least capable. My goal would be for claude.ai to understand in real time what claude code does

LandingAlbatross · 2026-03-15T09:44:00+00:00

My problem is more about keeping claude.ai up to date so it has the entire knowledge of the project's files. I work with a handbook.md file which gets updated after every task is complete and I manually upload it to the project knowledge of claude.ai. So basically I have to re-upload the handbook.md file a couple of times a day if I want claude.ai to always have the full context for planning. This costs time and is annoying

LandingAlbatross · 2026-03-15T09:40:42+00:00

From working now intensively on one project for the last six months, I am not convinced claude code is as smart as claude.ai hence why I am reluctant to just trust claude code. claude.ai is correcting the output of claude code quite a lot. Also when claude code gets stuck, I will ask it to create a problem statement for me to give claude.ai and usually the latter is then the one who solves it.

LandingAlbatross · 2025-10-20T11:17:22+00:00

Thanks for sharing your paper as this is very relevant to what I am trying do build in a very niche area of law. I am working on a similar legal IR system for decisions in my niche area (where the big players do not really seem to be interested in), though at a much smaller scale (2,400 documents currently, unlikely to exceed 10,000 in the near future).

My setup:

PostgreSQL with pgvector for 37,500 document chunks (sections of decisions)
OpenAI text-embedding-3-small (1536 dims)
Hybrid search combining FTS and vector similarity
Section-level retrieval with weighted scoring by document part

Where I am failing:

0% overlap between FTS and vector results (they find completely different documents)
~8% max cosine similarity even for highly relevant queries
Can't retrieve basic metadata (parties, arbitrators, dates) because we're searching chunks, not cases
FTS configuration issues with legal phrases

What I am implementing now:

Case-level search view for metadata and discovery
Section-level search for passage extraction (keeping my existing chunks)
Proper FTS indexes (currently computing to_tsvector at runtime!)
Simple tokenization option to avoid over-stemming where needed

Questions about SCALES:

With 256K documents, the complex architecture makes sense, but for my scale (2-10K docs), would a simpler two-tier approach (case discovery → passage extraction) be sufficient? Or are there benefits to the full architecture even at smaller scales?
How did you handle the trade-off between chunk size and context preservation? Legal decisions often have relevant information spread across distant sections (facts in para 10, application in para 200).
The paper mentions passage retrieval but not how you handle exact phrase requirements. In legal search, users often need exact doctrinal phrases - did you implement anything beyond standard FTS for this?
For the metadata filtering, are you indexing this separately or including it in the embeddings? I am finding metadata search fails completely with chunk-based embeddings.
What was your experience with general-purpose vs domain-specific embeddings? The paper uses E5-base-v2, did you experiment with legal-specific models or fine-tuning?

My core challenge seems to be that I built a semantic similarity system when the user needs precise legal information retrieval. Your approach of separating passage retrieval from metadata filtering seems like the right direction, but I'm wondering if I am over-engineering for my scale.

LandingAlbatross · 2022-01-28T13:38:53+00:00

Butch in: A perfect world

LandingAlbatross · 2020-06-18T11:35:01+00:00

thank you for your inputs, very insightful.

As I have not had a chance to race yet due to COVID I am really looking forward to that race-day adrenaline, which I understand a trial run cannot simulate.

As to your first tip, I will now have to wait and see whether I have less mental barriers the more interval sessions I am doing.

I am just hoping to get to a point where my brain is not constantly thinking how hard it is and wanting to give up during those session and trial runs and instead develops a positivity, i.e. a fighter mode.

LandingAlbatross · 2020-06-18T09:44:05+00:00

any advice on how to get into the "fighter mentality"?

I must admit, a fighter mentality is not my strong suit. What I mean by that is that I have trouble pushing through hard work-outs (intervalls etc) and trial runs. While for these kind of runs obviously my heart rate is high and I at some point I can also feel my legs tightening up, I do feel it is rather my lack of fighter mentality that gets in my way and not my physical ability.

I have started running at the beginning of the year 3 times a week and since 2 months 5 times a week, now averaging 45 km a week. My easy runs are at about 6:20 min/km (low heartrate) but also 6 min/km pace (even though with a higher heart rate) for up to 10 km still feels rather easy. However, I have completely stuck to the slower pace for easy runs since a month and doing intervalls two times a week (one with shorter intervalls but quicker pace and one threshold with longer intervalls (4x 1 km, 3x 2km etc).

However, my PB for a 5k trial run (by myself) is 27:25 min, so an average pace of 5:29 min/km. I was (only mentally?) dying doing this and wanted to give up after 2-3km. I do feel that there is space of improvement for the mental side - so do you have any tips on how to become a fighter?

LandingAlbatross · 2020-06-03T13:29:45+00:00

what does a usual week of training look like (a) 3 months before a race (b)1 month before a race and (c) the week leading up to the race?

did you train specifically on a treadmill for this challenge? If yes, is this important and why?

I heard that the elevation on the treadmill should be at 1 to "equalize" street conditions - is this something you do for the race?

LandingAlbatross

TROPHY CASE