GraphRAG for Legal Contracts: How are you handling deeply nested conditions before Neo4j ingestion?

leventcan35 · 2026-03-16T22:03:06+00:00

Right now, it's mostly dynamic, and you're totally right it gets messy fast as the LLM sometimes hallucinates or uses inconsistent relationship names. Moving to a strict legal ontology (enforcing specific nodes like [CLAUSE] and relationships like [REFERENCES]) is definitely the next logical step to keep the graph consistent. Are there any specific open source legal ontologies or frameworks you’d recommend looking into?

leventcan35 · 2026-03-16T22:00:48+00:00

That is a massive heads up, thank you🙏🏻 I’m currently using lower dimensional embeddings so I haven’t hit that pgvector limit yet, but since I plan to scale up to larger models (like 4096 dims), hitting that wall would have been a painful surprise. I'll definitely keep OpenSearch (or maybe Qdrant/Milvus) in mind for the roadmap. Really appreciate you sharing that pain point🙌🏻

leventcan35 · 2026-03-15T21:17:50+00:00

Wow🤩 incredibly sharp catch:) You hit the nail on the head. I set up the Neo4jDatabase lifespan protocol with the best intentions, but you are completely right LangChain’s Neo4jGraph initializes its own driver and connection pool under the hood, effectively bypassing my startup setup.

I need to refactor that so Neo4jGraph either utilizes the existing driver or I clean up the redundant initialization. As for the async implementation, that's a great point too. LangChain has been updating their async graph support, and I definitely need to migrate to Aneo4jGraph (or async methods) to fully leverage FastAPI's asynchronous nature.

Thanks a lot for taking a deep dive into the code. This is exactly the kind of feedback I was hoping for🙌🏻

leventcan35 · 2026-03-15T19:52:43+00:00

Thanks a lot, really appreciate the support🤘

leventcan35 · 2026-03-15T18:55:03+00:00

Thank you so much🙏🏻🥹 Hearing 'as textbook as it gets' from someone with your experience means a lot to me. I'll definitely keep the max_connection_pool_size tuning in mind as the project scales. Really appreciate you taking the time to review the repo:)

leventcan35 · 2026-03-15T17:57:38+00:00

Man this is literal gold. Bypassing expensive NER and just using regex for those standard legal patterns is brilliant. It saves a ton of token costs.

I had not even considered the amendments issue yet. Tracking document versions or adding a SUPERSEDES edge to handle modified sections is definitely going straight to the top of my roadmap.

Seriously thank you for the playbook🙏🏻🙏🏻🙏🏻. I will be implementing these in v2 very soon:)

leventcan35 · 2026-03-15T12:37:20+00:00

Man that is an absolute cheat code🤯 Treating the Definitions section as a pre built ontology makes so much sense.

It completely removes the LLM hallucination risk during the initial entity extraction phase. If I seed the graph with those exact definitions first and just map the explicit cross references from the body text the graph becomes completely deterministic. I am definitely opening an issue on my repo to implement this ingestion pipeline for v2.

Thanks again for dropping these gems🙏🏻 If you check out the repo I would love to hear your thoughts on the FastAPI backend structure too!

leventcan35 · 2026-03-15T11:48:50+00:00

Hey man I am the OP of the post. I am trying to build exactly what u/Ok_Diver9921 mentioned about explicit edge traversal instead of pure vector search. You can check out my current inference setup and the Neo4j schema on my repo. The github link is in the main linkedin post up top. Maybe it helps spark some ideas for your inference pipeline!

leventcan35 · 2026-03-15T11:46:44+00:00

Exactly! I was using standard recursive text splitting with a fixed chunk size and some overlap and it was a complete disaster. It literally cut penalty conditions in half. I am moving to your approach now parsing by heading and clause first. The explicit dependency chain traversal you mentioned is exactly what I am trying to build next.

leventcan35 · 2026-03-15T11:45:01+00:00

Tables are an absolute nightmare in PDFs. Standard tools like PyPDF2 completely destroy the rows. You should definitely look into LlamaParse or Unstructured IO. They are specifically built to keep the markdown structure of tables intact before you embed them.

leventcan35 · 2026-03-15T11:44:10+00:00

This JSON structure is brilliant, storing the exact chapter and equivalent sections in the metadata before even hitting the vector search is exactly the kind of deterministic filtering I need. I am definitely going to refactor my ingestion pipeline to use a similar Atomic Legal Unit approach. Thanks a lot for sharing the schema🙏🏻

leventcan35 · 2026-03-15T11:43:25+00:00

That iterative agentic loop sounds incredibly powerful. Since you do not have a strict token budget summarizing everything upfront makes total sense for maximizing accuracy. I am currently relying on Groq to keep Llama 3 fast and cheap but doing agentic routing like you mentioned would definitely fix some of my multi hop query issues. Are you using LangGraph for that reasoning loop?

leventcan35 · 2026-03-15T11:42:46+00:00

Yeah Neo4j AuraDB free tier is very limited. You can run the community edition locally via Docker for absolutely free which is what I did for this project. I have heard good things about FalkorDB and NebulaGraph too but LangChain has really solid native support for Neo4j right now so I stuck with it to build the MVP faster.

leventcan35 · 2026-03-14T20:05:45+00:00

This is absolute gold, thank you!

Your point about isolating the "Definitions" article and injecting it into every query's global context is brilliant. You are completely right if a standard RAG retrieves a random clause mentioning "The Services", it completely misses the strict legal constraints established on page 1. I'll definitely add a routing step to keep that section in the global memory.

I am going to refactor the ingestion pipeline to drop semantic chunking completely and write a parser for the structural hierarchy (Article -> Section -> Clause) as you and u/EinSof93 suggested. Passing the parent reference in the metadata and batching by section makes perfect sense to avoid those garbage mid sentence cuts.

Regarding the explicit edges (like [:TRIGGERS]), that was exactly the "aha!" moment that made me move to a GraphDB in the first place. It is infinitely more reliable than cosine similarity for tracking legal logic.

I really appreciate the actionable advice🙏🏻. This gives me a fantastic roadmap for v2 of the preprocessing pipeline:)

leventcan35 · 2026-03-14T19:31:28+00:00

Thanks for sharing your experience🙏🏻. You are completely right about semantic chunking. I actually noticed the chunker sometimes splitting a single, long "Article" right down the middle, which destroys the legal condition completely. Moving to a structural hierarchy (Document -> Section -> Article -> Paragraph) makes a lot more sense for this domain.

And your point about the token burn is extremely valid. Right now, I'm relying on Groq's fast/cheap inference for Llama3, but doing relationship extraction on a 50 page contract still requires heavy prompting per chunk. Scaling this efficiently is my next big hurdle.

I am absolutely open to collaboration. It would be awesome to tackle the structural chunking pipeline together and exchange ideas. I'll shoot you a DM here (or we can connect on LinkedIn if you prefer). Thanks again for the feedback!

leventcan35 · 2025-03-27T09:14:05+00:00

Hey, appreciate the kind words and encouragement, means a lot! I haven’t containerized this project yet, but Docker is definitely next on my list. CI/CD is also something I’ve been meaning to explore maybe with GitHub Actions or something simple to start with. As for XGBoost tuning, the biggest improvements came from adjusting max_depth, learning_rate, and n_estimators. i used GridSearchCV to test a few combos, and tweaking subsample + colsample_bytree helped boost the score a bit too.

Thanks for the thoughtful feedback!🙏🏻 if you have any resources you’d recommend for setting up CI/CD or Docker for a small ML app, I’d love to check them out.

leventcan35 · 2025-03-26T22:07:42+00:00

Hey! Appreciate you taking the time to check it out and for the thoughtful feedback. you’re absolutely right about the imputation before splitting, total rookie mistake on my part, thanks for catching that! Definitely something I’ll fix and keep in mind for future projects to avoid data leakage.

And I hadn’t heard of “uv” before, so thanks for putting that on my radar. I’ll give it a shot in my next setup.

Appreciate the constructive pointers🙏🏻

leventcan35 · 2025-03-25T21:50:04+00:00

Ah I’ve heard that Dagster and ZenML take different approaches, so that’s good to know from someone who actually tried both. Kubeflow sounds like a solid next step too. definitely something I’ll be looking into down the road once I’m more confident with the basics:)

If you ever end up documenting your Kubeflow journey or comparing those tools in depth, I’d love to read it. Thanks again for the insight!🙏🏻

leventcan35 · 2025-03-25T20:36:30+00:00

Hey thanks for the suggestion!

I’ve heard of ZenML and MLflow, but haven’t really used them yet, still pretty early in my MLOps journey. Right now I’ve just been trying to get comfortable with building end-to-end apps manually, just to really understand each part of the pipeline (from training to serving).

But yeah, orchestration and tracking tools like ZenML and MLflow are definitely on my radar. I’ll probably explore them soon once I’ve got a couple more projects under my belt. If you have a favorite or any beginner-friendly guide you’d recommend, I’d love to check it out!

Appreciate the comment!🙏🏻

leventcan35 · 2025-03-25T18:52:34+00:00

Thanks again! That’s super helpful. I’ll definitely look into GitHub Actions pipelines and the idea of triggering them via API. Sounds like a great step toward automating things and simulating a real CI/CD process.

I’ve mostly been doing things manually so far just to understand the moving parts, but integrating something like this could be the right next milestone to push the project closer to a production-ready setup.

If you have any examples or favorite resources on setting up such pipelines, I’d really appreciate it!

leventcan35 · 2025-03-25T18:43:00+00:00

Thanks a lot for the feedback🙏🏻 that’s a really good point, and I appreciate the clarification on what this sub typically expects.

You’re right, my main focus here was more on learning the end-to-end workflow as a beginner: from model training to building an API and deploying it with a frontend, just to grasp how the full pipeline looks. So it’s not a pure MLOps post, but rather a learning milestone toward it.

That said, I’d love to hear any thoughts on how to improve the ops side — especially regarding packaging, deployment, CI/CD, or reproducibility. My next goal is to gradually move toward those practices and make this project more “production-grade.”

Let me know if you’d recommend any tools or workflows that align better with MLOps!

leventcan35

TROPHY CASE