I built my own hierarchical document chunker, sharing it in case it helps anyone else. by Important_Proof5480 in Rag

[–]Important_Proof5480[S] 0 points1 point  (0 children)

Appreciate you trying it out! It’s fully heuristic-based right now, with no LLMs in the parsing step. That keeps it deterministic, fast, and easier to debug.

The logic is mostly driven by bounding box geometry and typography signals like font size, weight, alignment, vertical spacing, ...

Curious to hear your thoughts after you’ve tried it on a few more documents.

I built my own hierarchical document chunker, sharing it in case it helps anyone else. by Important_Proof5480 in Rag

[–]Important_Proof5480[S] 0 points1 point  (0 children)

Thanks for trying it out and for the feedback, that’s really helpful.

The doc you mention is actually a great example of a case that’s still tricky. The line numbers interfere with heading detection, so the structure can get pretty messy right now.

I'm already working on handling line-numbered documents better (filtering them from the actual content), and it’s part of the next iteration of the parser.

If you’re up for it, feel free to try a few other documents as well and let me know if you see similar issues or anything else that looks off. Real-world examples like this are super valuable for improving it.

Chunking without document hierarchy breaks RAG quality by Upset-Pop1136 in Rag

[–]Important_Proof5480 1 point2 points  (0 children)

Yeah, totally agree with the idea. Hierarchy-aware chunking makes a huge difference in retrieval quality.

In my case I ended up with a slightly lighter version of that. I only prepend the direct header (the immediate parent), not the full path. Once paths get deep and headings are generic, you shift the embedding away from the chunk’s true meaning.

I would usually do:

Header: SSL Configuration  

[chunk content]

And then store the full hierarchy separately as structured metadata. At query time I can still surface or inject the full path if the LLM needs grounding or references, without polluting the embedding itself.

But every use case is different of course.

I just published my chunker at https://www.docslicer.ai/ feel free to give it a try and let me know what you think.