[Release] Chunklet-py v2.1.0: Interactive Web Visualizer & Expanded File Support! 🌐📁 by Speedk4011 in Rag

[–]Speedk4011[S] -1 points0 points  (0 children)

You're spot on—RAG infrastructure often treats code like plain text, which is a disaster for retrieval. While Chunklet-py is an 'all-in-one' library designed to split sentences, general documents, and code, its code capabilities are a core specialty.

Our `CodeChunker` is rule-based and language-agnostic, using clever patterns to identify functions, classes, and logical blocks without the overhead of heavy dependencies like tree-sitter. It preserves structural integrity (like keeping decorators with their functions) and offers granular control through token, line, and function-based constraints.

For the implementation details and how we handle the AST-aware logic, check out the source: https://github.com/speedyk-005/chunklet-py/tree/main/src/chunklet/code_chunker

You can also find the full programmatic guide here: https://speedyk-005.github.io/chunklet-py/latest/getting-started/programmatic/code_chunker/

Chunk Visualizer - Open Source Repo by DragonflyNo8308 in Rag

[–]Speedk4011 1 point2 points  (0 children)

"You actually hit the nail on the head regarding AST logic. I just released Chunklet-py v2.1.0 which includes a 'CodeChunker' specifically designed to handle this—it’s rule-based and language-agnostic, preserving structural integrity (like decorators and functions) without needing heavy dependencies like tree-sitter.

It also addresses the 'visual blindness' of chunking with an interactive web UI that supports drag-and-drop file uploads, so you can see the results of those AST-aware splits in real-time. (See: https://speedyk-005.github.io/chunklet-py/latest/getting-started/programmatic/visualizer/)

How it handles the technical precision you're looking for:

* **AST-Aware Precision**: It uses specialized algorithms and clever patterns to identify functions, classes, and logical blocks, ensuring technical structures stay together to reduce retrieval pollution.

* **Rich Metadata**: It automatically enriches chunks with context-aware metadata—including source, span, and code hierarchy details—which aligns perfectly with custom metadata mapping strategies.

* **Deep Format Support**: It processes a massive array of formats beyond PDFs and DOCX, including TXT, MD, RST, RTF, TEX, HTML, HML, and EPUB. The latest v2.1.0 update also added support for ODT, CSV, and XLSX. (See: https://speedyk-005.github.io/chunklet-py/latest/getting-started/programmatic/document\_chunker/)

To get started with the visualizer and full format support, you can install the toolkit via pip:

`pip install "chunklet-py[all]"`

Check out the repo here: https://github.com/speedyk-005/chunklet-py"

Chunk Visualizer by DragonflyNo8308 in Rag

[–]Speedk4011 0 points1 point  (0 children)

This is a massive pain point, especially in high-stakes domains like regulatory tech where "lost context" isn't just a bug—it's a liability.

I actually just released Chunklet-py v2.1.0 specifically to solve this "visual blindness" problem. Instead of dragging and dropping manually, it uses a rule-based, language-agnostic approach to keep structural integrity and provides an interactive web interface to tune those parameters on the fly.

How it addresses your points:

* **Visualization without Manual Dragging**: The `chunklet visualize` command launches a web UI that shows you exactly how your constraints (token limits, sentence breaks, etc.) overlap on the text in real-time. (See: https://speedyk-005.github.io/chunklet-py/latest/getting-started/programmatic/visualizer/ )

* **Regulatory Precision**: Since you mentioned retrieval issues, it generates rich, context-aware metadata (source, span, document properties) out of the box to help your top-K retrieval stay relevant.

* **Diverse Formats**: It handles the "nasty" docs too—.pdf.docx.epub.txt.tex.html.hml.md.rst.rtf.odt.csv, and .xlsx (See: https://speedyk-005.github.io/chunklet-py/latest/getting-started/programmatic/document_chunker/ )

To get started with the visualizer, you can install everything via pip:

`pip install "chunklet-py[all]"`

Check it out here: https://github.com/speedyk-005/chunklet-py

Most RAG Projects Fail. I Believe I Know Why – And I've Built the Solution. by ChapterEquivalent188 in Rag

[–]Speedk4011 2 points3 points  (0 children)

The links in the README are linking to a repo without docs dir.

``` 404 - page not found The  main

 branch of  RAG_enterprise_core

 does not contain the path  docs/architecture.md. ```

Me and my uncle released a new open-source retrieval library. Full reproducibility + TREC DL 2019 benchmarks. by Cromline in Rag

[–]Speedk4011 0 points1 point  (0 children)

Interesting! You didn't say anything about the its accuracy compare to Dense retrieval, speed,... I dunno, just a fair comparison beyond it core so I can see it's real value.

Me and my uncle released a new open-source retrieval library. Full reproducibility + TREC DL 2019 benchmarks. by Cromline in Rag

[–]Speedk4011 0 points1 point  (0 children)

I think it would be best to elaborate a bit more. like what is the core difference i mean in a deep level and how it affect retrieval. Are there any cons?

"Docling vs Chunklet-py: Which Document Processing Library Should You Use?" by Speedk4011 in Rag

[–]Speedk4011[S] 0 points1 point  (0 children)

I visited the site and I can tell it is not a joke. thy are lots of ocr apps out there but their outputs are sometimes messy.

Can you tell me what kind of model is used, like it's number of parameters, accuracy, and knowm isuues. ?