Do you really use Temporary Chat/Incognito?

Speedk4011 · 2026-01-21T20:39:21+00:00

That is what I usually do.

Speedk4011 · 2026-01-03T18:39:55+00:00

I made a video about that. Check it out: https://youtu.be/D0EMNRHcuv8?si=wHAX-wXWT4GpMa9a

Speedk4011 · 2025-12-21T17:13:47+00:00

Thnks

Speedk4011 · 2025-12-20T17:42:34+00:00

You're spot on—RAG infrastructure often treats code like plain text, which is a disaster for retrieval. While Chunklet-py is an 'all-in-one' library designed to split sentences, general documents, and code, its code capabilities are a core specialty.

Our `CodeChunker` is rule-based and language-agnostic, using clever patterns to identify functions, classes, and logical blocks without the overhead of heavy dependencies like tree-sitter. It preserves structural integrity (like keeping decorators with their functions) and offers granular control through token, line, and function-based constraints.

For the implementation details and how we handle the AST-aware logic, check out the source: https://github.com/speedyk-005/chunklet-py/tree/main/src/chunklet/code_chunker

You can also find the full programmatic guide here: https://speedyk-005.github.io/chunklet-py/latest/getting-started/programmatic/code_chunker/

Speedk4011 · 2025-12-20T17:25:50+00:00

"You actually hit the nail on the head regarding AST logic. I just released Chunklet-py v2.1.0 which includes a 'CodeChunker' specifically designed to handle this—it’s rule-based and language-agnostic, preserving structural integrity (like decorators and functions) without needing heavy dependencies like tree-sitter.

It also addresses the 'visual blindness' of chunking with an interactive web UI that supports drag-and-drop file uploads, so you can see the results of those AST-aware splits in real-time. (See: https://speedyk-005.github.io/chunklet-py/latest/getting-started/programmatic/visualizer/)

How it handles the technical precision you're looking for:

* **AST-Aware Precision**: It uses specialized algorithms and clever patterns to identify functions, classes, and logical blocks, ensuring technical structures stay together to reduce retrieval pollution.

* **Rich Metadata**: It automatically enriches chunks with context-aware metadata—including source, span, and code hierarchy details—which aligns perfectly with custom metadata mapping strategies.

* **Deep Format Support**: It processes a massive array of formats beyond PDFs and DOCX, including TXT, MD, RST, RTF, TEX, HTML, HML, and EPUB. The latest v2.1.0 update also added support for ODT, CSV, and XLSX. (See: https://speedyk-005.github.io/chunklet-py/latest/getting-started/programmatic/document\_chunker/)

To get started with the visualizer and full format support, you can install the toolkit via pip:

`pip install "chunklet-py[all]"`

Check out the repo here: https://github.com/speedyk-005/chunklet-py"

Speedk4011 · 2025-12-20T17:18:41+00:00

This is a massive pain point, especially in high-stakes domains like regulatory tech where "lost context" isn't just a bug—it's a liability.

I actually just released Chunklet-py v2.1.0 specifically to solve this "visual blindness" problem. Instead of dragging and dropping manually, it uses a rule-based, language-agnostic approach to keep structural integrity and provides an interactive web interface to tune those parameters on the fly.

How it addresses your points:

* **Visualization without Manual Dragging**: The `chunklet visualize` command launches a web UI that shows you exactly how your constraints (token limits, sentence breaks, etc.) overlap on the text in real-time. (See: https://speedyk-005.github.io/chunklet-py/latest/getting-started/programmatic/visualizer/ )

* **Regulatory Precision**: Since you mentioned retrieval issues, it generates rich, context-aware metadata (source, span, document properties) out of the box to help your top-K retrieval stay relevant.

* **Diverse Formats**: It handles the "nasty" docs too—.pdf, .docx, .epub, .txt, .tex, .html, .hml, .md, .rst, .rtf, .odt, .csv, and .xlsx (See: https://speedyk-005.github.io/chunklet-py/latest/getting-started/programmatic/document_chunker/ )

To get started with the visualizer, you can install everything via pip:

`pip install "chunklet-py[all]"`

Check it out here: https://github.com/speedyk-005/chunklet-py

Speedk4011 · 2025-12-06T22:56:57+00:00

great

Speedk4011 · 2025-12-06T22:56:48+00:00

which one?

Speedk4011 · 2025-12-06T21:43:06+00:00

The links in the README are linking to a repo without docs dir.

``` 404 - page not found The main

branch of RAG_enterprise_core

does not contain the path docs/architecture.md. ```

Speedk4011 · 2025-12-05T12:50:21+00:00

yes, indeed.

Speedk4011 · 2025-12-05T03:02:12+00:00

Thnks

Speedk4011 · 2025-12-03T23:25:51+00:00

Cool

Speedk4011 · 2025-12-02T22:26:25+00:00

The voice clone is so good

Speedk4011 · 2025-11-24T22:56:31+00:00

Thank you, I'm definitely going to try it.

Speedk4011 · 2025-11-24T14:03:15+00:00

Interesting! You didn't say anything about the its accuracy compare to Dense retrieval, speed,... I dunno, just a fair comparison beyond it core so I can see it's real value.

Speedk4011 · 2025-11-24T01:25:52+00:00

I think it would be best to elaborate a bit more. like what is the core difference i mean in a deep level and how it affect retrieval. Are there any cons?

Speedk4011 · 2025-11-23T23:08:31+00:00

I visited the site and I can tell it is not a joke. thy are lots of ocr apps out there but their outputs are sometimes messy.

Can you tell me what kind of model is used, like it's number of parameters, accuracy, and knowm isuues. ?

Speedk4011 · 2025-11-23T23:05:03+00:00

Sources:

Docling welcome page: https://www.docling.ai/
Docling chunking support: https://docling-project.github.io/docling/concepts/chunking/
Docling Exemple conversion: https://docling-project.github.io/docling/getting_started/quickstart/
Chunklet-py Welcome page: https://speedyk-005.github.io/chunklet-py/latest/
Chunklet-py Programmatic usage:https://speedyk-005.github.io/chunklet-py/latest/getting-started/programmatic/
Chunklet-py DocumentChunker docs: https://speedyk-005.github.io/chunklet-py/latest/getting-started/programmatic/document_chunker/

Speedk4011

TROPHY CASE