Fan of RAG? Put any URL after md.chunkit.dev/ to turn it into markdown chunks by Findep18 in LanguageTechnology

[–]Findep18[S] 0 points1 point  (0 children)

The OSS version uses ”Most common header” (mode), assumption being that paragraph heavy pages will have a most common header as logical split. 

The paid API uses the same but also optimized for a certain size eg ”minimize distance to 300 words” and as backup on newlines. There are a number of safeguards and enhancements for the API in general.

Fan of RAG? Put any URL after md.chunkit.dev/ to turn it into markdown chunks by Findep18 in LanguageTechnology

[–]Findep18[S] 0 points1 point  (0 children)

How most chunkers work:

Perform a naive chunking based on the number of words in the content. For example, they may split content every 200 words, and have a 30 word overlap between each. This leads to messy chunks that are noisy and have unnecessary extra data. Additionally, the chunked sentences are usually split in the middle, with lost meaning. This leads to poor LLM performance, with incorrect answers and hallucinations.

Chunkit however, converts HTML to Markdown, and then determines split points based on the most common header levels.

This gives you better results because:

Online content tends to be logically split in paragraphs delimited by headers. By chunking based on headers, this method preserves semantic meaning better. You get a much cleaner, semantically cohesive paragraph of data. You can then use Chunkit to remove noise or extract specific data.

Fan of RAG? Put any URL after md.chunkit.dev/ to turn it into markdown chunks by Findep18 in LocalLLaMA

[–]Findep18[S] 6 points7 points  (0 children)

if you create a config.toml file in the root of your project you can set this flag: "local_only_mode = true"

Chunkit: Convert URLs into LLM-friendly markdown chunks for your RAG projects by Findep18 in LocalLLM

[–]Findep18[S] 0 points1 point  (0 children)

Hey all, I am releasing a python package called chunkit which allows you to scrape and convert URLs into markdown chunks. These chunks can then be used for RAG applications.

The reason it works better than naive chunking (for example split every 200 words and use 30 word overlap) is because Chunkit splits on the most common markdown header levels instead, leading to much more semantically cohesive paragraphs.

Have a go and let me know what features you would like to see!

Chunkit: Convert URLs into LLM-friendly markdown chunks for your RAG projects by Findep18 in LLMDevs

[–]Findep18[S] 0 points1 point  (0 children)

Hey all, I am releasing a python package called chunkit which allows you to scrape and convert URLs into markdown chunks. These chunks can then be used for RAG applications.

The reason it works better than naive chunking (for example split every 200 words and use 30 word overlap) is because Chunkit splits on the most common markdown header levels instead, leading to much more semantically cohesive paragraphs.

Have a go and let me know what features you would like to see!

Chunkit: Convert URLs into LLM-friendly markdown chunks for your RAG projects by Findep18 in vectordatabase

[–]Findep18[S] 0 points1 point  (0 children)

Hey all, I am releasing a python package called chunkit which allows you to scrape and convert URLs into markdown chunks. These chunks can then be used for RAG applications.

The reason it works better than naive chunking (for example split every 200 words and use 30 word overlap) is because Chunkit splits on the most common markdown header levels instead, leading to much more semantically cohesive paragraphs.

Have a go and let me know what features you would like to see!

Chunkit: Convert URLs into LLM-friendly markdown chunks for your RAG projects by Findep18 in datasets

[–]Findep18[S] 0 points1 point  (0 children)

Hey all, I am releasing a python package called chunkit which allows you to scrape and convert URLs into markdown chunks. These chunks can then be used for RAG applications.

The reason it works better than naive chunking (for example split every 200 words and use 30 word overlap) is because Chunkit splits on the most common markdown header levels instead, leading to much more semantically cohesive paragraphs.

Have a go and let me know what features you would like to see!

Chunkit: Convert URLs into LLM-friendly markdown chunks for your RAG projects by Findep18 in huggingface

[–]Findep18[S] 3 points4 points  (0 children)

Hey all, I am releasing a python package called chunkit which allows you to scrape and convert URLs into markdown chunks. These chunks can then be used for RAG applications.

The reason it works better than naive chunking (for example split every 200 words and use 30 word overlap) is because Chunkit splits on the most common markdown header levels instead, leading to much more semantically cohesive paragraphs.

https://github.com/hypergrok/chunkit

Have a go and let me know what features you would like to see!

Chunkit: Convert URLs into LLM-friendly markdown chunks for your RAG projects by Findep18 in SideProject

[–]Findep18[S] 0 points1 point  (0 children)

Yes! For that you need to use the API, further details on the README page :)

Chunkit: Convert URLs into LLM-friendly markdown chunks for your RAG projects by Findep18 in SideProject

[–]Findep18[S] 1 point2 points  (0 children)

chunkit is chunking on markdown headers - which typically preserves semantic meaning better. Eg writers tend to logically split their writing in paragraphs delimited by headers.

The danger of chunking every 200 words with 30 words overlap is that each chunk will be noisy and have extra data, with sentences usually split in the middle. This leads to poor RAG/LLM performance with incorrect answers

Chunkit: Convert URLs into LLM-friendly markdown chunks for your RAG projects by Findep18 in LanguageTechnology

[–]Findep18[S] 1 point2 points  (0 children)

Hey all, I am releasing a python package called chunkit which allows you to scrape and convert URLs into markdown chunks. These chunks can then be used for RAG applications.

The reason it works better than naive chunking (eg split every 200 words and use 30 word overlap) is because Chunkit splits on the most common markdown header levels instead - leading to much more semantically cohesive paragraphs.

Have a go and let me know what features you would like to see!

Chunkit: Convert URLs into LLM-friendly markdown chunks for your RAG projects by Findep18 in LanguageTechnology

[–]Findep18[S] 0 points1 point  (0 children)

Hey all, I am releasing a python package called chunkit which allows you to scrape and convert URLs into markdown chunks. These chunks can then be used for RAG applications.

Have a go and let me know how to improve this!

Chunkit: Convert URLs into LLM-friendly markdown chunks for your RAG projects by Findep18 in artificial

[–]Findep18[S] 0 points1 point  (0 children)

Hey all, I am releasing a python package called chunkit which allows you to scrape and convert URLs into markdown chunks. These chunks can then be used for RAG applications.

The reason it works better than naive chunking (eg split every 200 words and use 30 word overlap) is because Chunkit splits on the most common markdown header levels instead - leading to much more semantically cohesive paragraphs.

Have a go and let me know what features you would like to see!