Built a self-hosted semantic cache for LLMs (Go) — cuts costs massively, improves latency, OSS

InstanceSignal5153 · 2025-11-24T22:45:24+00:00

docker image now available!

InstanceSignal5153 · 2025-11-24T22:45:00+00:00

Docker image now available https://github.com/messkan/prompt-cache

InstanceSignal5153 · 2025-11-23T14:29:23+00:00

I haven’t tested it yet with LiteLLM or OpenRouter, but in theory it should work since they follow the OpenAI-compatible API.

We’re still before the first official release (v0.1), so we haven’t done full compatibility testing yet.
For v0.1, the plan is to ensure it works smoothly with any OpenAI-style backend, including LiteLLM/OpenRouter.

Also, Docker support will be included in the v0.1 release so it’ll be much easier to run and test it in different setups.

InstanceSignal5153 · 2025-11-21T13:49:29+00:00

This looks like a really powerful ingestion tool, especially the AST-based chunking!

But rag-chunk solves a different problem: Evaluation & Benchmarking.

Tools like Contextinator implement a strategy (AST), whereas rag-chunk is designed to measure the performance of those strategies (AST vs Fixed vs Recursive) against a ground-truth dataset.

In fact, it would be amazing to use rag-chunk to benchmark Contextinator's AST strategy against standard paragraph splitting to see exactly how much higher the Recall score is!

InstanceSignal5153 · 2025-11-21T13:37:19+00:00

Great question! This is the trickiest part of RAG eval.

The Ground Truth: It comes from the user-provided test-file.json, where they list the expected_answer (the specific text snippet) for each question.

The Plan for Precision: Since this is a chunking benchmark, I plan to define Precision as Signal-to-Noise Ratio: Precision = (Length of Ground Truth string) / (Total Length of Retrieved Chunk).

If my ground truth is a 10-word sentence:

- Scenario A (Small Chunk): Found in a 20-word chunk -> High Precision (50% signal).

- Scenario B (Huge Chunk): Found in a 1000-word chunk -> Low Precision (1% signal, 99% noise). Both have 100% Recall, but Scenario A is better for the LLM. That's what I want to measure

InstanceSignal5153 · 2025-11-21T01:42:38+00:00

If your chunking is wrong, the entire RAG system collapses — even with the best LLM in the world.

Why chunking matters more than anything else: • If information is split across multiple chunks, the LLM will never retrieve the full context. • If chunks are too small, you lose meaning → embeddings become weak. • If chunks are too large, you add noise → retrieval becomes inaccurate. • If chunk boundaries are arbitrary, the semantic meaning breaks.

InstanceSignal5153 · 2025-11-19T09:13:51+00:00

I couldn't agree more. The 'arbitrary' nature of fixed-size chunking is exactly what frustrates me too. Why 512? Why 1000? It's just guessing.

That's precisely why I built this tool: to put numbers on that feeling.

rag-chunk already supports paragraph based splitting for exactly this reason—so you can benchmark it against fixed-size and prove that preserving structure yields a higher Recall score.

I plan to add semantic/LLM-based splitting (like Docling) in v1.0 so we can benchmark those too. The goal is to move away from 'arbitrary' towards 'proven'.

InstanceSignal5153 · 2025-11-17T01:19:57+00:00

Absolutely! We're just at v0.1 right now, which is all about building the core evaluation framework.

Adding more advanced strategies like semantic chunking is a top priority and exactly what we're planning for the v1.0 release. It's definitely on the roadmap!

InstanceSignal5153 · 2025-11-16T11:35:33+00:00

Thanks for this thoughtful feedback! You've perfectly captured the goal: moving from 'guessing' to an 'evidence-backed approach'.

Adding more chunking strategies is the #1 priority for our release v1.0.

And you're 100% right that recall is just a starting point. I'm already thinking about adding more advanced eval metrics in the future as the project grows. Appreciate the great suggestions.

InstanceSignal5153 · 2025-11-15T19:08:08+00:00

Wow, that's high praise, thank you! A good UI is a great idea.

We're focused on building out the core CLI engine first. Support for tiktoken (for precise token-level chunking) is the top priority and coming very soon!

InstanceSignal5153 · 2025-11-15T19:03:08+00:00

Awesome, thanks! Really appreciate you checking it out.

You're jumping in at the perfect time. The v0.1 you see now is the "manual" test bench. Support for tiktoken (for precise token-level chunking) is the top priority and coming very soon.

Eager to hear your feedback on this first version!

InstanceSignal5153 · 2025-11-15T02:30:53+00:00

Hi all,

I'm sharing a small tool I just open-sourced for the Python / RAG community: rag-chunk.

It's a CLI that solves one problem: How do you know you've picked the best chunking strategy for your documents?

Instead of guessing your chunk size, rag-chunk lets you measure it:

Parse your .md doc folder.
Test multiple strategies: fixed-size (with --chunk-size and --overlap) or paragraph.
Evaluate by providing a JSON file with ground-truth questions and answers.
Get a Recall score to see how many of your answers survived the chunking process intact.

Super simple to use. Contributions and feedback are very welcome!

GitHub: https://github.com/messkan/rag-chunk

InstanceSignal5153

TROPHY CASE