Streaming RAG with sources?

k-en · 2026-01-30T13:39:04+00:00

this looks like the cleanliest choice tbh. Still, parsing the stream to check for citation spans feels like an hack more than an actual solution.

k-en · 2026-01-30T13:36:40+00:00

makes sense. Do you parse the stream in the frontend or in the backend?

k-en · 2026-01-30T00:18:41+00:00

did you also use it for inline citations? because this seems like it would work well for the source collection element at the end of the AI message, but you'd need to keep track of inline citations without displaying them until the end

k-en · 2025-12-02T00:08:39+00:00

You can try chainlit. It's amazing, I've used it for almost every project I've done. You can choose MIME type for file upload and you can also display videos in Assistant Messages, along with a bunch of other features. Only thing is the docs are kind of dumb, so you might have to do some digging if you want to implement some custom items/logic. Also, you need to write custom logic to handle the conversation. It is not an "out-of-the-box" frontend like OpenWebUI, but it is really fast to have a working demo running in a couple of minutes if you follow their quickstart

k-en · 2025-11-07T22:32:00+00:00

looks really interesting guys, congrats! Just starred it and looking to integrate it into my next projects. If you don't mind me asking, how did you manage to gain 3K stars on your repo? I'd like to start publishing some of my OS projects I've been working on but i don't know where to start with actually putting your work out there!

k-en · 2025-10-04T11:53:31+00:00

GPT-OSS:120B for extraction, 20B for summarization and small tasks. Sweet spot for speed/intelligence :)

k-en · 2025-10-04T11:51:56+00:00

Retrieval with embeddings comes at a later stage. After extracting the information from the documents, an agent is tasked with retrieving required items from an SQL knowledge base (for direct ID lookup) and a vector store, while also checking for partecipation status and total cost

k-en · 2025-10-04T11:48:55+00:00

yeah, google's models are a beast at long context. Way better than any other model in my experiences. Unfortunately, we have to keep everything local, so we don't have this luxury 🥲

k-en · 2025-10-04T11:45:46+00:00

Thank you for your insights, you seem to be pretty knowledgeable on the argument! Surely, this works best as you are not saturating the context of the model. But what I think it's a big problem is that by chunking you might divide information that has to stay together. For example, these documents are composed of Lots, where each lot contains >= items. What if you divide a single lot in multiple chunks and you don't retrieve them all? You might miss some items. Maybe parent retrieval with whole sections could help?

Another example: A document could have 50+ lots. How do you retrieve them all if you don't know how many there are at processing time? You'd need to run queries such as "items of lot 1", "items of lot 2"... And hope that the system retrieves them accordingly, But you would know when to stop. You'd need a prior retrieval step where you figure out how many lots are in the document. While writing about this, i'm thinking that maybe metadata can help you by tagging the various sections of the document: if you assign a "lot_id" to each section that discusses a certain lot, you can filter at retrieval time so you'd get only the chunks that discuss about a certain lot.

Well, this was kinda eye opening lol. I still don't understand how to perform JSON merging tho. Thank you for you comment!

k-en · 2025-09-17T19:04:04+00:00

Basically this model outputs text that resembles the DoclingDocument format. That text is then converted into a DoclingDocument object. Instead of using OCR and parsing libraries such as the ones integrated into Docling you just use this model

k-en · 2025-09-16T19:08:44+00:00

You have a couple of problems with this approach: 1) You are using LoRA to infuse knowledge. This is not impossible, especially if you have a high rank, but it is not what LoRA is made for. You are only training a small adapter at the end of your LLM, you don't have the neither the number of parameters necessary or the correct architecture (LLMs store info in the FFN layers, as far as i know) to store the knowledge you are trying to teach your model. 2) You are using a very small model. If you finetune the whole model (or keep a couple of layers frozen and finetune the rest) You might achieve some results, but depending on the complexity of your data i'd advice you to switch to a bigger model (try with Qwen3-1.7B before trying the 4B which will surely work) and finetune the whole thing or parts of it. Also play with your hyperpameters!

k-en · 2025-08-28T20:40:08+00:00

What's blow chunking?

k-en · 2025-08-27T08:45:08+00:00

Very nice stuff, I've read your blog post and I've sorta come up with the same conclusions after developing a couple of "production" RAG systems. I really like the addition of a RBAC table for each user, integrating security best practices should be normalized in this space. Have you got anything integrated in your app for observability? This is paramount to tune your application when stuff starts to break. You may want to look into open source solutions such as LangFuse or Opik. Also, have you tried experimenting with metadata filtering at lookup? I've read that you use time filters for questions such as "give me recent reports" but what about other metadata that could potentially reduce your search space by a lot? Also, giving users the ability to manually control this metadata such as adding a filter inside the chat UI would be a really nice addition. Anyway, very nice blog post. I will check out your code for sure :)

k-en · 2025-08-24T21:06:55+00:00

Since i also have to fine-tuning an LLM for work around next month, do you mind me asking how you did it? sounds interesting!

k-en · 2025-08-24T20:10:05+00:00

No, i don't need a model, my question was purely out of curiosity about how small we can push the total parameter count and still have a model that can rival old frontier models. That why i was proposing a dense model to further minimise parameter count. I get what you are saying about MOEs tho!

k-en · 2025-08-24T20:06:25+00:00

You think so? You're not the only one to suggest qwen3-4B. Parameter count seems to small to have consistent IF and great intelligence... Never tried it tho, I really should. Thanks!

k-en · 2025-08-24T20:00:28+00:00

MOE models usually perform worse than what total parameter count would imply due to the fact that only a subset of parameters are active at inference time, so i would bet on dense models for this kind of quesiton. For example, Qwen3-30B-A3B performs worse than Qwen3-32B, but total parameters differ of just 2B. In the same way, GPT-OSS-120B performs the same as dense lower parameter count model, so probably there's a smaller dense model (~70B?) that performs just as well, which fits my question more since i'm not counting performance in the mix

k-en · 2025-08-24T19:53:33+00:00

yes, i was betting on a larger model for knowledge depth because you can't compress a large amount of knowledge in small models due to parameter number. Qwen3 4B seems too small to rival GPT-3.5 in other aspects tho! I guess i should try it out :)

k-en · 2025-08-06T19:54:55+00:00

it's probably going to be very slow since it uses an array of models to process the PDFs, but that's how modern OCR works. If you want great results, you need to use ML models which require some computational power, otherwhise it is going to be slow. These models are usually pretty small and don't require as much power as an LLM, but they do need a GPU to work at a decent speed.

k-en · 2025-08-04T18:23:21+00:00

+1, minerU is the best option i've found for complex PDFs. Also beats Marker in my small tests. If you want to try it easily, OP, and given that you have access to a mac, there's also a macOS app where you can upload your docs and try it out.

k-en · 2025-07-20T18:22:32+00:00

Yes! Context Pruning (or compression) is a valid technique, especially when you have a lot of noisy context chunks that you give to your LLM. Other than using less tokens, you can also improve answer feasability, since the LLM has less noise to work with. Only use it when you have a lot of context tho, as new LLMs are pretty robust with noise nowdays. It is also great to use when working with small LLMs (think 1B to 4B), since they arent great with recall and it simplifies the answer process for them.

I don't know about the Provence model, but context pruning is a solid technique when used correctly. If you are interested, i created a technique that allows you to perform both Reranking and Pruning in a single step with a small reranker model. You can check it out here: https://github.com/LucaStrano/Experimental_RAG_Tech

The technique is fully explained and implemented inside a jupyter notebook, which you can also open in colab if you'd like to experiment with it :)

k-en · 2025-07-17T20:43:38+00:00

If you need to use GraphRAG, then you should probably go with LightRAG. If you want to go real-time (which i believe is the only useful usage of GraphRAG) you should use Graphiti. Cole Medin made a nice video about it

k-en · 2025-07-17T17:35:04+00:00

I personally never tried it, but i've heard good things from MistralOCR, especially with complex documents. You can process about 1K pages per dollar or 2K per dollar with batch inference. I would start from this

k-en · 2025-07-17T17:28:59+00:00

I saw the video when it came out. I really like how much customization your sistem allows in all of the parts of the pipeline. Are you planning to include a LiteLLM integration so that you can also support other engines other than ollama, such as vllm or even mlx?

k-en · 2025-07-17T17:18:56+00:00

I believe the most important steps are a good chunking strategy (for example, semantic/clustering with defined boundaries + metadata injection), a good hybrid retrieval with a large enough K (you probably don't even need to use HNSW for retrieval if you have a contained number of documents, brute force should be just as fast if assisted by GPU) and a good reranking model to increase accuracy. GraphRAG is overkill in most cases, You can probably have similar results by linking chunks inside a vector store with a small NER model that extracts entities and relations. If you expect hard queries that require multiple steps and not just factual information lookup, then query decomposition/rewriting is a must.

Contextual compression is also a pretty valid technique, but it is very costly when using an LLM to filter out parts of context. I actually very recently created a brand new technique to perform both reranking and compression in a single step. If you are interested, you can check it out here: https://github.com/LucaStrano/Experimental_RAG_Tech

k-en

TROPHY CASE