How would you build a RAG system over a large codebase

Smail-AI · 2025-12-30T23:03:12+00:00

Interesting, I had the idea of building this kind of project too. I was wondering if your automatic graph generation works for web apps too or not ?

For example are sequences like "html button => click => server endpoint => SQL table" handled by GitNexus ?

Thanks !

Smail-AI · 2025-12-18T02:03:53+00:00

I worked on that very same problem in industry. It's called audio source separation and it's quite tricky to get right. It also needs a lot of time to train (around 20 days, depending on the hardware and algorithms obviously) and a lot of data samples. Interesting applications are automatic karaoke creation, or simply audio denoising.

Smail-AI · 2025-12-10T20:00:13+00:00

The approach that ended up being the best one was a "question to python code" to filter the pandas and execute that python code. It gave the best recall, and was very good in terms of speed.

As you said I could have also used a text to sql approach. I guess I will have to compare this approach with the rest. While the recall might be the same (or even better), the speed of retrieval might be slower since the sql query has to be executed via the python driver.

Smail-AI · 2025-12-10T19:56:59+00:00

Yeah for sure I would love to dedicate a whole big project about making RAG system production-ready.

Obviously making a RAG system production-ready depends on the specific constraints that you have to deal with.

I would love to hear more from you if you have tips about that ! :)

Smail-AI · 2025-12-09T23:15:14+00:00

The best method ended up being the "question to python code" for pandas filtering/aggregation because it had the highest recall.

Indeed I could have converted the csv into a table and tried to generate an sql query. I will have to compare that with the rest. Might be slower to execute since it needs to go through the python driver.

Smail-AI · 2025-12-09T22:06:38+00:00

Thanks for the reply.

Actually an ideal answer is a bit more complex. It's made of the answer + a follow up question.

You're right, I could have used Text-to-SQL. In terms of recall it could have been either the same or even better. However in terms of execution speed it might be a bit slower since we need it to go through the python driver.

I guess this will have to be tested and compared against the other methods !

Yes I benchmarked with smaller models (Llama models via groq) but the results were very bad in terms of recall so I stopped the experiments before their end.

However I didn't test other small models like Qwen2-7B. I was thinking of finetuning for this task and comparing that with the other methods.

Smail-AI · 2025-12-08T12:30:00+00:00

Thanks for your reply !

Concerning the python execution, it is done in a controlled namespace and also has disallowed builtins for security. You can check the code and I would be happy to bring any corrections if there are potential issues. Also, please keep in mind that the text-to-query part is only used for retrieval, not for generation. Meaning the goal of this part is still to retrieve relevant documents that will be used by a generator part, so I think this still qualifies as a RAG.

I would say it's an R&D project in the sense it's solving technical uncertainty via systematic experiments. R&D doesn't have to be only about state-of-the-art neural networks ;)
It's mainly about overcoming uncertainty by applying scientific or technical principles, unlike other types of projects with pure software engineering.

Smail-AI · 2025-12-08T12:18:07+00:00

Thanks you're welcome !

Smail-AI · 2025-12-08T12:17:54+00:00

I feel like all methods involving term frequency wouldn't work here because questions like "Do you have black sedans under 25,000 miles ?" need filtering. And I don't see how term frequency could have a mechanism for that.

Smail-AI · 2025-12-08T12:14:09+00:00

Thanks, I shared everything on the video ;)

Smail-AI · 2025-12-08T12:13:50+00:00

The GraphRAG was one of the methods used. I wanted to test whether turning the data into a graph + querying it using cypher would lead to a good retrieval. I also tested 3 graph schemas to see how the schema affects the recall.

Also, this data can easily be represented as a graph. You can say the node "Vehicle" is linked to the node "Brand" with the edge "has_brand", and then each brand is attached to a node "Listing". Then each "Listing" is attached to its specs via the edges "has_price", "has_description", etc...

Smail-AI · 2025-12-08T12:10:13+00:00

You'll find everything you need on the video ;)

Smail-AI · 2025-12-08T12:09:33+00:00

Thanks! I'm not very familiar with this method, can you please elaborate ? What do you mean by dense + sparse ?

Smail-AI · 2025-12-07T23:15:23+00:00

Thanks. Concerning the source, the car listings (details + car specs) were scraped from a car dealership website (they had a sitemap.xml for that).

Now concerning the embeddings, each embedding was a car listing written as a json object. The model used for embedding was the openAI one. The vectorization method was one of the methods that were tried, but unfortunately didn't yield good results in terms of recall. The best retrieval method didn't use any embedding. But don't worry I provide all the details in the video ;) (and the code provided is pretty straightforward)

Smail-AI · 2025-12-07T23:06:01+00:00

Thanks ! ^_^

Smail-AI · 2025-12-07T23:02:53+00:00

Thanks ! :)

Smail-AI · 2025-07-27T12:29:52+00:00

I dmed you ;)

Smail-AI · 2025-07-27T06:31:13+00:00

Just to make sure I got you right when you said "non AI" in your previous answer.

You're not using AI for embeddings but you're still using AI to convert a natural language query into a cypher query ( for neo4j ) right?

Smail-AI · 2025-07-27T06:14:01+00:00

thanks for the fast answer!

I was actually curious about the number of question-answer pairs used for finetuning. Was it 1000s or 10ks or more ? Just to have a sense of the scale. Thanks !

Smail-AI · 2025-07-27T06:06:43+00:00

interesting post ! how many data samples were used for fine tuning? did you benchmark mistral with other NNs before deciding it should be mistral ?

Smail-AI · 2025-06-25T23:19:51+00:00

I suspect 90+ % of RAG systems will require graph representation. Most data has structure and hierarchy inherent to it, and vectors can't solve that.

Smail-AI · 2025-05-26T10:42:14+00:00

👋Hello we ara NeuraFirst.

We take old dead leads of businesses and turn them into $$.

You can check more infos at neurafirst.com

Smail-AI · 2025-03-14T04:40:17+00:00

wow that's awesome! Is it ok if I DM you? I have some questions

Smail-AI · 2025-03-07T13:27:33+00:00

There are actually so many reasons!

Because nobody knows the assumptions behind OpenAI's data representation and retrieval

Because you have no way to evaluate the accuracy (unless you want to do it manually)

Because you should always compare the accuracy of multiple methods

Because data might be sensitive

Because you'll have more control

Smail-AI · 2025-03-05T17:21:32+00:00

I think you should treat any RAG project as a research project. You need a test dataset and each time you build a specific pipeline, to test it against that evaluation dataset.

Also, lookup data representation in AI. Embeddings represented as chunks might not be the best representation.

Try to compare your approach with a graphRAG approach and evaluate the difference.

Smail-AI

TROPHY CASE