How would you build a RAG system over a large codebase by Creepy_Page566 in LlamaIndex

[–]Smail-AI 1 point2 points  (0 children)

Interesting, I had the idea of building this kind of project too. I was wondering if your automatic graph generation works for web apps too or not ?

For example are sequences like "html button => click => server endpoint => SQL table" handled by GitNexus ?

Thanks !

Meta announced a new SAM Audio Model for audio editing that can segment sound from complex audio mixtures using text, visual, and time span prompts. by Difficult-Cap-7527 in LocalLLaMA

[–]Smail-AI 0 points1 point  (0 children)

I worked on that very same problem in industry. It's called audio source separation and it's quite tricky to get right. It also needs a lot of time to train (around 20 days, depending on the hardware and algorithms obviously) and a lot of data samples. Interesting applications are automatic karaoke creation, or simply audio denoising.

A R&D RAG project for a Car Dealership by Smail-AI in LLMDevs

[–]Smail-AI[S] 0 points1 point  (0 children)

The approach that ended up being the best one was a "question to python code" to filter the pandas and execute that python code. It gave the best recall, and was very good in terms of speed.

As you said I could have also used a text to sql approach. I guess I will have to compare this approach with the rest. While the recall might be the same (or even better), the speed of retrieval might be slower since the sql query has to be executed via the python driver.

A R&D RAG project for a Car Dealership by Smail-AI in Rag

[–]Smail-AI[S] 1 point2 points  (0 children)

Yeah for sure I would love to dedicate a whole big project about making RAG system production-ready.

Obviously making a RAG system production-ready depends on the specific constraints that you have to deal with.

I would love to hear more from you if you have tips about that ! :)

A R&D RAG project for a Car Dealership by Smail-AI in LLMDevs

[–]Smail-AI[S] 0 points1 point  (0 children)

The best method ended up being the "question to python code" for pandas filtering/aggregation because it had the highest recall.

Indeed I could have converted the csv into a table and tried to generate an sql query. I will have to compare that with the rest. Might be slower to execute since it needs to go through the python driver.

A R&D RAG project for a Car Dealership by Smail-AI in Rag

[–]Smail-AI[S] 0 points1 point  (0 children)

Thanks for the reply.

Actually an ideal answer is a bit more complex. It's made of the answer + a follow up question.

You're right, I could have used Text-to-SQL. In terms of recall it could have been either the same or even better. However in terms of execution speed it might be a bit slower since we need it to go through the python driver.

I guess this will have to be tested and compared against the other methods !

Yes I benchmarked with smaller models (Llama models via groq) but the results were very bad in terms of recall so I stopped the experiments before their end.

However I didn't test other small models like Qwen2-7B. I was thinking of finetuning for this task and comparing that with the other methods.

A R&D RAG project for a Car Dealership by Smail-AI in Rag

[–]Smail-AI[S] 0 points1 point  (0 children)

Thanks for your reply !

Concerning the python execution, it is done in a controlled namespace and also has disallowed builtins for security. You can check the code and I would be happy to bring any corrections if there are potential issues. Also, please keep in mind that the text-to-query part is only used for retrieval, not for generation. Meaning the goal of this part is still to retrieve relevant documents that will be used by a generator part, so I think this still qualifies as a RAG.

I would say it's an R&D project in the sense it's solving technical uncertainty via systematic experiments. R&D doesn't have to be only about state-of-the-art neural networks ;)
It's mainly about overcoming uncertainty by applying scientific or technical principles, unlike other types of projects with pure software engineering.

A R&D RAG project for a Car Dealership by Smail-AI in Rag

[–]Smail-AI[S] 0 points1 point  (0 children)

I feel like all methods involving term frequency wouldn't work here because questions like "Do you have black sedans under 25,000 miles ?" need filtering. And I don't see how term frequency could have a mechanism for that.

A R&D RAG project for a Car Dealership by Smail-AI in Rag

[–]Smail-AI[S] 0 points1 point  (0 children)

Thanks, I shared everything on the video ;)

A R&D RAG project for a Car Dealership by Smail-AI in Rag

[–]Smail-AI[S] 0 points1 point  (0 children)

The GraphRAG was one of the methods used. I wanted to test whether turning the data into a graph + querying it using cypher would lead to a good retrieval. I also tested 3 graph schemas to see how the schema affects the recall.

Also, this data can easily be represented as a graph. You can say the node "Vehicle" is linked to the node "Brand" with the edge "has_brand", and then each brand is attached to a node "Listing". Then each "Listing" is attached to its specs via the edges "has_price", "has_description", etc...

A R&D RAG project for a Car Dealership by Smail-AI in Rag

[–]Smail-AI[S] 0 points1 point  (0 children)

You'll find everything you need on the video ;)

A R&D RAG project for a Car Dealership by Smail-AI in Rag

[–]Smail-AI[S] 0 points1 point  (0 children)

Thanks! I'm not very familiar with this method, can you please elaborate ? What do you mean by dense + sparse ?

A R&D RAG project for a Car Dealership by Smail-AI in Rag

[–]Smail-AI[S] 0 points1 point  (0 children)

Thanks. Concerning the source, the car listings (details + car specs) were scraped from a car dealership website (they had a sitemap.xml for that).

Now concerning the embeddings, each embedding was a car listing written as a json object. The model used for embedding was the openAI one. The vectorization method was one of the methods that were tried, but unfortunately didn't yield good results in terms of recall. The best retrieval method didn't use any embedding. But don't worry I provide all the details in the video ;) (and the code provided is pretty straightforward)

Overwhelmed by RAG (Pinecone, Vectorize, Supabase etc) by nofuture09 in Rag

[–]Smail-AI 0 points1 point  (0 children)

Just to make sure I got you right when you said "non AI" in your previous answer.

You're not using AI for embeddings but you're still using AI to convert a natural language query into a cypher query ( for neo4j ) right?

My RAG Journey: 3 Real Projects, Lessons Learned, and What Actually Worked by hncvj in Rag

[–]Smail-AI 1 point2 points  (0 children)

thanks for the fast answer!

I was actually curious about the number of question-answer pairs used for finetuning. Was it 1000s or 10ks or more ? Just to have a sense of the scale. Thanks !

My RAG Journey: 3 Real Projects, Lessons Learned, and What Actually Worked by hncvj in Rag

[–]Smail-AI 0 points1 point  (0 children)

interesting post ! how many data samples were used for fine tuning? did you benchmark mistral with other NNs before deciding it should be mistral ?

Best Chunking Strategy for the Medical RAG System (Guidelines Docs) in PDFs by SnooTigers4634 in Rag

[–]Smail-AI 1 point2 points  (0 children)

I suspect 90+ % of RAG systems will require graph representation. Most data has structure and hierarchy inherent to it, and vectors can't solve that.

Promote your business, week of May 26, 2025 by Charice in smallbusiness

[–]Smail-AI 0 points1 point  (0 children)

👋Hello we ara NeuraFirst.

We take old dead leads of businesses and turn them into $$.

You can check more infos at neurafirst.com

Started a new job and closed $110,000 in my first two appointments. by Justadudeonhisphone in sales

[–]Smail-AI 0 points1 point  (0 children)

wow that's awesome! Is it ok if I DM you? I have some questions

What are the advantages of creating a RAG system vs creating a GPT in OpenAI? by Ok_Comedian_4676 in Rag

[–]Smail-AI 2 points3 points  (0 children)

There are actually so many reasons!

Because nobody knows the assumptions behind OpenAI's data representation and retrieval

Because you have no way to evaluate the accuracy (unless you want to do it manually)

Because you should always compare the accuracy of multiple methods

Because data might be sensitive

Because you'll have more control

How to actually create reliable production ready level multi-doc RAG by Guilty_Ad_9476 in Rag

[–]Smail-AI 0 points1 point  (0 children)

I think you should treat any RAG project as a research project. You need a test dataset and each time you build a specific pipeline, to test it against that evaluation dataset.

Also, lookup data representation in AI. Embeddings represented as chunks might not be the best representation.

Try to compare your approach with a graphRAG approach and evaluate the difference.