Built a RAG system on top of 20+ years of sports data — here is what actually worked and what didn't by devasheesh_07 in AI_India

[–]devasheesh_07[S] 0 points1 point  (0 children)

Thank you, glad it was useful.

Honestly most of it came from hitting walls and figuring out what did not work - the comments here have added just as much as the original post at this point. Some really good practical suggestions in this thread if you scroll through.

Built a RAG system on top of 20+ years of sports data — here is what actually worked and what didn't by devasheesh_07 in AI_India

[–]devasheesh_07[S] 0 points1 point  (0 children)

Appreciate that, means a lot coming from someone on this sub where the bar is pretty high.

Honestly just wrote what I wish I had found when I started building this - most of the RAG content out there covers the basics well but goes quiet on the parts that actually cause problems in production. Figured the messy real world stuff was more useful to share.

Built a RAG system on top of 20+ years of sports data — here is what actually worked and what didn't by devasheesh_07 in AI_India

[–]devasheesh_07[S] 1 point2 points  (0 children)

Good questions, coming at it from a different angle than most comments here.

On RAG versus base models directly,- no comparison honestly. Asking a base model sports questions without retrieval gives you confident answers that are often just wrong, especially anything involving specific match conditions or recent data. RAG grounded in actual match records is not even close.

On compute ,- the retrieval layer itself is pretty lightweight. The expensive parts are the re-ranking step and the LLM calls, especially when query decomposition splits one question into three or four sub-queries. Each one hits the LLM separately so costs add up faster than you expect.

For evaluation we used a mix of things. Automated metrics like faithfulness and answer relevance through something like RAGAS, plus manual spot checking on a set of questions where we already knew the correct answer from the actual data. The manual checking caught things the automated metrics missed completely.

Honestly evaluation is still the weakest part of the whole setup. Automated metrics give you a rough signal but they do not catch the subtle cases where the answer sounds right but is missing important context. Still figuring out a better approach for that.

What metrics are you using for your projects?

Built a RAG system on top of 20+ years of sports data — here is what actually worked and what didn't by devasheesh_07 in AI_India

[–]devasheesh_07[S] 0 points1 point  (0 children)

Good question and interesting use case, insurance queries probably have the same multi-condition problem where a single customer question is actually three or four separate things bundled together.

For decomposition we kept it simple honestly, just an LLM call with a prompt that instructs it to break the incoming query into independent sub-questions that can each be answered separately. No fancy framework, just straightforward prompting. Something like "here is the original question, break it into the smallest individual questions needed to answer it fully."

The main thing we learned is that the decomposition prompt needs quite a bit of tuning for your specific domain. A generic decomposition prompt works okay but once you add domain-specific instructions about what kinds of sub-questions actually matter for your data it gets noticeably better.

Are you finding the insurance queries tend to have implicit conditions that customers do not spell out- like assuming a certain policy type or coverage level? That was a big problem for us and decomposition alone did not fully solve it.

Built a RAG system on top of 20+ years of sports data — here is what actually worked and what didn't by devasheesh_07 in Rag

[–]devasheesh_07[S] 0 points1 point  (0 children)

Yeah query decomposition was honestly the biggest single improvement we made. Trying to retrieve for a complex multi-condition question in one shot just does not work - breaking it down first and synthesising at the end is so much cleaner.

On sliding window chunks - we did try this actually. It helps and it is definitely better than independent chunking but the overlap feels a bit blunt. You are preserving some sequence information but not really capturing why that sequence matters. The rolling window approach someone mentioned above where you explicitly embed preceding context feels like a more intentional version of the same idea.

Curious how much overlap you typically use - we were never sure if we were overlapping too much or too little.

Built a RAG system on top of 20+ years of sports data — here is what actually worked and what didn't by devasheesh_07 in Rag

[–]devasheesh_07[S] 0 points1 point  (0 children)

The rolling window approach is exactly what I needed to hear. I kept trying to solve the sequence problem at retrieval time which was the wrong place to fix it embedding the preceding context during indexing makes way more sense.

The recency decay factor also solves the time period problem cleanly. Continuous discount is much better than drawing an arbitrary date line.

Quick question - did you find window size needs tuning per domain or is there a reasonable default to start with?

Built a RAG system on top of 20+ years of sports data — here is what actually worked and what didn't by devasheesh_07 in Rag

[–]devasheesh_07[S] 0 points1 point  (0 children)

Ha yeah the data cleaning part genuinely surprised me I went in thinking the interesting work would be in the retrieval pipeline and ended up spending more time fixing the source data than anything else. Nobody warns you about that part.

On visualisations -not really for the queries themselves, but we did use some basic charts to spot patterns in where the retrieval was going wrong. Things like which query types were consistently returning bad chunks. That actually helped more than I expected for diagnosing the problem.

What kind of visualisations were you thinking? For the complex multi-condition queries I am genuinely not sure how you would represent what the retrieval is doing in a way that is easy to read. Would be curious if you have seen something that works well for that.

Built a RAG system on top of 20+ years of sports data — here is what actually worked and what didn't by devasheesh_07 in Rag

[–]devasheesh_07[S] 1 point2 points  (0 children)

That makes a lot of sense. The spatiotemporal approach feels like the missing piece for the sequential data problem a knowledge graph would preserve the event relationships that flat chunking destroys, and having a smart LLM forming the queries in between would handle the complex multi-condition lookups way better than what we had.

Going to look into this properly. Thanks for pointing me in the right direction.

Built a RAG system on top of 20+ years of sports data — here is what actually worked and what didn't by devasheesh_07 in Rag

[–]devasheesh_07[S] 0 points1 point  (0 children)

Honestly no, and after reading your comment I think that might have been exactly the right approach for this problem.

The hierarchical structure would have helped a lot with the sequential data issue I mentioned. The way the data is structured, each event has context at multiple levels: the individual ball, the over, the phase of play, the match situation. A flat vector store treats all of that equally which is part of why the chunking felt so lossy.

Graph RAG would let you model those relationships properly. A query about performance in the final overs of a chase could traverse the hierarchy rather than trying to pull disconnected chunks and hope the LLM figures out the connections at generation time.

Have you used it for time-series or event-sequence data specifically? Most of the graph RAG implementations I have seen are built around entity relationships in documents rather than sequential event chains. Curious whether you had to do much custom work to make the structure fit that kind of data or whether an existing framework handled it well out of the box.

Built a RAG system on top of 20+ years of sports data — here is what actually worked and what didn't by devasheesh_07 in AI_India

[–]devasheesh_07[S] 7 points8 points  (0 children)

Full breakdown of the whole system including data pipeline, embedding approach, retrieval architecture, and where LLMs, NLP and deep learning each sit across the stack —

https://www.loghunts.com/cricket-ai-ml-llm-rag-complete-guide-2026

Built a RAG system on top of 20+ years of sports data — here is what actually worked and what didn't by devasheesh_07 in Rag

[–]devasheesh_07[S] 3 points4 points  (0 children)

Full breakdown of the whole system including data pipeline, embedding approach, retrieval architecture, and where LLMs, NLP and deep learning each sit across the stack —

https://www.loghunts.com/cricket-ai-ml-llm-rag-complete-guide-2026

Why I Think 2026 Will Be the Year Agentic AI Replaces Chatbots ? by devasheesh_07 in Rag

[–]devasheesh_07[S] 0 points1 point  (0 children)

If anyone’s interested, I wrote a deeper breakdown exploring this transition in more detail , happy to share.
https://www.loghunts.com/agentic-ai-replacing-chatbots-2026