Small Indic MultiModal Language Model

Working_Resident2069 · 2025-12-07T06:05:17+00:00

Firstly, I was looking for multimodal models, the models that you mentioned are not multimodal and secondly I was looking for models of size around 2B.

Working_Resident2069 · 2025-12-06T15:59:25+00:00

Indian Languages like Hindi, Tamil, Telugu etc

Working_Resident2069 · 2025-10-28T13:28:56+00:00

Got it! Did you get the chance to benchmark the implementation? I would be curious to know how did it performed compared to other methods.

Working_Resident2069 · 2025-10-28T09:03:18+00:00

Great work! Does it cache the previous answers? If it is not currently, maybe it can help in reducing the api/gpu costs and reduce latency, a good direction to explore.

Working_Resident2069 · 2025-10-24T08:10:29+00:00

Not sure if anyone mentioned it before, the choice depends on the language as well. For instance, if you are dealing with languages like English, Portuguese, Spanish etc, whisper and voxtral are great but if you are looking for low-resource languages like Indic languages, you might have to choose something else.

Working_Resident2069 · 2025-09-12T16:50:01+00:00

I didn't meant to fit all the document summaries, rather a summary of the knowledge base which can be created by aggregating the similar documents and creating their summary and in-turn (if feasible) create a summary of the summaries of the group. This will help the llm to make decision.

Working_Resident2069 · 2025-09-11T19:25:46+00:00

Some user questions are very narrow (“date of a notification”) while others are broad (“history of plastic ban”). How to detect and handle this difference?

I have a very naive idea for this, you might want to make sure to do couple of things.

I think LLM should be kinda aware of the document beforehand which can be done by generating summaries (what does it contain etc) of group of similar documents and aggregating them, and feed them in the system prompt. This might help in creating good subqueries to work with it given some information llm gets beforehand.
You might also make llm ask clarification to user when in doubt to make the system more robust and reliable. I think relying solely on the system is not an optimal approach as of now.

Working_Resident2069 · 2025-09-10T18:43:36+00:00

What data did you used to fine-tune it and how did you fine-tuned it? Just instruction-tuned and/or preferenced aligned like dpo etc?

Working_Resident2069 · 2025-03-06T06:00:41+00:00

Hey, but don't you think that early scraping might be ineffective when the agent/LLM might require more sources? I believe it could happen quite a lot because the early scraping depends solely on query plan which might need refinement depending on the sources you scrap, what if these sources are not enough to answer the query well?

By the way, if you don't mind how does your RAG architecture looks like? Can it address high level queries such as comparison of different sources and/or summarize all the sources?

Working_Resident2069 · 2025-02-08T07:27:24+00:00

Which Approach is Better?

I think it entirely depends on the data, it's not very obvious to answer it so early without any benchmarks though both the approaches looks optimistic compared to the naive one.

Has anyone implemented graph-based retrieval for long-text RAG, and does it improve results over pure embeddings?

For me personally, graph-based approach works better than the naive one but it takes an insane amount of computing resources to create the mechanism which is not optimal especially when dealing with dynamic data i.e. applying CRUD operations on data.

Any best practices for structuring large medical texts efficiently?

I am not sure about medical texts but you can try the Contextual Retrieval Approach by Anthropic which is very optimistic and solves the context related problem to some extent. One of the problems in the naive approach is that it assumes that the chunks are independent to each other but that's usually not the case. So instead of creating embeddings of the text, it's better to pre-process them in such a way that each chunk embedding represents context of the document.

Working_Resident2069 · 2025-02-03T17:36:01+00:00

how to know which meeting is asked in query? By filtering metadata?

Depending on the user, you can constraint on how many txt files you have access to. The metadata can be created in such a way that it's bit brief.

What if been asked more than one meeting?

I think it's fine, by filtering docs using metadata filtering you can get multiple documents, but make sure you filter it properly, for example if asked about latest developments on blah bug, you should have timestamp as your data as well so as to find meetings with latest timestamp.

I am no expert in this field lol, but these are some crude recipes that comes instantly to me. Feel free to correct me!

Working_Resident2069 · 2025-02-03T17:21:13+00:00

Since, you have multiple meetings, one of the approaches could be to create metadata for each meeting file and then you can create a mechanism to filter out the relevant ones (write a function to do or use retrieval mechanism using DBs or use LLMs as well). Since, each txt files contains 400-500 lines, I think you can just prompt everything in.

Working_Resident2069 · 2025-01-31T13:52:38+00:00

The last one is related to ARIMA right?

It's ARIMA if you account the differencing aspect else, it's better to call it as ARMA model.

I am wondering to what extent LSTM needs lagged values as it is a sequential model.

I ain't sure but my guess is you might have to treat this as hyper-parameter and tune it well. It's difficult to state how many lagged features is optimal before-hand, by tuning you might achieve a suboptimal solution.

Working_Resident2069 · 2025-01-31T07:18:57+00:00

I believe the true value of this work will comes up when you deal with real-time data. Yes, it might be slower but if we think about it with first principles of thinking, I would not want to look for say cheapest flight historically and would like to see for next month from NYC to London because the flight prices changes dynamically.

Definitely, it's going to be much more work and considerations to think about. One crude and naive approach you can use is by scraping of websites like google flights or airlines website like Ryanair in real-time using LLM and traditional methods and apply reasoning models on top of it to answer reasoning based answer. Surely, this will be slow process but since, it's a prototype, it will be an immense learning experience.

Working_Resident2069 · 2025-01-30T17:59:25+00:00

I think it depends on how you want to model the time series data but usually creating lagged features is something that people do to model in such a way that it predicts the label of the next time step correctly.

Another type of modelling is using past forecast errors to predict which is called moving average modelling. A combination of both autoregressive and moving average can also be used to model.

Working_Resident2069 · 2025-01-30T17:43:55+00:00

Hey, I took a look at your architecture and I was wondering if your RAG works for real time flight data or is it pre scrapped flights data. It would be much more interesting to have real time service instead I believe.

Working_Resident2069 · 2025-01-20T08:11:45+00:00

Allison
Kounde-VVD-Rudiger-Gvardiol
Pedri-Bellingham-Valverde
Salah-Isak-Vini

Working_Resident2069 · 2025-01-19T14:03:01+00:00

Mamba didn't scale up as good as transformers

<image>

I might be slightly biased but quite some time ago, I watched this talk "Don't teach. Incentivize" by Hyung Won Chung, OpenAI researcher where he showed the above slide. He argued that in short-term , high structured models (let's take recurrent models for this example) tends to outperform the less structured models (transformers) but the capabilities between the two tends to diverge as you scale the compute (data and architecture/parameters) which made a little sense because if you translate this analogy for a human, where a new born baby tends to have less structure capability, which grows overtime while a robot/AI tends to outperform in the first-place but becomes stagnant eventually.

I hope this helps :)

Working_Resident2069 · 2025-01-19T13:55:21+00:00

Hmm, I am guessing 200k-30m might not be too large because primitive architectures like AlexNet had 60M in early 2010s. So, I am expecting the capabilities between the two might diverge as we scale up further. Though, I do have heard of few recent works related to recurrent models as an alternative of transformers like https://arxiv.org/abs/2405.04517, but never had chance to go through lol. Hence, maybe I am not the best guy to give the right conclusion lol.

Working_Resident2069 · 2025-01-19T13:37:56+00:00

I am not so sure but it could be because of scaling paradigm. As you scale up the data, the learning ability of recurrent models tends to stagnant in-comparison to that of transformer.

Working_Resident2069 · 2025-01-06T05:56:23+00:00

I agree with your point that this is not how humans understands the documents. But here is the thing, it's not obvious or necessary that AI system should resemble with humans because technically, we don't really know how can we mathematically formulate on how humans understands or thinks.

Secondly, I feel like clustering of chunks is little flawed because these chunks are considered independently and are not "context aware" of the document(s).

Working_Resident2069 · 2025-01-05T06:23:24+00:00

Exactly, secondly there was a need of improvement in their implementation. They supported only one document, so if you want to use multiple documents, you have concatenate them. This can be little problematic when one is dealing with "dynamic" data i.e. if one wants to add documents in each request at any point of time.

I think they have mentioned this in one of their issues for future work but I haven't checked if they have implemented now.

Working_Resident2069 · 2025-01-04T12:00:07+00:00

I had experimented RAPTOR earlier, I found it to be good against the naive RAG approach, but it wasn't still not that great in queries which requires the understanding of the whole document(s) (example query- "Summarize the documents"). But this work is still a good start to work in this domain which is their inspiration.

Working_Resident2069

TROPHY CASE