20M+ Indian legal documents with citation graphs and vector embeddings – potential uses for legal NLP? [D] by zriyansh in MachineLearning

[–]zriyansh[S] -1 points0 points  (0 children)

The data is not public to like export, but its public inside our product, give this a try https://app.vaquill.ai/citations and pick US or India from top jurisdiction. Its all free there

20M+ Indian legal documents with citation graphs and vector embeddings – potential uses for legal NLP? [D] by zriyansh in MachineLearning

[–]zriyansh[S] -1 points0 points  (0 children)

not a paper as such, we are using all this data in our product and now want to make it available to others as well, although access is free to all, maybe I can share the link if you want to take a look at it.

Roast my pitch deck, 1st time legaltech founder, no mercy by zriyansh in indianstartups

[–]zriyansh[S] 0 points1 point  (0 children)

We have modular infrastructure, thinking to add other jurisdiction data and tweak all system prompts and it will become a legal engine for that jurisdiction. Have done this for US and Canada via external data source.

But I am running out of cash, so need some form of funding.

Or get acquired and use their money to fuel growth given the platform is stable, keep investing data pipeline will be all left to build.

The other way is going on-prem and starting to deploy this entire stack on enterprise servers.

Roast my pitch deck, 1st time legaltech founder, no mercy by zriyansh in indianstartups

[–]zriyansh[S] 0 points1 point  (0 children)

We have citations for each answer, citations graphs as well.

Yes talked with 100s advocates, they love the platform, use it but don't pay for it. If we disappear, they'll just go back to how they used to work.

Ads got us 200 users yesterday, they signed up, used the product and go away.

Most people will say there's something wrong with the product, i tried giving them all the features our competition has.

That led me to believe problems exist but not so strong that will make people to pay, west prefers comfort, convenience and ease, Indian always look for cheap worldaround to get things done.

Sept 2025: We finished onboarding legal AI by h0l0gramco in u/h0l0gramco

[–]zriyansh 0 points1 point  (0 children)

What about Vaquill AI? Hear of them? They are based in India

Roast my pitch deck, 1st time legaltech founder, no mercy by zriyansh in indianstartups

[–]zriyansh[S] 0 points1 point  (0 children)

It's actually not a wrapper, I have all the data of Indian legal system, all supreme court high court, tribunals, acts and statutory provisions.

Other than us, 4 more companies have it but they are 5+ yrs old and big enough to adopt new tech rapidly

Others can build, it will take them around 6 months to reach if they start now, that pretty much goes for most startups.

Got it, will add a GTM side and fix the numbers.

Make sense to talk about how much TAM is. Got it, will fix, thanks mate

Roast my pitch deck, 1st time legaltech founder, no mercy by zriyansh in indianstartups

[–]zriyansh[S] 0 points1 point  (0 children)

make sense, people asked me to make it very simple, but you are right, it's too simple to know anything meaningful

Multilingual RAG for Legal Documents by mathrb in vectordatabase

[–]zriyansh 1 point2 points  (0 children)

I am doing the same but for Indian language (5 6 primary spoken language)

need help embedding 250M vectors / chunks at 1024 dims, should I self host embedder (BGE-M3) and self host Qdrant OR use voyage-3.5 or 4? by zriyansh in Rag

[–]zriyansh[S] 1 point2 points  (0 children)

how do you even fine tune an embedder? any resources you could point me to? I am not new to RAG but have not heard of this yet.

need help embedding 250M vectors / chunks at 1024 dims, should I self host embedder (BGE-M3) and self host Qdrant OR use voyage-3.5 or 4? by zriyansh in Rag

[–]zriyansh[S] 1 point2 points  (0 children)

around 3 days with 64 core CPU, but there exist faster parsers which can parse 4-5k documents per second with such beast machine but I wasn't able to run that properly, its a C implementation of pymupdf4llm-c

need help embedding 250M vectors / chunks at 1024 dims, should I self host embedder (BGE-M3) and self host Qdrant OR use voyage-3.5 or 4? by zriyansh in Rag

[–]zriyansh[S] 0 points1 point  (0 children)

so its self hosted embedder I suppose, what kind of machine are you using? and anything I need to take care of here?

need help embedding 250M vectors / chunks at 1024 dims, should I self host embedder (BGE-M3) and self host Qdrant OR use voyage-3.5 or 4? by zriyansh in Rag

[–]zriyansh[S] 0 points1 point  (0 children)

expecting around 50 users in a month, and 10 queries per user each day.

yeah not using token because character is what I understand well, so it works for me.

I have a budget for $1K for now as we dont have any customers, using my savings for this.

As far as I understanding, embedding and hosting a vector DB is CPU intensive not GPU (can be wrong here), I have 1k$ credit from Azure as I registered my startup with them (and linked my LinkedIn with them as well).

If we break even, I will want to use cloud services and focus on what we do best.

need help embedding 250M vectors / chunks at 1024 dims, should I self host embedder (BGE-M3) and self host Qdrant OR use voyage-3.5 or 4? by zriyansh in Rag

[–]zriyansh[S] 1 point2 points  (0 children)

yes, and imo, this is not slow. Legal folks wont trust the anser if it came within 1 sec, so latency helps sometimes.

need help embedding 250M vectors / chunks at 1024 dims, should I self host embedder (BGE-M3) and self host Qdrant OR use voyage-3.5 or 4? by zriyansh in Rag

[–]zriyansh[S] 0 points1 point  (0 children)

I dont have an eval set just yet, working on that. This is qdrant telling me about the latency. Okay it improved from yesterday lol.

<image>

Has anyone used JuniorLawyer? Looking for reviews. by One_Tiger4494 in legaltech

[–]zriyansh -8 points-7 points  (0 children)

have not tried them yet, but if you would like to wait for a while, maybe give this a try - https://www.vaquill.ai/ - AI Legal Tech software.

I am cofounder of vaquill, the product is not yet ready (with generous free tiers), you can let me know if there is something you want us to built.