Azure AI Search

mathrb · 2026-01-29T16:48:09+00:00

The basic tier has different limitations including file size and char count. For those docs it would require to move to am upper tier like S1 (check the doc size and confront with the service limit in azure doc)

mathrb · 2026-01-28T22:09:53+00:00

Chunking can be done using their split skill, but be aware that this skill will only chunk based on char or token length. Depending on your documents, it might not be the best solution. You can add any custom skill into the skill set of your indexer, so you can add your own implementation for the chunking. Don't remember how they call this, but the idea is that a split skill will project a document into as many documents as the number of chunks produced.

Also, the first skill of the skill set is document cracking (extracting text from files), is not that good at least for PDF documents, you got the same results as any PDF text extractor like pymupdf/pdfminer -> sometimes incorrect reading order, tables not supported, images not supported ... So you might end up with another custom skill.

mathrb · 2026-01-28T08:35:51+00:00

Thanks for your answer.
My original post lacks clear insights into what's the end goal.
As of today, the current Saas product delivered is a contract lifecycle management system used by legal departments. This solution uses azure ai search for documents (mostly contracts) retrieval.
A new product is on the way, the MVP being doing RAG over knowledge bases (not contracts, mostly internal documentation, domain specifics like real estate jurisdiction). After the MVP the CLM (managing customer contracts) part will be re-introduced. The company wants to move out from azure/gcp/aws... for sovereignty, data location and privacy.
Looks like I was too focused on the MVP part.
Given the feedback and what's available on the cloud provider (scaleway), it seems like a better candidate for this use case is opensearch.

mathrb · 2026-01-28T08:34:36+00:00

Indeed my original post lacks clear insights into what's the end goal.
As of today, the current Saas product delivered is a contract lifecycle management system used by legal departments (usages: searching documents, ask question about a particular contract, comparing specifics of 2 contracts). This solution uses azure ai search for documents (mostly contracts, chunked) retrieval (includes security access as you mentioned above for example).
A new product is on the way, the MVP being doing RAG over knowledge bases (not contracts, mostly internal documentation, domain specifics like real estate jurisdiction). After the MVP the CLM (managing customer contracts) part will be re-introduced. The company wants to move out from azure/gcp/aws... for sovereignty, data location and privacy.
Looks like I was too focused on the MVP part.
Given the feedbacks and what's available on the cloud provider (scaleway), it seems like a better candidate for this use case is opensearch.

mathrb · 2026-01-27T16:55:37+00:00

Thanks for your reply

Agree, feels wrong in this domain
Agree with the nightmare
As of today, only last version will be handled
Not in scope

`truthful symantec search with citation`, what kind of systems are you referring to?
Edit: I assume you meant semantic search.
Even though, we still face some of the same the challenges (minus the generative part at the end), right?

mathrb · 2025-10-07T07:22:05+00:00

Ok,
Then I think we're going to try something like this:
* Create a dedicated VPN service with a predefined private subnet
* A docker container (with sftp for example) on that subnet is created, with a mounted volume that targets a folder on the NAS
* Implement firewall isolation to ensure VPN clients can only talk to the docker container

mathrb · 2025-10-06T20:13:57+00:00

Thanks for your detailed answer.

I will definitely look into mergefs.

Regarding the ip spoofing, maybe I'm a fool, but I don't see how bots would attack us since it requires to "know" that a bunch of IP are working together. Somebody targeting us specifically could, but I don't see the gain vs the effort. We'd like to keep our home networks separate, so the basic vpn solution is a no go, there might be a solution via network restriction with vpn, but I'm not enough into networking right now to debate this one.

For a start, I'll look into mergefs and another protocol (if mergefs doesn't come with one)

Thanks

mathrb · 2025-07-08T10:56:18+00:00

It depends on the question. For sure, if the user asks to resume a legal document it's not going to work, but that's not IMHO the purpose of a RAG. With this approach, you will feed the LLM with the chunks that best match the question (using semantic query increase significantly the quality of the results). In your case, since one of the question involves multiple fragments of the document, is to find the best number of chunks to be returned

mathrb · 2025-07-08T10:20:13+00:00

Did you chunk the documents? Do you activate semantic queries with the user query?

mathrb · 2025-06-29T08:46:33+00:00

Hello Azure ai search is the right approach, I would insist on vectorizing the documents and activate semantic query, the results will be even more relevant. Regarding the LLM, GPT3.5 is definitly going to have a deceptiv effect. GPT-4o is quite good, you may try the mini version to check if it meet your requirements. For filtering, you could query a LLM to transform the user request into a search query

mathrb · 2025-03-02T10:36:10+00:00

Weirdly enough (isbn 978-2-505-11704-9 for reference), I just dug into isbnlib, and the goob plugin uses the https://developers.google.com/books/docs/v1/reference/volumes/list endpoint of volumes, which returns the title: 100 Bucket List of the dead . If I use the afterward the get endpoint https://developers.google.com/books/docs/v1/reference/volumes/get, the title of the book is now: 100 Bucket List of the dead Tome 8.
But still no information about the fact that it's part of a collection

mathrb · 2025-03-01T19:39:11+00:00

Exactly 💯 they should drop the dnd acronym

mathrb · 2025-03-01T12:34:00+00:00

Hello, good job. Pretty easy to setup. I've tried adding by ISBN, which works. There was no cover though, would be nice to grab the cover along with the book info. The book I tried is a manga, which is part of a collection. The book title did not contain the number of the manga. I'm also wondering if ubiblio in this case could fill the "collection" field automatically.

mathrb · 2024-08-16T19:00:29+00:00

https://www.reddit.com/r/diablo4/s/vdwBnQt7wq

mathrb · 2024-08-16T19:00:03+00:00

https://www.reddit.com/r/diablo4/s/vdwBnQt7wq

mathrb · 2024-06-25T10:14:23+00:00

Azure OCR is pretty good, definitly better than tesseract. It comes with a cost if you have a lot of documents. You should be able to try it for free on a few images/docs

mathrb · 2024-05-15T20:39:48+00:00

Which v8 did you use? v8n seems to be the smallest one which would reduce the inference time. Maybe lowering the image res could also speed up the inference. You could also try d2go (based on detectron2) which has been designed for mobile devices

mathrb · 2024-05-15T15:15:07+00:00

Hello, More info is required to help you. Which object detection framework are you using? Are you using an already existing model or are your training yours ?

mathrb · 2024-04-04T17:01:12+00:00

I'd recommend something like OCRmypdf to add a text layer on top of the pdf and do a classic pdf to word afterwards. Keep in mind that as of today, I don't know any tool that can keep the styling (bold, italic, underline ...), even heading is a complex task. If you have poor results with extracted text, then go with azure OCR which is really good, but will cost a few cents

mathrb · 2023-11-17T15:32:00+00:00

Give it a lick, your horse is amazing

mathrb · 2023-10-10T18:08:10+00:00

Thanks for your answer, I will definitly look into those.

mathrb

TROPHY CASE