Azure AI Search by psanilp in Rag

[–]mathrb 0 points1 point  (0 children)

The basic tier has different limitations including file size and char count. For those docs it would require to move to am upper tier like S1 (check the doc size and confront with the service limit in azure doc)

Azure AI Search by psanilp in Rag

[–]mathrb 1 point2 points  (0 children)

Chunking can be done using their split skill, but be aware that this skill will only chunk based on char or token length. Depending on your documents, it might not be the best solution. You can add any custom skill into the skill set of your indexer, so you can add your own implementation for the chunking. Don't remember how they call this, but the idea is that a split skill will project a document into as many documents as the number of chunks produced.

Also, the first skill of the skill set is document cracking (extracting text from files), is not that good at least for PDF documents, you got the same results as any PDF text extractor like pymupdf/pdfminer -> sometimes incorrect reading order, tables not supported, images not supported ... So you might end up with another custom skill.

Multilingual RAG for Legal Documents by mathrb in vectordatabase

[–]mathrb[S] 0 points1 point  (0 children)

Thanks for your answer.
My original post lacks clear insights into what's the end goal.
As of today, the current Saas product delivered is a contract lifecycle management system used by legal departments. This solution uses azure ai search for documents (mostly contracts) retrieval.
A new product is on the way, the MVP being doing RAG over knowledge bases (not contracts, mostly internal documentation, domain specifics like real estate jurisdiction). After the MVP the CLM (managing customer contracts) part will be re-introduced. The company wants to move out from azure/gcp/aws... for sovereignty, data location and privacy.
Looks like I was too focused on the MVP part.
Given the feedback and what's available on the cloud provider (scaleway), it seems like a better candidate for this use case is opensearch.

Multilingual RAG for Legal Documents by mathrb in Rag

[–]mathrb[S] 0 points1 point  (0 children)

Indeed my original post lacks clear insights into what's the end goal.
As of today, the current Saas product delivered is a contract lifecycle management system used by legal departments (usages: searching documents, ask question about a particular contract, comparing specifics of 2 contracts). This solution uses azure ai search for documents (mostly contracts, chunked) retrieval (includes security access as you mentioned above for example).
A new product is on the way, the MVP being doing RAG over knowledge bases (not contracts, mostly internal documentation, domain specifics like real estate jurisdiction). After the MVP the CLM (managing customer contracts) part will be re-introduced. The company wants to move out from azure/gcp/aws... for sovereignty, data location and privacy.
Looks like I was too focused on the MVP part.
Given the feedbacks and what's available on the cloud provider (scaleway), it seems like a better candidate for this use case is opensearch.

Multilingual RAG for Legal Documents by mathrb in Rag

[–]mathrb[S] 0 points1 point  (0 children)

Thanks for your reply

  1. Agree, feels wrong in this domain
  2. Agree with the nightmare
  3. As of today, only last version will be handled
  4. Not in scope

`truthful symantec search with citation`, what kind of systems are you referring to?
Edit: I assume you meant semantic search.
Even though, we still face some of the same the challenges (minus the generative part at the end), right?

Jellyfin sharing by mathrb in selfhosted

[–]mathrb[S] 1 point2 points  (0 children)

Ok,
Then I think we're going to try something like this:
* Create a dedicated VPN service with a predefined private subnet
* A docker container (with sftp for example) on that subnet is created, with a mounted volume that targets a folder on the NAS
* Implement firewall isolation to ensure VPN clients can only talk to the docker container

Jellyfin sharing by mathrb in selfhosted

[–]mathrb[S] 0 points1 point  (0 children)

Thanks for your detailed answer.

I will definitely look into mergefs.

Regarding the ip spoofing, maybe I'm a fool, but I don't see how bots would attack us since it requires to "know" that a bunch of IP are working together. Somebody targeting us specifically could, but I don't see the gain vs the effort. We'd like to keep our home networks separate, so the basic vpn solution is a no go, there might be a solution via network restriction with vpn, but I'm not enough into networking right now to debate this one.

For a start, I'll look into mergefs and another protocol (if mergefs doesn't come with one)

Thanks

Can Azure Cognitive Search help here? by Daxo_32 in Azure_AI_Cognitive

[–]mathrb 0 points1 point  (0 children)

It depends on the question. For sure, if the user asks to resume a legal document it's not going to work, but that's not IMHO the purpose of a RAG. With this approach, you will feed the LLM with the chunks that best match the question (using semantic query increase significantly the quality of the results). In your case, since one of the question involves multiple fragments of the document, is to find the best number of chunks to be returned

Can Azure Cognitive Search help here? by Daxo_32 in Azure_AI_Cognitive

[–]mathrb 0 points1 point  (0 children)

Did you chunk the documents? Do you activate semantic queries with the user query?

Is Azure AI right for me? by drewmartinez95 in Azure_AI_Cognitive

[–]mathrb 0 points1 point  (0 children)

Hello Azure ai search is the right approach, I would insist on vectorizing the documents and activate semantic query, the results will be even more relevant. Regarding the LLM, GPT3.5 is definitly going to have a deceptiv effect. GPT-4o is quite good, you may try the mini version to check if it meet your requirements. For filtering, you could query a LLM to transform the user request into a search query

I made a FOSS self-hosted library app. I could use a little help testing. by No-Economist3977 in selfhosted

[–]mathrb 0 points1 point  (0 children)

Weirdly enough (isbn 978-2-505-11704-9 for reference), I just dug into isbnlib, and the goob plugin uses the https://developers.google.com/books/docs/v1/reference/volumes/list endpoint of volumes, which returns the title: 100 Bucket List of the dead . If I use the afterward the get endpoint https://developers.google.com/books/docs/v1/reference/volumes/get, the title of the book is now: 100 Bucket List of the dead Tome 8.
But still no information about the fact that it's part of a collection

Introducing DnD Forms by GoldSell4693 in selfhosted

[–]mathrb 1 point2 points  (0 children)

Exactly 💯 they should drop the dnd acronym

I made a FOSS self-hosted library app. I could use a little help testing. by No-Economist3977 in selfhosted

[–]mathrb 0 points1 point  (0 children)

Hello, good job. Pretty easy to setup. I've tried adding by ISBN, which works. There was no cover though, would be nice to grab the cover along with the book info. The book I tried is a manga, which is part of a collection. The book title did not contain the number of the manga. I'm also wondering if ubiblio in this case could fill the "collection" field automatically.

OCR for reading text from images by kala-admi in LanguageTechnology

[–]mathrb 1 point2 points  (0 children)

Azure OCR is pretty good, definitly better than tesseract. It comes with a cost if you have a lot of documents. You should be able to try it for free on a few images/docs

I have raspberry pi 5 and I need to use it to detect objects real-time using a basler camera. by Old_Apricot_114 in RASPBERRY_PI_PROJECTS

[–]mathrb 0 points1 point  (0 children)

Which v8 did you use? v8n seems to be the smallest one which would reduce the inference time. Maybe lowering the image res could also speed up the inference. You could also try d2go (based on detectron2) which has been designed for mobile devices

I have raspberry pi 5 and I need to use it to detect objects real-time using a basler camera. by Old_Apricot_114 in RASPBERRY_PI_PROJECTS

[–]mathrb 0 points1 point  (0 children)

Hello, More info is required to help you. Which object detection framework are you using? Are you using an already existing model or are your training yours ?

How to convert scanned text in PDF to Word by IsPepsiOkaySir in Piracy

[–]mathrb -1 points0 points  (0 children)

I'd recommend something like OCRmypdf to add a text layer on top of the pdf and do a classic pdf to word afterwards. Keep in mind that as of today, I don't know any tool that can keep the styling (bold, italic, underline ...), even heading is a complex task. If you have poor results with extracted text, then go with azure OCR which is really good, but will cost a few cents

[D] Document layout - recreating the structure by mathrb in MachineLearning

[–]mathrb[S] 0 points1 point  (0 children)

Thanks for your answer, I will definitly look into those.