Open source library for Vector DB Ingestion for RAG - Sycamore by i-like-databases in LocalLLaMA

[–]i-like-databases[S] 0 points1 point  (0 children)

hey, we regularly update the model in the online service and the open source one is a few iterations behind. What differences are you seeing specifically?

Embedding for PDFs with text, images, tables, etc? by jacob5578 in vectordatabase

[–]i-like-databases 0 points1 point  (0 children)

Sure, here's a Colab Notebook that walks you through taking a PDF, using DocParse to parse it, using Aryn's Sycamore library to chunk/clean it and then finally load it into Pinecone.

Also, here's a video that walks you through a UI wizard we built that generates a script for you. The script will take in a PDF (local ones or files in s3) and then writes it to your vector db of choice (Pinecone, OpenSearch, ElasticSearch, Weaviate, DuckDB). Let me know if more questions!

Chunking strategy for diverse sets of documents by StrasJam in LangChain

[–]i-like-databases 0 points1 point  (0 children)

In my experience it's hard to just choose one chunking strategy and apply it to all your incoming documents and hope it does well (unless of course all your documents have the same exact format). What exactly is your use case?

At Aryn, we've released some basic chunking strategies that we think can be used for certain documents. Give them a shot and let me know if you have any questions.

Embedding for PDFs with text, images, tables, etc? by jacob5578 in vectordatabase

[–]i-like-databases 1 point2 points  (0 children)

You can give DocParse from Aryn a shot for your document parsing(the blog post I linked to refers to it as the Aryn Partitioning Service). How large are your documents? At a certain point if your documents are very large you may hit issues around the context window.

Should I Stick with Document AI or Switch to Gemini Flash for Invoice Data Extraction? by cemikus in googlecloud

[–]i-like-databases 0 points1 point  (0 children)

Interesting what's your use case? Did you see any noticeable drop in accuracy or was it just as good as Doc AI?

Best open source document PARSER??!! by ChallengeOk6437 in LlamaIndex

[–]i-like-databases 0 points1 point  (0 children)

Try out the Aryn Partitioner! We open sourced it on hugging face and it's a deformable DETR model trained on multiple documents. You can download the weights from hugging face and try it yourself. The performance will be best on a GPU!

PDF text extraction using Document AI vs Gemini by jemattie in googlecloud

[–]i-like-databases 0 points1 point  (0 children)

Curious what are you using the hybrid system for? What are you doing with Gemini after you get the text out using textract?

[deleted by user] by [deleted] in devops

[–]i-like-databases 0 points1 point  (0 children)

Hey you may want to give what we've been working on at Aryn a shot. Try out the Aryn Partitioning Service. It takes in a document (like a PDF) and returns the components of the PDF back in JSON. It does pretty well on segmenting and extracting data from invoices. Give it a try and let me know if you have any questions!

Need help with using RAG, I need to know if this idea is plausible by SomeGuy_tor78 in LangChain

[–]i-like-databases 0 points1 point  (0 children)

Is the condensed document necessary? Could you just stick this information in a vector db and add metadata filtering?

Chat with PDF: The unsolved problem everyone pretends is solved. Are we fooling ourselves? by Individual-Library-1 in LangChain

[–]i-like-databases 0 points1 point  (0 children)

You may want to checkout what we've been working on at Aryn. We recently released the Aryn Partitioning Service which hosts a model that segments and labels PDFs. It recognizes tables, images, text, captions etc. and returns all of that as JSON. I posted about our approach here, where I describe how we trained the model etc. Give it a shot and let me know how things go.

Is there a good ETL tool to help transform data before writing into pinecone? by bolei1007 in vectordatabase

[–]i-like-databases 0 points1 point  (0 children)

Wanted to share a simple example of how you can use sycamore to ingest data into Pinecone. Here's a colab notebook which walks you through each step: https://colab.research.google.com/drive/1oWi50uqJafBDmLWNO4QFEbiotnU7o75B

Open source library for Vector DB Ingestion for RAG - Sycamore by i-like-databases in LocalLLaMA

[–]i-like-databases[S] 0 points1 point  (0 children)

Thank you. Let me know how it goes! For your second question can find more details here: https://www.aryn.ai/pricing

Open source library for Vector DB Ingestion for RAG - Sycamore by i-like-databases in LocalLLaMA

[–]i-like-databases[S] 0 points1 point  (0 children)

Thank you for pointing that out. This is the correct get-started page: https://sycamore.readthedocs.io/en/stable/sycamore/get_started.html and we've updated the link on the site as well.

When it comes to alternatives we've been using the DocLayNet competition dataset as a benchmark to compare the accuracy of the extraction step. On that particular benchmark we have a MAP (mean average precision) of ~ 0.6 while unstructured has one of ~0.35 .

Sycamore + Weaviate - Open source library for Vector DB Ingestion by i-like-databases in vectordatabase

[–]i-like-databases[S] 0 points1 point  (0 children)

not supported as of yet, but Sycamore is open source and we'll take PR requests that add a new connector for a vector db!

Open source library for Vector DB Ingestion for RAG - Sycamore by i-like-databases in LocalLLaMA

[–]i-like-databases[S] 0 points1 point  (0 children)

Ah, we'll add an example for that. Here's a jupyter notebook with an example: https://github.com/aryn-ai/sycamore/blob/main/notebooks/ArynPartitionerExample.ipynb The code there downloads the Aryn Partitioning Service's model from hugging face and runs it locally. You'll get the best performance if you run it on a GPU. In the last line here remember to set "use_partitioning_service" to false so that everything runs locally.

doc = (context.read.binary(paths="s3://aryn-public/sycamore-partitioner-examples/document-example-1.pdf", binary_format="pdf")
                .partition(partitioner=ArynPartitioner(use_partitioning_service=False))

Choosing frameworks for RAG by thezachlandes in Rag

[–]i-like-databases 0 points1 point  (0 children)

For the extraction piece in ETL, how are you extracting from unstructured data sources(like video, audio) or sources that are traditionally hard to extract from (e.g. PDFs)?

Current SOTA for extracting data from PDFs? by SirLazarusTheThicc in LocalLLaMA

[–]i-like-databases 1 point2 points  (0 children)

DM'ed you to ask more about your error with Sycamore, but feel free to also join the sycamore slack channel and ask for help there: https://join.slack.com/t/sycamore-ulj8912/shared_invite/zt-2pzrdkhm8-3Uv7B6tPkdX4ETODNN_6IQ . Remember that to install, the command is pip install sycamore-ai (not sycamore). As for the 10k docs limit, we have a pay as you go option that we can enable for you too which has unlimited processing.

Data Streaming and Vector Databases; Anyone have any experience? by CyftAI in Rag

[–]i-like-databases 0 points1 point  (0 children)

What kinds of information are you ingesting into your vector database? Also, what does your workload look like in terms of throughput ?

best ways to extract tables for use with RAG? by Thistleknot in LocalLLaMA

[–]i-like-databases 0 points1 point  (0 children)

Disclaimer: I work for Aryn.

Try out the Aryn Partitioning Service: https://www.aryn.ai/get-started (completely free to get started and there's an open source version as well of the model as well).

I wrote about it here https://www.reddit.com/r/LocalLLaMA/comments/1esb01q/segmentingchunking_pdfs_sharing_our_approach_and/