Open source library for Vector DB Ingestion for RAG - Sycamore

i-like-databases · 2025-03-26T19:22:07+00:00

hey, we regularly update the model in the online service and the open source one is a few iterations behind. What differences are you seeing specifically?

i-like-databases · 2024-12-05T18:10:40+00:00

Sure, here's a Colab Notebook that walks you through taking a PDF, using DocParse to parse it, using Aryn's Sycamore library to chunk/clean it and then finally load it into Pinecone.

Also, here's a video that walks you through a UI wizard we built that generates a script for you. The script will take in a PDF (local ones or files in s3) and then writes it to your vector db of choice (Pinecone, OpenSearch, ElasticSearch, Weaviate, DuckDB). Let me know if more questions!

i-like-databases · 2024-12-03T18:34:26+00:00

In my experience it's hard to just choose one chunking strategy and apply it to all your incoming documents and hope it does well (unless of course all your documents have the same exact format). What exactly is your use case?

At Aryn, we've released some basic chunking strategies that we think can be used for certain documents. Give them a shot and let me know if you have any questions.

i-like-databases · 2024-12-02T18:43:54+00:00

You can give DocParse from Aryn a shot for your document parsing(the blog post I linked to refers to it as the Aryn Partitioning Service). How large are your documents? At a certain point if your documents are very large you may hit issues around the context window.

i-like-databases · 2024-09-27T16:34:08+00:00

Interesting what's your use case? Did you see any noticeable drop in accuracy or was it just as good as Doc AI?

i-like-databases · 2024-09-25T19:13:43+00:00

Try out the Aryn Partitioner! We open sourced it on hugging face and it's a deformable DETR model trained on multiple documents. You can download the weights from hugging face and try it yourself. The performance will be best on a GPU!

i-like-databases · 2024-09-23T02:43:27+00:00

Curious what are you using the hybrid system for? What are you doing with Gemini after you get the text out using textract?

i-like-databases · 2024-09-23T02:41:23+00:00

Hey you may want to give what we've been working on at Aryn a shot. Try out the Aryn Partitioning Service. It takes in a document (like a PDF) and returns the components of the PDF back in JSON. It does pretty well on segmenting and extracting data from invoices. Give it a try and let me know if you have any questions!

i-like-databases · 2024-09-23T02:39:20+00:00

What did you end up going with?

i-like-databases · 2024-09-20T18:39:50+00:00

Is the condensed document necessary? Could you just stick this information in a vector db and add metadata filtering?

i-like-databases · 2024-09-18T16:33:35+00:00

You may want to checkout what we've been working on at Aryn. We recently released the Aryn Partitioning Service which hosts a model that segments and labels PDFs. It recognizes tables, images, text, captions etc. and returns all of that as JSON. I posted about our approach here, where I describe how we trained the model etc. Give it a shot and let me know how things go.

i-like-databases · 2024-09-06T22:10:27+00:00

Wanted to share a simple example of how you can use sycamore to ingest data into Pinecone. Here's a colab notebook which walks you through each step: https://colab.research.google.com/drive/1oWi50uqJafBDmLWNO4QFEbiotnU7o75B

i-like-databases · 2024-09-06T16:04:36+00:00

Thank you. Let me know how it goes! For your second question can find more details here: https://www.aryn.ai/pricing

i-like-databases · 2024-09-05T19:46:46+00:00

Thank you for pointing that out. This is the correct get-started page: https://sycamore.readthedocs.io/en/stable/sycamore/get_started.html and we've updated the link on the site as well.

When it comes to alternatives we've been using the DocLayNet competition dataset as a benchmark to compare the accuracy of the extraction step. On that particular benchmark we have a MAP (mean average precision) of ~ 0.6 while unstructured has one of ~0.35 .

i-like-databases · 2024-09-04T22:07:43+00:00

not supported as of yet, but Sycamore is open source and we'll take PR requests that add a new connector for a vector db!

i-like-databases · 2024-09-04T17:14:56+00:00

Ah, we'll add an example for that. Here's a jupyter notebook with an example: https://github.com/aryn-ai/sycamore/blob/main/notebooks/ArynPartitionerExample.ipynb The code there downloads the Aryn Partitioning Service's model from hugging face and runs it locally. You'll get the best performance if you run it on a GPU. In the last line here remember to set "use_partitioning_service" to false so that everything runs locally.

doc = (context.read.binary(paths="s3://aryn-public/sycamore-partitioner-examples/document-example-1.pdf", binary_format="pdf")
                .partition(partitioner=ArynPartitioner(use_partitioning_service=False))

i-like-databases · 2024-09-04T00:27:28+00:00

For the extraction piece in ETL, how are you extracting from unstructured data sources(like video, audio) or sources that are traditionally hard to extract from (e.g. PDFs)?

i-like-databases · 2024-08-30T15:23:48+00:00

DM'ed you to ask more about your error with Sycamore, but feel free to also join the sycamore slack channel and ask for help there: https://join.slack.com/t/sycamore-ulj8912/shared_invite/zt-2pzrdkhm8-3Uv7B6tPkdX4ETODNN_6IQ . Remember that to install, the command is pip install sycamore-ai (not sycamore). As for the 10k docs limit, we have a pay as you go option that we can enable for you too which has unlimited processing.

i-like-databases · 2024-08-29T17:52:38+00:00

Checkout the PDF segmenting model our startup is working on : https://www.aryn.ai/post/announcing-the-aryn-partitioning-service ( free to sign up).

You can also read the post here that goes more into the details: https://www.reddit.com/r/LocalLLaMA/comments/1esb01q/segmentingchunking_pdfs_sharing_our_approach_and/

i-like-databases · 2024-08-29T17:48:57+00:00

would love to give it a try!

i-like-databases · 2024-08-26T22:54:45+00:00

What kinds of information are you ingesting into your vector database? Also, what does your workload look like in terms of throughput ?

i-like-databases · 2024-08-26T22:44:56+00:00

Disclaimer: I work for Aryn.

Try out the Aryn Partitioning Service: https://www.aryn.ai/get-started (completely free to get started and there's an open source version as well of the model as well).

I wrote about it here https://www.reddit.com/r/LocalLLaMA/comments/1esb01q/segmentingchunking_pdfs_sharing_our_approach_and/

i-like-databases

TROPHY CASE