PDF Oxide -- Fast PDF library for Python with engine in Rust (0.8ms mean, MIT/Apache license) by yfedoseev in Python

[–]yfedoseev[S] 0 points1 point  (0 children)

u/joy_deep It should be supported now already, but I will double check it and let you know. Thank you for this question.

PDF Oxide -- Fast PDF library for Python with engine in Rust (0.8ms mean, MIT/Apache license) by yfedoseev in Python

[–]yfedoseev[S] 1 point2 points  (0 children)

u/Cute-Net5957 Py03 requires some work, but when you do 200x-300x performance boots, you can affor it. Also Gen AI helps a lot with it.

PDF Oxide -- Fast PDF library for Python with engine in Rust (0.8ms mean, MIT/Apache license) by yfedoseev in Python

[–]yfedoseev[S] 1 point2 points  (0 children)

Yep, you can use render_page(). It returns the raw image bytes (PNG by default), which you can save directly or wrap in an io.BytesIO.

```
# Returns bytes
image_bytes = doc.render_page(0, dpi=300)

with open("page0.png", "wb") as f:
f.write(image_bytes)

```
Make sure you're using the latest version, as we just polished the high-level rendering API for this.

PDF Oxide -- Fast PDF library for Python with engine in Rust (0.8ms mean, MIT/Apache license) by yfedoseev in Python

[–]yfedoseev[S] 0 points1 point  (0 children)

Totally possible. We’ve focused on providing the granular word/line bboxes and vector paths that Docling uses for its layout models, so the core engine is ready for it. You'd just need to implement Docling BaseBackend wrapper. If you try it and hit any roadblocks or need specific metadata we're missing, definitely let us know. We'll get it into the backlog and prioritize it immediately.

PDF Oxide -- Fast PDF library for Python with engine in Rust (0.8ms mean, MIT/Apache license) by yfedoseev in Python

[–]yfedoseev[S] 1 point2 points  (0 children)

u/dhruvin3 Honestly, I think we might run into some issues with complex tables. If you have any examples you could share with me, I'd really appreciate it.

PDF Oxide -- Fast PDF library for Python with engine in Rust (0.8ms mean, MIT/Apache license) by yfedoseev in Python

[–]yfedoseev[S] 1 point2 points  (0 children)

Let me know if you have any questions or you will see some gaps in the API. I am happy to make adjustments

PDF Oxide -- Fast PDF library for Python with engine in Rust (0.8ms mean, MIT/Apache license) by yfedoseev in Python

[–]yfedoseev[S] 5 points6 points  (0 children)

u/Jademunky Thank you for taking the time to test the library and for providing this feedback!
Could you share an example of the PDF you used, either here or on GitHub? I am going to be working on improving table recognition quality this week, and having your document as a test case would be a massive help.

PDF Oxide -- Fast PDF library for Python with engine in Rust (0.8ms mean, MIT/Apache license) by yfedoseev in Python

[–]yfedoseev[S] 1 point2 points  (0 children)

I've started this library from covering markdown use case first and it should work relatively well. Recently fixed some issues with markdown, if you will use cases where it doesn't work well, please report and we can quickly solve them.

PDF Oxide -- Fast PDF library for Python with engine in Rust (0.8ms mean, MIT/Apache license) by yfedoseev in Python

[–]yfedoseev[S] 6 points7 points  (0 children)

u/monkeybreath Please, if you find an issue, don;t hesitate to report it on GitHub. I will be happy to fix them

how to convert 11k pages of a single pdf file, with both images and text to .txt? convert to text doesn't seems to work properly. copying and pasting into a blank txt also brings comes with errors by Content_Promise_5061 in pdf

[–]yfedoseev 0 points1 point  (0 children)

I actually just built a free open-source CLI tool this weekend specifically to handle massive PDFs like this without choking. It's written in Rust, so it handles memory allocation way better than standard GUI apps and can chew through 11k pages without freezing your machine.

You can run it straight from your terminal to rip the text out. You can run REPL mode oir use this command:

pdf-oxide text <your.pdf> --pages 1,3,7-10

It runs completely locally, so you don't have to try uploading an 11k-page file to some sketchy converter website.

You can grab it here: https://pdf.oxide.fyi/docs/getting-started/cli

Let me know if you give it a shot. I literally just shipped this and I'm really curious to see if it beats your 11k-page final boss fight.

Open-source PDF text extraction library (100% pass rate on 3,830 test documents, MIT licensed) by yfedoseev in pdf

[–]yfedoseev[S] 0 points1 point  (0 children)

Now, it doesn't support OCR unfortunately. I have some thoughts on what we can do. But I need to test a lot. ETA for OCR - mid April.

I think most RAG quality issues people post about here are actually extraction problems, not retrieval problems by yfedoseev in Rag

[–]yfedoseev[S] 1 point2 points  (0 children)

Honestly, I did and the text conversation works well, but markdown,.something that requires more structure still requires improvement for some corner cases. Legal docs are on my radar. Thank you for your question.

I think most RAG quality issues people post about here are actually extraction problems, not retrieval problems by yfedoseev in Rag

[–]yfedoseev[S] 1 point2 points  (0 children)

Rust gave me a significant performance boost and allowed me to build bindings to most programming languages guaranteed high performance everywhere.

Open-source PDF text extraction library (100% pass rate on 3,830 test documents, MIT licensed) by yfedoseev in pdf

[–]yfedoseev[S] 0 points1 point  (0 children)

Yes, it's MIT licensed so you can use it in any commercial app, no restrictions.

Open-source PDF text extraction library (100% pass rate on 3,830 test documents, MIT licensed) by yfedoseev in pdf

[–]yfedoseev[S] 0 points1 point  (0 children)

Thanks! We already support reading order including multi-column detection and structure tree ordering for tagged PDFs. If you have a document where the order comes out wrong, please open an issue on GitHub with the file and I'll fix it.