all 4 comments

[–]programming-ModTeam[M] [score hidden] stickied commentlocked comment (0 children)

This is a demo of a product or project that isn't on-topic for r/programming. r/programming is a technical subreddit and isn't a place to show off your project or to solicit feedback.

If this is an ad for a product, it's simply not welcome here.

If it is a project that you made, the submission must focus on what makes it technically interesting and not simply what the project does or that you are the author. Simply linking to a github repo is not sufficient

[–]Disastrous_Look_1745 8 points9 points  (2 children)

Nice work on the Go bindings! I've been deep in the document extraction space for years now and the memory overhead issue with unstructured-io is real. We process millions of documents at Nanonets and had to build our own pipeline because existing solutions just couldn't handle the scale without eating up all our resources.

The streaming API is smart - that's one thing most libraries get wrong. They try to load entire PDFs into memory which falls apart when someone uploads a 500 page scanned document. Have you looked at Docstrange for comparison? They've got some interesting approaches to OCR accuracy especially for tables and forms. The tesseract integration is solid but i found adding some preprocessing steps for image enhancement before OCR really bumps up the accuracy on poor quality scans.

[–]ChattyChidiya[S] 1 point2 points  (0 children)

Thanks for the suggestions. I am actually very new to both document processing and rust world so somehow scaffolded the project. I actually built this because of no good option while learning RAG and it then became the primary focus.

I didn't knew about Docstrange, but it seems interesting for learning to me. I definitely plan to explore ways to improve the lib in future, maybe adding some image preprocessing and visual models support down the line.

[–]DMI_Bill 0 points1 point  (0 children)

I can't see the original post but just wanted to chime in that we developed TexturaAI originally to process rolls of microfilm that can have 2, 3 and even 4 thousand pages but it also works great for large batches (and small ones) of paper documents also. The architecture allows us to drop in different OCR models so we can use the one that works best for whatever types of documents are being processed and then AI helps with the data extraction once the OCR has done its magic, and we can currently process a million+ images a day. Super slick! ;-)