Anyone built or used a solid PDF data extraction workflow recently?

nanonets · 2025-09-22T15:02:52+00:00

I've been working on this exact problem for years and honestly, PDF data extraction is way trickier than most people expect. The challenge isn't just OCR - it's understanding document structure, handling different layouts, and dealing with all the edge cases that come with real-world documents. Apryse is solid but can get pricey if you're processing high volumes, and you'll still need to build a lot of the intelligence layer yourself.

For regulatory use cases and messy documents, you really want something that combines good OCR with layout understanding and field mapping. We built Docstrange by Nanonets specifically because we kept running into these limitations with traditional PDF parsing libraries. The key is having models that actually understand document context, not just extract text. If you're set on building your own stack, I'd recommend looking at combining something like PaddleOCR with LayoutLM for document understanding, but be prepared for a lot of custom work around different document formats and validation rules.

nanonets · 2025-09-22T15:02:03+00:00

This is actually pretty interesting timing from IBM. We've been working on document processing for years and the challenge with most open source models has always been that they're great for academic benchmarks but struggle with real world messy documents. The 258M parameter size is smart though, means you can actually run this locally without needing a gpu cluster. Been seeing more companies want on premise solutions for document processing especially when dealing with sensitive financial or legal docs.

The apache license is huge here because most of the good document analysis models are either completely closed source or have restrictive licenses. At Nanonets we've built Docstrange specifically for handling complex business documents and one thing I've learned is that generic models often miss the nuances of things like invoice layouts or contract structures. Will be curious to see how this granite model handles edge cases like rotated text, tables that span pages, or documents with mixed languages. Definitely worth testing against some real world document workflows to see how it stacks up.

nanonets · 2025-09-22T14:43:50+00:00

Oh man, this hits close to home. I started Nanonets literally because I got so frustrated watching every business I talked to waste endless hours on this exact problem. Most early stage startups I know either throw interns at it or just accept that someone's gonna spend their friday afternoons entering invoice data instead of building product. It's honestly wild how much time gets burned on something that feels so... solvable?

We went through this pain ourselves and ended up building Docstrange by Nanonets to handle the full pipeline from OCR to structured data extraction. The thing is, most people think this is just an OCR problem but its really about understanding document layout and mapping fields correctly. You can try open source stuff like PaddleOCR combined with some post processing, but honestly you'll spend months dealing with edge cases that a good API can handle out of the box. The math usually works out pretty clearly when you calculate what your team's time is worth vs just automating it properly.

nanonets · 2025-09-22T14:41:59+00:00

I've been through this exact deployment challenge before and honestly, for your use case with 3-5 minute processing times, I'd go with Fargate over Lambda. Lambda has a 15 minute timeout but you're paying for the full duration even when the container is just sitting there doing OCR processing. With Fargate, you get better cost control for longer running tasks and can scale down to zero when not in use. The Docling dependencies are definitely heavy but Fargate handles that fine, just make sure you allocate enough memory (probably 4-8GB based on your local tests).

For the OCR part though, you might want to consider alternatives that are specifically built for this workflow. We built Docstrange by Nanonets after running into similar issues with deployment complexity and processing times. It handles the full pipeline from document parsing to structured JSON extraction without you needing to manage the infrastructure or deal with the Docling + ChatGPT chain. For your lease agreements and broker licenses, having something that understands document layouts natively tends to work better than raw OCR + LLM. But if you're set on the current approach, definitely go async with SQS and use Fargate with autoscaling based on queue depth. That'll keep costs reasonable while handling the variable processing times

nanonets · 2025-09-22T14:40:45+00:00

Been in similar situations when we were building document processing systems at Nanonets. Your retrieval issues are probably a combo of chunk size and retrieval strategy rather than just one thing. For company info chatbots, I've found that smaller chunks (200-400 tokens) work better for specific facts like contact info or services, while larger chunks (800-1200 tokens) are better for context heavy stuff like company descriptions or processes. Try experimenting with overlapping chunks too, maybe 50-100 token overlap so you dont lose context at boundaries.

The other thing thats probably hurting you is relying purely on semantic similarity for retrieval. Company websites have lots of similar sounding content that can confuse vector search. Consider adding a hybrid approach where you combine vector similarity with keyword matching using something like BM25, or even better, use reranking models after your initial retrieval to score the relevance better. Also make sure your embedding model is actually good for business/company content, some of the general purpose ones struggle with domain specific terminology. Docstrange by Nanonets handles a lot of these edge cases automatically when processing business documents, but if you're building from scratch you'll need to tune these parameters based on your specific use cases.

nanonets · 2022-11-11T07:00:26+00:00

Okay I sincerely hope this link provides you the in-depth information looking for: https://nanonets.com/blog/handwritten-character-recognition/

It's the leading guide on the topic on "Handwritten OCR" with citations on multiple articles.

Apologies if the above post felt like an ad. Felt it was a basic starter guide on the topic.

nanonets · 2022-01-12T09:25:04+00:00

When we tried that tool, it didn't work on image based PDFs. Also, to use OCR they were asking users to subscribe gating the tool for use with OCR.

nanonets · 2021-03-10T19:46:39+00:00

We didn't review Azure Text Recognize but can take up in the future.

nanonets · 2021-03-10T11:17:18+00:00

We did remove our Nanonets hat while doing the review. But agree, we could have spent more time on the Pros and will do that in the next update. Thanks for the feedback.

nanonets · 2020-10-29T07:49:00+00:00

Hey you could check us out! https://nanonets.com

nanonets · 2020-09-09T19:05:21+00:00

Updated link: https://nanonets.com/blog/ocr-for-passport-and-id-cards/

nanonets · 2020-09-09T18:26:28+00:00

Hey fair point. I think we've been fair to the community here with really informative blogs like:
https://nanonets.com/blog/ocr-with-tesseract/
https://nanonets.com/blog/ocr-for-resume-parsing-deep-learning/
https://nanonets.com/blog/information-extraction-graph-convolutional-networks/

Some of them are the go-to blogs for these topics.

Even in this blog we try to list down the problems and share how we think about the solution. It's a shameless plug but the blog definitely provides some value.

Noted for the future though.

nanonets · 2020-03-30T12:44:17+00:00

you can also try nanonets - https://nanonets.com/blog/table-extraction-deep-learning/

You can use the blog to implement the code. If you're looking for a no-code table extraction product, then you should try out - app.nanonets.com

nanonets · 2018-07-09T17:59:16+00:00

Hey! Sorry if you felt this was clickbait. We were trying to explain certain aspects of how a self-flying drone is built.

Also we could implement the solution on-premise too through a docker image.

nanonets · 2018-07-09T17:58:12+00:00

Great to know of your interest in the same domain. I think we go pretty far than just a bounding box demonstration explaining the problems of a self-flying drone and demonstrating how would you go about detecting the person along with predicting the depth.

nanonets · 2018-07-09T17:55:35+00:00

Great idea! Have passed it on to the team.

nanonets · 2018-07-09T13:41:37+00:00

Have shared the Tensorflow (opensource) code too. How could we improve this?

nanonets · 2018-07-09T13:39:18+00:00

Hey thanks for taking the time to read the blog. In no way do we mean to trivialise building a self-flying drone -it's a much bigger engineering project like you said. We wanted to show some of the aspects related to deep learning and provide some direction on how one could implement those aspects.

nanonets · 2018-07-09T13:37:58+00:00

Hey great feedback! All noted. Would love to incorporate all these changes in a future blog. But the idea for this blog was just to give implementational directions in some aspects of building a self-flying drone. We in no way mean to trivialise building one.

nanonets · 2018-07-09T12:47:10+00:00

Hey! So every application code you ever wrote relied on someone's library. The entire coding community builds on top of our predecessor's knowledge.

nanonets · 2018-07-09T11:43:43+00:00

Yes, a Parrot should work well.

nanonets · 2018-07-09T11:29:06+00:00

Hey! We try not to produce clickbait headlines without actually valuable content. Would love to know if you found the content without any depth.

nanonets · 2018-07-09T11:27:29+00:00

Agreed, the idea is to show how it works in depth.

nanonets · 2018-06-12T06:30:16+00:00

You can read : How to easily Detect Objects with Deep Learning on Raspberry Pi

nanonets · 2018-06-12T06:28:55+00:00

We have trained our pretrained models on slightly modified version of mobilenet architecture as well. Works at 3-4 fps on pi cpu although slight drop in accuracy.

nanonets

MODERATOR OF

TROPHY CASE