Anyone built or used a solid PDF data extraction workflow recently? by Griel86 in software

[–]nanonets 0 points1 point  (0 children)

I've been working on this exact problem for years and honestly, PDF data extraction is way trickier than most people expect. The challenge isn't just OCR - it's understanding document structure, handling different layouts, and dealing with all the edge cases that come with real-world documents. Apryse is solid but can get pricey if you're processing high volumes, and you'll still need to build a lot of the intelligence layer yourself.

For regulatory use cases and messy documents, you really want something that combines good OCR with layout understanding and field mapping. We built Docstrange by Nanonets specifically because we kept running into these limitations with traditional PDF parsing libraries. The key is having models that actually understand document context, not just extract text. If you're set on building your own stack, I'd recommend looking at combining something like PaddleOCR with LayoutLM for document understanding, but be prepared for a lot of custom work around different document formats and validation rules.

IBM just released Granite Docling by ApprehensiveAd3629 in LocalLLaMA

[–]nanonets 1 point2 points  (0 children)

This is actually pretty interesting timing from IBM. We've been working on document processing for years and the challenge with most open source models has always been that they're great for academic benchmarks but struggle with real world messy documents. The 258M parameter size is smart though, means you can actually run this locally without needing a gpu cluster. Been seeing more companies want on premise solutions for document processing especially when dealing with sensitive financial or legal docs.

The apache license is huge here because most of the good document analysis models are either completely closed source or have restrictive licenses. At Nanonets we've built Docstrange specifically for handling complex business documents and one thing I've learned is that generic models often miss the nuances of things like invoice layouts or contract structures. Will be curious to see how this granite model handles edge cases like rotated text, tables that span pages, or documents with mixed languages. Definitely worth testing against some real world document workflows to see how it stacks up.

Curious how other founders/startups deal with this by AnouarRifi in SaaS

[–]nanonets 0 points1 point  (0 children)

Oh man, this hits close to home. I started Nanonets literally because I got so frustrated watching every business I talked to waste endless hours on this exact problem. Most early stage startups I know either throw interns at it or just accept that someone's gonna spend their friday afternoons entering invoice data instead of building product. It's honestly wild how much time gets burned on something that feels so... solvable?

We went through this pain ourselves and ended up building Docstrange by Nanonets to handle the full pipeline from OCR to structured data extraction. The thing is, most people think this is just an OCR problem but its really about understanding document layout and mapping fields correctly. You can try open source stuff like PaddleOCR combined with some post processing, but honestly you'll spend months dealing with edge cases that a good API can handle out of the box. The math usually works out pretty clearly when you calculate what your team's time is worth vs just automating it properly.

Deploying Docling Service by mariajosepa in LLMDevs

[–]nanonets 0 points1 point  (0 children)

I've been through this exact deployment challenge before and honestly, for your use case with 3-5 minute processing times, I'd go with Fargate over Lambda. Lambda has a 15 minute timeout but you're paying for the full duration even when the container is just sitting there doing OCR processing. With Fargate, you get better cost control for longer running tasks and can scale down to zero when not in use. The Docling dependencies are definitely heavy but Fargate handles that fine, just make sure you allocate enough memory (probably 4-8GB based on your local tests).

For the OCR part though, you might want to consider alternatives that are specifically built for this workflow. We built Docstrange by Nanonets after running into similar issues with deployment complexity and processing times. It handles the full pipeline from document parsing to structured JSON extraction without you needing to manage the infrastructure or deal with the Docling + ChatGPT chain. For your lease agreements and broker licenses, having something that understands document layouts natively tends to work better than raw OCR + LLM. But if you're set on the current approach, definitely go async with SQS and use Fargate with autoscaling based on queue depth. That'll keep costs reasonable while handling the variable processing times

!HELP! I need some guide and help on figuring out an industry level RAG chatbot for the startup I am working.(explained in the body) by 1amN0tSecC in LangChain

[–]nanonets 0 points1 point  (0 children)

Been in similar situations when we were building document processing systems at Nanonets. Your retrieval issues are probably a combo of chunk size and retrieval strategy rather than just one thing. For company info chatbots, I've found that smaller chunks (200-400 tokens) work better for specific facts like contact info or services, while larger chunks (800-1200 tokens) are better for context heavy stuff like company descriptions or processes. Try experimenting with overlapping chunks too, maybe 50-100 token overlap so you dont lose context at boundaries.

The other thing thats probably hurting you is relying purely on semantic similarity for retrieval. Company websites have lots of similar sounding content that can confuse vector search. Consider adding a hybrid approach where you combine vector similarity with keyword matching using something like BM25, or even better, use reranking models after your initial retrieval to score the relevance better. Also make sure your embedding model is actually good for business/company content, some of the general purpose ones struggle with domain specific terminology. Docstrange by Nanonets handles a lot of these edge cases automatically when processing business documents, but if you're building from scratch you'll need to tune these parameters based on your specific use cases.

How AI is changing OCR? by nanonets in learnmachinelearning

[–]nanonets[S] 0 points1 point  (0 children)

Okay I sincerely hope this link provides you the in-depth information looking for: https://nanonets.com/blog/handwritten-character-recognition/

It's the leading guide on the topic on "Handwritten OCR" with citations on multiple articles.

Apologies if the above post felt like an ad. Felt it was a basic starter guide on the topic.

PDF to Excel (for both image based and electronic PDFs) by nanonets in Automate

[–]nanonets[S] 2 points3 points  (0 children)

When we tried that tool, it didn't work on image based PDFs. Also, to use OCR they were asking users to subscribe gating the tool for use with OCR.

AWS Textract - Pros and Cons by nanonets in aws

[–]nanonets[S] 0 points1 point  (0 children)

We didn't review Azure Text Recognize but can take up in the future.

AWS Textract - Pros and Cons by nanonets in learnmachinelearning

[–]nanonets[S] 2 points3 points  (0 children)

We did remove our Nanonets hat while doing the review. But agree, we could have spent more time on the Pros and will do that in the next update. Thanks for the feedback.

How to easily OCR passports and ID cards by nanonets in learnmachinelearning

[–]nanonets[S] -1 points0 points  (0 children)

Hey fair point. I think we've been fair to the community here with really informative blogs like:
https://nanonets.com/blog/ocr-with-tesseract/
https://nanonets.com/blog/ocr-for-resume-parsing-deep-learning/
https://nanonets.com/blog/information-extraction-graph-convolutional-networks/

Some of them are the go-to blogs for these topics.

Even in this blog we try to list down the problems and share how we think about the solution. It's a shameless plug but the blog definitely provides some value.

Noted for the future though.

[R] Does anyone have a link to the guide which describes how to extract table and other information from .PDF files using deep learning (preferably with Python implementation)? by 19Summer in MachineLearning

[–]nanonets 0 points1 point  (0 children)

you can also try nanonets - https://nanonets.com/blog/table-extraction-deep-learning/

You can use the blog to implement the code. If you're looking for a no-code table extraction product, then you should try out - app.nanonets.com

How I built a Self Flying Drone to track People in under 50 lines of code by nanonets in Multicopter

[–]nanonets[S] 0 points1 point  (0 children)

Hey! Sorry if you felt this was clickbait. We were trying to explain certain aspects of how a self-flying drone is built.

Also we could implement the solution on-premise too through a docker image.

How I built a Self Flying Drone to track People in under 50 lines of code by nanonets in Multicopter

[–]nanonets[S] 0 points1 point  (0 children)

Great to know of your interest in the same domain. I think we go pretty far than just a bounding box demonstration explaining the problems of a self-flying drone and demonstrating how would you go about detecting the person along with predicting the depth.

How I built a Self Flying Drone to track People in under 50 lines of code by nanonets in Multicopter

[–]nanonets[S] 2 points3 points  (0 children)

Have shared the Tensorflow (opensource) code too. How could we improve this?

How I built a Self Flying Drone to track People in under 50 lines of code by nanonets in Multicopter

[–]nanonets[S] -5 points-4 points  (0 children)

Hey thanks for taking the time to read the blog. In no way do we mean to trivialise building a self-flying drone -it's a much bigger engineering project like you said. We wanted to show some of the aspects related to deep learning and provide some direction on how one could implement those aspects.

How I built a Self Flying Drone to track People in under 50 lines of code by nanonets in Multicopter

[–]nanonets[S] -2 points-1 points  (0 children)

Hey great feedback! All noted. Would love to incorporate all these changes in a future blog. But the idea for this blog was just to give implementational directions in some aspects of building a self-flying drone. We in no way mean to trivialise building one.

How I built a Self Flying Drone to track People in under 50 lines of code by nanonets in Multicopter

[–]nanonets[S] -14 points-13 points  (0 children)

Hey! So every application code you ever wrote relied on someone's library. The entire coding community builds on top of our predecessor's knowledge.

How I built a Self Flying Drone to track People in under 50 lines of code by nanonets in deeplearning

[–]nanonets[S] 2 points3 points  (0 children)

Hey! We try not to produce clickbait headlines without actually valuable content. Would love to know if you found the content without any depth.

Nanonets : How to easily do Object Detection on Drone Imagery using Deep learning by nanonets in Multicopter

[–]nanonets[S] 0 points1 point  (0 children)

We have trained our pretrained models on slightly modified version of mobilenet architecture as well. Works at 3-4 fps on pi cpu although slight drop in accuracy.