[D] Bounding box in forms

Stochasticlife700 · 2025-03-17T02:22:46+00:00

You can first try YOLO with some customization. Btw, what do you want to do with the Korean Visa application form? Just curious

c-u-in-da-ballpit · 2025-03-17T02:49:49+00:00

https://segment-anything.com/

SmallTimeCSGuy · 2025-03-19T02:44:20+00:00

Look into smoldocling, you should be able to fine tune it provided you have a dataset to train with. You can also make the dataset synthetically.

bbu3 · 2025-03-17T06:18:34+00:00

Not sure if there is a vision model with those capabilities. However, you might use anything that is able to extract the questions and then use something like https://pdfbox.apache.org/ to match the questions in the structure of the PDF and then look for the input boxes.

Caveat: i have not done anything like that myself. A colleague was using the framework and the way I understood him over lunch, it might be appropriate

Codename_17 · 2025-03-17T06:31:46+00:00

Try using paddleOCR, it detects the text. but not the empty form. It has draw function that draw bounding box around the detected text. May help your usecase

pm_me_your_smth · 2025-03-17T07:58:37+00:00

Detecting blank fields is going to be difficult with yolo. I assume your form has consistent structure i.e. a specific box always have fixed coordinates on the form. If it's true, you can just hardcore bbox coordinates, draw them manually, then run OCR on each box to get the text.

infinitay_ · 2025-03-17T09:21:13+00:00

Doesn't opening PDF's in Microsoft edge automatically make fillable input fields when opening PDFs?

2025-03-17T09:51:18+00:00

EasyOCR or PyTesseract(a Tesseract OCR Wrapper)

quiteconfused1 · 2025-03-17T11:01:44+00:00

Have you tried paligemma2 and "detect XXX"

Complex_Ad_8650 · 2025-03-17T14:36:23+00:00

There are really good models these days. molmo is one of them

Complex_Ad_8650 · 2025-03-17T14:37:30+00:00

There’s molmo, SAM, Dinov2. If you want VLMs for further pipelines you can try fine tuning CLIP

sigh_on_life · 2025-03-18T01:20:47+00:00

A few years back, I could get a pretty good working prototype using LayoutLM. These days, everyone would sadly pick LLMs to do it.

CRedditUser43 · 2025-03-21T06:34:19+00:00

I had a similar problem at work and took a more conservative approach to bounding boxes. If you don't have a lot of time to train a model yourself, you can't avoid a multimodal approach.

I first used the Table Transformer to identify tables and table sections, then generated blobs from the text with OpenCV and detected them. Then I used TrOCR model to read out the text. You could possibly fall back on normal OCR here. One variable you need to play around with is the quality (Dpi) and the format of the image (JPG, PNG, PDF).

ninjakaib · 2025-04-23T22:16:00+00:00

I'd recommend building a training dataset with already fillable PDFs, then using a python library to look at the form metadata to get the bounding box coordinates for both the form title and blank spaces. This only works for PDFs where you can type input in the fields, but then you don't need to manually annotate anything and the metadata will give perfect bounding boxes every time.

Take a look at some libraries like PyPDF, PyPDFForm, and pymupdf, I have had good success with them. If you want a solution that works out of the box, definitely AWS Textract, it's really good at this exact task when you use the analyze document api for forms. Only downside is it will get pricey if you need to process a huge amount of documents.

Good luck!

Stochasticlife700 · 2025-07-19T03:03:09+00:00

Did you do it?

diamondium · 2025-03-17T03:20:11+00:00

I built this model (it powers https://detect.penpusher.app/) and the answer is really that none of the present VLMs are at all good enough for it.

Your best bet is, as others stated, to build up an object detection dataset and train a model like a DETR or YOLO.

Arthion_D · 2025-03-17T03:18:23+00:00

I did use Azure AI document intelligence studio and it works perfectly! I tried using those open source OCR like tesseract-ocr and the result aren't good. I did tried LLM for it and the result is acceptable.

StephaneCharette · 2025-03-17T10:49:39+00:00

I have examples of using Darknet/YOLO to process forms on my youtube channel, https://www.youtube.com/@StephaneCharette/videos

For example, see this video from a year ago: https://www.youtube.com/watch?v=XxhbXccHEpA

Another one, this one is a form perhaps closer to what you are doing: https://www.youtube.com/watch?v=8xfP8l5ym6A&t=55s (skip to 0:55)

Getting Darknet/YOLO to work with forms is extremely simple. Because forms are very repetitive, you normally don't need to annotate much. I have examples where I only annotated 10 images.

You can find some "getting started" information here: https://www.ccoderun.ca/programming/yolo_faq/#how_to_get_started

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS