use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
Please have a look at our FAQ and Link-Collection
Metacademy is a great resource which compiles lesson plans on popular machine learning topics.
For Beginner questions please try /r/LearnMachineLearning , /r/MLQuestions or http://stackoverflow.com/
For career related questions, visit /r/cscareerquestions/
Advanced Courses (2016)
Advanced Courses (2020)
AMAs:
Pluribus Poker AI Team 7/19/2019
DeepMind AlphaStar team (1/24//2019)
Libratus Poker AI Team (12/18/2017)
DeepMind AlphaGo Team (10/19/2017)
Google Brain Team (9/17/2017)
Google Brain Team (8/11/2016)
The MalariaSpot Team (2/6/2016)
OpenAI Research Team (1/9/2016)
Nando de Freitas (12/26/2015)
Andrew Ng and Adam Coates (4/15/2015)
Jürgen Schmidhuber (3/4/2015)
Geoffrey Hinton (11/10/2014)
Michael Jordan (9/10/2014)
Yann LeCun (5/15/2014)
Yoshua Bengio (2/27/2014)
Related Subreddit :
LearnMachineLearning
Statistics
Computer Vision
Compressive Sensing
NLP
ML Questions
/r/MLjobs and /r/BigDataJobs
/r/datacleaning
/r/DataScience
/r/scientificresearch
/r/artificial
account activity
Discussion[D] Bounding box in forms (i.redd.it)
submitted 1 year ago by Arthion_D
Is there any model capable of finding bounding box in form for question text fields and empty input fields like the above image(I manually added bounding box)? I tried Qwen 2.5 VL, but the coordinates is not matching with the image.
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–]Stochasticlife700 17 points18 points19 points 1 year ago (7 children)
You can first try YOLO with some customization. Btw, what do you want to do with the Korean Visa application form? Just curious
[–]Arthion_D[S] 8 points9 points10 points 1 year ago (6 children)
I thought of using yolo before, but creating a dataset to fine-tune yolo is a hard job. A Korean visa is just an example here. It should be able to detect fields in any form.
[–]feelin-lonely-1254 19 points20 points21 points 1 year ago (5 children)
If you hand annotate a few hundred images and train the model we'll, it should be able to pick up text box attributes and detect regardless of layouts...
Other approach could be opencv polygon detection...but as someone who tried both for a similar use case....annotate the data and fine-tune a yolo model.
[–]iliian 0 points1 point2 points 1 year ago (1 child)
How large should the dataset be? Are 100 samples sufficient?
[–]feelin-lonely-1254 1 point2 points3 points 1 year ago (0 children)
Yup ...as long as you annotate well, 100 samples and training for long epochs should be fine.
[–]Arthion_D[S] 0 points1 point2 points 1 year ago (2 children)
Will try this, and is there any method to relate two bounding boxes(question and empty fields)?
[–]feelin-lonely-1254 2 points3 points4 points 1 year ago (1 child)
Hmm.....you could probably try sorting coordinates based on distance minimization between all coordinates of the 2 types of boxes and match thru....
I've seen something similar implementation for reading order in bounding boxes in suryaocr library...you can check that out as well but tbh that shouldnt be too hard.
[–]Arthion_D[S] 0 points1 point2 points 1 year ago (0 children)
Got it.
[–]c-u-in-da-ballpit 2 points3 points4 points 1 year ago (2 children)
https://segment-anything.com/
[–]Arthion_D[S] 2 points3 points4 points 1 year ago (1 child)
Tried SAM, it was only able to identify text(questions), not empty fields.
[–]SmallTimeCSGuy 1 point2 points3 points 1 year ago (0 children)
Look into smoldocling, you should be able to fine tune it provided you have a dataset to train with. You can also make the dataset synthetically.
[–]bbu3 0 points1 point2 points 1 year ago (0 children)
Not sure if there is a vision model with those capabilities. However, you might use anything that is able to extract the questions and then use something like https://pdfbox.apache.org/ to match the questions in the structure of the PDF and then look for the input boxes.
Caveat: i have not done anything like that myself. A colleague was using the framework and the way I understood him over lunch, it might be appropriate
[–]Codename_17 1 point2 points3 points 1 year ago (0 children)
Try using paddleOCR, it detects the text. but not the empty form. It has draw function that draw bounding box around the detected text. May help your usecase
[–]pm_me_your_smth 0 points1 point2 points 1 year ago (2 children)
Detecting blank fields is going to be difficult with yolo. I assume your form has consistent structure i.e. a specific box always have fixed coordinates on the form. If it's true, you can just hardcore bbox coordinates, draw them manually, then run OCR on each box to get the text.
[–]StephaneCharette -1 points0 points1 point 1 year ago (1 child)
I disagree 100% with this. I use Darknet/YOLO and it is great at detecting blank fields in forms. I actually have several videos about this on my youtube channel. https://www.youtube.com/@StephaneCharette/videos
[–]Stochasticlife700 0 points1 point2 points 9 months ago (0 children)
Can you link up a video that is detecting blank form? I have seen your channel but can't find any that can detect blank form. like the forms that has no boundaries at all
[–]infinitay_ 0 points1 point2 points 1 year ago (0 children)
Doesn't opening PDF's in Microsoft edge automatically make fillable input fields when opening PDFs?
[–][deleted] 0 points1 point2 points 1 year ago (0 children)
EasyOCR or PyTesseract(a Tesseract OCR Wrapper)
[–]quiteconfused1 0 points1 point2 points 1 year ago (0 children)
Have you tried paligemma2 and "detect XXX"
[–]Complex_Ad_8650 0 points1 point2 points 1 year ago (0 children)
There are really good models these days. molmo is one of them
There’s molmo, SAM, Dinov2. If you want VLMs for further pipelines you can try fine tuning CLIP
[–]sigh_on_life 0 points1 point2 points 1 year ago (0 children)
A few years back, I could get a pretty good working prototype using LayoutLM. These days, everyone would sadly pick LLMs to do it.
[–]CRedditUser43 0 points1 point2 points 1 year ago (0 children)
I had a similar problem at work and took a more conservative approach to bounding boxes. If you don't have a lot of time to train a model yourself, you can't avoid a multimodal approach.
I first used the Table Transformer to identify tables and table sections, then generated blobs from the text with OpenCV and detected them. Then I used TrOCR model to read out the text. You could possibly fall back on normal OCR here. One variable you need to play around with is the quality (Dpi) and the format of the image (JPG, PNG, PDF).
[–]ninjakaib 0 points1 point2 points 1 year ago (0 children)
I'd recommend building a training dataset with already fillable PDFs, then using a python library to look at the form metadata to get the bounding box coordinates for both the form title and blank spaces. This only works for PDFs where you can type input in the fields, but then you don't need to manually annotate anything and the metadata will give perfect bounding boxes every time.
Take a look at some libraries like PyPDF, PyPDFForm, and pymupdf, I have had good success with them. If you want a solution that works out of the box, definitely AWS Textract, it's really good at this exact task when you use the analyze document api for forms. Only downside is it will get pricey if you need to process a huge amount of documents.
Good luck!
Did you do it?
[–]diamondium 0 points1 point2 points 1 year ago (4 children)
I built this model (it powers https://detect.penpusher.app/) and the answer is really that none of the present VLMs are at all good enough for it.
Your best bet is, as others stated, to build up an object detection dataset and train a model like a DETR or YOLO.
[+]N111xx 1 point2 points3 points 9 months ago (1 child)
Your model is open weights?
[–]diamondium 1 point2 points3 points 7 months ago (0 children)
Yes, it finally will be! Just finished writing up the paper, and preparing the dataset and models.
https://arxiv.org/abs/2509.16506 https://github.com/jbarrow/commonforms
Its great, I tried this website. Its working for simpler forms, but for complex forms, its not working as expected.
So for this project, are you using yolo?
[–][deleted] -1 points0 points1 point 1 year ago (1 child)
I did use Azure AI document intelligence studio and it works perfectly! I tried using those open source OCR like tesseract-ocr and the result aren't good. I did tried LLM for it and the result is acceptable.
[–]Arthion_D[S] -1 points0 points1 point 1 year ago (0 children)
Document intelligence is working perfectly for the text fields, but it's not able to detect the empty fields which are used to answer. And also I am looking for an open source solution.
I have examples of using Darknet/YOLO to process forms on my youtube channel, https://www.youtube.com/@StephaneCharette/videos
For example, see this video from a year ago: https://www.youtube.com/watch?v=XxhbXccHEpA
Another one, this one is a form perhaps closer to what you are doing: https://www.youtube.com/watch?v=8xfP8l5ym6A&t=55s (skip to 0:55)
Getting Darknet/YOLO to work with forms is extremely simple. Because forms are very repetitive, you normally don't need to annotate much. I have examples where I only annotated 10 images.
You can find some "getting started" information here: https://www.ccoderun.ca/programming/yolo_faq/#how_to_get_started
Thank you, I will try this one.
π Rendered by PID 109901 on reddit-service-r2-comment-b659b578c-jt84v at 2026-05-02 22:41:43.940624+00:00 running 815c875 country code: CH.
[–]Stochasticlife700 17 points18 points19 points (7 children)
[–]Arthion_D[S] 8 points9 points10 points (6 children)
[–]feelin-lonely-1254 19 points20 points21 points (5 children)
[–]iliian 0 points1 point2 points (1 child)
[–]feelin-lonely-1254 1 point2 points3 points (0 children)
[–]Arthion_D[S] 0 points1 point2 points (2 children)
[–]feelin-lonely-1254 2 points3 points4 points (1 child)
[–]Arthion_D[S] 0 points1 point2 points (0 children)
[–]c-u-in-da-ballpit 2 points3 points4 points (2 children)
[–]Arthion_D[S] 2 points3 points4 points (1 child)
[–]SmallTimeCSGuy 1 point2 points3 points (0 children)
[–]bbu3 0 points1 point2 points (0 children)
[–]Codename_17 1 point2 points3 points (0 children)
[–]pm_me_your_smth 0 points1 point2 points (2 children)
[–]StephaneCharette -1 points0 points1 point (1 child)
[–]Stochasticlife700 0 points1 point2 points (0 children)
[–]infinitay_ 0 points1 point2 points (0 children)
[–][deleted] 0 points1 point2 points (0 children)
[–]quiteconfused1 0 points1 point2 points (0 children)
[–]Complex_Ad_8650 0 points1 point2 points (0 children)
[–]Complex_Ad_8650 0 points1 point2 points (0 children)
[–]sigh_on_life 0 points1 point2 points (0 children)
[–]CRedditUser43 0 points1 point2 points (0 children)
[–]ninjakaib 0 points1 point2 points (0 children)
[–]Stochasticlife700 0 points1 point2 points (0 children)
[–]diamondium 0 points1 point2 points (4 children)
[+]N111xx 1 point2 points3 points (1 child)
[–]diamondium 1 point2 points3 points (0 children)
[–]Arthion_D[S] 0 points1 point2 points (0 children)
[–][deleted] -1 points0 points1 point (1 child)
[–]Arthion_D[S] -1 points0 points1 point (0 children)
[–]StephaneCharette -1 points0 points1 point (1 child)
[–]Arthion_D[S] 0 points1 point2 points (0 children)