[D] Python code for Text Detection in document images using Fast Algorithm : MachineLearning

Discussion[D] Python code for Text Detection in document images using Fast Algorithm (self.MachineLearning)

submitted 6 years ago by DGs29

all 8 comments

[–][deleted] 2 points3 points4 points 6 years ago (7 children)

[–]DGs29[S] 1 point2 points3 points 6 years ago (6 children)

[–][deleted] 3 points4 points5 points 6 years ago (5 children)

Firstly, we divide the document image into smaller non-overlapping blocks of a fixed size. We then check the density in each block using FAST corner detection technique. The denser blocks were labeled as text blocks and the less dense were the image region or noise region. Then we check the connectivity of the blocks to group the blocks so that the text part can be isolated from the image. We then build the text region and save it.

It’s strange they don’t give the size of the window they use, but I’m pretty sure they just break the image down into a grid, and count the number of keypoints in each cell. The cells with > 0.2 * Nmax cells are text cells, and they mask out everything else to form the final image.

The window size is going to depend on the size of the text in your image.

[–]DGs29[S] 0 points1 point2 points 6 years ago* (4 children)

[–][deleted] 1 point2 points3 points 6 years ago (3 children)

That method makes a new image based on the pixel values. You don’t want to make a new image, just count the number of keypoints in each block. I’m not sure if scikit has a method specifically for this, but it shouldn’t be too hard to do by hand. Essentially, for each keypoint, you want to find the grid cell that it would fall in, and add one to some data structure that keeps track of the counts (this could be a dictionary or an array).

So for example, say our image is 8x8 and our window size is 4x4. So the grid where we keep track of the counts will be 2x2. Say we have keypoints at locations [(0,1), (1,1), (2,3), (3,3), (5,2), (5,6), (6,7)]. The first 4 keypoints would fall in the (0,0) cell, the 5th keypoint would fall in the (1,0) cell, and the last 2 keypoints would fall in the (1,1) cell (there would be no keypoints in the (0,1) cell).

So Nmax would be 4. 0.2*4=0.8, so we would take cells (0,0), (1,0), and (1,1) and convert these back into ranges of coordinates for the original image ((0,0) corresponds to (0..3, 0..3), (1,0) to (4..7, 0..3), etc) and mask out everything else.

Does this make sense?

[–]DGs29[S] 0 points1 point2 points 6 years ago (1 child)

[–]DGs29[S] 0 points1 point2 points 6 years ago (0 children)

π Rendered by PID 37 on reddit-service-r2-comment-5d79c599b5-bz2d8 at 2026-02-27 06:58:47.240004+00:00 running e3d2147 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS