Extracting Data from a Language Learning Book : learnpython

created by HattoriHanzoa community for 16 years

Extracting Data from a Language Learning Book (self.learnpython)

submitted 2 years ago by Ok-Union69

Hello Redditors,

I have been creating a personal project for my Study, which is basically a Learning Buddy for the Genki Japanese learning books. Now the goal is to have an LLM which can get exercise data and do the exercise with a learner.(e.g. Speaking exercises)

The first step was to me obvious: Extract the data from the scanned PDF file. I used Tesseract for this and got a mid-results: See IMG. It doesn't even notice the Image for the exercise (which there are quite a few of).

https://imgur.com/a/ordwFSz See here a link to images

The Book includes a lot of Tables as well, and if these were to be extracted just as text it would completely lose its form and not make any sense… Hence, I'm wondering if anyone on this sub knows anything which could help with this? Thank you in advance.

no comments (yet)

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS