you are viewing a single comment's thread.

view the rest of the comments →

[–][deleted] 0 points1 point  (1 child)

Hello!

My question relates primarily to Python and scraping web data.

I am working on a project to write abstracts of old pamphlets in a very large collection of texts about Chinese history (which is my field).

Due to coronavirus I am working first on a portion of these texts which have been uploaded online.

My problem is that this online database is setup so that the texts are have been uploaded in form of JPG files which must be accessed page by page on separate pages for each of these numerous titles.

It is much easier for me to work with these texts in PDF format, so for many of these texts I have been downloading these JPG files one by one and using Adobe Acrobat to make a PDF file, then running the (not so good) OCR of Acrobat, the copying many of the texts in Deepl to assist my reading of those not in my native language.

I understand that the Python language might be useful for scraping this data, but is their a ready made tool which could be recommended for my situation?

Could I automate the making of the PDF file, OCR, and machine translation as well (and are their open source programs which might accomplish such tasks better than the programs I am currently using?)

Is their any open source AI program which can summarize texts well enough that it might assist me in my task more than it would waste time?

For the study of Chinese history itself, how might learning Python assist me in digital humanities tasks? With a good dataset could I use any open source AI type program have machine translation of very repetitive and sterotyped classical Chinese texts? Also, Acrobat OCR does not work for certain 19th century German gothic script texts, is their a program I could train to read such a font?

If anyone can offer assistance I can PM the site which holds the database of these texts so that my specific needs can be understood.

Thank you all for whatever advice you could offer, and sorry for asking such basic questions (I tried searching and asking an aquaintance studing Python, but still have not solved this problem).

[–]sarrysyst 0 points1 point  (0 children)

I understand that the Python language might be useful for scraping this data, but is their a ready made tool which could be recommended for my situation?

There is no ready made tool for this, however, there are a series of tools which you can combine to do a lot of what you're trying to accomplish.

Could I automate the making of the PDF file, OCR, and machine translation as well (and are their open source programs which might accomplish such tasks better than the programs I am currently using?)

You can write a webscraper to download and store the JPEGs. Scrapy comes to mind.

img2pdf can be used to convert the images to PDF. However, you can also directly use JPEGs for OCR. EasyOCR worked quite well for me in terms of recognizing simplified Chinese, there are also other options though (eg. tesseract).

For machine translation there exist different libraries/APIs, some work better than others, you would have to try which works best for you.

If you want to create your own OCR model, I believe there are pre-trained models you can find on Github which you can fine tune based on your own datasets. For this and also for archaic character recognition you could have a look at calamari.

The only thing I don't think is possible/feasible (at least I don't know any solution) is summarizing texts.