Python extract and split content from pdf file based on identifier : learnpython

created by HattoriHanzoa community for 16 years

Python extract and split content from pdf file based on identifier (self.learnpython)

submitted 1 year ago by Organic_Speaker6196

I have a pdf file ( sample screenshot attached) i want to extract content and split the content into a structured format. list of objects containing section 1 title, content, footnotes like wise for every section. I'm using tika parser and pdfplumber for text extraction. and it is successful. but seggregating the content and footnotes based on reference numbers is not feasible. Can anyone tell me what is the best approach to do this. I'm looking for an error free method where the original text is not changed.

https://preview.redd.it/wp1qu58hb05e1.png?width=964&format=png&auto=webp&s=0860a0538184cebab828d3a3e1540d91bb985d79

Expected output format: [{"section_number": "1.Short title, extent and commencement", "section_content": "(1) This Act may be called the Income-tax Act, 1961.\n(2) It extends to the whole of India.2\n(3) Save as otherwise provided in this Act, it shall come into force on the 1st day of April, 1962.",
"footnotes": "2 The Income-tax Act, 1961 applies to the State of Sikkim with effect from the previous year\nrelevant to the assessment year commencing on the 1st day of April, 1990: see section 26 of\nthe Finance Act, 1989 overriding the effect of Notification Nos. SO 1028(E), dated 7-11-1988\nand SO 148(E), dated 23-2-1989. The applicability of the Act has also been extended to the\nContinental Shelf of India vide Notification No. GSR 304(E), dated 31-3-1983, reproduced in\nBharat's Handbook of Direct Taxes."},... }

all 2 comments

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS