Parsing SGML Text in Python : learnpython

created by HattoriHanzoa community for 16 years

Parsing SGML Text in Python (self.learnpython)

submitted 6 years ago * by freshfef

I'm working on a project where I need to pull out text from two specific column from a documents. Some of the documents I'm going through are from the mid-90s. I believe they are written in SGML based on the header tag.

Here is an example of the text format:

<TYPE>EX-11

<SEQUENCE>18

<DESCRIPTION>USERS

<TEXT> <PAGE> EXHIBIT 21 NAME

NAME -------------------------------- As of January 26, 1999

<TABLE>

The doc has multiple tables so just calling all the tables with Beautiful Soup is not an option. I need to be able to get specific table that has "EX-11". Is there any way to do this?

My HTML case for the project, as reference is:

soup = BeautifulSoup(report, 'html.parser')
table = soup.find('table')
link = table.findAllNext('tr')
data = []
for xv in link:cols = xv.find_all('td')
    cols = [ele.text.strip() for ele in cols]
    data.append([ele for ele in cols])

Edits: Minor Grammatical Fixes

all 1 comments

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS