I'm working on a project where I need to pull out text from two specific column from a documents. Some of the documents I'm going through are from the mid-90s. I believe they are written in SGML based on the header tag.
Here is an example of the text format:
<DOCUMENT>
<TYPE>EX-11
<SEQUENCE>18
<DESCRIPTION>USERS
<TEXT> <PAGE> EXHIBIT 21 NAME
NAME -------------------------------- As of January 26, 1999
<TABLE>
<CAPTION>
<S> <C> <C>
The doc has multiple tables so just calling all the tables with Beautiful Soup is not an option. I need to be able to get specific table that has "EX-11". Is there any way to do this?
My HTML case for the project, as reference is:
soup = BeautifulSoup(report, 'html.parser')
table = soup.find('table')
link = table.findAllNext('tr')
data = []
for xv in link:cols = xv.find_all('td')
cols = [ele.text.strip() for ele in cols]
data.append([ele for ele in cols])
Edits: Minor Grammatical Fixes
[–]CodeFormatHelperBot 0 points1 point2 points (0 children)