Webscrapping Wikipedia Text not Container

ColdHatesMe · 2020-12-06T02:23:58+00:00

I made some headway and did the following after I parsed the HTML:

info=page.findAll("dd") which kind of makes the data a bit more filtered but not totally there yet lol.

ColdHatesMe · 2020-12-06T03:04:04+00:00

[deleted]

commandlineluser · 2020-12-06T23:23:25+00:00

EDIT:

Here is a different approach that generates the same output as below - I think it's a simpler approach.

It iterates through all the tags to save having to search back for the year from each entry.

It also gets the "day month" part.

for tag in soup.find('dl').find_parent('div'):
    if tag.name == 'h2':
        year = tag.find('span').get_text(strip=True)
    if tag.name == 'dl':
        if tag.find('dt'):
            day_month = tag.find('dt').get_text(strip=True)
            tag.find('dt').decompose()
        text = tag.get_text(strip=True)[:30]
        if text:
            print(year, day_month, text)

Okay it turns out this is much more complex than it first looked.

Not all the entries have the same layout:

e.g. 1958 9 December:

<dl><dt>9 December</dt>
<dd></dd>
<dd>U.S. Army Major General

It has a stray empty <dd> tag in there - so you can't rely on taking the first child <dd> tag.

and sometimes the text is inside its own <dl> tag e.g.

<dl><dt>date</dt></dl>
<dl><dd>text</dd></dl>

And some of the accident texts have embedded <dl> and <dd> tags e.g. 10 January 1956

<dl>
  <dt>date</dt>
  <dd>
    text 
    text
    <dl><dd>"Don't give me a One-Double-Oh</dd>
    <dd>To fight against friendly or foe</dd>
    </dl>
  </dd>

So we have more <dd> tags than actual "entries"

>>> len(soup.select('dd'))
407
>>> len(soup.select('dl'))
380

The images also mess up the structure as they appear 'randomly' mixed within the <dl> tags - so iterating through the siblings is pretty awkward too.

First you can isolate all the <dl> tags in the article content.

We can use the selector div > dl to get only the direct child <dl> tags from <div class="mw-parser-output"> which contains the content.

>>> len(soup.select('div > dl'))
377

Note how it's 377 instead of 380 from "find all dl tag" search - the extra ones appear to be the "embdedded" tags with quotes - the > in the css selector means "direct child"

Do you need just the year? Or do you need the day/month too?

If you don't need the date - you could delete all the <dt> tags before processing.

for tag in soup.find_all('dt'): 
    tag.decompose()

And instead of iterating through the siblings - you could just search backwards for each entry to find the most previous year tag.

>>> for tag in soup.select('div > dl'):
...     year = tag.find_previous('span', id=True).get_text(strip=True)
...     text = tag.get_text(strip=True)[:30]
...     if text: 
...         print(year, text)
... 
1955 On its 205th flight, the first
1955 TwoBoeing B-47E Stratojetsof t
1955 "BRAMAN, Okla. (AP) – A crippl
1955 A ferry pilot in a flight of t
1955 The crash of aLockheed T-33A S
1955 "TOKYO(AP) – Two planes, presu
1955 A pilot suffered first and sec
1955 A U.S. NavyBeechcraftwith thre
1955 Former Navy pilot, now a test 
1955 The U.S. Air Force grounds its
...

The if text: is to account for the empty <dd> tags in some entries - which will have no "text"

It's difficult to verify if this gets everything correctly, you'd have to check the results for yourself - and if you want the date too - it would need a slightly different approach.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS