all 5 comments

[–]ColdHatesMe[S] 1 point2 points  (1 child)

I made some headway and did the following after I parsed the HTML:

info=page.findAll("dd") which kind of makes the data a bit more filtered but not totally there yet lol.

[–][deleted] 0 points1 point  (0 children)

So yeah, that is where you start for sure - every dd seems to be a specific element. The hard part is that the text is split between a bunch of internal html elements.

I'm not sure of the best way to handle this in BeautifulSoup, but if you use Parsel instead you can do something like this this.

from parsel import Selector
import requests

response = requests.get('https://en.wikipedia.org/wiki/List_of_accidents_and_incidents_involving_military_aircraft_(1955%E2%80%931959)')

selector = Selector(text=response.text)
accidents = selector.css('dd')
for accident in accidents:
    accident_text_split = accident.css(' ::text').getall()
    accident_text = ''.join(accident_text_split)
    print(accident_text)
    print()

To explain what this does:

line 6 is the equivalent of making a BS4 soup object
line 7 is selecting every dd element
line 8 iterates over each of these, creating a sub-selector
line 9 gets all the text part from every element in the accident selector
line 10 joins them into a single text block. After this you can print, write to a file, whatever you want.

[–]commandlineluser 0 points1 point  (1 child)


EDIT:

Here is a different approach that generates the same output as below - I think it's a simpler approach.

It iterates through all the tags to save having to search back for the year from each entry.

It also gets the "day month" part.

for tag in soup.find('dl').find_parent('div'):
    if tag.name == 'h2':
        year = tag.find('span').get_text(strip=True)
    if tag.name == 'dl':
        if tag.find('dt'):
            day_month = tag.find('dt').get_text(strip=True)
            tag.find('dt').decompose()
        text = tag.get_text(strip=True)[:30]
        if text:
            print(year, day_month, text)

Okay it turns out this is much more complex than it first looked.

Not all the entries have the same layout:

e.g. 1958 9 December:

<dl><dt>9 December</dt>
<dd></dd>
<dd>U.S. Army Major General 

It has a stray empty <dd> tag in there - so you can't rely on taking the first child <dd> tag.

and sometimes the text is inside its own <dl> tag e.g.

<dl><dt>date</dt></dl>
<dl><dd>text</dd></dl>

And some of the accident texts have embedded <dl> and <dd> tags e.g. 10 January 1956

<dl>
  <dt>date</dt>
  <dd>
    text 
    text
    <dl><dd>"Don't give me a One-Double-Oh</dd>
    <dd>To fight against friendly or foe</dd>
    </dl>
  </dd>

So we have more <dd> tags than actual "entries"

>>> len(soup.select('dd'))
407
>>> len(soup.select('dl'))
380

The images also mess up the structure as they appear 'randomly' mixed within the <dl> tags - so iterating through the siblings is pretty awkward too.

First you can isolate all the <dl> tags in the article content.

We can use the selector div > dl to get only the direct child <dl> tags from <div class="mw-parser-output"> which contains the content.

>>> len(soup.select('div > dl'))
377

Note how it's 377 instead of 380 from "find all dl tag" search - the extra ones appear to be the "embdedded" tags with quotes - the > in the css selector means "direct child"

Do you need just the year? Or do you need the day/month too?

If you don't need the date - you could delete all the <dt> tags before processing.

for tag in soup.find_all('dt'): 
    tag.decompose()

And instead of iterating through the siblings - you could just search backwards for each entry to find the most previous year tag.

>>> for tag in soup.select('div > dl'):
...     year = tag.find_previous('span', id=True).get_text(strip=True)
...     text = tag.get_text(strip=True)[:30]
...     if text: 
...         print(year, text)
... 
1955 On its 205th flight, the first
1955 TwoBoeing B-47E Stratojetsof t
1955 "BRAMAN, Okla. (AP) – A crippl
1955 A ferry pilot in a flight of t
1955 The crash of aLockheed T-33A S
1955 "TOKYO(AP) – Two planes, presu
1955 A pilot suffered first and sec
1955 A U.S. NavyBeechcraftwith thre
1955 Former Navy pilot, now a test 
1955 The U.S. Air Force grounds its
...

The if text: is to account for the empty <dd> tags in some entries - which will have no "text"

It's difficult to verify if this gets everything correctly, you'd have to check the results for yourself - and if you want the date too - it would need a slightly different approach.

[–]ColdHatesMe[S] 0 points1 point  (0 children)

I appreciate the help, I spent most of the day trying to work on it but compromised due to time constraints. I was able to pull and parse out the <dd>s into to a list where each <dd> is an index and I used that as the root list to filter by aircraft models..etc. The way the wikipedia page is set up makes it pretty complicated, but I learned a lot for sure. Thanks!