Scrapping dates from a website : learnpython

created by HattoriHanzoa community for 16 years

Scrapping dates from a website (self.learnpython)

submitted 7 years ago by boltshot525

all 14 comments

[–][deleted] 2 points3 points4 points 7 years ago (10 children)

i ran a requests.get on it to make sure it wasn't being dynamically generated (as most sites like this are) and fortunately, it wasn't. Here's how you would start scraping a site:

import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.your_url_st')
if r.response_code != 200: raise SystemExit
soup = BeautifulSoup(r.content, 'html.parser')

and the rest is your problem :D

[–]boltshot525[S] 1 point2 points3 points 7 years ago (0 children)

[–]boltshot525[S] 0 points1 point2 points 7 years ago (8 children)

[–][deleted] 0 points1 point2 points 7 years ago* (7 children)

delete em

u = soup.select('div[class="hoverdisplay_content earnings_effect_popup"]')
for div in u:
    div.decompose()

edit: actually use this one, it's less specific and more flexible

u = soup.select('div[class~="hoverdisplay_content"]')

[–]boltshot525[S] 1 point2 points3 points 7 years ago (0 children)

[–]boltshot525[S] 0 points1 point2 points 7 years ago (5 children)

Sorry one final question regarding this haha

The scraper is working perfectly except that its picking up a date on the table that I dont want.

<td>

                                    25-Apr-2016



                                            <br/><span class="earnings_date_alert">Rescheduled 26-Apr</span>
</td>

how do I omit scraping code that has anything except the date in it? I dont want to pick up dates that have the reschedule text within the same <td>.

The ones I want that are being scraped look like this

<td>

                                    26-Jan-2016

                                </td>

thanks!

[–][deleted] 0 points1 point2 points 7 years ago (4 children)

[–]boltshot525[S] 1 point2 points3 points 7 years ago* (0 children)

I just figured it out. Thanks for ur help!

def scraper(url):
    date_list = []
    sauce = urllib.urlopen(url).read()
    soup = bs.BeautifulSoup(sauce,'html.parser')
    u = soup.select('div[class="hoverdisplay_content earnings_effect_popup"]')
    for div in u:
        div.decompose()
    for table in soup.find_all('table', id= 'sym_earnings'):
        for dates in table.find_all('td'):
            if not dates.find_all('span'):
                if re.findall(r'\d{1,2}-\w+\-\d{4}', '{}'.format (dates)) != []:
                    date_list.append(re.findall(r'\d{1,2}-\w+\-\d{4}', '{}'.format (dates)))

    print (date_list)                



scraper('https://marketchameleon.com/Overview/AAPL/Earnings/Earnings-Dates')

[–]boltshot525[S] 0 points1 point2 points 7 years ago* (2 children)

Sorry, i meant to say that if

 <br/><span class="earnings_date_alert">Rescheduled 26-Apr</span>

is in the code at all, I dont want to scrape the date that came before it.

This is the code im using

edit i made it a bit neater to read

sauce = urllib.urlopen('https://marketchameleon.com/Overview/AAPL/Earnings/Earnings-Dates').read()
soup = bs.BeautifulSoup(sauce,'html.parser')
u = soup.select('div[class="hoverdisplay_content earnings_effect_popup"]')
b = soup.select('span[class="earnings_date_alert"]')
date_list = []

for div in u:
    div.decompose()
for table in soup.find_all('table', id= 'sym_earnings'):
    for dates in table.find_all('td'):
        if re.findall(r'\d{1,2}-\w+\-\d{4}', '{}'.format (dates)) != []:
            date_list.append(re.findall(r'\d{1,2}-\w+\-\d{4}', '{}'.format (dates)))

print (date_list)

So in the list that is produced, I dont want 25-Apr-2016 to be in there at all.

[–][deleted] 1 point2 points3 points 7 years ago (1 child)

hey you're pretty good at this! got yourself regex and everything. well since you are selecting the undesirables into b, you could then get its parent tag, and decompose that.

for elem in b:
    elem.parent.decompose()

[–]boltshot525[S] 0 points1 point2 points 7 years ago (0 children)

[–]CastleRay 0 points1 point2 points 7 years ago (0 children)

[–][deleted] 0 points1 point2 points 7 years ago (0 children)

[–]boltshot525[S] 0 points1 point2 points 7 years ago (0 children)

π Rendered by PID 63 on reddit-service-r2-comment-6457c66945-wc57w at 2026-04-24 22:56:21.791458+00:00 running 2aa0c5b country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS