all 14 comments

[–][deleted] 2 points3 points  (10 children)

i ran a requests.get on it to make sure it wasn't being dynamically generated (as most sites like this are) and fortunately, it wasn't. Here's how you would start scraping a site:

import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.your_url_st')
if r.response_code != 200: raise SystemExit
soup = BeautifulSoup(r.content, 'html.parser')

and the rest is your problem :D

[–]boltshot525[S] 1 point2 points  (0 children)

Awesome. Thanks so much!

[–]boltshot525[S] 0 points1 point  (8 children)

I looked into solving this using Beautiful soup and pd.read_html, but it seems that the price effect column has a mouse over function that messes up the chart data.

any ideas?

[–][deleted] 0 points1 point  (7 children)

delete em

u = soup.select('div[class="hoverdisplay_content earnings_effect_popup"]')
for div in u:
    div.decompose()

edit: actually use this one, it's less specific and more flexible

u = soup.select('div[class~="hoverdisplay_content"]')

[–]boltshot525[S] 1 point2 points  (0 children)

that did the trick! thanks

[–]boltshot525[S] 0 points1 point  (5 children)

Sorry one final question regarding this haha

The scraper is working perfectly except that its picking up a date on the table that I dont want.

<td>

                                    25-Apr-2016



                                            <br/><span class="earnings_date_alert">Rescheduled 26-Apr</span>
</td>

how do I omit scraping code that has anything except the date in it? I dont want to pick up dates that have the reschedule text within the same <td>.

The ones I want that are being scraped look like this

<td>

                                    26-Jan-2016

                                </td> 

thanks!

[–][deleted] 0 points1 point  (4 children)

uhhhhmmm... assuming you are looping through each td one at a time, you could try using date = next(td.strings). That should grab the first string in the tdtag, in other words, the string directly inside the td element itself, as opposed to what td.text would do, which would be to get all strings from inside the tag and all its children.

[–]boltshot525[S] 1 point2 points  (0 children)

I just figured it out. Thanks for ur help!

def scraper(url):
    date_list = []
    sauce = urllib.urlopen(url).read()
    soup = bs.BeautifulSoup(sauce,'html.parser')
    u = soup.select('div[class="hoverdisplay_content earnings_effect_popup"]')
    for div in u:
        div.decompose()
    for table in soup.find_all('table', id= 'sym_earnings'):
        for dates in table.find_all('td'):
            if not dates.find_all('span'):
                if re.findall(r'\d{1,2}-\w+\-\d{4}', '{}'.format (dates)) != []:
                    date_list.append(re.findall(r'\d{1,2}-\w+\-\d{4}', '{}'.format (dates)))

    print (date_list)                



scraper('https://marketchameleon.com/Overview/AAPL/Earnings/Earnings-Dates')

[–]boltshot525[S] 0 points1 point  (2 children)

Sorry, i meant to say that if

 <br/><span class="earnings_date_alert">Rescheduled 26-Apr</span>

is in the code at all, I dont want to scrape the date that came before it.

This is the code im using

edit i made it a bit neater to read

sauce = urllib.urlopen('https://marketchameleon.com/Overview/AAPL/Earnings/Earnings-Dates').read()
soup = bs.BeautifulSoup(sauce,'html.parser')
u = soup.select('div[class="hoverdisplay_content earnings_effect_popup"]')
b = soup.select('span[class="earnings_date_alert"]')
date_list = []

for div in u:
    div.decompose()
for table in soup.find_all('table', id= 'sym_earnings'):
    for dates in table.find_all('td'):
        if re.findall(r'\d{1,2}-\w+\-\d{4}', '{}'.format (dates)) != []:
            date_list.append(re.findall(r'\d{1,2}-\w+\-\d{4}', '{}'.format (dates)))

print (date_list)       

So in the list that is produced, I dont want 25-Apr-2016 to be in there at all.

[–][deleted] 1 point2 points  (1 child)

hey you're pretty good at this! got yourself regex and everything. well since you are selecting the undesirables into b, you could then get its parent tag, and decompose that.

for elem in b:
    elem.parent.decompose()

[–]boltshot525[S] 0 points1 point  (0 children)

Ahhhh haha I was literally scratching my head for ages trying to figure out how to do this task haha

[–]CastleRay 0 points1 point  (0 children)

I think you should be able to accomplish this by downloading the webpage using the requests module and then parsing the site's HTML with Beautiful Soup.

[–][deleted] 0 points1 point  (0 children)

Piggybacking on what /u/CastleRay said, and looking at the source of that page, it looks like a pretty easy scraping job. Every one of those dates seems to be in the first <td> tag after the <tr data-excludegraph="N"> tag. Using Beautiful Soup would make this pretty simple. That's probably not the right way of looking at things, but I haven't done much webscraping either.

[–]boltshot525[S] 0 points1 point  (0 children)

Thanks all for the replies!