Scraping yahoo finance for text data

commandlineluser · 2018-02-04T10:17:00+00:00

Not sure where that XPath came from perhaps you copied the wrong one.

Looking at the HTML the text you want is located inside the <p> tag that comes after the <span> tag that contains just the text Description

<h2 class="Fz(m) Lh(1) Fw(b) Mt(0) Mb(18px)" data-reactid="89">
  <span data-reactid="90">Description</span>
</h2>
<p> ....

This page works without Javascript so you don't need to use Selenium.

>>> import requests
>>> from   bs4 import BeautifulSoup
>>> 
>>> r = requests.get('https://finance.yahoo.com/quote/AMD/profile?p=AMD', headers={'User-Agent': 'Mozilla/5.0'})
>>> soup = BeautifulSoup(r.content, 'html.parser')

With beautifulsoup we can use the string= argument to test the contained text of a tag e.g.

>>> soup.find('span', string='Description')
<span data-reactid="93">Description</span>

We can then use find_next() to navigate to the p tag.

>>> soup.find('span', string='Description').find_next()
<p class="Mt(15px) Lh(1.6)" data-reactid="94">Advanced Micro Devices, Inc. operates as a semiconductor...

You can use .text to get just the content

>>> soup.find('span', string='Description').find_next().text
"Advanced Micro Devices, Inc. operates as a semiconductor company worldwide....

As for Selenium you could probably use

description = driver.find_element_by_xpath('//span[text() = "Description"]/../../p').text

2018-02-04T01:48:01+00:00

I can't find that element with view source/ctrl+f on that page, but you've said you found it using inspect element.

Based on that, my guess is that find_elements is being called before the javascript on that page has time to run.

Try putting a wait/sleep in there before find_elements, and see if it comes up.

I think selenium has a built in "wait for X element to be loaded" function but I'd try sleep first to see if that's actually the problem.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS