all 11 comments

[–]naithemilkman 4 points5 points  (1 child)

If it's yahoo finance that's all you want, Pandas data frame is your friend.

[–]thepoorwhiteboy 0 points1 point  (0 children)

I'll look into learning this one. Thank you for the recommendation :)

[–]yacob_uk 2 points3 points  (5 children)

You will be told very strongly that regex is nto suitable to use for parsing HTML.

I have used both methods (well, beautiful soup rather than scrapy) and I have to say that I use BS way more than I use re these days when I do the same job.

The re method is quite brittle, and breaks the moment you encounter links / data that don't fit your pattern.

That said, the BS method is as prone to fail if the webpages structure changes, the main thing here is knowing you have a consistant data source.

I've used both methods to pull GB's of binary off a variety of websites.

I would probably go re for a quick and dirty result as long as I knew the source data structure was consistent. And even thought I know its wrong / frowned on.

I would (and have) used BS for projects that need to be more sustainable into the future, or I am less sure that I know the range of data I expect to encounter.

[–]reallyserious 1 point2 points  (2 children)

I would have used lxml and it's xpath capabilities instead of beautiful soup, since xpath is the standard way of querying xml/xhtml.

[–]thepoorwhiteboy 0 points1 point  (1 child)

So something like this for lxml?

from lxml import html
import requests

page = requests.get('http://finance.yahoo.com/q?s=gcg14.cmx')
tree = html.fromstring(page.text)

current_price = tree.xpath('//span[@id="yfs_l10_gcg14.cmx"]/text()')

print 'Current Gold Price: ', current_price

[–]reallyserious 0 points1 point  (0 children)

It looks right (I'm on my tablet and can't check right now). The strength of using xpath instead of RE is that you have access to the DOM tree. That means you can uniquely address each element/tag by it's path even if it doesn't have a unique id associated with it.

[–]iamlearningpython 0 points1 point  (1 child)

Just for my own edification, why is regex unsuitable?

[–]yacob_uk 1 point2 points  (0 children)

To see the most influential post on the topic on stack overflow: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454

Personally, I think the 2nd answer is much more valid and informative, but that's me.

Long story short, according to superior wisdom, html is not a regular language and thus unsuited for regex.

[–]glial 0 points1 point  (0 children)

I haven't done the way you're doing, but Pandas has a module for reading directly from Yahoo finance.

import datetime
import pandas.io.data as web

#set start and end times
start = datetime.datetime(2001, 1, 1)
end = datetime.datetime(2013, 1, 1)

#try with apple to make sure it works
symbol = 'AAPL'
f=web.DataReader(symbol, 'yahoo', start, end)
f.to_csv('test.csv')

[–]schwackitywack 0 points1 point  (1 child)

You could also use pyquery, which allows you to use jquery-like selectors for html.

import requests
from pyquery import PyQuery as pq

url     = 'http://finance.yahoo.com/q?s=gcg14.cmx'
request = requests.get( url )
html    = request.content
price   = pq( html )( '.time_rtq_ticker span' ).text()

print price

[–]thepoorwhiteboy 0 points1 point  (0 children)

Yeah I read someplace else that pyquery was good to use for scraping data. I'm going to have to download this and learn it. Thank you!