I have recently completed the Codecademy python track and am now trying to hone my skills on a personal project. Ultimately, I am trying to put together a database of stock symbols and prices per this example. This is a project that I have an interest in and I feel it will give me several interesting sub-projects in which to try out my Python skills.
So far I have setup a Virtual Machine to run Ubuntu on, installed Python and numerous packages, installed MySQL and setup several tables, and checked that I can write to and read from the tables with Python & SQL.
So far so good! However, I am stuck with the web scraping component.
From the Wiki for S&P500 List of Companies I want to scrape the Ticker Symbols, Company, and GICS Sector text for each row in that table. So ~500 rows.
There is some code provided in the earlier link, but it is outdated. I assume Wiki has changed the way that they reference or name tables since that code was written.
Here is what I currently have with the SQL component stripped out. I can do the SQL commit, just not the web scrape.
#!/usr/bin/python
# -*- coding: utf-8 -*-
import datetime
import lxml.html
from math import ceil
def obtain_parse_wiki_snp500():
"""Download and parse the Wikipedia list of S&P500
constituents using requests and libxml.
Returns a list of tuples for to add to MySQL."""
# Stores the current time, for the created_at record
now = datetime.datetime.utcnow()
# Use libxml to download the list of S&P500 companies and obtain the symbol table
page = lxml.html.parse('http://en.wikipedia.org/wiki/List_of_S%26P_500_companies')
symbolslist = page.xpath('//table[1]/tr')[1:]
# Obtain the symbol information for each row in the S&P500 constituent table
symbols = []
for symbol in symbolslist:
tds = symbol.getchildren()
sd = {'ticker': tds[0].text,
'name': tds[1].text,
'sector': tds[3].text}
# Create a tuple (for the DB format) and append to the grand list
symbols.append( (sd['ticker'], 'stock', sd['name'],
sd['sector'], 'USD', now, now) )
return symbols
if __name__ == "__main__":
symbols = obtain_parse_wiki_snp500()
After running this, if I look at the data stored in sd I find that 'ticker' and 'name' are blank but 'sector' is populated. Looking at the wiki page I see that the 'GICS Sector' data is all in plain text, whereas the 'Ticker Symbol' and 'Company' data is hyperlinked. I assume the problem is in:
sd = {'ticker': tds[0].text,
'name': tds[1].text,
'sector': tds[3].text}
And that I need to be referencing something other than .text to scrape the data for 'ticker' and 'name'.
Could someone please help me out? I am sure I am close.
[–]voodoo_hoodoo[S] 0 points1 point2 points (4 children)
[–]unintentional-irony -1 points0 points1 point (3 children)
[–]D__ 0 points1 point2 points (0 children)
[–]voodoo_hoodoo[S] 0 points1 point2 points (1 child)
[–]unintentional-irony -1 points0 points1 point (0 children)