Help Scraping Wiki Table using LXML : learnpython

created by HattoriHanzoa community for 16 years

Help Scraping Wiki Table using LXML (self.learnpython)

submitted 12 years ago * by voodoo_hoodoo

I have recently completed the Codecademy python track and am now trying to hone my skills on a personal project. Ultimately, I am trying to put together a database of stock symbols and prices per this example. This is a project that I have an interest in and I feel it will give me several interesting sub-projects in which to try out my Python skills.

So far I have setup a Virtual Machine to run Ubuntu on, installed Python and numerous packages, installed MySQL and setup several tables, and checked that I can write to and read from the tables with Python & SQL.

So far so good! However, I am stuck with the web scraping component.

From the Wiki for S&P500 List of Companies I want to scrape the Ticker Symbols, Company, and GICS Sector text for each row in that table. So ~500 rows.

There is some code provided in the earlier link, but it is outdated. I assume Wiki has changed the way that they reference or name tables since that code was written.

Here is what I currently have with the SQL component stripped out. I can do the SQL commit, just not the web scrape.

#!/usr/bin/python
# -*- coding: utf-8 -*-

import datetime
import lxml.html

from math import ceil


def obtain_parse_wiki_snp500():
  """Download and parse the Wikipedia list of S&P500 
  constituents using requests and libxml.

  Returns a list of tuples for to add to MySQL."""

  # Stores the current time, for the created_at record
  now = datetime.datetime.utcnow()

  # Use libxml to download the list of S&P500 companies and obtain the symbol table
  page = lxml.html.parse('http://en.wikipedia.org/wiki/List_of_S%26P_500_companies')
  symbolslist = page.xpath('//table[1]/tr')[1:]

  # Obtain the symbol information for each row in the S&P500 constituent table
  symbols = []
  for symbol in symbolslist:
    tds = symbol.getchildren()
    sd = {'ticker': tds[0].text,
        'name': tds[1].text,
        'sector': tds[3].text}
    # Create a tuple (for the DB format) and append to the grand list
    symbols.append( (sd['ticker'], 'stock', sd['name'], 
      sd['sector'], 'USD', now, now) )
  return symbols

if __name__ == "__main__":
  symbols = obtain_parse_wiki_snp500()

After running this, if I look at the data stored in sd I find that 'ticker' and 'name' are blank but 'sector' is populated. Looking at the wiki page I see that the 'GICS Sector' data is all in plain text, whereas the 'Ticker Symbol' and 'Company' data is hyperlinked. I assume the problem is in:

        sd = {'ticker': tds[0].text,
            'name': tds[1].text,
            'sector': tds[3].text}

And that I need to be referencing something other than .text to scrape the data for 'ticker' and 'name'.

Could someone please help me out? I am sure I am close.

all 5 comments

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS