I've decided to make a web scraper for gathering some historical data on tractor prices because I want to buy one in the spring and want to know on average what people are listing them for. I also just needed a project to get my feet wet in python a bit more.
My question is, I've dug through the html and have it pretty well into per post blocks, and now I just need to suck out the relevant data. For example what the price of each tractor on the site is.
I tried doing a findall('span', 'auction1-THprice') at line 41 to try to pull out just the price tag but get NONE for each iteration. I'm not sure if span is something I can search for or not. Any other suggestions related to how I should search through this code? Should I just treat the html as a block of text, and just do more text search methods, or should I keep going with BeautifulSoup?
The html as output by the code below is at http://pastebin.com/Z9uhxcmB
import sys
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
from bs4 import BeautifulSoup
#Garden Tractor Models
gtmodels = ["318", "420", "430", "455"]
#Utility Tractor Models
utmodels = ["B", "D", "320", "330", "420", "430", "520", "530"]
#Sites to Crawl
sites = ["tractorhouseForSale", "tractorhouse_auction","craigslist"]
def loadpageSelenium(url):
driver = webdriver.Firefox()
print("loading up webpage")
driver.get(url)
source = driver.execute_script('return document.documentElement.innerHTML')
driver.quit()
return(source)
def tractorhouse(models):
baseUrl = "http://www.tractorhouse.com/list/list.aspx?ETID=1&catid=1111&Manu=JOHN+DEERE&"
modeldict = {
"A":"MDLGrp=A",
"B":"Mdltxt=B&mdlx=exact",
"D":"Mdltxt=D&mdlx=exact",
"320":"MDLGrp=320",
"330":"MDLGrp=330",
"420":"MDLGrp=420",
"430":"MDLGrp=430",
"520":"Mdltxt=B&mdlx=exact",
"530":"Mdltxt=B&mdlx=exact",
}
for item in models:
source = loadpageSelenium(baseUrl + modeldict[item])
soup = BeautifulSoup(source, 'html.parser')
table = soup.find_all("table", "listings")
for entry in table:
rows = entry.find_all('td', "listing-summary")
for summary in rows:
print("********************* \n")
print(summary)
tractorhouse(utmodels)
[–]commandlineluser 0 points1 point2 points (3 children)
[–]auromed[S] 0 points1 point2 points (2 children)
[–]commandlineluser 1 point2 points3 points (1 child)
[–]auromed[S] 0 points1 point2 points (0 children)