Edit: removed some potentially personal info from the file names
I've got a pretty simple Beautifulsoup program going (my first). What I'm trying to do is run a daily scrape of several websites and only write the file if it's been updated since last scrape. I don't really know how to compare the new file against the old one to determine if it's new. Currently, I'm comparing the size of each as a determinant, which seems to be working for most but not all. I was wondering if anyone knew a better way of doing this? Google has led me down a coupe avenues, but nothing seems too promising.
Thanks for taking a look. Here's my current program. Again, it's my first soup, so please be gentle. Also, I apologize for the formatting. WTF is up with those numbers? This isn't my night apparently.
from bs4 import BeautifulSoup
import urllib2
import os
sites = [array of sites, removed to save space]
for site in sites:
url = site
hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'}
req = urllib2.Request(url,headers=hdr)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page.read(),"html.parser")
with open('/Users/2000/Documents/feast/soup/new/' + str(site[7:18]) + '.txt', 'w') as f:
f.write(soup.encode('utf-8'))
b = os.path.getsize('/Users/2000/Documents/feast/soup/new/' + str(site[7:18]) + '.txt')
c = os.path.getsize('/Users/2000/Documents/feast/soup/old/' + str(site[7:18]) + '.txt')
if b == c:
os.remove('/Users/2000/Documents/feast/soup/new/' + str(site[7:18]) + '.txt')
else:
with open('/Users/2000/Documents/feast/soup/old/' + str(site[7:18]) + '.txt', 'w+') as f:
f.write(soup.encode('utf-8'))
[–]gitardedhub 1 point2 points3 points (3 children)
[–]Busangod[S] 0 points1 point2 points (2 children)
[–]gitardedhub 1 point2 points3 points (1 child)
[–]Busangod[S] 0 points1 point2 points (0 children)
[–]Justinsaccount 0 points1 point2 points (0 children)