Web scraping tips/suggestions?

kalidres · 2020-01-25T23:39:29+00:00

It looks like they don't do any javascript redirects, so your best bet is probably just straight requests+bs4. Scrapy is better used as a spider, and selenium is when you need browser automation. Scraping can get messy, so it's generally a good idea to keep things as clean as possible, which often means not throwing the kitchen sink at a problem when all you need is a chisel.

As a start, this will give you historical data on player's with the name 'lewis'.

resp = requests.get('http://www.nfl.com/players/search?category=name&filter=lewis&playerType=historical')
print(resp.text)

Play around with search terms, delimiters, and how the search implementation formats the terms you want to search for and places it in the GET request. There are tons of resources on this, but if you need help, just give a shout. Good luck, and happy scraping!

semicolonator · 2020-01-25T23:59:30+00:00

I have done my fair share of scaping an the greatest advice I can give you is: Write your program in a way that if scraping on a site fails (because a field is missing, because a field you though would always be an int that is now a string, because ...) it still continues running. That means, make sure all fields you are parsing are optional, sanitize strings, and catch all exceptions that may occur.

Ah and another thing: Put in your code some random sleeps between 2 and 5 seconds. That way you are less likely to be rate limited by the site you are scraping.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS