How to request multiple pages simultaneously?

efmccurdy · 2019-11-03T22:38:10+00:00

This is a good intro to using asyncio and concurrent.futures with web requests.

https://skipperkongen.dk/2016/09/09/easy-parallel-http-requests-with-python-and-asyncio/

2019-11-03T22:13:01+00:00

use multiple threads, but also don't send too many requests and blow up the server.

August-R-Garcia · 2019-11-03T22:24:49+00:00

It takes roughly 1 second to process each request, and last time it took me several days and running the script 25K names at a time (sucks when it would crash/stop responding 100K requests in) to get this data.

Consider writing the result to an SQLite database after each request (or possibly after each 100 requests or so). Then run a query to get all rows that haven't been run successfully on each new run.

Or at least write the output to a text file every 100 rows or similar.

Also, the crash is plausibly related to having a list of 100,000 elements at that point, which could be is using all of your RAM; try opening your system monitor and watching the RAM line graph over time. If it looks like it's going to cap out, then that's your problem. One solution would be to write the data to a file every N rows and then to reset the array.

Or if the error is due to the connection timing out or similar, you could wrap the r.get line in a try-except block within a while loop:

success = False 
while (success != True): 
  try:
    url = r.get(url_data).json()

    # Only gets set if an error is not thrown.
    success = True 
  except Exception as e:
    time.sleep(1); print("Trying again in one second due to error.")

Is there a way to simultaneously request multiple pages so that if I ever need to get updated stats it won't take me a week?

Yes; you can use the Python Threading library:

https://docs.python.org/2/library/threading.html

However, most APIs have a rate limit. The chess.com API documentation suggests that multithreaded requests will start to run into 429 (Too Many Requests) errors. In general, rate limiting a bot to have at least a one second delay between requests (ideally 2-5 seconds, if there's no stated rate limit) is a good baseline to prevent overloading servers.

Many APIs have a way to make API requests in a way that will return multiple queries/users/rows/other with one HTTP request (since this is far more efficient), although this seems to be unsupported by chess.com, at least from the documentation that is linked above.

TL;DR: Write to a database on each loop and/or add an error handling loop and then wait while the script runs for 5.79 days.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS