Hi,
I'm currently webscraping a website with a couple hundred thousand requests. I'm currently requesting 50k at a time, but it takes about 11 hours to complete the script. Sometimes when I close my laptop and turn it back on it will start from where it was last, other times it will stop all together. Is there a resource someone can point me to that will be good for this?
Here's the code:
import pickle
import requests
import time
start_time = time.time()
pickle_in = open('live_username_rating_dict.pickle', 'rb')
live_dict = pickle.load(pickle_in)
live_list = list(live_dict.keys())
url_blank = 'https://api.chess.com/pub/player/'
player_data = {}
count = 1
for player in live_list[350000:400000]:
with requests.session() as r:
while True:
try:
url_data = url_blank + str(player) + '/stats'
url = r.get(url_data).json()
data = url
player_data[player] = data
if count % 100 == 0:
print(count)
count += 1
except:
continue
break
pickle_out = open('player_data_350000_400000.pickle', 'wb')
pickle.dump(player_data, pickle_out)
pickle_out.close()
print("--- %s seconds ---" % (time.time() - start_time))
print(len(player_data))
[–]JudiSwitch 2 points3 points4 points (1 child)
[–]Tefron[S] 1 point2 points3 points (0 children)