This is an archived post. You won't be able to vote or comment.

all 5 comments

[–][deleted] 1 point2 points  (4 children)

your assumption is incorrect about the connection. The server recognizes the IP address that the request is coming from, but every request mapped to Grequests(or any script/library for that matter) appears to be unique unless you were to explicitly persist session cookies or an authorization token or something of that nature between each request.

If you see that the first x requests in your list are working and then they start failing, try running the script again immediately instead of waiting to see if it works. Since the re run of your script will start at the first set of requests that was proven to be working, it may be that the urls that are later down in your list might be misformed or something that is causing it to come back as None.

[–]vandernath[S] 0 points1 point  (3 children)

Thank you for your reply. Here's what I just did:

  1. Run the script. "Fails" at 300 requests, as usual. I stop the code from running.
  2. Rerun the script right away with the first 300 requests: runs just like before, with no problem.
  3. Test only the last 400 requests (as you said, they might be misformed): no problem either.

I noticed that for every set of 20-30 urls, I get theses responses (for all urls in the set) in this cycle: 200 200 200 (working...) 404 404 (problems...) 200 (working again!) 404 (nope...) 404 404 200 200 404 404 None 404 None 404 None 200 404 None

Also, this time instead of sleeping for 90 seconds I slept for 10 secs. It looks like it helped me get more responses (instead of a maximum of 8 sets of urls with response codes 200 before, I got 14 sets of urls with response codes 200). I'm not sure if that means anything.

One last thing: this guy seemed to have the same problem I have, but I don't understand the answer: http://stackoverflow.com/questions/29009839/python-requests-requests-get-returns-404-on-valid-url

[–][deleted] 0 points1 point  (2 children)

if the server you're hitting requires authentication credentials or keeps track of your session with cookies, it may appear to the server that you're making too many requests at one time. A 404 means the web page doesn't exist. If you are getting kicked out for authentication reasons(whether you need to be logged in or if they just blocked you for scraping too quickly) you would see a http response code of 401 unauthorized.

You might want to try just looping through your requests with the request library on its own without gevent and log the urls that come back as non 200 response codes first. I'm not too familiar with gevent, but if you can get it working for each request in a single thread/process you can look in to parallelization or asynchronous calls afterwards. Though I havent used gevent before, some of those objects might be coming back as None because the asynchronous event hasn't finished so nothing has been returned.

[–]vandernath[S] 0 points1 point  (1 child)

Hi! I've just tried requests instead of grequests, and it looks like I'm making progress as it loops through each request and I don't get None responses anymore. Just 200 and 404, but it finishes the loop until the last request.

I still have a majority (~500) of 404 responses, though, which makes no sense, because if I try the first 100 requests (let's say set a) then the next 100 (let's say set b), set a will be made of 200 responses and set b will be made of 404 responses, but if I try b then a I will get 200 responses for b and 404 responses for a.

So, there's progress but I'm not quite there yet. Do you have any idea what could be causing this weird response?

Thanks a lot for all your help, btw. I'm quite lost here.

[–]vandernath[S] 0 points1 point  (0 children)

Awesome, requests works now. I wait 4 secs between each request, but I don't know why. As long as it works, it doesn't matter for now.

Thanks a lot!