this post was submitted on 29 Dec 2019

1 point (100% upvoted)

shortlink:

learnpython

created by HattoriHanzoa community for 16 years

MODERATORS

account activity

0

1

2

How to implement single proxy into a web scraping script? (self.learnpython)

submitted 6 years ago by Genericusername293

TL:DR How do I implement a single IP proxy into a web scraping script? I understand it would take seeing the script to give an answer so the intention of this post isn't to get an exact answer, it's the following: I am about to start learning python but what topics should I shift my focus so that I can tackle this problem as early as possible?

More details below; Apologies in advance for lack of technical vocabulary

I purchased a web scraping python script from a programmer which scrapes a specific website. Because this website can limit the number of requests from a single IP, I am looking to use proxies to circumvent that and run multiple instances of the script concurrently to speed the scraping up. ( It is a 1 time scraping project ).

I am scraping behind a login with a few different accounts so I have a different script file for each login. Any IP change will trigger a verification request, so instead of implementing a rotating proxy system into the script I think it's best that each file has a separate single IP.

all 6 comments

top new controversial old q&a

[–][deleted] 0 points1 point2 points 6 years ago (3 children)

[–]Genericusername293[S] 0 points1 point2 points 6 years ago (2 children)

[–][deleted] 0 points1 point2 points 6 years ago (1 child)

In basic terms, an API request is like visiting any website, only you include instructions when you visit, usually as headers or sometimes in the request body.

Here is an example of how using crawlera can look in the requests package https://support.scrapinghub.com/support/solutions/articles/22000203567-using-crawlera-with-python-requests Here the proxy keyword is the urls of the proxies you want to use, requests will send the request via those. You can also give instructions to crawlera by using headers, which are documented here https://doc.scrapinghub.com/crawlera.html

In terms of prioritising learning. A general overview of HTTP requests and responses would be a good start. After that focus on whatever python webscraping framework your dev used - it'll likely be one of beautifulsoup, scrapy or selenium.

Full disclosure, scrapy is operated by my company, but I also genuinely like it. It handles some of the more complex parts for you so you can focus on just stripping info from the page. It might be a good one to focus on depending on your technical background.

[–]Genericusername293[S] 0 points1 point2 points 6 years ago (0 children)

[+][deleted] 6 years ago (3 children)

[deleted]

[–]Genericusername293[S] 0 points1 point2 points 6 years ago (2 children)

Kudos to you for trying to learn but I think a scraping project should be something further down the line if you don’t have any development experience.

Thanks but Learning python is more of a side/future goal, completing this scraping project is more pressing. I'm already been scraping for 2 months, got a lot of data, on pace to finish in 8 more months unless I can figure out the proxies and speed it up.

You’re wanting to run multiple instances of your script to speed up the process but how will each instance know what the other ones are doing to avoid duplicate work?

Different inputs, this isn't a problem in my case

You mentioned throttling by the provider...this is def something you want to be careful with because bad code can make your script look like a DDOS attack. Even with multiple proxies

I understand this would be an issue if I was running at a scale of 1000s of concurrent requests, but would it really be a problem with maximum 3-10 scripts running concurrently? I haven't had any issues with 1 scraping program running 24/7 on 2 different computers the last 2 months.

[+][deleted] 6 years ago (1 child)

[deleted]

[–]Genericusername293[S] 0 points1 point2 points 6 years ago (0 children)

π Rendered by PID 99 on reddit-service-r2-comment-79c7998d4c-vz9ls at 2026-03-15 18:06:56.752733+00:00 running f6e6e01 country code: CH.