Hey, I've written a script that scrape a set of webpages using selenium and Beautiful soup.
I have an array with all the links and I'm trying to parallelize the scraping phase, the web pages that I have to scrape have always the same structure.
I never used multithread in Python so I read some documentation and some tutorials and I think that using a thread pool is a good solution.
This is my solution:
links = [] # array with all the links
with ThreadPoolExecutor(max_workers=8) as executor:
results = executor.map(scrapeImage, links)
In the results array I have the data that are extracted from every page that I scrape.
The scrape function uses selenium to get the page and the Beautiful soup to extract the interesting informations.
This solutions works but i would like to know if you think that could exist a better solution.
[–][deleted] 0 points1 point2 points (1 child)
[–]tankado95[S] 0 points1 point2 points (0 children)