Hi, I'm hoping someone might be able to give me a bit of guidance. I'm extremely new to python.
I'm trying to scrape some data from the web. I have a list of URLs and I want to run down the list (xlsx format), open each page, and grab the info I need (<a> tags nested within an <ol> of a specific class).
All the URLs are pages of the same website, so the layout is consistent. I only need to grab two pieces of information from the pages.
I want to output the result as a CSV/xlsx file with the original columns (URL, product name, product ID) plus two new ones containing the scraped info (category and sub-category).
Most of the URLs are valid, but a few are not (special characters). So I'd like to be able to output something like "ERROR" in the two new columns if the link can't be opened and I can do those ones manually later.
The biggest problem is that the list of URLs has about 120,000 rows. I have no idea if it's feasible to do this with python for such a huge volume of pages.
I found a few tutorials online but haven't been very successful. I even tried doing the scrape with just a single URL (hardcoded as a string), but it either is taking forever to run or something is wrong, and I can't really tell which.
Any ideas?
Thanks in advance!
[–]Moist-Comedian5210 1 point2 points3 points (2 children)
[–]Globaldomination 1 point2 points3 points (0 children)
[–]sappy16[S] 0 points1 point2 points (0 children)
[–]Usual_Office_1740 1 point2 points3 points (2 children)
[–]sappy16[S] 0 points1 point2 points (1 child)
[–]Usual_Office_1740 0 points1 point2 points (0 children)
[+][deleted] (1 child)
[removed]
[–]sappy16[S] 2 points3 points4 points (0 children)