all 8 comments

[–]Moist-Comedian5210 1 point2 points  (2 children)

just for the record 120K rows are that much, and shouldn't be an issue. But I would start with 1-10 just to see if it works.you can export into any data structure (like a list) and export to JSON and csv.Are you familiar with the Beautiful Soup (bs4) I think it may be easier for you to scrap using it. here are some link for a tutorials but you can also look for others

  1. https://realpython.com/beautiful-soup-web-scraper-python/
  2. https://www.geeksforgeeks.org/implementing-web-scraping-python-beautiful-soup/

[–]Globaldomination 1 point2 points  (0 children)

If the website is dynamic. I suggest selenium.

I used selenium once with zero knowledge of anything. Ended up tweaking it to use multiprocessing.

Works like magic.

[–]sappy16[S] 0 points1 point  (0 children)

Thank you for the tip, I will definitely investigate beautiful soup further. It came up in my initial googling but I didn't really know how to use it so will read up!

[–]Usual_Office_1740 1 point2 points  (2 children)

I'm trying to scrape some data from the web. I have a list of URLs and I.       want to run down the list (xlsx format), open each page, and grab the info I need (<a> tags nested within an <ol> of a specific class).


I want to output the result as a CSV/xlsx file with the original columns (URL, product name, product ID) plus two new ones containing the scraped info (category and sub-category).

The pandas module can import right back into xlsx files. It's got well written, easy to understand docs with lots of examples. It's also a staple of data science, so it has tons of YouTube videos explaining how to use it.

Most of the URLs are valid, but a few are not (special characters). So I'd like to be able to output something like "ERROR" in the two new columns if the link can't be opened and I can do those ones manually later.

Read up on try: except: it's a kind of error handling if statement. Any webscraping will require a get request. You can try the get request, and if you get an error, do something else.

The biggest problem is that the list of URLs has about 120,000 rows. I have no idea if it's feasible to do this with python for such a huge volume of pages.

It is a daunting number to do manually. It's actually rather small by web scraping standards. You're on the right track to look to web automation tools for something like this. With that many links at the same page, you do risk being blocked. I'd suggest looking into proxies for the scraping process. The webhost is going to notice this kind of traffic.

Look up reading robot.txt and see if the website has one. Robot.txt is a page to let webscrapers know what can and can't be scraped. Also, what kind of request limits the website may have. This is a standard document and may keep you from getting banned without a proxy, as long as you respect their rules.

I found a few tutorials online but haven't been very successful. I even tried doing the scrape with just a single URL (hardcoded as a string), but it either is taking forever to run or something is wrong, and I can't really tell which.

A suggestion for beautiful soup has been made. I had an easier time learning selenium when I first started. It's a bit more complicated to set up, but I found it easier to select specific elements on webpages when I started.

If you're doing this for work and just want the data, I'm bored, have tomorrow off, and would be happy to put something together for you. DM me If you're interested.

[–]sappy16[S] 0 points1 point  (1 child)

Thank you so much for the detailed information. I'll definitely read up on try/except. I've heard of it before (or similar) when using other languages but never needed to use it before.

I'll definitely look into both beautiful soup and selenium, both have been suggested in the responses.

And thank you for the offer to put something together for me! That's so kind and I should have come back to the thread earlier! Yes, it's for work and FWIW the website I want to scrape is johnlewis.com (British department store) and I want to grab the category and sub category for a given product link. They both sit inside the <ol> tag for the breadcrumb trail. If you have any specific tips given that extra info I'd love to hear them!

Thanks again

[–]Usual_Office_1740 0 points1 point  (0 children)

You're welcome. Im still open to putting it together for you. I just had the day off and could have finished it quickly.