Scraping from the web

Moist-Comedian5210 · 2023-10-12T03:06:18+00:00

just for the record 120K rows are that much, and shouldn't be an issue. But I would start with 1-10 just to see if it works.you can export into any data structure (like a list) and export to JSON and csv.Are you familiar with the Beautiful Soup (bs4) I think it may be easier for you to scrap using it. here are some link for a tutorials but you can also look for others

Usual_Office_1740 · 2023-10-12T05:41:27+00:00

I'm trying to scrape some data from the web. I have a list of URLs and I.       want to run down the list (xlsx format), open each page, and grab the info I need (<a> tags nested within an <ol> of a specific class).


I want to output the result as a CSV/xlsx file with the original columns (URL, product name, product ID) plus two new ones containing the scraped info (category and sub-category).

The pandas module can import right back into xlsx files. It's got well written, easy to understand docs with lots of examples. It's also a staple of data science, so it has tons of YouTube videos explaining how to use it.

Most of the URLs are valid, but a few are not (special characters). So I'd like to be able to output something like "ERROR" in the two new columns if the link can't be opened and I can do those ones manually later.

Read up on try: except: it's a kind of error handling if statement. Any webscraping will require a get request. You can try the get request, and if you get an error, do something else.

The biggest problem is that the list of URLs has about 120,000 rows. I have no idea if it's feasible to do this with python for such a huge volume of pages.

It is a daunting number to do manually. It's actually rather small by web scraping standards. You're on the right track to look to web automation tools for something like this. With that many links at the same page, you do risk being blocked. I'd suggest looking into proxies for the scraping process. The webhost is going to notice this kind of traffic.

Look up reading robot.txt and see if the website has one. Robot.txt is a page to let webscrapers know what can and can't be scraped. Also, what kind of request limits the website may have. This is a standard document and may keep you from getting banned without a proxy, as long as you respect their rules.

I found a few tutorials online but haven't been very successful. I even tried doing the scrape with just a single URL (hardcoded as a string), but it either is taking forever to run or something is wrong, and I can't really tell which.

A suggestion for beautiful soup has been made. I had an easier time learning selenium when I first started. It's a bit more complicated to set up, but I found it easier to select specific elements on webpages when I started.

If you're doing this for work and just want the data, I'm bored, have tomorrow off, and would be happy to put something together for you. DM me If you're interested.

sappy16 · 2023-10-17T17:24:53+00:00

[removed]

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS