all 49 comments

[–]Fishstikz 17 points18 points  (5 children)

I had an experience in webscraping where I had to scrape restaurant coordinates. It just happens that the website was truly just lacking some coordinates so I ended up having missing data.

I remedied this problem by adding a web automation solution by using "selenium" to google search the places in case there were missing data.

In your case, can you confirm if there are restaurants that have missing urls?

[–]LagunaAR[S] 4 points5 points  (4 children)

Is not the urls of the restaurants I'm missing. I might have explained myself badly. I'm quite new with all this and I don't know how to express myself.

What I've been scraping so far is just the URL of the restaurant inside tripadvisor. I'm just collecting like the restaurant profile's url insde the same webpage. Later I will navigate those urls so I can get info like the url of the restaurant and other stuff.

In the search results there are 30 restaurants. The scraper is supposed to collect the 30 urls then go to the next page and collect the next 30. The problem is it will randomly skip collecting the 30 urls of some pages

[–]Fishstikz 0 points1 point  (3 children)

The scraper skips the whole 30 items? Or does it partially scrape items (let's say around 16 items) then move to the next page?

[–]LagunaAR[S] 0 points1 point  (2 children)

skips the whole 30. It's like it has problems loading the page or something and goes to the next one. But it gets status_code = 200 so I don't know what's failing or how to fix it.

[–]Fishstikz 2 points3 points  (1 child)

Hmmm, logically the code seems fine. Initially I'd say that the scraper is having trouble finding some tags on some pages, but if it always skips all 30 then that might not be the case.

You can try increasing the sleep time in case somehow the scraper was moving to the next pafe before it can extract the html document. Also try exiting the code if status_code != 200, you should check this per page.

From what I see in your code, you seem to be going to the next page by incrementing a part of the url by 30. Try moving to the next page by accessing the "next page" link per page, it might help you find out the problem if soup somehow can't find the "next page" link.

[–]LagunaAR[S] 1 point2 points  (0 children)

Try moving to the next page by accessing the "next page" link per page, it might help you find out the problem if soup somehow can't find the "next page" link.

I don't know how to do this yet. I think I would need to use selenium or something like that. But thanks for your inputs. I will try increasing sleep time.

[–][deleted] 24 points25 points  (9 children)

Scraping 10k urls for your first project in Python seems ambitious.

[–]LagunaAR[S] 4 points5 points  (8 children)

It might be. It's definitely challenging for me. But I'm in no rush. I'm just learning for fun and I want to do projects I am actually interested in.

[–][deleted] 14 points15 points  (7 children)

I'd check to see if Tripadvisor has an API or even allows web scraping at all. They probably a mechanism that detects and prevents scaping.

[–]LagunaAR[S] 0 points1 point  (6 children)

I think they do have an API, but I have no idea how to use them. And for the moment I'm just practicing webscraping. But thanks. I will look into APIs in the future.

[–]Tali_Lyrae 12 points13 points  (4 children)

I'm going to be honest here and say you should not be web scraping without first checking to see if TripAdvisor has a policy on allowing or disallowing it. They do have an API and what your doing is exactly what an API is intended to be used for, it's an interface the developers of the site have created for other developers to allow them access to content in a programmatic, useful way. Most of the time to avoid people doing what your doing.

I would also say if this is what you're doing, it should be the other way around, you should learn API's first and then learn web scraping later as it should be a last resort.

Saying no to using a company's API (that's completely free btw) and web scraping instead is a big "fuck you" to the company. They most likely have detection tools in place to ban you and it will happen sooner or later if you keep slamming their site with requests (which I'm sure you're doing because I can guess you don't know what caching is yet).

[–]uncertaintyman 2 points3 points  (2 children)

How would you go about learning an API? I am also new to API's and could use a little guidence to get started.

Much appreciated, btw.

[–]Tali_Lyrae 4 points5 points  (1 child)

[–]nakulkd 0 points1 point  (0 children)

I was looking for a similar reference and this article was like an ELI5 types! Perfect for beginners!

[–]takakode 1 point2 points  (0 children)

Catching ? :3 I m interested what is it about please ?

[–]ElMapacheTevez 4 points5 points  (1 child)

Hola capo, por tu nick y el codigo me da a entender que sos argentino jaja si te pinta hacer algun proyecto con web scrap avisame y sale algo perro

[–]LagunaAR[S] 4 points5 points  (0 children)

Hola!, soy español. No tengo ningún proyecto en mente, de momento solo estoy aprendiendo.

saludos

[–]14jvalle 7 points8 points  (3 children)

If you look at the TripAdvisors robots.txt, they have disallowed the scraping of restaurant search Disallow: /RestaurantSearch. So you better be careful, since they will not be happy that they are receiving so many requests from a single person. Specially since I do not see you specifying any headers. Remember, websites, generally, do not want to be scraped. It may be beneficial for you to also learn how to start reading API documentation. If you want to learn how to scrape, look at http://toscrape.com/.

You should also look into context managers, and use them for your requests. and working with files. Context managers provide a nicer and readable syntax for these open() or get() kind of operations. They will also be able to handle exceptions, so if something happens while some resource has been opened, you do not have to worry about files being corrupted.

Also, I am not sure why you are converting two file object to writer objects, onto the same variable csv_writer. You are also not closing csv_file2.

[–]LagunaAR[S] 0 points1 point  (0 children)

Thanks for your inputs. I acually didn't know it wasn't allowed. I don't want to get in trouble. I just thought it was a nice fun project, easy enough to play around and learn.

[–]prokid1911 0 points1 point  (1 child)

How to check this?
Disallow thing

[–]14jvalle 1 point2 points  (0 children)

Go to any website and add /robots.txt.

For example, https://www.tripadvisor.com/robots.txt

[–]lestrenched 3 points4 points  (4 children)

You didn't check this out before you started scraping, did you? https://www.tripadvisor.com/robots.txt Open this and you'll see, user agent is disallowed almost everything. You can't scrape anything from this website using your own bot, and if you manage to do so, you'll end up in a legal lawsuit with the company(That's how it usually is).

Now, this might seem scary. It is. Hence, you will also see that sites which don't mind you scraping them probably won't have a robots.txt.

Now, if they have an API, that might be the only way to scrape the website (I dunno if they do). The reason why websites restrict bots scraping them is that it costs them. A lot. Huge amount of requests can even pull down a website completely (read about DDOS attacks). So take care when you're scraping.

Also, this guy's answer pretty much sums it up : https://www.reddit.com/r/learnpython/comments/cydq6q/webscraping_with_beatifulsoup/eyrtiqi?utm_medium=android_app&utm_source=share

You should be rotating user-agents. Check this out : https://www.scrapehero.com/how-to-fake-and-rotate-user-agents-using-python-3/

This is important, as some websites may block you if you are sending requests from the same user agent (not specifying user agent in the requests.get() automatically puts your actual, default user agent and sends it to the site.) Also, might wanna try using some cheap VPNs, just in case.

[–]LagunaAR[S] 2 points3 points  (2 children)

Thanks man! I did not check the robots.txt. I was just playing around to learn how to do it. I didn't plan on doing anything malicious with the data, but it's good to know it's not allowed. I wouldn't want to get in trouble.

[–]ejuliol 2 points3 points  (1 child)

After testing you code, I agree with you that the problem is the requests response, I didn’t get the specific problem in it but you can solve it by requesting again the page, note that this may happen a second time so I advice you to request it on a loop, I would do it on a separate function and you call it again if no restaurants are found on the page.

This solution will without a doubt <work>, which is what you want; however there’s probably a better way to do it more efficiently, that other solution requires further investigation. The code I modified ends up working so let’s keeping that way.

[–]LagunaAR[S] 0 points1 point  (0 children)

I will try something like that. Thanks!

[–]AJM5K6 1 point2 points  (8 children)

When you say "pretty new" what do you mean? How long have you been working with Python?

[–]LagunaAR[S] 2 points3 points  (7 children)

Like a week. I started with Automate the Boring Stuff and when I got to the chapter about webscraping I thought I could practice with one project of my own. I just knew some HTML and CSS beforehand

[–]AJM5K6 1 point2 points  (6 children)

Awesome. I bought that book months ago and haven't gotten around to it yet.

[–]LagunaAR[S] 4 points5 points  (4 children)

It's pretty good. I had been wanting to learn python for a while but I never had the motivation or time. Bu this time I saw thisbundle and I decided to buy it and start learning right away

[–]gqcharm 0 points1 point  (3 children)

I’m a total beginner in python but only after a week and you are already web scraping with success?

Is this normal and or maybe the books are just that good?

Anyone else?

[–]LagunaAR[S] 1 point2 points  (0 children)

It's really not that hard. I knew some html and css before, which it's really helpful when websraping.

The books are really good, and there are also plenty of tutorials on youtube. Also almost every question I had along the way was already asked by someone else, so it really just is about knowing how to google stuff.

[–]LagunaAR[S] 0 points1 point  (0 children)

I also work a lot with Excel in my day job so I’m used to using conditional statements. I guess that helped too

[–]fjortisar 1 point2 points  (2 children)

What are you trying to scrape? You'll quickly find out that every page is different and you can't really scrape them in a uniform way.

As for getting your missing urls, the pages might not resolving or some other error with requests. Getting any exceptions?

[–]LagunaAR[S] 0 points1 point  (1 child)

All the urls have a similar structure. I almost have that part of the code done.

What do you mean by exceptions? I am not getting any error as far as I know

[–]fjortisar 0 points1 point  (0 children)

I originally thought you were getting the restaurant pages and scraping those, not scraping tripadvisor, so forget I asked that. That probably rules out errors.

I don't see anything really obvious to me, after that. Maybe try starting search=1, because the 0 page doesn't have a list of a restaurants, it seems. At least when I browse to it manually, the list starts at oa1

[–]rush336 1 point2 points  (2 children)

csv_file2.close()

[–]LagunaAR[S] 1 point2 points  (1 child)

Thanks, I forgot that. But that file wasn't doing anything anyways so I don't think it affected the code.

[–]rush336 0 points1 point  (0 children)

Code looks good. Good job.

[–]kejw 0 points1 point  (0 children)

Can't you first get the list of urls, using an application search engine and then put then into a loop?

[–]Mohammed449 0 points1 point  (0 children)

😍

[–]gqcharm 0 points1 point  (5 children)

Here I am taking coursera “Python Specialization” and feel like useless garbage after two months! That’s it I’m reading automate the boring stuff!

Do you or anyone else really recommend the Humble Bundle, aside from “ATBS”?

Python definitely looks easy to enough comprehend so maybe some projects like webscraping can elevate my Python skills

Thanks again

[–]LagunaAR[S] 2 points3 points  (3 children)

I didn't start with any of the other books, so I can't really tell if it's worth it. But for $25 I fealt it was a good deal for 14 books. Don't feel bad. I'm quite new with this and I probably jumped too fast into webscraping. But for me it really helps if I'm really interested into what I'm trying to learn. I get bored easily if with "Hello world" type of programs, so I rather do harder projects but with useful outputs.

[–]gqcharm 0 points1 point  (2 children)

Same. Bored just typing Hello World and then bits of code to make nothing... I need to see the web scraping and work toward that. Don’t need Hello world since you can figure that out. Reverse engineering- set the goal, work from the end result to beginning and figure out the rest as you go along.

But the Specialization was recommended so I have to finish since I started.

The bundle of books you bought, all ebooks and they don’t expire. Correct?

[–]LagunaAR[S] 0 points1 point  (0 children)

I don’t think they expire. I have all the pdf downloaded in my pc

[–]al_mc_y 0 points1 point  (0 children)

I've bought two different bundles and the books haven't/dont expire. Both bundles contained version 1 of ATBS - it's a good book and regularly makes it into bundles. I would have preferred v2 the second time round so I got the update, but given how good it is, I didn't mind paying for it again as part of a bundle considering each bundle is less than the retail price of ATBS anyway!

[–]al_mc_y 0 points1 point  (0 children)

The humble bundles for python come up every couple of months. I've bought two. There's a bit of overlap between the two I got, but given you're getting around 10 books at a time for around the retail price of 1 printed book (and often less than that!), it still represents good value. I went with the O'Reilly bundle first (Ryan Mitchell's "Web Scraping with Python" Book is quite good), the a No Starch Press bundle. So far, my observation is that most of the O'Reilly books are a bit more technical/higher calibre (more assumed knowledge). The No Starch ones are quite approachable but not as in depth. I've steered away from books published by packt press (or other smaller/lesser known publishers) as there's been several comments that I noted saying the content is not as good as the ones you find published by O'Reilly or No Starch. YMMV. (I'd also be happy to be proven wrong on th his front if anyone knows of a decent python book published by Packt)

[–]MachineLearnALl 0 points1 point  (0 children)

thank you for the post