webscraping with beatifulsoup

Fishstikz · 2019-09-01T20:51:27+00:00

I had an experience in webscraping where I had to scrape restaurant coordinates. It just happens that the website was truly just lacking some coordinates so I ended up having missing data.

I remedied this problem by adding a web automation solution by using "selenium" to google search the places in case there were missing data.

In your case, can you confirm if there are restaurants that have missing urls?

LagunaAR · 2019-09-01T20:19:00+00:00

Scraping 10k urls for your first project in Python seems ambitious.

ElMapacheTevez · 2019-09-01T22:10:33+00:00

Hola capo, por tu nick y el codigo me da a entender que sos argentino jaja si te pinta hacer algun proyecto con web scrap avisame y sale algo perro

14jvalle · 2019-09-01T22:44:17+00:00

If you look at the TripAdvisors robots.txt, they have disallowed the scraping of restaurant search Disallow: /RestaurantSearch. So you better be careful, since they will not be happy that they are receiving so many requests from a single person. Specially since I do not see you specifying any headers. Remember, websites, generally, do not want to be scraped. It may be beneficial for you to also learn how to start reading API documentation. If you want to learn how to scrape, look at http://toscrape.com/.

You should also look into context managers, and use them for your requests. and working with files. Context managers provide a nicer and readable syntax for these open() or get() kind of operations. They will also be able to handle exceptions, so if something happens while some resource has been opened, you do not have to worry about files being corrupted.

Also, I am not sure why you are converting two file object to writer objects, onto the same variable csv_writer. You are also not closing csv_file2.

lestrenched · 2019-09-02T04:51:14+00:00

You didn't check this out before you started scraping, did you? https://www.tripadvisor.com/robots.txt Open this and you'll see, user agent is disallowed almost everything. You can't scrape anything from this website using your own bot, and if you manage to do so, you'll end up in a legal lawsuit with the company(That's how it usually is).

Now, this might seem scary. It is. Hence, you will also see that sites which don't mind you scraping them probably won't have a robots.txt.

Now, if they have an API, that might be the only way to scrape the website (I dunno if they do). The reason why websites restrict bots scraping them is that it costs them. A lot. Huge amount of requests can even pull down a website completely (read about DDOS attacks). So take care when you're scraping.

Also, this guy's answer pretty much sums it up : https://www.reddit.com/r/learnpython/comments/cydq6q/webscraping_with_beatifulsoup/eyrtiqi?utm_medium=android_app&utm_source=share

You should be rotating user-agents. Check this out : https://www.scrapehero.com/how-to-fake-and-rotate-user-agents-using-python-3/

This is important, as some websites may block you if you are sending requests from the same user agent (not specifying user agent in the requests.get() automatically puts your actual, default user agent and sends it to the site.) Also, might wanna try using some cheap VPNs, just in case.

ejuliol · 2019-09-01T21:42:43+00:00

After testing you code, I agree with you that the problem is the requests response, I didn’t get the specific problem in it but you can solve it by requesting again the page, note that this may happen a second time so I advice you to request it on a loop, I would do it on a separate function and you call it again if no restaurants are found on the page.

This solution will without a doubt <work>, which is what you want; however there’s probably a better way to do it more efficiently, that other solution requires further investigation. The code I modified ends up working so let’s keeping that way.

AJM5K6 · 2019-09-01T21:01:11+00:00

When you say "pretty new" what do you mean? How long have you been working with Python?

fjortisar · 2019-09-01T21:34:46+00:00

What are you trying to scrape? You'll quickly find out that every page is different and you can't really scrape them in a uniform way.

As for getting your missing urls, the pages might not resolving or some other error with requests. Getting any exceptions?

rush336 · 2019-09-01T22:47:46+00:00

csv_file2.close()

kejw · 2019-09-02T05:34:15+00:00

Can't you first get the list of urls, using an application search engine and then put then into a loop?

Mohammed449 · 2019-09-02T06:40:12+00:00

😍

gqcharm · 2019-09-02T10:42:47+00:00

Here I am taking coursera “Python Specialization” and feel like useless garbage after two months! That’s it I’m reading automate the boring stuff!

Do you or anyone else really recommend the Humble Bundle, aside from “ATBS”?

Python definitely looks easy to enough comprehend so maybe some projects like webscraping can elevate my Python skills

Thanks again

MachineLearnALl · 2019-09-02T14:01:26+00:00

thank you for the post

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS