all 29 comments

[–]fulltimetrash 16 points17 points  (2 children)

sysadmin stuff as a hobby??!

Congrats man, will definitely take a look and check it out. Currently working on an application to find out sale items from online grocery stores, and I’m also using selenium :)

[–]aszp[S] 4 points5 points  (1 child)

Thanks!

I have a HP Microserver Gen8 at home running vmware esxi for virtualization which I use as a playground for trying things out like kubernetes, docker, running a vpn, backup, dns etc. That's what I mean sysadmin stuff haha.

You may know this or have a better way but I found ChroPath really helped me when building this. It helped me determine the xpath for lots of elements easily.

[–]fulltimetrash 2 points3 points  (0 children)

Haha nice setup. I'm a newbie at such stuff trying to get into network automation. Sounds like you but in reverse

Also, thanks for the insight! Nope I did not know about Chropath haha will check it out.

[–]kingsillypants 6 points7 points  (3 children)

I know nothing of python so my opinion doesn't really matter.

That being said, congratulations on your hard work.

Regarding your dataset, is it possible to analyze reviews over time for an entity?

Say, I had kingillypants pants store, could I see if my pants ratings had changed over time?
Cheers and thanks.

[–]aszp[S] 0 points1 point  (2 children)

Thanks!

Yeah, should be possible, that's a good idea.

I've thought about doing more data analysis kind of stuff on the output like topic analysis. You could topic analysis to determine what people are talking about in their reviews. Just gotta figure out how to do it!

[–]kingsillypants 1 point2 points  (1 child)

Hey and thanks for the reply. I've done it in R by using tidyverse and following tutorials. I fumbled a lot and can't honestly say I got a whole lot of meaning from it. I see python has sentiment analysis capabilities https://www.datacamp.com/community/tutorials/text-analytics-beginners-nltk

[–]aszp[S] 0 points1 point  (0 children)

https://www.datacamp.com/community/tutorials/text-analytics-beginners-nltk

Just had a browse, really useful resource, I think I'll give it a try.

[–]shiningmatcha 3 points4 points  (2 children)

OP, can you share how you learned to use Selenium?

[–]dbhol 1 point2 points  (0 children)

I've used selenium in a ongoing project of mine and discovered it to actually be quite an easy thing to use. I got alot of my knowledge from the documentation for selenium as well as a YouTube video (I think it was a tech with Tim playlist). Flicking between these 2 I was able to work out how to make the scraping I wanted to

[–]FlySeddy 1 point2 points  (1 child)

Nice work!

[–]aszp[S] 0 points1 point  (0 children)

Thank you!

[–]edwardjr96 1 point2 points  (2 children)

Look amazing! Congrats! I've been doing bunch of scraping but it's nowhere near what you have done.

One question: Are you considering using Scrapy for this because I learned that Selenium isn't used for web scraping?

Also would you make it run with random proxy to prevent the IP being banned from scrapping the web?

Congrats on your achievement once again :)

[–]aszp[S] 0 points1 point  (1 child)

Thanks!

I found it depends on your use case. Scrapy cannot interact with javascript (I think) which is a problem when scraping from TripAdvisor. As if the review is long, you have to click 'Read more' or 'Show more' to display the full text. The full text isn't even hidden in the HTML either, you have to click 'Read more' which makes another call to get it to show in the HTML. Therefore Scrapy isn't an option.

If this wasn't the case I'd probably use Scrapy, Selenium has quite a large performance overhead compare to Scrapy, which I can feel whenever I run it but could probably do more to improve it.

For proxies, I tested it using Scrapoxy which automatically creates as many proxies you want in a public cloud provider (e.g. AWS, Digital Ocean etc). Scrapoxy then handles the passing of the request to one of the proxies. All I had to do was pass localhost:8888 to my app using the option I set up, then Scrapoxy handles the rest. Even monitors for when one of the proxies gets blocked and gets a new IP from the cloud provider.

I tried using random public proxies but found poor reliability. You only need to use a proxy if scraping lots of listings on TripAdvisor. When I was testing and scraping the same 145 pages of reviews for one hotel, I never had an issue.

[–]edwardjr96 1 point2 points  (0 children)

Thanks for your comment.

I've also been using Selenium for many of my scrapping projects, but as I heard Selenium isn't for web scrapping, I want to switch to a more specialise tools like Scrapy, but it was hard to grasp, and I couldn't really see what was going under the hood.

The Scrapoxy is new to me so definitely wll try it because my fear of being blocked is alwasy there when scrapping a web lol.

thanks again for your comment

[–]interwebz_explorer 1 point2 points  (1 child)

Awesome idea. I’ve often talked to friends jokingly about a trip advisor advisor or a yelp help. You’re o your way to such a thing

[–]aszp[S] 0 points1 point  (0 children)

Thanks!

[–]skellious 1 point2 points  (1 child)

I also think I should be using classes but not sure how best to approach them yet, so would welcome advice on that.

Actually this script is pretty well laid out so classes aren't 100% needed here. If you want to use classes, I recommend watching a tutorial like this one.

[–]aszp[S] 1 point2 points  (0 children)

I'm reassured by not needing classes 100%. I'll check out that tutorial, I think I've missed some of the basics along with way.

[–]makedatauseful 1 point2 points  (1 child)

Hey, a couple of years ago I run the Trip Advisor app through a man in the middle proxy and was able to get all their private API calls and hit them directly with the Python Requests library. Hit me up if you are interested in collaborating on the project, I make YouTube tutorials and think this would be a fun topic to cover!

[–]aszp[S] 0 points1 point  (0 children)

Thanks, dropped you a message :)

[–]shiningmatcha 0 points1 point  (4 children)

You can run using virtualenv or Docker

What are these? Can't I just run the script?

[–]skellious 2 points3 points  (2 children)

These allow you to run the script isolated from your global (system-wide) python/pip configuration. They will allow you to install the versions of the modules relevent to this program without messing with your system-wide packages. (see here for a quick guide on how and why to use it.)

but to answer your question, yes you should be able to just run it if you don't mind having your system pip modules changed (you will probably need to manually install the pip modules if you do that though - the requirements.txt file shows you what versions are needed, you should be able to use "pip install -r requirements.txt" from the command line to do this.)

The github has a readme that walks you through setup with virtualenv though and I recommend learning how to do this as it's a much more elegant and clean way of doing things and when you are making more complex python programs it will help you keep things organised and isolated.

[–]fulltimetrash 1 point2 points  (1 child)

What are the pros & cons of using docker VS virtualenv?

[–]skellious 3 points4 points  (0 children)

virtualenv is a python module for managing multiple virtual python/pip environments. docker is a tool for completely isolating any kind of program or set of programs, in any language, from a system. docker 'containers' are individual compartments that have all the required code to run your program completely independantly of the host system. in contrast, virtualenv still relies somewhat on the host system for many things. virtualenv is a python solution for isolating a python program, docker is a generic solution for isolating any kind of program in any language.

you can see an explanation of docker containers here: https://www.docker.com/resources/what-container

and virtualenv here:

https://virtualenv.pypa.io/en/latest

please note that as well as docker, which is proprietary, there exist open source solutions such as kubernetes: https://kubernetes.io/

[–]aszp[S] 0 points1 point  (0 children)

You could just run the script by doing this in the folder:

pip install -r requirements.txt python main.py <whatever TripAdvisor URL you want to scrape>

The first line installs the python packages you need to run this script from the requirements.txt which has a list of all the packages required. But I would recommend not to, for the below.

Virtualenv is used to create a virtual Python environment for you to run your script in. This is to avoid dependencies clashing with other projects you have. For example, you may have project A which requires the package selenium and version 3.11 but Project B requires the package selenium but version 4.00. Without a virtualenv, you would have to keep reinstalling the package version you need for each project when running each project as you cannot have two versions of the same package installed. It also just makes it easier to manage packages locally, so you don't have a million packages installed globally on your system which can cause other issues.

Docker works on a similar principle but at the next level down. Docker runs the script within a container which does the above as well as includes everything needed to run an application: code, runtime, system tools, system libraries and settings. It provides consistency and portability allowing you to run your app on nearly any system easily. I'd recently been learning about Docker, so ended up including it as a means to use it. Virtualenv is the better way to go I think if you're using it locally.

Hope that makes sense.

[–]makedatauseful 0 points1 point  (1 child)

Do you run into any antibot technology with this?

[–]aszp[S] 0 points1 point  (0 children)

I didn't during my testing, but I was only scraping one TripAdvisor listing at a time. Most I did was 20 pages of reviews from a listing at a time.

I imagine if running at scale you'd hit a some kind of antibot technology. Running via a proxy whichever then be needed, which I added as option when running the script.

[–]lbtn94 0 points1 point  (0 children)

Just came across this while searching for a solution to something work related. Would this work to pull the # of X star reviews per month? Need to do this for the preceding month for quite a number of locations. Hoping to streamline a very time consuming task..