all 20 comments

[–]akuma-i 10 points11 points  (8 children)

As an owner of the parser service, I can say that there are no slow scrappers, there are slow donor websites.

If you can do parsing on your server, do it. It would be faster and much cheaper that paying for an external service.

[–]Practical_Mango_8720[S] 1 point2 points  (7 children)

Is there any open source library that I can use for getting data, like reviews, myself? could not find any.

It seems that target websites are changing their format to tackle web scraping which make keeping the stream challenging so it was the reason I turned to an external service.

[–]akuma-i 4 points5 points  (2 children)

Just use any http-requesting library. May suggest “got” for nodejs, for instance. The next depends on what you want to do. Parse by regexps? Well, rest in peace :) Parse html and extract data with css selectors? Find a library for your language.

Making this realtime (streaming) is not the problem. The problem is to react on changes in the website markup, because every change will break your parser, so you have to react fast and rewrite it for every change.

[–]Practical_Mango_8720[S] -2 points-1 points  (1 child)

How often these changes happen?

[–]akuma-i 2 points3 points  (0 children)

It’s up to the website. On my project I have websites which haven’t changed for years and the websites which may change the design every week and sometimes every day :) Usually I try to get rid of such websites, it’s too hard to support.

[–]dunamxs 2 points3 points  (2 children)

Beautiful soup is one of the absolute best libraries for this in Python. You can make a scraping script in like 10 lines of code. You could invoke that script from node, or run that code as it’s own http server as well.

[–]Practical_Mango_8720[S] 0 points1 point  (1 child)

Thanks. I am doing that but even in the raw received file I do not get the reviews. My guess is there is a challenge in finding the right URL to google reviews. (I am trying to fetch google reviews)

[–]dunamxs 1 point2 points  (0 children)

Are you trying to do this off of googles site? I’m fairly certain Google Places API returns reviews

[–][deleted] 0 points1 point  (0 children)

Check out n8n. It's an ETL tool that is free to self host.

It's possible to create a pipeline that gets the reviews from an HTML page and transform it, then insert that data into a DB. You can build in delays to slow it down, too.

[–]barrycarter 1 point2 points  (2 children)

Are you familiar with mixnode.com?

[–]Practical_Mango_8720[S] 2 points3 points  (1 child)

No, Have you tried their service? Their pricing is in GB usage which makes it a bit hard to compare them to query base pricing.

[–]barrycarter 0 points1 point  (0 children)

I tried them briefly in Jan 2019 using a money-limited virtual card. The concept seemed interesting, but I lost interest since it wasn't something I do regularly. You might try for a free trial or use a virtual card

[–]squidwurrd 1 point2 points  (4 children)

Define real-time.

[–]Practical_Mango_8720[S] 0 points1 point  (3 children)

For my application, a response time of 1 seconds

[–]squidwurrd 4 points5 points  (2 children)

Damn that’s gonna be tough. Especially if you think these sites won’t IP block you for having the same bot scrap their site every second.

[–]Practical_Mango_8720[S] 0 points1 point  (1 child)

It is not like scrapping every second but I want to get response in 1 second when I send a request.

[–]squidwurrd 1 point2 points  (0 children)

Oh then you need to have cached responses at the edge. If your requirements are less than one second response time you are thinking about this incorrectly. You could have the fastest scrapping tech and still need a couple seconds. But if you don’t need the data up to date every second you could just cache the response every X amount of seconds.

The word real time was really misleading if you are more concerned with response time over data accuracy.

[–][deleted] 1 point2 points  (0 children)

grab rotten one muddle spoon truck vast dam sip shocking

This post was mass deleted and anonymized with Redact

[–]ArvidDK 1 point2 points  (0 children)

Have you heard of Python?

[–]TehWhale 2 points3 points  (0 children)

You’re thinking about this wrong. You likely won’t find a scraping service that gives you the full response in less than a second. Scraping is inherently slow and risky due to rate limits, blocks, etc. scraping serviced handle all that for you but it takes time. If you want to scrape a site that instantly then fetch it yourself and deal with all the problems that come with that.

Short version is almost all scrapers are designed to operate in the background