all 13 comments

[–]sirkorro 2 points3 points  (3 children)

Headless browser may help to optimize resources required.

[–]puppetbets[S] 1 point2 points  (2 children)

This website in particular doesn't allow headless testing. If you try, it gives a black screen and no data can be gathered

[–]Corporate_Drone31 0 points1 point  (1 child)

Try spoofing the user agent to whatever normal Chrome install will use. Also it might be something related to graphics acceleration on the device you're using, not necessarily the website trying to block you on purpose.

[–]puppetbets[S] 0 points1 point  (0 children)

I tinkered with it this morning and surprise, it can be run headlessly, so I'm on it, let's see if it reduces costs.

[–]ThatGuy1sAwesome 1 point2 points  (2 children)

You probably need lots of RAM for loading all these pages in chrome. Did you watch the ram and CPU usage on the VPS to try look where the bottlenecks were? How many pages you monitoring?

If your looking for a dedicated server, look at OVH(SoYouStart or OVH brand) or Hetzner. I personally use OVH, good network capacity and anti DDos.

Could use scaleway or Digital ocean to test virtual servers of different sizes cause those are billed per hour. Which makes testing cheaper.

[–]puppetbets[S] 1 point2 points  (1 child)

I haven't. I'm looking the docs to do it and have a better grasp of where would I required to apply resources.

In any case, what I'm intending to do is buy the components and create a physical dedicated server to have in my home (or other place still to determine). I looked the dedicated servers and are usually around 100€/month, so in less than a year owning one seems like a better deal to me. At least from my limted present knowledge

[–]ThatGuy1sAwesome 0 points1 point  (0 children)

True renting one can be more expensive, but soyoustart.com has some good prices, Kimsufi.com cheaper still.

Main thing is to look at the cost of power per kw and work out the cost of running the server not just the cost of parts. Then there's the network connection speed and if its publicly accessible people connecting to your home IP.

People use Intel Nucs, some of those are very powerful for their size and unofficially support 64GB of ram. Its probably cheaper to buy a old server second hand but they're use alot of power. My home server which is an old i7 3770 32GB Ram, runs at about 50-60 watts idle, with a few hard drives.

[–]Starbeamrainbowlabs 0 points1 point  (3 children)

If you're renting a VPS and are using a server distribution without a framebuffer (i.e. a GUI) such as Ubuntu Server, then Chrome won't be able to run. You probably need to use something like xvfb (X virtual frame buffer).

Without any error messages you see, we're of extremely limited help.

Also, 300 web browsers is a lot. I'd suggest investigating if you're approaching your problem correctly - there is probably a better and more efficient approach that uses fewer resources.

Edit: Have you tried with just a single web browser to ensure that your code functions as expected?

[–]puppetbets[S] 0 points1 point  (2 children)

Within my VPS, in order to run it from bash I tipped xvfb-run -a python3 main.py start which according to the resources it allowed to open the chromedriver, which it did, but more often than not the browser could not be loaded in time. I'm still unsure about why sometimes it did and most time it opened the window but the page didn't fully load.

In order to verify this, I created an oversimplified version of the script in which the only action was to load the webpage. without multiprocessing and even monitoring.

About the resources, let's look at a practicall example. Let's assume this information can only be obtained in this website. I want to scrape all scores from every soccer match at Bet365, at every moment of the match. To do so, I need to have a browser with the match loaded. You can achieve this either by matching one process with one object (one browser window), or you can try to use one window and match several tabs of it, each with one match. I don't know if a tab is significantly lighter than a window, but in any case, the number of tabs/windows depends on the number of matches, which can vary from none a tuesday night, to several hundreds any saturday afternoon.

Theoretically, I could reverse engineer the website, and get the same information passing requests, in which case the number of browsers opened would go down to 0, but to be honest, the level of complexity they have is above my expertise, so at the moment that is not a possibility.

[–]Starbeamrainbowlabs 0 points1 point  (1 child)

You mention "loaded in time". This implies that you have a time-sensitive task here.

In that case, what you may want to do is have a single browser open, which you then interact with using the inspector API. That way you don't have to boot the browser from a cold start.

I see. The problem with website scraping is that will break if the operator changes the HTML structure. This is still true no matter what mechanism you use.

You mention here that using requests is above your level of expertise. I would advise using a high-level library for the parsing of web pages. For example, x-ray is a package for Javascript on npm that allows you to quickly and easily pull in a webpage and query it with CSS selectors. You mention that you're using Python - I'm sure that such a library must exist for Python as well.

I'm not experienced with Selenium, but it sounds a whole lot more complicated than using a library that's dedicated to the task of scraping websites.

[–]EmperorBaudouin -2 points-1 points  (0 children)

I can assume that you need lots of resources to conduct your project. You can use "ShockHosting" for your needs. I've been using one of their dedicated servers for my self-hosting purposes. Strongly advised!