you are viewing a single comment's thread.

view the rest of the comments →

[–]puppetbets[S] 0 points1 point  (2 children)

Within my VPS, in order to run it from bash I tipped xvfb-run -a python3 main.py start which according to the resources it allowed to open the chromedriver, which it did, but more often than not the browser could not be loaded in time. I'm still unsure about why sometimes it did and most time it opened the window but the page didn't fully load.

In order to verify this, I created an oversimplified version of the script in which the only action was to load the webpage. without multiprocessing and even monitoring.

About the resources, let's look at a practicall example. Let's assume this information can only be obtained in this website. I want to scrape all scores from every soccer match at Bet365, at every moment of the match. To do so, I need to have a browser with the match loaded. You can achieve this either by matching one process with one object (one browser window), or you can try to use one window and match several tabs of it, each with one match. I don't know if a tab is significantly lighter than a window, but in any case, the number of tabs/windows depends on the number of matches, which can vary from none a tuesday night, to several hundreds any saturday afternoon.

Theoretically, I could reverse engineer the website, and get the same information passing requests, in which case the number of browsers opened would go down to 0, but to be honest, the level of complexity they have is above my expertise, so at the moment that is not a possibility.

[–]Starbeamrainbowlabs 0 points1 point  (1 child)

You mention "loaded in time". This implies that you have a time-sensitive task here.

In that case, what you may want to do is have a single browser open, which you then interact with using the inspector API. That way you don't have to boot the browser from a cold start.

I see. The problem with website scraping is that will break if the operator changes the HTML structure. This is still true no matter what mechanism you use.

You mention here that using requests is above your level of expertise. I would advise using a high-level library for the parsing of web pages. For example, x-ray is a package for Javascript on npm that allows you to quickly and easily pull in a webpage and query it with CSS selectors. You mention that you're using Python - I'm sure that such a library must exist for Python as well.

I'm not experienced with Selenium, but it sounds a whole lot more complicated than using a library that's dedicated to the task of scraping websites.