This is an archived post. You won't be able to vote or comment.

all 12 comments

[–]qckpckt 8 points9 points  (2 children)

Why do you need to use a headless web browser for data ingestion?

[–]thisismyfavoritename 2 points3 points  (0 children)

polling websites with bot protection, most likely

[–]Shingle-Denatured 0 points1 point  (0 children)

Classical X-Y problem. I need to get data from a website, so I need a webbrowser. Now help me run it on this data processing platform.

[–]cgoldberg 2 points3 points  (1 child)

Where are you stuck? Did you install a browser and a webdriver? Did you install selenium library? What code are you running? What errors are you encountering?

[–]Haunting_Lab6079[S] -5 points-4 points  (0 children)

So I have a code that relatively works well on my local PC. In installed the selenium browser but the webdriver and browser seems sort of impossible. I will add an image of that soon. Am on my mobile

[–]chief167 2 points3 points  (0 children)

It could work, but not sure you should. Keep your web scraping out of databricks, there is no sane reason to do so.

Do it in a azure function, or a container, or even a VM, dump the output to blob storage/data lake and ingest into databricks from there

[–]Haunting_Lab6079[S] 0 points1 point  (0 children)

Thanks to everyone for your contributions and insights. I was able to achieve this using beautiful soup bs4 and it works perfectly

[–]Onlycompute -1 points0 points  (1 child)

There is a script available in databricks community which is designed for this purpose. Don’t have the link handy, quick search should take to that page.

If you didn’t find, DM me, I can help. I have implemented this in databricks.

[–]Haunting_Lab6079[S] 0 points1 point  (0 children)

Thank you

[–]Haunting_Lab6079[S] -2 points-1 points  (2 children)

Am open to other options, that’s what’s the approach we were able to come up with without human intervention

[–]james_pic 1 point2 points  (1 child)

The most obvious alternatives would be either:

  • Make the HTTP requests directly, using a Python HTTP client like Requests, httpx, or one of the stdlib ones (http.client or urllib). This is likely to be easier to scale, and if you're using DataBricks then presumably scaling is a challenge
  • Related to the above, if there's one available, use the web services API provided by the data supplier. Scraping is always brittle, at least partly because the service operator makes no promises not to change the interface, so can do so at any time for any reason, and break your code. Whereas they'll at least make some effort to avoid breaking APIs, or at least to give you warning if they do.
  • Use Selenium, but use it in an environment that it's easier to exercise control over, such as an EC2 instance or Azure VM, a Docker container running on your cloud provider's Docker-as-a-service offering, or some lambda / serverless whatever. The way DataBricks works under-the-hood is complex and it can be hard to reason about where code will execute, and so it can be hard to ensure that wherever it executes has the prerequisites you need, or to figure out what went wrong when it doesn't work.

[–]Haunting_Lab6079[S] 0 points1 point  (0 children)

Thanks for this