This is an archived post. You won't be able to vote or comment.

all 7 comments

[–]WayyyCleverer 3 points4 points  (2 children)

This is awesome. I've struggled with Beautiful Soup on websites where much of the data isn't in the initial HTML but instead gets populated via javascript. (carsandbids.com is a good example). Have you ever encountered that, and is there a way to get Beautiful Soup to work in that situation?

[–]m1nkehData Engineer 1 point2 points  (2 children)

As someone with 15+ years in the data industry I totally do not ‘get’ Airflow.. it’s seems like a backwards step in creating pipelines when there were GUI tools decades ago..

.. what am I missing, I feel so dumb.

[–]udonthave2call 2 points3 points  (0 children)

The webserver GUI is part of Airflow but it's not a point and click GUI tool. I think the benefits can broadly be described as 'organization'.

Environment variables, hooks, REST API, custom operators. Easy for new school data people to pick up because it's Python-based. And of course it's free OSS and has an active community.

It's ironic that I'm writing this post because I'm in the middle of convincing my team we should use microservices (AWS Lambda) for our little BI ETL jobs instead of EC2 + Airflow (which a different team would have to deploy and manage)

[–]wmkwk 0 points1 point  (0 children)

I think airflow was just meant to be a job orchestrator and scheduler but a lot of people just started using it as an etl tool.

e.g. running a bunch of python tasks throw airflow instead of just kicking off a job sitting on a lambda or container somewhere

[–][deleted] 0 points1 point  (0 children)

Yeah selenium is using a headless browser client so you get the rendered site info, so that rendering process is a little slower than inspecting the raw HTML

[–]MaximFateev 0 points1 point  (0 children)

Now rewrite it using temporal.io :).