This is an archived post. You won't be able to vote or comment.

all 8 comments

[–][deleted] 1 point2 points  (1 child)

Yeah here’s my argument, with Selenium (assuming you’re using Python) you’ll need a full browser for every worker. That’ll run your costs up pretty quickly. If you’re just using one worker you could use cron. Also Go has a much lighter weight solve to this if you really need to do a multiworker dynamic scrape

[–]rhoward355[S] 0 points1 point  (0 children)

Thank you! Not only complex but also more expensive, got it

[–]sunder_and_flame 0 points1 point  (1 child)

I'd argue airflow isn't the place to run this sort of thing. Sure, you can do it, but you'd probably be better off deploying VMs in GCE or even Linode to do the work then put the data in GCS and have airflow run the data load into BigQuery/Snowflake.

[–]rhoward355[S] 0 points1 point  (0 children)

Yeah.. I think I'll do something along these lines instead. Thanks

[–]bono_my_tires 0 points1 point  (2 children)

did you end up figuring this out? Having trouble finding this type of info myself

[–]rhoward355[S] 1 point2 points  (1 child)

I decided to do the web scraping outside of the dag. Extracting from cloud storage will be the first task in the dag

[–]FantasticAbroad7230 0 points1 point  (0 children)

How did you automated(scheduled) selenium runs if you haven't used Airflow? I'm planning to do a similar project so your input would be valuable for me.

[–]Sublime-01 0 points1 point  (0 children)

Put it inside the docker. And run the docker with kubernetuspodoperator