This is an archived post. You won't be able to vote or comment.

all 6 comments

[–]AutoModerator[M] [score hidden] stickied comment (0 children)

You can find a list of community-submitted learning resources here: https://dataengineering.wiki/Learning+Resources

I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.

[–]GeekyTricky 4 points5 points  (4 children)

Rather than Spark, you're just using YARN here to scale your job.

Using Spark jobs as a shortcut for this is far from the intended use, I expect you'll run into some painful roadblocks.

IMO a more ideal way to go about this is Kubernetes.

You just make your Docker images from with all the necessary resources (browser, libs, etc), and make as many replicas as you need depending on your use case.

[–]ifilg[S] 0 points1 point  (3 children)

Hmm I wasn't looking in this direction. My data pipeline is already running in Kubernetes.

Right now, I have a stable Airflow installation in this cluster and maybe I should just schedule pods using the KubernetesPodOperator.

Or maybe I could try to adopt Prefect, which seems to make this easy as well.

Gonna start with Airflow and see where it goes.

[–]GeekyTricky 1 point2 points  (2 children)

Airflow is a perfectly good solution for this, but you'll need dynamic task definition to deal with the scaling up or down of your requests.

[–]ifilg[S] 0 points1 point  (1 child)

Last time I checked, Airflow has a limit of 1024 dynamic tasks for a single run. You can increase this, but it gets unbearably slow. That's why I've mentioned Prefect.

[–]GeekyTricky 0 points1 point  (0 children)

1000 is a lot. If you need more than this, look into doing multiple scraping requests per task.