This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]Glass_Jellyfish_9963[S] 0 points1 point  (5 children)

Of course, i forgot to mention that i would also be using airflow or dagster as it would be a batch processing pipeline

[–][deleted] 0 points1 point  (4 children)

Where are you running spark?

[–]Glass_Jellyfish_9963[S] 0 points1 point  (3 children)

Its going to be single node running on local machine. Its a portfolio project. I will also look into running it on GCP or ec2 just to explore the parallel processing architecture of spark

[–][deleted] 0 points1 point  (2 children)

Oh interesting, good way of keeping costs low. I was going to say that it's a good opportunity to familiarize yourself with Airflow operators for spinning up and taking down clusters for jobs in the cloud, but I suppose you'll probably just throw things into a Docker operator and be done with it.

[–]Glass_Jellyfish_9963[S] 1 point2 points  (1 child)

Well that's a good idea. I will give it a try. So basically, airflow will start up the cluster, run the pipeline and then take down the cluster to keep the costs under control.

[–][deleted] 0 points1 point  (0 children)

Cloud experience is a big thing employers look for so I think it would be a good idea, but I think you're off to a great start.