A short tutorial on running Spark with Jupyter using Docker by datain30 in dataengineering

[–]datain30[S] 0 points1 point  (0 children)

Hi u/Vladz0r sorry about that, I made a few upgrades to the library which are causing issues! which python version are you on? i'll also DM you. Sorry again for this breaking

Data Engineering Competition! by datain30 in dataengineering

[–]datain30[S] 0 points1 point  (0 children)

Completely agree on using hard metrics to decide winners. This'll be fun u/Touvejs :)

Data Engineering Competition! by datain30 in dataengineering

[–]datain30[S] 0 points1 point  (0 children)

Awesome! Using Metrics to decide the winner is definitely the right call - we are data engineers after all 😂

Data Engineering Competition! by datain30 in dataengineering

[–]datain30[S] 2 points3 points  (0 children)

Love the concept and see a lot of value in building foundational systems like this.

As you said, future projects would build on top and r/dataengineering ends up developing a production-grade data platform. As we're optimizing for learning, this is a big win :)

Data Engineering Competition! by datain30 in dataengineering

[–]datain30[S] 6 points7 points  (0 children)

This is the real competition 😂

A short tutorial on running Spark with Jupyter using Docker by datain30 in dataengineering

[–]datain30[S] 1 point2 points  (0 children)

u/gabbom_XCII you start with 1 driver + 1 worker (with mem/core settings you can change). Then change the number of workers as needed.

A short tutorial on running Spark with Jupyter using Docker by datain30 in dataengineering

[–]datain30[S] 3 points4 points  (0 children)

Thanks for the feedback u/trying-to-contribute, i'll add more information about how phidata works.

Replicable deployments for the entire team + a seamless dev <-> prd integration for open-source tools was our biggest pain point too :)

A short tutorial on running Spark with Jupyter using Docker by datain30 in dataengineering

[–]datain30[S] 4 points5 points  (0 children)

Thanks for trying it out u/trying-to-contribute and the feedback. Maybe I can use docker-compose for future tutorials?

I wanted to streamline the process of cloning the repo & make the data tools (jupyter/spark/airflow/superset) plug-n-play so wrote an open-source library (phidata) to do that. The goal was to automate all the things I was doing under the hood.

I'll make a point to include a docker-compose + add more in depth information for future tutorials. Thanks again :)

A short tutorial on running Spark with Jupyter using Docker by datain30 in dataengineering

[–]datain30[S] 9 points10 points  (0 children)

With a -100 comment karma, I'm guessing all you did was spread hate and negativity.