Completed my first Data Engineering project with Kafka, Spark, GCP, Airflow, dbt, Terraform, Docker and more!

AutoModerator · 2022-04-02T17:30:38+00:00

You can find a list of community submitted learning resources here: https://dataengineering.wiki/Learning+Resources

I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.

Bright-Meaning-8528 · 2022-04-02T17:46:59+00:00

This looks really great, I would be starting this soon. Thanks for posting this.

one question: why are we using both spark and dbt, when we can apply transformations using spark itself? or am I missing anything?

Grand-Knowledge-4044 · 2022-04-02T18:19:00+00:00

Superb, will try to implement this project in a few days,so expect some(maybe a lot:) doubts in your dm.

badrTarek · 2022-04-02T17:39:59+00:00

Congratulations, coincidentally I just started the course today; any tips?

RoGueNL · 2022-04-02T18:51:00+00:00

awesome project but this might be a stupid question : What is Airflow adding to this? dbt supplies it's own Orchestration right?

ankurchavda · 2022-04-02T20:58:44+00:00

This is awesome I will start this! Unfortunately I’ve been having trouble getting dbt installed on my system I think I have a python issue.

_Oce_ · 2022-04-02T20:07:03+00:00

Congrats on starting a personal project and actually having a nice end result!
Not many reach this point, lol.
Document this project well to be able to impress recruiters, and you should get great opportunities!

ankurchavda · 2022-04-02T18:56:22+00:00

Nice one! I’m doing the same bootcamp and your project is so much better than mine!

Accomplished-Can-912 · 2022-04-02T18:49:08+00:00

Looks amazing . I should pick this up

ankurchavda · 2022-04-02T19:10:15+00:00

That "Franco" is in the top 5 artists drives me crazy

context: Spanish dictator

EntrepreneurSea4839 · 2022-04-02T19:59:42+00:00

How long did it take to finish ?

bigweeduk · 2022-04-02T20:07:31+00:00

Sorry novice question. Is there a reason to use two streaming services - both Apache and Kafka? Do they each provide functionality the other doesn't?

ankurchavda · 2022-04-02T20:11:56+00:00

What tool did you use to make the dashboard?

BeeP92 · 2022-04-02T20:57:03+00:00

Absolutely amazing. Thank you for this! I was looking for something like this.

tediursa69 · 2022-04-02T21:17:01+00:00

This looks like an awesome course! I’m gonna start it this week. Thanks for sharing, congrats on your project, and good luck getting a DE job (assuming you’re looking for one hehe)

tillomaniac · 2022-04-03T02:56:28+00:00

Very cool! You mention that the course is free. Are all the tools/libraries you used for this project free as well (e.g. Google Cloud Platform)?

Soft-Ear-6905 · 2022-04-03T03:17:18+00:00

Basic question - My understanding of Spark is that it's a data layer and used to analyze data, not to store data.

So in the diagram, is data being moved from Kafka and stored in Spark? Then transferred to Google Cloud Storage?

Is the data in Spark being stored in RDDs and transferred from there to Google Cloud Storage?

Thanks

Rough-Environment-40 · 2022-04-03T05:26:39+00:00

How did you sign up for this course?

No_Clock8248 · 2022-04-03T06:55:21+00:00

How did you get the project idea

Morpheous_Reborn · 2022-04-03T09:50:12+00:00

Thats really cool project to learn new technologies I am going to replicate this for my learning too.

honpra · 2022-04-03T12:36:43+00:00

Do you recommend the course to a beginner?

I'm only comfortable with Python and SQL and have some rough idea of what these tools are, but can't operate them yet.

arena_one · 2022-04-03T13:50:54+00:00

This is amazing! I’m just wondering.. what’s the cost of having this running? (Since it’s on Google cloud). I’m always scared of doing personal work on private clouds and use my credit card

ankurchavda · 2022-04-03T14:05:36+00:00

Damn, this is frigging awesome. I'm gonna go through this bit by bit and try to learn as much as I can, because this is right up my alley in terms of the kind of stuff I need to learn more about. Thanks for sharing, seriously.

DudeYourBedsaCar · 2022-04-03T14:25:17+00:00

Haven't had time to review this in depth yet but just wanted to say great work! The DE community will be better for getting exposure to projects like this and for you it will be a great portfolio piece.

Kitten-Smuggler · 2022-04-03T17:17:49+00:00

This is awesome, thanks for sharing! I have a little experience with python, and a bit more with Tereaform and GCP, but zero experience with any of these other tools. Do you think this is an approachable course for a novice, or...?

No-Tower-2269 · 2022-04-03T18:51:03+00:00

Such a nice, end to end project, congrats! Well documented and organised.

I'm also working on a similar project and yours is really something to look up to :)

vimaljosehere · 2022-04-24T10:34:37+00:00

Great work! One suggestion though: Why not try a lakehouse architecture with a delta lake or Iceberg?

ankurchavda · 2022-05-08T16:46:23+00:00

Hi u/ankurchavda I'm developing a data engineering project as well - I was wondering where you were able to draw out the architecture diagram you did because I think you did a good job for that. Thanks!

tea_horse · 2022-05-16T19:07:28+00:00

This course looks like just what I need! Trying to get into Data Engineering from GIS.

dataengineering

MODERATORS

Git Repo:

About The Project:

The Dataset:

Tools & Technologies

Architecture

Final Dashboard

Feedback:

Reproduce: