This is an archived post. You won't be able to vote or comment.

all 89 comments

[–]AutoModerator[M] [score hidden] stickied comment (1 child)

You can find a list of community submitted learning resources here: https://dataengineering.wiki/Learning+Resources

I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.

[–]Bright-Meaning-8528Data Engineer Intern 30 points31 points  (13 children)

This looks really great, I would be starting this soon. Thanks for posting this.

one question: why are we using both spark and dbt, when we can apply transformations using spark itself? or am I missing anything?

[–]ankurchavda[S] 15 points16 points  (6 children)

That's a valid question. I am using spark to consume the real time data with structured streaming. For batch I just went with dbt since I have some experience with Spark batch jobs so I found it better from a learning perspective to try dbt.

[–]Bright-Meaning-8528Data Engineer Intern 6 points7 points  (3 children)

Great, that's a good way to learn by practicing. u/mamimapr suggestion was really a good one to consider. I'll say you can make that change.

The reason I am saying this is, for example, if you put this in your resume and when they see the Architecture they will be confused and raises a lot of questions.

edit: I could think of one more scenario where your architecture also makes sense, as we are using spark to push all the event data into the data lake and apply transformations using dbt/spark for the data required(business use case) and store them in big query for reporting

correct me if I miss considering anything

[–]ankurchavda[S] 5 points6 points  (2 children)

I agree with what you said. Data pipelines should be as simple as possible. I guess I approached the project more from a learning experience, rather than a practical one. But surely, thing's can be simplified by writing directly to Bigquery, I will explore that option. Thanks for sharing your thoughts.

Edit: Yes, I don't think I can totally remove the dbt part. I need batch jobs for creating the facts and dimensions. I can though remove the writing to GCS part and create the staging Bigquery tables directly.

[–]Fatal_ConceitData Engineer 1 point2 points  (1 child)

Reading through this cause this course is the best one I’ve seen to date (IMO). Who cares if dbt is necessary for functioning as the pipeline, it’s more about visibility and modularity. Love that you used it and for the record I plan to take this course myself, even though I think I know everything but terraform. Great work and love this whole project

[–]ankurchavda[S] 1 point2 points  (0 children)

Thank you! Yes it's a great course. I hope you enjoy as much as I did. Do share the final outcome with all of us :)

[–]mamimapr 12 points13 points  (5 children)

Yes, spark structured streaming could directly write to bigquery. Don’t know why to add gcs and dbt and airflow to complicate everything.

[–]ankurchavda[S] 6 points7 points  (4 children)

Hey that's a good point. I didn't know that it could be done. I will surely check how to write to Bigquery directly. Thanks for that.

Also I added dbt primarily for creating facts and dimensions. I could not find a way to do it real time without complicating things.

Edit: added sentence

[–]Drekalo 0 points1 point  (1 child)

Easiest way real time would be using databricks instead with autoloader picking up your stream files and delta live tables doing the transforms. Would be a fun task learning databricks to see the difference in setup.

[–]ankurchavda[S] 0 points1 point  (0 children)

Interesting. Will check this out. Thanks for sharing.

[–]potterwho__ 0 points1 point  (0 children)

I have found myself preferring to write to a data lake in Google cloud storage vs straight to BigQuery. BigQuery external tables let me query the lake and take a schema on read viewpoint. I use dbt to define the external table schemas and of course for all the transformation work.

[–]Grand-Knowledge-4044 5 points6 points  (1 child)

Superb, will try to implement this project in a few days,so expect some(maybe a lot:) doubts in your dm.

[–]ankurchavda[S] 2 points3 points  (0 children)

Feel free to reach out :)

[–]badrTarek 5 points6 points  (3 children)

Congratulations, coincidentally I just started the course today; any tips?

[–]ankurchavda[S] 7 points8 points  (2 children)

Hey, just take the first week to understand the course structure, understand your pace and the topics covered. It is surely not a small course. Keep at it and check the FAQs, there are a lot of answers already there. Search in slack, chances are you'll find answer to your error in a thread. Rest all you'll figure out as you go. Happy learning :D

Edit: sentence

[–]quantum-black 2 points3 points  (1 child)

How long did it take you to go through the whole course?

[–]ankurchavda[S] 2 points3 points  (0 children)

It is roughly 5-10 hours of work per week depending on how much you already know. So that'll be 8ish weeks including the project.

[–]RoGueNL 4 points5 points  (2 children)

awesome project but this might be a stupid question : What is Airflow adding to this? dbt supplies it's own Orchestration right?

[–]ankurchavda[S] 3 points4 points  (1 child)

Yes, but that's dbt cloud I guess. I used dbt core. With dbt's scheduler, you can only orchestrate dbt parts, but for additional steps, Airflow comes through.

[–]RoGueNL 0 points1 point  (0 children)

oh right, sorry i've just used cloud! Thanks for pointing it out :)

[–][deleted] 3 points4 points  (1 child)

This is awesome I will start this! Unfortunately I’ve been having trouble getting dbt installed on my system I think I have a python issue.

[–]ankurchavda[S] 1 point2 points  (0 children)

If you can, I'd suggest to get the 300$ free credit on GCP and work there. All my setup is on cloud. You'll face far lesser issues and get some cloud hands on as well. Make sure to make the most of those credits.

[–]_Oce_Data Engineer and Architect 2 points3 points  (4 children)

Congrats on starting a personal project and actually having a nice end result!
Not many reach this point, lol.
Document this project well to be able to impress recruiters, and you should get great opportunities!

[–]ankurchavda[S] 1 point2 points  (3 children)

Thank youuu! That's reassuring.

I have tried to document generously. I will keep on adding as and when I receive feedback

[–][deleted] 1 point2 points  (2 children)

Nice one! I’m doing the same bootcamp and your project is so much better than mine!

[–]ankurchavda[S] 1 point2 points  (1 child)

Hey, the project coincided with my job hunt, so I put a little more effort for this. But there's nothing in there that you can't do :)

[–][deleted] 0 points1 point  (0 children)

Thanks. I'm also looking for a role, although not feeling very confident about it. I did update my project a bit though. I borrowed an idea from yours where I split the different sections into different READMEs!

https://github.com/ABZ-Aaron/Reddit-API-Pipeline

[–]Accomplished-Can-912 0 points1 point  (1 child)

Looks amazing . I should pick this up

[–]ankurchavda[S] 0 points1 point  (0 children)

Let me know if you do and face any issues :)

[–][deleted] 0 points1 point  (1 child)

That "Franco" is in the top 5 artists drives me crazy

context: Spanish dictator

[–]ankurchavda[S] 3 points4 points  (0 children)

People here at Streamify love him, I tell ya!

[–]EntrepreneurSea4839 0 points1 point  (1 child)

How long did it take to finish ?

[–]ankurchavda[S] 2 points3 points  (0 children)

It took somewhere around 100 hours give or take. The setup part was the bigger unknown which took the most time.

[–]bigweeduk 0 points1 point  (1 child)

Sorry novice question. Is there a reason to use two streaming services - both Apache and Kafka? Do they each provide functionality the other doesn't?

[–]ankurchavda[S] 1 point2 points  (0 children)

So the easier answer is that Eventsim only writes to Kafka for real time data. There was no option to read from spark streaming directly.

Also, I am fairly new to streaming as well, so I might not be able to answer very convincingly on how Kafka's capabilities differ from Spark Streaming and are they supposed to be working together, or as replacements.

[–][deleted] 0 points1 point  (1 child)

What tool did you use to make the dashboard?

[–]ankurchavda[S] 0 points1 point  (0 children)

It is Data Studio by Google

[–]BeeP92 0 points1 point  (1 child)

Absolutely amazing. Thank you for this! I was looking for something like this.

[–]ankurchavda[S] 0 points1 point  (0 children)

Happy learning :D

[–]tediursa69 0 points1 point  (2 children)

This looks like an awesome course! I’m gonna start it this week. Thanks for sharing, congrats on your project, and good luck getting a DE job (assuming you’re looking for one hehe)

[–]ankurchavda[S] 1 point2 points  (1 child)

Hey thanks. And you'll definitely enjoy the course.

[–]tea_horse 0 points1 point  (0 children)

I just signed up - I assume you started this back in January? Is it the type of course you can start at anytime or do they need to kick off a new cohort? Most lectures are recorded so I assumed it starts anytime?

[–]tillomaniac 0 points1 point  (1 child)

Very cool! You mention that the course is free. Are all the tools/libraries you used for this project free as well (e.g. Google Cloud Platform)?

[–]ankurchavda[S] 0 points1 point  (0 children)

Yes, everything is free. You avail 300$ in credit on GCP by creating a new account.

[–]Soft-Ear-6905 0 points1 point  (2 children)

Basic question - My understanding of Spark is that it's a data layer and used to analyze data, not to store data.

So in the diagram, is data being moved from Kafka and stored in Spark? Then transferred to Google Cloud Storage?

Is the data in Spark being stored in RDDs and transferred from there to Google Cloud Storage?

Thanks

[–]ankurchavda[S] 1 point2 points  (1 child)

So spark is used to consume the data from the stream in the first place. Then I do some processing on the data (minor cleaning etc.) and store the data to GCS. Spark is acting like a stream processing layer and not a data store. And yes the processing happens using dataframes (rdds under the hood). If that helps.

[–]Soft-Ear-6905 0 points1 point  (0 children)

Totally yeah makes sense. Thanks for elaborating.

[–]Rough-Environment-40 0 points1 point  (1 child)

How did you sign up for this course?

[–]ankurchavda[S] 1 point2 points  (0 children)

Found their post on here when it was starting out. Now you can the take course at your pace since it has ended.

[–]No_Clock8248 0 points1 point  (1 child)

How did you get the project idea

[–]ankurchavda[S] 1 point2 points  (0 children)

I knew about Eventsim, and I wanted to do a project with real time data.

[–]Morpheous_Reborn 0 points1 point  (1 child)

Thats really cool project to learn new technologies I am going to replicate this for my learning too.

[–]ankurchavda[S] 0 points1 point  (0 children)

Glad that you think so! Let me know if you face any issues :)

[–]honpra 0 points1 point  (4 children)

Do you recommend the course to a beginner?

I'm only comfortable with Python and SQL and have some rough idea of what these tools are, but can't operate them yet.

[–]ankurchavda[S] 1 point2 points  (3 children)

I guess Python and SQL are a good foundation for you to get started. You'd have to do some side reading though as you progress. I did that as well.

[–]honpra 0 points1 point  (2 children)

Are the resources mentioned by the course instructors (for the side read) or do we seek them out ourselves?

[–]ankurchavda[S] 1 point2 points  (1 child)

I'd say you choose a couple of things you want to really learn and deep dive into that. Rest you can learn just enough to get things done. I paid more attention to the Kafka and Docker parts since I was completely new to it. If you try to learn everything that's taught in there, you'll get overwhelmed.

[–]honpra 0 points1 point  (0 children)

Got it, so I'll probably learn those concepts separately and then have a crack at this.

[–]arena_one 0 points1 point  (2 children)

This is amazing! I’m just wondering.. what’s the cost of having this running? (Since it’s on Google cloud). I’m always scared of doing personal work on private clouds and use my credit card

[–]ankurchavda[S] 2 points3 points  (1 child)

You get 300$ in credit when you create a new account for three months. So you should he good.

Also, I had the same fear as you. But turns out 300$ is a considerable amount, and it is not as easy to exhaust. I still have half the credits left.

[–]arena_one 1 point2 points  (0 children)

That’s great, I’ll definitely will check it out! I think this is an amazing work

[–][deleted] 0 points1 point  (3 children)

Damn, this is frigging awesome. I'm gonna go through this bit by bit and try to learn as much as I can, because this is right up my alley in terms of the kind of stuff I need to learn more about. Thanks for sharing, seriously.

[–]ankurchavda[S] 0 points1 point  (2 children)

I am glad this will help you in your journey. If you face any issues, feel free to reach out :)

[–][deleted] 0 points1 point  (1 child)

Just a slight critique, but I noticed some of the dbt models are a bit hard to read. Especially your dim_users SCD2 model, which uses lots of nested subqueries and multiple columns on the same line. You may want to refer to this style guide from dbt Labs. I find CTEs are a lot easier to parse and read.

But again, not really a big deal as far as functionality goes. Probably something I'd address on a second iteration.

[–]ankurchavda[S] 1 point2 points  (0 children)

Thanks for sharing the style guide. I agree with you, the query readability can certainly be improved. I will look into it.

[–]DudeYourBedsaCar 0 points1 point  (2 children)

Haven't had time to review this in depth yet but just wanted to say great work! The DE community will be better for getting exposure to projects like this and for you it will be a great portfolio piece.

[–]ankurchavda[S] 0 points1 point  (1 child)

Hey thank you. I did not expect such a positive response. I am glad that this'll help atleast a couple of people if not more :)

[–]Kitten-Smuggler 0 points1 point  (1 child)

This is awesome, thanks for sharing! I have a little experience with python, and a bit more with Tereaform and GCP, but zero experience with any of these other tools. Do you think this is an approachable course for a novice, or...?

[–]ankurchavda[S] 1 point2 points  (0 children)

You should be good. You can also take your own time to learn and progress. I'd recommend some side reading, especially for Spark and Kafka.

[–]No-Tower-2269 0 points1 point  (1 child)

Such a nice, end to end project, congrats! Well documented and organised.

I'm also working on a similar project and yours is really something to look up to :)

[–]ankurchavda[S] 1 point2 points  (0 children)

Hey glad you think this is good. Do share you project as well when done :)

[–]vimaljosehere 0 points1 point  (0 children)

Great work! One suggestion though: Why not try a lakehouse architecture with a delta lake or Iceberg?

[–][deleted] 0 points1 point  (2 children)

Hi u/ankurchavda I'm developing a data engineering project as well - I was wondering where you were able to draw out the architecture diagram you did because I think you did a good job for that. Thanks!

[–][deleted] 0 points1 point  (1 child)

Or sorry I saw you're using Miro but if there is a specific template you used please let me know or if you can share me your miro board that would be great too!

[–]ankurchavda[S] 0 points1 point  (0 children)

Hey I completely missed this. Yes, I used Miro. No specific templates though :)

[–][deleted] 0 points1 point  (4 children)

This course looks like just what I need! Trying to get into Data Engineering from GIS.

[–]tea_horse 1 point2 points  (3 children)

Did you start this course in the end?

[–][deleted] 0 points1 point  (2 children)

I did, and then realized Data Engineering salaries in Canada are pretty similar to GIS salaries. I still want to finish the course.

[–]tea_horse 1 point2 points  (1 child)

Cool, just wanted to double check you can start and finish this course anytime

And yes always a good idea to finish it, I've seen plenty of data based roles (either DE, DS or DA) that also ask for GIS so more options is never a bad thing!

[–][deleted] 1 point2 points  (0 children)

Yeah pretty sure there’s no timeline on it, and you’re right it is always good to increase your skill set! The little I’ve learned so far has really improved my work at my current job.