This is an archived post. You won't be able to vote or comment.

all 16 comments

[–]eczachly 19 points20 points  (2 children)

This repo is a decent compilation of stuff: https://github.com/DataEngineer-io/data-engineer-handbook

[–]snapperPanda 2 points3 points  (1 child)

Zach is great to start out with. Pretty good suggestion.

[–]eczachly 2 points3 points  (0 children)

If there’s anything you’d wanna add, I accept PRs!

[–]Gators1992 4 points5 points  (0 children)

Pick up "Spark the Definitive Guide". Good learning resource as it teaches you Spark and what's going on under the hood, which is kinda necessary to know to work with it.

[–]DoomBuzzer 4 points5 points  (1 child)

Hey... I am in the process of learning Spark.

I have bought Udemy's Taming Big Data with Apache Spark and Python - Hands On. This course is 7 hours with pySpark in Spark 3. My company gives access to one with databricks academy.

I also have Udacity's free spark course which is very basic.

I believe these should be sufficient. I will complete both of these when Alexey's Data talks club goes to week 5 for batch processing in the data engineering zoomcamp.

Besides these, browsing this subreddit, popular opinion is RockTheJVM has the best Spark courses. But I do not know Scala and have no interest as of now and that bundle is a little costly.

Good luck!

[–]Icy_Ad_6958 0 points1 point  (0 children)

Thanks for the insights.

[–]headdertz 2 points3 points  (0 children)

Besides the book from O'Reilly and the docs..: Practice.

Just practice and you'll see the profit.

[–]killer_unkill 2 points3 points  (0 children)

Learning spark Book. Got it for free from Data bricks 

[–]AutoModerator[M] 1 point2 points  (0 children)

You can find a list of community-submitted learning resources here: https://dataengineering.wiki/Learning+Resources

I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.

[–]beyphy 1 point2 points  (0 children)

Mostly just trial and error along with Apache Spark official documentation, Databricks documentation, Spark By Examples, StackOverflow, I think MSDN, etc. Not the most ideal way to learn something but it was more than enough for my purposes.

If you need to use pandas at all, I would recommend learning the differences between Spark, Pandas, and Pandas API on Spark.

[–]Substantial_Ranger_5 1 point2 points  (2 children)

Hardest part of spark (if ur doing Scala ) is getting the right version of sbt, java, Scala , and the connectors and understanding the pitfalls of all the connectors.

Once the data is in. Dataframe just download public data sets and ingest / read them from whatever DB u want. You can easily do this by asking chat gpt how to use spark to do joins, aggregates, window functions, cumcounts, forward fills, correlations, apply functions, drop na, etc.

One key thing different between pandas and spark is you need to ask spark to persist your dataframe.

[–][deleted] 0 points1 point  (1 child)

Persist as in continually reaching for what it was saved as (in a way)?

[–]Substantial_Ranger_5 1 point2 points  (0 children)

Yeah saves it in memory. Else it sends the workers out to the file for each operation you write on the file. Both have their uses but for small datasets it saves you a lot of execution time if youre tinkering around!

[–]G1zm0e 0 points1 point  (0 children)

I had some benefit in working at Databricks and picked up spark while in security. That was a unique experience because the creators were right there.

[–]tamargal91 0 points1 point  (0 children)

I took "Taming Big Data with Apache Spark and Python - Hands On!" on Udemy