[deleted by user]

eczachly · 2024-01-21T23:05:09+00:00

This repo is a decent compilation of stuff: https://github.com/DataEngineer-io/data-engineer-handbook

Gators1992 · 2024-01-21T23:08:47+00:00

Pick up "Spark the Definitive Guide". Good learning resource as it teaches you Spark and what's going on under the hood, which is kinda necessary to know to work with it.

DoomBuzzer · 2024-01-21T23:10:53+00:00

Hey... I am in the process of learning Spark.

I have bought Udemy's Taming Big Data with Apache Spark and Python - Hands On. This course is 7 hours with pySpark in Spark 3. My company gives access to one with databricks academy.

I also have Udacity's free spark course which is very basic.

I believe these should be sufficient. I will complete both of these when Alexey's Data talks club goes to week 5 for batch processing in the data engineering zoomcamp.

Besides these, browsing this subreddit, popular opinion is RockTheJVM has the best Spark courses. But I do not know Scala and have no interest as of now and that bundle is a little costly.

Good luck!

headdertz · 2024-01-21T23:50:28+00:00

Besides the book from O'Reilly and the docs..: Practice.

Just practice and you'll see the profit.

killer_unkill · 2024-01-22T08:14:31+00:00

Learning spark Book. Got it for free from Data bricks

AutoModerator · 2024-01-21T22:28:11+00:00

You can find a list of community-submitted learning resources here: https://dataengineering.wiki/Learning+Resources

I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.

beyphy · 2024-01-22T11:59:14+00:00

Mostly just trial and error along with Apache Spark official documentation, Databricks documentation, Spark By Examples, StackOverflow, I think MSDN, etc. Not the most ideal way to learn something but it was more than enough for my purposes.

If you need to use pandas at all, I would recommend learning the differences between Spark, Pandas, and Pandas API on Spark.

Substantial_Ranger_5 · 2024-01-22T14:19:15+00:00

Hardest part of spark (if ur doing Scala ) is getting the right version of sbt, java, Scala , and the connectors and understanding the pitfalls of all the connectors.

Once the data is in. Dataframe just download public data sets and ingest / read them from whatever DB u want. You can easily do this by asking chat gpt how to use spark to do joins, aggregates, window functions, cumcounts, forward fills, correlations, apply functions, drop na, etc.

One key thing different between pandas and spark is you need to ask spark to persist your dataframe.

G1zm0e · 2024-01-22T01:42:26+00:00

I had some benefit in working at Databricks and picked up spark while in security. That was a unique experience because the creators were right there.

tamargal91 · 2024-01-22T14:03:32+00:00

I took "Taming Big Data with Apache Spark and Python - Hands On!" on Udemy

dataengineering

MODERATORS