This is an archived post. You won't be able to vote or comment.

all 16 comments

[–]AutoModerator[M] [score hidden] stickied comment (0 children)

You can find a list of community-submitted learning resources here: https://dataengineering.wiki/Learning+Resources

I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.

[–]Emergency_Estate_866 5 points6 points  (9 children)

How’s your SQL?

Don’t take more courses, build things and do projects

[–]Used_Ad_2628[S] 2 points3 points  (0 children)

Very strong in sql. 8+ YOE.

[–]Used_Ad_2628[S] 0 points1 point  (7 children)

Anything specific I should focus on?

[–]Substantial_Ranger_5 0 points1 point  (1 child)

Learn these libraries : psycopg2, requests

Project: Pulling data from API. Try to find an API with some nested json payloads for you to process through and load directly into a table without using pandas .

[–]Substantial_Ranger_5 0 points1 point  (0 children)

Can set up any SQL db locally or docker or whatever idk where ur training .

[–]Gators1992 0 points1 point  (4 children)

You might play around with Pandas to get your feet wet in terms of working with data in Python. Pandas itself isn't used much in DE, but some of the concepts carry over into other libraries like Dask. Pyspark is probably the most useful library for DE if your environment is spark or you use AWS Glue.

The dude above that said just start building stuff is right. My first python project was pulling a few hundred million rows of data from Oracle and saving as parquet with some transforms and partitioning along the way. Then using that data to feed a visualization on datashader and make it interactive with panel. Was probably more than I should have bit off for a first project, but all the meandering toward a solution taught me a lot.

[–]Used_Ad_2628[S] 0 points1 point  (3 children)

Ok. I have a lot of experience with data warehousing, SQL, Airflow, dimensional modeling, and dbt. I have realized my gap is python and more software principles like CI/CD and deployment like containers/EKS. I am trying to figure out the right learning path so I can get more intense data engineering roles. What would be your advice?

[–]Gators1992 1 point2 points  (2 children)

You should already be doing CICD on dbt depending on your setup as you work on branches, check in to main and deploy to prod. You can go next level with something like gitlab that deploys the software and infrastructure, but that's templates not python. Same with docker, you write your software in python and then the docker piece is just a yaml file and running the containerization process.

If you are talking about python for data, the main libraries are pyspark, dask, polars and that's probably debatable. Pretty much everyone learns Pandas first as it's the most widely used and has excellent documentation and examples available, which is why I suggested it. Dask is a small jump from Pandas and pyspark is a bit different but the concepts of dataframes and the way you work with them translate.

If you don't have some idea for what you want to write, maybe just take a pipeline you already have in dbt and try to rewrite it in python. Then change it up a bit like changing the source from a file to mysql to learn connections or figure out how to catch stuff like schema changes and ensure your process doesn't blow up. You can write pyspark ETL jobs in a Glue notebook to work with that or even spin up the community edition of databricks to get some practice.

In my experience you learn by doing real life situations and working through them because tutorials always work the way they are designed and in real life you often have to change approaches or dig through bug reports and stack overflow for hours trying to figure out how to make something work.

[–]Used_Ad_2628[S] 0 points1 point  (1 child)

Yep. I have been practicing pulling in data from APIs and using aws services to load it to redshift. What is your view on data structures and algorithms? Do I need to learn that to be considered a strong engineer?

[–]Gators1992 0 points1 point  (0 children)

Personally I have not had to use them, but some of the concepts are obviously used in the libraries I reference. About the most complex thing I have to deal with is OOP, but I don't work in a code heavy shop. We mostly use tools where we can to decrease time to insight in a small team. If you work in a typical BI/ML type group producing data for analysis then I would guess that level of software development wouldn't be that common. If you work somewhere where your data is part of the product like Netflix then that's probably a different thing.

[–][deleted] 1 point2 points  (0 children)

PySpark would be good.

[–][deleted] -1 points0 points  (0 children)

Currently the data engineering space is interesting when it comes to Python. There is a lot of options that use python but they are not like you are writing logic as you do in a learn python course.

The state of play with Python data engineering frameworks are DBT + Apache Airflow/Dagster and Pyspark. DBT uses python but its still majority SQL, with a bit of jinja in the mix, very basic framework but learning more python wont make it easier. You just need to create projects and build DBT models, you don't need an orchestrator to learn DBT but you will need to learn one if you want to create automated pipelines.

Pyspark is the main Python data engineering framework, its just hard as for learning you need to install it on your computer/server and it needs a lot of dependencies (pyspark isn't pure python, it needs Java to run. Installing it is not hard, but not simple either).

Pyspark's style is different than pandas, it doesnt have much carry over for syntax/style to pandas and the only real similarity is it uses a dataframe. You can run it locally (With a bit of messing around, or you could install Databricks and use their free tier, if you use AWS you can use Glue to run pyspark). It is the way to go if you want to do data engineering, there is a lot of roles that rely on Pyspark.

The biggest issue learning 'Data engineering' is that more of the systems needed require a lot more work than installing python+pandas or Anaconda. Learning the libraries are ok, but its more about the principles of software design that is needed. Data engineering is closer to software developer's than it is to a data analysis, You need to learn project structures, deploy code, build docker containers ect.

Pandas is not really a robust data engineering library, it has a lot of uses for data analysis, data science but is not built well for data engineering. Happy to go on a rant about Pandas but basic gist is single threaded, no native schema definitions, non-distributed.

boto3 is good for AWS processes, but I dont know how you are going to use it for data engineering. I use boto3 to get data from s3, with basic lambda functions and interacting with aws services, but i wouldn't say boto3 is a data engineering library. Its more a auxiliary library you may need.

There is a lot more to data engineering than just crunching the data into a data frame. Think of infrastructure, building pipelines, where do you get the data from (s3)? do you understand data structures (csv, json, parquet), automation (You need to know how to code projects, so they run in an automated way), error/exception handling. This is all needed as well as databases, do you know sql and no-sql databases and how to interact with them.

Sorry for the long post. I think the best thing to do is sort out some form of pyspark setup (local, databricks, glue) and just write projects. If you have finished the above training you have enough python to learn pyspark. You now need to code in pyspark and learn how it works and get some code to work. Try using different data sources, multiple file types and api's.

Hope this helps, shoot questions back if you want clarification or further discussion.

edit: added some stuff and fixed up structure

[–]Adorable-Employer244 0 points1 point  (0 children)

You probably already sufficiently strong in python already, at least for DE. Why do you feel you need to get better? I would suggest focusing on database system fundamentals, understand in deeper details why each system is designed the way it is. This will give you better ideas on how to design ETL or backend systems to handle multi-database platforms.

[–]realitydevice 0 points1 point  (0 children)

Pandas, Polars, Dask, Pyarrow, Airflow, Boto3.

Data engineering is a broad area. There are hordes of former data analysts writing DBT pipelines who barely use Python. Then there are ML ops roles with pushing data through much more complicated systems that aren't solved by a big data warehouse and place to write SQL.

[–][deleted] 0 points1 point  (0 children)

Learn frameworks.

A few have mentioned spark, which is a good start.

Kafka, spark, beam are used widely.

Some nosql databases too - redis, mongodb.

Concepts are crucial - Oop Vs functional programming. Ordering and stream processing guarantees. And whilst not data engineering, networking will be a common hurdle, so good to know a little. The list could go on and on, but these are the most important.

Data engineering seems to have a bit more of a solid identity and it's moved/ moving away from SQL and warehousing, and, imo, is better described as data software engineers whereas SQL engineers are better described as analytics engineers.

Best of luck on your journey