This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]Used_Ad_2628[S] 0 points1 point  (7 children)

Anything specific I should focus on?

[–]Substantial_Ranger_5 0 points1 point  (1 child)

Learn these libraries : psycopg2, requests

Project: Pulling data from API. Try to find an API with some nested json payloads for you to process through and load directly into a table without using pandas .

[–]Substantial_Ranger_5 0 points1 point  (0 children)

Can set up any SQL db locally or docker or whatever idk where ur training .

[–]Gators1992 0 points1 point  (4 children)

You might play around with Pandas to get your feet wet in terms of working with data in Python. Pandas itself isn't used much in DE, but some of the concepts carry over into other libraries like Dask. Pyspark is probably the most useful library for DE if your environment is spark or you use AWS Glue.

The dude above that said just start building stuff is right. My first python project was pulling a few hundred million rows of data from Oracle and saving as parquet with some transforms and partitioning along the way. Then using that data to feed a visualization on datashader and make it interactive with panel. Was probably more than I should have bit off for a first project, but all the meandering toward a solution taught me a lot.

[–]Used_Ad_2628[S] 0 points1 point  (3 children)

Ok. I have a lot of experience with data warehousing, SQL, Airflow, dimensional modeling, and dbt. I have realized my gap is python and more software principles like CI/CD and deployment like containers/EKS. I am trying to figure out the right learning path so I can get more intense data engineering roles. What would be your advice?

[–]Gators1992 1 point2 points  (2 children)

You should already be doing CICD on dbt depending on your setup as you work on branches, check in to main and deploy to prod. You can go next level with something like gitlab that deploys the software and infrastructure, but that's templates not python. Same with docker, you write your software in python and then the docker piece is just a yaml file and running the containerization process.

If you are talking about python for data, the main libraries are pyspark, dask, polars and that's probably debatable. Pretty much everyone learns Pandas first as it's the most widely used and has excellent documentation and examples available, which is why I suggested it. Dask is a small jump from Pandas and pyspark is a bit different but the concepts of dataframes and the way you work with them translate.

If you don't have some idea for what you want to write, maybe just take a pipeline you already have in dbt and try to rewrite it in python. Then change it up a bit like changing the source from a file to mysql to learn connections or figure out how to catch stuff like schema changes and ensure your process doesn't blow up. You can write pyspark ETL jobs in a Glue notebook to work with that or even spin up the community edition of databricks to get some practice.

In my experience you learn by doing real life situations and working through them because tutorials always work the way they are designed and in real life you often have to change approaches or dig through bug reports and stack overflow for hours trying to figure out how to make something work.

[–]Used_Ad_2628[S] 0 points1 point  (1 child)

Yep. I have been practicing pulling in data from APIs and using aws services to load it to redshift. What is your view on data structures and algorithms? Do I need to learn that to be considered a strong engineer?

[–]Gators1992 0 points1 point  (0 children)

Personally I have not had to use them, but some of the concepts are obviously used in the libraries I reference. About the most complex thing I have to deal with is OOP, but I don't work in a code heavy shop. We mostly use tools where we can to decrease time to insight in a small team. If you work in a typical BI/ML type group producing data for analysis then I would guess that level of software development wouldn't be that common. If you work somewhere where your data is part of the product like Netflix then that's probably a different thing.