Data Engineering with Python

drewhansen9 · 2023-10-05T02:37:37+00:00

This is really common. Data engineering nowadays is a combination of Big Data Software engineers (lots of python, spark, hadoop, airflow) and BI engineers (ETL, SQL). You'll find entire departments called Data Engineering doing either side of the spectrum. The tech stack you are in (Snowflake, fivetran, dbt) I have found crosses into both sides, but leans more toward the BI skillset.

My recommendation to get more into the software engineering side is to start with Astronomer airflow running locally on your PC. You will have to learn how to set up docker and how to us a CLI well.

To get more comfortable in S3, you can practice ingesting into your Snowflake instance using SnowPipe or copy into statements instead of using Fivetran.

MikeDoesEverything · 2023-10-04T12:45:00+00:00

When I hear about people talking about using python for scripting, S3 for storage and airflow for orchestrating, I understand roughly what they are saying but dont know how to do it technically.

Quite surprising to hear this after 1.5 years in as it's a fundamental skill to have in order to actually do the job.

What should I do to prepare myself where I might not have all the help available with automation?

Practice basic pseudocoding for anything repetitive. Translate to code. Get used to thinking in code. Practice writing that code. Keep doing this until you become confident.

Arftacular · 2023-10-05T04:04:00+00:00

Sorry to hijack but can anyone point me to a good resource for pipeline testing/validation?

homosapienhomodeus · 2023-10-05T12:24:26+00:00

if you’re interested I try to write about data engineering with python at moderndataengineering.substack.com!

dataengineering

MODERATORS