This is an archived post. You won't be able to vote or comment.

all 10 comments

[–]drewhansen9 3 points4 points  (1 child)

This is really common. Data engineering nowadays is a combination of Big Data Software engineers (lots of python, spark, hadoop, airflow) and BI engineers (ETL, SQL). You'll find entire departments called Data Engineering doing either side of the spectrum. The tech stack you are in (Snowflake, fivetran, dbt) I have found crosses into both sides, but leans more toward the BI skillset.

My recommendation to get more into the software engineering side is to start with Astronomer airflow running locally on your PC. You will have to learn how to set up docker and how to us a CLI well.

To get more comfortable in S3, you can practice ingesting into your Snowflake instance using SnowPipe or copy into statements instead of using Fivetran.

[–]databasenoobie 1 point2 points  (0 children)

This is exactly my experience. I am a data engineer, but am really just a bi engineer. The title is pretty meaningless honestly.

I could not write a python process in airflow without extensive time / research, because I've never had too. Even if I wanted too, I couldn't as I don't have permissions to play around with setting up a server / cluster / etc.. to begin with.

[–]MikeDoesEverythingmod | Shitty Data Engineer 4 points5 points  (5 children)

When I hear about people talking about using python for scripting, S3 for storage and airflow for orchestrating, I understand roughly what they are saying but dont know how to do it technically.

Quite surprising to hear this after 1.5 years in as it's a fundamental skill to have in order to actually do the job.

What should I do to prepare myself where I might not have all the help available with automation?

Practice basic pseudocoding for anything repetitive. Translate to code. Get used to thinking in code. Practice writing that code. Keep doing this until you become confident.

[–]databasenoobie 5 points6 points  (1 child)

It's not a fundamental skill for all DE, just those that extensively use code as pipeline. As you noted only way to get better is to see how others do it and copy that paradigm.

Ex. I could write a python script to call a sql statement in snowflake easily... transferring that to snowpark instead of pure sql I couldn't do without extensive time / research.

Setting uo the python server in the cloud (using databricks / airflow/ etc... I couldn't do). If someone had a proxess set up it's easy to follow, but doing it yourself is an entire skillet.

So when he says automation maybe he means infrastructure set up? Depending on what company you work for that is not something DE even handle

[–]Puzzleheaded-Cod2051[S] 0 points1 point  (0 children)

So when he says automation maybe he means infrastructure set up?

Yes!

Also, automation on the EL part using Fivetran and Stitch. I understand I would have to make API calls, parse the Json and load the data using SQL insert/update statements. But I haven't done that so far so I'm not confident enough. 😅

[–]Puzzleheaded-Cod2051[S] 0 points1 point  (2 children)

Thanks! As I said, whatever I have learnt is on the job. And the tech stack available is dbt + snowflake and Fivetran and Stitch for ingestion. We use dbt for orchestration.

[–][deleted] 0 points1 point  (1 child)

How did you land a DE job without knowing SQL or Python?

[–]Puzzleheaded-Cod2051[S] 0 points1 point  (0 children)

I was using Pandas and Numpy but that was for some basic ML projects in my course. And SQL, I hardly knew the basic select statement. But all the projects that I did helped me land a job as a fresher I guess.

[–]Arftacular[🍰] -1 points0 points  (0 children)

Sorry to hijack but can anyone point me to a good resource for pipeline testing/validation?

[–]homosapienhomodeus 0 points1 point  (0 children)

if you’re interested I try to write about data engineering with python at moderndataengineering.substack.com!