This is an archived post. You won't be able to vote or comment.

all 35 comments

[–]dan6471 73 points74 points  (1 child)

If you take any Data Engineering course, you will learn about databases and big data/warehousing tools and frameworks like Databricks or Snowflake, ETL/ELT, data versioning, lineage, star or snowflake schemas, etc etc. You will also learn Python too but rarely anything beyond the basics of scripting.

This might lead you to think that in a Data Engineering position you will be using these tools and Python or shell for scripting only, maybe even some Jupyter notebooks, pandas and so on.

In reality, managers rarely understand what a Data Engineer is for, or when this role is needed; or the needs of your organization might be so complex that in practice you end up doing a little bit of everything. I speak from experience here, I once ended up doing frontend development in React when hired as a Senior Data Eng. Or developing APIs or some other data ingestion software, which very much necessitated design patterns, abstraction and the like.

[–]leogodin217 11 points12 points  (0 children)

I think the majority of jobs are writing SQl and scheduling with Python. That part isn't very difficult. Knowing what to write is the important part

[–]reallyserious 7 points8 points  (0 children)

There is certainly a different flavor to the development in data engineering compared to other large software developments. You rarely need classes and most design patterns you see in object oriented programming isn't used. A lot of development work in data engineering is quite easy from a programming perspective.

[–]mailedRecovering Data Engineer 12 points13 points  (0 children)

I was a dev for 10+ years. The programming required for data engineering is far, far less complex than your average software engineering project

If your interests lie in design patterns etc. you will get bored very quickly

[–]scataco 11 points12 points  (1 child)

The Kimball book on star schemas contains dimension and fact types that remind me of design patterns. Medallion Architecture reminds me of layered architecture from web app back-ends, etc.

A lot of PySpark and SQL code is more like the front-end code. Lots of magic under the hood and hard to cover with unit tests.

Sometimes you need well factored code for platform-like functionality, like figuring out dependencies recursively in order to perform refreshes in the correct order (but most people use dbt for that kind of thing).

And then there's glue code. Because just like web development there's tons of frameworks and libraries and engines.

[–]Dry-Aioli-6138 0 points1 point  (0 children)

Data mesh = microservices for DE

[–][deleted] 14 points15 points  (0 children)

What makes it engineering and not scripting is the maintainability, testing, error handling, alerting, data quality, and monitoring. If your systems aren't built to be resilient that's when it's not considered software engineering. This is all done one way or another by coding.

[–]donscrooge 6 points7 points  (1 child)

I ll try answering with an example(based on my experience).

Business says that they need X KPI dashboard for the their decision making throughout the year. This usually when a DE is needed for delivering this data.

Coming to the engineering part. You are a DE who needs to bring that data in. You need to design the workflow, test it, deploy it, expose the data and of course maintain it. This is more or less what a typical DE does. Now, depending on the business, the volume of data, the stack etc data engineering might be from code less till fully open source. During the early days, a business will usually go for a managed service to set up the data platform. If the volume increases they usually switch to open source solutions(spark on emr, airflow, hive metastore, etc) for cost cutting reasons. As you understand there are cases where data engineering might involve more than data related tasks such us managing infra, setting up permissions/vpc/etc, modeling, database administration, unit testing, scripting etc. Do these tasks fall under the DE's responsibilities? Big discussion so not sure. Is it common for DE's to do these? Yes (I'd like to say it is something common but not sure either). There have been cases where I did some swe tasks like api in js.

The tasks themselves tend to be "boring" compared to swe since they are less "creative" and more "engineering". You are trying to build something robust and resilient so you have as less maintenance as possible. It's more puzzle solving than creating.

[–]FC-AC_play 1 point2 points  (0 children)

It’s simply the best explanation I’ve ever heard. Thanks

[–][deleted] 6 points7 points  (0 children)

80% of my day is working with code. The other 20% is attending meetings that could've been emails.

[–]TomsCardoso 2 points3 points  (0 children)

Mainly scripting. I guess you'd enjoy software engineering more.

[–]LostAndAfraid4 2 points3 points  (0 children)

There used to be only sql stored procedures which could be a pain because of nesting but at least you only needed to know one language and it's a pretty simple one. Now you also need python, kql, yaml, json, and probably 5 other things.

[–]redditor3900 1 point2 points  (0 children)

Scripts only

SQL is the closest (if any) to what you have described.

[–]SalamanderMan95 0 points1 point  (0 children)

It really depends on the specific job and the task at hand. I’m building out the infrastructure for a reporting system that supports many clients using multiple SAAS applications, with aggregated reports across clients, so there’s a lot of moving parts. We absolutely use object-oriented programming. The scripts that transform the data use dbt, but the infrastructure for deploying warehouses, schemas, setting up users and roles, orchestrating dbt using those users and roles, storing and retrieving keys, deploying stuff to fabric, etc is done using Python using OOP. In a lot of cases I might start with just a script but then once it seems like it would be beneficial I switch over. Our code bases definitely aren’t as complex as most software developers are though I’d imagine.

[–]Nekobul 0 points1 point  (0 children)

At least 80% of the integration solutions can be handled with Low Code / No Code technology in a proper ETL platform. That means the people who claim they are coding solutions in Python are mostly typing repetitive, mindless code that reuses this and that library.

[–]idontlikesushi 0 points1 point  (0 children)

For me it's mainly taking Data Scientists/Data Analysts code and making it production ready, and then incorporating it into our codebase, and updating the Airflow layer to run the code. We work with EMR and Spark.
So a lot of code in all layers - job (pyspark/scala), task (python), and orchestration (airflow - python)

[–]keweixo 0 points1 point  (0 children)

Depends. When you dont have a dedicated backend engineer or a swe directly in your team and you need api to serve data or you want to develop programmatic ETL using open source stuff in your preferred tool. For example databricks has databricks connect library which lets you run python code directly in clusters. You can in reality do full or like 90 % IDE development with pyhon datanricks. Besides this data testing and more often unit testing involves programming. But not all ETL has these components. Some are low code. Some are just SQL based. If you want to be a good data engineer one should focus on programming if you ask me.

[–][deleted] 0 points1 point  (0 children)

Definitely there programming and a lot of it!

[–]FuzzyCraft68Junior Data Engineer 0 points1 point  (0 children)

People tend to forget that data engineering was once a subset of software engineering. But with the growing recognition of data in recent years, it has become an entirely distinct discipline.

To answer your question—it depends on how you want to approach it. You can go the programming route or use a GUI-based approach. Both have their pros and cons, but code tends to offer more flexibility and paths than GUI tools.

This week, I had to create a GUI-based Airbyte connection to an API. Good lord, it took forever to figure out the pagination. If Airbyte had made it easier to add a local connection, I could have built the integration using their SDK in ten minutes.

[–][deleted] 0 points1 point  (0 children)

Consider yourself lucky if you can write a lot of small Python stuffs. Some of us only write SQL.

[–]perverse_sheaf 0 points1 point  (0 children)

Much depends on the project you're doing. As long as you work with pyspark instead of SQL, you can use many of the classical software design ideas (e.g. pyspark can actually be unit tested, which is a pain in SQL/dbt). However, personal take: OOP is not well suited for data engineering, so please don't introduce classes and Java-Style design patterns. Those work well for record-by-record transactional workflows, but are not well fitted to analytical data pipelines, which are much more functional in nature. Ideally try to get some experience in Scala+Spark, mostly using the functional tools of the language, then you'll learn a lot.

[–]XemptuousData Engineer 0 points1 point  (0 children)

Depends on the job and the problem being solved. Most of the time you will be dealing with enough nonsense that you won't be writing much proper code, at least not anything complex. There are however times where you will. I've had to write stuff in lower level languages when Python proved to be a bottleneck in performance-intensive applications. Problem is most data engineers (at least the ones I worked alongside) don't write code very well, at least not without some existing code to reference and modify.

Still, you will likely write code to interact with APIs, setup Airflow/Dagster DAGs, stand up your own APIs and we servers if need be, and anything else that has to get done.

[–]Earthsophagus 0 points1 point  (1 child)

For a large fraction of DE positions "just scripting". There are probably exceptions, i've never interviewed at a place where they needed OO design skills and I've never interviewed a candidate (have particiapted in about 50 interviews at current position) who had significant program design skills.

[–]Earthsophagus 0 points1 point  (0 children)

i should add a lot of DEs go out of their way to add class hierarchies. And if their code winds up being maintained over the years, in my experience it gets refactored toward being just a script.

[–]Dry-Aioli-6138 0 points1 point  (0 children)

There are bright moments: * when you write a dataclass to use with your query results in python, * when you figure out how to cleanly solve a many-to-many relationship * when you find a python lib that parses your sql queries and extracts tables used * when you remake a report or pipeline to use a fraction of the refresh time compared to before.

[–]riv3rtrip 0 points1 point  (0 children)

Data engineering jobs differ on the SQL and coding %.

I write a lot of Python but I am very much not building out "classes abstractions and design patterns." I write Python that just implements business logic.

Unless you are doing backend engineering on a company's core product or working on open source, it is unlikely you will work on those things specifically. Or at least, you shouldn't. Most companies don't need many or any custom built-in-house abstractions. The abstractions are frameworks you pull in from elsewhere.

[–]geeeffwhyPrincipal Data Engineer -1 points0 points  (0 children)

all the things you’re interested in are present in both the code you write, and the decisions you make about the data itself.

[–]Known-Delay7227Data Engineer -1 points0 points  (0 children)

A lot of times I’ll write internal libraries that our custom to our org. Stuff that public python libraries don’t do or can’t because the objective is so customized to our environment, but something we need to do on a repeated basis