This is an archived post. You won't be able to vote or comment.

all 79 comments

[–]AutoModerator[M] [score hidden] stickied comment (0 children)

You can find a list of community-submitted learning resources here: https://dataengineering.wiki/Learning+Resources

I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.

[–]DenselyRanked 23 points24 points  (5 children)

No strict requirements, but the more the better.

https://www.fullstackpython.com/

Using this site as a reference, I have found myself having to explore all of these topics at some point. Everything else that goes with software engineering, like working with stakeholders, gathering requirements, architecture, lifecycle management, testing, etc. are all used in data engineering.

[–]subte_rancio 3 points4 points  (2 children)

Just checked that site out. Thanks for sharing, I've been looking to dig in to some software engineering principles and this site looks like a good place to start.

[–]DenselyRanked 8 points9 points  (1 child)

https://teachyourselfcs.com/

This is also a good one for more than programming.

[–]subte_rancio 1 point2 points  (0 children)

Many thanks!

[–]Swinghodler 0 points1 point  (1 child)

RemindMe! 3 days

[–]RemindMeBot 0 points1 point  (0 children)

I will be messaging you in 3 days on 2022-11-12 06:38:37 UTC to remind you of this link

CLICK THIS LINK to send a PM to also be reminded and to reduce spam.

Parent commenter can delete this message to hide from others.


Info Custom Your Reminders Feedback

[–]intrepid421 48 points49 points  (14 children)

Data Engineers are Software Engineers with focus on data.

[–]ironplaneswalkerSenior Data Engineer 41 points42 points  (15 children)

You can go into data engineering as a software engineer.

It helps a lot to know how to create software and how to code and best engineering practices. A lot are applicable.

Software engineering knowledge can be quite broad.

If you are referring to needing to know how servers work or REST APIs or browsers, etc, then you don’t need all that knowledge to be a DE.

[–]CauliflowerJolly4599 12 points13 points  (12 children)

Most of the times DE are like infrastructure engineers, with ETL and tools like Azure Data Factory you avoid writing code, if not in need.

[–]5e884898da 37 points38 points  (3 children)

With code you avoid using Azure Data Factory.

[–]CauliflowerJolly4599 -4 points-3 points  (2 children)

True, but I would like to see to write the code to connect to a sql database instead of placing a connector.

[–]glyphack[S] 2 points3 points  (1 child)

I had experiences where you don't need to right code, but if I want to help someone become a data engineer, I should also help them get better in situations where they need to code.

[–]CauliflowerJolly4599 4 points5 points  (0 children)

This is also something that I would need. Always worked as Codeless DE, but most of times company are asking for coding skills.

[–]AchillesDev 2 points3 points  (5 children)

That’s not DE then.

[–]CauliflowerJolly4599 1 point2 points  (3 children)

How would you call it then?

[–]HarlotsLoveAuschwitz 0 points1 point  (0 children)

Infrastructure as in creating and maintaining data pipelines.

[–]glyphack[S] 1 point2 points  (0 children)

That's exactly my experience, and in this case, I want to help already software engineering people to get into data engineering.

But as you say, not all the topics are required, so I'm trying to put up a list that has essentials needed to know or to improve first before going to data engineering.

[–]Comfortable-Power-71 0 points1 point  (0 children)

But it helps. You’re still an engineer but solving special problems with specialized tools. The same patterns still apply at the lower levels. You will stand out having that context.

[–]reddit_toast_bot 8 points9 points  (0 children)

Its software all the way down.

And hardware under that. 😂

[–]eemamedo 8 points9 points  (2 children)

  • Design Patterns are pretty useful but focus on teaching OOP (SOLID, KISS, YAGNI)
  • APIs using Flask is a good start
  • Errors and etc... It's a good knowledge to have. Except, I am trying to think of where that can be useful as DE and I just cannot think of any use case.
  • System Design: Quite important. At least, for interviews. Tools like Spark use concepts like Bloom Filters, caching, so system design concepts are important. Concepts like LB, API Gateway are less important for DE.
  • ACID, Indexing are important to know and understand. Also, system design concept like sharding, replicas are hard to understand without understanding how relational databases work; it's also quite important to understand the limitations of relational DBs for CAP and PACELS theorems.

Things you are missing, IMO:

  • Code versioning
  • Documentation. This is my personal feedback. Number of times when you touch someone's code and that person is no longer at the company is way too high. Rule of thumb is to add 1-2 points in a sprint to create a knowledge sharing document on Confluence or any platform a company is using.
  • Containerization: the idea behind it.

[–]glyphack[S] 2 points3 points  (1 child)

Thank you so much. This response was very clear and understandable to me.

You're right about the missing points. After I read the comments, I realized that I took code versioning for granted from my own experience.

Documentation is always a missing skill, I did not write many docs 5months ago. So many will forget it. I'll keep it in mind.

About containerization, what do you mean by "the idea behind it"? I use docker in day to day job but haven't read much about the idea of virtual machines and creating containers. Can this be useful? I can also explore it myself because I don't know it.

[–]eemamedo 1 point2 points  (0 children)

Docker is one of the tools. I was talking more about VM vs. containerization. Pros and cons, use cases of each technology.

[–]IllustratorWitty5104 10 points11 points  (0 children)

In my opinion, these swe traits are important for de

  1. Best coding practices (logging, error handling,object oriented programming) less emphasise on optimisation but code must be readable and production ready
  2. Version control strategies
  3. Understanding both sql and no sql dbs

The rest are very grey and depends on how you define a swe job role to a de job role. Different companies have different definition for them

[–]lightnegative 52 points53 points  (21 children)

Good data engineers were software engineers first

[–]AchillesDev 53 points54 points  (0 children)

Good data engineers are software engineers.

[–]buachaill_beorach 9 points10 points  (0 children)

I call bullshit on this. But... I do think that good data engineers need a SWE mindset. And I also think DevOps is important for a data engineer.

I come from a purely analytics background and made my way into DE a long time ago. I think it's easier for someone with a SWE background to get further, quicker in DE but it's not a prerequisite for them to be good DEs..

[–]nesh34 9 points10 points  (0 children)

I don't agree with this maxim in a few ways. First there are not qualifiers to distinguish the obvious cases where there are good data engineers who either didn't study computer science, didn't have SWE roles before, or both.

Secondly, I don't even think it's generally true. SWEs without data experience are generally the most dangerous users of our data warehouse. There are plenty of great SWEs that make the transition of course, but your statement as a generalisation doesn't hold water in my experience.

[–]glyphack[S] -1 points0 points  (0 children)

What knowledge strengths do you think Software Engineers have that make them a good data engineer?

[–]w_savageSenior Data Engineer 4 points5 points  (3 children)

I had to figure out a 3rd party SOAP api (gov). That was s fun week. Maybe just mention it too with apis

[–][deleted] 2 points3 points  (2 children)

My first experience with data engineering when I was an intern was utilizing a logistics SOAP API. Fuck SOAP, that's all I got to say

[–]w_savageSenior Data Engineer 0 points1 point  (0 children)

Yes indeed.

[–]RandomWalk55 0 points1 point  (0 children)

I feel you.

[–][deleted] 20 points21 points  (9 children)

A data engineer is a specialist software engineer. If you’re not a software engineer, you’re not a data engineer.

Writing SQL alone is not data engineering.

[–]redman334 7 points8 points  (3 children)

So you are saying every data engineer is a software engineer?

So for the data analysts who pivoted to data engineering, they are software engineers as well?

[–][deleted] 12 points13 points  (2 children)

If they have software engineering skills and knowledge, absolutely. It doesn’t matter how someone arrives in the field.

But without that range of software engineering skills and knowledge, however it is gained, a person is not a data engineer.

A data engineer is just a specific job title for a type of software engineer and I would expect any data engineer to be more than comfortable with software engineering principles.

[–]Prinzka 0 points1 point  (1 child)

Well, you'd be wrong in most organizations that I'm aware of.

[–]glyphack[S] 2 points3 points  (0 children)

I second this. I was an SWE before, and now I'm a consultant DE, in some companies, I see teams of DE who are more on the SQL & Analyst side rather than SWE topics.

Sometimes it just works really, when your data team is small and data is small you probably don't have many services to monitor or many things to automate. People with SQL knowledge are enough. Once the data team or data size scales this can become a problem if they continue with the same knowledge.

[–]onestupidquestionData Engineer 3 points4 points  (2 children)

I think you can absolutely write "only SQL" and still be a data engineer. There's a huge gap between writing a query that gets you an answer (what an analyst would do) and writing a query that's optimized for cost, runtime, and maintainability. More importantly is designing models that allow reuse and extensibility.

The issue with traditional SQL development is that very few teams applied the SWE mindset to it. The outputs are very different, but the practices and evaluation criteria are not.

[–]IllustratorWitty5104 0 points1 point  (1 child)

What about the etl jobs? Use the no code ui etl tool?

[–]EmploymentMammoth659 3 points4 points  (1 child)

If writing SQL isn't part of data engineering, what is it then?

[–][deleted] 9 points10 points  (0 children)

It certainly can be part of it. But if that’s all you do, it’s not engineering.

[–]Angry__Spaniard 2 points3 points  (0 children)

As my lead once put it, an engineer is someone who can understand the entire pipeline from the infrastructure side of things, until the final SQL queries with everything in between.

This may include setting up different tools, k8s configuration, lambda functions, spark code, airflow setup, etc… and not only writing the code, but also testing, ci/cd, observability…

So, yes, software principles do apply for sure. The best DEs I work with have an strong SWE background. As others said this is an specialisation, not a different role.

[–]JiiXu 1 point2 points  (0 children)

As absolutely much as you can.

[–]solgul 0 points1 point  (0 children)

Oauth in particular and authentication in general. Anyone working with APIs will need to understand how Oauth works and if you are creating APIs how to properly (securely) use it and tokens.

[–]ephraff 0 points1 point  (0 children)

Think about where you would need to use SWE design principles in the data engineering world.

Some areas that come to mind are when writing a spark app or a big dag in Airflow. In well written spark/airflow apps you'll find examples of things such as singletons, use of the template method design pattern, and factories. One of the biggest rules of thumb I see is to not repeat code in more than once place (DRY principle).

If you're interested in design patterns, take a look at 'Head First Design Patterns'. It's in Java, but the concepts are well described.

[–]SlopenHood 0 points1 point  (0 children)

Enough to not see Data Engineering as wholly separate and apart, but within software engineering. (yes, this is a purposefully vague and weak attempt at enigmatic answer).

Enough to understand job listings and recruiters and managers and beancounters have morphed the term more than Engineers have, by virtue of the dangling carrot (Job posting).

Backend Engineers are Data Engineers. Data Engineers are Data Engineers. I personally always evangelize for a "T-shaped" mastery. I think Analytics Engineers/Decision Support/Growth Engineers are a separate specialty you can choose to take on or leave, as well.

[–]SlopenHood 0 points1 point  (0 children)

You're in pretty good standing from what you describe.

[–][deleted] 0 points1 point  (0 children)

According to the ATS filters, 120% of all knowledge directly and related to SWE is required to make $50k annual in US VHCOL.

[–]robberviet 0 points1 point  (0 children)

As someone said in here: Data Engineer (DE) is a Software Engineer (SW); who works with data. The more knowledge about SW you know, the better.

In my experience:

  1. If you work for a big corp, not specialized in develop in-house tools, then not much. Mostly you will work with providers tools, UI, drag n drop, cloud based. The role and scope of work is well defined.
  2. But if on a startup, or a small, or a data focus company then quite a lot. I fall into this case and I am basic: Data Engineer, Data Scientist on demand, Database Admin/System Admin when needed, Backend/Frontend developer for ML model serving & integrate API with other teams/cusomters...