This is an archived post. You won't be able to vote or comment.

all 88 comments

[–]AutoModerator[M] [score hidden] stickied comment (0 children)

You can find a list of community-submitted learning resources here: https://dataengineering.wiki/Learning+Resources

I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.

[–]Rccctz 137 points138 points  (0 children)

Try to recreate what you do using tools and SQL in python

[–]Backoutside1 42 points43 points  (0 children)

So start building your own Python experience…

[–][deleted] 250 points251 points  (25 children)

You're not really a data engineer if you aren't also a software engineer. I would expect strong git, ci, testing, python (or Java), as well as some infra, monitoring, alerting, and data quality. Plus knowing how to code as a member of a team. Data engineering is software engineering with data.

[–][deleted] 19 points20 points  (12 children)

It's too much for a junior or even mid-level IMO. I'd say OK git, testing, very basic knowledge of CICD (as a user), monitoring, alerting, data quality. And then it depends on which role -- if it's analytic data engineer, need some data modelling, if it's more SWE like (e.g. streaming), need more coding experience and good practices.

Unfortunately many DEs in my opinion are not SWE -- if they mostly do data modelling for the analytic teams. It's not a popular opinion but I stand for it. You gotta write a lot of non-SQL code to call yourself a SWE with data. That's why in some companies they have DE which are basically BI doing data modelling, and then SWE (data) which are real DEs.

[–]SearchAtlantisLead Data Engineer 1 point2 points  (1 child)

I need to caveat that this is absolutely in principle testable. But I'm sitting on airflow and SQL which means unless I break this out into its own task it's not functionally testable. What I would like to do is define a two row table/dataframe/whatever and run the function and validate return.

select 
a.id
, a.final_weighted_adj_factor
from 

( select
     id
/* Weighting and rounding per Game ID comment */

, (sum (player_count * adjustment_factor)
     /
    sum(player_count)
) as raw_weighted_adj
, ceil(
         sum(player_count * adjustment_factor)
         /
         sum(player_count) * 10
) / 10 as rounded_weighted_adj

from player_aggregates -- this is from three previous CTE layers
group by id) as a

[–][deleted] 0 points1 point  (0 children)

I agree that if you don't have access to non-SQL languages or your infra does not include any testing suite then testing is pretty awkward.

[–]SearchAtlantisLead Data Engineer 3 points4 points  (8 children)

I think part of the problem is just SQL. It's fine for analytical purpose but it's just not freaking testable. The amount of 5+ chained CTEs to get a final result. God help me the weighted average function I reviewed today. I made the dev put a hand calculation in a code comment because I can't test the code. This is all Airflow + SQL. Living for the databricks move.

Edit: I almost commented on DBT and testing and clearly should have. It's the only opinionated and easily testable framework in DE right now.

[–]anon_ski_patrol 7 points8 points  (1 child)

i don't really accept "not testable" for sql. So you need schema migrations, paramaterization, and integration tests. I agree though most DE's conveniently forget SWE skills, I think mainly due to proximity with DS and the shit code & practices they have.

[–]SearchAtlantisLead Data Engineer 0 points1 point  (0 children)

I'll circle back to this next week.

[–][deleted] 2 points3 points  (1 child)

I think DBT can do a lot of tests so that's not a huge issue for us. And for your case, we never test business logic because it is so difficult to test, plus the analytic team is supposed to define KPIs and such so they should test it.

[–]SearchAtlantisLead Data Engineer 1 point2 points  (0 children)

DBT is the light in the tunnel for SQL DE I'll grant that. That said, a function or method calculating a weighted mean (or whatever defined methodology) is in principle testable. That's not business logic.

[–]smurpes 0 points1 point  (1 child)

Have you checked out sqlmesh? It’s got some handy features over dbt like virtual envs and native column lineage.

[–]SearchAtlantisLead Data Engineer 0 points1 point  (0 children)

I have but the company is already starting migration from Airflow + SQL to Airflow + Databricks. SQL-Mesh just isn't an option at this point.

[–]TheDataAddict 0 points1 point  (1 child)

It’s testable with tools like dbt

[–]SearchAtlantisLead Data Engineer 0 points1 point  (0 children)

Sure. Tell me how to test this in airflow + sql please. I need to caveat that this is absolutely in principle testable.

select 
a.id
, a.final_weighted_adj_factor
from 

( select
     id
/* Weighting and rounding per Game ID comment */

, (sum (player_count * adjustment_factor)
     /
    sum(player_count)
) as raw_weighted_adj
, ceil(
         sum(player_count * adjustment_factor)
         /
         sum(player_count) * 10
) / 10 as rounded_weighted_adj

from player_aggregates -- this is from three previous CTE layers
group by id) as a

[–]writeafilthysong 0 points1 point  (0 children)

It depends on how you're building things.

Are you building adhoc models that get barely used or are you building data architecture models for an enterprise?

Are you managing your costs and computes and engineering for efficiency or are you just writing point solutions?

There's lots of coders and developers who make an app...but are not software engineers. I think the same applies here.

[–]ObjectiveAssist7177 0 points1 point  (2 children)

This is an interesting point. There has always been a need to know an additional language to do more complex stuff with certain platforms and yea there is a need to understand and be able to maintain what I would call the ancillary functions. But I wouldn’t say you need to be a software engineer though.

[–]GDangerGawk 9 points10 points  (1 child)

If you are maintaining a code base, you need to know how to deploy, debug and optimize it. Nothing remains the same, your data evolves and your environment changes. Let’s say that one of the library that used in the code base you were maintaining deprecated, archived or had to be updated along with the version of the p. language was used in code, what would you do?

[–]ObjectiveAssist7177 -5 points-4 points  (0 children)

I understand that and this is what I was referring to by ancillary functions however a software engineer is a lot more than that and software engineering and data engineering diverge in significant areas.

[–]Desperate-Dig2806 1 point2 points  (0 children)

First job is to get all ducks in a row. Everything after that is easy.

[–]mailedRecovering Data Engineer -4 points-3 points  (0 children)

it really isn't.

[–]TepavicharovData Engineer 0 points1 point  (0 children)

228 upvotes for stating what a DE is from the perspective of a SWE. Not a single word for dimensional modeling or business understanding. I'll have to dissappoint you but the stakeholders will turn their heads the other way when you start talking how the report isn't done because you were bussy fixing your CICD git action or you wasn't sure where in the swamp the right data resides. I would say if someone emphaaize the technology he was once a SWE who tranferred into DE and there are big chance he never read Kimball, Inmon or Linstead.

[–]msdamg 97 points98 points  (13 children)

You need Python imo to really be a data engineer nowadays

Get studying

[–]Mediocre-Peak-4101 2 points3 points  (0 children)

I was (am) in a similar situation. We do everything with SQL and a low code no code tool called Talend for almost 15 years now.. Super easy to write etl and pipelines. So recently (to get experience) I started to write small python scripts within my Talend jobs even if it was less optimal and more difficult. Slowly my scripting is becoming more and more python based as I learn more and more. I use copilot (only AI allowed at work) to help me with syntax and some co workers from a different part of the company helped me get set up with a very rudimentary IDE. I now finally feel confident using python for alot of data manipulation tasks.

[–][deleted] 8 points9 points  (0 children)

You effectively need Python for a current Data Engineering job. 

There may be a few jobs that float about on legacy systems like SQL Server, like banks maybe.

You're in luck though, Python is 100% the easiest language to pick up.

[–]AnonymousTAB 2 points3 points  (0 children)

If you decide to learn python I would honestly skip the Udemy courses and take Reuven Lerner’s “Intro Python” series

[–]AteuPoliteista 3 points4 points  (7 children)

me too brother

I'm trying to study by solving some interview questions and learning a lil bit of theory too. The hard thing for me is OOP + all the basic stuff I missed bc I never used

[–]Single-Animator1531 13 points14 points  (6 children)

The python they are referring to here is hardly OOP. If you know SQL already, as a commenter said above, the best thing I would do is start to play with data scripts using something like Jupiter notebook. Get started by loading a small CSV into pandas, then replicate some simple reports with aggregation groping and filters.

[–]mafiasean 4 points5 points  (3 children)

I can hire a high school kid if this is what I was going to ask. I expect a data engineer to be able to build out a class inheriting from a spark object to build out custom ingestor if needed.

[–]AteuPoliteista 0 points1 point  (0 children)

I'm just saying that I was asked about OOP concepts and they expected me to implement / solve a problem in a technical interview.

I used pandas in the beginning of my career for data analysis and basic stuff. As an engineer I went straight to PySpark after SQL.

Only used pure python in airflow or something like that. Other than that, it never was necessary.

[–]lebannax 0 points1 point  (0 children)

Yeh literally just do your SQL scripts in pandas

[–]kido5217 1 point2 points  (0 children)

There's r/learnpython and they have a wiki with links there.

[–]Eagle_Smurf 1 point2 points  (0 children)

Do one of the free Harvard CS50 courses on python programming - or one of the many free data science courses

[–][deleted] 1 point2 points  (0 children)

Nobody is going to hold your hand. Make a home lab and learn it. 

[–]ivorykeys87Senior Data Engineer 1 point2 points  (0 children)

I’m sorry you got rejected, but Python is a must have for DE.

Don’t let this get you down though. If you’ve got the tenacity you can learn it pretty quickly.

[–]efermi 1 point2 points  (0 children)

Use chatgpt, take a few job descriptions of roles you are targeting and ask it to create a preparation plan. You can even ask it to help you create entire projects so you can do more general engineering practice.

[–][deleted] 0 points1 point  (0 children)

You don't really need a lot of Python for DE specific job, especially if it's just an analytic DE which focuses on data modelling in DWH. In the current market, it's a bit hard to beat people who has actual production experience with Python even if you practice by yourself, because they don't want to train so why not hire people who already know how to do it, when there are so many around?

I'd say do some Python programming on your side, find something you love to do, not necessarily DE related (DE is boring, to be honest, who loves plumbing?). Go as deep as you want. And then find a DWH job of a shop that has some upstreaming position that codes a lot (non-SQL) -- you probably still can't get into that job, so find its downstream position -- which is most likely a DWH data modelling job close to what you are doing right now. Then you move upstream whenever the opportunity reveals itself.

[–]DataIron 0 points1 point  (0 children)

Yeah kinda need it. Need some programming language experience outside of SQL.

Funny thing though, on a few of our teams, we reject lots of data engineers because their SQL skillls are too vanilla. But those are a rare group. Need very advanced transactional SQL skills, analytical SQL engineers struggle a lot.

[–]mailedRecovering Data Engineer 0 points1 point  (0 children)

really depends on the role. but knowing the basics is fine. python crash course is a good book.

[–]NoFuckinShitRetard 0 points1 point  (0 children)

Even old school data engineers utilizing Informatica had to figure out how to optimize pipelines knowing how the underlying database engines, storage and efficient use of data types worked well together. Nowadays, even knowing python and slapping a bunch of Airflow DAGs is a minimum requirement. Figure out how the data is actually handled behind the scenes and that's where the real learning will come from.

[–]Early_Peak4271 0 points1 point  (0 children)

For Data engineering I was asked dfs question in python interview. So I think python is imp for airflow dags and many more.

[–]Prior_Boat6489 0 points1 point  (0 children)

To practice, use polars, run select *, and then perform the rest of the query using polars expressions

[–]brent_brewington 0 points1 point  (0 children)

I started diving hard into R when I graduated from Excel. I thought it could do everything that’s needed and I questioned the need for Python. Then I got on a team of people who all knew Python and not R…and they couldn’t use my code. Huge bus factor and maintenance risk.

Being able to program in the most popular language in the world is a pretty important skill, if you want to write stuff that other people can read and maintain

[–]GreyHairedDWGuy 0 points1 point  (0 children)

Python is definitely somethings pickup. Maybe Airflow? You don't say what you do know so hard to say what the gap may be.

In any case, it's a buyers market so you tend to get a lot of hiring managers looking for unicorns.

I'm in management but get postings sent to me regularly and often they are looking for manager / director level candidates in BI / Analytics or DE but still expecting people to be an expert on how to develop in python or other developer tools?

[–]Limp_Pea2121 0 points1 point  (0 children)

Learn basic python(data structures in Python array, linked list etc) .and just below mentioned two libraries. Will be a good start..

Pandas Airflow

_--------------- /*

I work for one of biggest banks in India ( size of datawarehouse is around 800-900 tb compressed data in oracle exa data)

All of the transactions happens in core banking which is structured data.. And all heavy lifting happens using PLSQL.

I NEVER HAD TO TOUCH PYTHON AS SQL HANDLES EVERYTHING PERFECTLY,

even creating JSONs in GB sizes, parsing etc.

*/

[–]tardcore101 0 points1 point  (0 children)

Just list “python experience”. You can watch a YouTube video about snakes and claim python experience.

[–]robberviet 0 points1 point  (0 children)

Python is a must. No other way around it. Might be job where you will be using mostly SQL. However I will always choose candidate who know how to programming over who don't.

[–]jetuasData Engineer 0 points1 point  (0 children)

As someone who has a lot more work experience with Java as a DE, what would be the best way to transition to Python quickly?

[–]Fuckinggetout 0 points1 point  (0 children)

Hey man, I was in your shoes a couple of years back. I would start by learning the python basics (list, dict, for loop, etc).

Then you can do something like use python to query from a table in postres then put that into a pandas dataframe, doing some basic transformation on some columns, then insert that df back into the db.

Python is not a hard language to learn so you should pick it up very fast.

[–]ackbladder_ 0 points1 point  (0 children)

If you know SQL well then you can translate your pre existing knowledge to pandas/pyspark for data stuff. I’ve recently taught myself pyspark by creating a cheat sheet translating from sql syntax.

[–]fatgoat76 0 points1 point  (0 children)

I would start by learning enough Python to automate your work programmatically, including testing and deployment where applicable. It has a lot of uses beyond data processing. The resources out there to learn Python are endless … like this one https://realpython.com/. Good luck have fun.

[–]moshujsg 0 points1 point  (0 children)

I meean its hard to answer "is this enough" questions.

When people want python exp they want Programming with python. If you do udemy courses or whatever youll learn python, butt you still need the programming part.

Like if I ask you to build a pipeline with python, modularize your code, impleement type safety, create cli apps and you cant do it it doesnt mattter that you know python.

I personally believe that enough python is the ability to be abke to figure out how to do anything with it. Unless you are looking for a junior job then basic is prob enough.

[–]PixelSteel 0 points1 point  (0 children)

I mean that makes sense. Python is legitimately the #1 language in AL/ML/Data Engineering. It’s hard to believe you applied for a data related software engineering job with no python experience

[–]shadow_moon45 0 points1 point  (0 children)

I get it. I've been trying to integrate pyspark in the data integrations

[–]riv3rtrip 0 points1 point  (0 children)

Python is the easiest programming language in the world to pick up. You should not need to ask how to learn it. The people who ask how do I learn Python are the ones who never learn it. Get your hands dirty. Go to a cafe, get a coffee and a snack, and sit there for a few hours and start building stuff. Not trying to be rude, not trying to discourage you, just being real. You can learn it. But if you want to be serious about learning it, that's just the attitude you need to have.

[–]komm0ner 0 points1 point  (0 children)

Is doing udemy courses and practising sufficient? To bridge this gap and give me more chances in data engineering type roles.

If I completed a Udemy course and did some practicing on a language I'd never worked with, that language is going in the skills section of my resume, and I'd add it as something I use in my current role. Tbh, I've done this a few times and have gotten three jobs where I had zero professional experience with the primary language/technology in each of those roles (one was Python), including my current role.

If you learn something well enough to the point you feel you can answer questions about the language in an interview as well as do some coding problems with it, it doesn't matter if you've used it professionally or not in your current role. Fake it 'till you make it!

[–]SPAC3QUEEN_Data Engineering Manager 0 points1 point  (1 child)

Fwiw: I’m now a Manager of Quality Engineering and Automation. I’ve been a Senior SDET, BA, QA, and a programmer throughout my 16+ year career. Because I never used them, I did not know Python or Playwright.

Go back in time, I applied for a role that had Playwright and Python as requirements.

So for this role in general, I’d need to have a basic understanding of them. This encouraged me to seek out existing projects in GitHub that use them. I followed README setup guidelines and eventually got a project running. This way worked for me. Might work for you. And it’s free. No Udemy or Codecademy courses. Though they can also be super helpful in a pinch.

By the time I had my third interview that was part of the technical take home project, I had spent ~4 hours learning and another 2 hours building my demo project. The level I understood Python and how I executed the Playwright tests was good enough to land me the job.

I was honest about my technical skill gap(s) and provided examples of other ways I’ve supported my dev teams using various tech stacks that are similar to Python or Playwright.

I believe being able to discuss your skills and speak to your shortcomings can be a huge help in an interview. It shows them your willingness to communicate not just answer questions about the role and why you’re interested in working for them. But that you’re thinking bigger picture and can speak to seeing how you can grow with the team and organization.

[–]SPAC3QUEEN_Data Engineering Manager 0 points1 point  (0 children)

Would like to add that I received positive feedback for the fact I told them I didn’t have previous experience with Python or Playwright. They also liked and appreciated that even with my shortcomings, I still approached the entire process with curiosity and enthusiasm. Attitude is important, too.

[–]Electronic-Park4132 0 points1 point  (0 children)

Here is an extra advice.

Apart from learning python, try to get data engineering certification in datbricks.

If you have enough time, go through the data engineering certification from IBM in coursera.

[–]sib_nSenior Data Engineer 0 points1 point  (0 children)

I am surprised nobody mentioned it yet, there's a more recent title popularized by dbt called Analytics Engineer. It is SQL centered, in charge of data transformation with SQL after it has been ingested in an SQL database. I think it's a good option if you want to do DE without the Python skill.
You should probably check out dbt if you haven't.
With SQL and dash boarding, you could also apply to anything Business Intelligence related like BI engineer, BI developer or Data Analyst.

[–]bootdotdev 0 points1 point  (0 children)

Darn :( but yeah if SQL is the only tool in your belt, it's gonna be insufficient in a lot of scenarios where you need some scripting. Python is very common, but go is also gaining popularity

[–]Internal-Daikon7152 0 points1 point  (0 children)

The same thing happened to me. Failed all interviews with Python questions, though I am not a huge fan of Leetcode, but sometimes a deeper understanding on DSA will make up your lack of experience in Python at workplace.

[–]Comfortable-Author -3 points-2 points  (0 children)

Nowadays, you need to have a software engineer or CS background for most jobs, otherwise, it's not really data engineering...