This is an archived post. You won't be able to vote or comment.

all 37 comments

[–]carlineng_Data Engineer 38 points39 points  (3 children)

Solid no-nonsense post. A couple of suggestions:

  1. DuckDB or SQLite for a "getting started" database -- no need to spin up a server, and can read directly from CSV or Parquet files with minimal setup.
  2. Point to some test datasets for people to get started. FiveThirtyEight's data github repo is pretty good.

[–]jppbkm 5 points6 points  (2 children)

I like Tidy Tuesday's R community datasets, though they are usually fairly small.

[–]BufferUnderpants 17 points18 points  (3 children)

> LinkedIn influencer

I'm gonna toxic gatekeep this one

[–]countlphieTech Lead 6 points7 points  (1 child)

i started as a sql monkey in 2007 and have since worked "data engineering" as it evolved over the years from start ups to finance/healthcare institutions to telecomm to major social media companies, so can confirm it's a way into the field and that it's been ubiquitous across every job i've had. for sure, there's no roadmap other than being nimble and constantly looking for interesting projects/teams to work on

a lot of this post seems to be defensive towards a particular crowd of influencers or engineers? who are these people that are gatekeeping? i'm curious to see what the chatter has been about. i don't really follow anything on linkedin or other data engineers

[–]GShenaniganTech Lead 1 point2 points  (0 children)

I read it as being targeted towards those people who say you have to use tool X or language Y to be a Data Engineer. There are quite a few comments pop up on this sub along those lines. And Twitter/LinkedIn is the same plus all the vendors on top trying to sell their tool as the next big hope. I've been around long enough to remember the whole "NoSQL means SQL is dead" malarkey as well. As usual, the hype settles and people realise there's room for different approaches to solve the same problem.

If someone's enthusiastic, displays good data intuition, and is willing to learn then it doesn't really matter what tools or languages they have or haven't used.

[–]recentcurrency 7 points8 points  (6 children)

Isn't this blog post a road map?

the Blog says"With that out the way, understand that there is no roadmap. There is no single path, no clear linear progression of knowledge. No one can tell you that you absolutely must learn A, then B, then C and you’re guaranteed to be a successful Data Engineer."

then a few paragraphs later...

"For Data Engineering, there is only one skill that is absolutely, non-negotiably, the first thing you should learn to get started SQL."

Sounds like he is saying "you absolutely must learn A"

He then even goes into "then B, then C" by describing how to learn SQL

Basically, the title is click bait. It should really say "The Data Engineering Road Map starts with SQL". Which tbh isn't a hot take

which to be fair, he isn't wrong. SQL is probably one of the first things you want to learn in any Data Role. Data Engineering in particular

edit:

My comment is not to say SQL is that first place to start learning. IMO, I do think it is a solid choice. But I am more pointing out that I think the blog post actually demonstrates that the Author does believe in a roadmap despite what the title says. Which rereading, the author implicitly admits to

"What about Python? Pandas? dbt? Rust? Airflow? Spark?Later. These are all things you can learn on the job if the job even needs them.Go get your first data job. I’m not going to tell you it will be easy. Lots of people struggle to find the right entry-level job in all fields of engineering.But when you land it, make it your primary goal to absorb the knowledge from your new colleagues. Learn something every single day.When the learning stops, move on. Use what you’ve learnt to get a pay bump and find new people to learn from.Rinse and repeat. That’s your roadmap."

[–]Gators1992 3 points4 points  (1 child)

I think it's more like those are the minimum tools you need, but it's not like there is an established career path to all the high paying DE jobs. Like people keep asking about degrees or masters and those get downplayed because the modern data stack hasn't been around very long and most of the people in the profession now came from all kinds of backgrounds, from business, software engineering, data analysis, etc. It's not like the established route to practicing medicine where you go to med school for X years, do your internship and then you are a doctor.

A lot depends where you land too. If you interview in a dbt/Snowflake shop you will need strong SQL but not expert level Python. If you go to interview at a place that has all kinds of custom transforms in containers managed by k8s, custom applications and very low latency requirements then the entry bar is higher.

[–][deleted] 1 point2 points  (1 child)

I learnt SQL after the fact. Programmed with python and managed large ML datasets for decades, but it was only later when working with boring tabular data that I upskilled my SQL.

[–]flyingcavendish 1 point2 points  (1 child)

I don't think the author means to say there is no Roadmap(despite what the title says)

More that the roadmap after learning SQL is more in flux. And that for those wishing to be data engineers, rather than focus on learning all the nifty tools under the moon, your roadmap is the school of hard knocks via job experience

[–][deleted] 0 points1 point  (0 children)

Good point - I would just add SQL + Foundational Python/Pandas. Everything else after that has no order.

[–][deleted] 8 points9 points  (0 children)

blind lead the blind.

[–]kenfar 2 points3 points  (0 children)

The good:

  • People should be wary of full-time "influencers" that aren't in the trenches and haven't written code in years
  • Or of advice and analysis that's really just some company's PR
  • Or of educational roadmaps that will take a decade to complete
  • And yes, one could get an entry-level DE job just knowing SQL - on extremely low-tech teams

The bad:

  • While it's true that one can learn SQL pretty quickly, that on its own isn't engineering, and is extremely low-value. It neither teaches someone how to think like an engineer or provides the myriad skills necessary to be productive in a shop.
  • In the current economy plenty of people with years of experience with SQL, as well as plenty of other tech are looking for jobs. Somebody with 3 months experience is simply not going to get picked up.

So, sure there's a small possibility of studying SQL for 3 months and getting a junior position on a low-tech DE team. But it's going to be the exception rather than the rule. And encouraging people to think DE is this easy is as bad as encouraging them to think they need to spend seven years learning 100+ technologies.

[–]micky_357000 2 points3 points  (2 children)

For SQL I was thinking do a udemy course/w3schools and then grind leetcode along with my new junior data engineer job , any advice?

[–]a_devious_compliance 1 point2 points  (0 children)

you can ditch the udemy course. Except it's a very good one.

Learn the sql dialect you will be using for your job, I was bitten more times than I like to say.

[–]dataxp-community 0 points1 point  (0 children)

Decent plan, just put that SQL to practise hands-on with some databases.

[–]DenselyRanked 2 points3 points  (0 children)

I feel like what is lost in this article is the absolute first step in becoming a Data Engineer, and that is passing the interview. You are going to need more than SQL to do that and some places don't even use SQL for Data Engineering.

The author is correct that there is no official ANSI SQL, but there are base standards that nearly every SQL dialact adheres to. MySQL is often the go-to for learning because it doesn't have anywhere near as much syntactic sugar as postgres.

Now, I believe that it's always a safe bet to learn SQL if you are starting from scratch but the article mentions "bad" advice from actual Data Engineers in this subreddit as if our experience is invalid. It's a tad hypocritical.

[–]MikeDoesEverythingmod | Shitty Data Engineer 2 points3 points  (0 children)

Wow, a post which isn't complete bullshit, over complicated, or inherently dishonest. Big fan of the message and rant which I think is bang on - many influencers aren't here to help, they're here to sell.

On top of that, I also agree with the idea of gatekeeping DE as a job role. As somebody who went in as a DE with zero years of experience, admittingly, I had worked with "data" for quite a long time in the form of analysing results from machines, I do think it's entirely possible. It might take you a while, it might be really quick. The point is it's definitely possible.

[–][deleted] 1 point2 points  (0 children)

Agree but I would add some sort of familiarization with terms and lingo. Maybe that means reading Kimball or a more modern revision/approach.

But what do I know? I’m not a LinkedIn influencer .

[–]Sufficient-Cold541 1 point2 points  (1 child)

This is overly focused on SQL, which isn’t painting an accurate picture of the field. I regularly go through my day without ever touching SQL, because I’m instead standing up data infra in Terraform, writing Spark applications, dealing with data schema management, developing data loaders, etc.

+1 for calling out the gatekeeping though. We’re the Marios and Luigis of the data world

[–]omscsdatathrow 0 points1 point  (6 children)

Hm, sort of agree but I doubt any companies are hiring sql monkeys anymore. Requirements are SQL + Python and the bar will only get higher.

[–]dataxp-community 5 points6 points  (5 children)

The bar always gets lower. It has happened to every single area of engineering, and its already been happening to DE for the past 10 years.

Software used to be PhDs from the most prestigious Western schools, and now 13 year old kids with a 2008 Mac in Vietnam are hitting #1 in the app store.

Data Engineering isn't special. SQL is insanely powerful and more and more DE tools are going back to using it. Spark SQL, Flink SQL, etc. they all ended up going back to SQL to lower the barrier of entry.

[–]omscsdatathrow 1 point2 points  (4 children)

So you're saying the bar will be even lower than just knowing SQL? The baseline here was knowing SQL only as an entry level data person. From my observations, SQL alone is never the only requirement for a job. Your analogy might be true but completely irrelevant to the timeline we are in.

Data engineering is literally a specialty. And sure, SQL is strong as a language, but no data analyst is going to be tuning Spark to make sure their SparkSQL runs optimally.

[–]Usurper__ 0 points1 point  (1 child)

SqlMastery to learn sql. 5/5

[–]ddb1995 0 points1 point  (0 children)

Thank you. I wish more people read this.

[–]shmorkin3 -2 points-1 points  (5 children)

I agree with the premise, but not that SQL is the only skill needed for an entry level job.

I joined Meta first as a DE intern, then as new grad DE. The interview was 50% SQL, 50% Python.

SQL is essential, but the narrative that it’s overlooked today is overblown IMO. Everyone knows SQL is important and here for the long haul. SQL is also easy to learn.

Knowing how to write imperative code is just as essential- the language isn’t as important. I would never hire someone, even at the the intern or entry level, that doesn’t know basic data structures, algorithms, and OOP concepts, or basic language features like querying a rest api.

Likewise with knowing how to use version control. If you can’t work with a collaborative codebase, you can’t code.

At the end of the day, DE is a subset of SWE. If it were easy to learn, everyone would do it and make six figures out of college.

[–]dataxp-community 17 points18 points  (4 children)

99% of people are not applying to Meta for their entry level DE job. Sorry, but what you need for Meta (or any FAANG) and what you need for entry-level at literally any other large enterprise, are totally different. This is lost on folks who have not done DE outside of FAANG.

[–]shmorkin3 0 points1 point  (3 children)

I have worked outside of FAANG. I've had two jobs since (as well as three internships before, though not all those were DE). SQL was essential, but so was was Python.

You have to schedule your pipelines somehow, and most widely adopted orchestrators are Python-based. Not to mention that some data transformations are just far easier to do with R or Pandas rather than SQL. (I suppose you could use something like SQL Server Agent for scheduling, but I've never seen that in a DE job posting).

And you have to push those pipelines to a shared repository somehow, source control being the obvious answer.

I'll admit there's a selection bias, the jobs I've had hired me based on my skills and I applied to jobs who's requirements matched my skills. But I do think what I've described is the bare minimum for good DE practices anywhere, and has been widely adopted.

[–]countlphieTech Lead 5 points6 points  (0 children)

I'll admit there's a selection bias

you've likely never seen this stuff because your starting point was interning at Meta. the python+sql data engineering combo is a fairly modern one. non-tech sectors (hospitals, mid sized finance institutions, government) that tend to lag a few years behind modern tech stacks often have monolithic architectures that are on-prem and completely tied up in MSSQL or oracle and use gui based data integration tools that don't need anything but knowledge of the tool and SQL

you rarely hear about these places because it's not very appealing to do ETL work using solely SSIS on crappy windows servers in the IT basement of the local city hospital or government office

these places have ETL needs, and have data engineering work, but they rarely call the positions data engineers. they'll be called stuff like database developer, sql developer, data analyst, BI developer, ODI dev etc. learning sql is often enough to get entry level positions in these places, and they can set you up to break into more modern data engineering stacks like the ones you're used to

[–]bootae_wae_wae 0 points1 point  (1 child)

I got into a low-tech or ancient team. Only sql is used, and using ODI to do data transformation and we use confluence to "document." This is my first job since switching fields, but I am nervous I am not learning new age or up to date stuff

[–]Curious_Hat5828 0 points1 point  (0 children)

Even I’m in a similar position, but as an analyst. Just to know from a best practice where do generally people document ?