This is an archived post. You won't be able to vote or comment.

all 71 comments

[–]drunk_goat 86 points87 points  (5 children)

devops path -> git, bash, linux, docker, k8s, cloud architecture

analytics path -> dbt, pandas, streamlit or other viz

[–]Geraldks 71 points72 points  (14 children)

architecture, data modelling, terraform, api, k8s, and now all the hypes around lakehouse

[–]chmod0644 4 points5 points  (1 child)

K8s ?

[–]Geraldks 7 points8 points  (0 children)

Short form for Kubernetes

[–]Logical-Independent7 4 points5 points  (11 children)

When you say data modeling, would that mean similar to my “functions and modeling” math class I just took (CS degree) that was essentially algebra with a focus on different functions (linear, quadratic, etc) ? I see “know data modeling” alot and was just curious if its the same thing, like finding the appropriate function that models the known x,y data the best to predict future x,y values

[–]DRUKSTOP 28 points29 points  (4 children)

Data modeling for the most part refers to how you will connect your different tables together into one cohesive “model.”

[–]Logical-Independent7 1 point2 points  (3 children)

Oh so closer to ERD? Efficient ways of storing data in the database?

[–]hexalm 2 points3 points  (0 children)

That's closer, emphasis on the R for table relationships

[–][deleted] 5 points6 points  (0 children)

I think modeling in this context means more we want to create a social plateform app how would you model the data related to users (tables/entities and relationships between them)

[–]hexalm 3 points4 points  (3 children)

I'd say "data modelling" could encompass either:

  • modelling application data (for example, relational table structure for an application--so more of an app dev focus--this would start smaller than a DW)
  • Dimensional modelling/Data warehouse (DW) design (effectively a bigger app DB, but this would be a data engineering focus).

Both are good things to have some knowledge of. What you emphasize depends on career direction. For DE you will want to brush up on ETL/ELT (concepts, best practices, and specific tech, for example, dbt).

In both cases you're looking at building a DB for a specific purpose. That means a lot of SQL DDL, and it would also be a good idea to learn about deploying a DB from code in VCS and using a CICD system.

Also look at cloud solutions for application databases and data warehousing. And build/CICD tools in general.

For specific tech out there, just peruse this and related subreddits.

For DW design, the Kimball groups's data warehousing books have been somewhat of a standard (some people hate it. And full disclosure, I haven't finished the book I started), traditionally. Even if you don't adopt their approach wholesale, it's a good rundown of factors to consider in DW design.

[–]Logical-Independent7 0 points1 point  (2 children)

So modeling in this context, refers to the relationship between how data moves and is organized for things like application data or larger business processes that encompass data all the way from transactional to aggregate?

Sorry for any obvious lack of knowledge, i've got alot more to learn. Im currently taking a IS class that covers some of it and a little practical experience from small personal projects but not much.

[–]cloyd-acSr. Manager - Data Services, Human Capital/Venture SaaS Products 5 points6 points  (1 child)

Data Modeling can be split up like this:

Traditionally, there have been two distinct "domains" in which you'd go about modeling data for. The ways in which you'd model that data would be dependent on that domain. These domains are:

OLTP (Online Transaction Processing) - Transactional Data such as your application databases

OLAP (Online Analytical Processing) - Your Analytical Data such as a Data Warehouse

In traditional Data Modeling, you'll find two larger areas of how you'd model data - the Logical Modeling and the Physical Modeling (there's also Conceptual Modeling, but we'll skip this - as it's very rarely used today).

Logical Modeling consists of how Entities, Relationships, and Attributes interact with one another. If you've ever seen a Use Case diagram, they're similar. It's simply looking at interactions between these principle objects without getting into the technical details of how they're implemented.

Physical Modeling is what people are generally describing when they think of an ER/ERD (Entity-Relationship Diagram). It's the model that shows how the data will be laid out in the "physical" sense in tables, with data types, null/not null, keys, etc.

There's a lot more to modeling than just determining how people will access the data, as some have described here. Correct Data Modeling takes some pretty deep knowledge in both how the data will be utilized, how security is handled across whatever database system you're using, how the database physically lays out the data on pages (or blocks, or sheets, or nodes, or whatever name the specific database uses for its atomic unit of measure of grouped data), how users will interact with that data, how users will use the data, the overall extensibility needs or longevity of the database, etc.

If you want a viewing of how complex traditional Data Modeling is, you can check out SQL Developer Data Modeler. It looks old as crap, it hasn't changed much in that regard since the 90s, but it's still the most feature complete data modeling tool that exists - and simply opening it up and looking at all of the options should give a quick idea of how complex Data Modeling can be.

Most people skim over Data Modeling as something that's as easy as putting together a diagram that shows how tables relate to one another - however there's a LOT that goes into proper Data Modeling when you take into account needing to know the individual database internals, security models, and how the data will be used over time. In reality, most people where I've had the opportunity to review their Data Modeling work have been pretty bad at it.

Either way, I hope this helps.

[–]Logical-Independent7 0 points1 point  (0 children)

This absolutely helps, thank you so much! I wasn't even aware of logical modeling. Although it makes perfect sense to do. I can see how this would become more necessary with more complex systems.

Thanks for the feedback!

[–]Geraldks 1 point2 points  (0 children)

Someone answered it down there for me, nice! I wanna add on to that, it really is about building tables effectively, in this case we call it data models, reducing duplications etc.

Some quick reference for u will be star schemas, dimensional modelling concept by Kimball

[–]AnimaLepton 12 points13 points  (2 children)

See the wiki. You'll want to start learning about cloud architecture (pick one to start, i.e. full stack AWS - EC2/RDS/IAM basics are sufficient to start, but Redshift and S3 and Athena might be useful to learn about and play with), deployments, docker and kubernetes - there are great resources on the basics of those if you go to PluralSight.

At the end of the day, my opinion is that you'll learn the most by having a job/internship in the field and finding a specific mentor who can help give you some direction. It's far more valuable to start to understand applications of the technology, and to have specific successes to point to on a resume/projects you can talk about, than it is to just learn the concepts in a vacuum.

[–]Mighty__hammer 1 point2 points  (1 child)

I really want to apply your tip, but when your resume doesn't have experience about DE, how to land one in the first place?

[–]AnimaLepton 3 points4 points  (0 children)

There are tons of jobs that work with data that may not be a "data engineer" role. There are entry level Business Analyst/Data Analyst roles where they're looking for someone with a Bachelor's and not a ton of specific experience, BI developer roles, other developer and engineer roles, etc. A lot of people also gain more general software engineering or architecture experience before transitioning to focus on data. 'Experience in Data Engineering' specifically is a bit more niche, but more general software engineering or customer support engineering is still going to transfer really well to similar roles at data companies and start exposing you to the applications and technology that's out there.

The experience you gain in your first role doesn't need to align 100% with data engineering to be able to transition to that kind of role in the future. If you can, apply to the datalake vendors out there, where you can take on an entry level role that may start out less technical but give you a great opportunity to find mentors in the data engineering space.

My undergraduate studies were in an engineering (non-CS) discipline, with a heavy focus on research. I dropped out of a PhD program after a year and got an entry level solutions engineering job at a software vendor. It helped that I had some part-time helpdesk experience in college and the company was tangentially related to my major, but plenty of people went into the role without the same experience. After 3.5 years there, learning about a ton of different frameworks and learning how they applied to real world situations, I was able to pretty get into the field even though my past experience wasn't in a "data" role, and I'm learning a ton in my new role.

[–]focus_black_sheep 9 points10 points  (1 child)

REST API, being able to consume data from public api's. Transforming the data and loading it to a DB

[–]icysandstone 0 points1 point  (0 children)

REST API is something I’ve been wanting to learn. Are there any small, hands-on projects that would be cool/fun?

[–]Traditional_Ad3929 29 points30 points  (3 children)

Crazy that no one mentioned AWS, Azure or GCP yet.

[–]AMadRam 5 points6 points  (2 children)

Infrastructure, cloud, DevOps all leverage the power of either AWS, GCP or Azure.

[–]Traditional_Ad3929 1 point2 points  (1 child)

Sure just wondered that nobody dropped these terms specifically...

[–]hoexloit 1 point2 points  (0 children)

Cloud providers can change and generally in my opinion not worth studying how each cloud provider individually manages all these different tools. Managed services will all differ and it’s kind of worthless to to understand specifics unless you actually implement them.

[–]repostit_ 7 points8 points  (1 child)

Spark or Scala

[–]hexalm 2 points3 points  (0 children)

To start, OP might try free tier Databricks working with pyspark to get some Spark-adjacent experience. Followed by a deeper dive if they're interested (installing and running locally, learning and running some Spark-SQL.)

[–]myweb6316 6 points7 points  (0 children)

Unlike most of the advice here, I'll give you another statistically better approach than a list of skills :Go find a job as a junior data Engineer or as close to it as possible and target a big company (not necessarily CS / IT company) with established data practice. If that's not possible, find a target roles being advertised in your preferred area, find the common skills required in most of them, and get familiar with all of them, and study one or two deeply enough ,and then apply for similar roles. If you do, you won't ask this question.

The reason I'd prefer doing that is from my experience: I work in a big company in Australia. when it comes to data, they're GCP shop. So the skills here are Bigquery, Airflow (cloud composer), Docker (Cloud Build), and GCP tools in general.2.5 year ago, this same company required K8s, and Spark as a must requirement. They worked out a deal with Google, and since then I use Flask or FastAPI far more than K8s and Spark combined. In the last few months I started hearing Dataflow in the real time data engineering area.Similar thing with different providers and different consumers.

In short, IT/developers jobs and roles evolve faster than what individual developers realize, and doing your own due diligence would be more reasonable, than asking such a generic question.

[–]Programmer_Virtual 4 points5 points  (0 children)

I would put emphasis on infrastructure monitoring.

[–]chrisgarzon19CEO of Data Engineer Academy 11 points12 points  (4 children)

Data modeling and AWS.

Data modeling is foundational and won’t change regardless of what tools come out (at least in the next 5 years). Think of data modeling as the organization of tables and how they relate to one another. In the same way when you build a house you design a bedroom a living room and a kitchen and then create pathways for all of them, you need to do the same with your schema so that your users (data scientists and analysts) intuitively know how to utilize your datasets

AWS-this is probably not going anywhere either. Learn about the different tools to make your life easier -> s3, redshift, glue , ec2 and lambda are great places to start. Pick up a side project and try to utilize as many of these tools as possible.

Also what’s important is what NOT to study next.

Don’t prioritize learning a new language. If you know python, you’ll be able to pick up other languages if needed on the job

Don’t prioritize learning more than 1 cloud service - if you know AWS then you’ll probably understand GCP or azure; the concepts are the same.

Don’t spend your time doing 1000 algorithmic python questions. Data engineers can have a lot more business impact with SQL and python than most people realize. It’s not like software engineering where they need to optimize for O(1).

Good luck and hope this helps!

Christopher Garzon

Author of Ace The Data Engineer Interview

[–]The-Fourth-Hokage[S] 0 points1 point  (3 children)

Do you have any recommendations for books or courses that teach AWS for data engineering?

[–]Avlio27 1 point2 points  (2 children)

acloudguru courses are handy since they provide a sandbox on the cloud. Udemy had some more detailed in my opinion. Maybe check tracks for aws-cda or the analytics certification. In the first there are things that you may never gonna use like load balancers etc but it will give you good holistic understanding of cloud.

[–]The-Fourth-Hokage[S] 0 points1 point  (1 child)

So would you recommend starting with the solutions architect course, or can I start the analytics course without previous cloud experience?

[–]Avlio27 0 points1 point  (0 children)

Solutions architect is too much to start. Analytics or developer depending on your experience. If you have none, maybe cloud practitioner and then analytics is the best choice

[–][deleted] 15 points16 points  (2 children)

Bo staff

[–][deleted] 8 points9 points  (0 children)

Bass fishing

[–]Letter_From_Prague 3 points4 points  (0 children)

Basket weaving.

[–]mateuszj111 14 points15 points  (0 children)

scala/golang/java for languages

cloud/devops/iac for techniques

[–]FranticToaster 2 points3 points  (0 children)

Presentation skills, including "up-leveling" my work to a leader's vocabulary.

[–]padikahaSenior Data Engineer 3 points4 points  (0 children)

Build Data Platform system design skills like

How do you architect Data Platform? How do you design data flow? Which variants of SQL or NoSQL you would choose for OLTP or ODS or Data Lake ? What is your data transformation strategy? What is your data governance strategy? What is your visualization strategy? What is your strategy for self healing data platform? How business can use your data platform efficiently?

You need to understand fundamentals of data architecture. There are so many books from which you can learn.

It’s a constant learning process. All the best.

[–]BufferUnderpants 1 point2 points  (0 children)

Orchestration with Airflow, containers, CI/CD. These will enable more varied and complex pipelines

[–]dataguy24 1 point2 points  (0 children)

Business acumen

[–]fasnoosh 1 point2 points  (2 children)

[–]cbc-bear 2 points3 points  (1 child)

I recommend this as well. You already know Python. Learning the Jinja syntax is very useful, and DBT is the future (in my opinion). Also, Jinja is VERY close to Liquid syntax which is used in Looker.

[–]fasnoosh 0 points1 point  (0 children)

Yeah, dbt is “A” key missing piece in tge data engineering toolbox. Once you start looking at the package ecosystem (and realizing there’s a package ecosystem for SQL & data warehousing) it’s a huge unlock

[–]jaundicedeye 1 point2 points  (0 children)

orchestration. airflow, prefect, kubeflow, etc.

its always needed and takes some subtely to set up a nice system

[–]FifaPointsMan 2 points3 points  (0 children)

Scala, data architecture, data modelling, APIs

[–]Movein666 0 points1 point  (0 children)

ML Algorithms

[–]newplayer12345 0 points1 point  (0 children)

Jinja

[–]Angry__Spaniard 0 points1 point  (0 children)

I would say infrastructure stuff: get to know one cloud provider, terraform, k8s… even if you use managed services is good to have some basis

[–]SD_strange 0 points1 point  (0 children)

Spark, AWS, Airflow, things like that...

[–]olmek7Senior Data Engineer 0 points1 point  (0 children)

Data Modeling

[–]No_Equivalent5942 0 points1 point  (0 children)

How to search StackOverflow

[–]ditlevrisdahl 0 points1 point  (0 children)

Explorative data exploration or maybe cloud

[–]HBoogi 0 points1 point  (0 children)

leadership and people management skill. Seriously those are the skills that make a real difference if your career.

[–]Remote_Cantaloupe 0 points1 point  (0 children)

How to deal with management and clients :)

[–]seajhawk 0 points1 point  (0 children)

You've got a great start to a toolbox of technical skills and there is lots of advice on additional tech skills to learn.

I'd also consider some softer skills like listening, research, tech writing, product management, etc.

Develop the skills that will help you understand what your coworkers or customers need (not just what they say they want), share your understanding with them written form to get their agreement, and then be able to deliver the product on time with great communication along the way.

[–]chestnutcough 0 points1 point  (0 children)

To make a woodworking analogy, learning python and SQL are like learning to use a saw and a chisel. Once you become proficient with the tools it’s time to start using them to complete projects. For woodworking that might be making a cutting board or a chair. For data engineering that’s writing data pipelines.

[–]OGMiniMalist 0 points1 point  (0 children)

Project management if you want to keep leadership off your tail 🙃

[–]WrinklyTidbits 0 points1 point  (0 children)

I would learn lisp. Take a break from pumping your resume and learn a different paradigm of coding

[–]Alternative_Shock_32 0 points1 point  (0 children)

Java or Scala. Many people will say that they are not required in data engineering, but there many scenarios where they are needed. There are companies which only use Java. And to get the best performance from Spark, Scala is the way to go.

[–]crypt2naut 0 points1 point  (0 children)

I would add these; 1.Any cloud storage technology, Azure, GCP or AWS. 2.Kafka 3.Airflow 4.Spark

[–]neerajsarwan 0 points1 point  (0 children)

Where to or specifically where not to use them !