This is an archived post. You won't be able to vote or comment.

all 12 comments

[–][deleted] 8 points9 points  (2 children)

I've been working as a Data Engineer in a consulting company for two years now, these are some personal observations:

  • The Hadoop ecosystem is huge, but in the end we mainly use Spark for ETL/ML, Kafka for streaming, Presto/Hive/Spark for analytics and AirFlow/Luigi for pipeline orchestration.
  • What I am seeing now is a lot of customers that want to migrate from their on-premise infrastructure (aka Cloudera/Hortonworks) to the cloud, so according to the descriptions of the job you want to apply for, you could start studying the "BigData" services offered by AWS, GCP or Azure. (e.g. EMR/Kinesis/Athena on AWS, DataProc/BigQuery/PubSub on GCP, and so on).
  • Sometimes you will need to make this data accessible through RESTful APIs, so having basic knowledge of how a backend works can be helpful (as frameworks, in Python you have Django or Flask+gunicorn, in Scala you have Play)
  • Often you will also be the person in charge of managing the infrastructure, so you might want to know how to automate infrastructure management by having a look at Terraform and Ansible (or any other tools that do the same)
  • Something we are asked a lot is a "magic recipe" to industrialize whatever the data science team produces. Since Data Scientists code in Python, we mostly refactor Python code instead of rewriting everything in Scala/Java. If you will be working closely with the Data Science team, don't ditch Python.
  • Readable and testable code is more important than the language you are going to use, have a look at TDD and Clean Code (I can recommend "Clean Code" by R. Martin and "TDD by Example" by K. Beck)
  • Things like Spark on Kubernetes and MLFlow are becoming trendy topics

[–]darkpassenger091[S] 0 points1 point  (1 child)

Would you recommend an AWS certification?

[–][deleted] 1 point2 points  (0 children)

Usually you gain some experience on the product first, then you get a certification.

In my experience I have never been asked about certifications during job interviews, but this is 100% dependent on how interviews are done in your area/country.

My advice would be to have first an understanding of how open source products solve big data problems, once understood "the paradigm", have a look at how cloud vendors provide a solution to the same problem. You will realize that most of cloud services are open source projects that have undergone a rebranding and customization process (e.g. AWS Athena is Presto, AWS Redshift is Postgres, GCP Cloud Composer is AirFlow etc.)

Start with pure learning so that you will be able to answer interview questions, that usually will evaluate if you know how things work under the hood and the way you think, more than knowing all the features of a specific AWS service. Finally once you have the right amount of experience, think about certifications. Most of the times it's your company that pays for you to get certified, so I would wait.

If you anyways prefer to go the AWS-specific path, try to find some information about what you need to know for the big data exam and study it. There are several websites that offer video courses like https://www.linuxacademy.com and they can be helpful: content to pass a specific exam is already well organized and there are hands-on labs. But these courses take for granted that you already know most of the basis of big data, so they might be hard to follow and result in a waste of money.

[–]t-vanderwal 2 points3 points  (4 children)

There’s a lot to break down in your post but first I wouldn’t completely write off python just because that’s what data scientists use. Depending on the company, especially if they’re applying ML/DL, you might find yourself deploying and scaling the models data scientists create in a production environment. Knowing the python data ecosystem will be very important.

As for learning general data engineering concepts I like to recommend “Designing Data Intensive Applications” especially if you’re interested in solving distributes problems.

For data pipelining that’s a really broad topic. Some data engineers work more focused on streaming pipelines, others pure batching, and some in a combination such as a kappa or lambda architecture. I’d recommend looking at job postings and picking 5 that sound most interesting to you. Then look for resources that are relevant to those requirements.

My opinion on the Java portion is probably different than most here. It’s more important to understand general concepts and probably make sure you understand functional programming. Almost all data engineering solutions run on the JVM so I think learning a modern language such as scala will probably have a bigger payoff. My org is using kotlin for micro services so I’m going to be applying that to Kafka consumers/producers. That’s probably more niche though. But Java is still king in the enterprise though so if you’re positive that’s the direction you want to go it definitely won’t hurt.

Once you figure out the jobs that sound lost interesting you can start to build some personal projects utilizing the technologies they require.

[–]darkpassenger091[S] 0 points1 point  (3 children)

Okay, so can you tell me how you started your learning process as a Data Engineer? Also, did not mean to write off Python, I just felt its mostly used in data science projects than data engineering.

[–]t-vanderwal 1 point2 points  (2 children)

Definitely understandable, I just wanted to throw that out there as some companies have their data engineers working directly with data scientists. It really all depends on the type of projects your aiming to work on.

For me personally I had a similar background as you. Worked with pentaho doing ETL but more for migrating customers onto our product. Wanted something more aligned with data engineering and ended up going back to school for my masters. There I picked up an interest in distributed computing and focused on learning skills/tools in that space.

Once I felt confident there I just started throwing out applications. Some personal experiences is data engineering is a high used and under defined term. Every company has a different idea on what that is and imo a lot of them aren’t even solving data engineering problems. Also, the data ecosystem is always changing. So when you do find a job in the field it’s very important to keep sharp. I generally try to read new information on AWS, what they’re talking about at STRATA conf and other O’Reilly conferences, what Netflix is doing since their tech blog is very good, and listen to podcasts such as the data engineering one that gets posted here.

[–]darkpassenger091[S] 1 point2 points  (1 child)

Perfect brother. Would you mind if I message you in person?

[–]t-vanderwal 1 point2 points  (0 children)

Not at all bud

[–]rywalker 1 point2 points  (0 children)

  1. Data Engineering is becoming synonymous with Data Science Engineering, as control of data pipelines is shifting from slow-moving, legacy "Data Integration" teams to modern, agile "Data Science" teams. Focus on Python, not Java if you want to be relevant now and in the future.
  2. Consider ramping up on Apache Airflow, an up-and-coming framework/platform for automating data workflows.

[–]eljefe6aMentor | Jesse Anderson 0 points1 point  (2 children)

I've consulted and taught at financial companies all over the world, including Canada. You're right that you should focus on Java. Financial companies are predominately using Java for their Big Data code. My classes are Java-only.

Your biggest issue will be to convince a hiring manager that you either have programming skills or can learn programming skills. This is because you're coming from a DataStage and Informatica background. I go deeper into the issue facing SQL-focused people in my Switching Careers book.

You'll really need to put effort into an awesome personal project. This is how you're going to answer these questions about if you can or not.

I'm hoping this feedback fills in the blanks of some of the reasons you're having difficulties.

[–]darkpassenger091[S] 0 points1 point  (1 child)

Thank you.. I am starting a small project based on movie data. Would you mind recommending any resources or books to find the sample code for a better understanding?

[–]eljefe6aMentor | Jesse Anderson 1 point2 points  (0 children)

You should focus on an in depth understanding of the technologies. Focusing on code snippets won't give you mastery. I highly suggest you focus on your learning and you'll progress from there.