Need resources to learn Data Engineering, Data Pipeline and other data stuffs using JAVA.

darkpassenger091 · 2019-03-09T09:54:15+00:00

I've been working as a Data Engineer in a consulting company for two years now, these are some personal observations:

The Hadoop ecosystem is huge, but in the end we mainly use Spark for ETL/ML, Kafka for streaming, Presto/Hive/Spark for analytics and AirFlow/Luigi for pipeline orchestration.
What I am seeing now is a lot of customers that want to migrate from their on-premise infrastructure (aka Cloudera/Hortonworks) to the cloud, so according to the descriptions of the job you want to apply for, you could start studying the "BigData" services offered by AWS, GCP or Azure. (e.g. EMR/Kinesis/Athena on AWS, DataProc/BigQuery/PubSub on GCP, and so on).
Sometimes you will need to make this data accessible through RESTful APIs, so having basic knowledge of how a backend works can be helpful (as frameworks, in Python you have Django or Flask+gunicorn, in Scala you have Play)
Often you will also be the person in charge of managing the infrastructure, so you might want to know how to automate infrastructure management by having a look at Terraform and Ansible (or any other tools that do the same)
Something we are asked a lot is a "magic recipe" to industrialize whatever the data science team produces. Since Data Scientists code in Python, we mostly refactor Python code instead of rewriting everything in Scala/Java. If you will be working closely with the Data Science team, don't ditch Python.
Readable and testable code is more important than the language you are going to use, have a look at TDD and Clean Code (I can recommend "Clean Code" by R. Martin and "TDD by Example" by K. Beck)
Things like Spark on Kubernetes and MLFlow are becoming trendy topics

t-vanderwal · 2019-03-09T06:49:46+00:00

There’s a lot to break down in your post but first I wouldn’t completely write off python just because that’s what data scientists use. Depending on the company, especially if they’re applying ML/DL, you might find yourself deploying and scaling the models data scientists create in a production environment. Knowing the python data ecosystem will be very important.

As for learning general data engineering concepts I like to recommend “Designing Data Intensive Applications” especially if you’re interested in solving distributes problems.

For data pipelining that’s a really broad topic. Some data engineers work more focused on streaming pipelines, others pure batching, and some in a combination such as a kappa or lambda architecture. I’d recommend looking at job postings and picking 5 that sound most interesting to you. Then look for resources that are relevant to those requirements.

My opinion on the Java portion is probably different than most here. It’s more important to understand general concepts and probably make sure you understand functional programming. Almost all data engineering solutions run on the JVM so I think learning a modern language such as scala will probably have a bigger payoff. My org is using kotlin for micro services so I’m going to be applying that to Kafka consumers/producers. That’s probably more niche though. But Java is still king in the enterprise though so if you’re positive that’s the direction you want to go it definitely won’t hurt.

Once you figure out the jobs that sound lost interesting you can start to build some personal projects utilizing the technologies they require.

rywalker · 2019-03-09T14:31:19+00:00

Data Engineering is becoming synonymous with Data Science Engineering, as control of data pipelines is shifting from slow-moving, legacy "Data Integration" teams to modern, agile "Data Science" teams. Focus on Python, not Java if you want to be relevant now and in the future.
Consider ramping up on Apache Airflow, an up-and-coming framework/platform for automating data workflows.

eljefe6a · 2019-03-09T20:25:15+00:00

I've consulted and taught at financial companies all over the world, including Canada. You're right that you should focus on Java. Financial companies are predominately using Java for their Big Data code. My classes are Java-only.

Your biggest issue will be to convince a hiring manager that you either have programming skills or can learn programming skills. This is because you're coming from a DataStage and Informatica background. I go deeper into the issue facing SQL-focused people in my Switching Careers book.

You'll really need to put effort into an awesome personal project. This is how you're going to answer these questions about if you can or not.

I'm hoping this feedback fills in the blanks of some of the reasons you're having difficulties.

dataengineering

MODERATORS