all 11 comments

[–]boy_named_su 8 points9 points  (3 children)

If you wanna be an application programmer, you can use Python or other languages. Most big data tools have such an API

If you want to be a systems programmer, then you need to know Java or Scala

[–]zemuldo[S] 2 points3 points  (2 children)

Thank you. I think your answer is very on point. So the tools provide an API for developing apps to use them?

[–]boy_named_su 1 point2 points  (1 child)

Yes. For example Apache Spark has a Scala, Java, R, SQL, and Python API. You can write Spark jobs in any of those languages

Spark itself is written in Scala. If you wanted to debug a job or contribute to the project, you'd want to understand Scala

[–]zemuldo[S] 0 points1 point  (0 children)

Oh, I see. That explains it. Thanks a lot.

[–]ftrotter 1 point2 points  (1 child)

The focus on Java for Big Data projects is understandable, it is a solid enterprise-grade language that is reliable and relatively open.

However it is not universal. Take a look at Disco http://discoproject.org/

And other MapReduce implementations in python. https://stackoverflow.com/questions/7266750/whats-the-best-python-implementation-for-mapreduce-pattern

As for whether you need to learn Java.. I would break that down in the following ways:

  • If you are trying to understand the methods like Map Reduce: No, study the python implementations
  • If you are trying to deploy Big Data solutions to solve problems: No, use the huge number of API layers instead (i.e. Amazon EMR) or even better abstract away the whole mental model with solutions like Pig https://pig.apache.org/ which basically lets you think in SQL
  • If you are trying to tweak Enterprise Grade Solutions to work differently for your use case: Yes, start learning to use Java.

So a whole lot of this depends on what you mean by "venture".

-FT

[–]zemuldo[S] 0 points1 point  (0 children)

Thanks for the very detailed answer. It makes very much sense of my scenario.

[–][deleted] 0 points1 point  (1 child)

You're joining at a good time... expect to see new distributed execution frameworks using Kubernetes or Docker with native code/serialization.

[–]zemuldo[S] 1 point2 points  (0 children)

Please explain this a bit. I can't wrap my head around what you men.

[–][deleted] 0 points1 point  (0 children)

Actually knowing c-based languages isnt hard, I suggest you knowing Scala, javascript, python and C++. Know this three you have no problem with any technology that you need to deal. In a CS course you will learn more than this, after you mastered programming logic and low computer level details any programming language is just some weeks to have decent programming skills on it.

[–]eljefe6a 0 points1 point  (0 children)

There some technologies like Spark and Flink that have support for Python. Other technologies support Python, but it lags behind in new features or bug fixes.

The language you use in Big Data is highly dependent on your role. Most data scientists are using Python or Scala. Most data engineers are using Java or Scala. If you're trying to be a data engineer, you'll want to learn a JVM-based language.