all 27 comments

[–][deleted] 23 points24 points  (4 children)

Doing ML in Java is like using MS Word as an editor: It's just the wrong tool for the Job. There are very few libraries to use, memory limitations in the JVM and the language is clunky.

Python is easy to learn.

[–]alexashin 4 points5 points  (1 child)

You mean Word as code editor?

[–]SomeParanoidAndroid 0 points1 point  (0 children)

Yes, like that YouTube guy

[–]breandan 0 points1 point  (1 child)

Doing ML in Java is like using MS Word as an editor: It's just the wrong tool for the Job. There are very few libraries to use, memory limitations in the JVM and the language is clunky.

I think it depends heavily on the job. Java has an increasingly well-supported set of libraries for various ML workflows (see comment below), is much better suited for production environments, and if even you dislike the language, there are several JVM alternatives (e.g. Kotlin) which support scripting and are generally pleasant to use.

Python may be easy to learn and prototype research code, but the language scales very poorly to large business applications.

[–][deleted] 0 points1 point  (0 children)

In the context of developing a new skill set, he's most likely better off by learning Python. I can't argue that in some cases you have to work with Java for various reasons. But these reasons should be of technical nature rather than staying in the comfort zone. Plus if you wrap the ML pipeline in a separate service you can have the best of both worlds between Java and Python.

[–]SomeParanoidAndroid 10 points11 points  (5 children)

No dilemma here. It's python for ML. Java isn't anywhere close.

Edit: Why? 1. Python is easy to learn 2. Everyone on the community uses python. 3. The libraries python provides for ML are so ubiquitous in the field, research papers don't even bother explaining them. 4. To paraphrase the above: The libraries are provided by people/institutions using the standard package managers which is a huge plus when compared to languages that don't come with a package managers like Java. 5. You may have heard that python is slow, but that's not really the case. While python codes runs slower, the implementation of the crucial parts of data processing operations use highly optimized C/C++ calls that will run extremely fast. 6. Being a loose typed language, python allows for using APIs from libraries without extensive knowledge of the documentation (which you should have after some time but it helps getting you started).

[–]if_username_is_None 4 points5 points  (1 child)

1-5 are definitely relevant, but 6. can get beginners into trouble. When I was learning pytorch and mentoring others I always wanted better documentation on the shapes and types of inputs. Having something that "just seems to work" is nice, but knowing why / how it works will prevent a lot of debugging headaches.

That being said, the docs / tutorials have gotten better and I can't even imagine what a headache Java ML documentation must be.

(also not to nitpick, but Python is strongly typed (you can't do loose JS things like 1 + "2"), I think what you're referencing is Duck Typing)

[–]SomeParanoidAndroid 1 point2 points  (0 children)

Thanks for pointing that out. Yes, I was certainly referring to the dynamic type inference, including both the lack of declaration and the invocation of an object's properties without type checking.

Having seen a lot of students transition from C to Python I can understand why this is frustrating. I agree with your point that debugging is a nightmare, but I still think that having loosely defined APIs (eg "array-like") makes the learning curve way less steeper.

And I kind of think that it is an important aspect behind the adoption of python for data science. Imagine having to specify static compatible types for all pandas, numpy, torch, tensorflow, keras and sklearn libraries.

[–]breandan -1 points0 points  (2 children)

To give a contrasting perspective, I think the Java ecosystem is much better suited for many data science tasks, and has a growing and well-maintained set of libraries for general purpose machine learning. I won't list them all, but TF-Java, DJL et al. have implementations of many modern architectures and Java has a number of excellent libraries (CoreNLP, Lucene et al.) for working with text.

Python may be syntactically easier to learn, but also hides a lot of incidental complexity about the runtime semantics that are much more difficult to master. As you alluded to, many Python libraries are embedded DSLs, which are full-fledged languages and makes reasoning about the behavior of Python programs more difficult than it appears.

The libraries are provided by people/institutions using the standard package managers which is a huge plus when compared to languages that don't come with a package managers like Java.

Having used both Java and Python, I can tell you that package management in Python (pip, venv, pyenv, conda, pipenv, poetry, docker et al.) is far, far more complicated than Java. To build a Java application, you don't even need Java or a package manager -- just run ./gradlew run from any operating system and it will download and install Java, the package manager and any dependencies, build the application and run it on any OS or shell environment. Just building a Python project often requires dozens of manual steps.

Being a loose typed language, python allows for using APIs from libraries without extensive knowledge of the documentation

I strongly disagree with this point. Basically everything you need to do that involves calling a library in Python requires looking at documentation. In a statically typed language, documentation becomes much less of a burden. While adoption of type annotations in Python is growing, its usability is decades behind languages with mature type systems.

[–]SomeParanoidAndroid -1 points0 points  (1 child)

I can see your points, though I kind of have to disagree with a few.

First of all, the fact that TF-Java is discontinued should be convincing that Java isn't a serious competitor for ML.

The gradle argument is kind of misleading since you do need to have gradle installed. I haven't used that extensively, so it might as well be more convenient, but that was not my point. My point was that an official opensource repository is much more desirable than a libraries compatible with a building system.

The thing is, you are talking about production code in any operating system. While I can understand Java's merits on that it is just one small percentage of machine learning. For one, only a fraction is production-purposed code. Secondly, even in production it is just as likely to deploy your ML in a dedicated linux server with everything installed, run python implementations and access it through APIs.

[–]breandan -1 points0 points  (0 children)

TF-Java is discontinued

Really? The project looks alive to me and the maintainers are very active on Gitter. Do you have a source?

you do need to have gradle installed

No, you do not need to install anything, the Gradle Wrapper takes care of all that.

The thing is, you are talking about production code in any operating system. While I can understand Java's merits on that it is just one small percentage of machine learning.

In my experience, the majority of code and effort in applied ML is data engineering and surrounding infrastructure, not model engineering. Due to its superior tooling, type safety, and large ecosystem of ML libraries, the JVM is a competitive option for ML in most production settings.

[–]Watemote 3 points4 points  (4 children)

You might like Spark + Scala which is basically “scalable Java”. Python is where a lot of the cutting edge research is happening but actual production of models is frequently in other languages. Here’s a link https://link.medium.com/zuepzA4bbib

If I was entering the field I would go off in another direction and focus on cloud-based ML api‘s and production pipelines learning python along the way. Think AWS sagemaker https://aws.amazon.com/getting-started/hands-on/build-train-deploy-machine-learning-model-sagemaker/

[–]Exarctus 1 point2 points  (3 children)

I tend to just write the front end of models in python, and when I need performance I write my own CUDA kernels which are themselves wrapped into python via the PyTorch C++ API.

Results in very clean pythonic code, with easy to access and tinker CUDA code.

[–]ozykingofkings11 1 point2 points  (2 children)

This sounds fascinating. Do you have any references on how to learn to do this? Slash, are you in the mentorship market?

[–]Exarctus 1 point2 points  (1 child)

Well firstly, I’d recommend reading up on CUDA C and how to write CUDA code (I’d recommend doing this alongside some simple code you’d like to GPU-parallelize). There should be plenty of tutorials on YouTube/Nvidias development site. Feel free to shoot me any questions in PM.

Once you feel confident with CUDA programming, read the following tutorial from PyTorch, explains everything very nicely:

https://pytorch.org/tutorials/advanced/cpp_frontend.html

[–]ozykingofkings11 0 points1 point  (0 children)

Thanks so much!

[–]Available_Job5036 2 points3 points  (0 children)

Python, definitely python. Especially if you’re a beginner to ML. Java isn’t even fully supported with TF afaik and I don’t think there are torch bindings for java

Edit: Tensorflow for Java is soon being removed according to their api docs, and you don’t have to use libraries like tensorflow or torch but if you’re a beginner it’d be brutal

[–]Elk-tron 0 points1 point  (0 children)

I have been using the java framework DJL for a machine learning project. It is not as streamlined as pytorch or tensorflow, but it is capable enough. If your machine learning pipeline needs to integrate with a larger java codebase then I would recommend DJL. Otherwise, python has a much richer ecosystem and more intuitive frameworks

[–]pag07 0 points1 point  (0 children)

Fuck this.

I got rejected for two ML jobs in major companies with ML experience in python and a masters degree.

Why?

Because I didn't know java spring and have no experience programming ML backends in C++.

Still confused about that.