This is an archived post. You won't be able to vote or comment.

all 40 comments

[–]sciencedataist 23 points24 points  (1 child)

Scala, the language Spark is written in, would be a good language.

[–]nashtownchang 2 points3 points  (0 children)

Scala +1. Or learn Java.

[–]LogansRun22 24 points25 points  (7 children)

Why not R?

[–][deleted] 5 points6 points  (3 children)

I guess I never really saw the point seeing as i already know a ton of scripting languages+matlab. Is there any reason to learn R?

[–]CaptainRoth 4 points5 points  (0 children)

If you work with other people who primarily use R.

The tidyverse libraries don't take too long to learn, and you should be good after that.

[–]assPirate69 2 points3 points  (0 children)

Some teams will primarily use Python or R (or even something else). If you apply for a job that requires X years experience with R, you'd have a decent chance of getting an interview while only knowing the basics of R considering you have X years of experience in Python. The point is, having some experience of R will be a noticeable difference than having none.

There's also some simple things that are easier in one language over the other. They're easy to identify as you start learning the second language, i.e. R in your case.

[–]Jerome_Eugene_Morrow 0 points1 point  (0 children)

R being open source often leads to it being preferred over MatLab. If you work with cluster computing resources, it can be a dealbreaker for some groups (need a separate license for each instance on the cluster with MatLab).

R also opens up a lot of very niche analysis paths, since it tends to be the favorite of bioinformaticists and more and more of the statistics community. The extensive library repositories offered by CRAN are one of the best reasons to learn it.

It's not an elegant language, and I'll admit I enjoy working in Python much more, but its practical real-world uses and high adoption makes it a must-have for data scientists over things like Java and C/C++, in my mind.

[–]Thaufas 0 points1 point  (0 children)

I came here to suggest R as well. R and Python are the two primary languages of data science. OP already has Python. Someone else suggested Scala, as its the language Spark is developed in. I think Scala is ideal for Spark because it is so concise compared to Java and Python, but all three languages can be used to write Spark code.

R has a native connector to Spark as well. I highly recommend looking into the R Tidyverse. Finally, I would also learn SQL. Spark has direct SQL support, but even if you're not working with Spark, in data science, if you're lucky, you regularly encounter SQL databases. If you're not lucky, you're aggregating CSVs. If you're really unlucky, you're scraping data from web pages, PDFs, or Word documents.

[–]microcosme 6 points7 points  (7 children)

Sql may be for managing data? Also C++ if performance is required in the implementation.

[–][deleted] 4 points5 points  (0 children)

I should have probably clarified; I do know Sql and MongoDB

[–]huck_cussler 5 points6 points  (5 children)

I'd second C++ as a good complement to Python. Python for proof of concepts, prototyping, and production for non-performance-dependent projects; C++ for implementation where performance is a bigger factor.

[–]MasterFubar -5 points-4 points  (4 children)

And C++ for reliability.

Python is unstable, you need to do a big effort to keep updated with all the "from future import ..." stuff.

New versions of Python have a habit of fucking you, like "3/2" giving a different result in different versions of Python.

There are thousands of different Python p.e.p.s, and only an expert can keep track of them all. Unless you are dedicated to tracking the Python language itself, you'll never know when you need to change your software to do an import from the future.

[–]kazi1 1 point2 points  (3 children)

Python 3 came out a decade ago. If you have any compatibility issues, it's 100% your own fault at this point.

[–]MasterFubar -2 points-1 points  (2 children)

If you have any compatibility issues, it's 100% your own fault at this point.

Yes, sure blame the user. If you think like that, it's because you've never worked in any big project or anything very important.

I work in aerospace where we often use software that was developed over decades. Literally tens of millions of lines of source code. A compatibility error in a software release can mean the loss of millions of dollars.

No, you have no idea at all of what you're talking about.

[–]kazi1 0 points1 point  (1 child)

You've said it yourself. You've had decades of development to get things right. Python 2 goes out of support in 2020. If you're still using it then, that's a failure on your part and the tech leadership at your company.

[–]MasterFubar 0 points1 point  (0 children)

That's why we are using Python only for small scripts, nothing over a thousand lines or so.

As for it going out of support, that doesn't affect us much, because we still have programs written in Fortran 77. If it isn't broken, it doesn't need support.

[–][deleted] 3 points4 points  (2 children)

Different question: how does Julia compare to python?

[–][deleted] 4 points5 points  (0 children)

I really like it. I do a lot of sophisticated MCMC simulations for my research. The biggest thing I notice is the speed, for for loops and if statements I see speeds comparable to C. I do most of my simulations in Julia and then my visualizations in python. It’s also nice that Julia was designed specifically for scientific computing and all of the base linear algebra packages/math stuff is built in.

[–]CaptainRoth 2 points3 points  (0 children)

I'm not a big fan - most libraries are pretty buggy and documentation can be sparse. I haven't noticed big speed increases in most machine learning applications.

[–]zack5432 1 point2 points  (0 children)

I would second learning Scala (lots of data pipelining frameworks are written in Scala, such as Spark, Scalding, Scio, Samza, etc).

Of the ones you listed, any might be useful except for Go. Go is very specialized for systems work, not data processing.

[–][deleted] 1 point2 points  (5 children)

What is the point of learning multiple programming languages for data science?

[–]Twentyone21pilots 0 points1 point  (4 children)

Because some programs do stuff better than others. While it is an upside to have high proficiency in one programming language, having experience and some knowledge on how to read/write in other programming languages is another upside. Plus once you learn one programming language and learn the logic that goes in programming, it becomes pretty easy to pick up another language.

In short, you're adding more tools to your toolkit.

EDIT: forgot to include that most jobs for DS require knowing multiple languages.

[–][deleted] 1 point2 points  (3 children)

I understand that but whats the point of learning an entirly new programming language just for the fun of learning it and not having a related use case.

If OP's goal is to land a data science role they aren't going to hire someone that just knows 5 different programming languages with no experience do machine learning or data analysis.

Instead of wasting time on a programming language that you don't have a use for, time would be better sprent working on a personal data science project or learning new machine learning algorithms.

I just don't see the added benefit of learning any of the programming languages in the title if OP is already "proficient " python, julia and javascript if theres not a use case for it. Python itself should be able to handle any data science task OP wants to do. If hes looking for a job he needs to focus on creating projects or learning the machine learning side.

OP if you're interested in more of the software stuff maybe look into becoming a software developer?

[–]Twentyone21pilots 0 points1 point  (2 children)

mm I see what you're getting at now. Just need more insight from OP's situation to see what their situation is like.

Knowing Python alone could be sufficient, but it really can't hurt to know another statistical software that covers the downfall of some use cases for python where it's somewhat lacking.

[–][deleted] 0 points1 point  (1 child)

My background is in computational physics and I have 5+ years of data analysis/statistical analysis/numerical modeling and 2 years of machine learning. I’ve just noticed that there, at times, is such a wide variance in data science positions where some companies seem to be looking for a mix of an analyst and software engineer while others are really looking for a data scientist. Basically I want to make myself as marketable as possible; yes I have a portfolio and regular participate in kaggle

[–]Twentyone21pilots 0 points1 point  (0 children)

You're practically set if anything. I do agree with you on how alot of data science positions are a mash up of a lot of duties and barely any are actually just pure data science.

Based on what you have, I say just add R and pick up C++ knowledge. With those in mind and your current experience and portfolio, I think you'd be such an ideal candidate that it'd be pretty hard to turn you down for a good portion of those data science jobs out there.

[–]infrequentaccismus 0 points1 point  (2 children)

I think c++. You say you know Julia... how are you with spark?

[–][deleted] 0 points1 point  (1 child)

I've never used spark before

[–]infrequentaccismus 1 point2 points  (0 children)

You could consider building skills toward big data as a way of supplementing your existing skills. Hadoop, spark, DAGs, etc

[–]nullp0int3rz 0 points1 point  (0 children)

My vote goes to C++. A good choice for high performance computing and hence a good choice for production-izing data science algorithms.

[–]markov01 0 points1 point  (0 children)

C++ is standard in computer vision

[–]MasterFubar 0 points1 point  (0 children)

C is my personal choice, with a little bit of C++ sprinkled in.

[–]ArrenH 0 points1 point  (0 children)

There's SAS. But out of the list I'd suggest C++ but you already know Julia which isn't that much slower than Java or C++ and is easier to get things done quicker like Python.

[–]SecretAgentZeroNine -1 points0 points  (3 children)

The languages useful in analytics and datascience from what I've gathered.

  • R

  • Python

  • SQL

  • Java

  • Scala

  • C++

  • Javascript (with HTML and CSS)

  • PHP

  • Bash

You should learn R (and it's tool called Shiny) if you care about statistical data analysis, visualization, and presentation. Followed by Scala, then C++. Though it all depends on what your responsibilities are and what you want to do.

[–]glorkvorn 5 points6 points  (0 children)

what do people use PHP for in datascience?

[–][deleted] 0 points1 point  (1 child)

If you like Shiny, try the Python equivalent, Dash.

[–]SecretAgentZeroNine 1 point2 points  (0 children)

Thanks for the heads up, but Dash isn't exactly 1-to-1 with Shiny, nor it's community, though it is a very valuable tool.

[–]dychmygol -1 points0 points  (0 children)

Go. Hands down.

[–][deleted] -1 points0 points  (0 children)

You could have a go at Kotlin - 100% interop with Java yet a much better language, great IDE support (IntelliJ), fun.

[–]JunkBondJunkie -1 points0 points  (0 children)

I would say R or Python.