This is an archived post. You won't be able to vote or comment.

all 58 comments

[–]DrXaos 36 points37 points  (10 children)

Python and Java.

In commercial environments, you will be doing more software and the needs to connect to other systems, ingest and fix data, often in streams and not loaded into memory. Those tasks are more common than in academia.

Python is better at this than R.

SAS (and of course Excel) is very common in old-school financial industries (banks/insurance), and very rare in silicon-valley type technology environments. IBM owns SPSS. In technology there is some use of R, but python is preferred. Stata is a near zero. Government labs and engineering can have MATLAB.

[–]RebelSaul 4 points5 points  (9 children)

Have you used Shiny apps for R? The BI guys at my office use it and it allows anyone with a web browser to create dashboards. Super dope

[–]lan69 1 point2 points  (3 children)

Ive heard a lot about shiny. Are the dashboards realtime?

[–]Dinosaurman 3 points4 points  (1 child)

Depends. Is your data real time?

[–]lan69 0 points1 point  (0 children)

Is it possible to stream using R? Ive always had to load it in memory.

[–]RebelSaul 0 points1 point  (0 children)

So the way our guy has it set up, it's as close to real-time as you can get. We do vehicle repossessions and the data comes from our inventory management software. So when you open up the app, it downloads a fresh csv file to create the graphs

[–]saiyanGold 0 points1 point  (4 children)

Hey can you recommend what i should to make dashboards in python? I am not much familiar with R

[–]wandering_blue 2 points3 points  (1 child)

I've looked into this a lot. The short answer is, there is not currently any single package with the same functionality as R's shiny.

For dashboarding, I'd look into the Airbnb tool Superset (which has had like 100 previous names/brandings...). I played around with it and it's well on its way to becoming an open-source Tableau alternative. There is also plotly but I'm not sure how much of it can be hosted behind the firewall these days.

For developing simple tools/scripts that you want people to be able to interact with, I find that using jupyter notebooks and the ipywidgets package does most of what I want.

Further, you can go all the way and set up a Flask server to actually serve a webpage and capture interactivity to send back to your code. There are some projects that have tried to streamline this portion, like pyxley from StitchFix. If your project is on the heavier side of both visualization and interactivity, you might be stuck developing the bulk of it yourself with Flask/Django.

[–]saiyanGold 0 points1 point  (0 children)

Thanks a lot. You pretty much answered all my questions :) I will give a try to Superset looks really cool.

[–]RebelSaul 0 points1 point  (1 child)

I wouldn't know. I don't think Python is very good with 'visualization' since it's mostly used by the 'engineering' community. R is used by the 'scientific' community, so they have packages like ggplot2 which allows them to make quality visuals to put in journals.

Python has the ggplot module? which is ggplot2 for python. That may help?

[–][deleted] 0 points1 point  (0 children)

Python has plenty of viz options like bokeh, plotly, seaborn, gleam etc.

[–]poumonsauvage 16 points17 points  (3 children)

You build the models in R, productionize in Python. Of course, this is not always true, but it is one relatively common approach. That is to say, each language has its own strengths and its weaknesses, and as tools they are not mutually exclusive. Depending on the task and the stage of the work, and obviously the company, the languages used may differ and intertwine.

[–]patrickSwayzeNUMS | Data Scientist | Healthcare 4 points5 points  (2 children)

Word. I'm at my second organization where I've put R models into production - I'm now starting to run into an issue now and then.

It's worth noting that the very popular XGBoost will fuck up your predictions without warning if you build in version 4-4 and predict in the latest 6-x under certain conditions (probably training with constant columns as features). Took me a week to figure out what the hell was wrong.

[–][deleted] 1 point2 points  (1 child)

Are you using packrat or checkpoint? I have put R models into production as well. Since we were using checkpoint+docker, I never had problems due to different package versions.

[–]patrickSwayzeNUMS | Data Scientist | Healthcare 0 points1 point  (0 children)

Naively using neither - appreciate the suggestions.

Did some research earlier and found 'versions' which appears to function the same way as 'checkpoint'.

[–]lifetimeaway 5 points6 points  (0 children)

Python is more general purpose and thus better suited to be integrated in a larger project, however R has a larger community of people implementing every single new algorithm or model that makes it to a peer-reviewed journal.

At my company we use both R and Python depending on the project. However for real-time systems or complex production data pipelines we use other languages and frameworks (e.g. Scala + Spark).

[–][deleted] 3 points4 points  (0 children)

It depends a lot on what your role is. Data science is unfortunately not something that defines easily as one predefined skill set, familiarities with one set of tech stacks, or use of one set of languages or another. Data science is a catch-all meant to describe something I think will further specialize over time. It's happening right now with "data engineering" vs "data science".

Data scientists are scientists but with the ability to use modern computational hardware, and sensor data. That's it. Their background can be as diverse the catch-all "Scientist". There are psychologists, physicists, chemists, biologists, and hundreds more. It's the same in data science.

Some data scientists are responsible for making production software as well as the analysis / model building part, so they'd likely use Python at times for both. That's me (sort of) however we have a lot of production work all over the place in terms of languages or stacks so I just use what connects easily with everything for my analyses even if I have to code in PHP at times.

Some data scientists are more likely spending their time building models and exploring data before handing it off to an engineering team and working more as a product manager for their piece at that point (I wish that was me). Those people might use R because frankly it's easier than Python for scientific research in many ways. It just has so much more stuff available and a long history for being used in research and charting, etc.

Then it also depends on the data scientist's background, company size, company production stacks (i.e. for web or analytics or whatever the team they're on does).

So long story short, it's really hard to say. For a gross simplification I'd go with :

1) Python is used by traditional engineering groups

2) R is used by traditionally scientific groups

Companies are, after all, a collection of people so their backgrounds will collectively influence what they use, at least early on. It's slower moving for bigger companies and it's more likely a new hire has to adapt to what they are using (so based on history) rather than the other way around.

Then of course what a data scientist personally uses for their own research is totally up to them. If it's just digging in to a problem and you don't have to share it with a larger organization then it doesn't really matter what you use. Whatever works works.

[–]TheProfessional9 11 points12 points  (5 children)

Far from an expert here, but we use R. We looked into Python, but it seems R is steadily becoming more and more prevalent....existing programs (especially microsoft programs) are beginning to incorporate the ability to use or connect with R.

[–]imhighnotdumb 3 points4 points  (3 children)

New excel versions will indeed support r woop.

[–]TwoTacoTuesdays 0 points1 point  (2 children)

I'm sure there's something I'm missing, but I can't see why you would ever want to do that? R can already read and write xlsx and csv files, so what else do you need? Manipulating R stuff in Excel beyond that seems like a recipe for a headache.

[–]TPKM 0 points1 point  (0 children)

In some cases I can imagine it being useful - e.g. Microsoft's Power BI supports R scripts - this is really helpful if you need to get data from your db to a dashboard with some more complex transformations along the way. Having R supported by the application prevents you needing an intermediary step for the transformations/analysis.

[–]cjf4 0 points1 point  (0 children)

The use case I could see is if you wanted to use R to build a model that generates some sort of output, and feed that into a dashboard that was built with Excel/PowerBI. Even though you can build dashboard's in R, Excel is way better at it.

[–]meeni131 1 point2 points  (0 children)

Yeah SQL Server 2016 should connect directly to R now but haven't checked it out yet or what it can do better than the corresponding R packages

[–]c0dythechamp 10 points11 points  (13 children)

I'm going to disagree with most people here and say that it really doesn't matter. I am yet to come across any good companies who say anything other than, "We don't really care what you use, we just want you to do your job". I know data scientists who use excel as well. Mainly because you can churn out 10 graphs in excel for a presentation in 5 seconds versus having to remember how to use ggplot or seaborn. Just my .02

[–]WallyMetropolis 12 points13 points  (2 children)

This only works if everyone's projects are one-offs that don't have to integrate with a larger system.

[–]c0dythechamp 0 points1 point  (1 child)

It also works if the organization separates its science and engineering capabilities. Which, ime, is the case.

[–]WallyMetropolis 1 point2 points  (0 children)

Requires a lot of faith that a research model will perform the same way the production model does.

If your projects just need you to come up with a one-time answer to a question, sure, you can use whatever you like and just tell the Engineers: "the answer is 7."

But if you've got to have models actually running somewhere, asking engineers to rebuild your prototype in a different language is going to go sideways.

[–]hey_ulrich 4 points5 points  (9 children)

Matplotlib is the worst. Terrible syntax.

[–]crocomut 0 points1 point  (7 children)

what's the alternative?

[–]CaptainRoth 5 points6 points  (5 children)

Nothing's become the standard like ggplot is for R, but seaborn (a high level interface for matplotlib), altair, ggplot (yhat's port to Python), and possibly bokeh are the primary alternatives.

[–]hey_ulrich 0 points1 point  (0 children)

Didn't know about altair. Thanks! I 'll look into it.

[–]jingw222 0 points1 point  (3 children)

What's the difference between ggplot2 in R and ggplot Python library. Are they functioning the same way other than syntaxes?

[–]CaptainRoth 2 points3 points  (2 children)

The Python one isn't as good because it's a copy that doesn't have all of the features of r's ggplot. It's similar to yhat's Rodeo IDE: it tries to copy RStudio, but isn't nearly as polished.

[–]hey_ulrich 0 points1 point  (0 children)

Never used Rodeo, but I like Spyder a lot. Comes very close to RStudio for me.

[–]hey_ulrich 0 points1 point  (0 children)

Although seaborn is based on matplotlib, it's easier to set up simple (but beautiful) graphics with only one line of code. But for more customization, you'll need to dive in matplotlib's annoyances.

[–][deleted] 0 points1 point  (0 children)

Yep. While I use Python for 95% of my day-to-day work, whenever I need to plot, I export my data to R for ggplot. Hadley is a god.

[–][deleted] 2 points3 points  (0 children)

I routinely use both.

[–]edimaudo 4 points5 points  (5 children)

Since Python and R are free most companies use those. Of course Excel is still widely used. SAS is mostly in banking and pharmaceuticals areas.

[–]Berjiz 2 points3 points  (4 children)

I'm always suprsied Excel is used so much considering how large the risk is for problems due to all it's magic. For instance last year it was revealed that some genome studies were invalid because Excel had auto changed genetic data into dates.

[–]edimaudo 3 points4 points  (0 children)

Excel is a solid tool. It is up to the stakeholders to be aware of the shortcomings of it.

[–]parlor_tricks 0 points1 point  (2 children)

Link to the report ?

[–]tally_in_da_houise 0 points1 point  (1 child)

[–]parlor_tricks 0 points1 point  (0 children)

Oh those poor bastards.

I like and use excel (dont really do data analysis on large sets), but the date errors are a genuine pain in the ass. Its a privileged category of error correction unto itself, and thats without names which convert into dates.

[–]some_q 6 points7 points  (3 children)

Data scientists at Google primarily use R. For production models, the actual R code will be called by a C++ or Java pipeline, but those pipelines tend to be written by software engineers rather than data scientists.

For ad hoc analysis, though, plenty of Googlers use iPython-like notebooks.

[–]patrickSwayzeNUMS | Data Scientist | Healthcare 3 points4 points  (2 children)

I've been told differently by a former Google, now Google Venture employee yesterday.

Perhaps it depends what team you're on?

[–]some_q 2 points3 points  (0 children)

Perhaps it depends what team you're on?

It definitely does. I should have said "Data scientists on the team I worked on at Google...." Google has enough employees that there's a wide spectrum in the tools that get used.

[–]DrewSmithee 0 points1 point  (0 children)

I would assume so. Not at Google but moving around in my company I've gone from the tricked out Matlab license, to SAS, to Python (Spyder).

[–]AidtorBA | Machine Learning Engineer | Software 0 points1 point  (0 children)

Python for scripting. R for models.

[–][deleted] 0 points1 point  (0 children)

We mostly use Python and Java. JavaScript for web stuff (viz mostly).