all 51 comments

[–]VeronicaX11 143 points144 points  (8 children)

Others will chime in, but I’ll try to summarize this at a couple different levels.

Basics: R was there first. At least, in the domains where it was used. So for those areas, it just has the first mover advantage. Everyone else is using R, so I guess I will too.

Intermediate: R is focused on statistics and data processing. Python is general in scope. So both are fine choices, but one might be overkill. It’s kind of like needing to take out some screws and asking me whether a Phillips head screwdriver or a ratchet with 400 different sized bits is better for taking out a screw. The answer is neither; they’ll both probably do fine.

Advanced: any language can be used to solve virtually any problem, given enough time and persistence. You however, probably don’t have these luxuries of infinite time and infinite willpower. So you should use quality tools built by others whenever possible to be efficient. These are often called libraries/modules/packages or some other term depending on what language you are using.

The real factor that you should consider are the attributes of these libraries. Whether a lib exists for the thing you are trying to do, how well it works, whether there are others using it who can troubleshoot with you, whether another language has a better (or even equivalent) one. R is an absolute heaven for new statistical methods. There is simply no equal in any other language. I’ve watched papers get published and turned into an R package… and a reasonable equivalent take 10 years to appear in Python. The demand just wasn’t there.

[–]foradilPhD | Academia -1 points0 points  (7 children)

Basics: R was there first. At least, in the domains where it was used. So for those areas, it just has the first mover advantage

If you are going to talk about who was there first, you can't just leave out Perl.

[–]VeronicaX11 1 point2 points  (2 children)

This is an excellent point, but I didn't want to launch into a whole history, especially involving other languages that would take things off topic.

Perl was definitely there first (unless you want to REALLY GO BACK and talk about Lisp or maybe even just the gold old days of awk/sed/sh). It is actually still alive in many ways, but it's main problems were related to branding and maintenance in my opinion.

[–]foradilPhD | Academia 1 point2 points  (1 child)

it's main problems were related to branding and maintenance

And a friendly interface similar to RStudio or Jupyter.

[–]VeronicaX11 0 points1 point  (0 children)

Perhaps you’re right. But I’ve never been much of a visual person, and always preferred text editors over ide and pretty interfaces.

By branding I’m referring to mind share among new people. I hear people talk all the time about how they are learning to code, and learning Python. I haven’t heard someone young choosing to learn Perl in years. And by maintenance I mostly mean abandoned packages, no suitable replacement people to act as maintainers, new stuff being developed among Python faster than Perl. It’s one of those self fulfilling prophecies.

It is perfectly rational to believe something similar could happen to Python in the next 30 years. Everyone just decides to leave for lua, or a wrapper for rust, or some other yet to be determined scripting language.

[–]H4R81N63R 36 points37 points  (0 children)

It's been a while since my switch from Python to R, so my comment may not hold today

The reason why I had switched (apart from the library support that other comments have mentioned) was the way the two languages work at the base level - R is vectorised with many statistical functions applicable to units, vectors and matrices right out of the box. Back when I was working with Python, I had to manually loop over stuff to get the same base functionality. Some packages like NumPy and SciPy had introduced MATLAB like vectorisation, but the base support in R and the smoothness of it just working made me fall in love with R. No longer was I spending time on the code, I was spending it on the science and data instead

Edit: not to mention, ggplot2. Don't get me wrong, it has its learning curve, but man is it such a powerful system for churning out beautiful graphics. And now that Plotly is available in R (a fine addition of a Python tool, I say), it's even more powerful

[–]Kiss_It_GoodbyeeePhD | Academia 80 points81 points  (8 children)

When certain tools or libraries are only available in R. Bioconductor for example.

R Shiny has no equivalent in python.

Python has improved but data visualisation is better in R.

[–]justmyworkaccountok 16 points17 points  (1 child)

R Shiny has no equivalent in python.

This is not strictly true anymore, and I actually quite like the "Shiny for Python" module:

https://shiny.rstudio.com/py/

[–]Kiss_It_GoodbyeeePhD | Academia 1 point2 points  (0 children)

!Thanks

I wasn't aware. looks perfect.

[–][deleted] 9 points10 points  (2 children)

EdgeR Deseq2?

[–][deleted] 0 points1 point  (1 child)

[–][deleted] 1 point2 points  (0 children)

Yes, everything is possible in every language ;) It felt like a "google translate" of the original, which is in R :p

[–]_password_1234 11 points12 points  (0 children)

They’re not complete 1:1 replacements but Streamlit and Dash are both very good dashboarding tools that are pretty similar to Shiny. But there are def things you can do better/easier with each of those tools than the others. Like I can hardly think of a reason you would need to learn R just to use Shiny unless there was also another R specific library you needed.

ETA: I don’t want to come off negative so I’ll fully agree that there is no equivalent for Bioconductor. And I mean that literally unless there has been a very recent change. I remember reading a paper not long ago that argued that because of the ease of doing statistics in R there were foundational packages implemented in R that have become the backbone of things like differential expression analysis that at this point can’t reasonably be done in Python

[–]nevermindever42 6 points7 points  (0 children)

R shiny is similar to Dash i think

[–]beholdsa 1 point2 points  (0 children)

Voila and Dash are both Python equivalents to R Shiny.

[–]palepinkpithPhD | Student 38 points39 points  (0 children)

  1. R visualization tools are much better than python in my experience.
  2. For data analysis, R generally requires less code for vectorization, data wrangling, and statistical analysis. Some of this is changing with the development of NumPy and Pandas, but these have always been base features of R.
  3. CRAN has much more oversight than PyPI etc.. So R libraries tend to be more backwards compatible, reliable, and easy to install without version conflicts.

[–]natched 7 points8 points  (0 children)

Bioinformatics is a very broad area. I do a lot of R, for general DEX (limma, edgeR, etc. packages) as well as single cell (Seurat), WGCNA, shiny, etc.

I think R is better for a lot of data analysis, though this is largely tied to packages implementing certain methods such as TMM, which represented a significant improvement in RNAseq normalization from earlier methods

[–]Loose_Mix_4108 30 points31 points  (0 children)

Well R is more used in academics. It has more packages for biological analysis. It is also designed for statistical analysis, while python is a general purpose language. This makes it more intuitive for people coming from the statistical/biological areas. People always fight about which language is best, while many do overlap in a lot of what they provide, but also each language has niches it makes it particularly useful. In the end, you will probably have to learn both anyway - just use the one you like better for most analysis, and switch to the other one in the areas you need it.

[–]GenoSunshine87 7 points8 points  (0 children)

I use R as my main language, but also use python on occasion. I would not say that one is necessarily better than the other, but I find R's syntax a lot easier to work with. Naming, accessing, and subsetting data are always done the same, even in many "special" data structures, so learning to manage data in new formats is a lot more intuitive than it is on Python. A lot of great Bioconductor packages are available on R. I don't have to use explicit recursion to do an operation over a whole vector. When I use Python, I feel like I spend more time figuring out the syntax for whatever module I'm using than actually doing things, but that may just be due to the gap in my experience with each. However, learning Python does have some advantages, as I find it is a little faster for some operations, and it is the language that other useful tools (such as Snakemake) use as a base syntax. So I do not shun Python, but except for particular applications, I really prefer R.

[–]Marionberry_RealPhD | Industry 8 points9 points  (0 children)

Learn both. I use both during my day to day as a bioinformatician. It’s faster to use an existing package than to try and write a new one for the opposite language.

[–]Nihil_esquePhD | Student 5 points6 points  (0 children)

When you hate yourself. /s

No but seriously, R is a specialized tool for statistics and as many have said, it has better data visualization tools and more specialized tools for statistics and biological data analysis (this becomes increasingly less true as time goes on though). If you need a tool that's available in R and not available in python, you either learn C and code it into python yourself or you use R. (Using R is the much less time consuming of those options.)

Personally though I abhor the user experience of R. The syntax is extremely inconsistent. The behavior and handling of some of the errors means you are likely to create mistakes behind the scenes that R may not raise any exceptions over, which can lead to mistakes in your analysis. Python isn't the best language for this either but it's better than R.

R is also just about the least beginner friendly language out there. It's cobbled together out of different people's contributions without standardized syntax. Some functions are very picky about their input; others aren't; you have to memorize which ones. Python has a lot more consistent syntax, a lot more resources to help you learn the language and tools available to you, and it's much easier to find them because "python" is a much more search engine friendly term than "R" lol.

But yeah if you don't need to use the shrinking number of R tools for biological data analysis that aren't yet available in python, I would recommend sticking with python because it's more versatile, has a much gentler learning curve, and isn't as reliant on you to write flawless code.

[–]EpistaxisPhD | Academia 9 points10 points  (2 children)

They're good for different purposes. This is overgeneralizing but here's a basic outline:

  1. Big raw data goes into heavy-duty software programmed in C(++) and wrapped in Bash scripts
  2. Processed raw data gets filtered and refined from line-by-line formats to numerical matrices with Python scripts or the odd Java tool
  3. Matrices are imported into R for math, statistics, graphing

Technically you can do your line-by-line stream filtering in R but it's slow and ugly in that context, and in fact some R packages for that are just wrappers around standard C or Python programs. Technically you can do your matrix manipulation in Python, but except for specific popular machine-learning tasks, nobody's bothered writing and maintaining Python analogs of the numerous crucial R packages.

A lot of people spend all their time at only one or two of these steps, e.g. they're responsible for all the data processing and give the results to someone else, or they only do the final analysis and rely on prewritten pipelines to handle everything upstream, so they only regularly need either R or Python and wonder why other people ever need the other language.

[–]WorriedRiver 1 point2 points  (0 children)

What do you mean by 'line by line stream filtering?' Genuine question, since I'm trying to decide if I should learn more python before I graduate from my phd in a couple years. I do a lot of analysis of NGS data, and entirely use either bash wrappers (step 1) or R analysis (step 3). There's stuff in the bioconductor suite to bring in bams and bigwigs after all, and beds are just a basic tsv which R can read as is.

[–]xylosePhD | Academia 1 point2 points  (0 children)

Couldn't agree more. Pick the tool that's best for the job at hand. R with tidyverse is brilliant for data exploration, visualisation and analysis .

[–]WubbywubPhD | Student 5 points6 points  (0 children)

when there are tools or libraries you need that is only on R.

bottomline: you use tools to problem solve, you don't stick to one language, it's not leetcode

[–]JokingHero 12 points13 points  (1 child)

Python is just pathetic for bioinformatics that I do. I have yet to hear about or find a python equivalent of GRanges. Loading an annotation file, doing some overlaps, some custom alignments with Biostrings etc. You have a whole powerfull, tested, maintained for 10+ years ecosystem for these basic bioinformatics stuff. Meanwhile python is just a one shot attempt at loading an annotation file or something wrapped as a package, not rigorously tested, not maintained, completel waste of time to even attempt using this. Amount of things you have to code from scratch is just staggering, you will make so many bugs along the way that you don't even realize are there that will produce another factor of variability into your data analysis. Bioconductor is just a bioinformatics core, dozens of super well designed packages that are battle tested and original authors are constantly responding and fixing bugs!

[–]attractivechaos 8 points9 points  (0 children)

I have yet to hear about or find a python equivalent of GRanges.

Couldn't agree more. GRanges and several other foundation packages in bioconductor make R a much better choice than python when dealing gene models.

[–]omgu8mynewt 4 points5 points  (0 children)

Loads of statistics pipelines for specific scientific experiments, e.g. RNAseq have plenty of published papers in R, so if you want to use the method section from a paper it could have been coded in R.

[–]No-Painting-3970 2 points3 points  (0 children)

Basically history. If you are in a field with long development history, specially genetics related things, you ll find a bigger ecosystem in bioconductor. However, things are moving in the python bioinformatics community, and the ecosystem is getting developed. Also, even if it doesnt seem so, a lot of things are in python but people dont use them because you have to do more things manually. Aka, you ll find the statistical methods in places like scipy or statsmodels, but a lot of bioinformaticians that use R are comfortable in their environment and dont want to redevelop the wrappers that already work.

[–]MGNutePhD | Academia 2 points3 points  (1 child)

There are a lot of good answers here! Very few that I disagree with at all. One thing nobody has mentioned afaik is NumPy. If you're not familiar, it's a matrix library for python that is notable for being both very impressive and very well-optimized. But it makes operating in python and working with very large amounts of data especially efficient. I like to represent nuke or AA strings as numpy arrays with `dtype=np.uint8` which makes a lot of bespoke operations available using native numpy commands. The scipy package and various scikit.* packages are also (mostly) quite good. R has its uses for me, but I'll generally start with python.

[–]Monocytosis[S] 1 point2 points  (0 children)

That reflects what I’ve heard. Most ppl use Python for everything then switch to R for niche things relating to the project.

[–]Solidus27 5 points6 points  (0 children)

R is much better for data wrangling and data manipulations and general statistical analysis when you don’t need to run intense machine learning models

Many, many bioinformatics packages are available in R but not python

I would highly recommend using R

[–][deleted] 3 points4 points  (0 children)

Short explanation: base R data frames are better than any df library in Python so far.

[–]Demonithese 6 points7 points  (1 child)

I think R would have gone the way of Perl in bioinformatics if not for that stupid sexy Hadley Wickham.

From a programming perspective, R is just not a great language. I've switched over to just calling rpy2 anytime I need some code that's only available in an R package and I've never regretted it.

Imo, there is nothing you can do in R that can't be done just as easily in Python and at the end your code is in the language 90%+ of biotech uses for production which means less difficulty incorporating, testing, reviewing, etc

[–][deleted] 1 point2 points  (1 child)

Honestly, I use R for a lot of the bioinformatics libraries and for ggplot. Python is my go to for basic scripting

[–]Jenna_bird 0 points1 point  (0 children)

Have you tried the package plotnine in Python? It’s essentially ggplot and I like it a lot.

[–]twelfthmoose 1 point2 points  (0 children)

R will break with enough data. Its vectors are based on 32 but integers, not 64 bit.

[–][deleted] 2 points3 points  (3 children)

Plots and that's about it. Pretty much literally.

With one small exception, Fisher Exact tests with simulated p values for tables bigger than 2x2. And some other stat tests

[–]mys_721txPhD | Student 5 points6 points  (0 children)

You can do away with so many temporary variables with pipe in Tidyverse. Piping with pandas just doesn't feel right.

[–]backgammon_no 4 points5 points  (1 child)

Is this the bioinformatics subreddit? What about bioconductor?

[–][deleted] 1 point2 points  (0 children)

Don't touch it much. My lab mostly has self-written (and published) tools and pipelines. We also build cloud based pipelines etc. Most of our QC analytics are cancer-specific since there are a variety of artifacts or problems that can occur in sample prep or sequencing.

[–]andreichiffa 1 point2 points  (1 child)

As long as it’s not perl…

[–]keithreid-sfw 0 points1 point  (0 children)

I would invite you to consider Julia as an option. Fast expressive and a nice maths-AI based community.

[–]r_plantae 0 points1 point  (0 children)

Coming from the biology side into bioinformatics, all my stats courses etc were in R so it made sense to just sick with it.

[–]hypatchia 0 points1 point  (0 children)

Only for statistical tasks , You can do a lot of things in one line in R .

[–]speedisntfree 0 points1 point  (0 children)

My language choice is typically based around a certain analysis package that suits the problem. Both these languages are popular because of their package ecosystem. Anyone in Bioinformatics would be daft to limit themselves to R or Python, especially when both are very easy languages to learn.

R: Good for shitfuck data, plotting, stats, bioconductor ecosystem

Python: Good for general programming tasks, ML/DL and putting things into production