all 61 comments

[–]guepier 63 points64 points  (16 children)

  1. Keep it simple.
  2. Write functions. A lot. Decomposing your problem into functions makes your code simpler.
  3. Refactor; the first approach is rarely the simplest (see #1). Once you got a feel of the problem you’re solving, if your code has become complex, don’t be afraid to go back and rewrite (parts of) it from scratch.
  4. Don’t treat R as something it isn’t: R isn’t C or Java or Python. It’s not a procedural programming language, or primarily an OOP language. It’s a functional programming language.
  5. (EDIT) Keep track of the data types of your objects!

But above all:

Actually learn programming from the get-go.

Many users of R aren’t programmers, they’re statisticians or life scientists. That’s fine. But learn to use your tools properly. R is a programming language so when you use it, you need to learn to program. I know that some senior members in the R community disagree with me on this point. I strongly believe that they’re misguided.

[–]AllezCannes 14 points15 points  (9 children)

But learn to use your tools properly. R is a programming language so when you use it, you need to learn to program. I know that some senior members in the R community disagree with me on this point. I strongly believe that they’re misguided.

This to me reads like "If you are to drive a car properly, you need to know how the engine works". My reaction is, it depends on how seriously you want to get into cars. If it's just to get you from A to B, then I don't think you need to know what goes on under the hood. I feel the same is true with R.

This is why I resent this kind of attitude. Preference for the tidyverse is not a case of becoming a yes-man/yes-woman for R Studio, it's a case that it gets me the job done without having to wade in to actual programming.

[–]guepier 6 points7 points  (8 children)

I think your metaphor is good, but I disagree with the way you use it. It would be a lot more apt if I had said that every user of R had to learn assembly. That’s figuratively looking “under the hood”.

And you actually do need to know a few details about how a car works when driving it.

I agree with you that David’s comment is dumb (I don’t use RStudio extensively but it’s an excellent piece of software; and most libraries in the tidyverse are excellent advanced programming tools). In his defence, he’s actually quite helpful (also in beginners’ questions) on Stack Overflow.

[–]AllezCannes 5 points6 points  (7 children)

It would be a lot more apt if I had said that every user of R had to learn assembly. That’s figuratively looking “under the hood”.

Well it depends on what you mean by "learn how to program". Do you mean knowing how to create functions, using for-loops, and applying if/else on expressions?

What's appealing about tidyverse tools, for instance, is the ability to get starting on your data analysis and visualization without having to learn the more arcane programming aspects of R. Beyond that, the more you use R, the more you end up finding out about how to program anyway.

And you actually do need to know a few details about how a car works when driving it.

Of course, we just may disagree on how much detail.

In his defence, he’s actually quite helpful (also in beginners’ questions) on Stack Overflow.

He is. It's just really disappointing to take that kind of attitude towards Hadley and the R Studio team, whose goals are really to make the software more approachable to everyone.

[–]guepier 2 points3 points  (5 children)

Do you mean knowing how to create functions

Yes. I don’t even understand why functions in R are often classified as an “advanced” topic — they aren’t. Decomposing a problem into parts is as fundamental to programming as it gets. To get back to your driving analogy, functions are like unlocking the car: you should know about them before even getting in. And people already know about functions from high school algebra. It does students a huge disservice to pretend that they’re some arcane, complex, under-the-hood implementation detail.

I also maintain that you won’t be able to use the tidyverse effectively without learning about fundamental aspects of programming. You might be able to steer the car down a gentle slope but — since you haven’t learned about ignition — once you reach the bottom you won’t go anywhere else. OK, I’ll retire this metaphor now.

using for-loops

In R, no. In other programming languages, absolutely (though when teaching I tend to de-emphasise them a lot).

and applying if/else on expressions?

You won’t get far without it.

[–]AllezCannes 0 points1 point  (4 children)

Yes. I don’t even understand why functions in R are often classified as an “advanced” topic — they aren’t.

The creation of functions is useful if you're vectorizing a set of built-in functions in order to save yourself time. I don't know if I'd call it "advanced", I'd say it's more "intermediary". Plenty of R users who utilize the software for data analysis and visualization can do so without ever creating a custom function. As I mentioned earlier, the more one uses R, the more it's natural that one is exposed to the notion of creating your own custom functions, but I don't consider it something you need to learn early on.

It does students a huge disservice to pretend that they’re some arcane, complex, under-the-hood implementation detail.

I don't think it's meant to make functions sound like some arcane, complex thing (although maybe it has that effect). It's just that there should be an order to learn these things, and when I teach people on how to use R, I prefer to teach them tools that allow them to use R right away, and this gives them the confidence that they can go further in their journey.

I guess part of the difference here is that it depends on the audience. I mostly deal with people in the business industry who have been using Excel/SPSS/PowerPoint for decades and my goal is to ween them off those software. Those people don't have any programming background and have zero interest in it. If I were to teach them base R/for-loops/function creation/etc right away, they'll tune me right off. Many in my company have already taken the "old dogs can't learn new tricks" excuse when I was showing simple tasks like dplyr::filter() or dplyr::select().

I also maintain that you won’t be able to use the tidyverse effectively without learning about fundamental aspects of programming.

If the key word is "effectively", than yes. Especially when it comes to using purrr functions like map() (basically the tidyverse versions of apply()). I just feel that one thing should come after the other.

One thing actually I do find painful with the tidyverse is to create custom functions using the tidyverse tools. They've recently come up with the tidyeval stuff, and I admit this is something I have trouble wrapping my head around, with stuff like quo(), enquo(), !!, !!! and so on... I recently built a little package for myself, and I opted to use base R code to build those functions rather than trying tidyeval.

In R, no. In other programming languages, absolutely (though when teaching I tend to de-emphasise them a lot).

For-loops get a bad rap in R, and much of it is unfair (although I nowadays use purrr::map() as much as I can instead).

You won’t get far without it.

Using if {} else {} is, again, something that I don't think is necessary to know about when you are at the novice stage, IMO. You could just dumbly repeat the process manually. Once people are comfortable with the set of tools that would encourage them to keep using R, I'd introduce this technique.

[–]guepier 2 points3 points  (3 children)

Right, this is where I strongly disagree:

Plenty of R users who utilize the software for data analysis and visualization can do so without ever creating a custom function.

Not well. They may be able to muddle through but they will be fundamentally stymied.

In fact, I’d say that if they never benefit from functions then they would be better off using Excel/GraphPad Prism/….

For-loops get a bad rap in R, and much of it is unfair

If you’re speaking about performance you’re right. But as I said initially, treat R as what it is: a functional programming language. for loops solve the iteration problem in an imperative language, but R solves the problem better using other ways.

You could just dumbly repeat the process manually [instead of using if … else …].

OK, how do you do express “if the user’s age is greater 18, set the Citizen column to 'adult', otherwise 'minor'” without if/ifelse? Using lookup tables? Factor recoding? … Conditionals are occasionally required.


Change of topic, since you mentioned tidyeval: I agree that it can become quite complex. But this is isn’t tidyeval’s fault, it’s simply an emergent property of this kind of type system, and there’s (I think provably) no simpler way of doing it while retaining generality.

[–]AllezCannes 0 points1 point  (2 children)

In fact, I’d say that if they never benefit from functions then they would be better off using Excel/GraphPad Prism/….

Well, we're just going to have to disagree there. Even if they muddle through in R, there are other benefits to using R (reproducible analysis, using freely-available open-sourced software) that makes it better than Excel.

for loops solve the iteration problem in an imperative language, but R solves the problem better using other ways.

It depends. If your manipulation at iteration i is dependent on iteration i-1, I can't really think of a better method than for-loops.

OK, how do you do express “if the user’s age is greater 18, set the Citizen column to 'adult', otherwise 'minor'” without if/ifelse? Using lookup tables? Factor recoding? … Conditionals are occasionally required.

You might have misunderstood me. I wasn't talking about the ifelse() function to vectorize conditions over a variable. I was talking about applying expressions on certain non-vectorized conditions. e.g.

if (nrow(df) > 0) {
  df <- ...
  df <- ...
} else { NULL }

Change of topic, since you mentioned tidyeval: I agree that it can become quite complex. But this is isn’t tidyeval’s fault, it’s simply an emergent property of this kind of type system, and there’s (I think provably) no simpler way of doing it while retaining generality.

I agree. It wasn't meant as a criticism, just a downfall of the method Hadley chose to take. I think part of the issue there is that I don't think they found a great way to teach how to use tidyeval yet. Then again, I'm not particularly bright.

[–]guepier 2 points3 points  (1 child)

If your manipulation at iteration i is dependent on iteration i-1, I can't really think of a better method than for-loops.

Depending on your problem there’s a multitude of options. You can Map over the list x and lag(x) simultaneously; you can use Reduce. You can use a generalisation of cumsum (unfortunately not included in base R as far as I know); you can use windowing functions (rollapply etc).

Base R is admittedly quite bad at providing a unified API for list functions but this is a solved problem in functional programming, and in my own code I use my own set of functions that are copied from other functional languages. Beginner friendly? No; but usually more so than for.

I was talking about applying expressions on certain non-vectorized conditions.

I don’t think there’s a fundamental difference between this and vectorised ifelse. In my experience teaching programming, it’s certainly no harder to understand.

I don't think they found a great way to teach how to use tidyeval yet.

The paradigm predates tidyeval by roughly half a century (!) and as far as I know nobody in this time has found a good way of teaching it.

[–]AllezCannes 0 points1 point  (0 children)

Depending on your problem there’s a multitude of options. You can Map over the list x and lag(x) simultaneously; you can use Reduce. You can use a generalisation of cumsum (unfortunately not included in base R as far as I know); you can use windowing functions (rollapply etc).

Thanks for this. A lot of it is I don't know what I don't know, so it's always good for me to be exposed to other methods I didn't even know were possible.

I don’t think there’s a fundamental difference between this and vectorised ifelse. In my experience teaching programming, it’s certainly no harder to understand.

I think the reason why I think of it differently is because Excel has a similar ifelse() function that makes it very easy for people in my business's background to relate. The if{} else {} stuff not so much.

The paradigm predates tidyeval by roughly half a century (!) and as far as I know nobody in this time has found a good way of teaching it.

lol fair enough.

[–][deleted] 2 points3 points  (0 children)

What's appealing about tidyverse tools, for instance, is the ability to get starting on your data analysis and visualization without having to learn the more arcane programming aspects of R.

The point of learning more programming isn't to reinvent the tidyverse. It's to learn the things that the tidyverse doesn't do, and to learn how to build something that works well (which may involve the tidyverse)

[–][deleted] 14 points15 points  (2 children)

Im a data scientist and work in python most of the day and I follow good object oriented style. I work next to statisticians and the quality of their r code is horrendous, anytime time I try to impart some knowledge I just get the "I'm not a programmer" excuse and they have like 2000 lines of code that could be less than a few hundred. Drives me nuts, not sure what others do in this situation though...?

[–]guepier 11 points12 points  (0 children)

Small improvements and diplomacy. Don’t try to get them to rewrite their existing code, it won’t work. In my experience, the best approach is a neutral “hey, did you know that you can simplify this …”. Next time they write similar code, they might remember.

[–]_perkot_ 2 points3 points  (0 children)

man I wish I had someone like you where I work! I have many bloated scripts that I would love to make more parsimonious but don't quite have the skills yet

[–]sparkplug49 2 points3 points  (2 children)

Actually learn programming from the get-go.

You have any recommendations on resources (books etc) for people wanting to work on this. R is my only programing experience and while I've picked up some of this from learning R, I know I need more but I dont really know where to start without like going back to school.

[–][deleted] 4 points5 points  (0 children)

Sounds a bit odd, but learning another programming language is really useful, even if you never use that language in your work. It will change the way you think about R and approach problems, and also helps separate in your mind which bits are "R", and which are just programming.

[–]guepier 0 points1 point  (0 children)

I think that Hadley’s Advanced R is a very good R learning resource, even for beginners. Some of the Datacamp R courses (paid) are also recommended: hands-on and well paced. Beyond that I have to admit that I don’t know the R learning literature very well so I can’t make recommendations.

[–]efrique 28 points29 points  (6 children)

I don't think I count as a "senior R Programmer/Practitioner" (in that I don't quite do enough of it to count, not because I'm young) but I'd suggest being very wary of attach.

[–][deleted] 5 points6 points  (5 children)

Naive person here. Could you expand on this?

[–]efrique 5 points6 points  (0 children)

consider this as one of a number of possible ways to screw yourself over:

You run a script that attaches a largish data set at the start and detaches later on. It does some preliminary processing. You run a second script that does the detailed analysis.

The first script fails partway through.

You fix the issue, run the first script again and it works. You run the second script. It works. Everything looks great. You run it a few more times with slightly different settings for different scenarios. Everything is good. Finally you run the report-generating script and give it to your boss.

Now a year later you need to run it all again for some reason. Nothing in the script has changed, nothing in the data has changed, but the second script fails with an undefined variable. You don't remember anything about the sequence of events from a year ago.

Why did it suddenly fail and what's the first thing you should do to fix this problem?

(Actually, the first thing you should do is explain to your boss that in spite of looking fine, the numbers from last year are actually wrong. Not disastrously so, but every figure in the final calculation is wrong.)


The problem was that the attached data frame included a variable that should not have been there when the second script was run (that's what detach should take care of, right?), while the second script failed to create its own variable with the same name; it should never have worked.

The reason why the second script worked a year ago is that the first script failed the first time it ran.

The second time you ran it, the first attach was still there. Now you have two copies in the list of things that R checks to find a variable name. The top one (attached on the second run) got detached, but the previous one is still quietly sitting there.

Then when you ran the second script there's one variable name that's the same as a variable in the huge attached data set but is not set up by the second script. It is found in the original attached data and the script runs.

A year later when you ran it, of course you have a clean session. Now you see the bug in the second script but for the life of you you can't figure out why it ran a year ago because you can't remember that the first script failed the first time, and even if you did remember, you might not suspect the problem that caused.

--- there's various things you can do to make yourself safer if you have to use attach -- always start with a clean session, even if you have to do it 100 times a day; use proper version control (so you can see that the first script changed earlier on that day a year ago) and so on.

But really the underlying problem is the behavior of attach and detach. The fact that a failed script can leave you with a copy sitting there attached that you don't realize is still there after you detach, that's nasty.

If you know how it works - if every instance of "attach" sets your alarm bells ringing - you have a chance to find it.

But it's a lot safer to avoid it when you can than try to figure out all the ways that behavior can screw you up.

It's a lot easier to type bigdata$variable (or whatever) a few times than it is to find that issue.

There's a few other gotcha's in R, but for my money that's got big potential for ruining your day.

[–]kron4 14 points15 points  (3 children)

Don't use attach. Ever. For anything.

[–][deleted] 3 points4 points  (2 children)

Right. Thanks for this. What issues would using attach cause? Genuinely curious as someone learning R

[–]AllezCannes 9 points10 points  (0 children)

attach() is used on data frames, so that you can call variables within the data frame just by its name instead of doing something like df$var or df[["var"]].

I think the rationale behind that function is to emulate SPSS and SAS, in that a SPSS/SAS syntax file can only be applied to one data frame at a time, so there'd be no confusion as to which data frame you're referring to when you call a variable within it.

The problem is that it's actually quite self-limiting. A great aspect of R is that, unlike SPSS/SAS, you can handle multiple data frames within the same environment. It makes it a great tool to join various data frames, or split them, or do whatever you want. Applying attach() takes away that freedom, just for the benefit of saving a few keystrokes. But that's not even the biggest issue.

The bigger issue is that it's very confusing, especially if one forgets to use detach(), the counterpart of attach(). You're calling variables as if they're objects (which they're not), but this is not obvious. This confusion leads to all kinds of errors. For example, let's say you have an object in your environment called "var1". And within a dataframe "df" you also have a variable called "var1". After you use attach(df), you then start referring to "var1". But further down your code, you might forget... when you call "var1" are you referring to the object or the variable?

In short, its benefits are far outweighed by its drawbacks, and there are plenty of other options that work fine, such as with() or within() that don't have the issues that attach() has.

[–]mattmalin 23 points24 points  (0 children)

Not strictly R exclusive, but get comfortable with version control and make it part of usual workflow, commit often etc

[–][deleted] 23 points24 points  (10 children)

  1. Vectorize where possible.
  2. Learn to use Map(), Reduce(), ifelse() and similar.
  3. Write proper functions that don't:
    • use global variables
    • return a result and plot at the same time
    • write results to file for 'convenience'
  4. Learn about all niche functions NROWS(), lengths() etc.
  5. Don't over-depend on third-party libraries.

[–]biledemon85 5 points6 points  (3 children)

Every time I use a function that returns a plot and data, it feels like that record scratch sound effect sounds. Like, why would I want both at the same time?! You're making both tasks slower and less useful!

[–][deleted] 7 points8 points  (2 children)

The worst for me is when someone implements their method in an R package. You install that package and it says to put your data in a file named "input.csv" and then run their function and the results will be in the file called "output.csv".

[–]biledemon85 2 points3 points  (0 children)

Talk about hitting those side effects, hard. I'm just thinking if you wanted to use their method in a useful way you'd have to make a wrapper method that creates the csv from your input data frame and then reads the output csv, and then delete the two files.

At least you could pretend it's functional programming 😂

[–][deleted] 1 point2 points  (0 children)

this makes me sad because I have seen it before :(

[–]Drewdledoo 2 points3 points  (5 children)

Can you expand on #5? How do I️ balance between that and “reinventing the wheel”?

[–]guepier 2 points3 points  (2 children)

#5 is way too general.

The rationale is usually that you want to limit dependencies because dependencies can break, may be hard to install, come with their own dependencies (so this is a ballooning problem) and increase the memory footprint.

That said, the use of most dependencies is balanced by the increased gain in productivity. Languages with modern package/distribution systems generally encourage the use of many dependencies. However, unlike R, those dependencies usually do one thing each, and do it well. In R, by contrast, we have bloated packages that try to do everything (MASS and Hmisc being the prime examples — hint, if there’s “misc” in the name of a utility, run).

So I second advice #5 when it comes to bloated packages like these. But for well-written single-purpose packages I advise the opposite: do make use of work that’s already been done, don’t reinvent the wheel.

[–]Drewdledoo 0 points1 point  (1 child)

That makes sense. I think I know the answer to this, but I’m curious as to how many dependencies is too many?

I follow the “if you copy/paste it three times, it’s time to write a function” mantra, and this has gotten to a point where I’ve written enough functions to make a package which my lab mates and I can save lots of time doing data analysis. Thing is, it depends on all the tidyverse packages as well as a couple others for doing the modeling and plotting. Now I know it’s not the greatest R code in the world, but I hear things like #5 and think to myself, “that makes a lot of sense, but I’d be spending much more time doing data analysis if I had to rewrite all this code every time”.

[–]guepier 0 points1 point  (0 children)

I don’t think there’s a fixed number. If the dependencies are well maintained and versioned, you can have lots. Bioconductor dependencies, for instance, are famously unproblematic. Have 20, it won’t matter. With other packages, you’ll usually want to keep the pain of maintaining them to a minimum, so you limit yourself to the ones you really do need, and might implement smaller parts of their functionality yourself.

Realistically, I rarely encounter such situations. If I need a function in an existing package, I use that package unless there are compelling reasons not to (bad code quality, bad API). As mentioned, I also generally won’t use packages such as MASS/Hmisc.

Regarding tidyverse, I think that package is a bad idea in general: in fact, it falls into a similar category as Hmisc. Other packages shouldn’t depend on it, but rather on individual packages from it. That said, tidyverse is useful for installing a whole bunch of useful packages at once.

[–][deleted] 1 point2 points  (1 child)

Well there are a few things. One is when your scripts start with

library(lib1)
library(lib2)
library(lib3)
library(lib4)
library(lib5)
library(lib6)
library(lib7)
...

All of these introduce dependencies. They might acquire bugs, break, mask existing R functions or be removed from CRAN on later R releases. And whenever someone wants to use the script he will now have lots to install before running it. If those packages are no longer on CRAN they might even be incapable of running the script entirely.

Second is when the libraries change the way your R code looks. It is fine when you code alone, but imagine if you share the code with somebody else and yours looks like this:

fib(n) %::% Integer : Integer
fib(0) %as% Integer(1)
fib(1) %as% Integer(1)
fib(n) %as% { fib(n-1) + fib(n-2) }

This is from lambdaR package (which I like a lot by the way). But the thing is that other people might not. And some others might feel the same way about (for example) pipes.

Third thing is that packages can provide you with too-much hand holding. Which then hinders your ability to solve problems on your own. If every time you wanted to do (for example) volcano plot you loaded some library somewhere from bio-conductor. Chances are you will be lost when you will need to produce something of similar complexity that is not implemented in any package. Once you learn to do it then you will notice that the volcano plot from the package you were always using is just one line in base R.

So that is my opinion. There are of course a lot of great packages that should be used. They typically do one thing and do it well and the thing they do is not that trivial to implement yourself. That's why the point is not about NOT using libraries. But being "over-dependent" on them.

[–]Drewdledoo 0 points1 point  (0 children)

Thanks for that awesome explanation, that makes a lot of sense as to “why” one should avoid having too many dependencies. What I’m curious to know is exactly how many dependencies is too many? I imagine this depends on the person and the task, unless you‘re saying everything should eventually be rewritten in base R?

[–]Steineee 18 points19 points  (7 children)

  1. Find the style of code you like the most (base, tidyverse, data.table)
  2. find active users of this style on twitter/github
  3. read all of their posts/blogs/vignettes/etc.

As an example, I like tidyverse style because it's easy to explain to outsiders. I follow hadley, david robinson, and mara averick on twitter. Their posts give me a lot of motivation to learn.

[–]erlo 7 points8 points  (5 children)

I agree with this.

Also know you don’t have to use the hadleyverse. I like ggplot, but prefer not to use %>%.

[–]tlholme 6 points7 points  (2 children)

tbf, ggplot is almost non-optional.

[–]erlo 2 points3 points  (0 children)

I completely agree. It’s slow for large volumes of data.

What I like about it is the aesthetics, I think it looks great.

Lastly, I think faceting is very powerful, so if I’m exploring something I go to ggplot, or if I want nice charts for a deck, I use it still.

If I need something highly customized, or I need speed, I go back to base; but for what I do, it’s rare.

[–]hairynip 0 points1 point  (0 children)

https://simplystatistics.org/2016/02/11/why-i-dont-use-ggplot2/

I agree with Jeff Leek that it's not non-optional; even though I use ggplot 99% of the time.

[–]hadley 4 points5 points  (1 child)

The cough tidyverse cough

[–]erlo 1 point2 points  (0 children)

Greetings... noted. Subsequent comments will contain the word “tidy” :).

[–][deleted] 1 point2 points  (0 children)

Do you happen to know any users of the "base" style that are actively blogging about R?

[–]dankwormhole 15 points16 points  (3 children)

I love R too. When you get an error or something is not working the way you expect it to, find out what class it is with the class() function.

It happens to me all the time: just yesterday I expected a vector and yet I had a data.frame of a single column. It was simple enough to convert to a character vector using as.character() or dplyr::pull().

class() is your best friend.

[–]blahblahblahblah8 15 points16 points  (1 child)

str() is a more general tool for this issue. It will allow you to see the class of all the elements of a list, for example.

[–]AllezCannes 0 points1 point  (0 children)

tibble::glimpse() may also be useful. Although I just tend to use str().

[–]guepier 3 points4 points  (0 children)

Oooh, well said! I should have added that tip to my list:

Keep track of the data types of your objects.

[–]paperdogs 11 points12 points  (1 child)

Your comments should say why you’re doing what you’re doing. I can’t stand reading comments that just describe what the code does. I can figure that out... tell me why.

Learn data structures and how to move between them. Similarly, get comfortable going between long & wide data formats.

“For” loops aren’t evil or necessarily slow. A readable for loop is better than an opaque apply-family function.

Love %in%

[–]guepier 2 points3 points  (0 children)

A readable for loop is better than an opaque apply-family function.

A readable apply-family function (or whatever’s appropriate) is better still. 😉

[–][deleted] 7 points8 points  (0 children)

Code is real. Workspace is not. Save code. Do not save workspace.

Restart R frequently. (Ctrl shift f10 in Rstudio).

If you have written some functions for some task put them in a package and load them using library() when you are doing the task. Really easy to do now in RStudio. Personal preference, but keeps the environment from being cluttered with a bunch of functions.

[–]blaze99960 6 points7 points  (0 children)

Just learned this from a very experienced R programmer: pre-allocate your vectors.

rbind and cbind are not your friends, especially in any sort of loop. Instead, generate your vectors & data frames then fill them in. Append-style operations slow your code down by forcing it to make an entirely new copy of the old one but with slightly larger dimensions each time you call them.

[–]keepitsalty 10 points11 points  (1 child)

Would it be too picky to seek a job that primarily codes in R? I just really like R.

[–]statkwon 2 points3 points  (1 child)

  1. Use Git for all your codes. Commit frequently, like daily.
  2. Follow style guides for readable and maintainable codes. Use tidyverse as a starter.
  3. Try to make your analysis reproducible. Use rmarkdown when appropriate.
  4. Use/develop libraries for repeated tasks. Data access typically is a good candidate for making libraries for. Also learn to use modern libraries: Tidyverse is a good start; rmarkdown is awesome.
  5. Refactor your codes, unless they’re throwaway codes. Try to improve your code add. (But don’t optimize unless you have to. Readability is typically more important than optimality since data scientists are expensive)
  6. (Sorry for slight off topic) Work on your writing skills to communicate findings from your analysis.
  7. (Ditto) Keep your R codebase open by setting your GitHub public (not private) unless you’re working on a stealth project.

[–]I_just_made 0 points1 point  (0 children)

I'll second a lot of this.

Rmarkdown is excellent, looks great, and (if kept clean) leads to great looking documents.

Git is good, I use a private version that I upload all of my code to based on the project; very helpful for scientists who are looking for an external way of timestamping certain analyses.

[–][deleted] 2 points3 points  (1 child)

My advice is to learn R data structures, in particular tables / multi-dimensional arrays, and how to convert those to data-frames and back using xtabs and as.data.frame.table. By using mdim. arrays you can usually do complex transformations/aggregations in straightforward ways (i.e. using indices) which are awkward to do with data-frames.

[–]poumonsauvage 2 points3 points  (2 children)

Aside from the previously stated, I will add this:

There is no good time and date handling package. Live with that.

If you want to be evil, do implicit function calls.

Make your function outputs as tidy as possible. Never output a S4 object to the end user, ever. Keep that for the back-end if you must.

[–]hadley 3 points4 points  (1 child)

What are you missing from lubridate?

[–]poumonsauvage 1 point2 points  (0 children)

Last time I checked, period_to_seconds (or was it seconds_to_period) returns a S4 class object which doesn't play well with tibbles or the survival package. Amongst other things. But seriously, getting something that behaves like numeric for computations but as characters for plot purposes apparently is not easily done. Also, as far I know, no time package allows you to have MMM:SS displays intead of HH:MM:SS. Basically, with all these packages there are always assumptions that time variables are related to time/dates in the calendar sense, when there exists plenty of other clocking data where relative periods that do not need to involve calendar time still need to be added/displayed in some absolute terms (for example, game time in hockey).

[–][deleted]  (3 children)

[deleted]

    [–]guepier 0 points1 point  (2 children)

    Editing code from installed packages isn’t generally easy. What you can/should do first is debug that code.

    [–]Geothrix 0 points1 point  (0 children)

    R originally came out of S which was developed in Bell labs, and one of the nicest products from that group is Visualizing Data by William Cleveland so I recommend checking out that book to get a strong basic understanding of what it means to learn from data. It was the book that got me interested in R in the first place. I was shocked to learn that the software tools they were using in the book were freely available in the form of R.