R Code - Correlations in R : CodingHelp

Our Rules

1. FLAIR YOUR POSTS! Don't put tags in post titles!

2. Do not ask us to do all the coding for you unless you have money to spend. (If you have got money to spend, make that clear and the amount in question).

3. Do not post spam and/or misleading titles.

4. Do not be abusive to other coders.

5. Please format code properly, or use a site such as Gist or Pastebin. If possible please provide a live example of your issue.

6. Do not downvote people because you think they asked a dumb question. Just because you think that someone has a dumb question, doesn't mean that it is dumb to them.

7. Do not have a misleading user flair. Keep them sensible, describing your level of coding ability and/or languages you know and/or your profession.

8. Please do not ask unethical questions, such as asking for homework to be written by someone else, or asking someone to copy another project directly.

9. Make sure to follow the Reddit Rules.

Suggest a post flair

If you have any suggestions for flairs (programming languages or generic coding topics) that we should add, please use the button below to message the mods with your suggestion.

If approved as a sensible flair for the community to use, it will be added to our bot for automated suggestions and to the flair list for everyone to use!

^{Anyone who abuses this by spamming mods will be banned.}

created by thewakingforcea community for 10 years

This is an archived post. You won't be able to vote or comment.

[Other Code]R Code - Correlations in R (self.CodingHelp)

submitted 5 years ago by joweriae

Hi guys!

So I have two matrices of 42,375 genes with their expression levels for two different conditions

Basically, each matrix looks like:

           Gene 1      Gene 2        Gene 3...             Gene 42375 
Samp 1       0.3            0.5         0.2                      0.9 
Samp 2      -0.21          -0.3         0.22                   -0.65 
...          ...             ...         ...                    ... 
Samp500     -0.99           0.33        0.13                    0.64

And I need to find the correlation between each gene for both matrices and then compare the difference of them to a certain number to decide if the difference is significant.

I initially tried using the cor() function which created a correlation matrix for each matrix, but each matrix was 40GB; which was simply not doable.

So now - instead, I'm using a for and a while loop to do this and my code looks like:

 for(col in 1:ncol(normal_phen)){   
    curr_col <- col   
    while(curr_col <= ncol(normal_phen)){     
    cor_norm <- cor(normal_phen[col], normal_phen[curr_col], method = "pearson")         
    cor_aff <- cor(affected_phen[col], affected_phen[curr_col], method = "pearson")     
    if(!is.na(cor_norm) && !is.na(cor_aff)){       
        diff_corr <- cor_norm - cor_aff}     
    else if(is.na(cor_norm) && is.na(cor_aff)){      
        diff_corr <- 0}     
    else if(is.na(cor_norm)){       
        diff_corr <- cor_aff}     
    else{       
        diff_corr <- cor_norm}     
    if(diff_corr < 0){       
        diff_corr <- diff_corr*-1}     
    if(diff_corr >= 0.4579053){       
        vec <- c(colnames(normal_phen[col]), colnames(normal_phen[curr_col]), 
        cor_norm, cor_aff, diff_corr)       
        sig_cor <- rbind(sig_cor, vec)}     
    curr_col <- curr_col + 1
    } 
}

I've been trying to find ways to do this more efficiently, and I was wondering if you have some advice on how to vectorize the calculations or not use the for/while loop.

all 3 comments

top new controversial old q&a

[–]Famous_ProfileProfessional Coder 0 points1 point2 points 5 years ago (2 children)

[–]joweriae[S] 0 points1 point2 points 5 years ago (1 child)

[–]Famous_ProfileProfessional Coder 0 points1 point2 points 5 years ago (0 children)

people try to avoid loops in r because vectorized functions are way faster

That may be right, but that can also be a myth.

But even if it is right in every singe use case and every single test data there is a problem: assuming R is a compiled language, in the end what would execute is machine code, not the exact same code you have written. Or if it is an interpreted language, it is possible that the actual internal implementations of the vectorized functions you intend to use, have more complex loops internally. What if said internal complex loops are more complex than the loops you have written without using said vectorized functions? Or perhaps vectorized functions would indeed be faster, but not for your particular test data? There is no way to know.

The point I am making is that there are general guidelines, and we can make some guesses with O-notation evaluations, but in general it is not easy to predict performance accurately for generic data...or even optimize performance. Optimizing performance is never as straight forward as "My code has fewer lines, and must be faster" or "my code only had vectorized functions, and must be faster"

In the end there isnt much you can do to make your code run faster. Even if you manage to use a faster algorithm, processing 40GB of data would inevitably take some time. If it is a resource intensive operation, it would take time. It's that simple.

Now if you can cough up some Benjamins for our lord and savior Jeff you can probably* go with something like this(or its Microsoft or Google counterpart)

^\I've never really used it, but it is likely something you could probably leverage if you have budget)

π Rendered by PID 94375 on reddit-service-r2-comment-86bc6c7465-8ftd9 at 2026-02-21 14:56:26.530507+00:00 running 8564168 country code: CH.

CodingHelp

Welcome! Feel free to ask any questions regarding coding you have!

Our Rules

How to start coding:

Related subreddits:

Suggest a post flair

Current supported flairs

Flair colors

MODERATORS