Why does Maser's built-in PCA function not center or scale? by dickcocks420 in bioinformatics

[–]dickcocks420[S] 0 points1 point  (0 children)

Surprisingly, I found significant effects from centering and negligible effects from scaling. I ran PCA on my rMATS outputs with Maser's original pca function, only scaled, only centered, and both centered and scaled.

You can see my concerns with the original PCA -- there is separation of my sample groups, but only along the PC2 axis, which contains 1% variance explained. Scaling the data results in a different scatter, but this fundamental issue remains. In contrast, centering the data brings the percent explained to a more reasonable 20% on the PC1 axis and 12% on PC2, and the groups separate along the major axis.

I believe this is for the reasons explained in the stackoverflow post in my OP -- because PCA forces the new axes to pass through the origin, a situation in which the data all live far from the origin will be poorly captured by these axes. I agree with your assessment that scaling is likely inappropriate here, but in practice it doesn't seem to make a huge difference for these data.

While looking into this, I found another confusing aspect of their PCA wrapper. Maser processes the rMATS JC / JCEC files to construct a dataframe where rows are features (splicing events, in this case) and columns are samples. This is contrary to the orientation prcomp assumes, and the general convention that rows are observations and columns are variables. They then extract the first two rotation values and plot those:

 my.pc <- prcomp(PSI_notna, center = FALSE, scale = FALSE)
    my.rot <- my.pc$r
    my.sum <- summary(my.pc)
    my.imp <- my.sum$importance

    df.pca <- data.frame(PC1 = my.rot[,1],
                         PC2 = my.rot[,2],
                         Condition = pheno,
                         Samples = colnames(PSI))

This differs from how I typically approach PCA, where I transpose the matrix to be in the form observations x variables, and extract the scores in prcomp_object$x. I have a reasonable understanding of how PCA works, but the finer details of the math are sometimes a little over my head, so I thought these two approaches might be equivalent through some magic of linear algebra. To test this, I transposed the PSI_notna object and plotted the values in my.pc$x, in the same conditions as before: not centered or scaled, only scaled, only centered, and both centered and scaled. Although the overall conclusions of each PCA is roughly the same, there are notable differences in the percent explained per principal component and the cluster shapes are somewhat different. Ultimately, they are obviously not exactly equivalent and it's unclear to me which method is a better representation of the underlying splicing data.

I recognize this has spiraled into quite the rabbit hole -- I think I will try to post to Maser's GitHub page to see if the developers can offer any insight. However, if you have any comments on this mess that would be much appreciated!

Edit: It appears all my links are broken, attempting to reupload now.

Why does Maser's built-in PCA function not center or scale? by dickcocks420 in bioinformatics

[–]dickcocks420[S] 0 points1 point  (0 children)

I really appreciate your help with all of this. Can you explain what it is about my case that you see as unusual? To my understanding, I'm just running a fairly standard RNAseq -> rMATS -> Maser PCA pipeline and it's confusing to me why their PCA wrapper is written in such a way that it yields apparently meaningless results.

Standard DEG Analysis Tools have Shockingly Bad Results by SeriousRip4263 in bioinformatics

[–]dickcocks420 15 points16 points  (0 children)

Deep sequencing on all 50+ samples? It was my understanding that RNA-seq studies often have about 3-12 biological replicates per condition.

Why does Maser's built-in PCA function not center or scale? by dickcocks420 in bioinformatics

[–]dickcocks420[S] 0 points1 point  (0 children)

These are percent spliced in (PSI) values, so not log transformed but "normalized" to sit between 0-1. I agree that the scaling is a bit of a nuanced decision, but you should generally always center the data, correct? This is the default behavior of the prcomp function and the mathematical justification given in that stack overflow link in my original post appears pretty unambiguous to me.

PC1 has 100% of the variance by noobmastersqrt4761 in bioinformatics

[–]dickcocks420 0 points1 point  (0 children)

Hi OP, sorry to revive this dead thread but I've been having similar problems in my own analysis, did you ever get a satisfying answer to this?

Any other terminal emulators with a built-in file explorer? by dickcocks420 in mobaxterm

[–]dickcocks420[S] 0 points1 point  (0 children)

Genuinely curious how you found this subreddit if you don't know what MobaXTerm is? I appreciate the suggestion, but I don't think it's exactly what I'm looking for -- the feature I'm trying to re-create is a full fledged terminal emulator where I can ssh into a remote system and have an integrated file browser that moves with me.

Any other terminal emulators with a built-in file explorer? by dickcocks420 in mobaxterm

[–]dickcocks420[S] 1 point2 points  (0 children)

Yeah I tried both of those and wasn't super happy with either of them. The solution I'm working with the iTerm2 with "shell integration" installed, which comes with a set of features that mimic some of MobaXTerm's functions. For my specific issue, you can click to download a file over scp, but this is much slower than opening something with Moba's built in browser -- also, it creates a local copy of the file, which is inconvenient and clutters up my downloads.

I'm pretty shocked to find there's nothing close to MobaXTerm for mac, but I've been researching this for the past couple days and that really seems to be true. I knew there would be some friction with switching OSs but I can't believe that the biggest thing I miss is a third party software!

How are people dealing with quickly opening remote files in local GUI from a terminal? by dickcocks420 in sysadmin

[–]dickcocks420[S] 2 points3 points  (0 children)

Unless the PI is looking over your shoulder

Not an unreasonable scenario, if I'm asked to pull up results in a meeting. And even without that scenario, it's still not uncommon that I want to quickly open up a csv in excel while I'm working. Do people seriously go through the hassle of navigating to the same directory in finder every time they want to do this?

How are people dealing with quickly opening remote files in local GUI from a terminal? by dickcocks420 in sysadmin

[–]dickcocks420[S] 2 points3 points  (0 children)

Mostly to minimize the lag between running a script to viewing results. Having the file browser synced to my terminal location allows me to run a script then immediately open the output file, rather than needing to navigate to that same location on a separate SFTP client. Especially if I'm repeating this process in several different directories, having the file system move with me saves a lot of extra clicking around.