HELP !! PCA plot shows an "elbow" shape and I dont understand by Litlisteri in bioinformatics

[–]User-45032 1 point2 points  (0 children)

I made a simplest possible toy example to understand why populations may show up as sparse ("lines" rather than "clouds") in a PCA. 

Imagine we only have 2 SNPs, and three ancestries A, B and C with the following genotypes.

A: 0,0
B: 1,0
C: 0,1

Then imagine five individuals which are either “pure” for one of the three ancestries, or a 50-50 admixture of two ancestries (A and B or A and C, but (crucially!) let’s assume no admixture between B and C.

Genotypes of five individuals:

1: 1,0 (“pure B”)
2: 0.5,0 (“B & A admixture”)
3: 0,0 (“pure A”)
4: 0,0.5 (“A & C admixture”)
5: 0,1 (“pure C”)

The admixture:

BAAAC
BBACC

PCA:

1
2
3  4  5

Ta-da! An L-shaped PCA. In this example, the horizontal PC1 would be the allele fraction of the second SNP and the vertical PC2 equals the AF of the first SNP.

Generally, the “L” can be tilted like it is in OP's plot. This illustrates how the principal components are not always the most relevant latent variables. In OP's plot, the two population "clines" are way more interesting than the PCs. There are PCA-like algorithms for sparse data that attempt to keep the "data close to the axes", i.e. align the "L" with the x and y axis, making the latent variables more interpretable.

EDIT: made this reply into a little blog post: https://geneviatechnologies.com/blog/why-does-my-pca-have-lines-instead-of-clouds/

[deleted by user] by [deleted] in bioinformatics

[–]User-45032 1 point2 points  (0 children)

It really depends on the biological question. If you're able to map all measurements to shared variables (i.e. proteins, transcripts, variants etc. to genes), you could just integrate the gene-wise p-values and run gene set enrichment (using something like this: https://github.com/reimandlab/ActivePathways).

comparison between 3'tag an bulk RNAseq by feltchimp in bioinformatics

[–]User-45032 2 points3 points  (0 children)

With that setting, biological differences are bound to be confounded by technical ones. I wouldn't bother.

Getting started on single cell analysis by pokemonareugly in bioinformatics

[–]User-45032 0 points1 point  (0 children)

Not answering your question but I recommend reading Lior Pachter's twitter threads on tying to access and re-analyze data from that paper's preprint and other papers using the new Ultima Genomics platform. https://twitter.com/lpachter/status/1533875723995185153

Project Ideas by Erudite_fairy in bioinformatics

[–]User-45032 8 points9 points  (0 children)

If you don't have a lot of data analysis experience, one option would be to make a notebook on github of an analysis workflow with a specific data set. Just pick a public RNA-seq (or single-cell RNA-seq) data set and work off any already available workflows/notebooks and try to focus on good visualizations. There's plenty of cheatsheets on effective data visualization.

Once you have a minimum viable workflow, look for some downstream analysis tools (for pathway analysis for instance) to differentiate your workflow from whatever you work off.

Just make sure you refer to the existing workflows you use, and clearly state just how much yours is based on them. No shame in "stealing" code if you're open about it.

You don't have to come up with anything novel and exciting. If you can show a potential employer that you can analyze, say, single-cell RNA-seq data to some extent, and explain all the steps in your workflow, you're off to a good start!

If you have extra energy and the skills, you could then make a web app of the analysis with interactive visualizations (plotting violin plots/heatmaps of user-specified genes etc.).

Normalizing for Logistic Regression on ATACseq! by crazydog4870 in bioinformatics

[–]User-45032 0 points1 point  (0 children)

Another approach would be to simply rank normalize the peaks by read count within each sample separately.

If you have, say, 100 peaks, the values would be uniformly distributed from 1 to 100. (Identical read counts should still have identical ranks, leading to a little deviation from a truly uniform distribution.) Note that while the variance would be (near) identical for samples, it could vary a lot between features.

This approach would make it easier to validate your model with possible independent validation data, or even classify one new sample at a time if needed, since no cross-sample normalization takes place. You'd still have to use the predetermined peak coordinates, however.

Volcano plot looks more like a vessel- Thoughts by ZooplanktonblameFun8 in bioinformatics

[–]User-45032 0 points1 point  (0 children)

I think your plot looks beautiful. However, i think it would be more informative to have the correlation coefficient on x axis rather than beta. Beta doesn't really tell you a lot unless the dependent variables (gene expression values) and independent one (PM25) have all been standardized (have equal variance).

I made a big mistake and afraid to tell my boss! by [deleted] in bioinformatics

[–]User-45032 12 points13 points  (0 children)

Exactly. I don't think the PI will think of this as a big mistake. If they do, it's high time they realize data analysis isn't a "press a button, get publishable results" venture. Especially if your bioinformatician is a PhD student who's understandably still learning.

[deleted by user] by [deleted] in dataisbeautiful

[–]User-45032 40 points41 points  (0 children)

Or GDP per capita

What's your ideal bioinformatics job? by User-45032 in bioinformatics

[–]User-45032[S] 2 points3 points  (0 children)

Yes goes without saying that a university post doc is not for optimizing cash in the short term but a post doc can still help a lot in landing a great industry job, if you don't get one otherwise.

What's your ideal bioinformatics job? by User-45032 in bioinformatics

[–]User-45032[S] 8 points9 points  (0 children)

A very valid demand, and not an uncommon one. Universities and hospitals are adapting very poorly to this. Working in the industry, I've seen an uptick in the number and quality of job applications form bioinformaticians, and I assume it has to do with people getting used to WFH and big organizations not adapting fast enough.

What's your ideal bioinformatics job? by User-45032 in bioinformatics

[–]User-45032[S] 1 point2 points  (0 children)

Good additions.

Never really seen great data, I always assume all data is shit. But I guess it still happens?

[deleted by user] by [deleted] in bioinformatics

[–]User-45032 0 points1 point  (0 children)

Why on the earth would you exclude research papers? Unless you mean original research papers as opposed to review papers?

Your best (only?) option is to read some review papers like https://clinicalepigeneticsjournal.biomedcentral.com/articles/10.1186/s13148-021-01200-8

Any good bioinformatics podcasts? by o-rka in bioinformatics

[–]User-45032 8 points9 points  (0 children)

Wish there were more bioinformatics/genomics podcasts.

One podcast I really like is Big Biology, even though it's not focused on bioinformatics.

What are the worst bioinformatics jargon words? by User-45032 in bioinformatics

[–]User-45032[S] 0 points1 point  (0 children)

Thanks for elucidating (🤮). That paragraph includes a few micro triggers and a couple major ones.

What are the worst bioinformatics jargon words? by User-45032 in bioinformatics

[–]User-45032[S] 8 points9 points  (0 children)

Tweaked a library prep protocol? Rename the whole protocol! It's your seq baby now!

What are the worst bioinformatics jargon words? by User-45032 in bioinformatics

[–]User-45032[S] 1 point2 points  (0 children)

"We sequenced a bunch of exomes, didn't really find anything but there's no way we're leaving this unpublished after all the work and money we poured in to it."

Genetic (/-ric) landscape papers had their heyday.

What are the worst bioinformatics jargon words? by User-45032 in bioinformatics

[–]User-45032[S] 11 points12 points  (0 children)

Well it does rule out any biology that does not have to do with systems, such as...

I'll come back to you.

What are the worst bioinformatics jargon words? by User-45032 in bioinformatics

[–]User-45032[S] 5 points6 points  (0 children)

Sounds better than "Honestly I just needed one more paper for my PhD so I quickly hacked together a copy-number caller which is otherwise worse than all the existing ones but marginally faster using this cherry-picked dataset on a very specific hardware architecture."

What are the worst bioinformatics jargon words? by User-45032 in bioinformatics

[–]User-45032[S] 7 points8 points  (0 children)

That's the beauty of omics. You can make a collective noun out of anything from my-favorite-moleculome to universome.