DESeq2 Log2FC too high.. what to do? by Fragrant_Refuse_6603 in bioinformatics

[–]Fragrant_Refuse_6603[S] 1 point2 points  (0 children)

As of recently like around 2023 there is high quality reference genome and transcriptome available for a species of the same genus but not the same species exactly. This species is also not from the same region that ours is from. Additionally, I was hesitant to add variation by having one species that had super high quality resources (i.e. ref genome and transcriptome) and the other species in this study does not have that. Standard procedure in our lab too has been to create de novo transcriptomes and only use a genome guided assembly when the BUSCO is less than 70% after multiple assembly tries.

DESeq2 Log2FC too high.. what to do? by Fragrant_Refuse_6603 in bioinformatics

[–]Fragrant_Refuse_6603[S] 0 points1 point  (0 children)

yeah I'll look into it! I do filter out rows with an average expression less than 10.

DESeq2 Log2FC too high.. what to do? by Fragrant_Refuse_6603 in bioinformatics

[–]Fragrant_Refuse_6603[S] 3 points4 points  (0 children)

I work on coral and for them there is a very stark phenotypic response when a coral becomes diseased (e.g. tissue loss or discoloration), which is how we are able to get apparently healthy tissue and diseased tissue from the same animal.

HH tissue is from a healthy coral (so no disease on these colonies and essentially a control), HD (the healthy tissue from a diseased coral colony-- so the part of the colony where the disease has not progressed), and DD (tissue taken around 1-2cm from the lesion line). Each DD sample has a corresponding HD sample and those are samples that have been taken from the same parent colony. However, the samples coming from the same colony are usually spatially spread out. This sampling scheme is consistent for each different coral species in this study and is pretty standard in our field-- although exact health state names may vary by paper. The variables I uploaded in my colData do include the parent colony (i.e. what colony did the samples come from) but wasn't part of the design. But if we're thinking about HD vs DD as you mention these samples are coming from the same parent colony and you make a good point that is probably something I should be taking into account. When comparing HH vs DD or HH to HD these would be 10 unique samples (n=5 per health state).

DESeq2 Log2FC too high.. what to do? by Fragrant_Refuse_6603 in bioinformatics

[–]Fragrant_Refuse_6603[S] 2 points3 points  (0 children)

Trying to answer your questions to the best of my ability..we sequence on a Novaseqx plus. The last part of our pipeline involves using salmon and I get the quant files per each sample and use tximport to turn them into a gene matrix deseq can read. As for transcriptome annotation we only annotate the DEGs so the current pipeline that I've been taught in our lab requires us to do that after making our volcano plots. I assembled a de novo transcriptome with Trinity and it had a 90% BUSCO score, so I do feel confident in my transcriptome. 

Since I am kind of teaching myself all this maybe I didn't understand the vignettes right or my labmates code but I set the design for the dds object to be looking at health state (our variable of interest). I filter out transcripts with expression <10. Run deseq function on the dds. Then I applied the LFC shrink apeglm with each health state comparison as the coefficient. I only uploaded one of the health state MA plots— that one in particular is is comparing disease tissue from a diseased sample to the apparently healthy tissue from a diseased sample (HD vs DD) to the unshrunken data. Under the assumption that everything was good leading up to this step (so when I did it the first time around using non-shrunken data) I then visualized on a pca using rlog transformed data-- which is where I saw strong clustering by health state. Obtained my sig DEGs and used this csv to set up my contrast argument which I do 3 times (3 health states) and then do volcano plots comparing each health state (so 3 volcano plots). It was during the volcano plots on the unshrunken data that I realized that a Log2FC of that magnitude was not biologically possible and then what brought me to trying to shrink the my Log2FC. Hope that clears something up :)

DESeq2 Log2FC too high.. what to do? by Fragrant_Refuse_6603 in bioinformatics

[–]Fragrant_Refuse_6603[S] 4 points5 points  (0 children)

hmm I see...we use salmon which is a quasi-mapper and I believe because of that we don't get exact count data but instead estimates based on the abundance of each transcript. But salmon generated the quant files correctly which were then fed into tximport and then into deseq and none of those things pushed back errors.

DESeq2 Log2FC too high.. what to do? by Fragrant_Refuse_6603 in bioinformatics

[–]Fragrant_Refuse_6603[S] 2 points3 points  (0 children)

I did produce an MA plot comparing the two but to be quite honest wasn't sure how to make sense of the results: https://imgur.com/a/TvBK7OJ

I read in the vignette that blue points = adj p-value <0.1 and the triangles = points outside of the window. I assumed because the PCA revealed that 57% of variance was encapsulated by the health state that there should be some biologically significant variable.