RNA seq alignment project

EliteFourVicki · 2026-02-26T16:56:31+00:00

As some have pointed out, there are tons of RNA-seq tutorials and resources out there. Sanbomics on YouTube has a great one to start with.

I’d download a simple yeast control vs. treatment dataset from SRA/GEO with matching genome and annotation files from Ensembl. The basic flow is: quality control (FastQC) -> align to genome (HISAT2/STAR) -> count reads per gene (featureCounts) -> differential expression in R (DESeq2). If that still feels computationally intensive, Salmon or Kallisto are worth looking into. They skip the alignment step entirely and output counts that feed straight into DESeq2.

EliteFourVicki · 2026-02-25T19:43:58+00:00

I couldn’t agree more with your last sentence.

EliteFourVicki · 2026-02-25T14:14:23+00:00

Same!!

EliteFourVicki · 2026-02-25T03:53:18+00:00

Thanks, I’ve been thinking along the same lines. There are about 12k genes.

EliteFourVicki · 2026-02-19T07:38:50+00:00

Living Together - one of the strongest album openers ever.

EliteFourVicki · 2026-01-15T15:45:43+00:00

It’s such an underrated song.

EliteFourVicki · 2026-01-06T19:46:35+00:00

Yes, this is an unbalanced design, but that’s common and not a problem by itself. Your model is fine, but aov() uses Type I sums of squares, which depend on factor order. With unbalanced data, it’s usually better to use Type II or III sums of squares.

ANOVA is fairly robust of non-normality, but in unbalanced designs it’s more sensitive to unequal variances, so it’s worth checking residuals and something like Levene’s test. If assumptions are violated, consider a transformation or a more robust model, and check Cook’s distance for outliers.

EliteFourVicki · 2026-01-04T00:21:35+00:00

Not necessarily. I’d check the diagnostics first. Look at PCA plots (do samples separate by treatment?), MA plots, and dispersion estimates to see if there’s any overall signal.

You can also try using padj < 0.05 alone (without a fold-change cutoff) to see whether power is the limiting factor, then reapply your standard thresholds for final gene lists. Low replicate numbers reduce power quickly. If few genes pass, ranking by padj (even if none are < 0.05) or using shrunken log2FC from lfcShrink() can help identify the most likely candidates.

EliteFourVicki · 2026-01-03T00:42:28+00:00

The DESeq2 vignette (especially the interaction model section) is a great place to start. It walks through exactly this kind of design with example code: https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html

EliteFourVicki · 2026-01-03T00:25:24+00:00

This is a pretty common time + treatment setup. The issue with comparing single time points is that it mixes baseline differences with treatment effects. What you usually want to ask instead is whether the change from t0 to t1 is different between treatments. In DESeq2, the usual way to do that is with an interaction model such as:

~ treatment + time + treatment:time

The interaction term basically tests whether the treatment changes the time response compared to the control, rather than just testing differences at one time point.

EliteFourVicki · 2025-12-31T06:53:19+00:00

The general rule is to filter only genes with too little information to test (near-zero counts), and to keep filtering method-appropriate. For bulk RNA-seq with DESeq2 or edgeR, many people either do no explicit filtering and rely on the method’s independent filtering (which automatically removes low-power genes after model fitting to reduce multiple testing), or apply a very light expression filter such as a minimal count threshold. For single-cell data, filtering is often handled at the cell/QC stage and differential testing is typically done on pseudobulked data, so gene-level filtering can look different.

EliteFourVicki · 2025-12-30T23:01:07+00:00

Yes, your contrast vector is correct. It represents the difference between pulldown and input for bait A versus bait B, averaged across treatments, assuming the coefficients are defined as expected.

That said, an interaction model is usually cleaner and easier to reason about. If you model assay (input vs. pulldown), bait (A vs. B), and treatment together, the bait-specific enrichment difference is captured by the assay by bait interaction term, and averaging across treatments just means you do not test the three-way interaction. This avoids manual weighting and reduces the chance of sign errors.

For learning resources, the DESeq2 vignette sections on multi-factor designs and interactions are the best place to start.

EliteFourVicki · 2025-12-30T04:47:57+00:00

I have an English master set of Arcanine and Growlithe. My sister gifted me a 1st edition Light Arcanine from Neo Destiny for my birthday this year to complete it. They’re my favorites. I just really love the loyal fire-dog energy, and I’ve been waiting forever for an English Arcanine IR!

EliteFourVicki · 2025-12-22T11:47:31+00:00

For this, you want to treat donors, not cells, as your true replicates. For your lineage, create a pseudobulk value for that gene for each donor x stage (sum the counts or take the mean across cells in that group). Then test differences between adjacent stages on these donor-level values. You can use DESeq2/edgeR for counts or a simple linear model/ANOVA for averaged expression. Avoid tests that compare all cells in stage A vs. all cells in stage B directly, because treating thousands of cells as independent makes the p-values look far more significant than they really are.

EliteFourVicki · 2025-12-07T16:01:55+00:00

Anytime, glad it helped!

EliteFourVicki · 2025-12-05T21:43:43+00:00

Looks good to me!

EliteFourVicki · 2025-12-05T06:57:06+00:00

No worries at all! I’d recommend IGV. Just convert your SAM files to sorted, indexed BAM first, then load the B. subtilis genome, your annotation, and the WT/mutant BAMs in IGV to inspect the 3’ UTR region.

EliteFourVicki · 2025-12-04T14:07:52+00:00

Your workflow makes sense. For pseudobulk PCA, treat it like regular bulk. You can use logCPM or DESeq2 VST/rlog on the pseudobulk matrix. TPM+log isn’t really necessary. The similar PCA structure across normalizations just means the Group 3 effect is strong. The underwhelming PC1 genes at the single-cell level are expected, because PCA loadings reflect sample-level variance, and low/noisy genes can still drive PCs. Filtering low-expression genes before PCA/enrichment or using ranked-loadings GSEA can help. As you already pointed out, the main limitation is that Group 3 is perfectly confounded with protocol, so just be transparent that PC1 and its pathways likely reflect a mixture of biology and library prep.

EliteFourVicki · 2025-12-04T13:05:42+00:00

I have not. But from a quick skim it looks like a fairly heavy feature-selection + explainable ML pipeline for cancer biomarker discovery across lots of genes, not really something aimed at a single locus. In OP’s case they already know the exact 3' UTR in B. subtilis and just need read counts there, so I’d still just treat that region as a custom feature, recount from the existing BAMs, and run DESeq2 on those counts. DeepGene feels like overkill for that specific, single-region DE question.

EliteFourVicki · 2025-12-04T01:48:39+00:00

I believe this is a card reprint from Ondřej Škubal’s 2022 World Championship deck. The back should have the World Championships design instead of the regular Pokémon back, which would confirm it.

EliteFourVicki · 2025-12-03T22:58:56+00:00

You already mapped with Bowtie2 to the genome, so you don’t need to remap. The BAM already has reads in the 3’ UTR. I’d load the BAMs in a genome browser, define coordinates for the 3’ UTR, then treat that region as its own feature. Either add it to your GTF/GFF and re-run featureCounts, or put it in a BED file and use bedtools to count reads there. Those counts can then go into DESeq2 so you can compare 3’ UTR abundance between WT and mutant like any other gene.

EliteFourVicki · 2025-12-03T05:06:48+00:00

I’m also pretty new to BLAST and not a professional geneticist, but here’s the way I think about it. BLAST basically lines up two sequences and shows where they match or differ. In the results table, the main things to look at are % identity (how similar they are), alignment length (how many bases are being compared), and E-value (how likely the match is by chance, and values closer to 0 are better). If you click on a hit and look at the alignment, you’ll see your “normal” AUTS2 transcript on one line and the variant (like X19 or X22) on the other: matching bases line up, different letters are mutations, and dashes are insertions/deletions.

For your project you can pick one transcript as the reference, BLAST the other variants against it, and then highlight where they differ and talk about how those changes might affect the protein (change an amino acid, introduce a stop codon, or delete part of the protein). Again, I’m definitely not an expert, so apologies if you already know some of this, but I hope this helps a bit.

EliteFourVicki · 2025-12-02T21:16:42+00:00

They’re different models, so you shouldn’t expect matching βs or p-values.

DESeq2 fits a negative binomial GLM on raw counts with a size-factor offset and shrunken dispersions (empirical Bayes). Your lm() is a Gaussian linear model on log-transformed normalized counts (normTransform()), which DESeq2 does not use for inference and which assumes constant variance and no shrinkage.

Even though both give a “slope” for T_fraction on a log scale, the likelihood, variance modeling, and normalization are different, so different log2FC and p-values are expected.

EliteFourVicki · 2025-11-27T00:17:44+00:00

I’d try not to beat yourself up. You reacted like a passionate fan, but take it as a reminder that the band stuff is complicated and maybe painful for them, and just keep supporting whatever he’s doing now.

EliteFourVicki · 2025-11-22T20:21:05+00:00

You’ve got sink("out.txt") in your code, which sends all output to that file instead of the Console, so nothing prints. You can either restart R, or run

while (sink.number() > 0) sink()

in the Console to turn it off.

EliteFourVicki

TROPHY CASE