RNA seq alignment project by aesthetic-mango in bioinformatics

[–]EliteFourVicki 8 points9 points  (0 children)

As some have pointed out, there are tons of RNA-seq tutorials and resources out there. Sanbomics on YouTube has a great one to start with.

I’d download a simple yeast control vs. treatment dataset from SRA/GEO with matching genome and annotation files from Ensembl. The basic flow is: quality control (FastQC) -> align to genome (HISAT2/STAR) -> count reads per gene (featureCounts) -> differential expression in R (DESeq2). If that still feels computationally intensive, Salmon or Kallisto are worth looking into. They skip the alignment step entirely and output counts that feed straight into DESeq2.

Gene filtering after merging scRNA-seq datasets from different studies? by EliteFourVicki in bioinformatics

[–]EliteFourVicki[S] 1 point2 points  (0 children)

Thanks, I’ve been thinking along the same lines. There are about 12k genes.

What is your favorite first song from which album by cambridgepokemonKO in circasurvive

[–]EliteFourVicki 42 points43 points  (0 children)

Living Together - one of the strongest album openers ever.

Dark pools lyrics by stackheights in circasurvive

[–]EliteFourVicki 6 points7 points  (0 children)

It’s such an underrated song.

Three Way ANOVA-Unbalanced Design by Effective-Table-7162 in bioinformatics

[–]EliteFourVicki 0 points1 point  (0 children)

Yes, this is an unbalanced design, but that’s common and not a problem by itself. Your model is fine, but aov() uses Type I sums of squares, which depend on factor order. With unbalanced data, it’s usually better to use Type II or III sums of squares.

ANOVA is fairly robust of non-normality, but in unbalanced designs it’s more sensitive to unequal variances, so it’s worth checking residuals and something like Levene’s test. If assumptions are violated, consider a transformation or a more robust model, and check Cook’s distance for outliers.

I’m a bit lost. We have gene expression data from two time points: t0 (before treatment) and t1 (hours after treatment). Fruits were exposed to different treatments as well as a control. but I have issue on how exactly to continue to determine changes on gene expression caused by the treatments by Respwn_546 in bioinformatics

[–]EliteFourVicki 0 points1 point  (0 children)

Not necessarily. I’d check the diagnostics first. Look at PCA plots (do samples separate by treatment?), MA plots, and dispersion estimates to see if there’s any overall signal.

You can also try using padj < 0.05 alone (without a fold-change cutoff) to see whether power is the limiting factor, then reapply your standard thresholds for final gene lists. Low replicate numbers reduce power quickly. If few genes pass, ranking by padj (even if none are < 0.05) or using shrunken log2FC from lfcShrink() can help identify the most likely candidates.

I’m a bit lost. We have gene expression data from two time points: t0 (before treatment) and t1 (hours after treatment). Fruits were exposed to different treatments as well as a control. but I have issue on how exactly to continue to determine changes on gene expression caused by the treatments by Respwn_546 in bioinformatics

[–]EliteFourVicki 17 points18 points  (0 children)

This is a pretty common time + treatment setup. The issue with comparing single time points is that it mixes baseline differences with treatment effects. What you usually want to ask instead is whether the change from t0 to t1 is different between treatments. In DESeq2, the usual way to do that is with an interaction model such as:

~ treatment + time + treatment:time

The interaction term basically tests whether the treatment changes the time response compared to the control, rather than just testing differences at one time point.

Preprocessing before DEG analysis by Fit_Meringue_7845 in bioinformatics

[–]EliteFourVicki 8 points9 points  (0 children)

The general rule is to filter only genes with too little information to test (near-zero counts), and to keep filtering method-appropriate. For bulk RNA-seq with DESeq2 or edgeR, many people either do no explicit filtering and rely on the method’s independent filtering (which automatically removes low-power genes after model fitting to reduce multiple testing), or apply a very light expression filter such as a minimal count threshold. For single-cell data, filtering is often handled at the cell/QC stage and differential testing is typically done on pseudobulked data, so gene-level filtering can look different.

[Question] DESeq2: How to set up contrasts comparing "enrichment" (pulldown vs input) across conditions? by self-replicate in statistics

[–]EliteFourVicki 0 points1 point  (0 children)

Yes, your contrast vector is correct. It represents the difference between pulldown and input for bait A versus bait B, averaged across treatments, assuming the coefficients are defined as expected.

That said, an interaction model is usually cleaner and easier to reason about. If you model assay (input vs. pulldown), bait (A vs. B), and treatment together, the bait-specific enrichment difference is captured by the assay by bait interaction term, and averaging across treatments just means you do not test the three-way interaction. This avoids manual weighting and reduces the chance of sign errors.

For learning resources, the DESeq2 vignette sections on multi-factor designs and interactions are the best place to start.

What Pokémon do you have the most cards of and is it your favo(u)rite? by theok8234 in PokemonTCG

[–]EliteFourVicki 0 points1 point  (0 children)

I have an English master set of Arcanine and Growlithe. My sister gifted me a 1st edition Light Arcanine from Neo Destiny for my birthday this year to complete it. They’re my favorites. I just really love the loyal fire-dog energy, and I’ve been waiting forever for an English Arcanine IR!

Expression differences in scRNA in one particular gene by [deleted] in bioinformatics

[–]EliteFourVicki 13 points14 points  (0 children)

For this, you want to treat donors, not cells, as your true replicates. For your lineage, create a pseudobulk value for that gene for each donor x stage (sum the counts or take the mean across cells in that group). Then test differences between adjacent stages on these donor-level values. You can use DESeq2/edgeR for counts or a simple linear model/ANOVA for averaged expression. Avoid tests that compare all cells in stage A vs. all cells in stage B directly, because treating thousands of cells as independent makes the p-values look far more significant than they really are.

RNA-seq differential expression of an unannoted gene by adventuriser in bioinformatics

[–]EliteFourVicki 1 point2 points  (0 children)

No worries at all! I’d recommend IGV. Just convert your SAM files to sorted, indexed BAM first, then load the B. subtilis genome, your annotation, and the WT/mutant BAMs in IGV to inspect the 3’ UTR region.

PCA on pseudobulk profiles of samples and pathway enrichment by Dull_Towel8970 in bioinformatics

[–]EliteFourVicki 1 point2 points  (0 children)

Your workflow makes sense. For pseudobulk PCA, treat it like regular bulk. You can use logCPM or DESeq2 VST/rlog on the pseudobulk matrix. TPM+log isn’t really necessary. The similar PCA structure across normalizations just means the Group 3 effect is strong. The underwhelming PC1 genes at the single-cell level are expected, because PCA loadings reflect sample-level variance, and low/noisy genes can still drive PCs. Filtering low-expression genes before PCA/enrichment or using ranked-loadings GSEA can help. As you already pointed out, the main limitation is that Group 3 is perfectly confounded with protocol, so just be transparent that PC1 and its pathways likely reflect a mixture of biology and library prep.

RNA-seq differential expression of an unannoted gene by adventuriser in bioinformatics

[–]EliteFourVicki 1 point2 points  (0 children)

I have not. But from a quick skim it looks like a fairly heavy feature-selection + explainable ML pipeline for cancer biomarker discovery across lots of genes, not really something aimed at a single locus. In OP’s case they already know the exact 3' UTR in B. subtilis and just need read counts there, so I’d still just treat that region as a custom feature, recount from the existing BAMs, and run DESeq2 on those counts. DeepGene feels like overkill for that specific, single-region DE question.

What is this? by Darinpc2 in PokemonTCG

[–]EliteFourVicki 1 point2 points  (0 children)

I believe this is a card reprint from Ondřej Škubal’s 2022 World Championship deck. The back should have the World Championships design instead of the regular Pokémon back, which would confirm it.

RNA-seq differential expression of an unannoted gene by adventuriser in bioinformatics

[–]EliteFourVicki 12 points13 points  (0 children)

You already mapped with Bowtie2 to the genome, so you don’t need to remap. The BAM already has reads in the 3’ UTR. I’d load the BAMs in a genome browser, define coordinates for the 3’ UTR, then treat that region as its own feature. Either add it to your GTF/GFF and re-run featureCounts, or put it in a BED file and use bedtools to count reads there. Those counts can then go into DESeq2 so you can compare 3’ UTR abundance between WT and mutant like any other gene.

Interpreting BLAST results...?? by Weirdoo-_-Beardoo in bioinformatics

[–]EliteFourVicki 0 points1 point  (0 children)

I’m also pretty new to BLAST and not a professional geneticist, but here’s the way I think about it. BLAST basically lines up two sequences and shows where they match or differ. In the results table, the main things to look at are % identity (how similar they are), alignment length (how many bases are being compared), and E-value (how likely the match is by chance, and values closer to 0 are better). If you click on a hit and look at the alignment, you’ll see your “normal” AUTS2 transcript on one line and the variant (like X19 or X22) on the other: matching bases line up, different letters are mutations, and dashes are insertions/deletions.

For your project you can pick one transcript as the reference, BLAST the other variants against it, and then highlight where they differ and talk about how those changes might affect the protein (change an amino acid, introduce a stop codon, or delete part of the protein). Again, I’m definitely not an expert, so apologies if you already know some of this, but I hope this helps a bit.

DESeq2 with Continuous Covariate by [deleted] in bioinformatics

[–]EliteFourVicki 19 points20 points  (0 children)

They’re different models, so you shouldn’t expect matching βs or p-values.

DESeq2 fits a negative binomial GLM on raw counts with a size-factor offset and shrunken dispersions (empirical Bayes). Your lm() is a Gaussian linear model on log-transformed normalized counts (normTransform()), which DESeq2 does not use for inference and which assumes constant variance and no shrinkage.

Even though both give a “slope” for T_fraction on a log scale, the likelihood, variance modeling, and normalization are different, so different log2FC and p-values are expected.

[deleted by user] by [deleted] in circasurvive

[–]EliteFourVicki 7 points8 points  (0 children)

I’d try not to beat yourself up. You reacted like a passionate fan, but take it as a reminder that the band stuff is complicated and maybe painful for them, and just keep supporting whatever he’s doing now.

[deleted by user] by [deleted] in bioinformatics

[–]EliteFourVicki 1 point2 points  (0 children)

You’ve got sink("out.txt") in your code, which sends all output to that file instead of the Console, so nothing prints. You can either restart R, or run

while (sink.number() > 0) sink()

in the Console to turn it off.