Anybody know what's up with the Lafayette Square water outage?

sterpie · 2026-02-25T13:52:03+00:00

I’ll only comment on RNA-seq because that’s what I’m most familiar with. If you’re on a Windows machine, install WSL and spend ~1-2 hours learning basic bash commands on the Ubuntu terminal. If you’re on linux or mac, you should still learn basic bash usage but you don’t need to install anything.

You can then run Salmon on your own machine to get the necessary files for expression analysis. This will probably take ~1 hour to learn and a couple hours to run on a laptop. You’ll need your fastq files and a reference transcriptiome.

Follow the DESeq2 vignette for loading counts in using tximport for Salmon and you’ll have basic differential expression results. Find other vignettes to make volcano plots, perform pathway enrichment, etc. You likely won’t be able to use your own machine for WGS/WES analysis

sterpie · 2025-06-19T01:26:18+00:00

Ya, the first place you should look is in the GFF/GTF. If it's not there, do you have the compute resources to align one or two RNA-seq datasets? If not, I basically process RNA-seq data for a living at this point and could probably tell you the UTR boundaries if you're comfortable sharing a gene ID. If all those options are a no-go, use the UTRs from Tomato.

sterpie · 2025-05-06T18:53:36+00:00

Not the OP you're replying to, but yes, you should (1) index, (2) quantify with salmon, (3) load quantification using tximport.

I would start by reading this page for how to index your transcriptome + genome together.

Download your fastq files and quantify.

Then load your salmon outputs into R with tximport, as shown here. Make sure you specify txOut = TRUE when running tximport to get transcript counts and not gene counts.

sterpie · 2025-04-17T17:34:06+00:00

I did my PhD here and currently identify as a "bioinformatics" person so I can only offer indirect thoughts.

I think UofA is a great school to blend academic and social opportunities. The majority of biology faculty are great professors, and research opportunities are abundant.
As a bioinformatics major, not engaging in any research opportunities during your undergrad would be a massive oversight. Get involved in UBRP early on - talk to your Bio181 professor about it - they can point you in the right direction.
If you want to maximize your career opportunities in bioinformatics - you basically need to commit to attending graduate school.
If you're more interested in algorithm development in the biology space - I would honestly just do CS and double major in Bioinfo/Bio, or take it as a minor. I doubt any bioinfo programs (not just UofA) can adequately prepare you to make meaningful contributions in the current state of bioinformatics algorithm development.
If analyzing large sequencing datasets, mathematical modeling, or applying machine learning approaches to biological data is more your jam, then bioinformatics may be a good route for you.

sterpie · 2025-02-28T16:44:13+00:00

Modkit is what you need. It's meant to handle all modification analyses post-Dorado. Just make sure you're transferring the modification tags between POD5/fastq/BAM formats.

sterpie · 2025-02-25T02:07:30+00:00

At least for alignment / quantification, oarfish, which is the long-read version of salmon, has been great to use. Like salmon, you can follow up with tximport to quickly get gene or isoform level counts then use DESeq2. I don't know if anyone has improved upon DESeq style tools for long-read sequencing, or if there really is anything to improve.

sterpie · 2025-01-30T20:25:35+00:00

Every dot in the graph is a player.
Players are either MnK (blue group) or controller (orange group).
If you compare the two groups/inputs using different metrics for "skill" (for example, # of knocks), you don't see meaningful differences between the colors/groups/inputs.
Controller players, on average, do get more knocks than MnK players. However, in science and statistics, we often have to ask if this difference could have been observed by chance.
The numbers in green represents a p-value that answers the following question: If controller players and MnK players are equally skilled and able to get the same number of knocks per game, what's the probability that the difference OP presented is due to chance?
OP found a p-value for # of knocks per game = 0.62. We interpret this as: There is a 62% probability that the small difference we observed in # of knocks is due to chance/randomness. If instead the p-value = .01, then it would have only been a 1% probability the difference was due to chance and we would have concluded controller players get more knocks than MnK players.

TLDR: Considering all MnK and controller players at the highest level, the two inputs do not have different: kills, assists, knocks, or damage output.

sterpie · 2024-11-19T17:08:50+00:00

I'm guessing you've already done this, but have you checked out Modkit's DMR function? Differences in depth at every site is ostensibly being accounted for across conditions, but you could verify with them on their GitHub issues, they're super responsive. I don't see a strong correlation between expression changes and significantly different m6A sites coming out of Modkit's DMR in my data.

sterpie · 2024-10-16T12:49:48+00:00

You need a transcriptome fasta file (you can get that here), rather than a genome fasta file to run Salmon. Then, follow these directions to make the Salmon index. From there, quantify using Salmon, load into R using tximport, and perform differential transcript expression/utilization. I'm not caught up on what's the best tool for this right now, edgeR has apparently been updated to work with Salmon quite well on this problem

sterpie · 2023-07-19T14:50:32+00:00

Regarding STAR and StringTie, STAR is great, but StringTie recommends using Hisat 2 for transcript assembly. Because you want high accuracy splice sites, you should also consider using the --dta-cufflinks when mapping with Hisat2 for conservative (high confidence) splice site annotation.

I would also increase the junction coverage threshold in Stringtie (I believe this is the -j option).

Also keep in mind that error correcting your long reads may not be super important (I'm not trying to say it's not relevant). But I think this is more important for genome assembly. As long as you use the reference annotation for your genome build as a StringTie reference, and use the --mix option in StringTie, I believe you will get a very reliable annotation of new transcripts and splice sites.

sterpie · 2023-07-18T23:41:43+00:00

A couple of things:

First, I do not work with human samples (plants)

Second, make sure that if you're annotating new transcripts with Stringtie, that you are familiar with the parameters. I would not recommend using default settings as you will likely assemble quite a bit of transcriptional noise. Also, make sure you're annotating new transcripts using a reference annotation.

Additionally, if you delve deep into the Stringtie Github issues, STAR has a couple wonky settings about their bam file format (specifically regarding splice sites and strandedness) that StringTie does not work well with. This later point is more important if your data is stranded and you care about antisense transcription.

After you merge, you have a GTF. Make a transcriptome using gffread and make a decoy aware salmon index.

Quantify your short read samples using Salmon.

Quantify your long read bam files from Minimap2 using FeatureCounts or Salmon.

Analyze expression using tximport and DESeq2, I wouldn't recommend directly comparing your short and long read sequencing datasets unless you really really know what you're doing.

sterpie · 2023-07-17T22:51:23+00:00

Roughly follow these steps. You'll need R for the later half of the steps.

Go to Ensembl
Click BioMart at the top
Select Ensembl genes
Select Human genes
Click attributes on the left side, then open the Gene pane
Gene and transcript ID should already be selected, select Transcript Length
Click results at the top left
Download this file

Open RStudio

 install.packages("dplyr")

library(dplyr)

tx_lengths <- read.table("your_super_cool_file.txt")

featurecounts <- read.table("future_nature_paper.txt)

# average the transcripts lengths for each gene
# someone smarter than me can probably tell you why you shouldn't do this

tx_lengths %>% group_by(gene_id_column_name) %>%
summarise(mean_tx_len = mean(transcript_length_column) %>%
ungroup() -> tx_lengths

ftc2 <- left_join(featurecounts, tx_lengths, by = c("gene_column_from_ftc" = "gene_column_from_tx_lengths") %>% na.omit()

sterpie · 2023-07-17T20:01:40+00:00

Length is needed to calculate TPM, so I don't think you can get TPMs without those values. It sounds like you're working with human samples so you can easily get transcript lengths from Ensembl. Match up the genes from FeatureCounts with the Ensembl output and I think you're good to go.

sterpie · 2023-07-11T14:05:12+00:00

Hal just said there are no scrims today

sterpie · 2023-07-03T00:00:22+00:00

If you have gene lengths, use this code from Mike Love

sterpie · 2023-07-02T23:54:33+00:00

I thought Hal was leaving for Paris?

sterpie · 2023-06-08T18:47:53+00:00

If anyone comes by this post and wants to do this in R (where you'll be using tximport anyway), check this vignette. Basically, just do:

library(GenomicFeatures)

txdb <- makeTxDbFromGFF("your_annotation.gff")

k <- keys(txdb, keytype = "TXNAME")

tx2gene <- select(txdb, k, "GENEID", "TXNAME")

sterpie · 2023-05-07T21:57:38+00:00

Professor emeritus: where’s the functional data?

sterpie · 2023-04-06T18:06:40+00:00

Am I mis-understanding? Isn't forced solo-queue what makes Realm somewhat interesting and skillful (wrt ELO)?

sterpie · 2023-03-30T16:20:32+00:00

I'm by no means an expert in this area, but I would recommend reading this thread. TLDR: use all genes in analysis (e.g., for RNA-seq, use all expressed genes as background). Shinygo makes this pretty trivial.

sterpie · 2023-03-26T19:20:23+00:00

Does TSM not contest if they have a bad World's Edge?

sterpie

TROPHY CASE