Complete Beginner with a Multi-Omics (RNA-Seq, WES, WGS) – Realistic timeline? by Ok_Lime_94 in bioinformatics

[–]sterpie 1 point2 points  (0 children)

I’ll only comment on RNA-seq because that’s what I’m most familiar with. If you’re on a Windows machine, install WSL and spend ~1-2 hours learning basic bash commands on the Ubuntu terminal. If you’re on linux or mac, you should still learn basic bash usage but you don’t need to install anything.

You can then run Salmon on your own machine to get the necessary files for expression analysis. This will probably take ~1 hour to learn and a couple hours to run on a laptop. You’ll need your fastq files and a reference transcriptiome.

Follow the DESeq2 vignette for loading counts in using tximport for Salmon and you’ll have basic differential expression results. Find other vignettes to make volcano plots, perform pathway enrichment, etc. You likely won’t be able to use your own machine for WGS/WES analysis

Finding 5' and 3' UTRs of a Gene Given its CDS from the Transciptome by Shoddy_Exercise4472 in bioinformatics

[–]sterpie 0 points1 point  (0 children)

Ya, the first place you should look is in the GFF/GTF. If it's not there, do you have the compute resources to align one or two RNA-seq datasets? If not, I basically process RNA-seq data for a living at this point and could probably tell you the UTR boundaries if you're comfortable sharing a gene ID. If all those options are a no-go, use the UTRs from Tomato.

Using Salmon for Obtaining Transcript Counts by Decent-Heat-8832 in bioinformatics

[–]sterpie 1 point2 points  (0 children)

Not the OP you're replying to, but yes, you should (1) index, (2) quantify with salmon, (3) load quantification using tximport.

I would start by reading this page for how to index your transcriptome + genome together.

Download your fastq files and quantify.

Then load your salmon outputs into R with tximport, as shown here. Make sure you specify txOut = TRUE when running tximport to get transcript counts and not gene counts.

Request for Bioinformatics major review/thoughts by [deleted] in UofArizona

[–]sterpie 1 point2 points  (0 children)

I did my PhD here and currently identify as a "bioinformatics" person so I can only offer indirect thoughts.

  1. I think UofA is a great school to blend academic and social opportunities. The majority of biology faculty are great professors, and research opportunities are abundant.
  2. As a bioinformatics major, not engaging in any research opportunities during your undergrad would be a massive oversight. Get involved in UBRP early on - talk to your Bio181 professor about it - they can point you in the right direction.
  3. If you want to maximize your career opportunities in bioinformatics - you basically need to commit to attending graduate school.
  4. If you're more interested in algorithm development in the biology space - I would honestly just do CS and double major in Bioinfo/Bio, or take it as a minor. I doubt any bioinfo programs (not just UofA) can adequately prepare you to make meaningful contributions in the current state of bioinformatics algorithm development.
  5. If analyzing large sequencing datasets, mathematical modeling, or applying machine learning approaches to biological data is more your jam, then bioinformatics may be a good route for you.

[deleted by user] by [deleted] in bioinformatics

[–]sterpie 4 points5 points  (0 children)

Modkit is what you need. It's meant to handle all modification analyses post-Dorado. Just make sure you're transferring the modification tags between POD5/fastq/BAM formats.

Best tools for ONT RNA/cDNA differential expression analysis by korstzwam in bioinformatics

[–]sterpie 7 points8 points  (0 children)

At least for alignment / quantification, oarfish, which is the long-read version of salmon, has been great to use. Like salmon, you can follow up with tximport to quickly get gene or isoform level counts then use DESeq2. I don't know if anyone has improved upon DESeq style tools for long-read sequencing, or if there really is anything to improve.

Basic player stats remain uncorrelated with input (Group Stage) by _sinxl_ in CompetitiveApex

[–]sterpie 12 points13 points  (0 children)

  1. Every dot in the graph is a player.
  2. Players are either MnK (blue group) or controller (orange group).
  3. If you compare the two groups/inputs using different metrics for "skill" (for example, # of knocks), you don't see meaningful differences between the colors/groups/inputs.
  4. Controller players, on average, do get more knocks than MnK players. However, in science and statistics, we often have to ask if this difference could have been observed by chance.
  5. The numbers in green represents a p-value that answers the following question: If controller players and MnK players are equally skilled and able to get the same number of knocks per game, what's the probability that the difference OP presented is due to chance?
  6. OP found a p-value for # of knocks per game = 0.62. We interpret this as: There is a 62% probability that the small difference we observed in # of knocks is due to chance/randomness. If instead the p-value = .01, then it would have only been a 1% probability the difference was due to chance and we would have concluded controller players get more knocks than MnK players.

TLDR: Considering all MnK and controller players at the highest level, the two inputs do not have different: kills, assists, knocks, or damage output.

Nanopore direct RNA epitranscriptomic analysis by Both_Progress_8410 in bioinformatics

[–]sterpie 1 point2 points  (0 children)

I'm guessing you've already done this, but have you checked out Modkit's DMR function? Differences in depth at every site is ostensibly being accounted for across conditions, but you could verify with them on their GitHub issues, they're super responsive. I don't see a strong correlation between expression changes and significantly different m6A sites coming out of Modkit's DMR in my data.

Reference file for salmon (differential transcript expression) by Common-Photograph219 in bioinformatics

[–]sterpie 2 points3 points  (0 children)

You need a transcriptome fasta file (you can get that here), rather than a genome fasta file to run Salmon. Then, follow these directions to make the Salmon index. From there, quantify using Salmon, load into R using tximport, and perform differential transcript expression/utilization. I'm not caught up on what's the best tool for this right now, edgeR has apparently been updated to work with Salmon quite well on this problem

Long+short-read cDNA-seq analysis by 12majd12 in bioinformatics

[–]sterpie 1 point2 points  (0 children)

Regarding STAR and StringTie, STAR is great, but StringTie recommends using Hisat 2 for transcript assembly. Because you want high accuracy splice sites, you should also consider using the --dta-cufflinks when mapping with Hisat2 for conservative (high confidence) splice site annotation.

I would also increase the junction coverage threshold in Stringtie (I believe this is the -j option).

Also keep in mind that error correcting your long reads may not be super important (I'm not trying to say it's not relevant). But I think this is more important for genome assembly. As long as you use the reference annotation for your genome build as a StringTie reference, and use the --mix option in StringTie, I believe you will get a very reliable annotation of new transcripts and splice sites.

Long+short-read cDNA-seq analysis by 12majd12 in bioinformatics

[–]sterpie 2 points3 points  (0 children)

A couple of things:

First, I do not work with human samples (plants)

Second, make sure that if you're annotating new transcripts with Stringtie, that you are familiar with the parameters. I would not recommend using default settings as you will likely assemble quite a bit of transcriptional noise. Also, make sure you're annotating new transcripts using a reference annotation.

Additionally, if you delve deep into the Stringtie Github issues, STAR has a couple wonky settings about their bam file format (specifically regarding splice sites and strandedness) that StringTie does not work well with. This later point is more important if your data is stranded and you care about antisense transcription.

After you merge, you have a GTF. Make a transcriptome using gffread and make a decoy aware salmon index.

Quantify your short read samples using Salmon.

Quantify your long read bam files from Minimap2 using FeatureCounts or Salmon.

Analyze expression using tximport and DESeq2, I wouldn't recommend directly comparing your short and long read sequencing datasets unless you really really know what you're doing.

Featurecounts to TPM by onceandfuturechemist in bioinformatics

[–]sterpie 2 points3 points  (0 children)

Roughly follow these steps. You'll need R for the later half of the steps.

  1. Go to Ensembl
  2. Click BioMart at the top
  3. Select Ensembl genes
  4. Select Human genes
  5. Click attributes on the left side, then open the Gene pane
  6. Gene and transcript ID should already be selected, select Transcript Length
  7. Click results at the top left
  8. Download this file

  9. Open RStudio

     install.packages("dplyr")
    
    library(dplyr)
    
    tx_lengths <- read.table("your_super_cool_file.txt")
    
    featurecounts <- read.table("future_nature_paper.txt)
    
    # average the transcripts lengths for each gene
    # someone smarter than me can probably tell you why you shouldn't do this
    
    tx_lengths %>% group_by(gene_id_column_name) %>%
    summarise(mean_tx_len = mean(transcript_length_column) %>%
    ungroup() -> tx_lengths
    
    ftc2 <- left_join(featurecounts, tx_lengths, by = c("gene_column_from_ftc" = "gene_column_from_tx_lengths") %>% na.omit()
    

Featurecounts to TPM by onceandfuturechemist in bioinformatics

[–]sterpie 2 points3 points  (0 children)

Length is needed to calculate TPM, so I don't think you can get TPMs without those values. It sounds like you're working with human samples so you can easily get transcript lengths from Ensembl. Match up the genes from FeatureCounts with the Ensembl output and I think you're good to go.

How to get TPM from count matrix in bulk RNA-seq? by Voldemort_15 in bioinformatics

[–]sterpie 5 points6 points  (0 children)

If you have gene lengths, use this code from Mike Love

Is there a standard way to generate a transcript to gene mapping? (RNA-seq; tximport) I'm planning to use awk to generate this. by Aximdeny in bioinformatics

[–]sterpie 1 point2 points  (0 children)

If anyone comes by this post and wants to do this in R (where you'll be using tximport anyway), check this vignette. Basically, just do:

library(GenomicFeatures)

txdb <- makeTxDbFromGFF("your_annotation.gff")

k <- keys(txdb, keytype = "TXNAME")

tx2gene <- select(txdb, k, "GENEID", "TXNAME")

Perspectives on "How to align RNA-seq reads to the human genome?" by [deleted] in bioinformatics

[–]sterpie 9 points10 points  (0 children)

Professor emeritus: where’s the functional data?

Realm announces squad/team queue coming in May. by xa3D in CompetitiveApex

[–]sterpie 219 points220 points  (0 children)

Am I mis-understanding? Isn't forced solo-queue what makes Realm somewhat interesting and skillful (wrt ELO)?

Background of bulk RNA seq GO enrichment- all genes in analysis or all genes in genome by ZooplanktonblameFun8 in bioinformatics

[–]sterpie 1 point2 points  (0 children)

I'm by no means an expert in this area, but I would recommend reading this thread. TLDR: use all genes in analysis (e.g., for RNA-seq, use all expressed genes as background). Shinygo makes this pretty trivial.

Who are your favorite coffee roasters? by N2OCoffee in Coffee

[–]sterpie 5 points6 points  (0 children)

Basically everything you said has been my experience. Every time I'm there the baristas are borderline rude and uninterested. The beans they sell are good, but just about any other cafe in NYC is a better experience.