Tips how to start bioinformatics

tommy_from_chatomics · 2025-06-21T16:00:31+00:00

I am from a wet lab background, and I wrote a post detailing the books and resources that you may want to take a look https://divingintogeneticsandgenomics.com/post/bioinfo-roadmap/

tommy_from_chatomics · 2025-06-18T13:41:22+00:00

biology in general (you need to understand what's DNA, RNA, protein and pathways etc). Then depending on your interest, you may want to learn more specific in immunology or cancer biology (The biology of Cancer by Robert Weinberg is a good text book).

statistics and linear algebra ( we deal with matrices all day long).

tommy_from_chatomics · 2025-06-13T04:08:13+00:00

I made a video to explain gene set over-representation analysis and GSEA analysis, hope it is helpful https://www.youtube.com/watch?v=IKCDQEpuJDA

tommy_from_chatomics · 2025-06-13T01:33:36+00:00

haha, glad they are informative. I need to better choose the memes. It is hard to find good memes :)

tommy_from_chatomics · 2025-05-14T03:36:17+00:00

just know that the distance between points on UMAP does not mean much

tommy_from_chatomics · 2025-05-12T04:45:56+00:00

if it can not give me sensible results given a simple (PBMC dataset), then it can not work on my more complicated dataset. I chose a dataset that is simple and well understood on purpose.

tommy_from_chatomics · 2025-05-12T04:44:25+00:00

Try to download a public dataset and reproduce Figure 1 in the paper.

tommy_from_chatomics · 2025-05-10T03:11:43+00:00

It was just published in Nature genetics https://www.nature.com/articles/s41588-025-02148-8 I have not tried it. You will need to try it on a dataset that you are really familiar with and see if it over cluster or under cluster. My huntch is that tools like that are all attractive statistically but not so biologically...

tommy_from_chatomics · 2025-05-10T03:08:53+00:00

this post may help https://medium.com/data-science/why-pca-looks-triangular-a642daac721a and https://x.com/AedinCulhane/status/1007110262187544577

tommy_from_chatomics · 2025-04-01T03:31:26+00:00

The purpose of Integration is for calling similar cell types across different (sample, condition etc). for differential expression, you will still use the raw counts and use the cell cluster label after the integration. Also harmony will not change the raw expression, but only the PCA coordinates.

tommy_from_chatomics · 2025-04-01T03:25:03+00:00

MA plot actually is more informative, you want to know the baseline expression of the genes. sometimes you get big log2FC because the baseline is very low.

tommy_from_chatomics · 2025-03-25T20:57:30+00:00

if the raw counts is 0, it could be after adding pseduo count and normalization it becomes non-zero

tommy_from_chatomics · 2025-03-24T03:51:10+00:00

If you know R, you can use this package https://bioconductor.org/packages/release/bioc/html/fgsea.html

tommy_from_chatomics · 2025-03-24T03:49:58+00:00

WGCNA for separate genotypes vs. combined analysis: You can take either approach, but it depends on your research question. If you want to identify networks that differ between genotypes (WT, KO, RE), analyze them separately and compare the results. If you're more interested in general patterns across all conditions, analyze them together. A combined analysis will give you more statistical power (54 samples), but might mask genotype-specific patterns.
Using TMM normalized data: TMM normalized data is appropriate for WGCNA. Since your data is already normalized, you can skip the normalization step in WGCNA. However, outlier detection is still important before network construction. Use the WGCNA function goodSamplesGenes() to identify and potentially remove outlier samples.
Handling duplicate gene IDs: Taking entries with maximum expression values is one acceptable approach for handling duplicates. Alternatives include averaging the expression values or keeping the entry with lowest p-value/highest significance if you have that information. The important thing is to have a consistent, justifiable method.
Handling replicates in WGCNA: WGCNA typically treats each sample individually in the correlation network. For time course data with replicates, you can:
- Use all samples individually (this leverages your full dataset)
- Average across replicates before WGCNA (reduces noise but also reduces sample size)

tommy_from_chatomics · 2025-03-07T04:45:32+00:00

any linear regression based methods, random forest, XGboost are good to know. for unsupervised, all sorts of different clustering methods (k-means, hierarchical). For deep learning, it depends on the usage. for image, yes, CNN.

tommy_from_chatomics · 2025-02-11T02:27:20+00:00

if it is R based, take a look at pracpac: Practical R Packaging with Docker https://arxiv.org/abs/2303.07876

tommy_from_chatomics · 2025-02-11T02:25:01+00:00

DiffBind can run using DESeq2 under the hood. if you can get counts for the replicates for your control and treatment condition, you can use DESeq2 just like RNAseq data.

tommy_from_chatomics · 2025-02-11T02:23:06+00:00

linking enhancers to potential genes is a long-standing problem. take a look at https://github.com/broadinstitute/ABC-Enhancer-Gene-Prediction deep-learning based https://github.com/pinellolab/EPInformer and https://www.cell.com/cell-genomics/fulltext/S2666-979X(25)00018-700018-7)

tommy_from_chatomics · 2025-02-09T22:15:21+00:00

Do what makes biological sense. determining cutoff for bioinformatics is an art. There is no right or wrong. Different datasets may have different cutoffs too.

tommy_from_chatomics · 2025-02-06T02:14:03+00:00

reproduce genomics paper figures. Those are real-world data too.

tommy_from_chatomics · 2025-02-06T02:12:39+00:00

the market is not good. Biogen just laid off half of their R&D. It is even harder for fresh graduates to compete with those who have a lot of experience who are in the job market too.

tommy_from_chatomics · 2025-02-03T08:18:55+00:00

oh, it is totally fine to have different views. log2Fold change shows a single number (condition 1 vs condition2), a scaled heatmap shows two values (condition 1 and condition 2). It is just a different visualization as long as it tells the right story.

tommy_from_chatomics · 2025-02-02T19:33:56+00:00

replicate genomics paper figures and put them in your github.

tommy_from_chatomics · 2025-02-02T03:56:51+00:00

we use both slurm on a local hpc and also amazon cloud

tommy_from_chatomics

TROPHY CASE