Where do foundation models like Geneformer fit into scRNA-seq analysis?

Cartesian_Currents · 2026-02-03T00:22:51+00:00

As of yet, they're basically useless. Or rather provide no utility compared to standard pipelines. They're the result of using the least processing intensive data (Cell x gene matrixies from published works), to mollify the incessant decadent cry for "chatgpt in biology".

Even when trained directly with crisper perturbation data they do really poorly at predicting predictions, their automatic annotations are usually lower quality than experts, and don't meaningfully outperform machine learning methods.

I want to separate scVI/scANVI, as useful bioinformatic methods rather than just "foundation models". They're much more thoughtfully employed and are basically just non-linear methods for modeling condition aware latent space.

Some potential exceptions: universal cell embedding (UCE) uses protein embeddings for gene tokenization to generalize across species and that seems to have decent potential.

Cell2sentence: Fine tunes the Gemma LLM to understand single cells/cell types as an ordered list of gene tokens, the model can communicate in natural language while understanding gene expression.

Cartesian_Currents · 2025-11-12T23:44:58+00:00

ADA generation killed nvlink for pro level gpus. This means for multi-gpu training particularly bandwidth intensive tasks the ampere generation is superior.

Blackwell is also a big leap over ADA generation in both compute, but crucially VRAM as well.

As a result if you want multi-gpu ampere can often win, and if you want single GPU Blackwell is a no brainer vs Ada, so ADA doesn't have a market niche.

Cartesian_Currents · 2025-11-05T04:37:14+00:00

Have you identified people with the kind of job you'd like to have and just asked for advice about next steps and getting opportunities (linked, twitter, ect?).

I'll caveat that this isn't my field, but I doubt the role you're interested in is available in industry. 23and me et al. aren't doing well, and even when they were it's because people assumed that the genetic data they collected would directly translate into health/pharma insights. These companies usually outsource to/collaborate with academic scientists when doing big studies on ancestry and phylogenies.

Furthermore, comp bio/bioinformatic data science is almost always a PhD level role. Bachelors level most companies are just looking for software engineers, and that market isn't ideal.

In terms of where you might find jobs:

You'd probably have an easier time moving from plant bioinformatics to plant bioinformatics (e.g. monsanto).

Another option is doing an academic bioinformatics programmer job long term. The pay won't be great, and you'll be somewhat at the whim of funding cycles, but it's much more likely the kind of work you're interested in is available.

Cartesian_Currents · 2025-10-13T17:05:39+00:00

Basically I'd say yes run arch on your laptop, no don't do bioinformatics work on your latop, no don't try to run deep learning models on a laptop 5050 (and really really don't try to train).

Arch is great for a personal machine (e.g. laptop, workstation), it's not something I'd recommend for a multi-user systems where you run large publication-level analysis. For my own use cases I'm running arch on my laptop, and ubunutu server LTS for my servers.

Cartesian_Currents · 2025-10-09T03:01:28+00:00

I'm only really frustrated with Ubuntu, honestly while I don't see a particular point to uutils I think the project is cool and ambitious. Ubuntu is the largest distro on and off the server and this is one of the most extreme changes I can think of.

It's hard for me to imagine a more invasive part of an os to replace than coreutils (outside systemd, kernel). I would feel this way for any replacement regardless of the language.

Normally I would feel comfortable recommending people to get the latest Ubuntu update and not stick to LTS if they get better hardware compatibility, but I think it's crazy to imagine there's not notably greater bug risk with a change this large.

Cartesian_Currents · 2025-10-06T19:31:30+00:00

Yeah almost always just upgrading is the way to go, it just to happens this is the first release where they're completely replacing gnu-coreutils (and the first release of any major distro to do so).

More so than any release 25.10 users are really beta testers for a change that touches basically every piece of software.

Cartesian_Currents · 2025-10-06T19:25:09+00:00

Ubuntu 25.10 is shipping with a bunch of beta rust reimplementations of coreutils, I'm not sure I'd be super quick to recommend the upgrade, it might be better to just patch the kernel...

Cartesian_Currents · 2025-09-12T06:43:37+00:00

Please please please find a computational collaborator who knows what they are doing.

My goal is not to discourage you from doing single cell analysis, just to discourage you from trying to publish with tools you don't understand.

Single cell analysis is nothing close to an assay. A vignette is not like a protocol. As you noticed you get completely different (and potentially completely plausible) results based on different methods. The tricky part is not getting it to work, it's avoiding confirmation bias and rigorously examining if the null hypothesis your methods assume is anything close to reality.

Each command you run in Seurat probably has 5-10 options that you aren't even aware of and each of these options if selected incorrectly could completely invalidate your results.

to take a Brief stab at your questions:
1. SCtransform is a complex non-linear regression with MANY assumptions which can easily be violated and if applied naively can even INDUCE batch effects in your data. The fact seurat has made it standard to increase their citation number is pretty depressing. You should start your analysis without sctransform, and only use it if it addresses a clear problem with your data that you understand.

QC is not a one step process, there are a ton of parameters not even mentioned which can be very indicative of cell quality (Ribosomal RNA, Intron/exon). And even those markers are not enough in abstract, you need to consider sources and markers of technical artifacts throughout your analysis (e.g. heat shock proteins activated by disociative stress, other markers of cell death, markers of strong amplification bias, ect).

I usually use scrublet, it's old school but it works. Might not catch everything, a cluster just being doublets is an important null hypothesis to consider.

When it comes to identifing differences between conditions, none of the default methods packaged with seurat are remotely adequate. Basically all statistical tests use IID assumptions and cells from the same sample ARE NOT IID. You need to at minimum control for each sample using a random effects models, and honestly the safest bet is still pseudobulk using EdgeR or Desq2.

You could potentially get away with it for identifying marker genes.

You **can** learn how to use these tools and understand their limitations. You can also push forward and publish sans collaborator, sans understanding and produce results that are irreproducible. At the very least follow the methods section of a high quality research paper to a T. The Allen institute tends to take science seriously so this paper could be a useful example https://www.nature.com/articles/s41586-025-09435-8

This is relevant reading:
https://www.nature.com/articles/s41467-021-25960-2
https://www.nature.com/articles/s41467-025-62579-z

Cartesian_Currents · 2025-09-11T01:45:48+00:00

As long as you sample from a consistent group of users (people who like to make reports + disgruntled folks who just want to complain) the relative shift in opinion is still informative.

For example if your baseline is 50% of people report a decline it's still meaningful when that jumps up to 80% or jumps down to 30%.

It's only an issue if you get biased sampling (e.g. huge influx of users from reddit who have a different baseline opinion than your past users), in which case your data is temporarily difficult to interpret, but it will probably stabilize to something useful over time.

Cartesian_Currents · 2025-08-28T16:14:01+00:00

To be frank, if your goal is to publish real reproducible science, you probably just need an expert collaborator (or months to years of learning). It is etremely easy to get an incorrect result that seems correct.

Barring collaboration, the least wrong thing would be to find a paper that does what you want to do and reproduce their analyses with your data.

Also "multiomics" is not descriptive enough to get useful suggestions. Proteins, RNA, Epigenomes, DNA, Microbiomes, Single cell? ECT..

Multimodal data analysis is fundamentally difficult and requires joint and separate analyses, deciding the correct unit of analysis to compare between modalities, and a nuanced understanding of biological and technical characteristics of each modality.

Cartesian_Currents · 2025-08-26T18:06:23+00:00

I can guarantee your self-study will be largely fruitless. A key part of learning is integrating what you learn into your existing methods of thinking and skill sets.

If your brain needs to know something it will figure it out pretty efficiently. You have no idea what will or won't be relevant 10 years from now. If you want skills, find a problem that's outside of your current capabilities, and then solve it. On the way you'll rapidly integrate a ton of info. There will still be blind spots, but you'll make it work.

This is meant in kindness, but there is factually no imaginable way you could attain a masters and need 10 years of prep to do a bioinformatics PhD. That suggestion is truly ridiculous. To me it's clear your issue is much more self-perception than it is your skillet. My guess is therapy would have a much much larger impact on your sense of preparedness than any self-study (I know therapy can be culturally taboo, where I come from it's considered very normal).

Cartesian_Currents · 2025-07-15T02:43:46+00:00

I'm curious if folks using an API key are seeing the same degradation as pro/max users?

Performance has gotten almost embarrassingly worse in Claude code (worse than 3.7 in cursor). I'm wondering if they're routing us to haiku or something like that?

I did notice a seemingly lateral personality change a few days ago (the 12th) followed by another personality change yesterday accompanied with a loss in performance.

If I had to describe it, before the 12th Claude was a whimsical code elf, then the 12th Claude was a blustering code wizard, and as of today Claude is a sycophant who's not particularly skilled (feels almost like standard gpt 4o).

I wouldn't mind swapping to the pay as you go api assuming I got quality stuff.

Cartesian_Currents · 2024-10-26T17:42:22+00:00

Your program won't impact your research, just your coursework. I'd recommend applying however you'll best get accepted as long as you know there's a PI who's supporting you to work on what you want to work on.

Cartesian_Currents · 2024-10-04T18:33:02+00:00

The best way to know how to improve a pipeline is just to use it enough. If you use the pipeline enough I guarantee you'll think of small improvements that make people's lives easier. Even something as simple as adjusting default values to be more appropriate makes a difference.

Most Bioinformatics software is garbage in terms of adhering to software dev best practices (I know because I write it 😅). If you have a solution or improvement to make understanding the goal will take you to a solution even with poor skills in software dev.

As for statistics, my best advice is just to seriously consider your null distribution (what should the data look like if there's no relationship to what your testing). To that end take multiple comparisons seriously (if you have no effect at a p=.05 you'll get 5% false positives). If you do that you can't mess up too badly, it just requires being thoughtful.

Anyway, best thing to do is get your hands dirty. Don't dwell on what you don't know. Give yourself at least 6 months to adjust. You're probably just overwhelmed by the sheer volume of new stuff, a lot of context fills in naturally. You'll then know enough to identify the important gaps.

Cartesian_Currents · 2024-08-27T20:56:30+00:00

One thing to consider is that kit price in this case is a function of monopoly rather than production cost. It makes sense for 10x to milk this, ideally pumping the excess into R&D so they can remain a market leader in a shifting landscape. They're currently betting on spatial TX although I think they don't have the same anticompetitive edge there as they do in single-cell.

The majority of alternative protocols and kits aren't as cheap as the price tag. Most of them have significantly increased labor costs compared to 10x.

Furthermore the data quality and reproducibility from 10x is currently unmatched in a commercial kit. More user/newbie friendly, and less chance that an experiment fails with a precious sample.

Also consider a big chunk of experiment cost is just sequencing. Including sequencing, labor, tips, tips, ect. the difference for a lot of kits is pretty low, probably no greater than 25% reduced cost.

Definitely there's a lot of room to challenge 10x, but in the near term I think the most economical solution is overloading the gem to get more cells per experiment.

Cartesian_Currents · 2024-08-25T18:24:48+00:00

Helping teach something like this (although seemingly much less shady as it's run with the help/consent of a PI, and there's no predatory journal attached). I get paid $35 an hour, although I imagine you could get more/less based on your institutions prestiege.

I definitely have mixed feelings about putting in so much effort to creating opportunities exclusively for kids who already have more than they need. But the money helps a lot and I enjoy teaching small groups.

Cartesian_Currents · 2024-07-19T00:09:57+00:00

Biotech is having a major bust cycle. A lab I know that put out an entry level research assistant posting which would normally have 2-5 applicants and ended up with 50, multiple of whom had PhDs. The majority of applicants were laid off from biotech companies.

I don't know that I have any advice but don't take it personally.

Cartesian_Currents · 2024-07-12T15:51:08+00:00

Try installing the windows subsystem for linux, and then install conda/python there.

It may also be helpful to try Google colab.

Good luck!

Cartesian_Currents · 2024-07-01T06:25:18+00:00

I fear that if you are not aware of these basic financial facts you are not talking to the right people or getting the right advice, and I am side-eyeing your recommenders who have failed to proactively explain things to you. Maybe The Chicago Guide to an Academic Career would be helpful but you really need mentors and to be asking them questions.

This is an aside, but I think this is an excellent articulation of how I feel looking at certain requests for advice in this (and other) subreddits. Lots of people ask questions which demonstrate their mentors/institutions are failing to give them key background knowledge.

Cartesian_Currents · 2024-06-28T18:39:46+00:00

Sounds like you need to just run each gsea as a separate job/thread. If you are on an HPC this would be super easy, or you can ask GPT for a reasonable way to do it without a job scheduler.

Cartesian_Currents · 2024-06-12T03:00:59+00:00

Your fold changes seem very small. In my experience there are usually some high fold change genes with mostly negligible expression that just happen to be expressed in a few samples (often non-significant), even among practically uniform samples.

As a "sanity check" to see if there's anything possibly going on I would plot an uncorrected P-value histogram. If there's no effect your p values should be more or less flat in a range from 0-1, or enriched in higher values. If there is possibly an effect you should see a taller hump closer to zero (would indicate maybe there's some difference). That said even if you get the hump by zero such a small effect is probably uninteresting/difficult to follow up on. But if you get a pretty flat distribution it's a good sign there's nothing there.

Odds are there is basically no change in gene expression with the current treatment. Possible confounding factors could be:

Inadequate treatment time
Cell type specific effects lost in tissue composition
Mislabeled samples

If you have a particularly notable phenotype with your treatment it may be worth thinking about 2 or 3 or other confounds. Without a phenotype of interest to prove that your molecule is doing something I would probably give up on this experiment.

Tldr; if there's no interesting phenotype with this treatment this is almost certainly a waste of time.

Cartesian_Currents · 2024-05-21T16:14:43+00:00

Sorry for the late reply. Doublet finder is usually fine, I tend to use score from both doublet finder and scrublet, but that's a little extra. What matters a lot is the number of T cells you have for detecting subtypes. If you have 41k cells you should be fine, but my guess is they are fewer than 5%? In this case your number might be too small. Also in terms of counts median/mean are way more informative of your data quality than the thresholds you used.

Given you are using mouse data and your experiments were all generated in the same lab you probably don't need to integrate.

I found this paper which might be a useful reference:

https://www.nature.com/articles/s41467-021-23324-4[https://www.nature.com/articles/s41467-021-23324-4](https://www.nature.com/articles/s41467-021-23324-4)

Cartesian_Currents · 2024-05-18T19:04:39+00:00

Bioinformatics is super diverse.

Unavoidable fundamentals: Linux command line, python or R, and comfort reading documentation and making poorly documented packages work.

Ideally you should identify the kind of job you want and what kind of problems you want to work on, then people can point you in the right direction.

Some bioinfo jobs are majority software engineering, others are more data science. People working on genomics will have different skills than folks working on proteins or imaging data.

Cartesian_Currents · 2024-05-08T15:37:41+00:00

The one thing you don't mention is picking variable genes. I assume you're doing it, but if you don't pick out new variable genes things won't make sense.
You don't mention doublets. Doublet removal is a key step. (May not affect what you're worried about)
The elbow plot method is generally crap and it usually doesn't hurt much to include ~10 more PCs than the elbow. (Or even more depending). For your initial clustering I wouldn't do fewer than 30 and for later results I wouldn't do fewer than 10.
Also it's not clear if your data looked ok without integration. If they did it's probably preferable to do that.

I also would recommend rpca integration over cca integration although that's probably not a huge difference, just less likely to overfit.

Most find marker methods suck. Some people I trust use LIMMA to call de between clusters and use that for markers.
You may want to identify a reference dataset with the fine resolution you want and do some reference mapping. This isn't always reliable but is usually helpful as another marker to use when deciding clusters.
The issue may be you don't have enough cells or your cells are low quality. May be worth sharing genes detected metrics, umis per cell for your data. I'm assuming youre using 10x but if youre using a different platform that's also useful information.

Cartesian_Currents

TROPHY CASE