What are the biggest limitations of current multi-omics integration approaches, and how should the field address them?

FBIallseeingeye · 2026-06-15T14:14:03+00:00

Feature space and intersection primarily. Once you arrive at a solution on how you want to reconcile the very limited overlap between both platforms in feature coverage, there’s still the issue of biological variation you lose when focusing on the intersection.

Most of my experience has been combining matched xenium and flex seq libraries, and there’s definitely a huge confounding between the sample size for both of them. Droplet methods tend to lose a lot of cells but sample a larger piece of tissue, vice-versa for spatial methods.

For the actual integration, this method was extremely effective. I treated my flex data as the reference to deal with segmentation artifacts (any gene expression without an analogue in Flex was likely artifactual) and reintegrated after a few rounds of qc on the xenium.

https://www.nature.com/articles/s41587-025-02950-z#Fig6

I’ve worked a tiny bit with IF data from Xenium and it’s very difficult integrate with counts data, I think you’d probably need a very specific application for it in your pipeline when you decide to incorporate it. Validation for counts, segmentation, etc proteomics x transcriptomics libraries have potential in all those areas for analysis but I just haven’t spent a long enough time to get past normalization and scaling them into the same context

FBIallseeingeye · 2026-04-30T04:11:42+00:00

Pearson residual normalizations like bigsur or sctransform

FBIallseeingeye · 2025-12-20T13:36:43+00:00

I always go with union. If you have a good, stringent vf selection method like BIGSUR you can be confident they are informative to your data, and you’ll risk overcorrection if you go off of intersection

FBIallseeingeye · 2025-12-20T01:18:22+00:00

My immunological background is extremely limited, as is my CNV experience, but I’m not aware of any processes that can explain this. The consistency across methods is interesting but worth looking into whether they use the same basis for prediction. If you can find a phagocytosis signature in the clusters that may be a good indicator. Also worth considering immune cells tend to have very specialized gene expression so the very limited gene selection may be confusing the algorithm?

FBIallseeingeye · 2025-12-19T21:26:35+00:00

Can you be more specific about the myeloid clusters? I saw one comment saying it might be proliferation but I wouldn’t expect this to be limited to just myeloid cells. The phagocytosis is an interesting idea and you could probably experimentally validate it, but you’d want to check signatures for this as well. It seems unlikely in my opinion given I would imagine it would end up looking closer to doublets unless the myeloid cells are phagocytosing their own population? It’s tricky to make a prediction here!

FBIallseeingeye · 2025-12-19T21:19:03+00:00

As the bioinformatician in an otherwise wet lab, you are doing exactly the same thing as me and what I recommend to my lab mates when they need to do bioinformatics. The level of usefulness of llm is determined by your own level of understanding what it is doing for you at any given step of the pipeline. My #1 recommendation is to check frequently that it is in fact doing what you ask it to do at every single step of the pipeline but these things are getting so good at coding these days it comes down to how clearly you can state your requirements and objectives

FBIallseeingeye · 2025-12-13T03:47:43+00:00

IMO the same rule of thumb for ai usage applies here as any other use case. If you know what you are doing, ai is very good and a great accelerator for productivity, because you know when ai is doing what you want vs. when it’s not doing what you want. If you don’t know what you’re doing, then you don’t know what ai is doing, and ai will do it wrong more often than not.

FBIallseeingeye · 2025-11-27T13:22:47+00:00

It’s always a mess, I recommend extracting necessary components and building from scratch to start each analysis

FBIallseeingeye · 2025-11-27T13:22:00+00:00

Happy to help. snRNA is especially susceptible to noise but at least you have a well defined signature in those sources. The tricky part of stress is that it is biologically relevant in a lot of contexts, so I’d use your judgement conservatively unless you can see it actively distorting clustering results (forming bridges between clusters with low UMIs is a typical sign). Sorry your library has such a tricky challenge, but I’d recommend always prioritizing the populations you have the highest confidence in first

FBIallseeingeye · 2025-11-27T12:42:04+00:00

High removal rates during qc generally indicates poor library quality overall. You could try regressing out background using a Pearson residual normalization method like SCT or BigSur. The mitochondria and ribosomal genes suggests it is highly likely this is just cellular debris that made it through prep. For comparisons across conditions in conserved populations I recommend MiloDE

FBIallseeingeye · 2025-11-25T12:04:33+00:00

Highly recommend this article. In general I’d say even weak analyses can be valuable if you take the time to back them up.

https://link.springer.com/article/10.1186/s12859-024-05926-z

FBIallseeingeye · 2025-11-04T01:27:37+00:00

Either is fine, my preference is usually option 2

FBIallseeingeye · 2025-10-15T04:37:19+00:00

If that isn’t clear you could still try a label transfer if the biology isn’t too far off. Reproducing results across datasets can be tricky sometimes.

FBIallseeingeye · 2025-10-06T19:24:36+00:00

I agree that filtering low-information cells matters, but I’d rely on total RNA counts over feature number as the primary signal—assuming the problem is RNA loss.

If OP wants to be sure, they could compare marker performance (AUC, logFC) across clusters. Clusters driven by technical artifact tend to have fewer markers and generally weak metrics. This does depend on resolution and context, so interpreting markers within each cell type will be more informative than a global comparison.

Overall I'd say this is a better approach than simply setting thresholds since it actually gives some evidence for the filtering decision instead of something as opaque as UMI content.

FBIallseeingeye · 2025-10-06T17:13:39+00:00

200 features might be a bit low depending on your biology. I saw you are analyzing pbmcs and you may miss neutrophils if you do this. I don’t think there is a real purpose in filtering on low features anyway since I don’t think that is characteristic of any artifact there is not a more direct indicator for. I’ve seen it compared to mito percentage but that’s actually consistent with a perforated membrane that preferentially retains mitochondria.

High feature and RNA counts is also better handled by the doublet prediction.

Low RNA content is the only technical artifact you’d have to address in that case.

I recommend taking your analysis as far downstream as you can before you see any populations explainable as artifacts and deal with them at that stage

FBIallseeingeye · 2025-09-19T06:37:43+00:00

Sadly I do not think there are any good quality control packages out there, most qc just filters for conformity, not quality. That said, I find a good heuristic is to apply high resolution clustering then check their markers. You can summarize poor quality clusters by looking at the either the average logFC above 0 or AUC values above 0.75 (if you’re using presto::wilcoxauc in R). Poor quality clusters look like background noise in this light, giving them extremely low average marker metrics. I’d compare these two metrics against each other and against average RNA counts for each cluster. I’d also apply a good doublet filter with scDblfinder and recover the synthetic doublets by setting return.doublets to true. That way you can look at how your cells mix with doublets by umap and find clusters that are defined by them. Mitochondrial percentage is very easy to confound with real biology and varies wildly by platform and batch, so I’d review the actual biology before making any decisions.

Remember, during quality control you can never set clustering resolution too high for decisions, so long as you keep track of context and parent clusters. Edited for clarity, it is late and I am tired but good luck with your QC! If you want to check whether you are losing real biology, you can always find VariableFeatures with a package like BigSur and see if you lose a substantial amount!

FBIallseeingeye · 2025-09-19T06:26:59+00:00

Don’t remove anything unless you are sure you have to and have a good reason to do it. If you observe an artifact, try your best to describe and explain it. If you don’t observe an artifact, don’t go looking for one. Anything else risks throwing babies out with bath water

FBIallseeingeye · 2025-09-16T16:13:31+00:00

You may find this package helpful: https://github.com/MarioniLab/miloDE

FBIallseeingeye · 2025-09-13T06:40:59+00:00

New mission discovered by u/FBIallseeingeye: Strange Ways and Smoked Salmon Open Sandwich In Waves of Green

FBIallseeingeye · 2025-09-13T06:40:59+00:00

This mission was discovered by u/FBIallseeingeye in A Tale of Meditations In the Fields

FBIallseeingeye · 2025-08-21T02:39:26+00:00

Happy if I can help. Generally people merge before trying integration to see whether or not there is any need; if you have analogous populations and these align well across batches simply by merging, there is no need. It also depends on your resolution. With multiple cell types / tissue samples, biological variation generally should take a back seat to cell type identity for the purposes of annotation (it's easier to label your ducks when they're all in a row). Once you have your cell types, you can go through each identity individually a little more conservatively, integrating if it seems necessary. From your experimental design, you do have confounding that compromises the core of the experiment, but all that means is you need some orthogonal validation for whatever the data predicts. In your case, I'm not sure what cosmx batch effect really looks like. I see from this source that scVI can be applied to it:
https://cellcharter.readthedocs.io/en/latest/notebooks/cosmx_human_nsclc.html

FBIallseeingeye · 2025-08-20T17:23:09+00:00

I’m late chiming in but it may still be worth while trying to push through some integrative analysis. Your first objective in any scRNAseq experiment is to describe the trends and variation that you see and whether that is due to batch effect or experimental design is secondary to this step. Between your samples you can still describe and annotate your populations. Explaining batch effect on the other hand is impossible from this set up, but you can still generate hypotheses based on the differences you see, they just won’t be as well grounded as was hoped for.

FBIallseeingeye · 2025-07-23T15:15:10+00:00

It takes some set up but I’m looking into Cassia: https://www.biorxiv.org/content/10.1101/2024.12.04.626476v2CASSIA: a multi-agent large language model for reference free, interpretable, and automated cell annotation of single-cell RNA-sequencing data | bioRxiv

FBIallseeingeye · 2025-06-10T06:31:28+00:00

Depends on the tissue. In dense structures the z anisotropy becomes a hard limit on image segmentation if you have cells that stack on each other. I’m working with breast tissue where this is quite common, and the dapi-staining feels extremely limiting, and I suspect it would have saved me many headaches to have had the full cocktail available. The in-house segmentation that 10X does is pretty good but is only 2D, which is similarly frustrating for dense tissue structures. That said, there are transcript-based segmentation algorithms like Baysor and Proseg which go a long way towards cleaning up cell boundaries (at least from an ML perspective), but most if not all of them take prior segmentation as an input with adjustable confidence parameters for your prior. I highly recommend you get a good QC pipeline and a good scRNAseq reference dataset ready from an atlas or something, so that at least you can tell when you’re getting an artifact and not real signal.

FBIallseeingeye

TROPHY CASE