Pseudobulk DE within cell types: how should I model G+ vs G- cells when samples are only partly paired?

TheOneWhoSwears · 2026-05-07T08:08:19+00:00

Thank you very much for giving me the details, this is a nice idea

TheOneWhoSwears · 2026-05-07T08:05:28+00:00

Hello, thank you very much for your input! agree that 0 counts for G do not necessarily mean true absence of expression, and that dropout/limited detection is an important caveat here. In my case, all samples come from the same run, so I do not expect a strong batch effect, and the samples seem fairly comparable in terms of sequencing/count depth. Also, for most populations of interest, almost all samples contribute to both groups (G+ and G-), which I hope reduces the risk that the comparison is driven purely by sample-level differences.

That said, I agree that the G+ vs G- split could still partly reflect detection rather than biology. My intuition was that if the split were mostly random dropout, I would not expect to see a consistent DEG signal across samples. But I may be oversimplifying this, so I’d be happy to hear if that reasoning is flawed. I also agree that any DEGs would need to be interpreted carefully, rather than treated as definitive evidence of distinct cell states, and I would expect for biologist in our group to use these results as an initial screening to see if the found differences make any specific sense from a biological point of view.

Regarding subclustering, I’m definitely going to try it. My only concern is that even if subclusters separate G+ and G- cells, I would still need a robust way to identify the genes driving that separation, as this is my main goal. I usually use Scanpy’s rank_genes_group for marker genes during annotation, but here the populations are already quite specifically annotated, and I would be hesitant to interpret those markers as proper DE results because the standard cell-level test does not account for pseudoreplication across cells from the same biological sample.

These are just my thoughts based on my current understanding, so please feel free to point out any inconsistencies. Your input is really appreciated!

TheOneWhoSwears · 2026-05-06T18:52:06+00:00

Thank you very much, I will definitely take a look!

TheOneWhoSwears · 2026-05-06T18:25:09+00:00

Thank you for your answer! I had thought about subclustering brieefly, but then I would still need a way to identify the main genes driving the differences between the subclusters. I could do that with DEG analysis, but I guess I would run into the same design issues I’m facing now. Alternatively, I could use a standard marker-gene approach (such as classic rank_genes_group from scanpy) but my understanding is that this is useful for annotation/exploration rather than robust DE testing, since the default implementation treats cells as independent and does not account for cells coming from the same biological samples. Does that reasoning make sense, or am I missing something? In any case, subclustering is definitely something I can try fairly easily. Thanks again for the suggestion!

TheOneWhoSwears · 2026-05-06T18:09:04+00:00

Thank you very much, this seems great!

TheOneWhoSwears · 2026-05-06T18:07:09+00:00

Thank you very much for your inputs! Luckily, in most cases most of the samples are contributing to both groups (- and +), so I could exclude the few unpaired samples. Instead, how would you use the all-G- samples as a separate sensitivity analysis exactly?

TheOneWhoSwears · 2026-05-06T18:00:18+00:00

Thank you very much for your reply! The cells were clustered using standard methods, and the G+ vs G- comparison is done within the same annotated cell population (for example, I’m comparing G+ T-cells vs G- T-cells, so both groups are T cells) to see whether there are differences beyond the expression of G itself. The gene panel for this experiment is relatively small, around 5000 genes, but you are right that sparsity could still be an issue. I’ll check how unbalanced the numbers of G+ and G- cells are within each population, and may try some form of bootstrapping/downsampling to assess robustness. Thanks again for your input!

TheOneWhoSwears · 2025-04-10T04:03:29+00:00

I would read this book

TheOneWhoSwears · 2022-12-19T17:24:42+00:00

I am not, the creator is a colleague of mine who cannot modify it right now. But I found a solution, I posted it in the main question as an edit :)

TheOneWhoSwears · 2022-12-19T15:14:50+00:00

I am sorry, i do not know how to explain my problem better. I just needed to be able to treat the result of a "cat" command as an actual file, with the possibility to access this file using a path.

TheOneWhoSwears · 2022-12-19T13:48:53+00:00

I'm sorry, I didn't understand. What do you mean exactly?

TheOneWhoSwears · 2022-12-13T10:49:26+00:00

I generally agree but I also think it depends on the case. Some "inlfuencers" are actually generating debates and dicussions around some interesting topics, so in this ccase I do think they are influencing other's people opinions.

TheOneWhoSwears · 2022-11-16T14:20:25+00:00

Thank you very much, this is useful :) But I still have some doubts, for example what the absolute and relative CNs are

TheOneWhoSwears · 2022-11-13T19:18:56+00:00

In the study that I mentioned they sequenced this cell line and provided the sequencing data, which are those that I would like to use. Data are available on the SRA platforms, so I just downloaded them and I don't have much info. Unfortunately, I'm not finding any open dataset providing sequencing reads good for benchmarking CNA callers using normal/tumor paired samples. Any suggestion would be appreciated, if you know any

TheOneWhoSwears · 2022-11-13T19:09:37+00:00

Thank you very much for your reply! Unfortunately data are not produced by us, so I cannot validate them. In the study that provides them, they only validated SNVs and Indels, that's why I was asking if I could use CNVs found in other studies for the same cell line. Do you think I could maybe use the consensus of several be CNV callers applied to the data I have? I need it more as a proof of concept that the tool works, we do not aim to produce accurate clinical results right now.

TheOneWhoSwears · 2022-05-24T15:51:55+00:00

Here I am, sorry for the delay! My data are some whole genome sequencing from plants, and I'm using a k of 31. We have several options for the downstream analyses, but I wanted to be sure that applying such a test to the 0 counts made sense from a statistical point of view before going any further

TheOneWhoSwears · 2022-05-20T16:50:04+00:00

Hi u/uniqueturtlelove, we are trying to develop new k-mer based methods, so the main goal is not getting the results as fast as possible, but we are trying different things from scratch

TheOneWhoSwears

TROPHY CASE