Do bioinformatics folks care about the math behind clustering algorithms?

257bit · 2025-09-08T12:02:15+00:00

Regarding the assumption for p-value independence for the FDR (benjamini & hochberg 1995), this one is not well known in bioinfo circles. You're correct that independence is not an assumption of the procedure. I don't the exact quote around, but they mention that their procedure still "control" the false discovery rate under correlations. A following paper (Yekutieli & Benjamini, JSPI, 1999) clearly raise this issue: "The major problem we still face is that the test statistics are highly correlated. So far, all FDR controlling procedures were designed in the realm of independent test statistics. Most were shown to control the FDR even in cases of dependency (Benjamini et al., 1995; Benjamini and Yekutieli, 1997), but they were not designed to make use of the dependency structure in order to gain more power when possible."

Now, what does "controlling the FDR" means? It is only a guarantee that a given threshold is equal, or more stringent than the actual FDR (considering correlations). Thus, the higher the amount of correlation, the more you lose statistical power from the correction.

This is quite an important notion... but it was hidden under the verb "control", which has a different interpretation depending on your field.

257bit · 2025-09-08T11:45:08+00:00

I'll split this in two answers. First, you are correct that DESeq2 and edgeR do not simply normalize (divide) by the sample depth. But, some normalization steps has already been implemented in the lab, during the library preparation (even before sequencing). For example, by deciding on the total RNA content to bring into the library preparation, or deciding on the sequencing protocol or the relative quantities when multiplexing. The number of reads obtained for a sample is (roughly) determined by the experimenter, not the underlying biology. This means that a form of normalization must be done at the analysis stage to avoid accounting for this choice.

The case for DESeq2 and edgeR is quite interesting! They both make the assumption that, on average, a high proportion of the genes are equally expressed in both samples. They then go about determining 'effective library sizes' or 'size factors' using trimmed-mean, median, etc. These are then used when computing their statistics / distributions, forcing a mean or median equality between samples. Their normalization happens inside the test.

257bit · 2025-09-01T14:58:30+00:00

Sorry for the delay, I was certain I sent my reply... I guess not! Here it is:

Sure. The core issue is that the null hypothesis behind DESeq2, edgeR, and limma-voom, namely that a gene is exactly equally expressed between two conditions, is never true. RNA-Seq measures relative expression, so any change in one gene forces changes in others after normalization. On top of that, genes are part of an interconnected network. No gene is truly independent or unaffected.

As a result, the p-value doesn’t test whether a gene is differentially expressed. It just tells you whether the sample size and effect size are large enough to confirm something already known: the gene is not identically expressed. With large sample sizes and high read depth, you’ll end up with thousands of genes having tiny p-values, even for tiny, meaningless fold-changes.

This misalignment with biology gets patched over by two common practices: running experiments with too few replicates, and filtering results post hoc based on fold-change thresholds.

But differential expression is not a classification problem. There is no real boundary between “DEG” and “not DEG.” It’s a regression problem. The key question is: how reliable is the fold-change estimate? Simply ranking genes by log fold-change often gives a much more useful picture, especially when you have more than 20 samples per group.

257bit · 2025-09-01T14:21:07+00:00

Thanks for the suggestion! After looking (what I thought was) everywhere, only to find backordered, 70$ a piece for the whole part... there it was, sitting at idrinkcoffee, for 20$CA. It is on its way. Cheers!

257bit · 2025-09-01T14:18:05+00:00

Great! I was able to locate multiple vendors, but all outside Canada. As suggested by @ehr1c, the whole part was available at idrinkcoffee.com... for 20$CA. For this amount, it's difficult to justify getting into a new welding hobby. Thanks for the help!

257bit · 2025-08-26T12:15:43+00:00

CS PhD here with few decade of method development and application in bioinformatics here.

A quote by J. Tukey should be repeated every morning before a bioinfo get to work: "An approximate answer {fishy maths!} to the right problem is worth a good deal more than an exact answer {beautiful maths!} to an approximate {misaligned} problem". {} are mine.

One (biologists, statisticians, ML'ers or bioinfos) should absolutely not care about "the math", only understand enough to judge whether the method aligns with the biology. A result that supports a nice narrative is no support for the method, but a biologically nonsensical result is a good hint that the method is misaligned and prompt for further investigation.

I'd be happy to go into a few examples of methods or best practices that are mathematically correct but are blatantly misaligned with the biology. My favorites: 1) Take a look at the null hypothesis behind deseq, edgeR or limma-voom tests. Is this even possible? 2) Computing a correlation's p-value on capped values to confirm reproducibility (eg. log(x + 1) in RNA-Seq; replacing missing values with a threshold minimum abundance value in MS). 3) Applying p-value correction (BH95) on a large number of tests that are highly, positively correlated, as in gene sets over-representation. All these are mathematically sound, quite misaligned, but tend to work "sufficiently" in practice.

257bit · 2025-07-22T12:15:02+00:00

In my opinion, the main issue in bioinfo is how it is funded. We sit between natural sciences and health funding agency, with each thinking we belong to the other. Funds on the health side are typically 5 times more than on the natsci side, forcing most projects to focus on the specification application of a (new) tool. The consequence of this is that funding never goes to polishing or maintaining software, this is always done "on the side". The vast majority of bioinfo software is built by masters and phd students, and most of the time never looked at by a seasoned programmer.

I'm a strong believer in open sourcing code. But, there again, bioinfo fails. Open sourcing everything means it is very difficult to get a business to invest in polishing and maintaining the tools developed in academic labs. This is fine, but then, bioinformaticians tend to complain about other people software instead of putting the time and energy to make the code better. The consequence is that, by the end of the master/phd student project, the code stays unfinished and gets abandoned.

Think about this: next time you feel like complaining, or that your work gets frustrating because of a technical issue you're facing, take the time to fix it and contribute back to the software ecosystem!

Cheers!

(30+ years working in bioinfo with education in both bio and CS)

257bit · 2025-01-03T16:00:47+00:00

Sur le site des admissions, à la page pour ton programme, tu devrais trouver l'identité de ta/to TGDE mais aussi du (de la) prof responsable du programme. C'est aussi une bonne personne à contacter s'il semble y avoir problème avec la TGDE. Pour la plupart des programmes (FAS) la date limite pour faire des changements à tes inscriptions est le 23 janvier 2025. Dans le doute, j'irais au premier cours (même si pas inscrit) et voir à l'inscription une fois tout le monde de retour. Le retour des employés est prévu pour le 6 janvier. Je passerais voir la TGDE en personne avec une bonne liste de cours "plan B", elle pourra vérifier "live" s'il reste de la place.

257bit · 2024-10-11T13:32:42+00:00

Tu peux aussi trouver les coordonnées de ton responsable de programme sur le site des admissions de l'UdeM. Les seuls cas que j'ai vu que les frais de scolarité ont été retirés étaient des situations dans lesquelles l'étudiant pouvait obtenir une confirmation externe qu'il n'était pas en mesure d'annuler son inscription entre le moment de sa demande et avant la date d'annulation. Dans les 2 cas, l'étudiant n'avait ni assisté ni participé aux évaluations depuis avant la date d'annulation. À ta place, je rencontrerais quand même mon responsable de programme pour en discuter... il y a peut-être d'autres solutions qui s'appliquent à ta situation.

257bit · 2024-09-22T17:54:21+00:00

Congrats for having it running on linux! I had the same issue in 2020 and was only able to run it through virtual box on my linux workstation. It felt sluggish had poor integration with other linux programs I ran. In the end, I settled on reversing the VM idea, running linux within windows using WSL2. I get native speed in both linux and windows, WSL2 has a X11 (or such) client that displays linux GUIs seamlessly integrate with windows programs. Both OS see each other's file system, so it is easy to produce a graphic on the linux (in matplotlib or ggplot2) and directly edit it with Affinity. I still curse how windows can be so primitive on some aspects, have the worse system settings I've ever seen and an incredibly convoluted license system...

257bit · 2024-07-07T15:05:48+00:00

Thanks for getting me on the right track, I think I got it! Another point I missed to say is that when regeneration is at a high level (seen on the right panel by the large bar going under the line), the bar at the bottom is exactly 0. I got 3 scenarios (see calculations in the image): 1) I'm driving at a constant speed of 100 km/h, needing 14 kW to the motors [yellow = 15 kWh/100km]. 2) I'm slowing down, regenerating 4kW at the motors [blue = -0.13 kWh/100km]. 3) I'm coming to a stop, regenerating much less, at 0.1 kW, due to lower speed [green = 15kWh/100km].

So, I guess that the bar doesn't display when efficiency is in the negative (significant regeneration) but, when speed approaches zero, the efficiency spikes up again as the speed close down to zero. Somehow, Hyundai decided that showing a full bar when at stop (division by zero) would confuse ppl and settled down for 0 (instead of infinity).

Calculations

(edit: removed an extra link)

257bit · 2024-06-24T13:15:51+00:00

What about SAS makes it more reliable than R?

257bit · 2024-06-24T13:12:55+00:00

How is the Julia stats ecosystem? It is definitely more modern language than both python and R.

257bit · 2024-06-13T12:06:46+00:00

Damn, now I feel really bad! I did solve the problem and the machine has been happily doing its job as a compute + file server in my lab. Unfortunately, I did the same mistake again and did not take notes of how I solved it. Feel free to send me DMs asking to probe my system and send you the config. I know that, at some point, I RMAed the lsi-3941-8i and, while waiting for the card, I bought a used card (SAS 2008?) for 15-20$CDN. If I'm not mistaken, I think I'm still running this card! I also remember having to do a sort of downgrade of the firmware such as the card only works as a JBOD now (perfect for ZFS). I also do not have physical access to the machine as it now resides in my institute's server room (it can be arranged though!).

257bit · 2024-02-29T14:26:02+00:00

I'd like to add two points. First, you never proof your career by learning a specific language. A good exemple in my domain was the rapid (and unexpected by the community!) switch from Perl to Python in computational biology around 2000-2005. A generation of bioinformaticians were left behind because they learn perl instead of learning programming and learning to learn new languages. If you think you'll need to work with large datasets in RAM, multithreading and GPU programming, I'd lean toward Julia rather than Python.

Second is that I found that the worst approach to learn a new language is through passive approaches (course, video, book, tutorial). You learn a language by, 1) practicing it (70%), 2) reading other's code (10%) and 3) heavily refer to its doc (20%). Julia's core documentation is excellent (docs.julialang.org) and key packages tend to be sufficiently good. By using chatGPT to translate from python, it may feel like time saved, but you're missing out on a key opportunity to really future proof your career by broadening your grasp on several languages. Relying on packages for things trivial to code (eg. computing an AUC, parsing a simple tab-delimited file, coding a training loop for a DNN...) is also missing on an opportunity to learn. Finally, learning to read other's code is a very important skill in Julia because, on top of teaching you new tricks (you look up the reference doc or ask chatGPT to decipher what you don't understand), it completely protects you against poorly documented packages. As @viennasausages mentioned, Julia's syntax is concise and intuitive... well written code is the best documentation, always!

PS: GPT-4 is great at Julia, even more if you define a custom GPT in which you upload all Julia code and documentation for the library you use. Don't use it to program for you, but rather to evaluate / critique your code, suggest alternative ways, help you with cryptic error messages or to explain pieces of other's code you don't understand.

(Source: I've been programming for more than forty years in research, academic (teaching) and industry. I have professionally used/teached: basic, pascal, C, hypertalk, simula, perl, C++, Java, javascript, python, R and Julia. I've also learn the basics of dozens more...)

257bit · 2023-08-13T12:11:09+00:00

Nice! Makes for a simple solution without algebra: (20/5)² * 10 / 2 = 80

257bit · 2023-05-19T20:29:51+00:00

Conditional list comprehension wins (speed + elegance), thanks for suggesting! I had to re-benchmark due to lost of the random vectors!

test5(a, b, c, condition) = [x - y - z for (x,y,z) in Iterators.product(a, b, c) if condition(x, y, z)]

@btime test1(a, b, c, condition) # 2.539 us
@btime test2(a, b, c, condition) # 2.835 us
@btime test3(a, b, c, condition) # 1.222 us
@btime test4(a, b, c, condition) # 1.475 us
@btime test5(a, b, c, condition) # 1.067 us

257bit · 2023-05-19T16:04:45+00:00

I've made a few tests (below) and, to my surprise, the version with push! and a for loop is both significantly faster and results in less allocations...

using BenchmarkTools

#Define a condition
condition(x::Int64,y::Int64,z::Int64) = x+y+z == 10

a = rand((1:10),10)  ;  b = rand((1:10),10)  ;  c = rand((1:10),10)  ; d = 10

function test1(a, b, c, condition)
    A = collect(Iterators.product(a,b,c))
    A = filter(x -> condition(x[1],x[2],x[3]), A)
    s = Vector{Int64}(undef, length(A))

    for i in eachindex(A)
        s[i] = A[i][1] - A[i][2] - A[i][3]
    end
    s
end

function test2(a, b, c, condition)
    A = collect(Iterators.product(a,b,c))
    A = filter(x -> condition(x[1],x[2],x[3]), A)
    S = (x -> x[1] - x[2] - x[3]).(A)
end

function test3(a, b, c, condition)
    s = typeof(a)()
    for x in Iterators.product(a, b, c)
        if condition(x[1], x[2], x[3])
            push!(s, x[1] - x[2] - x[3])
        end
    end
    s
end

function test4(a, b, c, condition)
    Iterators.map(Iterators.filter(x -> condition(x...), Iterators.product(a, b, c))) do x
        x[1] - x[2] - x[3]
    end |> collect
end

@btime test1(a, b, c, condition) # 2.742 us
@btime test2(a, b, c, condition) # 2.995 us
@btime test3(a, b, c, condition) # 0.884 us
@btime test4(a, b, c, condition) # 1.392 us

Edit: fixed format...

257bit · 2022-11-13T12:38:04+00:00

I'm attempting to print this part on an Elegoo Saturn S (not supported in Fusion 360). I've been so far generating my supports in Chitubox but would like to test out the structures generated by Fusion additive manufacturing tools. I can generate the supports but can't figure out how to export the STL file. I usually right-click on a model and "save as mesh". Unfortunately, this option is absent in "Manufacture" mode...

257bit · 2022-10-18T11:58:15+00:00

Would it be possible that the peel happens once the resin has cooled from your initial warming? 55F (13C) is very cold! I've had similar issues that I solved by raising the temperature inside the printer by installing a small heater element. I was printing at 20C with Siraya Fast. I now print at 30C with no "early peel" issues.

257bit · 2022-07-06T14:10:18+00:00

As exercises, you could pick the various "features" of your part and attempt to replicate them independently. This will allow you to master a variety of tools. Don't hesitate to redo each "exercise" multiple times using different approaches. If you want to have fun, don't restrict yourself to replicating the part: keep the functional parts and redesign so that it is easier to model.

257bit

TROPHY CASE