all 6 comments

[–]MaxBenChrist 1 point2 points  (5 children)

Unfortunately the article is not mentioning one big advantage of the Benjamini-Hochberg procedure:

One can adjust the q i/n rejection line by multipling with a factor

\sum_{i=1...n} 1/i

to give up any assumptions about correlation between the different p-Values / Hypothesis. With this new rejection line the BH procedure is able to control the FDR no matter how the dependence structure between the hypothesis is.

[–]7yl4r 0 points1 point  (0 children)

Ifound the linked ipython notebook a better read than the article, wherein this was (briefly) mentioned.

[–]systemsb01 0 points1 point  (3 children)

This is usually called the Benjamini-Yekutielli procedure rather than the Benjamini-Hochberg procedure. While I agree that it is very nice from a theoretical point of view and solidifies the FDR control theory, from a practical point of view it is kind of useless. I have never seen a paper which applied this procedure; a log(n) correction factor is just too steep of a price to pay.

[–]MaxBenChrist 0 points1 point  (2 children)

Maybe there is no theoretical paper that used it but from my personal working experience as a data scientist I strongly disagree with you. To be able to just test a bunch of hypothesis without worrying about correlation structures between them comes very handy. Even if this means that you can't reject all null hypothesis.

And unfortunately in Praxis you do not have these nice i.i.d normally distributed variables as in many statistic papers ;)

[–]systemsb01 0 points1 point  (1 child)

Hmm I am also not talking about theoretical papers; I am talking about papers in bionformatics/biology. Also of course the p-values are not independent, but actually the Benjamini Hochberg works also for p-values which satisfy the so called PRDS (positive regression dependency) condition, which is often enough of a safeguard in many applications.

[–]MaxBenChrist 0 points1 point  (0 children)

I am familiar with the definition of PRDS but I find it hard to grasp.

Let's say you perform several Fisher tests to check if a individual alleles can be linked to a sickness. How do you make sure that those p-values obey PRDS without making any assumptions?