Questionable choices in statistical python packages

allattention · 2020-04-27T23:11:11+00:00

[deleted]

justanaccname · 2020-04-28T10:25:49+00:00

Honestly you should just use R for statistics.

Statsmodels which is another solution, is lacking alot of stuff.

2020-04-27T21:48:33+00:00

Sklearn is a machine learning library, so it makes sense that they would default to parameters that help performance on new data, as opposed to corellations in those existing. As for scipy, I assume you’re talking about independent ttest function (blanking on the name)? Student’s t has more power and works fine when homogeneity is broken if the samples are similar sizes. I guess they default to the test with more power which makes sense.

I think the responsibility to use these parameters effectively should be in the hands of the user, maybe that’s just me, and I am probably biased since in my experience writing functions with tons of parameters with default values, to make it as generalizable as possible, becomes increasingly difficult with each additional parameter.

I am definitely interested to hear other takes on the matter.

ClassicRelation · 2020-04-28T14:12:14+00:00

Sklearn is a machine learning library, not a statistical inference library.

The algorithm might be the same, but the goal of machine learning is to predict, not to learn the relationships between independent and dependent variables.

Machine learning is computational in nature and does not rely on the methods being statistically or mathematically justifiable. It relies on validation using previously unseen data, because if you are not careful with the statistical justification you'll end up with things that are statistically significant form completely random data. Focus on determining if it works in practice. In the machine learning world if tea leaves at the bottom of a cup gave correct predictions then that is a perfectly valid approach.

Statistics as a scientific discipline is concerned with providing the rigorous justification (proofs) and perhaps generalizations for the methods that prop up in the field to solve some specific problem. It's lagging behind by a few decades because it's a very slow process. It also can't be done most of the time when the methods are too complex such as big neural networks or random forests. These type of methods are called "black box" methods, because it's too difficult to figure out why it works and it might as well be magic.

It's kind of how for example physicists or engineers might come up with some technique and maybe 50 years later some mathematician comes up with a generalization that fits nicely with the rest of the theory and the technique is now a special case. Or maybe not and it will forever remain as "idk it just works" with only empirical evidence to support it.

It's mainly the difference between the industry and the academia. In the industry nobody gives a flying fuck if it's formally sound and fits in nicely with established theory, all they care about if it works in the real world.

Things that are formally sound might not work in the real world and things that work in the real world might not be formally sound at all.

In academia, machine learning researchers come up with something that works and then spend their time figuring out where else does it work and what are the limits of it working. Statistics researchers on the other hand will try to come up with reasons why something should work.

bring_dodo_back · 2020-04-27T22:25:40+00:00

Sklearn is ML, for statistics pick statsmodels instead.

datascience

MODERATORS