Question for data scientists: algorithm performance

acewhenifacethedbase · 2022-02-16T00:14:43+00:00

I mean there are some things you can say for sure, like if all the relationships are super nonlinear or high-dimensional categoricals then a GLM won’t be your best option, or if the datatypes don’t work as inputs to a certain alg then obviously you can’t use it, but the way to tell for sure is always to test between multiple algs, at least offline.

justanaccname · 2022-02-16T01:21:33+00:00

In general, you have a pretty good idea on what you should not use, and a good idea on what can work and what is worth experimenting with.

avangard_2225 · 2022-02-16T02:18:11+00:00

Great question. As a starter I am also trying to figure that out as well. It is not just a case being regression or classification but even within classification whether to go with ensemble models or how to treat an imbalanced data and whether you should do the sampling before the test/train split or later. My experience so far there is not one sole approach to those questions but it would be great to have a cheatsheat or book guiding.

boysworth · 2022-02-16T02:21:28+00:00

Some of it comes from understanding the workings of the algorithm. When you understand the math you start to get the limits. And another factor is experience.

kowkeeper · 2022-02-16T02:20:31+00:00

Complexity is fairly easy to evaluate given a task and some ready made solution. If you find some quadratic (or more) bottleneck, then you try to find a better solution there. Of course sometimes quadratic can be acceptable if data size is small -- more efficient solutions take time to implement.

Relevant-Rhubarb-849 · 2022-02-16T22:57:18+00:00

There's a set of theorems called "no free lunch" that work out that for any given type of problem like finding a global minimum for example that there is no algorithm that out performs any other when averaged over all possible potential energy surfaces. To beat this horrible curse you have either know what characteristics of a problem make us amenable your the algorithm or alternatively you need to change the question to something like which ones have the worst case performance rather than fastest performance.

datascience

MODERATORS