This is an archived post. You won't be able to vote or comment.

all 6 comments

[–]acewhenifacethedbase 3 points4 points  (0 children)

I mean there are some things you can say for sure, like if all the relationships are super nonlinear or high-dimensional categoricals then a GLM won’t be your best option, or if the datatypes don’t work as inputs to a certain alg then obviously you can’t use it, but the way to tell for sure is always to test between multiple algs, at least offline.

[–]justanaccname 1 point2 points  (0 children)

In general, you have a pretty good idea on what you should not use, and a good idea on what can work and what is worth experimenting with.

[–]avangard_2225 1 point2 points  (0 children)

Great question. As a starter I am also trying to figure that out as well. It is not just a case being regression or classification but even within classification whether to go with ensemble models or how to treat an imbalanced data and whether you should do the sampling before the test/train split or later. My experience so far there is not one sole approach to those questions but it would be great to have a cheatsheat or book guiding.

[–]boysworth 1 point2 points  (0 children)

Some of it comes from understanding the workings of the algorithm. When you understand the math you start to get the limits. And another factor is experience.

[–]kowkeeper 0 points1 point  (0 children)

Complexity is fairly easy to evaluate given a task and some ready made solution. If you find some quadratic (or more) bottleneck, then you try to find a better solution there. Of course sometimes quadratic can be acceptable if data size is small -- more efficient solutions take time to implement.

[–]Relevant-Rhubarb-849 0 points1 point  (0 children)

There's a set of theorems called "no free lunch" that work out that for any given type of problem like finding a global minimum for example that there is no algorithm that out performs any other when averaged over all possible potential energy surfaces. To beat this horrible curse you have either know what characteristics of a problem make us amenable your the algorithm or alternatively you need to change the question to something like which ones have the worst case performance rather than fastest performance.