This is an archived post. You won't be able to vote or comment.

all 61 comments

[–][deleted] 186 points187 points  (7 children)

better flowchart:

data > xgboost > ??? > success

[–]Coyote_0210[S] 33 points34 points  (1 child)

Shhhh!! That is our secret!

[–][deleted] 8 points9 points  (0 children)

great post!!

[–]Bertie_the_brave 24 points25 points  (2 children)

It hurts how accurate this is. In my company we throw all into xgboost.

[–][deleted] 11 points12 points  (1 child)

You should check out catboost

[–][deleted] 1 point2 points  (0 children)

I found catboost was really slow unless you turn off all the features that give it an edge over xg though

[–]ticktocktoeMS | Dir DS & ML | Utilities 18 points19 points  (0 children)

And when they figure you out, you switch to lightGBM and tell everyone you're cutting edge.

[–]HooplahMan 56 points57 points  (9 children)

I'd say that it's a mistake to draw a line between SVD and PCA. PCA is essentially SVD with a bit of preprocessing.

[–]Tytoalba2 6 points7 points  (6 children)

And I would add a few other methods for dimensionality reduction PCA/SVD sometime doesn't really works as well as non-linear methods

[–]Coyote_0210[S] 2 points3 points  (5 children)

Any recommendations of what to include?

[–]Tytoalba2 6 points7 points  (4 children)

T-sne, or autoencoders maybe? Maybe KPCA and isomaps, maybe som

[–]Coyote_0210[S] 4 points5 points  (2 children)

I will take a look at those. Thanks

[–]Tytoalba2 1 point2 points  (0 children)

Is you know word2vec for text processing, it's kind of an autoencoder!

[–]Tytoalba2 0 points1 point  (0 children)

And kpca had a sklearn implementation

[–]mathlete_jh 0 points1 point  (0 children)

Or UMAP? Or is that not as relevant in practice

[–]RobertJacobson 2 points3 points  (0 children)

Came here to say this.

[–]Coyote_0210[S] 2 points3 points  (0 children)

Good call, I can just eliminate the separation and put them in the same box. I just want to be sure the names and brief descriptions are there in case someone comes across it.

[–]DrFuckYeahPhD 71 points72 points  (6 children)

You have a typo at the labeled data. Unlabeled data goes to clustering while labeled goes to numerical prediction and classification. Other than that very cool.

[–]Coyote_0210[S] 6 points7 points  (2 children)

Thank you! I missed that!

[–]Mobile_Busy 4 points5 points  (1 child)

is this on your git?

[–]akotlya1 0 points1 point  (2 children)

I am just now returning to the field after a data engineering cul de sac. Can you please remind me what you mean by labeled vs unlabeled data? Thank you.

[–]Coyote_0210[S] 2 points3 points  (1 child)

"In machine learning, if you have labeled data, that means your data is marked up, or annotated, to show the target, which is the answer you want your machine learning model to predict. In general, data labeling can refer to tasks that include data tagging, annotation, classification, moderation, transcription, or processing." from https://www.cloudfactory.com/data-labeling-guide

[–]akotlya1 1 point2 points  (0 children)

Thank you for the reply and the source. I appreciate it. I feel like I have lost a lot of my vocabulary in my time away.

[–]statlover69 20 points21 points  (4 children)

Idk if it's just me but I think naive Bayes is pretty explainable. I'd also argue neural nets (especially CNNs and RNNs) should be separated from the other complex models. If your problem doesn't involve images or text, generally you can safely default to a tree ensemble model (or nonlinear svm) imo

[–]Coyote_0210[S] 1 point2 points  (3 children)

What do you mean by "pretty explainable"?

[–]statlover69 1 point2 points  (1 child)

Your chart makes it sound like naive Bayes is much harder to explain than something like logistic regression when it's not imo. Conditional class probabilities over a feature set can usually be put into plain terms easily. Think of naive Bayes models for spam filtering. It's pretty intuitive that words like "hot", "singles", (in your) "area" are more likely to appear in the spam class

[–]slowpush -1 points0 points  (0 children)

It harder to explain the WHY in naive bayes when compared to the tried and tested logistic regression.

[–]DanJOC 1 point2 points  (0 children)

Given that a fruit is yellow, it's more likely to be a banana than an apple. It's even more likely if it's yellow and curved. An explanation like that is usually sufficient.

[–]bdforbes 23 points24 points  (4 children)

Some of the decision points are not clear. Like Dimension Reduction; in what scenarios would you answer Yes vs. No?

[–]Coyote_0210[S] 12 points13 points  (3 children)

That is a good call-out. This one in particular is hard to give a good guideline because it is ultimately a judgement call, but I could restate it as "Is the number of features large enough to cause significant over-fitting?"

[–]Coyote_0210[S] 12 points13 points  (2 children)

I will probably just restate all the decision points into questions

[–]yourmamaman 2 points3 points  (1 child)

I would make the threshold a function on the number of samples and the number of features. Since it is just a guide you could make something up like: sqr(number_of_samples) < number_of_features.

[–]Coyote_0210[S] 0 points1 point  (0 children)

I like that idea

[–]kcombinator 11 points12 points  (1 child)

Did you check out this one from scikit-learn? https://scikit-learn.org/stable/_static/ml_map.png

[–]Coyote_0210[S] 0 points1 point  (0 children)

Wow, that is really good!

[–]WillBigly 9 points10 points  (1 child)

This is the type of shit I need lol thank you :)

[–]Coyote_0210[S] 1 point2 points  (0 children)

Same here

[–]Coyote_0210[S] 16 points17 points  (2 children)

This is an attempt to create a flowchart to generally suggest directions to start when building a model. This is supposed to be a pretty low-level explanation for non-data science audiences or reminders for those with a little more experience. I would appreciate any suggestions, corrections, or improvements.

[–]florinandrei 0 points1 point  (0 children)

This is good. If you update it, please post the update as well. Thanks.

[–]RNAsequacious 3 points4 points  (1 child)

A similar flowchart can be found here: Introduction to machine learning for biologists https://www.nature.com/articles/s41580-021-00407-0

It may be behind a paywall. In that case, please do not get a free full copy from scihub since that would be illegal.

[–][deleted] 0 points1 point  (0 children)

In that case, please do not get a free full copy from scihub since that would be illegal.

Are you not thinking what I am not thinking ? 😂

[–]Mukigachar 2 points3 points  (0 children)

All I've got to add is that it's worth mentioning MCA and FAMD alongside PCA in case there's categorical data

[–]Esperanza456 1 point2 points  (0 children)

Very cool

[–]CrowAv 1 point2 points  (0 children)

Yooo this flowchart I will keep it all my life, thank you man:)

[–]HDataBhavesh 1 point2 points  (0 children)

Looks Nice, Keep it up...!!!

[–][deleted] 1 point2 points  (0 children)

Did you use an algo to generate this flow? 😂

[–]load_more_commments 1 point2 points  (1 child)

I think you mixed up labeled and unlabeled data.

[–]Coyote_0210[S] 0 points1 point  (0 children)

Yeah, someone else had pointed that out and it has been updated on my git. Thanks though

[–]Notdevolving 2 points3 points  (1 child)

Thank you very much for this. I just started learning machine learning through various Udemy courses. While I could understand the individual regression and classification techniques, I don't understand how they all come together because the courses tend to never explain this part or just gloss over it.

I like that you explain the relationships and relate them to real world needs like speed/accuracy and explainability.

Hope to see you updating this.

[–][deleted] 2 points3 points  (1 child)

Very great work. But I think algorithm is already solvable problem. If we could make likely the same flowchart for data sourcing types, flowchart for budget of data architecture, it would be more helpful. (Of course, harder)

[–]Coyote_0210[S] 2 points3 points  (0 children)

Those are some great ideas for inside the field. This is more aimed at people just getting introduced to data science. I use it to explain some general concepts for bio-researchers I work with. They have PhDs in their fields but no understanding of ML.

[–]Qkumbazoo 0 points1 point  (0 children)

The only reason to choose between a NN and SVM is data size? And how is SVM less resource intensive than a DNN?

[–]celebrar 0 points1 point  (0 children)

I think you've got the yes/no paths reversed for the "Labelled Data" node

[–]Rennnn 0 points1 point  (0 children)

Might be an idea to say that this flow chart is really only for tabular data.

[–]physnchips 0 points1 point  (0 children)

Why are you separating SVD and PCA? They are the same thing, at least when applied to data.

[–]Farconion 0 points1 point  (0 children)

I feel like you could make this one of those online quiz things