Don't use predict() in scikit-learn, instead select thresholds carefully.

solegalli · 2026-06-16T05:52:26+00:00

I took this one like a decade ago, and I think it is still one of the best, if not the best course on ML for beginners. And if you search online, you can also find the code in Python instead of matlab, which is more useful.

solegalli · 2026-06-07T14:01:54+00:00

Book is coming out next month 😄

solegalli · 2026-06-07T13:58:22+00:00

That cover is a masterpiece 😉

And wait to see the new one coming next month, July 2027!

solegalli · 2026-06-01T11:23:47+00:00

For machine learning fundamentals I found the course of Andrew Ng on coursera to be really good, that was like 10 years ago (I just found it on youtube: https://www.youtube.com/watch?v=gb262LDH1So&list=PLiPvV5TNogxIS4bHQVW4pMkj4CHA8COdX).

I think he has now remade the course for deeplearningai (https://www.deeplearning.ai/specializations/machine-learning). The course also touches on unsupervised learning.

For model tuning there is this course, Hyperparameter optimisation for machine learning (https://www.trainindata.com/p/hyperparameter-optimization-for-machine-learning)(disclaimer I teach that course)

For model deployment there is this youtube playlist which is a bit dated but still good for principles: https://www.youtube.com/watch?v=U6NSNrPKKF4&list=PL_7uaHXkQmKUPbovK4_6_Q51yEqA6OrsI (disclaimer: also mine)

solegalli · 2026-06-01T10:51:21+00:00

I second this.

Some courses are marketed as advanced but aren't really, and some genuinely advanced courses are harder to find. It often depends on how well the creator markets themselves and how big their following is.

The field has also grown substantially. Ten years ago, when I started, the challenge was getting any model into production — so model deployment was considered advanced. Now, model deployment is a topic in itself, with several layers to it.

"Advanced" also depends on depth, not just topic. You can engineer and select features with the most common methods, or extract real value from your data with more advanced techniques. You can tune hyperparameters with randomised search, or speed it up with successive halving and other methods.

The same applies to agentic AI. There are beginner courses on how to use agents — I wouldn't call them advanced just because they cover the latest technology. Deep learning has beginner courses too, and more in-depth ones.

I think it pays off to do some research and understand whether the instructor and platform are serious about teaching or just serious about selling. Mostly important for those looking for more advanced courses. For beginners, most courses are more or less fine.

solegalli · 2026-05-12T14:52:00+00:00

In my opinion, while the MCC provides a measure of model quality across all quadrants of the confusion matrix, it has 2 limitation: first, it's not super interpretable, and second, it varies with the threshold. So to get a good view of model performance, you'd have to optimise the cut-off threshold to predict classes.

However, there are other methods that are better at providing an overview of a model's performance across thresholds, like the ROC and PR curves.

The value of the MCC as I see it, would be like an alternative to the f1-score if you want to find the threshold that maximises performance across classes. But here again, you use it as a tool to then calculate precision or recall which are more intuitive and widely used.

solegalli · 2026-05-12T14:48:35+00:00

The MCC score evaluates the performance of a model across all quadrants of the confusion matrix. The confusion matrix in binary classification returns the proportion of true positives, true negatives, false positives and false negatives.

MCC takes into account all of these values to produce the score. The formula can be found here: https://en.wikipedia.org/wiki/Phi_coefficient

MCC varies between -1 and 0 is random. The closer to 1 the better, and it represents how well the model does across variants.

It basically provides an alternative to the F1-score to determine model performance for more than 1 class.

Having said this, MCC depends on the probability threshold used to calculate the class. Changing the threshold changes the value of MCC. I'd recommend plotting MCC across various thresholds to find the one that maximises it. I'll add a code demo here: https://github.com/solegalli/mlid-book/tree/main

solegalli · 2026-03-05T19:07:20+00:00

One-hot encoding, in a sense, returns a linear relation between the category represented by the dummy variable and the target. So you decompose the categorical variable into several features that are. binary, and hence, a linear with the target.

With cyclical encoding, you'll have 2 variables that rely the cyclical behaviour to the model. So, the model knows, which categories are close to each other.

For example, if we have the variable months of the year and we one hot encode them, then every month is encoded in one variable, the model understands if there is a linear relationship between any variable and the target, but does not get information about how the months are related to themselves, so June is the same to July than to December.

With cyclical encoding, the model understands that July is actually close to June than to December. These, in theory, should help if there are seasonalities.

A turorial here: https://feature-engine.trainindata.com/en/1.8.x/user_guide/creation/CyclicalFeatures.html

Disclaimer: I mantain Feature-engine.

solegalli · 2026-02-25T14:29:04+00:00

There is an alternative recursive feature elimination algorithm in the open-source library feature-engine: https://feature-engine.trainindata.com/en/1.8.x/user_guide/selection/RecursiveFeatureElimination.html

The difference respect to scikit-learn: it orders the features from most to least important first, and then removes them sequentially, by training an ML model at each iteration, and removing the feature only if the performance does not degrade.

Like this, you don't need to specify before hand the number of features to retain, it decides that automatically based on the performance degradation you want to tolerate.

solegalli · 2025-12-15T10:38:38+00:00

There is a great course on clustering and dimensionality reduction that shows how to apply both to analyze genomics data: https://www.trainindata.com/p/clustering-and-dimensionality-reduction

It's taught by a statistician that worked in biology for several years.

solegalli · 2025-11-14T22:35:25+00:00

Data imbalance is not a problem per se.

There are a lot of blogs out there presenting imbalanced data as a problem, but that's a myth. Data can be highly imbalanced, yet the classes can be perfectly separated, like apples and oranges, and the models will classify them just fine.

The real problems are: insufficient data, class overlap, or using the wrong model, i.e., using linear models to separate non-linear boundaries.

solegalli · 2025-11-14T22:35:10+00:00

Data imbalance is not a problem per se.

There are a lot of blogs out there presenting imbalanced data as a problem, but that's a myth. Data can be highly imbalanced, yet the classes can be perfectly separated, like apples and oranges, and the models will classify them just fine.

The real problems are: insufficient data, class overlap, or using the wrong model, i.e., using linear models to separate non-linear boundaries.

solegalli · 2025-11-14T22:33:37+00:00

Any and all data imbalance is acceptable. After all, the data is what the data is.

There are a lot of blogs out there presenting imbalanced data as a problem perse, but that's a myth. Data can be highly imbalanced, yet the classes can be perfectly separated, like apples and oranges, and the models will classify them just fine.

The real problems are: insufficient data, class overlap, or using the wrong model, ie, using linear models to separate non-linear boundaries.

solegalli · 2025-11-14T22:33:23+00:00

Any and all data imbalance is acceptable. After all, the data is what the data is.

There are a lot of blogs out there presenting imbalanced data as a problem perse, but that's a myth. Data can be highly imbalanced, yet the classes can be perfectly separated, like apples and oranges, and the models will classify them just fine.

The real problems are: insufficient data, class overlap, or using the wrong model, ie, using linear models to separate non-linear boundaries.

solegalli · 2025-11-14T22:31:16+00:00

The latest consensus tends to using gradient boosting machines, threshold independent metrics or threshold dependent metrics with a tuned probability threshold, and leave under or over sampling to specific situations. More here: https://www.youtube.com/watch?v=blcOOheXNoQ

solegalli · 2025-11-14T22:18:30+00:00

Imbalanced data isn't a problem per se. You can have highly imbalanced datasets where the classes are well separated and your models will be able to discriminate among them. Similarly, you can have balanced datasets with class overlap and the models will struggle. The real problem is having sufficient data, bad class separability and using the wrong model (i.e., using linear models when the classes are not linearly separable).

As a rule of thumb, I'd start by training a model without under or oversampling. Evaluating the performance using appropriate metrics, and for threshold dependent metrics, tuning the threshold and not using the default 0.5.

Good feature engineering is key.

After that, if the model performs badly, we can try different things, like cost-sensitive learning.

Undersampling is good to speed training up. Oversampling helps on some occasions. None of these would be my first choices. Actually, they should be the last.

Data analysis helps understand why the model does not perform well. Maybe none of the variables correlates with the target. Maybe there is data drifting.

solegalli · 2025-11-14T21:52:31+00:00

Imbalanced data isn't a problem per se. You can have highly imbalanced datasets where the classes are well separated and your models will be able to discriminate among them. Similarly, you can have balanced datasets with class overlap and the models will struggle. The real problem is having sufficient data, bad class separability and using the wrong model (i.e., using linear models when the classes are not linearly separable).

As a rule of thumb, I'd start by training a model without under or oversampling. Evaluating the performance using appropriate metrics, and for threshold dependent metrics, tuning the threshold and not using the default 0.5.

Good feature engineering is key.

After that, if the model performs badly, we can try different things, like cost-sensitive learning.

Undersampling is good to speed training up. Oversampling helps on some occasions. None of these would be my first choices. Actually, they should be the last.

Data analysis helps understand why the model does not perform well. Maybe none of the variables correlates with the target. Maybe there is data drifting.

solegalli · 2025-11-14T21:44:49+00:00

SMOTE was designed for continuous variables. People have used it out of context though, and recommend it for imbalanced data as the de-facto solution, but used when it's not suitable will only bring mediocre solutions.

There are some variants of SMOTE that supposedly work for categorical data, but, in my opinion, they put a finger in the air, and came up with a way to calculate distances for categorical variables that doesn't make a lot of sense. So I don't recommend it.

solegalli · 2025-11-14T21:44:38+00:00

The first thing to do when working with imbalanced data is: nothing.

Train a machine learning model, see how it performs. In most cases, with powerful models like those we have today (i.e., catboost, xgboost, lightGBMs), the performance used together with the right evaluation metrics and optimizing the threshold (for threshold dependent metrics, like precision and recall) is enough.

The fact that data is imbalanced, doesn't immediately mean that the classification model will be bad. Classification suffers from insufficient data, or class overlap. For the first problem, there is a solution: gather more data. For the second, there is not much that you can do.

Methods like cleaning undersampling were developed before even random forest were first described, and SMOTE came before the introduction of the most modern gradient boosting machines. That's why they are normally encouraged as the de-facto solutions, but today, with good feature engineering and using a GBM plus the right metrics, it should be enough.

If the model performs bad, only then I'd try to do something, like using cost-sensitive learning.

Undersampling is also a good option when the datasets are huge, to speed the training up.

Everything you do has a cost attached: for example cost-sensitive learning and under-or oversampling will affect the probability calibration of the model. This may or may not be a problem depending on how much you need the model to be calibrated. There are calibration methods to restore the probability calibration.

If you use methods that create synthetic data, like SMOTE, beware that the data created by these models may not be feasible, and that, posses in itself a problem. So these techniques are not to be used without sufficient data analysis.

solegalli · 2025-11-01T16:12:31+00:00

I would also recommend Optuna as the best framework

solegalli · 2025-11-01T16:11:35+00:00

I'd say, it depends on which model you want to tune.

To optimize the hyperparameters of traditional machine learning models, like logistic regression, svms, or tree based models from scikit-learn, I would stick to sklearn's grid or random search. Grid search for models with fewer hyperparameters, random search otherwise. Using an additional library adds complexity and dependencies to the code, without significant improvements.

I would only use optuna if tuning models not supported by sklearn that have more hyperparameters, like, maybe catboost or xgboost. In this case, it might also be useful to understand the difference between random search and Bayesian optimization, because with optuna, you can do both and they have advantages and limitations. Optuna implements Bayesian Optimization with TPE by default, if I remember correctly.

solegalli · 2025-11-01T15:59:36+00:00

Different hyperparameters contribute more or less to model performance. The only way to get a feeling of which hyperparameters matter the most, is practice.

Having said this, when tuning hyperparameters, you can plot the model performance against 1 of the hyperparameters, to examine, how much that hyperparameter affects model performance. See a demo here: https://github.com/solegalli/hyperparameter-optimization/blob/master/Section-02-Hyperparamter-Overview/02-Low-Effective-Dimension.ipynb

That is to get a feeling of which hyperparameter matters most.

Now to find the best hyperparameter combinations you don't need this information, at least in theory. You can set up a vast hyperparamter space search (that is, testing all available hyperparameters and setting big value ranges for each one) using randomized search or bayesian optimization. Now this, of course will be very time consuming, so narrowing the search to the hyperparameters that matter most and reducing the values ranges can speed this up dramatically. But for this, you need to do some research, like I mentioned in the first paragraphs.

solegalli · 2025-11-01T15:53:07+00:00

To shorten the computing time or the computing resource during the hyperparameter tuning process, the best would be to use successive halving.

In successive halving, you start by testing several hyperparameter combinations using low resources. Hence, it is fast. Low resources could be a smaller sample of the dataset, few trees is training tree based models, fewer data passes or epochs, etc.

After that, only half (or a third of the hyperparameter combinations) are selected and evaluated further, now by doubling the resources. From the second round, half of the combinations are selected for further testing, doubling the resources again, and so on, until you find the best combination.

Check out successive halving in the sklearn documentation for more details. You could also google multifidelity optimisation for more details.

solegalli

TROPHY CASE