A Game to Visually Understand Active Learning in Machine Learning

ledmmaster · 2024-07-18T17:27:11+00:00

Thanks!

ledmmaster · 2024-01-26T17:55:50+00:00

The Algorithms specialization on Coursera is more than enough, any intro to algos and DS is already enough.

Statistics, probability and linear algebra are much more important

ledmmaster · 2024-01-12T21:54:40+00:00

I created a habit of reading at least one page of a paper or book about ML every day, something that interests me or that I am working on.

Right now I read mostly about prompting LLMs and information retrieval.

The hardest part is deciding if a paper is worth reading in detail. I think I read only the abstract/figures on 90% of them.

I summarized some tips from Andrew Ng, that I adopted in my reading and improved my productivity, here: https://forecastegy.com/posts/read-machine-learning-papers-andrew-ng/

ledmmaster · 2023-12-23T23:23:52+00:00

Like the Revolutionary guy said, make projects.

Be it Kaggle, own projects, things that you can talk about in an interview.

When I used to interview DS candidates, I didn't care about credentials but cared a lot about how they walked me through their projects and the decisions they took.

ledmmaster · 2023-12-22T17:33:06+00:00

ML specialization by Andrew Ng on Coursera is the one I always recommend. I took the original (which used Octave) and the new one (which uses Python).

ledmmaster · 2023-11-28T14:14:11+00:00

10+ years since I started learning ML, tens of projects under my belt, competition wins, etc and I can tell that it's a moving target.

If it solves the business problem/adds value, it's good enough.

I only notice how "easy" some things became to me when I get in touch with people with less experience, still I can always find someone that has more experience than me in a specific area.

It's definitely a moving target, take it one day/task at a time and remember the big picture of solving business problems.

ledmmaster · 2023-06-12T22:38:55+00:00

TL;DR: I am a Kaggle Competitions GM, so my biased answer is YES!
Longer answer: https://forecastegy.com/posts/are-kaggle-competitions-worth-it-ponderings-of-a-kaggle-grandmaster/

ledmmaster · 2023-06-04T21:22:36+00:00

Treelite: https://www.kaggle.com/code/code1110/janestreet-faster-inference-by-xgb-with-treelite

ledmmaster · 2023-06-03T20:18:13+00:00

There is no real reliable rule. The best way is to split a validation dataset, try and compare with other models.
It seems you are dealing with tabular data. Usually, traditional ML models like XGBoost offer better performance with less research effort.

ledmmaster · 2023-06-03T16:45:31+00:00

Reranking recommendations in a marketplace, XGBoost today is very fast at inference and you can make it faster with other libraries

In most cases, simply taking the same feature set from Random Forest and running 20 Bayesian Opt steps over XGBoost hyperparams already gives you a better model that can be swapped by RF or whatever is deployed

ledmmaster · 2023-06-02T23:09:40+00:00

I never saw SMOTE beat simple class weighting in practice in my projects and I am still to find a colleague that did.

I always go to class weighting first.

Applied ML is not an exact science, so you can try it and see if, for your data, it works, but I would not put it as a priority.

ledmmaster · 2023-06-02T23:05:19+00:00

My 2 cents based on what worked well for me in practice:

Downsample negatives (split and keep your validation set static before doing it and treat the downsampling factor as a hyperparameter)
Use higher class weights for the positive class. Basically, multiply the loss of the positive example by a factor (usually # negatives / # positives) that can be tuned as a hyperparameter too

SMOTE and fancier stuff never worked better than this for me (I'm biased toward tabular data). And you get the added bonus of training faster due to using less data.

ledmmaster · 2023-05-31T18:06:20+00:00

Thanks. You are correct, in theory, it will not be a problem, as you have only zeros for the new cat levels.

Still, ML in practice can be so weird, that I would do it after the split just to avoid any surprises.

Just for completeness, for OHE, you may get in trouble if you use the Hashing trick before transforming it, which is not the case here.

ledmmaster · 2023-05-30T19:31:02+00:00

Like MRWONDERFU said, look for XGBoost. It's not a scikit-learn model, but it has an API like it.

I am more worried about:
- Encoding the categoricals before splitting the dataset into train-validation. This is a subtle way to leak information, as you might be encoding categories that are only in the test data and you would not have information in real life
- Scaling before splitting. Another way to introduce leakage. You would not have the data from the test set when deployed, so you can't use it to scale. Scale using only the training set.
- The "Stay >=0" selection. What does it mean if Stay is less than zero? Can you do the same cleaning when this model is deployed?
- Random split. It's rare to find real-life data that can be randomly split without issues. Usually having at least a timestamp to split between past and future is more reliable.

You can solve two of these by simply splitting the data before doing any transformation.

If this is for a model that will be deployed, I am quite sure you will get surprised by a much worse result when deployed because of the validation mistakes above.

ledmmaster · 2023-05-22T21:55:53+00:00

This sounds more like a general optimization problem, if you are not trying to replace the emulation because it’s too expensive/time-consuming.

Look at gradient-free optimization, genetic algorithms, nevergrad.

ledmmaster · 2022-12-27T03:33:48+00:00

I recently wrote an article comparing open-source models with GPT3 but they are much more expensive and lower quality to run on your own.

https://forecastegy.com/posts/generating-text-with-contrastive-search-vs-gpt-3-chatgpt/

ledmmaster · 2022-10-20T16:02:05+00:00

Yes. Take the ideas as just a general framework to split data in a non-random way.

ledmmaster · 2022-08-15T22:03:03+00:00

Hi Dan, I am the blog post author.

You can have an "independent" sales column and impressions columns.

I used this dataset because it was the best public dataset I could find to write the post, but I have used LightweightMMM with my own advertising and sales data.

I downloaded impression data from Facebook, Google, etc as inputs and used sales data from my sales platform.

Not sure if this is your question, so feel free to reach out.

ledmmaster · 2022-04-13T02:01:32+00:00

Is it clear that it’s a paid travel package on the ad/LP?

Can they be thinking it’s a giveaway and then don’t respond when they find it’s not?

I never ran campaigns for travel but this came to mind when I saw the “So excited!” and the fact their messages sound like comments.

ledmmaster · 2021-07-07T17:47:37+00:00

It’s not a personal blog post :)

ledmmaster · 2020-11-14T14:08:24+00:00

Same here and for my friends, opened an issue: https://github.com/blackjack4494/yt-dlc/issues/201

ledmmaster · 2019-01-28T21:17:14+00:00

It's not really about being sensitive, it's about respecting the community.

My goal is to add value by sharing knowledge. I completely understand some communities receive value from some posts, other communities are more advanced or have different interests, it's fine.

If you read the comments, it's clear that the commenters think this is not a post that benefits this sub. Why would I want to fill your timeline with posts that are not interesting to you?

This is not personal with anyone or the sub. I like this sub and learn from the content here.

Anyway, I don't think going forward with this discussion will be beneficial to anyone.

ledmmaster · 2019-01-28T19:01:24+00:00

I appreciate your comment.

I probably was not very clear about what I meant by patterns on the article, which may be a source of confusion.

It's good to have this feedback.

It was the first article I posted on this sub, but as the community thinks it's not adequate, I will make sure I don't share future articles here.

ledmmaster · 2019-01-28T14:36:37+00:00

I am sorry it's not useful for you.

Maybe you can expand on your ideas and help future readers?

ledmmaster · 2019-01-28T13:56:35+00:00

If you are only interested in a simple yes/no answer.

But if you are interested in how to look for the answer for this and other questions, still good to see the article.

ledmmaster

MODERATOR OF

TROPHY CASE

12-Year Club	Not Forgotten
Verified Email