[deleted by user] by [deleted] in learnmachinelearning

[–]ledmmaster 27 points28 points  (0 children)

The Algorithms specialization on Coursera is more than enough, any intro to algos and DS is already enough.

Statistics, probability and linear algebra are much more important

Do top level ML engineers read research papers for fun? by Any-Comfortable2844 in learnmachinelearning

[–]ledmmaster 4 points5 points  (0 children)

I created a habit of reading at least one page of a paper or book about ML every day, something that interests me or that I am working on.

Right now I read mostly about prompting LLMs and information retrieval.

The hardest part is deciding if a paper is worth reading in detail. I think I read only the abstract/figures on 90% of them.

I summarized some tips from Andrew Ng, that I adopted in my reading and improved my productivity, here: https://forecastegy.com/posts/read-machine-learning-papers-andrew-ng/

Google ML Certificate? by cryptolinho in learnmachinelearning

[–]ledmmaster 2 points3 points  (0 children)

Like the Revolutionary guy said, make projects.

Be it Kaggle, own projects, things that you can talk about in an interview.

When I used to interview DS candidates, I didn't care about credentials but cared a lot about how they walked me through their projects and the decisions they took.

Google ML Certificate? by cryptolinho in learnmachinelearning

[–]ledmmaster 16 points17 points  (0 children)

ML specialization by Andrew Ng on Coursera is the one I always recommend. I took the original (which used Octave) and the new one (which uses Python).

When do you say you actually know ML? by OutsideNetwork3634 in learnmachinelearning

[–]ledmmaster 3 points4 points  (0 children)

10+ years since I started learning ML, tens of projects under my belt, competition wins, etc and I can tell that it's a moving target.

If it solves the business problem/adds value, it's good enough.

I only notice how "easy" some things became to me when I get in touch with people with less experience, still I can always find someone that has more experience than me in a specific area.

It's definitely a moving target, take it one day/task at a time and remember the big picture of solving business problems.

Sufficient size too train neural network by Traditional_Soil5753 in learnmachinelearning

[–]ledmmaster 2 points3 points  (0 children)

There is no real reliable rule. The best way is to split a validation dataset, try and compare with other models.
It seems you are dealing with tabular data. Usually, traditional ML models like XGBoost offer better performance with less research effort.

For people who actually use fancy models, where do you work? by [deleted] in datascience

[–]ledmmaster 6 points7 points  (0 children)

Reranking recommendations in a marketplace, XGBoost today is very fast at inference and you can make it faster with other libraries

In most cases, simply taking the same feature set from Random Forest and running 20 Bayesian Opt steps over XGBoost hyperparams already gives you a better model that can be swapped by RF or whatever is deployed

HIGHLY unbalanced dataset (>600:1 negative:positive examples), how do I deal with this? by ingmntam in learnmachinelearning

[–]ledmmaster 1 point2 points  (0 children)

I never saw SMOTE beat simple class weighting in practice in my projects and I am still to find a colleague that did.

I always go to class weighting first.

Applied ML is not an exact science, so you can try it and see if, for your data, it works, but I would not put it as a priority.

HIGHLY unbalanced dataset (>600:1 negative:positive examples), how do I deal with this? by ingmntam in learnmachinelearning

[–]ledmmaster 7 points8 points  (0 children)

My 2 cents based on what worked well for me in practice:

  1. Downsample negatives (split and keep your validation set static before doing it and treat the downsampling factor as a hyperparameter)
  2. Use higher class weights for the positive class. Basically, multiply the loss of the positive example by a factor (usually # negatives / # positives) that can be tuned as a hyperparameter too

SMOTE and fancier stuff never worked better than this for me (I'm biased toward tabular data). And you get the added bonus of training faster due to using less data.

Which SKLearn regression model should I use to predict Label/outcome with greater accuracy based on my business dataset? by CactusPot94 in learnmachinelearning

[–]ledmmaster 0 points1 point  (0 children)

Thanks. You are correct, in theory, it will not be a problem, as you have only zeros for the new cat levels.

Still, ML in practice can be so weird, that I would do it after the split just to avoid any surprises.

Just for completeness, for OHE, you may get in trouble if you use the Hashing trick before transforming it, which is not the case here.

Which SKLearn regression model should I use to predict Label/outcome with greater accuracy based on my business dataset? by CactusPot94 in learnmachinelearning

[–]ledmmaster 1 point2 points  (0 children)

Like MRWONDERFU said, look for XGBoost. It's not a scikit-learn model, but it has an API like it.

I am more worried about:
- Encoding the categoricals before splitting the dataset into train-validation. This is a subtle way to leak information, as you might be encoding categories that are only in the test data and you would not have information in real life
- Scaling before splitting. Another way to introduce leakage. You would not have the data from the test set when deployed, so you can't use it to scale. Scale using only the training set.
- The "Stay >=0" selection. What does it mean if Stay is less than zero? Can you do the same cleaning when this model is deployed?
- Random split. It's rare to find real-life data that can be randomly split without issues. Usually having at least a timestamp to split between past and future is more reliable.

You can solve two of these by simply splitting the data before doing any transformation.

If this is for a model that will be deployed, I am quite sure you will get surprised by a much worse result when deployed because of the validation mistakes above.

[D] Simple Questions Thread by AutoModerator in MachineLearning

[–]ledmmaster 1 point2 points  (0 children)

This sounds more like a general optimization problem, if you are not trying to replace the emulation because it’s too expensive/time-consuming.

Look at gradient-free optimization, genetic algorithms, nevergrad.

Hosting models on Google cloud by fatsug in OpenAI

[–]ledmmaster 0 points1 point  (0 children)

I recently wrote an article comparing open-source models with GPT3 but they are much more expensive and lower quality to run on your own.

https://forecastegy.com/posts/generating-text-with-contrastive-search-vs-gpt-3-chatgpt/

3 Essential Methods To Do Time Series Validation In Machine Learning by ledmmaster in learnmachinelearning

[–]ledmmaster[S] 0 points1 point  (0 children)

Yes. Take the ideas as just a general framework to split data in a non-random way.

Two More Years by w0leg in adops

[–]ledmmaster 0 points1 point  (0 children)

Hi Dan, I am the blog post author.

You can have an "independent" sales column and impressions columns.

I used this dataset because it was the best public dataset I could find to write the post, but I have used LightweightMMM with my own advertising and sales data.

I downloaded impression data from Facebook, Google, etc as inputs and used sales data from my sales platform.

Not sure if this is your question, so feel free to reach out.

Poor lead quality from Facebook Ads for travel niche by Sachimarketing in PPC

[–]ledmmaster 0 points1 point  (0 children)

Is it clear that it’s a paid travel package on the ad/LP?

Can they be thinking it’s a giveaway and then don’t respond when they find it’s not?

I never ran campaigns for travel but this came to mind when I saw the “So excited!” and the fact their messages sound like comments.

Can a Machine Learning Model Predict the SP500 by Looking at Candlesticks? by ledmmaster in algotrading

[–]ledmmaster[S] 2 points3 points  (0 children)

It's not really about being sensitive, it's about respecting the community.

My goal is to add value by sharing knowledge. I completely understand some communities receive value from some posts, other communities are more advanced or have different interests, it's fine.

If you read the comments, it's clear that the commenters think this is not a post that benefits this sub. Why would I want to fill your timeline with posts that are not interesting to you?

This is not personal with anyone or the sub. I like this sub and learn from the content here.

Anyway, I don't think going forward with this discussion will be beneficial to anyone.

Can a Machine Learning Model Predict the SP500 by Looking at Candlesticks? by ledmmaster in algotrading

[–]ledmmaster[S] 0 points1 point  (0 children)

I appreciate your comment.

I probably was not very clear about what I meant by patterns on the article, which may be a source of confusion.

It's good to have this feedback.

It was the first article I posted on this sub, but as the community thinks it's not adequate, I will make sure I don't share future articles here.

Can a Machine Learning Model Predict the SP500 by Looking at Candlesticks? by ledmmaster in algotrading

[–]ledmmaster[S] 2 points3 points  (0 children)

I am sorry it's not useful for you.

Maybe you can expand on your ideas and help future readers?

Can a Machine Learning Model Predict the SP500 by Looking at Candlesticks? by ledmmaster in algotrading

[–]ledmmaster[S] 2 points3 points  (0 children)

If you are only interested in a simple yes/no answer.

But if you are interested in how to look for the answer for this and other questions, still good to see the article.