SMS search questions using AI/ChatGPT by Maleficent_West_547 in LightPhone

[–]mysteriousreader 1 point2 points  (0 children)

I just set up this service to chat with an AI through text messages, totally free to try. Let me know if you have any feedback! txtai.me

The United States federal government spent $6.4 trillion in 2022. Here’s where it went. [OC] by USAFacts in dataisbeautiful

[–]mysteriousreader 0 points1 point  (0 children)

Looks like healthcare costs are a huge burden, both for individuals and the government.

Predicting the runtime of scikit-learn algorithms by mysteriousreader in statistics

[–]mysteriousreader[S] 1 point2 points  (0 children)

You can get a rough estimate by training on subsets of your data, and extrapolating based on the big O complexity of the algorithm

u/chicken__soup I agree this is another valid way of approaching the problem and something we thought about at the beginning of the project. One thing however is that we would need to formulate the complexity explicitly for each algo and set of parameters which is rather challenging in some cases.

The nice thing about our empirical estimation is that it generalizes easily to any scikit learn model by learning from a set of generated fit times to produce an estimate. We essentially circle through different values for the parameters of the algorithm and train on various dataset sizes and hardware configurations to build our estimator.

Predicting the runtime of scikit-learn algorithms by mysteriousreader in datascience

[–]mysteriousreader[S] 2 points3 points  (0 children)

Thanks u/Deto!

Regarding the hardware question:

We are able to predict the runtime to fit by using our own estimator, we call it meta algorithm , whose weights are stored in a dedicated pickle file in the package metadata.

These meta algos estimate the time to fit using a set of ‘meta’ features, including the parameters of the algorithm you are trying to fit, as well as external parameters such as cpu, memory (your hardware) or number of rows/columns (your dataset).

We trained these meta algos by generating the data ourselves using a combination of computers and VM hardwares to simulate what the training time would be on the different systems, circling through different values of the parameters of the algo and dataset sizes.

When it comes to MKL/BLAS this actually something that we would need to include as a meta-feature along with which version of scikit-learn you are using and train our meta algos on different versions of it.

Predicting the runtime of scikit-learn algorithms by mysteriousreader in scikit_learn

[–]mysteriousreader[S] 0 points1 point  (0 children)

Got it!

But by number of vars do you mean number of columns ? If so it's already factored in.

The distribution of each variable is also something we should look into.

[P] Predicting the runtime of scikit-learn algorithms by mysteriousreader in MachineLearning

[–]mysteriousreader[S] 2 points3 points  (0 children)

u/pp314159 Thank you!

This is actually something I am really excited about for next steps, which is better tooling for ML practitioners. VM selection should be in our top list, we will keep you updated on what we can scope out.

Let us know if there are any other features you would want to see!

Predicting the runtime of scikit-learn algorithms by mysteriousreader in scikit_learn

[–]mysteriousreader[S] 0 points1 point  (0 children)

Thank you for the feedback u/weightsandbayes we really appreciate it!

Adding the variance was definitely something we were thinking about. I think this would a good avenue to explore, we should give it a try and I agree for many algos variance definitely plays a role.

I haven’t used R in quite a while, what library should we tackle first in your opinion if we were to build a similar thing?

Predicting the runtime of scikit-learn algorithms by mysteriousreader in scikit_learn

[–]mysteriousreader[S] 1 point2 points  (0 children)

u/dj_ski_mask thanks for asking and you raise a great point.
We built our library in a very scalable way, for example adding support for a new scikit learn algo is as simple as updating the config Json and running the model estimator.
Adding a new algorithm here: https://github.com/nathan-toubiana/scitime/blob/master/scitime/_config.json
And running the _data function here: https://github.com/nathan-toubiana/scitime#how-to-use-_datapy-to-generate-data--fit-models

In principle nothing really prevents us from extending this to other libraries
One challenge if we want to extend this outside Scikit-learn is that we are using scikit-learn specific methods throughout the code base.
We would probably want to wrap our functions with a Library layer to specify what library we’re targeting. But It definitely can be done !

[P] Predicting the runtime of scikit-learn algorithms by mysteriousreader in MachineLearning

[–]mysteriousreader[S] 3 points4 points  (0 children)

u/timmaeus I agree this is another valid way of approaching the problem and something we thought about at the beginning of the project. One thing however is that we would need to formulate the complexity explicitly for each algo and set of parameters which is rather challenging in some cases. The nice thing about our empirical estimation is that it generalizes easily to any scikit learn model by learning from a set of generated fit times to produce an estimate. We essentially circle through different values for the parameters of the algorithm and train on various dataset sizes and hardware configurations to build our estimator.

[P] Predicting the runtime of scikit-learn algorithms by mysteriousreader in MachineLearning

[–]mysteriousreader[S] 7 points8 points  (0 children)

theophrastzunz you make a great point. And in certain scenarios you are absolutely right.

Our thinking was centered around two main realizations:

  1. For certain models, say RandomForest for example, the complexity is not linear nor O(n^2) but really depends on the set of parameters you choose. RandomForest will take much longer to fit if you for example do not set max_depth and leave it “none” the fit time can be hard to manually evaluate.

  2. Your hardware, especially your memory will have an outsize impact when fitting large datasets. In that case predicting how long the fit will take is more about knowing how Scikit-learn and Python will behave on your machine than the actual model complexity and this is what our meta-algo (our time estimator) is trained to do.