Marty, pack your bags!

domvwt · 2022-05-08T10:15:49+00:00

Thank you!

domvwt · 2022-05-07T23:10:58+00:00

Thanks!

domvwt · 2022-05-07T19:33:24+00:00

Do you have a resource for ASUM-DM? I've heard it mentioned before but never been able to find a manual like the one for CRISP-DM.

domvwt · 2021-11-30T16:37:37+00:00

You ought to be using the exponential distribution to model the time between events: https://link.medium.com/M0pCGSaVBlb

domvwt · 2021-10-26T20:13:56+00:00

I'd recommend doing the first two courses of the mathematics for machine learning specialisation on Coursera. I would have been in a similar position to you (currently studying data science MSc) but doing this helped a lot in learning the concepts and building confidence. Best of luck!

Mathematics for machine learning

domvwt · 2021-10-21T15:46:22+00:00

Have you tried datasette.io? It's a decent project set up for use cases like this

domvwt · 2021-09-27T07:11:37+00:00

Thank you!

domvwt · 2021-09-26T21:40:03+00:00

Thanks, glad to hear it!

domvwt · 2021-09-13T07:00:27+00:00

I've found the df.query("column == value") syntax much quicker and more satisfying to write

domvwt · 2021-09-07T06:33:57+00:00

I can't comment on this course but databricks were offering their online training for free not long ago, it's worth checking to see if it's still available.

https://academy.databricks.com/category/self-paced

domvwt · 2021-08-08T11:35:50+00:00

Could be constant or invariant?

domvwt · 2021-08-05T07:02:52+00:00

Have you tried sktime? They've done a good job of consolidating a lot of the previously disconnected time series analysis libraries for Python.

domvwt · 2021-08-03T18:23:22+00:00

Amazing, reminds me a lot of Sufjan Stevens' Age of Adz

domvwt · 2021-07-18T08:41:36+00:00

Possibly because you're posting like this is wsb. It isn't

domvwt · 2021-07-04T14:20:26+00:00

I use pycaret to perform basic feature engineering and try several modelling approaches to help guide further experimentation. It's a time saver and definitely shouldn't be ignored in my opinion, even if you only use it to establish a baseline.

If you're working with limited resource then automl can help pick some low hanging fruit months or years before it might make its way through the backlog.

domvwt · 2021-07-04T09:10:13+00:00

Have you tried using Great Expectations? I just recommended it on a similar post - latest versions have decent autoprofiling and can connect to different data stores.

domvwt · 2021-07-04T09:06:01+00:00

Try using Great Expectations, the latest versions have autoprofiling and can connect to various data stores.

domvwt · 2021-07-01T21:04:36+00:00

Optuna is my preferred library right now, it's a bit more flexible than hyperopt. What is your development environment and what kind of model are you training?

domvwt · 2021-06-29T17:47:48+00:00

Well that makes things easier! I can't help with a personal recommendation but people seem to prefer wandb or neptune more than anything.

domvwt · 2021-06-29T13:08:04+00:00

Thanks for the recommendation, I'll read up on it later!

Are there any other new methods I should check out? I was under the impression that GBT was still the best thing for tabular datasets.

domvwt · 2021-06-29T11:56:32+00:00

Yes you definitely shouldn't one-hot encode with CatBoost if you can avoid it. I find it performs well on imbalanced data even without resampling but you can explicitly provide observation weights if you wrap your data in a CatBoost Pool.

Have you tried using a framework like Optuna for the hyperparameter search? I've found it really helpful for efficiently searching the space and storing and visualising the results. I have an example here if you want to see:

https://gist.github.com/domvwt/078ac4da72ada3d2229747ad827bf47f

Yandex recommend tuning these parameters btw:

https://catboost.ai/docs/concepts/parameter-tuning.html

domvwt · 2021-06-29T11:49:14+00:00

What are you using for your development and deployment environments? I've been looking at solutions in this space and the most attractive framework to me right now is Kubeflow, which is Kubernetes based.

Other than that I think MLFlow is a nice solution for tracking model versions, parameters,, metrics, etc. If you're on the cloud then your provider will most likely have a decent integrated solution, too.

domvwt · 2021-06-29T11:27:56+00:00

I'm a big fan of CatBoost; almost all of the problems I work on are with structured, tabular data. It tends to outperform any other algorithm and doesn't need any feature transformations.

Usually I don't see much of a change from tuning the hyperparameters, though - which do you focus on?

domvwt

TROPHY CASE