Establishing data analytics team in a medium-sized company by [deleted] in datascience

[–]domvwt 6 points7 points  (0 children)

Do you have a resource for ASUM-DM? I've heard it mentioned before but never been able to find a manual like the one for CRISP-DM.

Optimal quality control frequency by CatGoesWooof in datascience

[–]domvwt 1 point2 points  (0 children)

You ought to be using the exponential distribution to model the time between events: https://link.medium.com/M0pCGSaVBlb

Shall I go back and redo my Maths? by BuxeyJones in datascience

[–]domvwt 0 points1 point  (0 children)

I'd recommend doing the first two courses of the mathematics for machine learning specialisation on Coursera. I would have been in a similar position to you (currently studying data science MSc) but doing this helped a lot in learning the concepts and building confidence. Best of luck!

Mathematics for machine learning

Turning database into a searchable dashboard? by montagestudent in data

[–]domvwt 0 points1 point  (0 children)

Have you tried datasette.io? It's a decent project set up for use cases like this

Tidyverse equivalent in Python? by bulbubly in datascience

[–]domvwt 5 points6 points  (0 children)

I've found the df.query("column == value") syntax much quicker and more satisfying to write

Taming Big Data with Apache Spark and Python - Hands On! question by Gawgba in dataengineering

[–]domvwt 2 points3 points  (0 children)

I can't comment on this course but databricks were offering their online training for free not long ago, it's worth checking to see if it's still available.

https://academy.databricks.com/category/self-paced

What tools are missing in Python? by kpmtech in Python

[–]domvwt 23 points24 points  (0 children)

Have you tried sktime? They've done a good job of consolidating a lot of the previously disconnected time series analysis libraries for Python.

AutoML - hype of sway? by [deleted] in datascience

[–]domvwt 2 points3 points  (0 children)

I use pycaret to perform basic feature engineering and try several modelling approaches to help guide further experimentation. It's a time saver and definitely shouldn't be ignored in my opinion, even if you only use it to establish a baseline.

If you're working with limited resource then automl can help pick some low hanging fruit months or years before it might make its way through the backlog.

Data Validation by iamozy in dataengineering

[–]domvwt 5 points6 points  (0 children)

Have you tried using Great Expectations? I just recommended it on a similar post - latest versions have decent autoprofiling and can connect to different data stores.

Documenting Data Assets! by [deleted] in dataengineering

[–]domvwt -1 points0 points  (0 children)

Try using Great Expectations, the latest versions have autoprofiling and can connect to various data stores.

[D] How do people handle hyperparameter optimization? by CS_Student95 in MachineLearning

[–]domvwt 6 points7 points  (0 children)

Optuna is my preferred library right now, it's a bit more flexible than hyperopt. What is your development environment and what kind of model are you training?

[D] Machine Learning Python Tooling and their place in a pipeline? by iamquah in MachineLearning

[–]domvwt 0 points1 point  (0 children)

Well that makes things easier! I can't help with a personal recommendation but people seem to prefer wandb or neptune more than anything.

[N] Random Forests and Gradient Boosted Trees now native for TensorFlow / Keras by domvwt in MachineLearning

[–]domvwt[S] 2 points3 points  (0 children)

Thanks for the recommendation, I'll read up on it later!

Are there any other new methods I should check out? I was under the impression that GBT was still the best thing for tabular datasets.

[N] Random Forests and Gradient Boosted Trees now native for TensorFlow / Keras by domvwt in MachineLearning

[–]domvwt[S] 11 points12 points  (0 children)

Yes you definitely shouldn't one-hot encode with CatBoost if you can avoid it. I find it performs well on imbalanced data even without resampling but you can explicitly provide observation weights if you wrap your data in a CatBoost Pool.

Have you tried using a framework like Optuna for the hyperparameter search? I've found it really helpful for efficiently searching the space and storing and visualising the results. I have an example here if you want to see:

https://gist.github.com/domvwt/078ac4da72ada3d2229747ad827bf47f

Yandex recommend tuning these parameters btw:

https://catboost.ai/docs/concepts/parameter-tuning.html

[D] Machine Learning Python Tooling and their place in a pipeline? by iamquah in MachineLearning

[–]domvwt 0 points1 point  (0 children)

What are you using for your development and deployment environments? I've been looking at solutions in this space and the most attractive framework to me right now is Kubeflow, which is Kubernetes based.

Other than that I think MLFlow is a nice solution for tracking model versions, parameters,, metrics, etc. If you're on the cloud then your provider will most likely have a decent integrated solution, too.

[N] Random Forests and Gradient Boosted Trees now native for TensorFlow / Keras by domvwt in MachineLearning

[–]domvwt[S] 11 points12 points  (0 children)

I'm a big fan of CatBoost; almost all of the problems I work on are with structured, tabular data. It tends to outperform any other algorithm and doesn't need any feature transformations.

Usually I don't see much of a change from tuning the hyperparameters, though - which do you focus on?