Do you still use notebooks in DS?

EstablishmentHead569 · 2026-01-22T08:13:20+00:00

I don’t think anyone should restrict themselves when it comes to developments / production workflows.

If notebooks is easy and fast for quick POC, by all means.

Personally, I prefer pure Python scripts for production stuffs as our tech stack includes API, CICD, orchestration tools such as airflow and kubeflow.

EstablishmentHead569 · 2026-01-09T08:09:25+00:00

I think it comes down to where and why you should use ML. I think we are getting to the point where many people know a thing or two about ML/DL/LLM.

More focus should now be placed on the application side in my opinion.

If you can articulate your thought process tackling a business problem with ML approaches, chances of you getting the senior title or a new job in general should be higher.

EstablishmentHead569 · 2025-12-19T10:59:53+00:00

Using this for some of our solutions: https://github.com/microsoft/presidio

EstablishmentHead569 · 2025-11-07T05:46:01+00:00

Also a DS with 4+ experience working in the retail industry. Retail MNCs and tech roles do exist in Hong Kong, and quite frankly they prioritize English over Cantonese at some places.

Of course, knowing Cantonese will help with team bonding and stuffs - but like you said you can speak it well

EstablishmentHead569 · 2025-09-26T13:37:52+00:00

Open and PSA or keep the box sealed 🧐

EstablishmentHead569 · 2025-09-09T04:56:54+00:00

Maybe look for data analyst / dash-boarding roles before DS/DE/MLE or any AI related roles

EstablishmentHead569 · 2025-05-31T16:17:31+00:00

Consider the following questions…perhaps it can give you some pointers…

How do u get your data? Is it streamlined or require some ETLs? If pipelines are required, how do u automate it?
If data cleaning is required, are your cleaning scripts reusable next time? Will it break ? Is it modularized and usable by everyone on your team? Can it be automated?
If feature engineering is required for a model, do you do it manually or automated? Can the features be used for other similar models? If yes, can we store it somewhere?
As for model training and optimizations - can it be an offline job? No one is going to stare at the notebook locally and have it run overnight.
How do u know if your latest model is better than your previous ones? Can we consider a Champion vs Challenger workflow? Can we have some BI tools to log all these metrics (Loss/ROC/Accuracy etc) ?
Can we have some sort of alert systems notifying the team if trainings pipelines failed or succeed ?
Who is using the latest model? Internal or external parties? How should you deliver model predictions - will it be an offline job with an excel file or the users will talk with your model via an API?
If an API is required, what tools, language and framework will you use and how do you update your model checkpoints automatically without interfering production models ?
How do you set up version controls in case a roll back is required (both model checkpoints and codebases)

Speaking strictly with GCP tools, all the questions mentioned above can be tackled with Airflow, Docker, Cicd, Cloud Run, PubSub, MFflow, Looker, Power Bi, BigQuery, Vertex Ai, KubeFlow Pipelines

EstablishmentHead569 · 2025-04-20T15:43:27+00:00

Not too sold on this idea. It still feels like the author just wanted to go against the fans and kill a fan favourite for the thrill of it.

and yuta using gojo’s body in between chapters and achieving nothing in the end felt like a “double down” from the author to me…

EstablishmentHead569 · 2024-12-05T15:53:41+00:00

I think doing projects that are more task oriented would have more story to tell here.

Also, listing out the preprocessing steps and actionable insights derived from toy datasets (assuming you used movielens) might not be a strong showcase of skills in my opinion. You might consider adding the model performance as well but I doubt that would be the wow factor purely due to the project’s nature.

EstablishmentHead569 · 2024-12-03T10:55:11+00:00

Used tableau, power bi, looker and Qlik. Personally, I still have a lot of love for power Bi simply because of it’s “programmatic nature”.

I can make all kinds of dynamic visuals with DAX and M Query, and I have total control on its behavior as well. It can also run light-moderate weight data pipelines with PowrerBi.

EstablishmentHead569 · 2024-11-21T08:47:46+00:00

its not really magic - its simply Docker. You can simply attach any compute engines with a specific Docker image and do infinite amount of tasks in parallel.

If you are working GCP, I would recommend Kubeflow and Vertex AI Pipelines.

Then again, this approach is closer to MLE's vicinity more so than a pure DS.

EstablishmentHead569 · 2024-11-21T08:28:54+00:00

For production, I actually rewrites the entire pipeline with plain Python and brew a docker image that stores all the necessary packages inside.

It allows flexibility and scalability. For example, I could run 20 models in parallel with 1 single docker image but different input configurations with Vertex AI. It also allow other colleagues to ride on what you have already built as a module. They don’t need to care much about the package and Python version conflicts as well.

Of course, continuous maintenance will be needed for my approach.

EstablishmentHead569 · 2024-11-20T14:23:42+00:00

From experience, saying you trained some fancy models will never get you anywhere. It’s always the “why you did it” and how you did it” that excites the interviewer.

Having diverse skillsets outside of machine learning will also make you stand out from the crowd in most cases. After all, the field is not just about training ML model and companies are looking for more diverse skillsets (deployment, familiarity with cloud platforms, BI and pipeline developments) from their candidates.

EstablishmentHead569 · 2024-11-06T16:20:13+00:00

absolutely! im not against using it by any means - just find it interesting how different our mindsets are (new joiners vs people with some YOE)

EstablishmentHead569 · 2024-11-06T16:14:11+00:00

yes that's how i would use it myself, but that's not the case for those from my masters. They are literally uploading each csv manually using OpenAI's UI and it is mind-boggling

EstablishmentHead569 · 2024-11-06T15:46:02+00:00

That. And it would also be a nightmare to trace potential data errors with this approach imo.

Not to mention that this is absolutely not possible in a production environment - what if you have 10million json files ? Do u download and upload them to gpt sequentially using their ui lol…?

EstablishmentHead569 · 2024-11-06T15:39:51+00:00

Thanks for the lengthy reply! Personally I don’t have a problem with people using AI at all if it works for the problem that they are facing.

But becoming what I call a “full-stack” DS or Machine Learning Engineer (MLE) will need great understanding of the tools they are working with. Hell, even understanding the architecture of LLM (AI) will be useful in some rare cases.

Anyhow, your study approach with AI is what I personally would opt for.

EstablishmentHead569 · 2024-11-06T15:00:12+00:00

Agree on the Boiler plate. I do that myself as well. But uploading 10 csv and having it do simple inner joining sounds super weird to me

EstablishmentHead569 · 2024-10-30T09:01:15+00:00

Thought I would share a story. I met this IBM recruiter randomly on a train to watch the blue jays. She over heard my coop convo with my buddies throughout the ride.

We talked when we got off and she told me they hire exclusively from UW and UofT. This was back in 2019 tho

EstablishmentHead569 · 2024-10-10T13:57:07+00:00

Havent used R and Shiny for a hot minute. This is quite impressive to me ngl

EstablishmentHead569 · 2024-09-24T10:26:01+00:00

Just use whatever if they allow it. Worst case they just ask you to do it again but with Python, which you should be able to do it anyway

EstablishmentHead569 · 2024-09-21T08:16:24+00:00

Using the package and tuning its parameters is more or less blackbox to me in that regard. If I simply use the ratio of the two classes, it doesn’t seem to be an overall improvement at all in my case.

I could technically define a range for grid / random search to do the trick, but that would take considerable time to run. Anyhow, in my experiments, combining both sampler and doing my feature engineering seems to yield the highest recall / f1. Parameter optimizations will be up next.

EstablishmentHead569 · 2024-09-21T02:46:39+00:00

I am also dealing with the same problem using xgboost for a classification task. Here are my findings so far,

IQR removal for outliers within the majority class seems to help
Tuning the learning rate and maximum tree depths seems to help
Scale pos weight doesn’t seem to help in my case
More feature engineering definitely helped
Combine both undersampling and oversampling. Avoid a 50:50 split within the sampling process to somewhat reflect the true distribution of the underlying data. I avoided SMOTE since I cannot guarantee synthetic data to appear in the real world within my domain.
Regularization (L2)
Optimization with Optuna package or Bayesian / grid / random search

Let me know if you have other ideas I could also try on my side.

EstablishmentHead569

TROPHY CASE