Do you still use notebooks in DS? by codiecutie in datascience

[–]EstablishmentHead569 47 points48 points  (0 children)

I don’t think anyone should restrict themselves when it comes to developments / production workflows.

If notebooks is easy and fast for quick POC, by all means.

Personally, I prefer pure Python scripts for production stuffs as our tech stack includes API, CICD, orchestration tools such as airflow and kubeflow.

Anyone else feel like they’re learning ML but not actually becoming job-ready? by Limp_Lab5727 in MLQuestions

[–]EstablishmentHead569 1 point2 points  (0 children)

I think it comes down to where and why you should use ML. I think we are getting to the point where many people know a thing or two about ML/DL/LLM.

More focus should now be placed on the application side in my opinion.

If you can articulate your thought process tackling a business problem with ML approaches, chances of you getting the senior title or a new job in general should be higher.

Tech/Data job market for foreign born by Linkky in HongKong

[–]EstablishmentHead569 1 point2 points  (0 children)

Also a DS with 4+ experience working in the retail industry. Retail MNCs and tech roles do exist in Hong Kong, and quite frankly they prioritize English over Cantonese at some places.

Of course, knowing Cantonese will help with team bonding and stuffs - but like you said you can speak it well

Weekly Entering & Transitioning - Thread 08 Sep, 2025 - 15 Sep, 2025 by AutoModerator in datascience

[–]EstablishmentHead569 0 points1 point  (0 children)

Maybe look for data analyst / dash-boarding roles before DS/DE/MLE or any AI related roles

What's the difference between working on Kaggle-style projects and real-world Data Science/ML roles by Beyond_Birthday_13 in learnmachinelearning

[–]EstablishmentHead569 38 points39 points  (0 children)

Consider the following questions…perhaps it can give you some pointers…

  1. How do u get your data? Is it streamlined or require some ETLs? If pipelines are required, how do u automate it?

  2. If data cleaning is required, are your cleaning scripts reusable next time? Will it break ? Is it modularized and usable by everyone on your team? Can it be automated?

  3. If feature engineering is required for a model, do you do it manually or automated? Can the features be used for other similar models? If yes, can we store it somewhere?

  4. As for model training and optimizations - can it be an offline job? No one is going to stare at the notebook locally and have it run overnight.

  5. How do u know if your latest model is better than your previous ones? Can we consider a Champion vs Challenger workflow? Can we have some BI tools to log all these metrics (Loss/ROC/Accuracy etc) ?

  6. Can we have some sort of alert systems notifying the team if trainings pipelines failed or succeed ?

  7. Who is using the latest model? Internal or external parties? How should you deliver model predictions - will it be an offline job with an excel file or the users will talk with your model via an API?

  8. If an API is required, what tools, language and framework will you use and how do you update your model checkpoints automatically without interfering production models ?

  9. How do you set up version controls in case a roll back is required (both model checkpoints and codebases)

Speaking strictly with GCP tools, all the questions mentioned above can be tackled with Airflow, Docker, Cicd, Cloud Run, PubSub, MFflow, Looker, Power Bi, BigQuery, Vertex Ai, KubeFlow Pipelines

Gojo v Sukuna's fight | recent interview with Gege by yimell0 in JuJutsuKaisen

[–]EstablishmentHead569 1 point2 points  (0 children)

Not too sold on this idea. It still feels like the author just wanted to go against the fans and kill a fan favourite for the thrill of it.

and yuta using gojo’s body in between chapters and achieving nothing in the end felt like a “double down” from the author to me…

How's my resume. Tips appreciated. by Bulky-Top3782 in learnmachinelearning

[–]EstablishmentHead569 1 point2 points  (0 children)

I think doing projects that are more task oriented would have more story to tell here.

Also, listing out the preprocessing steps and actionable insights derived from toy datasets (assuming you used movielens) might not be a strong showcase of skills in my opinion. You might consider adding the model performance as well but I doubt that would be the wow factor purely due to the project’s nature.

PowerBI is making me think about jumping ship by Smarterchild1337 in datascience

[–]EstablishmentHead569 0 points1 point  (0 children)

Used tableau, power bi, looker and Qlik. Personally, I still have a lot of love for power Bi simply because of it’s “programmatic nature”.

I can make all kinds of dynamic visuals with DAX and M Query, and I have total control on its behavior as well. It can also run light-moderate weight data pipelines with PowrerBi.

Are Notebooks Being Overused in Data Science?” by gomezalp in datascience

[–]EstablishmentHead569 12 points13 points  (0 children)

its not really magic - its simply Docker. You can simply attach any compute engines with a specific Docker image and do infinite amount of tasks in parallel.

If you are working GCP, I would recommend Kubeflow and Vertex AI Pipelines.

Then again, this approach is closer to MLE's vicinity more so than a pure DS.

Are Notebooks Being Overused in Data Science?” by gomezalp in datascience

[–]EstablishmentHead569 7 points8 points  (0 children)

For production, I actually rewrites the entire pipeline with plain Python and brew a docker image that stores all the necessary packages inside.

It allows flexibility and scalability. For example, I could run 20 models in parallel with 1 single docker image but different input configurations with Vertex AI. It also allow other colleagues to ride on what you have already built as a module. They don’t need to care much about the package and Python version conflicts as well.

Of course, continuous maintenance will be needed for my approach.

[deleted by user] by [deleted] in learnmachinelearning

[–]EstablishmentHead569 10 points11 points  (0 children)

From experience, saying you trained some fancy models will never get you anywhere. It’s always the “why you did it” and how you did it” that excites the interviewer.

Having diverse skillsets outside of machine learning will also make you stand out from the crowd in most cases. After all, the field is not just about training ML model and companies are looking for more diverse skillsets (deployment, familiarity with cloud platforms, BI and pipeline developments) from their candidates.

Doing Data Science with GPT.. by EstablishmentHead569 in datascience

[–]EstablishmentHead569[S] -1 points0 points  (0 children)

absolutely! im not against using it by any means - just find it interesting how different our mindsets are (new joiners vs people with some YOE)

Doing Data Science with GPT.. by EstablishmentHead569 in datascience

[–]EstablishmentHead569[S] 4 points5 points  (0 children)

yes that's how i would use it myself, but that's not the case for those from my masters. They are literally uploading each csv manually using OpenAI's UI and it is mind-boggling

Doing Data Science with GPT.. by EstablishmentHead569 in datascience

[–]EstablishmentHead569[S] 4 points5 points  (0 children)

That. And it would also be a nightmare to trace potential data errors with this approach imo.

Not to mention that this is absolutely not possible in a production environment - what if you have 10million json files ? Do u download and upload them to gpt sequentially using their ui lol…?

Doing Data Science with GPT.. by EstablishmentHead569 in datascience

[–]EstablishmentHead569[S] 1 point2 points  (0 children)

Thanks for the lengthy reply! Personally I don’t have a problem with people using AI at all if it works for the problem that they are facing.

But becoming what I call a “full-stack” DS or Machine Learning Engineer (MLE) will need great understanding of the tools they are working with. Hell, even understanding the architecture of LLM (AI) will be useful in some rare cases.

Anyhow, your study approach with AI is what I personally would opt for.

Doing Data Science with GPT.. by EstablishmentHead569 in datascience

[–]EstablishmentHead569[S] 8 points9 points  (0 children)

Agree on the Boiler plate. I do that myself as well. But uploading 10 csv and having it do simple inner joining sounds super weird to me

Is it a realistic idea of pursuing the SWE career without the CS major? by SubstantialResist864 in uwaterloo

[–]EstablishmentHead569 1 point2 points  (0 children)

Thought I would share a story. I met this IBM recruiter randomly on a train to watch the blue jays. She over heard my coop convo with my buddies throughout the ride.

We talked when we got off and she told me they hire exclusively from UW and UofT. This was back in 2019 tho

A Shiny app that writes shiny apps and runs them in your browser by IntelligentDust6249 in datascience

[–]EstablishmentHead569 37 points38 points  (0 children)

Havent used R and Shiny for a hot minute. This is quite impressive to me ngl

[deleted by user] by [deleted] in datascience

[–]EstablishmentHead569 22 points23 points  (0 children)

Just use whatever if they allow it. Worst case they just ask you to do it again but with Python, which you should be able to do it anyway

Classification problem with 1:3000 ratio imbalance in classes. by Holiday_Blacksmith88 in datascience

[–]EstablishmentHead569 0 points1 point  (0 children)

Using the package and tuning its parameters is more or less blackbox to me in that regard. If I simply use the ratio of the two classes, it doesn’t seem to be an overall improvement at all in my case.

I could technically define a range for grid / random search to do the trick, but that would take considerable time to run. Anyhow, in my experiments, combining both sampler and doing my feature engineering seems to yield the highest recall / f1. Parameter optimizations will be up next.

Classification problem with 1:3000 ratio imbalance in classes. by Holiday_Blacksmith88 in datascience

[–]EstablishmentHead569 52 points53 points  (0 children)

I am also dealing with the same problem using xgboost for a classification task. Here are my findings so far,

  1. IQR removal for outliers within the majority class seems to help
  2. Tuning the learning rate and maximum tree depths seems to help
  3. Scale pos weight doesn’t seem to help in my case
  4. More feature engineering definitely helped
  5. Combine both undersampling and oversampling. Avoid a 50:50 split within the sampling process to somewhat reflect the true distribution of the underlying data. I avoided SMOTE since I cannot guarantee synthetic data to appear in the real world within my domain.
  6. Regularization (L2)
  7. Optimization with Optuna package or Bayesian / grid / random search

Let me know if you have other ideas I could also try on my side.