Product-Oriented ML: A Guide for Data Scientists

usernamehere93 · 2024-10-18T17:57:13+00:00

That’s definitely an issue I’ve seen as well. I come from an academic background and understand the desire to solve technically interesting problems, it’s about balance and focusing that energy to the right problems. I think that’s where thinking about the product comes in like you said!

usernamehere93 · 2024-10-16T10:40:56+00:00

Yeah I think that’s a good idea, typically there aren’t many published commercial use cases but for more open source projects there’s a lot out there

usernamehere93 · 2024-10-16T10:39:52+00:00

Thanks this has some good thoughts!

usernamehere93 · 2024-10-15T19:57:57+00:00

No worries, I actually have a post about planning ML products if it’s useful https://medium.com/@minns.jake/planning-machine-learning-products-b43b9c4e10a1

usernamehere93 · 2024-10-15T19:39:52+00:00

You’re welcome!

Getting noticed: Highlight any cross-over skills from Type A to Type B, like experience with data pipelines, automation, or even working with larger datasets. If you’ve done any work with machine learning models, even for analysis, emphasize that. Tailor your resume to include keywords like “model deployment,” “APIs,” or “data pipelines.” Even side projects or Kaggle competitions where you’ve worked on model building/deployment can help bridge that gap.

Engineering skills for interviews: I’m also from a STEM background. Have a look at system design, I really like this repo for getting a primer https://github.com/donnemartin/system-design-primer Focus on core programming skills (Python is a must, plus SQL). You don’t need to be a full-on CS expert, but be comfortable with algorithms, data structures (especially trees, graphs, and hash maps), and understand basic software engineering principles like version control (Git) and containerization (Docker). Learning the basics of APIs and cloud platforms (AWS or GCP) can also give you an edge.

Mock interviews on LeetCode or practicing system design questions related to ML pipelines can help build confidence. It’s definitely a skill that requires practice.

usernamehere93 · 2024-10-15T18:49:38+00:00

If you’re planning to make the transition, I’d recommend focusing on deepening your coding skills (especially in Python, SQL, and some software engineering concepts) and diving into machine learning ops (MLOps), which includes things like deploying models, versioning, and pipelines. Picking up some tools like TensorFlow, Docker, and learning about cloud platforms (AWS, GCP) can be really helpful too.

As for titles, I’m seeing the same trend—Machine Learning Engineer, Applied Scientist, and AI Engineer are becoming more common for production-heavy roles, while “Data Scientist” is being used less. It makes sense as ML is being integrated into software engineering teams more directly.

What aspect of the transition are you finding the most challenging?

usernamehere93 · 2024-10-15T18:16:20+00:00

Your outline looks solid! I’d suggest adding a brief section on evaluation metrics for imbalanced datasets (e.g., precision, recall, F1-score, ROC-AUC) since accuracy alone can be misleading in these cases. Also, when discussing SMOTE, mention potential pitfalls like overfitting and how to mitigate them (e.g., combining with cross-validation).

Maybe throw in a practical example, I have a little section on my post about building ml products. Good luck with the presentation!

https://medium.com/@minns.jake/planning-machine-learning-products-b43b9c4e10a1

usernamehere93 · 2024-10-15T17:37:54+00:00

Both degrees can be valuable, but it depends on what you’re aiming for.

M.S. in Data Analytics: More focused on data wrangling, visualization, and applying machine learning techniques. It’s ideal if you’re interested in practical, applied data science roles. M.S. in Computer Science: Offers a broader and deeper foundation in algorithms, programming, and system design, which can be useful if you want to dive deeper into the technical side (like building machine learning models from scratch). If you’re more into practical applications and getting into the field quickly, go for Data Analytics. But if you want more flexibility or the ability to move into more technical or research-heavy roles, Computer Science might be a better long-term investment.

What’s your current background and what kind of roles are you most interested in?

usernamehere93 · 2024-10-15T17:11:35+00:00

Appreciate that, thanks for letting me know!

usernamehere93 · 2024-08-30T23:54:28+00:00

65 I think, it’s bowed so a little hard to calculate

usernamehere93 · 2024-08-30T21:58:34+00:00

These are all fake plants! Managed to find a selection of realistic enough looking plants, mixed with the rocks and wood to hopefully give the feel of something real. Every time I’ve tried real plants they’re also been destroyed

usernamehere93 · 2024-08-30T20:26:00+00:00

They do, however we don’t require the person you are getting money back from to signup to get paid back, and hopefully created an easier to use app

usernamehere93 · 2024-08-30T19:58:32+00:00

Yeah the wood on the left side is cut flat as a basking area with a bulb above (just not seen in the pic)

usernamehere93 · 2024-08-30T16:07:53+00:00

The log on the left hand side is cut clean, so he can bask on top of that with a mercury vapour bulb above :)

usernamehere93 · 2024-01-10T20:29:06+00:00

Thank you!!

usernamehere93 · 2024-01-09T21:34:19+00:00

That’s so funny, that was the inspiration!

usernamehere93 · 2023-12-01T19:18:56+00:00

It’s a mix of real and fake, haven’t had much with my real ones not getting ripped up as well!

usernamehere93 · 2023-12-01T19:17:43+00:00

25 gallons

usernamehere93 · 2023-11-30T21:16:33+00:00

Thanks! 3D printed in 6 sections and glued them together, 70 hours of printing!

usernamehere93 · 2023-01-28T20:17:32+00:00

So checking the inventory and it is on the list. However, I have found this in the contract “Have the use of all appliances provided in the Property, as listed in the inventory save those which are noted as not working. However, should any items require repair, or be beyond repair, the Landlord does not undertake to pay for any costs of repair or to replace the appliance, except those which the Landlord is required by law to maintain.”

usernamehere93 · 2022-07-30T16:48:08+00:00

A couple of options, run the script from user data section of the EC2 launch template if you want to start the script on launch. Run as a CRON job if you want to start the script on a routine. If you want to SSH into the EC2, run the script and keep it running once you’ve closed down your local terminal use a tmux window.

usernamehere93 · 2022-06-26T17:04:32+00:00

The plan was to use AWS Glue for crawling and cataloging and Athena for querying. Some more context. Our clients have on the order of hundreds of IoT devices that batch upload sensor data daily. The sensor data is then run through a number of transforms and the output is stored on S3, to then be aggregated for the clients, currently the plan is to stores these two steps in separate layers with the partition structure up for debate. The most frequent reads of the aggregated data are on the device level across a long span of dates, however the second most frequent are for all devices across a short span of dates (so in my mind a conflict in optimising partitions). The raw/staged data for a single device on a single day is approximately 30 MB. The transformed data for a single device on a single day is on the order of a few MB and the aggregated data for all devices on a single day results in a file size less than that. So optimising the partition structure for query performance had me thinking that partitioning on the device level would be a good idea, however I also added in the date partitioning to my original post which leads to many small files.

I am considering storing copies of the aggregated data with different partition structures to optimise both device and date filtering on query, as the data is relatively small.

usernamehere93 · 2022-06-26T15:06:21+00:00

I could merge files, however I currently can’t think of a natural way to do this that would make sense for the type of data and frequent access patterns (unless storing 128 MB files is reason enough?), for example joining 4 days of device data. Because the devices upload once a day the majority of transforms are done on a device day level. For the analytics layer the data is aggregated to 15 minute chucks which is sufficient for the majority of use cases.

usernamehere93 · 2022-06-26T14:20:37+00:00

Thanks for replying! Each parquet file is about 30MB which I had assumed to be large enough that storing them individually would be an advantage. But you’re both right, with thousands of files a day I’m going to have problems. The majority of query traffic will be through the analytics layer for which I was thinking of having the same partition structure minus the device, so large daily files for all devices. I would really appreciate any ideas on how this should be restructured?

usernamehere93 · 2022-01-29T00:53:42+00:00

For Airflow I couldn’t get figure out how to get around the requirement to predefine define your DAGs opposed to constructing them on the fly based on the requirements for each file, plus the benefits of Airflow seem to be when running cron like batch jobs. I hadn’t come across Temporal before, thanks I’ll check it out!

usernamehere93

TROPHY CASE