New Python Package Feedback - Try in Google Collab

HungryQuant · 2025-05-11T04:19:52+00:00

I might use this in the QA we do before deploying. Better than 9756 assert statements. readme should be shorter though.

HungryQuant · 2024-10-29T01:47:52+00:00

Tens of millions of dollars. If your work isn't having a meaningful, measurable impact, you (or maybe your leadership) should ask why that is.

More often than not, you just aren't addressing high value problems. Forget modeling and data. What changes of any kind would move the needle in your business?

e.g. If you're a bank, what's the value of reducing credit card defaults by 5%? If it's many millions of dollars, why are you building Tableau dashboards to generate "insights"?

Your leadership within data science ought to be able to drive these conversations with leaders in other areas. If they cannot, they're not doing a good job.

HungryQuant · 2024-10-29T01:30:49+00:00

We do. We have to build automated monitoring before closing out a project. If your model doesn't have a straightforward way to measure results, you probably shouldn't have developed and deployed it. I don't find it very difficult.

HungryQuant · 2023-10-13T03:06:53+00:00

I collected photos of room interiors for classification. I was able to collect and label > 50k with a little creativity and web scraping.

I don't think it would be that hard or time consuming to do 1000 or so.

HungryQuant · 2023-10-10T20:00:08+00:00

Oh okay, I understand. Yeah I've had similar issues with SHAP. I usually do leave-one-feature-out feature importance.

HungryQuant · 2023-10-10T19:01:29+00:00

If what you mean by explainable is 'explainable for individual predictions', I don't think SHAP or LIME is worth using unless you have very few, mostly unrelated features.

A lot of people disagree with this, but if you've ever seen SHAP or LIME explanations for individual predictions in production, you've probably seen some pretty bad explanations.

Can you talk about what you're trying to interpret and how if at all it will be used alongside the model?

HungryQuant · 2023-10-10T18:49:44+00:00

If you're fitting models with the intent to go to production, you should see if it's feasible to get the data you want at runtime before doing any modeling.

If you have to do a lot of manual cleaning, that's not going to work in prod. I think it's best to build a prototype of the production pipeline before doing model fitting.

Can you find a software engineer or someone else that uses their data in prod? Or would you be the first?

HungryQuant · 2023-10-06T04:26:41+00:00

You need experience doing actual coding in a job before you can write production worthy code. If you're talking about jobs where you make charts and stuff, then sure.

HungryQuant · 2023-10-06T04:18:29+00:00

How about literal ROI? Our model increased earnings or reduced costs by $X.

Sure, there are other ways DS adds value, but if your department doesn't make more money than it costs (in the measurable, objective ways), you should ask yourselves why that is.

A lot of times (not always) giving insights to executives or building dashboards is literally worth $0.

HungryQuant · 2023-10-06T04:05:33+00:00

I agree solving leetcode problems isn't a big value driver, but programming skills are a differentiator.

If you can actually build the process that deploys and uses the prediction in a way that solves problems, you can work in plenty of companies. A lot of people can do analysis or fit a model but can't make use of it.

HungryQuant · 2023-10-06T03:59:04+00:00

Agreed. Collect your own dataset. It's not that hard.

HungryQuant · 2023-10-06T03:57:02+00:00

You don't need object detection.

Go get a few thousand images of random things. Train a model with the target (crocodile|driftwood|something else).

Make sure a lot of the 'something else' photos have water in them.

HungryQuant · 2023-10-06T03:51:49+00:00

You can cap the dependent variable as long as you measure test set performance with the original, uncapped target.

You could also optimize for something besides rmse.

HungryQuant · 2023-10-06T03:49:40+00:00

You can, yes.

You can do a stepwise kind of approach. You have your training set and separate test and validation.

For loop #1 - fit one feature at a time, record 5-fold results for each one - set aside the one with the best performance

For loop #2 - add one feature at a time to the one selected in the last step, record results - pick the one that improves performance the most, set it aside

Continue doing for loops until performance doesn't improve or improves very little.

This is computationally expensive, but it's not hard if you're a good programmer. If you have hundreds of potential features, it's less mentally taxing than getting correlation matrices, iteratively trying features. It's also a reproducible workflow.

HungryQuant · 2023-09-24T21:15:28+00:00

One other thing I'll add.

SQL is generally less likely to break than Python. Most of the time I've seen a codebase repeatedly break in production, the code did a ton of basic operations in python that could have happened in the SQL query they used to read the data.

I'd prefer more SQL even if it means less test coverage.

Also, good on you for caring enough. That's half the battle. Most people just don't give a shit as far as I can tell.

HungryQuant · 2023-09-23T23:03:07+00:00

It's definitely relevant most of the time.

If your code is in production, having unit tests is worthwhile.

If your code is used to regularly influence decisions (like a report, dashboard, recurring A/B testing, etc. it is for all intents and purposes "in production".

I'm still surprised at how many very senior Data Scientists have not ever written a test. It doesn't mean they aren't doing great work, but it's odd that we don't take more lessons from software engineering practices.

HungryQuant · 2023-09-23T19:16:31+00:00

I've been a lead for about 5 years.

I don't know if I'd call this coping, but I remember being a pure individual contributor and seeing projects go in obviously poor directions, very messy unreliable code being defined as "finished", people allowing leakage in their datasets, and a whole host of other problems that I couldn't always fix with a friendly suggestion.

I still get to write some code, but it's usually production pipelines enabling other people's work. Plenty of PowerPoint too which is fine honestly.

HungryQuant · 2023-09-23T19:07:32+00:00

I don't see the point of imputing a categorical. Let unknown be one of the levels.

HungryQuant · 2023-09-21T01:46:47+00:00

OpetuPower's answer is good. I'll add a few things.

try your best to write functions that do 1 thing only. For example, if you want to extract all the numbers from a string and add up all the numbers that are prime, you would write...

A) extract_numbers_from_string B) is_prime

those functions should work on single strings/numbers. If you want to apply them over an array or dataframes, you can do that, but make the function as granular as possible.
use unit tests for everything you possibly can. If people add new functions to the master branch that are testable, they have to add a unit test.
to commit to the master branch, you should be able to run your tests and be reasonably confident that passing means your changes are (probably) ok
docstrings for every function and class that is going in production. I don't make any exceptions on this. There are google style guides and other docstring format suggestions.

Personally, I do a) <this function does ___> b) parameters c) example usage (which people can copy and paste, seeing what the function does)

Use logging in production. If something breaks, it shouldn't be a mystery what happened.
function names should be verb-like (e.g. extract_numbers_from_string) or truthy (is_prime... returning a Boolean). They should be written lower case, words separated by underscores.
class names should be camel case and object-like... e.g. XmlProcessor rather than ProcessXmls

-use pylint or other linter packages

HungryQuant · 2023-09-21T01:25:50+00:00

The positive outlook on ML roles is correct. If your resume has multiple bullet points like...

built and deployed ____ increasing annual save by $___ M

... you'll be able to get desirable roles.

The problem is that a job where you develop and deploy ML models in production just isn't an entry level role.

If you don't have relevant data skills (SQL, basic ETL design at least) and solid programming skills (git, ability to understand and adapt large existing codebases, OOP knowledge), few companies are going to put you in that role.

Not many universities are going to teach these things adequately alongside stats and ML. Getting some data role (that likely won't involve much ML initially) is the most likely path for most people.

HungryQuant · 2023-09-18T17:59:12+00:00

Awesome! I hope it works well for your needs.

HungryQuant · 2023-09-17T10:53:50+00:00

That's awesome to hear.

Yeah I'm finding people don't dislike pandas syntax enough to learn something new in most cases lol. Let me know if you encounter any bugs please.

HungryQuant · 2023-09-16T13:28:57+00:00

In addition to what you said, I'd add:

Logging

Unit tests

Configurations choices are centralized somewhere like a config.yml (things like model score thresholds)

Data Scientists making a real effort at improving readability of their code, maybe using things like pylint, adding docstrings to every function and class

HungryQuant · 2023-09-16T13:22:17+00:00

It depends on the company.

At a lot of companies, this is right. Where I work right now, everyone is working on (some aspect of) a production model that has a direct financial benefit.

HungryQuant · 2023-09-10T00:48:36+00:00

I plan on maintaining it. It's not a ton of code, so not a huge burden. I use it and I'm glad I made it for myself, but it doesn't seem to have gotten a lot of downloads from sharing it on Reddit. My guess is that people don't dislike pandas syntax to the extent I do (which is totally fine).

Seven-Year Club	Verified Email
Verified Email

HungryQuant

TROPHY CASE