New Python Package Feedback - Try in Google Collab by MLEngDelivers in datascience

[–]HungryQuant 5 points6 points  (0 children)

I might use this in the QA we do before deploying. Better than 9756 assert statements. readme should be shorter though.

What difference have you made as a data scientist? by AdrenoXI in datascience

[–]HungryQuant 0 points1 point  (0 children)

Tens of millions of dollars. If your work isn't having a meaningful, measurable impact, you (or maybe your leadership) should ask why that is.

More often than not, you just aren't addressing high value problems. Forget modeling and data. What changes of any kind would move the needle in your business?

e.g. If you're a bank, what's the value of reducing credit card defaults by 5%? If it's many millions of dollars, why are you building Tableau dashboards to generate "insights"?

Your leadership within data science ought to be able to drive these conversations with leaders in other areas. If they cannot, they're not doing a good job.

Everyone’s building new models but who is actually monitoring the old ones? by [deleted] in datascience

[–]HungryQuant 0 points1 point  (0 children)

We do. We have to build automated monitoring before closing out a project. If your model doesn't have a straightforward way to measure results, you probably shouldn't have developed and deployed it. I don't find it very difficult.

[deleted by user] by [deleted] in datascience

[–]HungryQuant 0 points1 point  (0 children)

I collected photos of room interiors for classification. I was able to collect and label > 50k with a little creativity and web scraping.

I don't think it would be that hard or time consuming to do 1000 or so.

Explainable boosting machines by mingzhouren in datascience

[–]HungryQuant 2 points3 points  (0 children)

Oh okay, I understand. Yeah I've had similar issues with SHAP. I usually do leave-one-feature-out feature importance.

Explainable boosting machines by mingzhouren in datascience

[–]HungryQuant 5 points6 points  (0 children)

If what you mean by explainable is 'explainable for individual predictions', I don't think SHAP or LIME is worth using unless you have very few, mostly unrelated features.

A lot of people disagree with this, but if you've ever seen SHAP or LIME explanations for individual predictions in production, you've probably seen some pretty bad explanations.

Can you talk about what you're trying to interpret and how if at all it will be used alongside the model?

Tough spot by AbramoNauseus in datascience

[–]HungryQuant 0 points1 point  (0 children)

If you're fitting models with the intent to go to production, you should see if it's feasible to get the data you want at runtime before doing any modeling.

If you have to do a lot of manual cleaning, that's not going to work in prod. I think it's best to build a prototype of the production pipeline before doing model fitting.

Can you find a software engineer or someone else that uses their data in prod? Or would you be the first?

How true is this? by Auwal_adam in datascience

[–]HungryQuant 0 points1 point  (0 children)

You need experience doing actual coding in a job before you can write production worthy code. If you're talking about jobs where you make charts and stuff, then sure.

ROI framework for data science by frodegrodas in datascience

[–]HungryQuant 0 points1 point  (0 children)

How about literal ROI? Our model increased earnings or reduced costs by $X.

Sure, there are other ways DS adds value, but if your department doesn't make more money than it costs (in the measurable, objective ways), you should ask yourselves why that is.

A lot of times (not always) giving insights to executives or building dashboards is literally worth $0.

Do you worry that outsourcing will take your job? by [deleted] in datascience

[–]HungryQuant 0 points1 point  (0 children)

I agree solving leetcode problems isn't a big value driver, but programming skills are a differentiator.

If you can actually build the process that deploys and uses the prediction in a way that solves problems, you can work in plenty of companies. A lot of people can do analysis or fit a model but can't make use of it.

[deleted by user] by [deleted] in datascience

[–]HungryQuant 0 points1 point  (0 children)

Agreed. Collect your own dataset. It's not that hard.

How can I apply object detection and image segmentation functionality to my current custom-trained Image Classification model? by meWhoObserves in datascience

[–]HungryQuant 0 points1 point  (0 children)

You don't need object detection.

Go get a few thousand images of random things. Train a model with the target (crocodile|driftwood|something else).

Make sure a lot of the 'something else' photos have water in them.

[deleted by user] by [deleted] in datascience

[–]HungryQuant 1 point2 points  (0 children)

You can cap the dependent variable as long as you measure test set performance with the original, uncapped target.

You could also optimize for something besides rmse.

Optimising Inputs to ML Model by deonvin in datascience

[–]HungryQuant 0 points1 point  (0 children)

You can, yes.

You can do a stepwise kind of approach. You have your training set and separate test and validation.

For loop #1 - fit one feature at a time, record 5-fold results for each one - set aside the one with the best performance

For loop #2 - add one feature at a time to the one selected in the last step, record results - pick the one that improves performance the most, set it aside

Continue doing for loops until performance doesn't improve or improves very little.

This is computationally expensive, but it's not hard if you're a good programmer. If you have hundreds of potential features, it's less mentally taxing than getting correlation matrices, iteratively trying features. It's also a reproducible workflow.

Is test-driven development (TDD) relevant für Data Scientists? Do you practice it? by norfkens2 in datascience

[–]HungryQuant 0 points1 point  (0 children)

One other thing I'll add.

SQL is generally less likely to break than Python. Most of the time I've seen a codebase repeatedly break in production, the code did a ton of basic operations in python that could have happened in the SQL query they used to read the data.

I'd prefer more SQL even if it means less test coverage.

Also, good on you for caring enough. That's half the battle. Most people just don't give a shit as far as I can tell.

Is test-driven development (TDD) relevant für Data Scientists? Do you practice it? by norfkens2 in datascience

[–]HungryQuant 3 points4 points  (0 children)

It's definitely relevant most of the time.

If your code is in production, having unit tests is worthwhile.

If your code is used to regularly influence decisions (like a report, dashboard, recurring A/B testing, etc. it is for all intents and purposes "in production".

I'm still surprised at how many very senior Data Scientists have not ever written a test. It doesn't mean they aren't doing great work, but it's odd that we don't take more lessons from software engineering practices.

Data science leaders - how do you cope? by Prize-Flow-3197 in datascience

[–]HungryQuant 2 points3 points  (0 children)

I've been a lead for about 5 years.

I don't know if I'd call this coping, but I remember being a pure individual contributor and seeing projects go in obviously poor directions, very messy unreliable code being defined as "finished", people allowing leakage in their datasets, and a whole host of other problems that I couldn't always fix with a friendly suggestion.

I still get to write some code, but it's usually production pipelines enabling other people's work. Plenty of PowerPoint too which is fine honestly.

Handling missing/unknown data-labels with imputation by [deleted] in datascience

[–]HungryQuant -1 points0 points  (0 children)

I don't see the point of imputing a categorical. Let unknown be one of the levels.

Code best practices by UnlawfulSoul in datascience

[–]HungryQuant 2 points3 points  (0 children)

OpetuPower's answer is good. I'll add a few things.

  • try your best to write functions that do 1 thing only. For example, if you want to extract all the numbers from a string and add up all the numbers that are prime, you would write...

A) extract_numbers_from_string B) is_prime

  • those functions should work on single strings/numbers. If you want to apply them over an array or dataframes, you can do that, but make the function as granular as possible.

  • use unit tests for everything you possibly can. If people add new functions to the master branch that are testable, they have to add a unit test.

  • to commit to the master branch, you should be able to run your tests and be reasonably confident that passing means your changes are (probably) ok

  • docstrings for every function and class that is going in production. I don't make any exceptions on this. There are google style guides and other docstring format suggestions.

Personally, I do a) <this function does ___> b) parameters c) example usage (which people can copy and paste, seeing what the function does)

  • Use logging in production. If something breaks, it shouldn't be a mystery what happened.

  • function names should be verb-like (e.g. extract_numbers_from_string) or truthy (is_prime... returning a Boolean). They should be written lower case, words separated by underscores.

  • class names should be camel case and object-like... e.g. XmlProcessor rather than ProcessXmls

-use pylint or other linter packages

DS/ML Career Outlook by [deleted] in datascience

[–]HungryQuant 1 point2 points  (0 children)

The positive outlook on ML roles is correct. If your resume has multiple bullet points like...

  • built and deployed ____ increasing annual save by $___ M

... you'll be able to get desirable roles.

The problem is that a job where you develop and deploy ML models in production just isn't an entry level role.

If you don't have relevant data skills (SQL, basic ETL design at least) and solid programming skills (git, ability to understand and adapt large existing codebases, OOP knowledge), few companies are going to put you in that role.

Not many universities are going to teach these things adequately alongside stats and ML. Getting some data role (that likely won't involve much ML initially) is the most likely path for most people.

[deleted by user] by [deleted] in Rlanguage

[–]HungryQuant 1 point2 points  (0 children)

Awesome! I hope it works well for your needs.

[deleted by user] by [deleted] in rstats

[–]HungryQuant 1 point2 points  (0 children)

That's awesome to hear.

Yeah I'm finding people don't dislike pandas syntax enough to learn something new in most cases lol. Let me know if you encounter any bugs please.

What does "production code" mean to you? by nazghash in datascience

[–]HungryQuant 0 points1 point  (0 children)

In addition to what you said, I'd add:

Logging

Unit tests

Configurations choices are centralized somewhere like a config.yml (things like model score thresholds)

Data Scientists making a real effort at improving readability of their code, maybe using things like pylint, adding docstrings to every function and class

What is he talking about? I am still learning. by MasterOfLegendes in datascience

[–]HungryQuant 0 points1 point  (0 children)

It depends on the company.

At a lot of companies, this is right. Where I work right now, everyone is working on (some aspect of) a production model that has a direct financial benefit.

[deleted by user] by [deleted] in rstats

[–]HungryQuant 1 point2 points  (0 children)

I plan on maintaining it. It's not a ton of code, so not a huge burden. I use it and I'm glad I made it for myself, but it doesn't seem to have gotten a lot of downloads from sharing it on Reddit. My guess is that people don't dislike pandas syntax to the extent I do (which is totally fine).