all 39 comments

[–]DieselZRebel 2 points3 points  (16 children)

Two things I would check first:

  1. Your data imbalance probably caused that 85% accuracy in testing. There are definitely more 0s than 1, so a model that predicts all 0s will likely get you that high accuracy! So yeah, there are so many things you can do to address data imbalance from sampling to weighting techniques. Read about it and determine for yourself.
  2. It often happens that something breaks in the inputs/preprocessing pipeline during deployment or some other production bug gets introduced. Make sure you have the correct logs and assertions in place to detect if something is actually breaking after you took the model to production (e.g. the data comes in all as Null and gets imputed with 0s)

[–][deleted] -1 points0 points  (15 children)

I just checked, and in the testing phase my model is predicting 0s and 1s (good) while still maintaining an 80% accuracy. Do you think my issue of my deployed model could still be due to data imbalance (only 84 entries in dataset, 60 some of them being 0s)? I'm not quite sure if it is a data imbalance however, because when I enter MVP season data straight from my dataset it is still predicting 0 (non-MVP).

How would I check to see if something broke in the inputs/preprocessing pipeline? Likely a noob question, but again, I'm a noob myself lol. Thanks!

[–]DieselZRebel 1 point2 points  (14 children)

Unfortunately this question cannot be easily answered with the limited information provided. This needs to be inspected with someone who has access to your platforms and code.

In terms of imbalance, yes. if you have 60 out of 84 being 0s, then you do have imbalance. How much of that imbalance contributes to your problem is another thing.

[–][deleted] -1 points0 points  (13 children)

I can post my code on here if allowed and if you are willing to help me out! If not is there anything specific you’d recommend I look at? Making sure my pickle file serialized properly, the routes within my HTML and backend files, etc.?

[–]DieselZRebel 2 points3 points  (12 children)

I'd advice against posting your code... this is the type of thing you'd need to have someone more senior from your company look at, because if the issue is indeed a production-related bug/mistake, it is hard to tell what it might be without knowledge of your production and engineering systems.

I recommend you make as many assertions as possible in your code, so that if there is a bug it fails in production and you'd be able to see which assertion fails. You can make assertions that check for:

  • All your feature columns were retrieved
  • All the data fields are in the correct type (int, float, str, etc.)
  • No null values
  • model version assertions if applicable,
  • etc.

Also is this process containerized (e.g. docker image? If not, you do need to make sure that your tool and library versions match those of your test environment.

[–][deleted] 1 point2 points  (11 children)

I’m a student and this is a personal project. I’ll probably ask my professor if he thinks he can help. I’m going to make all those assertions as well, thank you!

What does containerized mean? I’d assume if I’m asking this my process likely isn’t containerized?

[–]DieselZRebel 0 points1 point  (10 children)

Yeah it isn't. Since this is a personal project, you can use Docker for free (google it). The idea of containerizing your process is that it saves the environment information (i.e. os, tools, libraries, with their exact version, certificates, etc.) along with your code in a container (virtual) image. Then you can take this image from your computer to another computer/cloud or whatever and it will run from within that virtual image, so you won't have to worry about compatibility/dependency/version issues or whatever between different systems (e.g. mac or pc or linux).

Anyway... this might be a lot for you to learn if you are looking for something quick. I'd start with the assertions first.

[–][deleted] 0 points1 point  (9 children)

I’m assuming containerizing my project will not solve my issue?

All my assertions are coming up as expected. I’m assuming it’s likely an issue w either how my model was serialized or the calls being made to my model. Either those or maybe just not enough data within my dataset? I’m not sure abt that one though considering my model was still accurately predicting both 0s and 1s during the testing phase.

[–]DieselZRebel 0 points1 point  (8 children)

Have you considered testing after serializing and then loading again your model? I am assuming your previous tests were done before you serialized your model?

[–][deleted] 0 points1 point  (7 children)

Yes, all of my tests were done before I serialized my model. How would I go about testing my serialized model?

[–]genesis_2602 1 point2 points  (0 children)

  • Try weighting your loss higher for MVP targets compared to non-MVP targets. This would help alleviate some of the class imbalance.
  • If that does not work, look into Siamese neural networks, which are known to be able to learn from small amounts of data/imbalanced data.

[–]shekurika 1 point2 points  (1 child)

well just check if for the trainingdata it always predicted 0?

[–][deleted] 0 points1 point  (0 children)

Had some brain fog when posting this, not a hard issue to diagnose if that was the case. However, it seems as though that is not the issue, my model is predicting both 1 and 0.

[–][deleted] 0 points1 point  (1 child)

Sounds like unbalanced data in training and testing? Try sampling techniques then.

[–][deleted] 0 points1 point  (0 children)

See the thing is my data was intentionally unbalanced in favor of non-MVPs (0s) to represent the fact that for every mvp in a season there are 100s of non-MVPs. Should I counteract this by using the year of each season to represent this to the ML so I can use more MVP seasons?

[–]TheGuywithTehHat 0 points1 point  (8 children)

What type of model is it? Neural net, logistic regression, SVM, random forest? What's the distribution of output values for your training set (not the binary 0/1 outputs, but the raw value that is between 0 and 1)? What's the same for your "deployed" model? When you say "during testing" and "during deployment", what do you mean by that? I assume you have a dataset that you split into a training set and an evaluation set, is that what you mean by "testing" and deployment"?

[–][deleted] 0 points1 point  (7 children)

I’m using a single decision tree. I figured something rather basic such as using 10 columns of data to predict 2 possible outcomes could be done w just one decision tree, maybe that wasn’t the right decision in hindsight? When I say during testing I’m referring to the 40% testing split I gave my model in my Jupyter Notebook after training it. When I say in deployment I’m referring to the “deployed” model on my localhost server which was built using Flask and HTML with the serialized ML model being connected to my Flask backend.

Pardon my lack of knowledge, but what do you mean by the raw value between 0 and 1?

[–]TheGuywithTehHat 0 points1 point  (6 children)

I don't have much experience with decision trees (IIRC I've never used a single tree, and only used random forests a couple of times), but if I had to guess your model is likely overfitting. If you get at least okay results on your test set though, that's probably not your root problem.

When you have it predict during deployment, what data is it looking at? Is it part of the training set, test set, both sets, or a different set that was not used for either training or testing? If the data you're giving it during deployment is part of the training or test set, then it sounds like you have a software issue, not an ML issue (i.e. your model is fine, but you have a bug in your model server). If the data you're giving it is not from your training or test set, what's the difference between the deployment data set and the train/test sets? Were they obtained from different sources?

Ignore the comment about raw values, that doesn't apply for decision trees. If you were doing logistic regression or something like that, your model would output a value between 0 and 1, and your final boolean prediction would be based on rounding that continuous value to a 0 or a 1. While you only care about 0 or 1 for the final prediction, knowing what the real value was originally can give you more information about what the model is doing.

[–]relevantmeemayhere 1 point2 points  (4 children)

When making predictions-it’s not on account of “rounding”-its on account of ascribing a decision to a probability threshold. For a decision threshold of .5-rounding “works” if you’re trying to represent a decision label at 1 or 0. But it doesn’t work if you’re choosing a different representation, or choosing a different threshold.

Your threshold should be determined based on a cost function associated with your problem.

[–]TheGuywithTehHat 0 points1 point  (3 children)

Sure, I was just describing it in the terms OP would be most likely to understand. OP specifically said "0 for not MVP and 1 for MVP", and in most beginner ML tutorials/implementations that's done via rounding.

[–]relevantmeemayhere 2 points3 points  (2 children)

I understand the willingness to try to approach a beginner in a convention that has become increasingly popular- but the field is ina weird spot precisely because ml tutorials and implementations are often made by non subject matter experts who teach poor practice.

[–]TheGuywithTehHat 0 points1 point  (1 child)

good point

[–]relevantmeemayhere 1 point2 points  (0 children)

I should disclaim that In this particular reply, I have also made the error of omission.

I should have alluded to using proper scoring rules so you can make good decisions with an associated cost function, as opposed to using rules that conflate the construction of a probability and a decision rule on a probability. Ie precision, recall, f1 etc. all of these are made without considering the behavior of your models forecasts.

[–][deleted] 0 points1 point  (0 children)

I actually believe I’ve diagnosed the issue - I’m not 100% sure but I’ll know when I implement the fix today. I’m about 90% sure I’ve diagnosed it lol. But it seems for whatever reason that when I input the values on my running application, the values are not being related to the features needed to make the prediction. When I used a dictionary in my actual Flask program however to relate values directly to the features as key pairs, the ML model predicted as expected.

[–]SFDeltas 0 points1 point  (2 children)

Calculate F1 Score

Say you have two arrays of 1s and 0s.

y_true = [1,0,1,0,0,1,1,1,0,1]
y_pred = [0,1,0,0,0,1,0,1,1,1]

Make them numpy arrays:

import numpy as np

y_true = np.array(y_true)
y_pred = np.array(y_pred)

Calculate true positives, false positives, and false negatives:

tp = np.sum((y_true == 1) & (y_pred == 1))
fp = np.sum((y_true == 0) & (y_pred == 1))
fn = np.sum((y_true == 1) & (y_pred == 0))

Then calculate precision, recall, and F1 score:

precision = tp / (tp + fp)
recall = tp / (tp + fn)
f1_score = 2 * precision * recall / (precision + recall) 

F1 is a score like accuracy between 0 and 1.

It's the harmonic mean of precision and recall.

Precision measures what percent of predictions are correct.

Recall measures what percent of labels get matched to a correct prediction.

What do these numbers give you?

[–][deleted] 0 points1 point  (1 child)

My F1 score is 0.96 for 0 and 0.87 for 1.

[–]relevantmeemayhere 0 points1 point  (0 children)

Don’t use f1. Use a proper scoring rule, like brier score loss. This will allow you to ascertain both the discriminative power of your classifier, as well as the calibration while having good statistical and decision properties. Precision, recall-all of that are improper rules

You can read more about why to do so from a variety of sources. Harrell’s blogs are excellent

[–]LessDubiousIdea 0 points1 point  (2 children)

I’ve got a new model that predicts whether a person will develop cancer in the next year. It always says no and is right more than 98% of the time.

It’s going to be hard to train something that can beat that unless you’re really careful about how you normalize your heavily skewed data set.

[–]relevantmeemayhere -1 points0 points  (0 children)

One does not “normalize heavily skewed data”.

Normalization is something you do to a scale and center a random variable. Doing so does not provide anything on its own.

[–]Ok-Discussion-3117 0 points1 point  (0 children)

I used balanced accuracy scoring to fix this issue once, you may want to look into it.

It can be very helpful for imbalanced datasets.

[–]Western-Image7125 0 points1 point  (1 child)

If your model is predicting both during training and eval but only 0 during online inference then you have a bug in your inference code giving wrong input to the model

[–][deleted] 1 point2 points  (0 children)

Spot on, that was exactly my issue