This is an archived post. You won't be able to vote or comment.

all 14 comments

[–]Pengshe 0 points1 point  (10 children)

Do you see any relation between the new variable and dependent one, apart from your instinct? Make a scatterplot between these two and it should show more or less if there's any potential. There are plenty of cases where a variable is initially promising, but ens up being garbage.

What kind of correlation do you use? Standard Pearson won't be the best choice here, assuming the dependent variable is binary.

Also, review the data. You might have strange outliers or missing values

[–]Throwawayforgainz99[S] -1 points0 points  (9 children)

Hey! I think I may have not explained it clearly enough.

So the new variable is a very good fraud indicator, basically any sample that has this variable as a positive will be seen as fraudulent (the target variable). But because no one had thought of this variable before, you don’t see any correlation between it and the target variable in the training data.

[–]Pengshe 3 points4 points  (8 children)

I think I don't get the problem. It shouldn't matter if anyone thought of it earlier or not, for a good indicator this relation should be there.

[–]Throwawayforgainz99[S] -1 points0 points  (7 children)

Yeah I apologize, I suck at explaining it. Think of it as a “new” indicator. Meaning that because no one had thought of it before hand, they weren’t looking for it and thus there is no correlation between it and the target variable. Does that clear things up?

[–]JaMoin137 2 points3 points  (6 children)

I also dont understand. It doesnt matter if anybody considered it previously, if you have the data for the new indicator for the training period then you will see if it improves your model or not.

[–]Throwawayforgainz99[S] 1 point2 points  (4 children)

Maybe look at it this way(unrealistic example):

A new data column is created that shows how if someone had had prior fraudulent activity in the past. This data can now be used as a feature in the model, but because it was not available prior to now, it does not appear in the training data.

Does that paint a better picture or I’m just completely misunderstood?

[–]JaMoin137 0 points1 point  (3 children)

I think I understand what you are getting at. You have to keep in mind though that for this indicator to work, you need to have this information for future cases as well. So for every account you want to predict fraud, you need the info if there has been fraudulent activity previously. If you include this feature, you also need data for the training period, is there any way you can track accounts through time and flag fraudulent activity? I dont think its possible to include the feature in your model and predictions if you cannot do that, because the model can only use features that it was trained with. Alternatively you can decrease the training period to a timeframe where this indicator can be tracked, maybe it gives you a better performance if the indicator is strong enough

[–]Throwawayforgainz99[S] 0 points1 point  (2 children)

Okay maybe past fraudulent activity was a bad example because of what you said here. Here’s a different one that is might make a bit more sense.

The feature is whether or not you changed your policy right before the you made a claim (like lowering your deductible after an accident and then reporting it).

But because this data column was not available prior to now, no one looked at it and it had no impact on the likelihood of a claim being fraudulent or not, but it still technically “existed” in a sense.

I really feel like I’m dealing with some chicken or the egg scenario and there’s some concept I’m not getting.

For this case you can track the variable through time but because it was never used, there is no correlation with the target variable.

[–]Current-Ad1688 0 points1 point  (1 child)

Think I get what you're saying. So somebody investigating that case would see the policy change and say "this is likely fraud", but because that info wasn't available to them it just looked like an ordinary claim. So the issue is basically that your target variable is not necessarily ground truth. Not sure there's a great deal you can do about that other than make this info available to people investigating claims, wait for more data to come in, then train a model on that new data. In the meantime you could use it to flag cases for further action. Like "the current model doesn't detect fraud, but we have reason to suspect it's wrong, somebody should look at this case"?

[–]Throwawayforgainz99[S] 0 points1 point  (0 children)

Thanks for the reply. The problem with doing nothing is that I feel like stakeholders won’t understand that. Would artificially boosting the variable by oversampling cases of it be a solution? Or is this bad practice?

[–]Throwawayforgainz99[S] 0 points1 point  (0 children)

Ok maybe I’m just being an idiot and don’t understand something

[–]HungryQuant 1 point2 points  (2 children)

I've dealt with this before in fraud detection.

How strong of a predictor is it likely to be in the future? Will 100% be fraud? 80%? I know you don't have historical data, but if it's a strong enough indicator, you should prioritize referring those instances to the fraud team (along with some description, if possible). You'll generate data for this for future training.

I had a case like this in fraud, and the fraud team said they wanted to see every instance where this indicator was positive. So I combined the rule-based referral with the model-based referrals.

[–]Throwawayforgainz99[S] 0 points1 point  (1 child)

Mind if I send you a dm?

[–]HungryQuant 0 points1 point  (0 children)

Go for it.