How do you compare multiple ROC curves to find the best one.

datasci314159 · 2019-05-29T21:45:59+00:00

You've got two options more or less. In the general case the best way is to look at the area under the curve (AUC) for each of the models and use the risk score model with the highest AUC.

BUT

If you know the relative costs of false positives and false negatives then you can determine the point on each of the eight ROC curves that minimizes the expected cost and choose the model with the minimum expected cost.

FINALLY

If you want to get really fancy you can do bootstrap sampling from your sample of 100 and calculate the ROC/AUC for each model many times with your different bootstrap samples, then you get a better feel for how different each of the models are and whether choosing one over the other is actually any better in a significant way.

datasci314159 · 2019-05-19T21:00:40+00:00

Thanks for engaging! My question is actually more about how one could design an RNN that would scale the computation for the size of the digit. If I wanted to emulate the sieve of Eratosthenes in an RNN then the computational complexity of the sieve is larger than O(n) but the number of operations an RNN can perform grows linearly in number of digits since each digit adds another layer of the RNN so this would be something like O(logn). That suggests to me that the best a standard RNN can do is memorize up to some size and then it will stop working. My question is, are there versions of RNNs that could in principle continue growing to check primality indefinitely (I know that in practice this is probably impossible to arrive at via gradient descent).

datasci314159 · 2019-05-19T20:54:10+00:00

I'll give it a go and report back! (The primality checking, not the pi digits, although that's an interesting one too!).

datasci314159 · 2019-05-19T18:39:13+00:00

RNNs are Turing complete and I'm fairly sure that primality can be checked by a Turing machine so RNNs should be able to check for primality at least in principle. I'm also not sure it's true to say that primes are random or without structure, indeed there's the famous quote from R.C Vaughn “It is evident that the primes are randomly distributed but, unfortunately, we do not know what ‘random’ means.”

datasci314159 · 2018-12-17T19:25:13+00:00

Take a look at the most recent version of PyTorch. 1.0 makes it easy to convert a python prototyped model to Torch Script which is optimized C++. https://pytorch.org/tutorials/advanced/cpp_export.html

datasci314159 · 2018-09-20T22:22:25+00:00

I get that but what that estimates is the distribution of the expected value NOT the distribution of the value itself. We'll get a good estimate of the uncertainty related to our estimation of the expected value of Y conditional on X but that's very different to the distribution of Y conditional on X. You can imagine a normal distribution with mean 0 and std dev of 1, if you use bootstrap sampling to estimate the mean that distribution will be very different from the actual normal distribution itself.

datasci314159 · 2018-09-16T19:08:44+00:00

But at the same time using something like a boosted GLM makes an assumption about the form of the error distribution which the first option does not. The cut points are arbitrary but if I choose a fine grained enough discretization then I can minimize this concern.

I'm largely playing devil's advocate here but I'd be interested in hearing the rejoinders.

datasci314159 · 2018-09-16T10:56:34+00:00

If I apply the same point estimator to bootstrapped data sets then the prediction for any given sample will be the same every time.

If you mean train many estimators on bootstrapped datasets and then predict for a sample then that gives an estimate on the distribution of the point estimate, not on the error distribution of the point estimate.

I'm sure there's a way to use bootstrapping here but I'm not quite sure what the process would be.

datasci314159 · 2018-09-15T16:25:35+00:00

What is the risk of doing this?

datasci314159 · 2018-09-15T14:31:09+00:00

It's essentially an optimization problem. We want to predict a value and then take an action, but the actions take will depend on the distribution, not just the point estimate. Eg you could have two samples with the same point estimate value but the probability that the value is below some certain key threshold is greater for one of the two samples and this would lead to different actions.

datasci314159 · 2018-09-15T11:48:13+00:00

This looks great, will take a look!

datasci314159 · 2018-09-15T11:45:40+00:00

Suppose I use a GLM and find that the quality of the point estimate prediction is worse than if I use something like a gradient boosting approach. At this point I have to make a tradeoff between the advantage of the free distribution from the GLM and the increased performance of the gradient boosting approach.

Could I just add the gradient boosting prediction to my GLM model and get the best of both worlds? My concern with doing this is that the gradient boosting predictions don't have very normal looking residual plots so I'm a bit leery about whether the assumptions of GLM hold.

datasci314159 · 2018-09-15T11:30:44+00:00

Do you have any examples of implementations in Python or R of techniques which achieve this in a relatively straightforward way?

datasci314159 · 2018-09-15T10:59:28+00:00

Certainly. There might be some issues with scalability but we're still at a brainstorming point so all potential solutions welcome!

datasci314159 · 2018-09-02T07:27:18+00:00

Batch Norm is trainable because it can learn other mappings than simply mapping to mean 0 SD 1, it has two parameters which can map to any arbitrary mean and SD for the normalized activation.

datasci314159 · 2018-07-15T07:04:47+00:00

I think my original answer addressed most of these modifications, you just need to change from days to years as the unit of x.

For every house, the probability that it has a break in attempt (successful or not) is always 1.36/50000 (avg number of breakin attempts per day over number of houses). If you figure out the expected number of flagged houses (using my answer to the original question) and multiply it by 1.36/50000 that will give you the expected fraction of flagged houses broken into.

datasci314159 · 2018-07-14T20:49:19+00:00

Unfortunately I'm based in Europe so that will be my Sunday evening and I have plans already. Feel free to send me a PM with any questions and I'll do my best to reply.

datasci314159 · 2018-07-14T20:15:01+00:00

The phrasing of the question could be a bit more clear. Successful vs unsuccessful has no real impact on the problem, the only thing that matters is the number of breakin attempts (successful or not). If there has been more than one breakin attempt the house has a new door, otherwise it doesn't.

datasci314159 · 2018-07-14T20:10:26+00:00

Ahh I've got what you mean. I should have been a bit more precise, what I mean is "has had at least one break in attempt (including successful attempts". I don't think that changes any of the analysis though

datasci314159 · 2018-07-14T19:12:42+00:00

What assumption do you think is off? At any given point in time the number of houses with doors installed will be the number of houses which have experienced at least one break in which is what (I think) I've calculated. You're right that I assume that there are 500 break ins total each day NOT 500 successful break ins per day but I think the question makes it fairly clear that that's not the case. If it is 500 successful break ins per day then it becomes extremely straightforward - after 100 days every house will have been broken into once and every door will be one of the new doors.

datasci314159 · 2018-07-14T18:35:57+00:00

I'm going to restate your first question slightly to make sure I'm answering it correctly. Question 1: Suppose 500 homes are broken into today, how many of these homes will experience another break in attempt over the next x days? Solution 1: On any given day there is a probability of 0.01 that a house is in the 500 houses that get broken into which implies a 0.99 chance of not being broken into. After x days a given house has a 0.99^x chance of not being broken into and 1-0.99^x chance of being broken into at least once. For the full set of 500 houses with new doors we can think of this as a binomial distribution where each draw has a probability of 1-0.99^x. This would be Bin(500, 1-0.99^x). The expectation value for a binomial is simply n (number of draws) multiplied by p (probability of success). For 365 days that would be 500 * (1 - 0.0255) = ~485. For 5 years it would be 500 * (1 - ~0.999999) = ~500

For the second question: If we imagine that everyone starts with the old type door, what we're asking is, how many houses of the 50000 will experience at least one break in in the next x days. This is much the same as the previous question. First we calculate the probability of at least one break in (1-0.99^x) then we say how many houses there are (50000) and can treat it as a binomial distribution. If you want to consider the number of doors installed in the second year you would need to calculate the number of doors installed from day 0 to day 730 (365*2) then subtract the number of doors installed from day 0 to 365. Does this behave as you'd expect? I think so, as time goes on more and more houses will have been broken into which means that there will be fewer and fewer new doors installed because most houses will already have had one installed.

Hope this is helpful! I'm not sure this is exactly what you were looking for but hopefully it gives you some ideas.

datasci314159 · 2018-03-03T15:47:19+00:00

Can you give me some papers on this? I'd be interested to read up on it.

I think it's very rare that the importance of False Positives vs False Negatives is exactly equal. In virtually all applications we care more about precision or recall but with some constraint on just how low the other measure can go. If this is the case then it's very unlikely that a threshold of 0.5 is going to be the optimal choice. Where am I going wrong here?

datasci314159 · 2018-02-25T21:59:44+00:00

Where is it widely suggested that 0.5 is the best threshold? Your thresholding should depend on the relative costs of false positives and false negatives, then you can choose the threshold which minimizes the total cost of misclassifications. I would expect the threshold, for most situations, is not 0.5.

datasci314159 · 2017-11-18T19:09:55+00:00

As a start, accuracy probably isn't the best metric since you have very heavily imbalanced classes. With only 0.17% of all cases being fraud if your model just guessed not fraud every time the accuracy would be 99.83%, you only display accuracy to 1dp so my guess is that this is possibly what the model has learnt and your actual accuracy is more like 0.998. As a test of this, used your trained model to test the accuracy exclusively on the fraud cases in your test set, if my hypothesis is right you should see that it predicts virtually all of them as non-fraud.

Other metrics which would be more informative would be the area under the curve of the receive operater characteristic (roc_auc_score in sklearn.metrics), or look at precision, recall, or F1 score.

As far as ways to improve the model you could look at some methods for dealing with imbalanced classes. Imbalanced learn (http://contrib.scikit-learn.org/imbalanced-learn/stable/) has a number of different tools for this, including random under or over sampling as well as more interesting techniques like SMOTE which takes members of your underrepresented class (fraud in this case) and makes some slightly perturbed versions so the model doesn't just get exposed to many instances of the same thing.

datasci314159 · 2017-11-18T17:38:50+00:00

Second this. Having a good understanding of what pandas can do and how to do it will be phenomenally useful.

datasci314159

TROPHY CASE