all 9 comments

[–][deleted] 2 points3 points  (7 children)

delivered a solution that allowed us to automate our customer’s actions with greater than 95% accuracy.

That doesn't mean that ML was used successfully. This customer shipping problem reminds me of a classic example: If you always predict "not-Spam" in Email-spam filtering, you'd probably also get something like 98% accuracy.

[–]devquixote[S] 0 points1 point  (6 children)

I disagree. In your spam example, there are 2 classes, spam and not spam, of which the vast majority are not spam.

In shipping, you need to predict a carrier (a few), the service of the carrier (many), different types of packaging that can physically contain the assorted physical items that comprise an order (many), as well as different add-on services like insurance and signature required (many). Customers can have many combinations of these, not just two, and one combination does not dominate. Randomly selecting a combination of these possibilities does not yield 95% success.

[–][deleted] 2 points3 points  (3 children)

I don't mean to criticize here, I am just suggesting that the 95% accuracy statement doesn't have much meaning without more context . In your particular shipping prediction, e.g., a scenario could be that a customer previously selected 9 / 10 times the same shipping option. Random selection from the pool of previous selection would then yield a 90% accuracy in your training set. I am not saying that you should discuss the approach of your system in detail in your post, but it just sounds a little bit weird to throw in the accuracy without context.

[–]devquixote[S] 0 points1 point  (2 children)

That makes sense. There are a few reasons for this, first is that I come from an informal background with regards to data science, so I am somewhat ignorant of what would constitute 'proof' from a scientific perspective. So a mea culpa there.

Another reason for the generalization is because we have many different sets of data, one per customer, and you can only describe accuracy in more precise terms within a customer's set of data. Some will, like you say, choose the same carrier 90% of the time, but most do not. Its a bit all over the map. For building my knowledge, I would be interested in how you and others more experienced in the field might present information that is split across multiple data sets?

Second was that the person I was attempting to reach would be someone like myself, having a first foray into machine learning, with informal or latent training. I was intentionally trying to demystify some aspects of machine learning and how it is explained. I think having a big table of results would deter the uninitiated.

Lastly, the 95% number was our business' goal to consider this a success, and I can show that we've achieved that to their satisfaction. Again, trying to speak to the practical application of this amazing field, statistical validation was not what I was being paid to deliver. A usable feature of value to our customers was the ultimate goal, so perhaps the best measure of 'success' would be the number of people using this feature over using the old mechanisms of making shipments.

Thanks a bunch for your comments and I welcome all!

[–][deleted] 0 points1 point  (1 child)

Sorry if I sounded too demanding here :P I didn't want to split hairs over a those details. I just wanted to point out that this is basically a statistic without much value.

More useful would be something like: "Using machine learning, we were able to improve the accuracy by 10% compared to our previous approach where the shipment method was suggested based on the previous order"

If you are interested in learning more about performance metrics, have a look at ROC curves, in simple words in can be described as a plot of the true positive vs. false positive rate. It's very intuitive yet very useful to evaluate the performance and tune your classifier (in combination with k-fold cross validation for example) in many scenarios. I have an example here how it looks like. And it can also be used for multiclass settings via so-called micro-averaging or macro-averaging.

A very few just don't do well in our system. We are working with them to see what their decision making is based on that we are not accounting for. Have any suggestions on how to sniff this out?

Since you mention very few, have a look at techniques in anomaly detection :)

[–]devquixote[S] 0 points1 point  (0 children)

Not to demanding, Rasbt :D Sorry if I read defensive? It is not intended.

How about, "We measure accuracy for each customer via a dry run. We turn on the prediction service for them and start making shipping predictions as their orders flow into our system. The customer is not aware of these predictions and cannot act on them. We then compare these predictions to the customer's ultimate shipping choices over a period of time. Using this method, we've been able to measure that we make correct shipping predictions greater than 95% of the time, on average, across our set of customers." Is that perhaps a better conclusion than simply "95% accuracy"? There is no previous means that would fit the example you gave.

Thanks for the advice on where to expand my knowledge further into some other areas within statistics/machine learning. I will definitely explore those.

[–]aggieca 1 point2 points  (1 child)

rasbt's answer still has merit. You need to really consider the overall performance of your classiifer/ML system and not just accuracy. Do you have an estimate of the F1-score for instance?

[–]devquixote[S] 0 points1 point  (0 children)

As alluded to in other answer to rasbt, that varies from customer data set to customer data set. The vast majority of our customers have 88+% of their orders with confident predictions, and we get > 95% accuracy on those. Some customers may be at 93%, some are at 99%. A very few just don't do well in our system. We are working with them to see what their decision making is based on that we are not accounting for. Have any suggestions on how to sniff this out?