A simple example of how to stabilize classification errors in uncertain environments by sacrificing accuracy
I corrected an assertion in my earlier post on the difference between classification rates and classification errors. I mistakenly asserted that for a training dataset with a 70/30 prevalence of binary labels the best random classifier is the 70/30 one. This is not so. The best one is the 100/0 classifier since it will correctly classify the prevalent label all of the time and will therefore have an error rate of 30% on the 70/30 training dataset. This is better than the 70/30 random classifier which has a classification error of 42% on this training dataset.
But this clarification provides a simple example of how sacrificing accuracy leads to stabilizing errors in an uncertain environment. As the previous post shows, the 70/30 classifier has an error rate increase of 4% when tried on data that has a 60/40 ratio. Its error has increased by a relative 9.5% (). But the better random classifier, the 100/0 one, has its error rate increase by 33% ()! It is overtrained and therefore much more unstable when applied to data that differs from the characteristics of its training set. The moral: increased stability of the error rate can be achieved by sacrificing accuracy.
Of course, these are just specific examples. The way to quantify the benefit of sacrificing accuracy is to consider the average stability of a random classifier under various degrees of variability in the input. This can be done, for example, by considering a beta distribution with various amounts of variability to generate synthetic input data. As the variability knob on the beta distribution is increased one should find that the best strategy for stabilizing error on average is to sacrifice accuracy.
Now suppose that a user/employer comes to you and asks that you design a classifier based on a 70/30 training dataset. Should you give them the 100/0 or 70/30 random classifier? I think the responsible thing to do is to give them the 70/30 classifier. It will also make them happier! I don’t think many users/employers are going to be happy with a classifier that gets one class completely wrong all of the time in spite of your pleas that it is more accurate! Why? Because users always have costs associated with the errors of your classifier. Would you be happy with a spam filter that was 70% accurate but delivered no mail?
October 3rd, 2007 at 2:50 pm
I think the question at hand is: “Which performance metric is most relevant?” Observation-level accuracy is known to suffer from exactly the peculiarity which you describe. My experience in practice, though, is that many clients want probability outputs (as opposed to class label outputs) and need to maximize class separation (measured by, for instance, AUC).
October 8th, 2007 at 7:42 am
I agree, Will. As you say, the question really is what metric is most relevant to the task at hand. My rhetorical excesses are meant to counteract the obsession with highest accuracy that I encounter everyday at work.
Since I am using random detectors, the area under the ROC curve (AUC) is fixed for all of my examples. As you say, clients want us to reduce the AUC. But not always. I have a funny story about this from when I worked at Dragon Systems.
Dragon participated in a series of speaker identification competitions sponsored by NSA in the 90’s. I was lead developer for a system that won 6 out of 9 testing conditions. By won, I mean we got the lowest AUC of any of the other entries by research powerhouses like IBM and BBN. George Doddington had stated he would give 25 dollars for the ‘best’ system in each testing condition. When he got up to award his prizes during the conference with the competition participants, I started licking my chops over the 150 dollars I was going to get. George then proceeded to award all the prizes to other systems because his criterion was not lowest AUC! He chose the systems that set their operating point at their lowest cost! That is, even though we had fielded a system that could attain the lowest cost for 6 of the 9 conditions, we had not set it to the lowest cost it could get.
The lesson to me: users sometimes want to minimize costs with what they have.