Archive for November, 2008

Guaranteeing accuracy and precision

Discussions in this blog on precision error estimation via 1 minimization have made it clear that the technique is only good for recovering precision, not accuracy. This post will argue that accuracy can also be guaranteed if all the detectors have a greater than one-half probability of being correct.

By “guaranteed”, I mean that I have a high probability of being correct and that furthermore, I can estimate this probability with high confidence. The idea is not new to me and follows a well-known result: if all detectors have P true>1 /2 then, given enough of them, you can guarantee that simple majority voting will be correct at whatever level you want, for example, ninety-nine per cent of the time.

The idea follows from the following formula

P voting correct= i=n/2 +1 n(n i)p i(1 p) ni,

where p is the probability of each system detecting the right answer (which the formula assumes is the same for all of them. The formula also assumes that the detectors are uncorrelated. This is an important point and one I will return to in future posts. For the moment, I will assume detector independence to make the discussion simpler. The technique I am proposing, however, does not require it.

The probability of simple voting being correct shown in the formula above comes from counting the sequences where more than half the detectors give you the right answer. My main argument is that if you know the detectors are better than one-half individually, you can do quite well given enough of them.

The following plot shows the probability of being correct for five detectors as a function of their individual ability to figure out the correct answer.

Probability of majority voting being correct for five detectors
five-detectors

The dashed line shows probability that majority voting gives you the right choice. Note that for p single detector correct<1 /2 , it pays to ditch voting and just use the output of a single detector. Note also, that there is a point of maximum relative gain. For five detectors this occurs around 0.72 which gets you 0.86 probability of being correct.

Future posts will discuss how we can lever this simple idea in situations where we have no ground truth and are therefore incapable of estimating our correctness rate. Hint: it involves turning the autonomous difference equations used for the mapping application of precision error into majority voting equations.

Site CSS broken

My introduction of cool pop-out images has created all sorts of bad css entries for this site. The site no longer validates as having well-formed CSS (check out the validate button at the bottom of this page). I apologize to the purists out there. I am proud to say, however, that it validates as correct XML and MathML.

Robust voting in uncertain environments

Combining the judgments of different recognizers is always better than using the best one alone. This observation is universal in machine learning realms. But it seldom gets used in practice. Why?

For one, it costs more to implement. Instead of one recognizer, you must deploy several. Computing cycles grow linearly with the number of recognizers if they all have roughly proportional cost. In academic circles, you get rewarded for finding the best recognizer/algoritm so there is little incentive to work in using the work of others along with your own. Companies usually haveĀ  proprietary rights to a few recognizers and want to deploy their best one to save development costs, hog less cycles on the customer’s machine, etc.

in certain conditions, however, it actually pays to use different recognizers and to incur the extra development and computing costs. Noisy or uncertain deployment environments can seriously degrade the performance of a well-tuned system. In such situations, the recognizer’s output could be worse than random and it would be impossible to place any certainty on it output.

The situation changes dramatically when we use different recognizers. To see this, consider the case of multiple recognizers, all of which have a probability greater than one half of getting the right answer. One can establish rigorous bounds for the number of recognizers that we would need to guarantee that a simple majority vote would have, say, greater than 99% probability of being right. This is discussed in the article on the Chernof Bound in Wikipedia.

But simple majority voting has its problems. For one, it is incredibly costly. Suppose that your detectors where correct two-thirds of the time. it would take at least 15 detectors to have 90 per cent probability of being correct when you averaged their individual votes

In addition, what if the testing condition starts to differ too much from the training data? The detectors could have degraded performance and 15 detectors may only give you fifty per cent chance of being correct. How would you know that your probability of being correct has dropped?

In future posts, I will explain a scheme using the precision error methodology that I have discussed in previous posts to create a robust voting system that automatically weighs the vote of a collection of systems according to their actual accuracy on the data, and it does this without any ground truth and in a completely data-driven way. Two system may “flip”, one becoming more accurate than the other and the algorithm I propose to combine their decisions would detect this reversal. The practical effect of this scheme is that one can predict, in the field, automatically the probability that one’s collection of recognizers is correct with an efficiency that is higher than simple voting. In compressed sensing, you use less sensors to reconstruct the same signal. I propose something similar, but for decision making: using less decision makers in a robust fashion to make the correct recognition decision.