Archive for the 'Randomness' Category

On the usefulness of glorious failures

Jon Stewart’s comments on the usefulness of the Apollo missions:

It was that fateful day in July that we planted the Stars and Stripes in the lunar surface, officially claiming the moon as America’s space Puerto Rico. It was all ours. It was the culmination of a dream. … It took us ten years, astronauts’ lives, billions of dollars, and all we did is hit a f***ing golf ball?

have made me think about how exploration is a good onto itself even when it fails. Indeed, it seems that the unintended consequences or discoveries of such explorations are good enough to justify our urge to carry them out.

I offer as my first example of unintended consequences a single photo — the famous Earthrise captured by the Apollo 8 mission on its fourth orbit around the Moon. As recounted in Richard Poole’s Earthrise: How Man First Saw the Earth it was mission commander Frank Borman that first noticed the Earth rising above the gray moonscape and pointed it out to William Anders, the mission scientific crew member. During the previous three lunar orbits, Anders had been busy photographing the lunar surface for possible landing sites. Anders handed the camera with black and white film he was using to Borman but then proceeded to warn him not take any pictures – “Hey, don’t take that, it’s not scheduled.”. Quickly reversing himself, Anders picked up a roll of color film and took the now iconic photo. The cultural influence of this photo is richly detailed in Poole’s book. I’ll mention one that greatly influenced me: it was used as the front cover image for the Last Whole Earth Catalog. As others have pointed out, it is ironic that not one of the people that envisioned space travel ever thought that its most important result would be the perspective we would gain on our home planet. Norman Cousins stated during a 1975 Congressional hearing on the future of space travel: “what was most significant about the lunar voyage was not that men set foot on the Moon, but that they set eye on the Earth.”

My second example on the unintended consequences of exploration and how failures can sometimes be glorious is Columbus’ Voyage to the Americas. Wishing to reach the East Indies, believing that the Earth was actually half as big as it really is, he hit upon the Americas and to his last days did not realize what he had done. We do not know what we will find when we search out from the safety of our known world, and that is why we should.

Apology to my few readers

I want to apologize to my few readers for the lack of posts since May of last year. This is partly due to personal circumstances that overwhelmed me emotionally and mentally. Since last Fall, however, I have rebounded and have been, since Thanksgiving, been engaged in an immensely satisfying intellectual journey that has greatly generalized the work in the ICML 2008 paper.

Unfortunately, since September 29th, I have been employed as a Research Scientist at the Mobility Division of Nuance Communications. This means that my intellectual output now belongs to my employer and I am obligated to mantain the secrecy of my work until such time as it is patented.

My plans for the immediate future of this blog areĀ  to concentrate on general scientific and “philosophical” matters related to the issues I have explored in previous posts and whatever else strikes my fancy.

Guaranteeing accuracy and precision

Discussions in this blog on precision error estimation via 1 minimization have made it clear that the technique is only good for recovering precision, not accuracy. This post will argue that accuracy can also be guaranteed if all the detectors have a greater than one-half probability of being correct.

By “guaranteed”, I mean that I have a high probability of being correct and that furthermore, I can estimate this probability with high confidence. The idea is not new to me and follows a well-known result: if all detectors have P true>1 /2 then, given enough of them, you can guarantee that simple majority voting will be correct at whatever level you want, for example, ninety-nine per cent of the time.

The idea follows from the following formula

P voting correct= i=n/2 +1 n(n i)p i(1 p) ni,

where p is the probability of each system detecting the right answer (which the formula assumes is the same for all of them. The formula also assumes that the detectors are uncorrelated. This is an important point and one I will return to in future posts. For the moment, I will assume detector independence to make the discussion simpler. The technique I am proposing, however, does not require it.

The probability of simple voting being correct shown in the formula above comes from counting the sequences where more than half the detectors give you the right answer. My main argument is that if you know the detectors are better than one-half individually, you can do quite well given enough of them.

The following plot shows the probability of being correct for five detectors as a function of their individual ability to figure out the correct answer.

Probability of majority voting being correct for five detectors The dashed line shows probability that majority voting gives you the right choice. Note that for p single detector correct<1 /2 , it pays to ditch voting and just use the output of a single detector. Note also, that there is a point of maximum relative gain. For five detectors this occurs around 0.72 which gets you 0.86 probability of being correct.

Future posts will discuss how we can lever this simple idea in situations where we have no ground truth and are therefore incapable of estimating our correctness rate. Hint: it involves turning the autonomous difference equations used for the mapping application of precision error into majority voting equations.

Site CSS broken

My introduction of cool pop-out images has created all sorts of bad css entries for this site. The site no longer validates as having well-formed CSS (check out the validate button at the bottom of this page). I apologize to the purists out there. I am proud to say, however, that it validates as correct XML and MathML.

Robust voting in uncertain environments

Combining the judgments of different recognizers is always better than using the best one alone. This observation is universal in machine learning realms. But it seldom gets used in practice. Why?

For one, it costs more to implement. Instead of one recognizer, you must deploy several. Computing cycles grow linearly with the number of recognizers if they all have roughly proportional cost. In academic circles, you get rewarded for finding the best recognizer/algoritm so there is little incentive to work in using the work of others along with your own. Companies usually haveĀ  proprietary rights to a few recognizers and want to deploy their best one to save development costs, hog less cycles on the customer’s machine, etc.

in certain conditions, however, it actually pays to use different recognizers and to incur the extra development and computing costs. Noisy or uncertain deployment environments can seriously degrade the performance of a well-tuned system. In such situations, the recognizer’s output could be worse than random and it would be impossible to place any certainty on it output.

The situation changes dramatically when we use different recognizers. To see this, consider the case of multiple recognizers, all of which have a probability greater than one half of getting the right answer. One can establish rigorous bounds for the number of recognizers that we would need to guarantee that a simple majority vote would have, say, greater than 99% probability of being right. This is discussed in the article on the Chernof Bound in Wikipedia.

But simple majority voting has its problems. For one, it is incredibly costly. Suppose that your detectors where correct two-thirds of the time. it would take at least 15 detectors to have 90 per cent probability of being correct when you averaged their individual votes

In addition, what if the testing condition starts to differ too much from the training data? The detectors could have degraded performance and 15 detectors may only give you fifty per cent chance of being correct. How would you know that your probability of being correct has dropped?

In future posts, I will explain a scheme using the precision error methodology that I have discussed in previous posts to create a robust voting system that automatically weighs the vote of a collection of systems according to their actual accuracy on the data, and it does this without any ground truth and in a completely data-driven way. Two system may “flip”, one becoming more accurate than the other and the algorithm I propose to combine their decisions would detect this reversal. The practical effect of this scheme is that one can predict, in the field, automatically the probability that one’s collection of recognizers is correct with an efficiency that is higher than simple voting. In compressed sensing, you use less sensors to reconstruct the same signal. I propose something similar, but for decision making: using less decision makers in a robust fashion to make the correct recognition decision.

Minimum number of questions revisited

To show off the installation of FancyZoom (a trick I learned while visiting the excellent Language Log), I present a graph of the percentage variation in the mean square precision error as a function of the number of questions used to compute it. The image looks small but you can now click on it to obtain a zoomed in version. Try it!

Variance of precision error estimates for question 9 as a function of number of questions used. Note how good the fit is to a shifted exponential function of the form:
a+b*exp(c(n q3 )).
The measurements are the small dots at n q={4,6,7,10,12 }. The fitted values are a=0.06 , b=0.2 , and c=0.43 . The variable c is the decay constant for the variability in the estimate. In particular, if you calculate its inverse 1 /c you get the number of questions beyond three that will give you less than 33% variability in the estimate. This turns out to be about 2 questions. So ten or twelve questions should be enough for this group of students.

Once again, this suggests that teachers are asking too many questions in their multiple choice exams.

ICML accepts precision error via L1 minimization paper

Our technical report on how to recover precision error estimates with 1 -minimization has been accepted by the 2008 International Conference on Machine Learning.

The paper originally got three anonymous reviews. Two were positive, one strongly negative. In our response to the reviews, we agreed with the general criticism by the reviewers that one experimental demonstration is not enough. In our precision error papers so far, we have only been using one dataset — aerial photographs from the Twenty-Nine Palms region in California. So we are going to include some results from North Carolina forest data to show that our technique works for all sorts of images.

Readers of previous posts may note that besides maps, the precision error has been recovered for questions in a multiple-choice-quesiton (MCQ) exam. It would be nice to include this in our ICML paper, but the title of the paper is “Autonomous geometric precision error estimation in low-level computer vision tasks” so it seems incongruous to do so.

The paper was submitted in early January. Afterwards, we realized that our precision error technique for elevation errors in maps applies to any set of models that make scalar predictions about multiple entities. We are now working on a draft for a Science magazine article that will combine the examples from maps and exams to illustrate the wide applicability of our technique.

Student answers versus random answers

An interesting baseline for thinking about precision error is to consider the case of uniformly random answers. The student may be completely ignorant: you gave a college level test to kindergarten kids. Your questions were so hard or so incomprehensible (think Chris Kattan’s mumbling character giving a “uupp-uizzz” (pop-quiz) to this students) that students are just guessing.

Precision error for random answer responsesActual student answers Note how the answer look uniformly gray. There really is no pattern in the students response. This is a uniform group answering this exam — the random answering makes everyone belong to the same group.

This second figure is actual student answers in a test. The grayness of the diagonal squares is varying. Some questions have a precision error lower than random! Others, just as high. The noisy (imprecise questions) are the ones that gave students the most trouble.

Mean precision error equations

Back in December, Howard Schultz and I wrote a paper for the IEEE Computer Vision and Pattern Recognition 2008 conference. The paper, by the way, was rejected by the reviewers (this should form an interesting article at some future time). In the paper we calculate the horizontal decorrelation length for a collection of maps using a variogram technique.

The plots were not coming out right and Howard pointed out that I had not de-meaned the data. In particular, I had not de-meaned the average precision error. That is, the precision error for a model has a mean.

For some reason, I had thought that this mean error could be exactly solved with the precision error equations. It cannot. Given m models, the precision error equations give you m1 equations. So the mean value also has to be recovered with 1 -minimization.

How should one interpret this mean precision error? In the case of exams, the mean precision error for a question could be interpreted as the hardness or ease for the question. Hard questions will get answered correctly by few people, easy ones will be answered correctly by everyone.

This is what I observe in the introductory Physics class exam I have been analyzing. The hardest questions have the smallest mean error. The easiest ones have the largest mean error.

The correct way of defining the covariance matrix involves substracting this mean. That is, the entries should be of the form δ iδ jδ iδ j and not just plain δ iδ j as I have been writing in previous posts.

Who is causing the variations in predictions?

Precision error is a measure of the variation in the predictions of a collection of models. If there are no variations — all the models agree. There is no precision error. One is perfectly in focus as far as one can tell. But, of course, scientific models disagree. So who is to blame? Is it the data or is it the algorithms used to process the data (the models). Having enough models allows you to do decide who is to blame.

Consider the case of a piece of data used to train one of the models that always lead to a disagreement, no matter what algorithm is used to process it. Who is to blame in this case? The optimal choice seems to me to be to decide that the data is bad.

Now consider an algorithm that always disagrees with all the other models no matter what set of predictions are compared. This seems to suggest that the model is wrong, not the data.

In both these cases, the availability of a large number of models is what allows one to distinguish the two cases. Real data will not be as stark as the examples above. Here is where Fourier analysis and probability theory come in. As the number of models increases one is able to disentangle the two. For small number of models, blaming the data or the model would explain the observed variation equally well. As the number of models increase, assigning blame becomes asymmetric!

This is sort of like the “Is it me or is him/her?” question. Comparing ourselves to only one other person does not allow us to decide who is the crazy one. But the more people we interact with, the sooner we realize who is to blame.

I’ll try to come up with a simple example with a few models to illustrate the point mathematically in a later post.