“Wrong” again

Reading David Freedman’s “Wrong” is making me feel dirty. As a scientist, that is. He has collected a whole menagerie of ways that experts can go wrong. From cops to doctors, business gurus to scientists, he spares no one. The two chapters that are discomforting for me are the “The trouble with scientists, Part I” and “The trouble with scientists, Part II”. He focuses on “surrogate measurements” – the use of one thing to measure another one. We cannot test new drugs on humans, so we test them on mice. We cannot tell what you are thinking, so we take a fMRI scan to see where blood is flowing in your brain. One devastating example of how this can go wrong is the drug Avastin.

Avastin shrinks cancer tumors. That was what it was designed to do, and it does it well. Shrink the tumor, cancer cured. That was the theory, at least. One measurement, tumor size, is a surrogate for another one – cancer prognosis. In this case, a negative surrogate, if you will – decreasing one (tumor size) increases the other one (cancer survival). Trouble is, it decreases tumors but it does not increase survival. And it increases other risk factors (heart attack, bowel perforations).

Freedman then goes on to discuss how the 6 billion a year that is being spent in bioinformatics to link genes to disease has yielded so little tangible medical benefits – an observation recently mentioned in connection with the tenth anniversary of sequencing the human genome. Again, surrogate measurements are to blame: the presence of genetic variants in diseased individuals is taken to measure the likelihood of the disease and is taken itself to be an explanatory cause. You get the disease because you have the defective genes.

I am trying to think how this relates to probability and Bayes’ Theorem. Is it that we assume that if event $b$ follows event $a$ the converse must be true? Or in mathematical terms
$$ p(b | a) \approx p(a | b)$$

This notion of surrogate measurements also occurs in my field, machine learning. One salient example with which I am familiar was mentioned in 2003 at a talk by a NSA director at MITRE. He was discussing the large funding effort given to large vocabulary continuous speech recognizers (the LVCSR program) in the 90′s. NSA scientists were convinced that if we could word spot, we could make better filters for security related communications. Select the telephone calls that mention words like “bomb”, etc., and you get a higher proportion of messages that are related to your security concerns. The LVCSR program (in which I participated during my years at Dragon Systems) had the natural arc of these research initiatives. Rapid gains were observed at the beginning, but eventually all systems saturated to the same level of performance.

The funders wanted to pull the plug when this happened. The program directors responded by carrying out a cheating experiment. They concocted a test set where they had perfect transcriptions of a set of telephone conversations and measured how well they could pick out the threatening ones. To their surprise, word spotting, sucked as a detector of high-importance messages. The ability to solve one problem (speech recognition) was not the answer to the real problem (spot the threatening calls). Even if you had perfect recognition, which we did not at that time, you still would not be able to significantly increase your chance of picking out the communications with high security relevance.

Lest you conclude from this story that the LVCSR program directors should have conducted this cheating experiment at the beginning, before they spent the money, consider the incredible economic benefits that have resulted from our cheap access to very good continuous speech recognizers. To paraphrase something I once heard from Fred Jelinek – research is hard and frequently useless, but the alternative is worse. By the way, the exact quote was “I hate work, but the alternative is worse.”

Robust voting in uncertain environments

Combining the judgments of different recognizers is always better than using the best one alone. This observation is universal in machine learning realms. But it seldom gets used in practice. Why?

For one, it costs more to implement. Instead of one recognizer, you must deploy several. Computing cycles grow linearly with the number of recognizers if they all have roughly proportional cost. In academic circles, you get rewarded for finding the best recognizer/algoritm so there is little incentive to work in using the work of others along with your own. Companies usually have  proprietary rights to a few recognizers and want to deploy their best one to save development costs, hog less cycles on the customer’s machine, etc.

in certain conditions, however, it actually pays to use different recognizers and to incur the extra development and computing costs. Noisy or uncertain deployment environments can seriously degrade the performance of a well-tuned system. In such situations, the recognizer’s output could be worse than random and it would be impossible to place any certainty on it output.

The situation changes dramatically when we use different recognizers. To see this, consider the case of multiple recognizers, all of which have a probability greater than one half of getting the right answer. One can establish rigorous bounds for the number of recognizers that we would need to guarantee that a simple majority vote would have, say, greater than 99% probability of being right. This is discussed in the article on the Chernof Bound in Wikipedia.

But simple majority voting has its problems. For one, it is incredibly costly. Suppose that your detectors where correct two-thirds of the time. it would take at least 15 detectors to have 90 per cent probability of being correct when you averaged their individual votes

In addition, what if the testing condition starts to differ too much from the training data? The detectors could have degraded performance and 15 detectors may only give you fifty per cent chance of being correct. How would you know that your probability of being correct has dropped?

In future posts, I will explain a scheme using the precision error methodology that I have discussed in previous posts to create a robust voting system that automatically weighs the vote of a collection of systems according to their actual accuracy on the data, and it does this without any ground truth and in a completely data-driven way. Two system may “flip”, one becoming more accurate than the other and the algorithm I propose to combine their decisions would detect this reversal. The practical effect of this scheme is that one can predict, in the field, automatically the probability that one’s collection of recognizers is correct with an efficiency that is higher than simple voting. In compressed sensing, you use less sensors to reconstruct the same signal. I propose something similar, but for decision making: using less decision makers in a robust fashion to make the correct recognition decision.

Positive and negative precision error correlations, real or not?

One of the noisy maps based on the synthetic landscape

Synthetic dataset

All the experiments we have been carrying out with precision error have, so far, been with real data. Because of this, we do not have “ground truth” to determine if the reconstruction is correct. That changed today.

Synthetic experiments are a well-known device for studying models or algorithms. By artificially creating data where one knows exactly what is going on, one can then see if the algorithm one is testing is able to reproduce the artificially created “ground truth”.

I did that today by creating an artificial landscape as shown in the left figure at the top. A single example from the ten noisy versions of this landscape is shown in the figure on the right.

The advantage of using precision error is shown in the figures at the bottom. The figure on the left shows what happens after weighting the maps with the discovered precision error covariance matrix. The picture on the right is the result of simple averaging. The difference is clear. The weighted average is better, and precision error estimation was the way to obtain the weights.
Weighted average using precision error covariance matrixSimple average of all ten maps

Not every measurement is perfect

Precision error estimate variance decay

Just to show that not all questions behave as nicely as question 9 in the previous post, here is the plot for question 6 in the same exam.The fit is not as good as for question 9. This is expected, there is no reason why the precision error should decay with a perfect exponential behavior. Nonetheless, it still shows a similar decay constant — about six questions. Remember to click on the image to get the larger image in a zoom pane.

Precision error for parse trees

The precision error equations require that “ground truth” cancel out. It is easy to see what that means for elevations in a map. What does it mean for parse trees in a natural language processing task like sentence parsing?

One way to define distance between trees is to consider the total number of reverse operations that bring them back to a common ancestor. Is that number equal to the number one would get by comparing everything to the “true” parsing? That is, the observed parse prediction’s distance is equal to the true parse distance plus the distance created by the error-transformations.

Substraction makes sense to me in the context of trees: you take everything after the common ancestor. What is addition of parse trees? The union of all edges and vertices. Parse trees are graphs after all.

This addition and subtraction of graphs means that we can use the precisione error equations. Parse trees are added and substracted. In the end, a score is assigned to the difference by counting the number of operations it would take to collapse the resulting graph to disconnected single ancestors.

How do I get a bunch of parsing models to test this idea out?

Asymmetry in likelihood of causing the error

As the number of models increase, the observed pattern in prediction discrepancies allows one to decide what is causing it assuming uniform uncertainty among all possible scenarios. The observed error pattern will be consistent with many different scenarios. In some scenarios, the noisy model predictions are due to the model being correct and the other models being incorrect. In other scenarios, the model is incorrect and the other models are correct. The hypothesis I want to prove is that the “mass” (the number of states) that corresponds to assuming the model is incorrect becomes larger than the one assuming it is correct.

Minimum number of questions needed for an exam

Another application of the precision error covariance matrix is to find out the minimum number of questions needed for an exam. The linear algebra system derived from the precision erorr equations requires at least three scientific models before one can measure the precision error. More is better. But what is enough? How many questions does one need to ask in an exam to be guaranteed that the precision error one measures for the questions is well estimated?

To give an idea of what this number might be, consider first the popular parlor game “Twenty Questions”. One person thinks about something and the rest of the group must guess what it is by asking twenty questions or less. Isn’t it the case that this is almost always possible? Twenty questions is plenty to find out about what someone is thinking even with no prior information. This is, in fact, so automatic that popular toys exist that ask questions and are remarkably good at coming up with what one is thinking in twenty questions or less.

It seems to me, then, that twenty questions is way too many questions to ask in an exam if the purpose of that exam is to figure out if a student is competent in the material covered in class. Using precision error covariance matrices, I can actually calculate what the minimum number of questions are. This can be done by looking at how the precision error estimate varies for a single question as one varies the number of questions it is paired against. As the number of questions it is paired with increase, its precisione error estimate settles down to a value that does not change any further after
a certain threshold is reached. This threshold is the minimum number of questions needed in an exam.

I am now carrying out experiments with the exam data I have to empirically measure the number.

Grading mistake detection with precision error

While making a covariance matrix for eighteen questions in an introductory Physics exam I gave in the Spring of 2006, I discovered another use for the precision error measurements: grading mistake detection.

The figure shown first is my initial try. I computed the student score on each question with the function: $f(\text{correct})=1.0, f(\text{incorrect})=0.0$.Two graded incorrectly. Note the two dark squares at position 5 and 6.

Further investigation showed that I incorrectly scored the two questions. The second figure shows the matrix after I corrected the grading.All graded correctly Note how the squares at position 5 and 6 are now similar to the others.

The precision error covariance matrix also detects grading anomalies!

Random faster than systematic

I am writing a Mathematica program to produce the precision error signal and reconstruction matrix for an arbitrary number of models. The maximum number I had tried before was ten models because it corresponded to the number of maps we have for the 29 Palms dataset.

My first try consisted of systematically creating all possible permutations of the precision error equations, squaring them, and then storing the coefficients. The program would then systematically look at the equations and augment an independent set every time it found an equation that could not be written as linear combinations of the previous ones in the set.

This worked okay for ten or so models, but I want to produce the full covariance matrix for twenty questions in a multiple-choice exam. No problem, I was making some grilled lamb for Easter dinner yesterday, so I put the computer to work and walked away. Three hours later, the computer was still trying to finish the list of all possible equation permutations! I confess that I have not worked out the combinatorics for the equations yet so perhaps it is of order $20!$. This would be 2,432,902,008,176,640,000 combinations. Compare this to $10!=3628800$ and you can see why the computation got hard quickly.

So my second incantation of the program was to do the computation randomly. Two integers $P$ and $Q$ are picked randomly such that $1 \leq P \leq m$ and $1 \leq Q \leq m$ where $m$ is the number of models. These random numbers are then used to randomly sample the model variables and construct a precision error difference equation. If the equation is independent from the set currently at hand, it is kept, otherwise discarded.

This second version is taking about ten minutes to produce a result. This made me think about how we perceive randomness as haphazardly: “Oh, you are just randomly trying to guess the right answer.” We perceive randomly as wasteful or misguided. The case presented here is just another example of how random is sometimes faster than systematic.