Archive for the 'Randomness' Category

Mean precision error equations

Back in December, Howard Schultz and I wrote a paper for the IEEE Computer Vision and Pattern Recognition 2008 conference. The paper, by the way, was rejected by the reviewers (this should form an interesting article at some future time). In the paper we calculate the horizontal decorrelation length for a collection of maps using a variogram technique.

The plots were not coming out right and Howard pointed out that I had not de-meaned the data. In particular, I had not de-meaned the average precision error. That is, the precision error for a model has a mean.

For some reason, I had thought that this mean error could be exactly solved with the precision error equations. It cannot. Given m models, the precision error equations give you m1 equations. So the mean value also has to be recovered with 1 -minimization.

How should one interpret this mean precision error? In the case of exams, the mean precision error for a question could be interpreted as the hardness or ease for the question. Hard questions will get answered correctly by few people, easy ones will be answered correctly by everyone.

This is what I observe in the introductory Physics class exam I have been analyzing. The hardest questions have the smallest mean error. The easiest ones have the largest mean error.

The correct way of defining the covariance matrix involves substracting this mean. That is, the entries should be of the form δ iδ jδ iδ j and not just plain δ iδ j as I have been writing in previous posts.

Who is causing the variations in predictions?

Precision error is a measure of the variation in the predictions of a collection of models. If there are no variations — all the models agree. There is no precision error. One is perfectly in focus as far as one can tell. But, of course, scientific models disagree. So who is to blame? Is it the data or is it the algorithms used to process the data (the models). Having enough models allows you to do decide who is to blame.

Consider the case of a piece of data used to train one of the models that always lead to a disagreement, no matter what algorithm is used to process it. Who is to blame in this case? The optimal choice seems to me to be to decide that the data is bad.

Now consider an algorithm that always disagrees with all the other models no matter what set of predictions are compared. This seems to suggest that the model is wrong, not the data.

In both these cases, the availability of a large number of models is what allows one to distinguish the two cases. Real data will not be as stark as the examples above. Here is where Fourier analysis and probability theory come in. As the number of models increases one is able to disentangle the two. For small number of models, blaming the data or the model would explain the observed variation equally well. As the number of models increase, assigning blame becomes asymmetric!

This is sort of like the “Is it me or is him/her?” question. Comparing ourselves to only one other person does not allow us to decide who is the crazy one. But the more people we interact with, the sooner we realize who is to blame.

I’ll try to come up with a simple example with a few models to illustrate the point mathematically in a later post.

The half life of exam question precision error estimates

It sounds strange to think it. But the precision error formalism I have been discussing is a relation between the data and the models. Who is to blame for the variation that is observed? Is it the data? Is the data noisy? Is it the algorithm used to calculate the model? It could be both.

One way to see this dependency of the error estimation on the data is to consider the diagonal terms in the covariance matrix: the self-correlation in the precision error, the terms of the form δ i 2 . Since this involves only one model, we would expect it to be the error of the model. How does this number vary for a question in an exam as we change the set of questions we compare it against? If we use a few questions, the bias in those questions may skew the bias in our estimate. Another set of a few questions could have a wildly different estimate for the self variance of a question.

As the number of questions that are included in the comparison set are increased, what happens to the fluctuations in the estimates of the self-correlations. Do they decrease? How fast do they decrease? Some preliminary experiments with the exam questions show that with three questions, the estimates vary by 51% of the mean. When eighteen questions are used, they vary by 15%. Fitting these two preliminary data points to an exponential decay curve gives a decay constant for the noise in the precisione error estimate (this sounds so weird — the noise of the noise?). The decay constant one gets from the above figures is fourteen questions.

Fourteen questions seem to be enough to have an estimate of the precision error of each question that is within 27% of the estimate. This seems to suggest that twenty questions is about right after one adds extra questions as insurance padding against data variability.

Minimum number of questions needed for an exam

Another application of the precision error covariance matrix is to find out the minimum number of questions needed for an exam. The linear algebra system derived from the precision erorr equations requires at least three scientific models before one can measure the precision error. More is better. But what is enough? How many questions does one need to ask in an exam to be guaranteed that the precision error one measures for the questions is well estimated?

To give an idea of what this number might be, consider first the popular parlor game “Twenty Questions”. One person thinks about something and the rest of the group must guess what it is by asking twenty questions or less. Isn’t it the case that this is almost always possible? Twenty questions is plenty to find out about what someone is thinking even with no prior information. This is, in fact, so automatic that popular toys exist that ask questions and are remarkably good at coming up with what one is thinking in twenty questions or less.

It seems to me, then, that twenty questions is way too many questions to ask in an exam if the purpose of that exam is to figure out if a student is competent in the material covered in class. Using precision error covariance matrices, I can actually calculate what the minimum number of questions are. This can be done by looking at how the precision error estimate varies for a single question as one varies the number of questions it is paired against. As the number of questions it is paired with increase, its precisione error estimate settles down to a value that does not change any further after
a certain threshold is reached. This threshold is the minimum number of questions needed in an exam.

I am now carrying out experiments with the exam data I have to empirically measure the number.

Grading mistake detection with precision error

While making a covariance matrix for eighteen questions in an introductory Physics exam I gave in the Spring of 2006, I discovered another use for the precision error measurements: grading mistake detection.

The figure shown first is my initial try. I computed the student score on each question with the function: f(correct)=1.0 ,f(incorrect)=0.0 .Two graded incorrectly. Note the two dark squares at position 5 and 6.

Further investigation showed that I incorrectly scored the two questions. The second figure shows the matrix after I corrected the grading.All graded correctly Note how the squares at position 5 and 6 are now similar to the others.

The precision error covariance matrix also detects grading anomalies!

Random faster than systematic

I am writing a Mathematica program to produce the precision error signal and reconstruction matrix for an arbitrary number of models. The maximum number I had tried before was ten models because it corresponded to the number of maps we have for the 29 Palms dataset.

My first try consisted of systematically creating all possible permutations of the precision error equations, squaring them, and then storing the coefficients. The program would then systematically look at the equations and augment an independent set every time it found an equation that could not be written as linear combinations of the previous ones in the set.

This worked okay for ten or so models, but I want to produce the full covariance matrix for twenty questions in a multiple-choice exam. No problem, I was making some grilled lamb for Easter dinner yesterday, so I put the computer to work and walked away. Three hours later, the computer was still trying to finish the list of all possible equation permutations! I confess that I have not worked out the combinatorics for the equations yet so perhaps it is of order 20 !. This would be 2,432,902,008,176,640,000 combinations. Compare this to 10 !=3628800 and you can see why the computation got hard quickly.

So my second incantation of the program was to do the computation randomly. Two integers P and Q are picked randomly such that 1 Pm and 1 Qm where m is the number of models. These random numbers are then used to randomly sample the model variables and construct a precision error difference equation. If the equation is independent from the set currently at hand, it is kept, otherwise discarded.

This second version is taking about ten minutes to produce a result. This made me think about how we perceive randomness as haphazardly: “Oh, you are just randomly trying to guess the right answer.” We perceive randomly as wasteful or misguided. The case presented here is just another example of how random is sometimes faster than systematic.

Precision error vectors are rank-1 tensors

My previous post on precision error tensors was misleading. We tend to think of tensors as complicated mathematical structures. Vectors are rank 1 tensors. Driving home from work today, I realized that I had already shown that precision error vectors can be calculated in our horizontal decorrelation estimation paper. So mathematically speaking, I have already shown that precision error should be treated as tensors. The precision error vector is the rank 1 tensor example. The precision error covariance matrix is the rank-2 tensor. Two examples in the usual tensor progression. At some future time I should calculate the rank-3 tensor. How would one induce representations of the Symmetric group in rank-3 tensors?

Precision error covariance matrix for a Physics multiple-choice exam

Precision error covariance for 10 questions from a Physics test
I have applied the autonomous difference equations to test the quality of ten out of twenty questions I used in a Physics exam I gave in the Spring of 2006 to an introductory class for engineering students. That dark square in position six of the matrix corresponds to the question least likely to be answered correctly. Only 64 students out of 250 answered correctly. The reason this happened was that I gave a very clever wrong answer that attracted most of the students (the correct answer but forgetting to take a square root). I have used the precision error covariance matrix to assess the test maker not the students!

This example also highlights the general applicablity of the precision error covariance matrix. There now exist two experimental verifications of its usefulness: digital elevation maps and test assessment.

Precision error equations for information retrieval, information extraction, and bioinformatics

Today I was able to look at the precision error equations for the digital elevation models and see for the first time that they can be trivially generalized to other machine learning fields like information retrieval, information extraction, and bioinformatics. This is so embarrassingly simple that I cannot stop coming up with new variants every hour or so! The pattern is easy to explain, so I think any reader of this should be able to come up with a variant that applies to their area of work. Please let me know if you do so, I would like to start a catalogue of the many ways this can be done.

Here is the pattern. I’ll start with information retrieval. Assume you have a system that creates a relevance model of a corpus of documents. These relevance models are judgments of the form r(d,q)={0,1 }. The notation is meant to capture that the relevance judgment is a binary decision on whether a particular document d is relevant to a query q. Maybe you made that relevance model using maximum likelihood estimates, or maybe you used Latent Dirichlet Allocation. Each way you calculate that relevance judgment is a model. Assume that you have used different algorithms, or different parameter settings, or whatever, to come up with n different relevance judgments for a set of test queries.

Each relevance judgment of a specific model i can be written as
r estimated(d,q) i=r true(d,q) i+δ(d,q) i

Now consider the following quantities that can be calculated with these many relevance models
q,d(1 E i=1 Er i(d,q))(1 M j=1 Mr j(d,q))
These quantities would not include the r true value. It would cancel out. So the above equation can be written as
q,d(1 E i=1 Eδ i(d,q))(1 M j=1 Mδ j(d,q)). These equations would allow you to recover the precision errors {δ i} for a collection of information retrieval models!

The pattern can now be generalized ad-infinitum to any machine learning task! You pick the model prediction for which you want to measure the precision error.

Do enough models always guarantee that the precision error will be sparse?

I’m writing a talk for the Machine Learning Lunch at UMass/Amherst. It has got me thinking about the issue of sparsity in the precision error. Here are the technical details. The precision error when using n models leads to a n 2 covariance matrix. Because the matrix is symmetric, it really has n(n+1 )/2 independent components. You can think of the covariance matrix as a data structure that lives in a n(n+1 )/2 . It does not inhabit the full space because cross-correlations must be less than the variance of the cross-correlated variables.

The autonomous difference equations that were first publishd in the Swiss paper lead to n(n+1 )/2 n independent equations. So the linear algebra system for the precision error covariance matrix is always under-determined by n equations. The 1 minimization technique advocated by Donoho would be able to solve this system if enough of the entries in the matrix were zero. This is what I mean by: is the precision error covariance matrix sparse enough?

You can calculate that the percentage of equations you have as the following fraction less than one
n(n+1 )/2 nn(n+1 )/2 .
This number is equal to
1 2 n+1
so it approaches a well-determined system (the value one) with a term that dies as order 1 /n. Doesn’t this mean that if you have enough models (n large enough) you would be able to become sparse enough to reconstruct the covariance matrix?

This means that sparsity of the error of the signal behaves empirically different than the sparsity of the signal. This has profound implications for the wide-applicability of the theory Howard Schultz and I have developed.