Archive for the 'Randomness' Category

Grading mistake detection with precision error

While making a covariance matrix for eighteen questions in an introductory Physics exam I gave in the Spring of 2006, I discovered another use for the precision error measurements: grading mistake detection.

The figure shown first is my initial try. I computed the student score on each question with the function: f(correct)=1.0 ,f(incorrect)=0.0 .Two graded incorrectly. Note the two dark squares at position 5 and 6.

Further investigation showed that I incorrectly scored the two questions. The second figure shows the matrix after I corrected the grading.All graded correctly Note how the squares at position 5 and 6 are now similar to the others.

The precision error covariance matrix also detects grading anomalies!

Random faster than systematic

I am writing a Mathematica program to produce the precision error signal and reconstruction matrix for an arbitrary number of models. The maximum number I had tried before was ten models because it corresponded to the number of maps we have for the 29 Palms dataset.

My first try consisted of systematically creating all possible permutations of the precision error equations, squaring them, and then storing the coefficients. The program would then systematically look at the equations and augment an independent set every time it found an equation that could not be written as linear combinations of the previous ones in the set.

This worked okay for ten or so models, but I want to produce the full covariance matrix for twenty questions in a multiple-choice exam. No problem, I was making some grilled lamb for Easter dinner yesterday, so I put the computer to work and walked away. Three hours later, the computer was still trying to finish the list of all possible equation permutations! I confess that I have not worked out the combinatorics for the equations yet so perhaps it is of order 20 !. This would be 2,432,902,008,176,640,000 combinations. Compare this to 10 !=3628800 and you can see why the computation got hard quickly.

So my second incantation of the program was to do the computation randomly. Two integers P and Q are picked randomly such that 1 Pm and 1 Qm where m is the number of models. These random numbers are then used to randomly sample the model variables and construct a precision error difference equation. If the equation is independent from the set currently at hand, it is kept, otherwise discarded.

This second version is taking about ten minutes to produce a result. This made me think about how we perceive randomness as haphazardly: “Oh, you are just randomly trying to guess the right answer.” We perceive randomly as wasteful or misguided. The case presented here is just another example of how random is sometimes faster than systematic.

Precision error vectors are rank-1 tensors

My previous post on precision error tensors was misleading. We tend to think of tensors as complicated mathematical structures. Vectors are rank 1 tensors. Driving home from work today, I realized that I had already shown that precision error vectors can be calculated in our horizontal decorrelation estimation paper. So mathematically speaking, I have already shown that precision error should be treated as tensors. The precision error vector is the rank 1 tensor example. The precision error covariance matrix is the rank-2 tensor. Two examples in the usual tensor progression. At some future time I should calculate the rank-3 tensor. How would one induce representations of the Symmetric group in rank-3 tensors?

Precision error covariance matrix for a Physics multiple-choice exam

Precision error covariance for 10 questions from a Physics test
I have applied the autonomous difference equations to test the quality of ten out of twenty questions I used in a Physics exam I gave in the Spring of 2006 to an introductory class for engineering students. That dark square in position six of the matrix corresponds to the question least likely to be answered correctly. Only 64 students out of 250 answered correctly. The reason this happened was that I gave a very clever wrong answer that attracted most of the students (the correct answer but forgetting to take a square root). I have used the precision error covariance matrix to assess the test maker not the students!

This example also highlights the general applicablity of the precision error covariance matrix. There now exist two experimental verifications of its usefulness: digital elevation maps and test assessment.

Precision error equations for information retrieval, information extraction, and bioinformatics

Today I was able to look at the precision error equations for the digital elevation models and see for the first time that they can be trivially generalized to other machine learning fields like information retrieval, information extraction, and bioinformatics. This is so embarrassingly simple that I cannot stop coming up with new variants every hour or so! The pattern is easy to explain, so I think any reader of this should be able to come up with a variant that applies to their area of work. Please let me know if you do so, I would like to start a catalogue of the many ways this can be done.

Here is the pattern. I’ll start with information retrieval. Assume you have a system that creates a relevance model of a corpus of documents. These relevance models are judgments of the form r(d,q)={0,1 }. The notation is meant to capture that the relevance judgment is a binary decision on whether a particular document d is relevant to a query q. Maybe you made that relevance model using maximum likelihood estimates, or maybe you used Latent Dirichlet Allocation. Each way you calculate that relevance judgment is a model. Assume that you have used different algorithms, or different parameter settings, or whatever, to come up with n different relevance judgments for a set of test queries.

Each relevance judgment of a specific model i can be written as
r estimated(d,q) i=r true(d,q) i+δ(d,q) i

Now consider the following quantities that can be calculated with these many relevance models
q,d(1 E i=1 Er i(d,q))(1 M j=1 Mr j(d,q))
These quantities would not include the r true value. It would cancel out. So the above equation can be written as
q,d(1 E i=1 Eδ i(d,q))(1 M j=1 Mδ j(d,q)). These equations would allow you to recover the precision errors {δ i} for a collection of information retrieval models!

The pattern can now be generalized ad-infinitum to any machine learning task! You pick the model prediction for which you want to measure the precision error.

Do enough models always guarantee that the precision error will be sparse?

I’m writing a talk for the Machine Learning Lunch at UMass/Amherst. It has got me thinking about the issue of sparsity in the precision error. Here are the technical details. The precision error when using n models leads to a n 2 covariance matrix. Because the matrix is symmetric, it really has n(n+1 )/2 independent components. You can think of the covariance matrix as a data structure that lives in a n(n+1 )/2 . It does not inhabit the full space because cross-correlations must be less than the variance of the cross-correlated variables.

The autonomous difference equations that were first publishd in the Swiss paper lead to n(n+1 )/2 n independent equations. So the linear algebra system for the precision error covariance matrix is always under-determined by n equations. The 1 minimization technique advocated by Donoho would be able to solve this system if enough of the entries in the matrix were zero. This is what I mean by: is the precision error covariance matrix sparse enough?

You can calculate that the percentage of equations you have as the following fraction less than one
n(n+1 )/2 nn(n+1 )/2 .
This number is equal to
1 2 n+1
so it approaches a well-determined system (the value one) with a term that dies as order 1 /n. Doesn’t this mean that if you have enough models (n large enough) you would be able to become sparse enough to reconstruct the covariance matrix?

This means that sparsity of the error of the signal behaves empirically different than the sparsity of the signal. This has profound implications for the wide-applicability of the theory Howard Schultz and I have developed.

To err is human, to study your errors is glorious

I’ve been sick all week but today has been the worst. In between my sleeping hallucinations I have been thinking a lot about a proposal I’m currently writing on the use of non-commutative harmonic analysis to study mapping error patterns. It has become clear that the approach we are advocating at the AIRS lab is applicable to other areas of machine learning. Before I describe in additional detail what I mean by this let me present a graphic that abstractly represents the scientific enterprise: An abstract representation of the scientific enterprise
The work I am describing here lies at the bottom of the picture. It is an algorithmic prescription for understanding the precision of our models given a dataset used to construct those models. I present this diagram to make clear the limitations of our work. It is not a description or explanation of errors in general. It is a technique for probing the error patterns in your system. The hard work is still left to you on how to apply it for a specific system that constructs models from data, and its usefulness is not guaranteed. The technique may tell you nothing interesting about your system.

We can view model creation as a black box. It takes data inputs and produces a model. The crucial point is that in some machine learning situations the number of models we can build is very large. Data is model-redundant.The redundancy is sometimes continuous, for example, the initial position of a camera or the weight of a Lagrangian term. But it can also be discrete — a finite set of documents or photographs. Changing the data inputs can then be used to probe the variation of a system’s model predictions. These variations will not be completely random (i.e. patternless). Some documents are more informative, some photographs give us a better view. Therefore we can use the symmetry groups associated with our data inputs to Fourier analyze the model variation. This model variation or model precision is informative about the quality of the data and can be used to reject bad data or discount lower quality inputs.

X-raying the geometric precision error of DEMs with Fourier analysis

In a previous post I mentioned a way of Fourier analyzing the geometric precision error of DEMs. Today I realized that the scheme I proposed can only account for part of the error signal. The approach I proposed is correct but it can only capture one particular aspect of the total error. The simplest way of seeing this is to consider the S 2 symmetry group. This would be the one to use for p=2 photographs. From two photographs I can produce two DEMs: AB and BA. The covariance matrix for these two DEMS would be a 2×2 covariance matrix of the form:
(ab bc).
But the representation induced by S 2 on these two DEMs generates the matrices:
(1 0 0 1 )and(0 1 1 0 )
These two matrices cannot capture the three independent degrees of freedom in the 2×2 covariance matrix. Therefore, the induced representation cannot capture all of the possible errors that are observed when two photograps are used to produce two maps. But the representation would allow you to project out that component of the error that is explained by permutations of the images.

You would need at least p=7 photograps to have enough members in S p to completely model the variation in the DEMs observed when you use two photographs to produce a map. To understand the error that cannot be explained by the permutation group you would need to use three photographs to create a DEM. For p photographs this would create p*(p1 )*(p2 ) DEMs. Since we can produce as much or even more than p! DEMs from p photographs, at some point we will always overwhelm the representational power of the symmetry group of p objects (in this case, p photographs). What error remains after we project out the component that can be modelled by S n? I hypothesize that it would be error that can be further Fourier analyzed by using the symmetry group associated with the orientation and positions of the cameras. These parameters are themselves error prone and would, by virtue of their geometry, only induce certain error patterns.

This viewpoint of the errors would therefore view the observed error as one that can be captured by a succesive series of symmetry groups. One component would be that related to the finite group of S p. Another component would be that one induced by translations and rotations of the camera positions and orientations. Like any real theory of errors, this approach would only peel away layers of error — always remaining would be a nugget of error that would require more and more complex models to decompose. The second law of thermodynamics is not violated!

The metaphor to x-raying in the title of this post comes from using Fourier analysis to study X-ray diffraction photographs by crystals. Crystals induce a certain periodicity on the scattered X-rays even when the sample is crushed into a powder. In other words, the randomly scattered blocks of crystal in the powder individually send a perfect difraction pattern. But the X-ray photograph records the mismash of the signals — the picture is blurry. Nonetheless, the bluriness has a symmetry component that comes from the periodic structure of the crystals and therefore Fourier analysis is able to pick the symmetry in the x-ray caused by the crystal periodicity. The Fourier decompositions for geometric errors are doing the same thing. There are many sources of errors in DEMs from aerial photographs. Some come from the fact that you used individual photographs to create the maps. This component of the error can therefore be accounted by studying representations of the symmetry group of p objects. Others come from uncertainty in the position or orientation of the camera when it took the photograph. These are explained by induced representations of non-abelian Lie groups like 3-D rotations in the space of covariance matrices.

Decreasing precision errors with randomness

If I was to rate the things I have learned from computer science, I would place the algorithmic use of randomness right at the top. The uses of randomness in computations is too vast to start a list here. Check out Probability and Computing: Randomized Algorithms and Probabilistic Analysis for many examples. I want to discuss another way of using randomness in computation by discussing the estimation of precision errors in Digital Elevation Models.

I’ll use some simple linear algebra to explain how precision error can be discussed in the language of compressed sensing. The Swiss paper describes how to turn the estimation of the precision error covariance matrix into a linear algebra problem of the
form
S=Φα.
“S” is the signal. In this case, the autonomous difference terms one can calculate from the DEM elevation estimates. This makes “S” the signal because it can be calculated from what we observe — the DEM elevations. The vector α are the precision error covariance terms the robot is trying to estimate without knowing any ground truth. The “reconstruction” matrix Φ(n) tells you how to go between these two quantities. Φis can be calculated exactly and is only a function of the number n of DEMs.

Randomness comes into the error estimation process because there exist many different ways to specify the reconstruction matrix. Take the example of 10 DEMs I have used before since it corresponds to the case I have studied most in my work with Howard Schultz. The autonomous difference equations give us about 5,000 different ways of calculing quantities that do not depend on ground truth. Out of all those many equations, only 45 are linearly independent. Which 45? Any 45. That means that there are many ways of constructing Φ. So many that I can randomly do it by picking equations from the set of 5,000 equations until I get a set of 45 linearly independent ones.

So it becomes possible to check the precision error estimate many different ways. Could one then use this to improve the error estimation process itself? I do not know but I’m investigating that issue today by running experiments with randomly picked independent sets and plotting how the values vary for the same DEMs used as input.

Autonomous horizontal correlation length in DEM data

The “Swiss paper” that I discussed in an earlier post solved the problem of vertical precision error estimation. I used a set of difference equations that range over {l,m} where l and m are integers from 1 to the number of observations. The equations look like the difference of simple averages. My purpose in using them is that I would be able to cancel out the true value and be left with the error in each measurement. Surprisingly, the set of independent equations you get from considering all possible equations of this type can be turned into a well-determined linear algebra problem for the entries in a particular sparse covariance matrix. This allowed me to measure the vertical uncertainty in a composite Digital Elevation Model (DEM) without knowing ground truth. I currently interpret this as meaning that I have an estimate of how good my DEM model is. I could still be off by a scale, rotation, and translation.

But vertical uncertainty is only the first of two important numbers for the quality of a map. Another important one is the horizontal resolution. How fine grained are the readings in the map? Another way to capture this resolution is to ask what is the horizontal correlation length — how far apart do two measurements have to be so that they are de-correlated with each other. A way to study this is with the variogram. I had never heard of a ‘variogram’ until a couple of years ago. It essentially measures the spatial correlation of data by taking the average of the spatial difference of a function:
E[(f(x)f(x+L)) 2 ]
One typical behavior of this variogram function is that it starts at zero and rises to a plateau in an exponential fashion:
r*(1 exp(L/λ))
The horizontal correlation length is defined to be where the variogram reaches 67% of its final value. In the exponential rise curve this happens when L=λ.

I have had to put off the calculation of the correlation length until today. I was astounded to get almost text-book like exponential rise curves. The autonomous difference equations work for horizontal correlation lengths also! I found a four postings correlation length for our Twenty-Nine Palms dataset. Howard says this is very good resolution. What surprises me is that this could even be done.