To err is human, to study your errors is glorious

I’ve been sick all week but today has been the worst. In between my sleeping hallucinations I have been thinking a lot about a proposal I’m currently writing on the use of non-commutative harmonic analysis to study mapping error patterns. It has become clear that the approach we are advocating at the AIRS lab is applicable to other areas of machine learning. Before I describe in additional detail what I mean by this let me present a graphic that abstractly represents the scientific enterprise: An abstract representation of the scientific enterprise
The work I am describing here lies at the bottom of the picture. It is an algorithmic prescription for understanding the precision of our models given a dataset used to construct those models. I present this diagram to make clear the limitations of our work. It is not a description or explanation of errors in general. It is a technique for probing the error patterns in your system. The hard work is still left to you on how to apply it for a specific system that constructs models from data, and its usefulness is not guaranteed. The technique may tell you nothing interesting about your system.

We can view model creation as a black box. It takes data inputs and produces a model. The crucial point is that in some machine learning situations the number of models we can build is very large. Data is model-redundant.The redundancy is sometimes continuous, for example, the initial position of a camera or the weight of a Lagrangian term. But it can also be discrete — a finite set of documents or photographs. Changing the data inputs can then be used to probe the variation of a system’s model predictions. These variations will not be completely random (i.e. patternless). Some documents are more informative, some photographs give us a better view. Therefore we can use the symmetry groups associated with our data inputs to Fourier analyze the model variation. This model variation or model precision is informative about the quality of the data and can be used to reject bad data or discount lower quality inputs.

X-raying the geometric precision error of DEMs with Fourier analysis

In a previous post I mentioned a way of Fourier analyzing the geometric precision error of DEMs. Today I realized that the scheme I proposed can only account for part of the error signal. The approach I proposed is correct but it can only capture one particular aspect of the total error. The simplest way of seeing this is to consider the $S_2$ symmetry group. This would be the one to use for $p=2$ photographs. From two photographs I can produce two DEMs: $A \rightarrow B$ and $B \rightarrow A$. The covariance matrix for these two DEMS would be a 2×2 covariance matrix of the form:
$$\left( \array{ a \; b \\ b \; c} \right).$$
But the representation induced by $S_2$ on these two DEMs generates the matrices:
$$\left( \array{ 1 \; 0 \\ 0 \; 1} \right) \text{and} \left( \array{ 0 \; 1\\ 1 \; 0} \right)$$
These two matrices cannot capture the three independent degrees of freedom in the 2×2 covariance matrix. Therefore, the induced representation cannot capture all of the possible errors that are observed when two photograps are used to produce two maps. But the representation would allow you to project out that component of the error that is explained by permutations of the images.

You would need at least $p=7$ photograps to have enough members in $S_p$ to completely model the variation in the DEMs observed when you use two photographs to produce a map. To understand the error that cannot be explained by the permutation group you would need to use three photographs to create a DEM. For $p$ photographs this would create $p*(p-1)*(p-2)$ DEMs. Since we can produce as much or even more than $p!$ DEMs from $p$ photographs, at some point we will always overwhelm the representational power of the symmetry group of $p$ objects (in this case, $p$ photographs). What error remains after we project out the component that can be modelled by $S_n$? I hypothesize that it would be error that can be further Fourier analyzed by using the symmetry group associated with the orientation and positions of the cameras. These parameters are themselves error prone and would, by virtue of their geometry, only induce certain error patterns.

This viewpoint of the errors would therefore view the observed error as one that can be captured by a succesive series of symmetry groups. One component would be that related to the finite group of $S_p$. Another component would be that one induced by translations and rotations of the camera positions and orientations. Like any real theory of errors, this approach would only peel away layers of error — always remaining would be a nugget of error that would require more and more complex models to decompose. The second law of thermodynamics is not violated!

The metaphor to x-raying in the title of this post comes from using Fourier analysis to study X-ray diffraction photographs by crystals. Crystals induce a certain periodicity on the scattered X-rays even when the sample is crushed into a powder. In other words, the randomly scattered blocks of crystal in the powder individually send a perfect difraction pattern. But the X-ray photograph records the mismash of the signals — the picture is blurry. Nonetheless, the bluriness has a symmetry component that comes from the periodic structure of the crystals and therefore Fourier analysis is able to pick the symmetry in the x-ray caused by the crystal periodicity. The Fourier decompositions for geometric errors are doing the same thing. There are many sources of errors in DEMs from aerial photographs. Some come from the fact that you used individual photographs to create the maps. This component of the error can therefore be accounted by studying representations of the symmetry group of p objects. Others come from uncertainty in the position or orientation of the camera when it took the photograph. These are explained by induced representations of non-abelian Lie groups like 3-D rotations in the space of covariance matrices.

Decreasing precision errors with randomness

If I was to rate the things I have learned from computer science, I would place the algorithmic use of randomness right at the top. The uses of randomness in computations is too vast to start a list here. Check out Probability and Computing: Randomized Algorithms and Probabilistic Analysis for many examples. I want to discuss another way of using randomness in computation by discussing the estimation of precision errors in Digital Elevation Models.

I’ll use some simple linear algebra to explain how precision error can be discussed in the language of compressed sensing. The Swiss paper describes how to turn the estimation of the precision error covariance matrix into a linear algebra problem of the
form
$$ S = \Phi \alpha. $$
“S” is the signal. In this case, the autonomous difference terms one can calculate from the DEM elevation estimates. This makes “S” the signal because it can be calculated from what we observe — the DEM elevations. The vector $\alpha$ are the precision error covariance terms the robot is trying to estimate without knowing any ground truth. The “reconstruction” matrix $\Phi(n)$ tells you how to go between these two quantities. $\Phi$is can be calculated exactly and is only a function of the number $n$ of DEMs.

Randomness comes into the error estimation process because there exist many different ways to specify the reconstruction matrix. Take the example of 10 DEMs I have used before since it corresponds to the case I have studied most in my work with Howard Schultz. The autonomous difference equations give us about 5,000 different ways of calculing quantities that do not depend on ground truth. Out of all those many equations, only 45 are linearly independent. Which 45? Any 45. That means that there are many ways of constructing $\Phi$. So many that I can randomly do it by picking equations from the set of 5,000 equations until I get a set of 45 linearly independent ones.

So it becomes possible to check the precision error estimate many different ways. Could one then use this to improve the error estimation process itself? I do not know but I’m investigating that issue today by running experiments with randomly picked independent sets and plotting how the values vary for the same DEMs used as input.

Autonomous horizontal correlation length in DEM data

The “Swiss paper” that I discussed in an earlier post solved the problem of vertical precision error estimation. I used a set of difference equations that range over $\{l,m\}$ where $l$ and $m$ are integers from 1 to the number of observations. The equations look like the difference of simple averages. My purpose in using them is that I would be able to cancel out the true value and be left with the error in each measurement. Surprisingly, the set of independent equations you get from considering all possible equations of this type can be turned into a well-determined linear algebra problem for the entries in a particular sparse covariance matrix. This allowed me to measure the vertical uncertainty in a composite Digital Elevation Model (DEM) without knowing ground truth. I currently interpret this as meaning that I have an estimate of how good my DEM model is. I could still be off by a scale, rotation, and translation.

But vertical uncertainty is only the first of two important numbers for the quality of a map. Another important one is the horizontal resolution. How fine grained are the readings in the map? Another way to capture this resolution is to ask what is the horizontal correlation length — how far apart do two measurements have to be so that they are de-correlated with each other. A way to study this is with the variogram. I had never heard of a ‘variogram’ until a couple of years ago. It essentially measures the spatial correlation of data by taking the average of the spatial difference of a function:
$$E[(f(x)-f(x+L))^2]$$
One typical behavior of this variogram function is that it starts at zero and rises to a plateau in an exponential fashion:
$$r*(1-\exp(-L/\lambda))$$
The horizontal correlation length is defined to be where the variogram reaches 67% of its final value. In the exponential rise curve this happens when $L=\lambda$.

I have had to put off the calculation of the correlation length until today. I was astounded to get almost text-book like exponential rise curves. The autonomous difference equations work for horizontal correlation lengths also! I found a four postings correlation length for our Twenty-Nine Palms dataset. Howard says this is very good resolution. What surprises me is that this could even be done.

The random binary detector with the lowest error cost variation

Last night I did some calculations to see what random detector has the lowest error cost variation. That is, given that the detector will be deployed in an uncertain environment, what random detector should you use to minimize its error cost variation. The math is simple so I will show it all in this post.

As in previous posts, I am restricting myself to the simplest case: a binary classification problem. We have two classes, call them A and B. The variability of the testing environment is codified by the single parameter $f_A$, the frequency of class A. The binary random detector is described by a single parameter also, $d_A$ — the classification rate for class A.

The cost of misclassifying class A as class B will be denoted by $C_{A \rightarrow B}$. The cost of classifying class B as class A by $C_{B \rightarrow A}$. The error cost or loss function in this simple example is then equal to
$$L = C_{A \rightarrow B} f_A (1 – d_A) + C_{B \rightarrow A} (1-f_A) d_A.$$

What random detector setting $d_A$ minimizes $\frac{\partial L}{\partial f_A}$, the loss variation under a changing environment? Calculating the derivative one quickly obtains that zero variation in the cost is attainable when
$$ d_A = \frac{C_{A \rightarrow B}}{C_{A \rightarrow B} + C_{B \rightarrow A}}.$$
Note that this is not the most accurate detector setting for a particular environment, $f_A$.

The strategic advantage of stabilizing errors in uncertain environments

This weekend I was talking to a friend from graduate school about how to stabilize errors in uncertain environments by sacrificing accuracy, the subject of the previous two posts. During the conversation, I realized another advantage of stabilizing errors: it makes it easier to modify your detector/classifier in an adversarial situation.

The “arms race” game is a common one in conflict situations. You have some technological advantage over an opponent. The opponent adapts and changes their strategy. This is different from the issue of a noisy environment. Here the signal used by your detector is intentionally modified to circumvent its performance level. The enemy camouflages their tanks better. New strategies are deployed. All of which leads to an increase in the error rate of your detector.

A user confronted with this situation would want to slow down their error rate increase so it can be below their learning rate. Such a user would be willing to sacrifice accuracy for the ability to adapt to an adversary. There is a strategic advantage to stabilizing errors.

A simple example of how to stabilize classification errors in uncertain environments by sacrificing accuracy

I corrected an assertion in my earlier post on the difference between classification rates and classification errors. I mistakenly asserted that for a training dataset with a 70/30 prevalence of binary labels the best random classifier is the 70/30 one. This is not so. The best one is the 100/0 classifier since it will correctly classify the prevalent label all of the time and will therefore have an error rate of 30% on the 70/30 training dataset. This is better than the 70/30 random classifier which has a classification error of 42% on this training dataset.

But this clarification provides a simple example of how sacrificing accuracy leads to stabilizing errors in an uncertain environment. As the previous post shows, the 70/30 classifier has an error rate increase of 4% when tried on data that has a 60/40 ratio. Its error has increased by a relative 9.5% ($(46-42)/42$). But the better random classifier, the 100/0 one, has its error rate increase by 33% ($(40-30)/30$)! It is overtrained and therefore much more unstable when applied to data that differs from the characteristics of its training set. The moral: increased stability of the error rate can be achieved by sacrificing accuracy.

Of course, these are just specific examples. The way to quantify the benefit of sacrificing accuracy is to consider the average stability of a random classifier under various degrees of variability in the input. This can be done, for example, by considering a beta distribution with various amounts of variability to generate synthetic input data. As the variability knob on the beta distribution is increased one should find that the best strategy for stabilizing error on average is to sacrifice accuracy.

Now suppose that a user/employer comes to you and asks that you design a classifier based on a 70/30 training dataset. Should you give them the 100/0 or 70/30 random classifier? I think the responsible thing to do is to give them the 70/30 classifier. It will also make them happier! I don’t think many users/employers are going to be happy with a classifier that gets one class completely wrong all of the time in spite of your pleas that it is more accurate! Why? Because users always have costs associated with the errors of your classifier. Would you be happy with a spam filter that was 70% accurate but delivered no mail?

What is more important: precision or accuracy?

I have just uploaded our autonomous precision error estimation Swiss conference paper to the new UMass/Amherst Digital Repository. I work at the Aerial Imaging and Remote Sensing Lab at UMass/Amherst. For years, Howard Schultz has been doubling the number of Digital Elevation Models (DEMs) his Terrest system makes by using the fact that computer stereo matching algorithms are not symmetric in their inputs. The paper above proves mathematically that this is indeed correct and possible if the correlation between DEMs is sparse, i.e. only a few of the DEMs are highly correlated.

We are now writing a proposal to further this work by connecting it with compressed sensing ideas as we briefly hint at the end of the paper above. This has got us thinking about the difference between precision and accuracy in measurement errors. Precision is the width of your error bars, accuracy is where the centroid of the measurement is located relative to the ‘true’ value. So if you had to choose between a precise and an accurate map, which one would you choose?

A precise map is one that captures the shape of the world very well, it has a lot of details. An accurate map is one that tells you where objects are located in the real world. An accurate but imprecise map is located correctly but it is very fuzzy. The landscape looks melted. A precise but inaccurate map is located wrong but has lots of detail — you tried to map a desert patch and the map says you mapped Paris (there is a caveat to this characterization — horizontal correlation — that I don’t want to get into in this post). The practical significance of this difference is enormous. A precise but inaccurate map is just a rigid body transformation and a scale change of the real world (3 + 3 + 1 = 7 unknown parameters) that can be fixed by measuring 3 ground control points (3 + 3 + 3 = 9 measurements). An imprecise but accurate map would require thousands of measurements to recover the detail lost in the fuzzy estimate. Therefore, precise maps are cheaper to make than accurate ones since measuring ground control points is a time consuming expensive task.

Furthermore, for some tasks, precision is all you really care about. For example, scientist Andrea Laliberte at the Jornada Experimental Range in south-central New Mexico is interested in classifying invasive species in the desert. For this task you need precision, not accuracy. Of course, if you wanted to bomb a particular shrub, you would want accuracy. My point is that precision is sometimes good enough and therefore you can sacrifice accuracy. Does this sound familiar?

On the difference between classification probability and classification error

Last year I realized that a classifier with robust error performance could be created by feeding the errors of a front-end algorithm to a back-end one. To prove the point that this has nothing to do with the particular algorithms, I used a well-known computer science trick — I considered how this would apply to two random classifiers. Initial conversations with others about this idea got a cool reception.

Today I realized that part of the problem is that the distinction between the classification probability of a binary random classifier and its classification error on a particular dataset is not clearly understood. This distinction is important. So I’ll try to explain it on this posting.

Suppose you have a training set for a binary label $\{(x_i,y_i)\}$ where $y_i \in \{-1,1\}$, and $x_i$ is the observed feature vector for each observation. Furthermore, suppose that for this particular dataset the prevalence of the $+1$ class is 70% versus 30% for the $-1$ class. Consider the binary random classifier that selects the $+1$ class 70% of the time and the $-1$ class 30%. What is the classification error of this random algorithm for this particular dataset?

Well, it correctly classifies class $+1$ 49% of the time ($0.7*0.7=0.49$). Class $-1$ is classified correctly 9% of time ($0.3*0.3=0.09$). So the total classification error rate is 42% (100% – 49% – 9%). Note that the error rate of this random classifier on this dataset is not equal to 100% – 70% or 100% – 30% — the classification error rate is not equal to the classification probability.

What does this have to do with error robustness? Users have to deploy our classification algorithms in environments that are unknown to us. The 70/30 random classifier could be deployed in a situation where the prevalence of the classes was 60/40, somewhat different than our training dataset. The error rate of the random classifier on this dataset would then be 46%. Our error rate has increased. For some classification applications we care more about the cost caused by errors than by our accuracy. For example, when I step into a minefield I care more about not stepping on a mine than being accurate about finding a mine every time I gently push a stick into the ground! Therefore, any algorithm that stabilizes the error rate in uncertain environments would be extremely useful to such users. This is a very different line of research than the usual one that you see in the computer science literature where all the focus is on increasing accuracy not stabilizing errors. I’ll leave for another post how one could do precisely this — stabilize the error rate in uncertain environments while sacrificing accuracy by concatenating classification algorithms.