Archive for the 'Machine Learning' Category

Asymmetry in likelihood of causing the error

As the number of models increase, the observed pattern in prediction discrepancies allows one to decide what is causing it assuming uniform uncertainty among all possible scenarios. The observed error pattern will be consistent with many different scenarios. In some scenarios, the noisy model predictions are due to the model being correct and the other models being incorrect. In other scenarios, the model is incorrect and the other models are correct. The hypothesis I want to prove is that the “mass” (the number of states) that corresponds to assuming the model is incorrect becomes larger than the one assuming it is correct.

Variations in student responses to a multiple exam for latent group discovery

Questions in an exam are detectors of student competency. Students are detectors of the correct answers in a test. What is the variation in the student’s model of the correct exam? The precision error equations can be used to construct a covariance matrix for the students instead of the questions. What makes the difference is what is being averaged. When you want to use a test to tell you something about the questions, you average over the students. The covariance matrix is then indexed by the questions. When you want the test to tell you about the students, you average over the questions. The covariance matrix is indexed by students.

All of this suggests that it would be possible to build a completely parameter-less approach to detecting latent groups in students. This would be a different approach from that involved in topic models which use a specific probability distribution — the Dirichlet distribution. In this approach, you would assume a number of groups and arbitrarily assign students to these groups in a probabilistic fashion (60% group 1, 30% group 2, etc.). One can then see how well this group distribution predicts the observed covariance matrix by use of non-commutative harmonic analysis. Group assignment is thereby completely determined by the data — no parameters are needed.

The number of clusters problem as precision error minimization

One possible application of the precision error tensors framework is to use it as a criterion for selecting the number of clusters needed to describe a dataset. The number of clusters problem refers to the generic problem of deciding how many clusters describe a dataset. Many clustering algorithms exist. Deciding which one is appropriate in a particular task is up to the investigator. Suppose that one has settled on a clustering algorithm. An algorithm like k-clusters has no natural stopping criterion. You dial in how many clusters you want, i.e. you manually set the value of k, and the algorithm gives you the data clustered into k groups.

Putting aside the correctness of using a particular algorithm for clustering a specific dataset, we can ask: what number of clusters gives me the smallest precision error? This provides an automatic algorithm for deciding on the optimal number of clusters given the chosen algorithm and the dataset to which it is applied.

Precision error tensors

Mathematical objects have dimensions associated with them. The temperature outside my house is measured as a single number or scalar. It is a one-dimensional quantity. This fact can be observed in how mercury thermometers are built: they are a long tube or line. Thermometers are never built as squares.

The position of house in a city is an example of a two dimensional quantity. It requires two numbers to specify and is therefore two-dimensional. This fact is obvious in that maps of cities are usually printed in a sheet of paper not a very thin strip of paper. The position of the house is expressed as a vector. This vector can be expressed as an ordered series of numbers of the form (v 1 ,v 2 ,,v n). Another way to represent the vector is just with the single symbol v i. You tell me the value of i and I go down the list and read off the component v i.

Generalizing further, we can have matrices like the precision error covariance matrix I have been going on and on about all these months. This matrix can be represented by the symbol m i,j. You now have to tell me two numbers, i ad j, for me to read off the correct entry in the matrix.

We can keep playing this game forever. It is possible to invent mathematical quantities of the form m i,j,k. Three “indices” need to be specified to read off an entry. You can think of this as a cube of numbers.

Precision error covariance matrices can also be generalized to precision error tensors. Instead of just asking how are the errors between two models correlated, we can ask how are the errors of three models correlated. We can have a cube of cross-correlations between the different model errors!

Precision error covariance matrix for a Physics multiple-choice exam

Precision error covariance for 10 questions from a Physics test
I have applied the autonomous difference equations to test the quality of ten out of twenty questions I used in a Physics exam I gave in the Spring of 2006 to an introductory class for engineering students. That dark square in position six of the matrix corresponds to the question least likely to be answered correctly. Only 64 students out of 250 answered correctly. The reason this happened was that I gave a very clever wrong answer that attracted most of the students (the correct answer but forgetting to take a square root). I have used the precision error covariance matrix to assess the test maker not the students!

This example also highlights the general applicablity of the precision error covariance matrix. There now exist two experimental verifications of its usefulness: digital elevation maps and test assessment.

To err is human, to study your errors is glorious

I’ve been sick all week but today has been the worst. In between my sleeping hallucinations I have been thinking a lot about a proposal I’m currently writing on the use of non-commutative harmonic analysis to study mapping error patterns. It has become clear that the approach we are advocating at the AIRS lab is applicable to other areas of machine learning. Before I describe in additional detail what I mean by this let me present a graphic that abstractly represents the scientific enterprise: An abstract representation of the scientific enterprise
The work I am describing here lies at the bottom of the picture. It is an algorithmic prescription for understanding the precision of our models given a dataset used to construct those models. I present this diagram to make clear the limitations of our work. It is not a description or explanation of errors in general. It is a technique for probing the error patterns in your system. The hard work is still left to you on how to apply it for a specific system that constructs models from data, and its usefulness is not guaranteed. The technique may tell you nothing interesting about your system.

We can view model creation as a black box. It takes data inputs and produces a model. The crucial point is that in some machine learning situations the number of models we can build is very large. Data is model-redundant.The redundancy is sometimes continuous, for example, the initial position of a camera or the weight of a Lagrangian term. But it can also be discrete — a finite set of documents or photographs. Changing the data inputs can then be used to probe the variation of a system’s model predictions. These variations will not be completely random (i.e. patternless). Some documents are more informative, some photographs give us a better view. Therefore we can use the symmetry groups associated with our data inputs to Fourier analyze the model variation. This model variation or model precision is informative about the quality of the data and can be used to reject bad data or discount lower quality inputs.

X-raying the geometric precision error of DEMs with Fourier analysis

In a previous post I mentioned a way of Fourier analyzing the geometric precision error of DEMs. Today I realized that the scheme I proposed can only account for part of the error signal. The approach I proposed is correct but it can only capture one particular aspect of the total error. The simplest way of seeing this is to consider the S 2 symmetry group. This would be the one to use for p=2 photographs. From two photographs I can produce two DEMs: AB and BA. The covariance matrix for these two DEMS would be a 2×2 covariance matrix of the form:
(ab bc).
But the representation induced by S 2 on these two DEMs generates the matrices:
(1 0 0 1 )and(0 1 1 0 )
These two matrices cannot capture the three independent degrees of freedom in the 2×2 covariance matrix. Therefore, the induced representation cannot capture all of the possible errors that are observed when two photograps are used to produce two maps. But the representation would allow you to project out that component of the error that is explained by permutations of the images.

You would need at least p=7 photograps to have enough members in S p to completely model the variation in the DEMs observed when you use two photographs to produce a map. To understand the error that cannot be explained by the permutation group you would need to use three photographs to create a DEM. For p photographs this would create p*(p1 )*(p2 ) DEMs. Since we can produce as much or even more than p! DEMs from p photographs, at some point we will always overwhelm the representational power of the symmetry group of p objects (in this case, p photographs). What error remains after we project out the component that can be modelled by S n? I hypothesize that it would be error that can be further Fourier analyzed by using the symmetry group associated with the orientation and positions of the cameras. These parameters are themselves error prone and would, by virtue of their geometry, only induce certain error patterns.

This viewpoint of the errors would therefore view the observed error as one that can be captured by a succesive series of symmetry groups. One component would be that related to the finite group of S p. Another component would be that one induced by translations and rotations of the camera positions and orientations. Like any real theory of errors, this approach would only peel away layers of error — always remaining would be a nugget of error that would require more and more complex models to decompose. The second law of thermodynamics is not violated!

The metaphor to x-raying in the title of this post comes from using Fourier analysis to study X-ray diffraction photographs by crystals. Crystals induce a certain periodicity on the scattered X-rays even when the sample is crushed into a powder. In other words, the randomly scattered blocks of crystal in the powder individually send a perfect difraction pattern. But the X-ray photograph records the mismash of the signals — the picture is blurry. Nonetheless, the bluriness has a symmetry component that comes from the periodic structure of the crystals and therefore Fourier analysis is able to pick the symmetry in the x-ray caused by the crystal periodicity. The Fourier decompositions for geometric errors are doing the same thing. There are many sources of errors in DEMs from aerial photographs. Some come from the fact that you used individual photographs to create the maps. This component of the error can therefore be accounted by studying representations of the symmetry group of p objects. Others come from uncertainty in the position or orientation of the camera when it took the photograph. These are explained by induced representations of non-abelian Lie groups like 3-D rotations in the space of covariance matrices.

Decreasing precision errors with randomness

If I was to rate the things I have learned from computer science, I would place the algorithmic use of randomness right at the top. The uses of randomness in computations is too vast to start a list here. Check out Probability and Computing: Randomized Algorithms and Probabilistic Analysis for many examples. I want to discuss another way of using randomness in computation by discussing the estimation of precision errors in Digital Elevation Models.

I’ll use some simple linear algebra to explain how precision error can be discussed in the language of compressed sensing. The Swiss paper describes how to turn the estimation of the precision error covariance matrix into a linear algebra problem of the
form
S=Φα.
“S” is the signal. In this case, the autonomous difference terms one can calculate from the DEM elevation estimates. This makes “S” the signal because it can be calculated from what we observe — the DEM elevations. The vector α are the precision error covariance terms the robot is trying to estimate without knowing any ground truth. The “reconstruction” matrix Φ(n) tells you how to go between these two quantities. Φis can be calculated exactly and is only a function of the number n of DEMs.

Randomness comes into the error estimation process because there exist many different ways to specify the reconstruction matrix. Take the example of 10 DEMs I have used before since it corresponds to the case I have studied most in my work with Howard Schultz. The autonomous difference equations give us about 5,000 different ways of calculing quantities that do not depend on ground truth. Out of all those many equations, only 45 are linearly independent. Which 45? Any 45. That means that there are many ways of constructing Φ. So many that I can randomly do it by picking equations from the set of 5,000 equations until I get a set of 45 linearly independent ones.

So it becomes possible to check the precision error estimate many different ways. Could one then use this to improve the error estimation process itself? I do not know but I’m investigating that issue today by running experiments with randomly picked independent sets and plotting how the values vary for the same DEMs used as input.

No data is wasted

Compressed sensing caught my attention last year. I was doing a literature search on the Internet to see if anyone else had discussed the autonomous difference equations that Howard Schultz and I had devised to measure the precision errors in Digital Elevation Models (DEMs). One of the basic tenets of compressed sensing is that since many natural signals are sparse, why waste your resources taking many measurements when you can just take fewer to get the same reconstructed signal? For example, our high mega-pixel cameras capture images that we end up compressing anyhow to much smaller files. So why have CCDs with so many pixels?

I have a new hypothesis that would justify all these redundant measurements. Scientists view measurements as two numbers. The guess for the measured quantity, say the temperature of a glass of water, is the one that gets quoted first. But equally important is the error on that temperature guess. So the temperature of the glass should be properly be quoted as 10.0 ±0.2 Celsius degrees, for example. So repeated measurements may not improve the color and intensity estimate for a pixel in a photograph but it dramatically improves our error on that measurement. Measurements should never be wasted! In the case of maps, it means that repeated images of a terrain would not necessarily improve the resolution of the map, but they would have a dramatic effect on the resolution of the error map for our elevation estimates.

I am currently writing a paper for the International Conference on Machine Learning that is studying this hypothesis in the context of DEMs. I’ll post some of the results in a later post if the hypothesis turns out to be correct. I mentioned this hypothesis for my pre-proposal submission to the National Science Foundation’s new Cyber-Enabled Discovery program but I don’t think it will get much traction just yet.

NIPS interesting paper on group theory and Fourier analysis applied to inference

This week I am at the annual Neural Information Processing Systems Conference a fascinating conference that combines many of my scientific interests on machine learning, computer vision, statistics, and natural language processing. Last night I visited the poster by Jonathan Huang on efficient inference for distributions on permutations.

The paper considers the problem of how to reason probabilistically in a tracking task where you have sporadic tracking information of objects. Juang and co-authors end up using concepts like irreducible representations and Clebsch-Gordan coefficients. This may be unfamiliar concepts to the reader but to me they sound like a distant echo of all my physics training since these concepts are all over quantum mechanics and quantum field theory. What a cool paper! I’ll definitely be studying this paper since I have been interested in the issue of permutations with my work on recognizing answer patterns in multiple-choice exams.