Archive for the 'Machine Learning' Category

NIPS interesting paper on group theory and Fourier analysis applied to inference

This week I am at the annual Neural Information Processing Systems Conference a fascinating conference that combines many of my scientific interests on machine learning, computer vision, statistics, and natural language processing. Last night I visited the poster by Jonathan Huang on efficient inference for distributions on permutations.

The paper considers the problem of how to reason probabilistically in a tracking task where you have sporadic tracking information of objects. Juang and co-authors end up using concepts like irreducible representations and Clebsch-Gordan coefficients. This may be unfamiliar concepts to the reader but to me they sound like a distant echo of all my physics training since these concepts are all over quantum mechanics and quantum field theory. What a cool paper! I’ll definitely be studying this paper since I have been interested in the issue of permutations with my work on recognizing answer patterns in multiple-choice exams.

The random binary detector with the lowest error cost variation

Last night I did some calculations to see what random detector has the lowest error cost variation. That is, given that the detector will be deployed in an uncertain environment, what random detector should you use to minimize its error cost variation. The math is simple so I will show it all in this post.

As in previous posts, I am restricting myself to the simplest case: a binary classification problem. We have two classes, call them A and B. The variability of the testing environment is codified by the single parameter f A, the frequency of class A. The binary random detector is described by a single parameter also, d A — the classification rate for class A.

The cost of misclassifying class A as class B will be denoted by C AB. The cost of classifying class B as class A by C BA. The error cost or loss function in this simple example is then equal to
L=C ABf A(1 d A)+C BA(1 f A)d A.

What random detector setting d A minimizes Lf A, the loss variation under a changing environment? Calculating the derivative one quickly obtains that zero variation in the cost is attainable when
d A=C ABC AB+C BA.
Note that this is not the most accurate detector setting for a particular environment, f A.

The strategic advantage of stabilizing errors in uncertain environments

This weekend I was talking to a friend from graduate school about how to stabilize errors in uncertain environments by sacrificing accuracy, the subject of the previous two posts. During the conversation, I realized another advantage of stabilizing errors: it makes it easier to modify your detector/classifier in an adversarial situation.

The “arms race” game is a common one in conflict situations. You have some technological advantage over an opponent. The opponent adapts and changes their strategy. This is different from the issue of a noisy environment. Here the signal used by your detector is intentionally modified to circumvent its performance level. The enemy camouflages their tanks better. New strategies are deployed. All of which leads to an increase in the error rate of your detector.

A user confronted with this situation would want to slow down their error rate increase so it can be below their learning rate. Such a user would be willing to sacrifice accuracy for the ability to adapt to an adversary. There is a strategic advantage to stabilizing errors.

A simple example of how to stabilize classification errors in uncertain environments by sacrificing accuracy

I corrected an assertion in my earlier post on the difference between classification rates and classification errors. I mistakenly asserted that for a training dataset with a 70/30 prevalence of binary labels the best random classifier is the 70/30 one. This is not so. The best one is the 100/0 classifier since it will correctly classify the prevalent label all of the time and will therefore have an error rate of 30% on the 70/30 training dataset. This is better than the 70/30 random classifier which has a classification error of 42% on this training dataset.

But this clarification provides a simple example of how sacrificing accuracy leads to stabilizing errors in an uncertain environment. As the previous post shows, the 70/30 classifier has an error rate increase of 4% when tried on data that has a 60/40 ratio. Its error has increased by a relative 9.5% ((46 42 )/42 ). But the better random classifier, the 100/0 one, has its error rate increase by 33% ((40 30 )/30 )! It is overtrained and therefore much more unstable when applied to data that differs from the characteristics of its training set. The moral: increased stability of the error rate can be achieved by sacrificing accuracy.

Of course, these are just specific examples. The way to quantify the benefit of sacrificing accuracy is to consider the average stability of a random classifier under various degrees of variability in the input. This can be done, for example, by considering a beta distribution with various amounts of variability to generate synthetic input data. As the variability knob on the beta distribution is increased one should find that the best strategy for stabilizing error on average is to sacrifice accuracy.

Now suppose that a user/employer comes to you and asks that you design a classifier based on a 70/30 training dataset. Should you give them the 100/0 or 70/30 random classifier? I think the responsible thing to do is to give them the 70/30 classifier. It will also make them happier! I don’t think many users/employers are going to be happy with a classifier that gets one class completely wrong all of the time in spite of your pleas that it is more accurate! Why? Because users always have costs associated with the errors of your classifier. Would you be happy with a spam filter that was 70% accurate but delivered no mail?

What is more important: precision or accuracy?

I have just uploaded our autonomous precision error estimation Swiss conference paper to the new UMass/Amherst Digital Repository. I work at the Aerial Imaging and Remote Sensing Lab at UMass/Amherst. For years, Howard Schultz has been doubling the number of Digital Elevation Models (DEMs) his Terrest system makes by using the fact that computer stereo matching algorithms are not symmetric in their inputs. The paper above proves mathematically that this is indeed correct and possible if the correlation between DEMs is sparse, i.e. only a few of the DEMs are highly correlated.

We are now writing a proposal to further this work by connecting it with compressed sensing ideas as we briefly hint at the end of the paper above. This has got us thinking about the difference between precision and accuracy in measurement errors. Precision is the width of your error bars, accuracy is where the centroid of the measurement is located relative to the ‘true’ value. So if you had to choose between a precise and an accurate map, which one would you choose?

A precise map is one that captures the shape of the world very well, it has a lot of details. An accurate map is one that tells you where objects are located in the real world. An accurate but imprecise map is located correctly but it is very fuzzy. The landscape looks melted. A precise but inaccurate map is located wrong but has lots of detail — you tried to map a desert patch and the map says you mapped Paris (there is a caveat to this characterization — horizontal correlation — that I don’t want to get into in this post). The practical significance of this difference is enormous. A precise but inaccurate map is just a rigid body transformation and a scale change of the real world (3 + 3 + 1 = 7 unknown parameters) that can be fixed by measuring 3 ground control points (3 + 3 + 3 = 9 measurements). An imprecise but accurate map would require thousands of measurements to recover the detail lost in the fuzzy estimate. Therefore, precise maps are cheaper to make than accurate ones since measuring ground control points is a time consuming expensive task.

Furthermore, for some tasks, precision is all you really care about. For example, scientist Andrea Laliberte at the Jornada Experimental Range in south-central New Mexico is interested in classifying invasive species in the desert. For this task you need precision, not accuracy. Of course, if you wanted to bomb a particular shrub, you would want accuracy. My point is that precision is sometimes good enough and therefore you can sacrifice accuracy. Does this sound familiar?

On the difference between classification probability and classification error

Last year I realized that a classifier with robust error performance could be created by feeding the errors of a front-end algorithm to a back-end one. To prove the point that this has nothing to do with the particular algorithms, I used a well-known computer science trick — I considered how this would apply to two random classifiers. Initial conversations with others about this idea got a cool reception.

Today I realized that part of the problem is that the distinction between the classification probability of a binary random classifier and its classification error on a particular dataset is not clearly understood. This distinction is important. So I’ll try to explain it on this posting.

Suppose you have a training set for a binary label {(x i,y i)} where y i{1,1 }, and x i is the observed feature vector for each observation. Furthermore, suppose that for this particular dataset the prevalence of the +1 class is 70% versus 30% for the 1 class. Consider the binary random classifier that selects the +1 class 70% of the time and the 1 class 30%. What is the classification error of this random algorithm for this particular dataset?

Well, it correctly classifies class +1 49% of the time (0.7 *0.7 =0.49 ). Class 1 is classified correctly 9% of time (0.3 *0.3 =0.09 ). So the total classification error rate is 42% (100% - 49% - 9%). Note that the error rate of this random classifier on this dataset is not equal to 100% - 70% or 100% - 30% — the classification error rate is not equal to the classification probability.

What does this have to do with error robustness? Users have to deploy our classification algorithms in environments that are unknown to us. The 70/30 random classifier could be deployed in a situation where the prevalence of the classes was 60/40, somewhat different than our training dataset. The error rate of the random classifier on this dataset would then be 46%. Our error rate has increased. For some classification applications we care more about the cost caused by errors than by our accuracy. For example, when I step into a minefield I care more about not stepping on a mine than being accurate about finding a mine every time I gently push a stick into the ground! Therefore, any algorithm that stabilizes the error rate in uncertain environments would be extremely useful to such users. This is a very different line of research than the usual one that you see in the computer science literature where all the focus is on increasing accuracy not stabilizing errors. I’ll leave for another post how one could do precisely this — stabilize the error rate in uncertain environments while sacrificing accuracy by concatenating classification algorithms.