The number of clusters problem as precision error minimization

One possible application of the precision error tensors framework is to use it as a criterion for selecting the number of clusters needed to describe a dataset. The number of clusters problem refers to the generic problem of deciding how many clusters describe a dataset. Many clustering algorithms exist. Deciding which one is appropriate in a particular task is up to the investigator. Suppose that one has settled on a clustering algorithm. An algorithm like k-clusters has no natural stopping criterion. You dial in how many clusters you want, i.e. you manually set the value of $k$, and the algorithm gives you the data clustered into $k$ groups.

Putting aside the correctness of using a particular algorithm for clustering a specific dataset, we can ask: what number of clusters gives me the smallest precision error? This provides an automatic algorithm for deciding on the optimal number of clusters given the chosen algorithm and the dataset to which it is applied.

Precision error vectors are rank-1 tensors

My previous post on precision error tensors was misleading. We tend to think of tensors as complicated mathematical structures. Vectors are rank 1 tensors. Driving home from work today, I realized that I had already shown that precision error vectors can be calculated in our horizontal decorrelation estimation paper. So mathematically speaking, I have already shown that precision error should be treated as tensors. The precision error vector is the rank 1 tensor example. The precision error covariance matrix is the rank-2 tensor. Two examples in the usual tensor progression. At some future time I should calculate the rank-3 tensor. How would one induce representations of the Symmetric group in rank-3 tensors?

Precision error tensors

Mathematical objects have dimensions associated with them. The temperature outside my house is measured as a single number or scalar. It is a one-dimensional quantity. This fact can be observed in how mercury thermometers are built: they are a long tube or line. Thermometers are never built as squares.

The position of house in a city is an example of a two dimensional quantity. It requires two numbers to specify and is therefore two-dimensional. This fact is obvious in that maps of cities are usually printed in a sheet of paper not a very thin strip of paper. The position of the house is expressed as a vector. This vector can be expressed as an ordered series of numbers of the form $(v_1, v_2,\ldots,v_n)$. Another way to represent the vector is just with the single symbol $v_i$. You tell me the value of $i$ and I go down the list and read off the component $v_i$.

Generalizing further, we can have matrices like the precision error covariance matrix I have been going on and on about all these months. This matrix can be represented by the symbol $m_{i,j}$. You now have to tell me two numbers, $i$ ad $j$, for me to read off the correct entry in the matrix.

We can keep playing this game forever. It is possible to invent mathematical quantities of the form $m_{i,j,k}$. Three “indices” need to be specified to read off an entry. You can think of this as a cube of numbers.

Precision error covariance matrices can also be generalized to precision error tensors. Instead of just asking how are the errors between two models correlated, we can ask how are the errors of three models correlated. We can have a cube of cross-correlations between the different model errors!

Do enough models always guarantee that the precision error will be sparse?

I’m writing a talk for the Machine Learning Lunch at UMass/Amherst. It has got me thinking about the issue of sparsity in the precision error. Here are the technical details. The precision error when using $n$ models leads to a $n^2$ covariance matrix. Because the matrix is symmetric, it really has $n(n+1)/2$ independent components. You can think of the covariance matrix as a data structure that lives in a $n(n+1)/2$. It does not inhabit the full space because cross-correlations must be less than the variance of the cross-correlated variables.

The autonomous difference equations that were first publishd in the Swiss paper lead to $n(n+1)/2 – n$ independent equations. So the linear algebra system for the precision error covariance matrix is always under-determined by $n$ equations. The $\ell_1$ minimization technique advocated by Donoho would be able to solve this system if enough of the entries in the matrix were zero. This is what I mean by: is the precision error covariance matrix sparse enough?

You can calculate that the percentage of equations you have as the following fraction less than one
$$\frac{n(n+1)/2-n}{n(n+1)/2}.$$
This number is equal to
$$ 1 – \frac{2}{n+1}$$
so it approaches a well-determined system (the value one) with a term that dies as order $1/n$. Doesn’t this mean that if you have enough models ($n$ large enough) you would be able to become sparse enough to reconstruct the covariance matrix?

This means that sparsity of the error of the signal behaves empirically different than the sparsity of the signal. This has profound implications for the wide-applicability of the theory Howard Schultz and I have developed.