Archive for the 'Natural Language Processing' Category

Precision error for parse trees

The precision error equations require that “ground truth” cancel out. It is easy to see what that means for elevations in a map. What does it mean for parse trees in a natural language processing task like sentence parsing?

One way to define distance between trees is to consider the total number of reverse operations that bring them back to a common ancestor. Is that number equal to the number one would get by comparing everything to the “true” parsing? That is, the observed parse prediction’s distance is equal to the true parse distance plus the distance created by the error-transformations.

Substraction makes sense to me in the context of trees: you take everything after the common ancestor. What is addition of parse trees? The union of all edges and vertices. Parse trees are graphs after all.

This addition and subtraction of graphs means that we can use the precisione error equations. Parse trees are added and substracted. In the end, a score is assigned to the difference by counting the number of operations it would take to collapse the resulting graph to disconnected single ancestors.

How do I get a bunch of parsing models to test this idea out?

Half-life of English irregular verbs

I picked up a copy of this month’s Discover magazine and found an interesting news item on the half-life of English irregular verbs. This piqued my interest since I have been doing some studying of Natural Language Processing to see how precision error could be used in the field.

A Student’s Introduction to English Grammar (co-written by one of the principals at the Language Log) defines irregular verbs as those that do not have a well-defined rule to generate their inflectional forms. The preterite form of “walk” is “walked”. “Walk” is a regular verb that uses the “-ed” rule for forming the preterite and past participle inflections. On the other hand, “fly” is an irregular verb since the preterite form is given by “flew” and the past participle by “flown”.

Erez Lieberman and co-authors did a quantitaive study of how often irregular verbs in English turn regular. From historical records (Old English -> Middle English -> Modern English) they were able to determine that the half-life of irregular verbs was proportional to the square root of their frequency. An irregular verb a 100 times less frequent in daily use than another verb will regularize 10 times faster than the frequently used one.

The idea of a half-life comes from nuclear physics. Given a sample of n radioactive atoms, the half-life is the average time you have to wait for half the atoms to decay to another type. The half-life of the uranium isotope U-230 is about 4.5 billion years. This, by the way, explains why we can still find U-238 on Earth (which is, itself, 4 billion years old). If the half life of U-238 was a million years or less, it would all have disappeared by the time we became clever enough to discover radioactivity (about a hundred years ago).

The half life for verbs with a frequency of 1 /100 to 1 /1000 is estimated to be 5,400 years. Examples of verbs in this frequency bin are: “begin” and “help”. “Begin” is still irregular (”began”) but “help” decayed from “holp” to “helped” sometime between Middle English and Modern English. Although the Oxford English Dictionary says “holp” is still used in obscure American dialects. The OED quotes Mark Twain in “The Prince and the Pauper” as saying: “Of a truth I was right — he hath holpen in a kitchen.”

The most common verbs — “be” and “have” — have not been observed to decay but extrapolating using the square root of the frequency rule allows the authors to estimate a half-life of 39,000 thousand years! In other words, English as a language will probably die before “be” becomes regular.