Error-tolerant finite-state recognition based on a research of Kemal Oflazer, enables the recognition of strings that deviate mildly from any string in the regular set recognized by the underlying finite-state recognizer. For example, suppose we have a recognizer for the regular set over {a, b} described by the regular expression (aba + bab), and we would like to recognize inputs that may be slightly corrupted, for example, abaaaba may be matched to abaaba (correcting for a spurious a), or babbb may be matched to babbab (correcting for a deletion), or ababba may be matched to either abaaba (correcting a b to an a) or to ababab(correcting the reversal of the last two symbols). Error-tolerant recognition can be used in many applications that are based on finite-state recognition, such as morphological analysis, spelling correction, or even tagging with finite-state models (Voutilainen and Tapanainen 1993; Roche and Schabes 1995). The approach presented in his paper uses the finite-state recognizer built to recognize the regular set, but relies on a very efficiently controlled recognition algorithm based on depth-first searching of the state graph of the recognizer. In morphological analysis, misspelled input word forms can be corrected and morphologically analyzed concurrently. In the context of spelling correction, error-tolerant recognition can universally be applied to the generation of candidate correct forms for any language, provided it has a word list comprising all inflected forms, or its morphology has been fully described by automata such as two-level finite-state transducers (Karttunen and Beesley 1992; Karttunen, Kaplan, and Zaenen 1992). The algorithm for error-tolerant recognition is very fast and applicable to languages that have productive compounding, or agglutination, or both, as word formation processes.
We can informally define error-tolerant recognition with a finite-state recognizer as the recognition of all strings in the regular set (accepted by the recognizer), and additiona strings that can be obtained from any string in the set by a small number of unit editing operations.
The notion of error-tolerant recognition requires an error metric for measuring how much two strings deviate from each other. The edit distance between two strings measures the minimum number of unit editing operations of insertion, deletion, replacement of a symbol, and transposition of adjacent symbols (Damerau 1964) that are necessary to convert one string into another. Let Z = zl, z2 . . . . , Zp denote a generic string of p symbols from an alphabet A. Z~] denotes the initial substring of any string Z up to and including the jth symbol. We will use X (of length m) to denote the misspelled string, and Y (of length n) to denote the string that is a (possibly partial) candidate string. Given two strings X and Y, the edit distance ed(X[m], Y[n]) computed according to the recurrence below (Du and Chang 1992) gives the minimum number of unit editing operations required to convert one string to the other.
ed(X[i+ 1],Y[j+ 1]) = ed(X[i],Y~’]) if xi+l = yj+l (last characters are the same)
= 1 + min{ed(X[i - 1], Y[j - 1]), (if both xi = yj+l)
ed(X[i + 1], Y[j]), (and xi+l = yj)
ed(X[i], Y~” + 1])} (last two characters are transposed)
= 1 + min{ed(X[i], Y[j]), otherwise
ed(X[i + 1], Y[j]),
ed(X[i], Y~” + 1])}
ed(X[O],Y~’]) = j 0 < j < n
ed(X[i],Y[O]) = i 0 < i < m
ed(X[-1], Y~’]) = ed(X[i], Y[-1]) = max(re, n) (boundary definitions)
For example, ed(recoginze, recognize) = 1, since transposing i and n in the first string would give the second. Similarly, ed(sailn,failing) = 3 since one could change the initial s of the first string to f, insert an i before the n, and insert a g at the end to obtain the second string.
A (deterministic) finite-state recognizer, R, is described by a 5-tuple R = (Q, A, 6, q0, F) with Q denoting the set of states, A denoting the input alphabet, 8 : Q x A —, Q denoting the state transition function, q0 E Q denoting the initial state, and F C_ Q denoting the final states (Hopcroft and Ullman 1979). Let L c A be the regular language accepted by R. Given an edit distance error threshold t > 0, we define a string X[m] ~ L to be recognized by R with an error at most t, if the set C = {Y[n] I Y[n] c L and ed(X[m],Y[n]) < t} is not empty.