Pattern matching is difficult. If you’ve ever doubted this, consider string unavoidability, the topic of this paper.
Let A = {a,b,...} be an alphabet of letters, and V = {x,y,z,...} be variables that can take as values any nonempty string formed from A. We’ll use a=0 and b=1 for two-letter alphabet examples. A pattern is a mixture of letters and variables, for example p = 0x0. A pattern occurs in a word w, p ≤ w, if some assignment to the variables makes p match a substring of w. Thus, p ≤ 01000, with x = 100. A pattern is avoidable if there are arbitrarily long strings that do not contain it, and unavoidable otherwise. The pattern p = xxx is avoidable. The strings 01, 0110, 01101001, ... avoid it. These strings are composed of two halves, the second being the complement of the first: for example, 0110 joins 1001 to form 01101001. This sequence avoids xxx. On the other hand, p = xyx is unavoidable for strings of length five or more. The string 00111 matches 111 with x=1, y=1; 010101 matches 101 with x=1, y=0; and so on.
It has been established that string avoidability is decidable, but the most natural reduction algorithm was earlier proven intractable by the author of this paper. Here, she explores whether the known necessary conditions on unavoidability may combine to provide a tractable algorithm. The most obvious necessary condition is that, if p is unavoidable and q ≤ p, then q is unavoidable. We saw that xyx is unavoidable. Then so must be xy (however, xx is avoidable). A less obvious condition is that an unavoidable pattern must contain an isolated variable, such as the y in xyx. There are other conditions as well.
The author’s conclusion is, alas, negative. Every algorithm concocted from the known necessary conditions fails on some set of often-subtle examples. For example, p = xyxzuvzyvy is avoidable, but q = xyxzuvyvzyx, which shares a six-variable prefix with p, is unavoidable. It remains an open problem to determine the computational complexity of deciding string unavoidability.