Friday, May 24, 2013

KMP Algorithm

KMP Algorithm

KMP is a string matching algorithm which finds the occurrence of a pattern string in a text string.

The brute force way to find pattern in text is to check each position it could be. For a position in text, it compares pattern with sub-string of text starting from this position, character by character, to see whether a match is found or not. If fails, do the match again from the next position of text, and so foth. Below is the C code extracted from Linux kernel's <string.c>, it returns the first occurrence of pattern (char *s2) in text (char *s1) or returns NULL if fails.
char *strstr(const char *s1, const char *s2)
{
    size_t l1, l2;

    l2 = strlen(s2);
    if (!l2)
        return (char *)s1;
    l1 = strlen(s1);
    while (l1 >= l2) {
        l1--;
        if (!memcmp(s1, s2, l2))
        return (char *)s1;
        s1++;
    }
    return NULL;
}
Suppose the length of text and pattern is n and m respectively, the worst case time complexity is O(nm).

Each time when a mismatch is found, the brute forth method moves to the next position of text and do the memcmp again. In figure 1, pattern ("abaeabac") is compared with text from position i-7. The comparison fails at the last character of pattern ('c' != 'd'). KMP, however, moves more steps rather than moves just one position for next try as it makes use of previous match information.

Fig. 1

For substring pattern[0,...6], KMP knows its longest common prefix and suffix, that is "aba" underlined in red (figure 1). Notice that substring text[i-7,...,i-1] is the same as substring pattern[0,...,6], so we know that text[i-3,...,i-1] (same as the suffix) must be the same as pattern[0,...2] (prefix) and it is the next longest successive match to position i-1 in text. Then it is safe to move pattern to position i-3 for next match as shown in figure 2.
Fig. 2

As text[i-3,...,i-1] is the same as pattern[0,...2], it is unnecessary to compare them again. We only need to start the comparison from position 3 (length of pattern[0,...2]) of pattern and the last mismatch position i of text. Then another mismatch between text[i] and pattern[3] is found, KMP considers the longest common prefix and suffix of substring pattern[0,...,2] again and knows the next comparison should start between pattern[1] and text[i] as shown in figure 3.
Fig. 3
KMP uses an array next to store the length of the longest common suffix and prefix for each substring pattern[0,...,j] (j = 0 to m-1). Suppose a mismatch occurs between pattern[j] and text[i], and next[j-1] = k. Then KMP just needs to start the next match to text[i] from pattern[k], and so forth.

The computation of next array is similar to the match process. Suppose next[0,...,i-1] has been computed and next[i-1] = j (means pattern[0,..j-1] == pattern[i-j,...,i-1]). If pattern[i] is equal to pattern[j] (means pattern[0,...,j] == pattern[i-j,...,i]), the length of the longest common prefix and suffix of pattern[0,...,i] is j+1 (set next[i] = j+1); otherwise, let j next[j-1] and check whether pattern[i] is equal to pattern[j], and so forth.

At any time the text or pattern will move at least one step during matching, so matching takes O(n) time. Similarly, the computation of next array takes O(m) time. The total time complexity of KMP is then O(n+m). Below is the code of KMP in C++.

Code (C++)

void calc_next(char *pattern, int *next)
{
    next[0] = 0;
 
    int j = 0;
    for (int i = 1; pattern[i] != '\0'; ++i) {
        while (j > 0 && pattern[j] != pattern[i])
            j = next[j-1];
                
        if (pattern[j] == pattern[i])
            ++j;
                
        next[i] = j;
    }
}

char *strstr(char *text, char *pattern)
{
    assert (text != NULL && pattern != NULL);
        
    int len2 = strlen(pattern);
    if (len2 == 0)
        return text;
    int len1 = strlen(text);
            
    int *next = new int[len2];
    calc_next(pattern, next);
        
    int j = 0;
    for (int i = 0; i < len1; ++i) {
        while (j > 0 && pattern[j] != text[i])
            j = next[j-1];
                
        if (pattern[j] == text[i])
            ++j;
            
        if (j == len2) {
            delete[] next;
            return text + (i - j + 1);
        } 
    }
        
    delete[] next;
    return NULL;
}


-End-