North Rivers: 2013

Friday, May 24, 2013

KMP Algorithm

KMP is a string matching algorithm which finds the occurrence of a pattern string in a text string.

The brute force way to find pattern in text is to check each position it could be. For a position in text, it compares pattern with sub-string of text starting from this position, character by character, to see whether a match is found or not. If fails, do the match again from the next position of text, and so foth. Below is the C code extracted from Linux kernel's <string.c>, it returns the first occurrence of pattern (char *s2) in text (char *s1) or returns NULL if fails.

char *strstr(const char *s1, const char *s2)
{
    size_t l1, l2;

    l2 = strlen(s2);
    if (!l2)
        return (char *)s1;
    l1 = strlen(s1);
    while (l1 >= l2) {
        l1--;
        if (!memcmp(s1, s2, l2))
        return (char *)s1;
        s1++;
    }
    return NULL;
}

Suppose the length of text and pattern is n and m respectively, the worst case time complexity is O(nm).

Each time when a mismatch is found, the brute forth method moves to the next position of text and do the memcmp again. In figure 1, pattern ("abaeabac") is compared with text from position i-7. The comparison fails at the last character of pattern ('c' != 'd'). KMP, however, moves more steps rather than moves just one position for next try as it makes use of previous match information.

Fig. 1

For substring pattern[0,...6], KMP knows its longest common prefix and suffix, that is "aba" underlined in red (figure 1). Notice that substring text[i-7,...,i-1] is the same as substring pattern[0,...,6], so we know that text[i-3,...,i-1] (same as the suffix) must be the same as pattern[0,...2] (prefix) and it is the next longest successive match to position i-1 in text. Then it is safe to move pattern to position i-3 for next match as shown in figure 2.

Fig. 2

As text[i-3,...,i-1] is the same as pattern[0,...2], it is unnecessary to compare them again. We only need to start the comparison from position 3 (length of pattern[0,...2]) of pattern and the last mismatch position i of text. Then another mismatch between text[i] and pattern[3] is found, KMP considers the longest common prefix and suffix of substring pattern[0,...,2] again and knows the next comparison should start between pattern[1] and text[i] as shown in figure 3.

Fig. 3

KMP uses an array next to store the length of the longest common suffix and prefix for each substring pattern[0,...,j] (j = 0 to m-1). Suppose a mismatch occurs between pattern[j] and text[i], and next[j-1] = k. Then KMP just needs to start the next match to text[i] from pattern[k], and so forth.

The computation of next array is similar to the match process. Suppose next[0,...,i-1] has been computed and next[i-1] = j (means pattern[0,..j-1] == pattern[i-j,...,i-1]). If pattern[i] is equal to pattern[j] (means pattern[0,...,j] == pattern[i-j,...,i]), the length of the longest common prefix and suffix of pattern[0,...,i] is j+1 (set next[i] = j+1); otherwise, let j = next[j-1] and check whether pattern[i] is equal to pattern[j], and so forth.

At any time the text or pattern will move at least one step during matching, so matching takes O(n) time. Similarly, the computation of next array takes O(m) time. The total time complexity of KMP is then O(n+m). Below is the code of KMP in C++.

Code (C++)

void calc_next(char *pattern, int *next)
{
    next[0] = 0;
 
    int j = 0;
    for (int i = 1; pattern[i] != '\0'; ++i) {
        while (j > 0 && pattern[j] != pattern[i])
            j = next[j-1];
                
        if (pattern[j] == pattern[i])
            ++j;
                
        next[i] = j;
    }
}

char *strstr(char *text, char *pattern)
{
    assert (text != NULL && pattern != NULL);
        
    int len2 = strlen(pattern);
    if (len2 == 0)
        return text;
    int len1 = strlen(text);
            
    int *next = new int[len2];
    calc_next(pattern, next);
        
    int j = 0;
    for (int i = 0; i < len1; ++i) {
        while (j > 0 && pattern[j] != text[i])
            j = next[j-1];
                
        if (pattern[j] == text[i])
            ++j;
            
        if (j == len2) {
            delete[] next;
            return text + (i - j + 1);
        } 
    }
        
    delete[] next;
    return NULL;
}

-End-

Tuesday, April 30, 2013

K-th Smallest Element of Two Sorted Arrays

Problem Description:

Given two sorted arrays A and B of size m and n, respectively. Find the k-th (1 <= k <= m+n) smallest element of the total (m+n) elements in O(log(m)+log(n)) time.

Analysis:

This is a popular interview question and one special case of this problem is finding the median of two sorted arrays. One simple solution is using two indices pointing to the head of A and B, respectively. Then increase the index pointing to the smaller element by one till k elements have been traversed. The last traversed element is the k-th smallest one. It is simple but the time complexity is O(m+n).

To get O(log(m)+log(n)) time complexity, we may take advantage of binary search because both of the two arrays are sorted. I first read the binary search method from this MIT handout, which aims to find the median of two sorted arrays. This method can be easily modified to solve the k-th smallest problem:

First of all, we assume that the k-th smallest element is in array A and ignore some corner cases for basic idea explanation. Element A[i] is greater than or equal to i elements and less than or equal to all the remaining elements in array A (A[i] >= A[0..i-1] && A[i] <= A[i+1..m-1]). If A[i] is the k-th smallest element of the total (m + n) elements (A[i] >= k-1 elements in A+B and <= all other elements in A+B), it must be greater than or equal to (k - 1 - i) elements in B and less than or equal to all the remaining elements in B (A[i] >= B[k - 1 - i - 1] && A[i] <= B[k - 1 - i]). Below figure shows that A[i] (in gray) is greater than or equal to all green elements (k-1 in all) in A+B and less than or equal to all blue elements in A+B.

Then it becomes very easy for us to check whether A[i] >= B[k - 1 - i - 1] and A[i] <= B[k - 1 - i]. If so, just return A[i] as the result; if not, there are two cases:

1). A[i] > B[k - 1 - i], which means A[i] is greater than more than k elements in both A and B. The k-th smallest element must be in the lower part of A (A[0..i-1]);

2). otherwise, the k-th smallest element must be in the higher part of A (A[i+1..m-1]).

The search begins with range A[0..m-1] and every time i is chosen as the middle of the range, therefore we can reduce the search size by half until we find the result.

The above algorithm looks good, but it won't give you correct answer because there are many corner cases need to be addressed:

1). The simplest one may be that the k-th smallest element is in B rather than in A. When the entire array A has been traversed and no answer is returned, it means the k-th smallest element must be in B (if k is valid, i.e. 1 <= k <= m+n). We just need to run another "binary search" in B and this time the correct answer is guaranteed to be returned.

2). i >= k. This means A[i] is greater than or equal to k elements in A and will be at least the (k+1)-th element in A+B. In this case, we need to "binary search" in the lower part of A.

3). i + n < k - 1. Similarly to case 2), this means A[i] will be at most the (k-1)-th element in A+B. In this case, we need to "binary search" in the higher part of A.

4). At any time when we refer B[k - 1 - i - 1] and B[k - 1 - i], we should assure the indices are in range [0, n) to avoid out of array bounds error.

The binary search on A and B takes O(log(m)) and O(log(n)) time respectively, so the worst case time complexity is O(log(m)+log(n)). For the problem of finding median of two sorted arrays, the ((m+n)/2 + 1)-th element should be returned if (m+n) is odd. If (m+n) is even, the average of the ((m+n)/2)-th and ((m+n)/2 + 1)-th elements is returned. Below is the code in c++.

Code (C++)

/**
 * Search the k-th element of A[0..m-1] and B[0..n-1] in A[p..q]
 */
int kth_elem(int A[], int m, int B[], int n, int k, int p, int q) {
    if (p > q)
        return kth_elem(B, n, A, m, k, 0, n-1);//search in B
        
    int i = p + (q - p) / 2;
    int j = k - 1 - i - 1;
        
    if ((j < 0 || (j < n && A[i] >= B[j])) 
        && (j+1 >= n || (j+1 >= 0 && A[i] <= B[j+1]))) {
        return A[i];
    } else if (j < 0 || (j+1 < n && A[i] > B[j+1])) {
        return kth_elem(A, m, B, n, k, p, i-1);
    } else {
        return kth_elem(A, m, B, n, k, i+1, q);
    }   
}

double median_two_arrays(int A[], int m, int B[], int n) {
    if ((m + n) % 2 == 1) {
        return kth_elem(A, m, B, n, (m+n)/2+1, 0, m-1);
    } else {
        return (kth_elem(A, m, B, n, (m+n)/2, 0, m-1) +
                kth_elem(A, m, B, n, (m+n)/2+1, 0, m-1)) / 2.0;
    }
}

-End-

Thursday, April 18, 2013

Longest Repeated Substring

This problem is from codeeval.com.

Problem Description:

You are to find the longest repeated substring in a given text. Repeated substrings may not overlap. If more than one substring is repeated with the same length, print the first one you find.(starting from the beginning of the text). NOTE: The substrings can't be all spaces.

Input Sample:
Your program should accept as its first argument a path to a filename. The input file contains several lines. Each line is one test case. Each line contains a test string. eg.

banana
abc

Output Sample:
For each set of input produce a single line of output which is the longest repeated substring. If there is none, print out the string NONE. eg.

an
NONE

Analysis:

The input text is actually a string and suppose the length of string is n.

Brute Force:
The brute force method is easy: for each index pair [i, j] (1 <= i, j <= n) of the string, count the length of non-overlapping repeated substring starting from this two indices . And then get the first repeated substring with the longest length. Although the non-overlapping constraint can help to reduce the character comparison times while checking two substrings, the running time in the worst case is still O(n^3) (can be proved, but omitted here). Obviously the brute force method is not a good choice.

Suffix Array:
Another method is to use suffix array. For the input string str, we first sort all of its suffixes in the suffix array. eg.

str = banana, its suffixes are:
banana
anana
nana
ana
na
a

after sort, the suffix array looks like:
a
ana
anana
banana
na
nana

Then for each two adjacent suffixes on the sorted suffix array, count the length of the common prefix (Note: should avoid counting the overlapped part). The longest repeated prefix appearing first on the original input string is the result. e.g.

a
ana
anana
banana
na
nana
The red prefixes an and green prefixes na both have the same longest length 2, but an is the result because it appears before na in str = banana.

Similar to brute force method, the suffixes comparison in sorting and repeated prefix length counting make the running time be O(n^2lgn) (suppose an O(nlgn) sorting algorithm is used). The length of suffix array is n and it may just need to store the indices pointing to the suffixes in the input string rather than the suffixes themselves. Therefore the space complexity is O(n).

Dynamic Programming:
O(n^2lgn) looks good, however, we can solve this problem in O(n^2) time and O(n^2) space with DP. Suppose we first don't consider the constraint of no overlap. The idea is to find the longest repeated suffix for all pairs of prefixes of the string str. The length of the longest repeated suffix for prefixes str[1..i] and str[1..j] is:

c(i, j) = 1) c(i-1, j-1) + 1   if str[i] == str[j];
          2) 0                 otherwise.

The maximum of c(i, j) (1 <= i, j <= n) is the length of longest repeated substring (suffix) of str. The longest repeated substring should be stored and dynamically updated during the calculating of c(i, j).

Now consider the non-overlapping constraint, otherwise the above process will give us the string itself. Look at prefixes str[1..i] and str[1..j], the end indices of their suffixes are i and j (for simplicity, suppose i <= j), respectively. If their common suffix is longer than j-i, the two suffixes must overlap. The overlapped part should not be counted and the above recursion should be modified as:

c(i, j) = 1) c(i-1, j-1) + 1   if str[i] == str[j] && abs(i-j) > c(i-1, j-1);
          2) 0                 otherwise.

If more than one repeated substring have the same longest length, we just need to take the first one in the input string. More details can be found in the code.

Code (C++)

Bellow is the code of DP method:

std::string repeated_substring(std::string &str) {
    int len = str.length();

    int **c = new int*[len + 1];
    for (int i = 0; i <= len; ++i)
        c[i] = new int[len + 1];
    for (int i = 0; i <= len; ++i) {
        c[i][0] = 0;
        c[0][i] = 0;
    }

    int max_len = 0, index = len + 1;
    for (int i = 1; i <= len; ++i) {
        for (int j = 1; j <= len; ++j) {
            if (str[i-1] == str[j-1] && abs(i-j) > c[i-1][j-1]) {
                c[i][j] = c[i-1][j-1] + 1;
                if (c[i][j] > max_len) {
                    max_len = c[i][j];
                    index = std::min(i, j);
                }
            } else {
                c[i][j] = 0;
            }
        }
    }
    
    for (int i = 0; i <= len; ++i)
        delete[] c[i];
    delete[] c;
 
    if (max_len > 0) {
        std::string ret = str.substr(index - max_len, max_len);
        for (int i = 0; i < max_len; ++i)
            if(ret[i] != ' ')
                return ret;
    }

    return "NONE";
}

-End-

Monday, April 15, 2013

Merge k Sorted Lists

This problem is from leetcode.com.

Problem Description:

Merge k sorted (ascending order) linked lists and return it as one sorted list. The definition for singly-linked list is:

struct ListNode {
    int val;
    ListNode *next;
    ListNode(int x) : val(x), next(NULL) {}
};

and the function prototype is:

ListNode *mergeKLists(vector<ListNode *> &lists);

Analyze and describe its complexity.

Analysis:

This is not a tough problem and there are two ways to solve this kind of k-way merge problem (e.g. merge k sorted arrays): 1) winner/loser tree; 2) minimum heap. Both of them have O(nlog(m)) running time and O(m) space complexity if winner/loser tree and minimum heap are constructed for merge, where n is the total list node number and m is the list number. However, we can achieve O(1) space complexity when the input vector is made as a minimum heap.

For C++, it is easy for us to use several built-in methods in <algorithm> to make a min-heap with a vector and operate on it.

Code (C++):

class comp {
public:
    bool operator() (const ListNode* l, const ListNode* r) const {
        return (l->val > r->val);
    }
};

ListNode *mergeKLists(vector<ListNode *> &lists) {
    vector<ListNode*>::iterator it = lists.begin();
    while (it != lists.end()) {
        if(*it == NULL)
            lists.erase(it);
        else
            ++it;
    }
    if(lists.empty())
        return NULL;

    ListNode *head = NULL, *cur = NULL;
    make_heap(lists.begin(), lists.end(), comp());
    while (!lists.empty()) {
        if (head == NULL)
            head = cur = lists[0];
        else
            cur = cur->next = lists[0];

        pop_heap(lists.begin(), lists.end(), comp());
        int last = lists.size() - 1;
        if (lists[last]->next != NULL) {
            lists[last] = lists[last]->next;
            push_heap(lists.begin(), lists.end(), comp());
        } else {
            lists.pop_back();
        }
    }

    return head;
}

-End-

Saturday, April 13, 2013

Zigzag Conversion

This problem is from leetcode.com.

Problem Description:

The string "PAYPALISHIRING" is written in a zigzag pattern on a given number of rows like this: (you may want to display this pattern in a fixed font for better legibility)

P   A   H   N
A P L S I I G
Y   I   R

And then read line by line: "PAHNAPLSIIGYIR"

Write the code that will take a string and make this conversion given a number of rows:

string convert(string text, int nRows);

convert("PAYPALISHIRING", 3) should return "PAHNAPLSIIGYIR".

Analysis:

Let's first take text = "abcdefghijklmnopqrstuvwxyz" and nRows = 4 as another example. The Zigzag pattern should like this:

a  g  m  s  y
b fh ln rt xz
ce ik oq uw
d  j  p  v

If we ingore the blank spaces between letters in each line, the Zigzag pattern is generated by placing text letters in each line sequentially. First the heading direction is up-bottom, we place 'a', 'b', 'c' and 'd' from the 1st line to the 4th line. And when we reach the bottom line the heading direction should be reversed. Then 'e', 'f' and 'g' are placed from the 3rd line to the 1st line. Each time we reach the top or bottom line the heading direction is reversed.

As the final result is generated by reading letters line by line and space is omited, space is not needed to be considered while generating the Zigzag pattern. Therefore, the pattern actually is of the following form:

agmsy
bfhlnrtxz
ceikoquw
djpv

We can use a vector to store letters in each line. After placing all letters, reading letters in each vector one by one gives us the final result.

Code (C++):

string convert(string s, int nRows) {
    int len = s.length();
    if (len < 3 || nRows == 1)
        return s;

    bool down = true;
    vector<vector<int> > tab(nRows);
    for (int i = 0, row = 0; i < len;) {
        if (down && row < nRows) 
            tab[row++].push_back(s[i++]);
        else if (!down && row >= 0)
            tab[row--].push_back(s[i++]);

        if (row < 0 || row >= nRows) {
            row += (down ? -2 : 2);
            down = !down;
        }
    }

    string ret = "";
    vector<vector<int> >::iterator it = tab.begin();
    for (; it != tab.end(); ++it) {
        vector<int>::iterator it2 = (*it).begin();
        for (; it2 != (*it).end(); ++it2)
            ret += *it2;
    }
    return ret;
}