Thursday, April 18, 2013

Longest Repeated Substring

Longest Repeated Substring

This problem is from codeeval.com.

Problem Description:

You are to find the longest repeated substring in a given text. Repeated substrings may not overlap. If more than one substring is repeated with the same length, print the first one you find.(starting from the beginning of the text). NOTE: The substrings can't be all spaces.

Input Sample:
Your program should accept as its first argument a path to a filename. The input file contains several lines. Each line is one test case. Each line contains a test string. eg.
banana
abc
Output Sample:
For each set of input produce a single line of output which is the longest repeated substring. If there is none, print out the string NONE. eg.
an
NONE

Analysis:

The input text is actually a string and suppose the length of string is n.

Brute Force:
The brute force method is easy: for each index pair [i, j] (1 <= i, j <= n) of the string, count the length of non-overlapping repeated substring starting from this two indices . And then get the first repeated substring with the longest length. Although the non-overlapping constraint can help to reduce the character comparison times while checking two substrings, the running time in the worst case is still O(n^3) (can be proved, but omitted here). Obviously the brute force method is not a good choice.

Suffix Array:
Another method is to use suffix array. For the input string str, we first sort all of its suffixes in the suffix array. eg.
str = banana, its suffixes are:
banana
anana
nana
ana
na
a
after sort, the suffix array looks like:
a
ana
anana
banana
na
nana
Then for each two adjacent suffixes on the sorted suffix array, count the length of the common prefix (Note: should avoid counting the overlapped part). The longest repeated prefix appearing first on the original input string is the result. e.g.
a
ana
anana
banana
na
nana
The red prefixes an and green prefixes na both have the same longest length 2, but an is the result because it appears before na in str = banana. 
Similar to brute force method, the suffixes comparison in sorting and repeated prefix length counting make the running time be O(n^2lgn) (suppose an O(nlgn) sorting algorithm is used).  The length of suffix array is n and it may just need to store the indices pointing to the suffixes in the input string rather than the suffixes themselves. Therefore the space complexity is O(n).

Dynamic Programming:
O(n^2lgn) looks good, however, we can solve this problem in O(n^2) time and O(n^2) space with DP. Suppose we first don't consider the constraint of no overlap. The idea is to find the longest repeated suffix for all pairs of prefixes of the string str. The length of the longest repeated suffix for prefixes str[1..i] and str[1..j] is:
c(i, j) = 1) c(i-1, j-1) + 1   if str[i] == str[j];
          2) 0                 otherwise.
The maximum of c(i, j) (1 <= i, j <= n) is the length of longest repeated substring (suffix) of str. The longest repeated substring should be stored and dynamically updated during the calculating of c(i, j).

Now consider the non-overlapping constraint, otherwise the above process will give us the string itself. Look at prefixes str[1..i] and str[1..j], the end indices of their suffixes are i and (for simplicity, suppose i <= j), respectively. If their common suffix is longer than j-i, the two suffixes must overlap. The overlapped part should not be counted and the above recursion should be modified as:
c(i, j) = 1) c(i-1, j-1) + 1   if str[i] == str[j] && abs(i-j) > c(i-1, j-1);
          2) 0                 otherwise.
If more than one repeated substring have the same longest length, we just need to take the first one in the input string. More details can be found in the code.

Code (C++)

Bellow is the code of DP method:
std::string repeated_substring(std::string &str) {
    int len = str.length();

    int **c = new int*[len + 1];
    for (int i = 0; i <= len; ++i)
        c[i] = new int[len + 1];
    for (int i = 0; i <= len; ++i) {
        c[i][0] = 0;
        c[0][i] = 0;
    }

    int max_len = 0, index = len + 1;
    for (int i = 1; i <= len; ++i) {
        for (int j = 1; j <= len; ++j) {
            if (str[i-1] == str[j-1] && abs(i-j) > c[i-1][j-1]) {
                c[i][j] = c[i-1][j-1] + 1;
                if (c[i][j] > max_len) {
                    max_len = c[i][j];
                    index = std::min(i, j);
                }
            } else {
                c[i][j] = 0;
            }
        }
    }
    
    for (int i = 0; i <= len; ++i)
        delete[] c[i];
    delete[] c;
 
    if (max_len > 0) {
        std::string ret = str.substr(index - max_len, max_len);
        for (int i = 0; i < max_len; ++i)
            if(ret[i] != ' ')
                return ret;
    }

    return "NONE";
}

-End-

4 comments:

  1. c(i, j) = 1) c(i-1, j-1) + 1 if str[i] == str[j] && abs(i-j) > c(i-1, j-1);
    2) 0 otherwise.

    I have some different opinion on this.
    Think about this string
    string src="aaaaaa"
    we have c[0,2]=1; c[1,3]=2; when we build up the table and now comes c[2,4].
    obviously, src.substr(2-2,2)==src.substr(4-2,2)
    c[2,4]=2, no overlapping.
    But your code produces 0 on c[2,4]. Of course this will not harm the result of this problem, but it will harm some similar problems as I can think of.
    I would say the logic would be:
    if(src[i]==src[j])
    c[i][j]= std::min(1+c[i-1][j-1],abs(i-j));
    else
    c[i][j]=0;
    Sorry if I make any mistake.

    ReplyDelete
  2. Your solution did not solve problem of "repeated string can not overlap". Just simply follow the below ideas will do.
    index of substring a: i1
    index of substring b: i2
    common prefix of a and b : c
    if(abs(l1-l2) >=c) add c to result;

    ReplyDelete
  3. There is O(n) algorithm to build the Suffix Array, so we can use O(n) to solve this problem.

    ReplyDelete
  4. Fails for input:-abababab
    Answer should be:-abab, but ur's is giving just ab !!

    ReplyDelete