Pattern Searching

来源：互联网发布：开票软件金税盘版重装编辑：程序博客网时间：2024/05/03 22:16

Problem Definition:

Given a text txt[0..n-1] and a pattern pat[0..m-1], write a function search(char pat[], char txt[]) that prints all occurrences of pat[] in txt[]. You may assume that n > m.

Examples:
1) Input:

  txt[] =  "THIS IS A TEST TEXT"  pat[] = "TEST"

Output:

Pattern found at index 10

2) Input:

  txt[] =  "AABAACAADAABAAABAA"  pat[] = "AABA"

Output:

   Pattern found at index 0   Pattern found at index 9   Pattern found at index 13

Solutions:

Naive Pattern Searching

A brute-force solution

Time Complexity: O(n*m)

Improvement Insights:

When we compare pat[j] with txt[i] and see a mismatch, Naive Pattern Searching just backtrack to pat[0] and txt[i-j] naively. We can do some improvement on this from different aspect:

1. when we are searching, can we don't backtrack the sub-index i of txt[]? The answer is yes: KMP, A Naive Pattern Searching For Special Pattern and Finite Automata.

KMP: precomputing the jump array of pattern next[], when searching, do not need to backtrack i, when see a mismatch just set j to next[j].

A Naive Pattern Searching For Special Pattern: use the property of the special pattern (all characters in pattern are different), when searching, do not need to backtrack i, when see a mismatch just set j to next[j].

Finite Automata: preconstructing the FA jump table, when searching, do not need to backtrack i, when see any character in txt[], just jump to a state accordingly.

2.when we are searching, can we predict if txt[i...i+M-1] and pat[0...M-1] are possible to be matchable? The answer is yes: Rabin-Karp.

Rabin-Karp: precomputing the hash value of txt[i...i+M-1] and pat[0...M-1], when searching, only when the hash value of the two mathcing string are same, we start matching individual characters.

KMP Algorithm

Unlike the Naive algo where we slide the pattern by one, we use a value from lps[] to decide the next sliding position. Let us see how we do that. When we compare pat[j] with txt[i] and see a mismatch, we know that characters pat[0..j-1] match with txt[i-j...i-1], and we also know that lps[j-1] characters of pat[0...j-1] are both proper prefix and suffix which means we do not need to match these lps[j-1] characters with txt[i-j...i-1] because we know that these characters will anyway match.

Time Complexity: O(n)

Rabin-Karp Algorithm

Like the Naive Algorithm, Rabin-Karp algorithm also slides the pattern one by one. But unlike the Naive algorithm, Rabin Karp algorithm first matches the hash value of the pattern with the hash value of current substring(txt[i...i+m-1]) of text, and only if the hash values match then it starts matching individual characters.

Time Complexity: The average and best case running time of the Rabin-Karp algorithm is O(n+m), but its worst-case time is O(nm). Worst case of Rabin-Karp algorithm occurs when all characters of pattern and text are same as the hash values of all the substrings of txt[] match with hash value of pat[]. For example pat[] = “AAA” and txt[] = “AAAAAAA”.

A Naive Pattern Searching For Special Pattern

In the Naive Pattern Searching algo, we always slide the pattern by 1. When all characters of pattern are different, we can slide the pattern by more than 1. Let us see how can we do this. When we compare pat[j] with txt[i] and see a mismatch, we know that characters pat[0..j-1] match with txt[i-j...i-1], because every character in pat[] must be different, so pat[0] is different from every character in txt[i-j...i-1], then we only need to check start from txt[j] and pat[0].

Time complexity: O(n).

Finite Automata

In FA (Finite Automata) based algorithm, we preprocess the pattern and build a 2D array that represents a Finite Automata. Construction of the FA is the main tricky part of this algorithm. Once the FA is built, the searching is simple. In search, we simply need to start from the first state of the automata and first character of the text. At every step, we consider next character of text, look for the next state in the built FA and move to new state. If we reach final state, then pattern is found in text.

Time complexity: O(n) in searching, there exists O(m*NO_OF_CHARS) algorithm (Hint: we can use something like lps array construction in KMP algorithm) to construct the FA.

References:

http://www.geeksforgeeks.org/fundamentals-of-algorithms/