Updated February 28, 2023

Introduction to KMP Algorithm

KMP Algorithm or Kuth-Morris-Pratt Algorithm is a pattern matching algorithm in the world of computer science and was the first Linear time complexity algorithm for string matching. It was named after Donald Kuth, Vaughan Pratt, and James Morris who together wrote the paper on KMP Algorithm in 1977 although James Morris had independently invented the algorithm in 1970. A String Matching or String Algorithm in Computer Science is string or pattern recognition in a larger space finding similar strings. KMP Algorithm- data thus uses such string algorithms to improve the time taken to find and eliminate such pattern when searching and therefore called linear time complexity algorithm. The time complexity of the KMP Algorithm – data is represented as O (m+n). The KMP Algorithm – data was initially developed by analyzing the Naïve algorithm.

Examples of KMP Algorithm

Let us see the examples mentioned below:

Example #1

First let’s take an example to understand how the usual KMP Algorithm searches for a substring.

Index 0 1 2 3 4 5 6 7 8 9

Text S = b c m a l m n x y z

Pattern P = m a l

Explanation: First it will match the 0th index of pattern with the 0th index of the given text which is ‘m’ with ‘a’. Since they don’t match, we move to the next index of the text and we’ll compare ‘m’ with ‘c’, again it’s not a match. Then, again we will move to the next index and compare ‘m’ with the next index value which is ‘m’ now it’s a match then we move to the next index and search for more matches and compare ‘a’ with the next index value which is ‘a’ . So , again it’s a match and then we move to the next index and move to the next index and compare ‘l’ with ‘l’ .Finally, the whole substring matches and the and we would get 2 which is the index at which the substring exists. Time complexity in the worst case is O(m*n). Where m and n are the length of S and P respectively.

Example #2

Index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

Text S = a c b a c x a b c d a b x a b c d a c b a c d a b c

Pattern P = a c b a c d a b c y

Explanation:

Step 1: In the string, you can see there’s a mismatch at ‘d’ and before character ‘d’ all character are matched.

Index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

Text S= a c b a c x a b c d a b x a b c d a c b a c d a b c

Pattern P= a c b a c d a b c y

Step 2: When the mismatched character is found the KMP Algorithm – data searches for substring before the mismatched character.

In the example the mismatched character is ‘d’ and the substring before the mismatched character is “acbac”. Now the substring is checked for proper suffixes and prefixes.

Proper prefix are a, ac,acb,acba
Proper suffix are c, ac,bac,cbac

KMP Algorithm – data find out the suffix and prefix which are common. In this case ‘ac’ is the string which is the longest common substring such that it is the suffix and prefix both.

Step 3: We can see “acbac” matches with the given text as since they have matched then only we have reached the mismatched character ‘d’.

Index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

Text S= a c b a c x a b c d a b x a b c d a c b a c d a b c

Pattern P= a c b a c d a b c y

Here, you can see the prefix ‘ac’ in the substring is the suffix in the given string. Therefore, we can directly start searching from. Hence, index 3 s the current starting position now.

Step 4: Again, since ‘ac’ from the substring matches with the ‘ac’ from the given string so we can directly skip two characters and move to the next character which is ‘b’.

Step 5: Now we will compare ‘b’ and ‘x’, they don’t match. Again find out the longest common substring from the substring before the mismatched character, which is common to both suffix and prefix. Here, there is no common between suffix and prefix so we can say ‘previously chosen ‘a’ is not the correct starting position.

Step 6: Therefore we will choose the next character after ‘x’ which is ‘a’ at index 6 as the starting position. Compare ‘a’ with’a’ , it matches and now increments the position.

Index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

Text S= a c b a c x a b c d a b x a b c d a c b a c d a b c

Pattern P= a c b a c d a b c y

Step 7: Now compare ‘c’ with ‘b’, it does not match.

Index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

Text S= a c b a c x a b c d a b x a b c d a c b a c d a b c

Pattern P= a c b a c d a b c y

Step 8: By the rule find substring before unmatched character and find the longest common substring which is common to both suffix and prefix. There is no common such substring.

Step 9: In this way how the KMP Algorithm – data works. Finally, we will get the matching substring.

Here, you can see it avoids many unnecessary comparisons and reduces the time complexity to O(m+n).

Advantages and Disadvantages of KMP Algorithm

Below are the advantages and disadvantages of the KMP Algorithm:

Advantages

Below are the advantages:

The most obvious advantage of KMP Algorithm – data is that it’s guaranteed worst-case efficiency as discussed.
The pre-processing and the always-on time is pre-defined.
There are no worst-case or accidental inputs.
Preferable where the search string in a larger space is easier and more efficiently searched due to it being a time linear algorithm.
The algorithm needs to move backward in the input text. This is particularly favorable in processing large files.

Disadvantages

Below are the disadvantages:

One of the glaring disadvantages of KMP Algorithm – data is that it doesn’t work well when the size of the alphabets increases. Due to this more and more error occurs.
For processing very large files it also requires resources in the form of processors and that could be a problem for smaller organizations to adopt KMP Algorithm – data

Conclusion

In order to become efficient in learning search and Data Science broadly, it becomes extremely important to be familiar with various kinds of search algorithms. KMP Algorithm is the most widely used and most advantageous for organizations around the world due to its capacity of being the worst-case fault resistant algorithm. It’s beneficial for anyone wanting to make a career in Data Science in general and Search in particular.