Z algorithm

Introduction

There are some algorithms of exact substring searching (e.g. Knuth-Morris-Pratt, Boyer-Moore etc.) I want to explain one of them which is called Z algorithm in some sources.

Z-boxes and Z-values

Let’s consider the concept of Z-box. Take the string S = "abcxxxabyyy". We have an internal part "ab" in the string which repeats its prefix. The internal "ab" is a Z-box. It has the beginning at the position with index 6 and the end in 7 (0-based). Z-boxes are substrings which match string prefixes with the same length. For the Z-box, let’s call the corresponding prefix the prototype (it’s my term but I think it’s OK to use it.)

Z algorithm

In fact, the string "abcxxxabyyy" has three non-empty (i.e. with length greater than zero) Z-boxes:
1) the whole strings matches it’s prefix length 11;
2) 'a' character at the position 6;
3) just considered substring "ab" (6-7).

Let Z-value, Zi(S) or just Zi be the length of a Z-box in string S which starts at the position i. Obviously, Z-value can’t be bigger than the length of the rest of the string.

We can compute all Z-values in the string:

Index 0 1 2 3 4 5 6 7 8 9 10
Character a b c x x x a b y y y
Z-value 11 0 0 0 0 0 2 0 0 0 0

Z-box [0-10] matches the whole string, so has the length 11. Z-box [6-7] matches prefix [0-1], length 2. Z-box [7-7] matches prefix [0-0], length 1.

Let’s ignore empty Z-boxes (with the length 0).

Some examples:

Index 0 1 2 3 4 5
Character a a a a a a
Z-value 6 5 4 3 2 1
Index 0 1 2 3 4
Character a b b b b
Z-value 6 0 0 0 0
Index 0 1 2 3 4 5
Character a b c a b c
Z-value 6 0 0 3 0 0
Index 0 1 2 3 4 5 6 7 8 9 10
Character a b r a c a d a b r a
Z-value 11 0 0 1 0 1 0 4 0 0 1

Naïve computation of Z-values

The naïve algorithm of Z-values computation is very simple: we just need to go through all string indexes starting from 1 and explicitly calculate Zi(S).

In Python it’d look like:

Or you might want some Python magic :)

But simple is better than complex, isn’t it?

Advanced computation of Z-values

The naïve implementation has O(|S|2) time complexity, where |S| is the length of the string. Can it be better? Yes, it can.

What do we know about Z-box? We know that the corresponding substring is exactly the same as the prefix of the string (with the same length). We compute Z-value sequentially from the beginning, so we could use these previous values.

Introduce indexes lti (for “left”) and rti (for “right”). rti indicates the right-most end (the index of the last character) of any Z-box that starts at the position i or to the left of this position at the position lti. In the future we will skip i in lti and rti. Let’s call Z-box in range [lt..rt] (both ends are included) the current Z-box.

For index k inside the current Z-box, let’s call index p=k-lt lying inside the prototype the pair index.

We know the value of Zp. Substrings in Z-box and its prototype are equal. Can we say that Zk = Zp? Under some conditions, yes.

The fact that the current character at the position k is inside previously found Z-box can give us some information. Let’s take substring S(k..rt) and consider two cases.

Case 1

Zp < |S(k..rt)|, i.e. Zp is less then the length of the rest (starting from k) of the current Z-box.
|S(k..rt)| = rt-k+1.
We can add k to the both side of the condition: k+Zp < rt+1.

Note: Pay attention that if we have a string with the length n which starts at the position i then the end of the string will be at the position x+i-1. For example, if n=1 and i=2 then the string begins and ends at the position 2. If i=0 and n=2 then the string begins in 0 and ends in 1.

In this case, we have two identical substrings (Z-box and its prototype), two identical characters inside them (at k and p positions correspondingly).
Moreover, we have the following information:
1) substrings S(p..p+Zp-1) and S(0..Zp-1) are identical (from the meaning of Zp);
2) substrings S(k..k+Zp-1) and S(p..p+Zp-1) are identical because indexes k and k+Zp-1, p and p+Zp-1 lie inside the Z-box and its prototype at the same relative positions;
3) substrings S(k..k+Zp-1) and S(0..Zp-1) are identical (from statement 1 and 2);

4) characters S(p+Zp) and S(Zp) aren’t equal (again, from the meaning of Zp);
5) characters S(k+Zp) and S(p+Zp) are equal because these indexes lie inside Z-box and its prototype at the same relative positions;
6) characters S(k+Zp) and S(Zp) aren’t equal (from statement 4 and 5);

From statements 3 and 5 and the meaning of Zk we can conclude that Zk is equal to Zp.

Thus we can state that when index k is inside the current Z-box with the left border lt, right border rt and Zp < rt-k+1 then Zk = Zp, where p=k-lt.

lt and rt remain the same because k+Zk certainly lies to the left of the current rt.

We need the strict condition Zp < |S(k..rt)| instead of Zp ≤ |S(k..rt)| to be able to say about the equality of characters S(k+Zp) and S(p+Zp) with confidence. If Zp = |S(k..rt)| then S(k+Zp) lies outside the current Z-box and we have no idea about either these characters are equal or not. Why don’t we just check? This smoothly brings us to the case 2.

Case 2

Zp > |S(k..rt)|.
|S(k..rt)| = rt-k+1.
In this case we can’t simply state that Zk = Zp because we don’t know anything about equality of characters beyond the right bounds of prototype and Z-box.

We know that substrings S(k..k+h), S(p..p+h) and S(0..h) are identical. We can simply compare characters from ranges [rt+1..|S|] and [rt-k+1..|S|-k] until the first inequality or the end of the ranges (and the whole string). If we found not equal characters at positions q and k+q then Zk equals the length of the equal substrings, i.e. Zk = q.

Of course, we must set the new value to rt and lt, corresponding to the newly found Z-box because k+q-1 lies to the right of rt (or at least they are equal).

Obviously, when k isn’t inside the current Z-box (k > rt), we have to calculate Zk explicitly.

The formalization of the algorithm might look like this:

Having string S,
Set Z0=|S|, lt=0, rt=0.
For each k, 1 ≤ k < |S| do:
1. If k > rt: calculate Zk(S) explicitly. If Zk > 0, set lt := k and rt := k+Zk-1.
2. Else: p := k-lt.
2.1. If Zp < rt-k+1, set Zk := Zp.
2.2. If Zp ≥ rt-k+1, for each i, rt+1 ≤ i < |S| do:
2.2.1. If S(i) = S(i-k): increment i and continue.
2.2.2. If S(i) ≠ S(i-k): set Zk := i-k, lt := k, rt := i-1 and break.

And in Python:

 

Example

Let’s consider the example: take the string S = "ababxababyabaca" and calculate Z-values by the described algorithm.
We’ll highlight the current Z-box with the blue color in the “Index” row and matching and non-matching characters of a potential Z-box with the green and the red color respectively in the “Character” row.

Z0 = |S| = 15.

Start from k = 1. lt = 0, rt = 0.

Index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Character a b a b x a b a b y a b a c a
Z-value 15
k

k=1 > rt=0 and we must calculate Z1 explicitly. S(1) ≠ S(0), so Z1 = 0. We don’t treat empty Z-boxes as normal Z-boxes, so we just skip it, no updates of lt and rt are made.

k = 2, lt = 0, rt = 0.

Index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Character a b a b x a b a b y a b a c a
Z-value 15 0
k

Again, k=1 > rt=0 and we’re calculating Z2 explicitly. S(2) = S(0), S(3) = S(1), S(4) ≠ S(2), so Z2 = 2. This is a normal newly found Z-box and we update lt := k = 2, rt := k + Zk - 1 = 2 + 2 - 1 = 3.

k = 3, lt = 2, rt = 3.

Index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Character a b a b x a b a b y a b a c a
Z-value 15 0 2
lt rt
p k

Here, k=3 ≤ rt=3. It has the pair index p = k-lt = 3-2 = 1. Zp=Z1=0. Z1 < rt-k+1=1, so we don’t need to examine any additional characters and can say that Z3 = 0.

k = 4, lt = 2, rt = 3.

Index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Character a b a b x a b a b y a b a c a
Z-value 15 0 2 0
lt rt
k

k=4 > rt=3 and we’re calculating Z4 explicitly. S(4) ≠ S(0), so Z4 = 0.

k = 5, lt = 2, rt = 3.

Index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Character a b a b x a b a b y a b a c a
Z-value 15 0 2 0 0
lt rt
k

k=5 > rt=3 and we’re calculating Z5 explicitly. Z5 = 4. This is a normal newly found Z-box and we update lt := k=5, rt := k+Zk-1 = 5+4-1 = 8.

k = 6, lt = 5, rt = 8.

Index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Character a b a b x a b a b y a b a c a
Z-value 15 0 2 0 0 4
lt rt
p k

k=6 ≤ rt=8. The pair index p = k-lt = 6-5 = 1. Zp = Z1 = 0. Z1 < rt-k+1 = 3, so we don’t need to examine any additional characters and can say that Z6 = 0.

k = 7, lt = 5, rt = 8.

Index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Character a b a b x a b a b y a b a c a
Z-value 15 0 2 0 0 4 0
lt rt
p k

k=7 ≤ rt=8. The pair index p = k-lt = 7-5 = 2. Zp = Z2 = 2. Z2 ≥ rt-k+1=2, so we need to examine characters starting from rt+1=9 and compare them to characters starting from rt+1-k=2. S(9) ≠ S(2), first not equal characters are at positions 2 and 9. So, Z7 = 9-k = 2. Also, we need to update lt := k=7, rt := k+Zk-1 = 7+2-1 = 8.

k = 8, lt = 7, rt = 8.

Index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Character a b a b x a b a b y a b a c a
Z-value 15 0 2 0 0 4 0 2
lt rt
p k

k=8 ≤ rt=8. The pair index p = k-lt = 8-7 = 1. Zp = Z1 = 0. Z1 < rt-k+1 = 1, so we don’t need to examine any additional characters and can say that Z8 = 0.

k = 9, lt = 7, rt = 8.

Index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Character a b a b x a b a b y a b a c a
Z-value 15 0 2 0 0 4 0 2 0
lt rt
k

k=9 > rt=8 and we’re calculating Z9 explicitly. S(9) ≠ S(0), so Z9 = 0.

k = 10, lt = 7, rt = 8.

Index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Character a b a b x a b a b y a b a c a
Z-value 15 0 2 0 0 4 0 2 0 0
lt rt
k

k=10 > rt=8 and we’re calculating Z10 explicitly. S(10) = S(0), S(11) = S(1), S(12) = S(2), S(13) ≠ S(3), so Z10 = 3. Also, we need to update lt := k=10, rt := k+Zk-1 = 10+3-1 = 12.

k = 11, lt = 10, rt = 12.

Index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Character a b a b x a b a b y a b a c a
Z-value 15 0 2 0 0 4 0 2 0 0 3
lt rt
p k

k=11 ≤ rt=12. The pair index p = k-lt = 11-10 = 1. Zp = Z1 = 0. Z1 < rt-k+1=2, so we don’t need to examine any additional characters and can say that Z11 = 0.

k = 12, lt = 10, rt = 12.

Index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Character a b a b x a b a b y a b a c a
Z-value 15 0 2 0 0 4 0 2 0 0 3 0
lt rt
p k

k=12 ≤ rt=12. The pair index p = k-lt = 12-10 = 2. Zp=Z2=2 ≥ rt-k+1=1, so we need to examine additional characters starting from rt+1=13 and compare them to characters starting from rt+1-k=1. S(13) ≠ S(1), first not equal characters are at positions 1 and 13. So, Z12 = 13-k = 1. Also, we need to update lt := k=12, rt := k+Zk-1 = 12+1-1 = 12.

k = 13, lt = 12, rt = 12.

Index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Character a b a b x a b a b y a b a c a
Z-value 15 0 2 0 0 4 0 2 0 0 3 0 1
lt
rt
k

k=13 > rt=12 and we’re calculating Z13 explicitly. S(13) ≠ S(0), so Z13 = 0.

k = 14, lt = 12, rt = 12.

Index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Character a b a b x a b a b y a b a c a
Z-value 15 0 2 0 0 4 0 2 0 0 3 0 1 0
lt
rt
k

k=14 > rt=12 and we’re calculating Z14 explicitly. S(14) = S(0), so Z14 = 1.

Finally, we have:

Index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Character a b a b x a b a b y a b a c a
Z-value 15 0 2 0 0 4 0 2 0 0 3 0 1 0 1

Complexity

Let’s find out (not very strictly) time and space complexity of the algorithm. We have |S| iterations. In each iteration we either already have got a Z-value or we have to do some comparisons. When we find q matches during the comparisons rt will be moved to the right: rtk := rtk-1+q. After each mismatch an iteration finishes, so there aren’t more than |S| mismatches. We have not more than |S| iterations with not more than |S| matches and mismatches as a whole. So, the algorithm has O(|S|) time complexity.
Obviously, it requires O(|S|) space to store Z-values.

Search algorithm

You could ask why we need these Z-values at all. This is an appropriate question :) We can use the idea of Z-values for doing a substring search.
Take a string T (for “text”) and a string P (for “pattern”), both built of characters of some alphabet A. Also, take character '$' which is not in A. This character is called the sentinel. Assemble all these things in the following way: P$T.

Compute Z-values of this composite string by the advanced algorithm described earlier. All this values will be limited to |P| (except Z0, of course) because of the sentinel which can’t be found at any place in P or T. And when you see a Zj = |P| then you can be sure that substring T(j..j+|P|-1) is an occurrence of P in T. And vice versa: you can be confident that all occurrences will be marked in the list of Z-values.

Let’s consider an example. P = "ab", T = "xaybzabxaby":

Index 0 1 2 3 4 5 6 7 8 9 10 11 12 13
Character a b $ x a y b z a b x a b y
Z-value 14 0 0 0 1 0 0 0 2 0 0 2 0 0

Even with overlapping matches: P = "aa", T = "xaaay":

Index 0 1 2 3 4 5 6 7
Character a a $ x a a a y
Z-value 8 1 0 0 2 2 1 0

This search algorithm has the O(|P|+|T|) time and space complexity because it has the longest, dominating part – search of Z-values which is O(|S|), |S|=|P|+|T|.

Good algorithm. But there is an obstacle for a general-purpose use: the choice of the sentinel. In some applications you have a short known alphabet. For example, in bioinformatics there are symbols 'A', 'C', 'G' and 'T' and we can simply use '$' as the sentinel. But what about general Unicode strings?

Search algorithm without the sentinel

Fortunately, there is search algorithm that doesn’t need to use sentinels. The idea is in calculating Z-values of a string PT with a simple modification: after each Z-value is calculated, limit it to the |P|, i.e., Zk := min{|S|, Zk}. This limitation plays role of the sentinel.

The Python code:

 

All the code can be found here.

Thank you.

  • Newcoder

    great explain with pictures :)

  • Jude

    For the step-by-step illustration, when k=4 and k=5, they’re > rt because rt = 3 instead of 0. Just small typo, but still pretty awesome!

  • Biru

    I liked Your Tutorial Very Much

  • Guilherme

    Hi Ivan,

    I really enjoyed this blog post. This is one of the most complete resources I found on Z Algorithm. I would like your permission to use some of the images in a homework. Is that ok?

    Best,

    • Hi Guilherme,

      Thank you.

      I would like your permission to use some of the images in a homework. Is that ok?

      Yep, no problem.

  • Pingback: C/C++ program for Z algorithm string matching - WikiStack()

  • Pingback: Can't find the issue in this Ruby code - BlogoSfera()

  • murdocc007

    Thanks a lot for your explanation. Have seen a lot of tutorials on Z algorithm but wasn’t able to understand from any of them. The figures clarifies the explanation. THANKS YOU SO MUCH ! :)

  • Tieu Kieu

    easy understanding explain.
    Thanks a lot ^^

  • John Brubaker

    def z_naive(s):
    Z = [len(s)]

    for k in range(1, len(s)):
    n = 0
    while n + k < len(s) and s[n] == s[n + k]:
    n += 1
    Z.append(n)

    return Z

    OR:

    Z = [len(list(takewhile(lambda p: p[0] == p[1], zip(s, s[k:]))))
    for k in range(len(s))]

    • Surely, but that isn’t as illustrative as the verbose version.

  • Pingback: Z algorithm | thoughts...()

  • Eno Hcnup Nam

    Excellent tutorial! Thanks for the effort. I was wondering which tools did you use to make those illustrations. Is it LaTeX? or an illustration library?

    • Thank you.

      > I was wondering which tools did you use to make those illustrations.
      Actually, it’s Microsoft Visio.

  • Pingback: FindThree | turuthok()

  • SaurabhM

    Why isn’t ‘b’ a z-box? We have prototype at index 1 in the example and the z-box at position 7.

    • Z-boxes match string prefixes, i.e. at the beginning of the string, counting from 0. b doesn’t match character at 0, which is a. That’s why b isn’t a z-box.

  • Gene Ressler

    This is a terrific, careful explanation. Thanks for all this work.

    I have only one nagging question.

    Why is the while loop scan necessary in Case 2? Won’t the first comparison always fail? Your nice picture shows that S(Z_p – 1) = S(p + Z_p – 1) and – since the z-box is maximal – these cannot be equal to S(rt+1). Certainly the loop fails in the first iteration in each case in your example.

    Again many thanks.

    • Hi Gene.

      (If I understand your question correctly,) We actually need to loop when Zp = rt-k+1. Consider an example:
      [(a b c) x y (a b c)] z ... [(a b c) x y (a b c)] x y ....
      In this case, Zp = 3, just reaches the right border of the current Z-box. However, new Z-box will be bigger: Zk = 5, i.e. including x and y, which we will count in while loop.

      • Gene Ressler

        Ah! You’re right. I was thinking only about Zp > rt-k+1. Here the loop body never executes.

  • gaurav kumar

    Great Explanation Man, Awesome :)

  • uli

    Thanks for the great article. In the script for my lecture
    https://www.bio.ifi.lmu.de/mitarbeiter/volker-heun/notes/ab6.pdf
    on page 145, figure 2.44 there is an addition “if(i-1 > r)” statements prior to l := k at the end.

    I’m not sure why this statement exists as I’m just beginning to understand the algorithm, but I noticed that it’s not present in the code. Can you think of any case where that statement would matter?

    • Hi.
      I couldn’t find a case where this additional condition matters.