### Introduction

There are some algorithms of exact substring searching (e.g. Knuth-Morris-Pratt, Boyer-Moore etc.) I want to explain one of them which is called **Z algorithm** in some sources.

### Z-boxes and Z-values

Let’s consider the concept of Z-box. Take the string `S = "abcxxxabyyy"`

. We have an internal part `"ab"`

in the string which repeats its prefix. The internal `"ab"`

is a **Z-box**. It has the beginning at the position with index 6 and the end in 7 (0-based). Z-boxes are substrings which match string prefixes with the same length. For the Z-box, let’s call the corresponding prefix the prototype (it’s my term but I think it’s OK to use it.)

In fact, the string `"abcxxxabyyy"`

has three non-empty (i.e. with length greater than zero) Z-boxes:

1) the whole strings matches it’s prefix length 11;

2) `'a'`

character at the position 6;

3) just considered substring `"ab"`

(6-7).

Let Z-value, `Z`

or just _{i}(S)`Z`

be the length of a Z-box in string _{i}`S`

which starts at the position `i`

. Obviously, Z-value can’t be bigger than the length of the rest of the string.

We can compute all Z-values in the string:

Index | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |

Character | a | b | c | x | x | x | a | b | y | y | y |

Z-value | 11 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 |

Z-box [0-10] matches the whole string, so has the length 11. Z-box [6-7] matches prefix [0-1], length 2. Z-box [7-7] matches prefix [0-0], length 1.

Let’s ignore empty Z-boxes (with the length 0).

Some examples:

Index | 0 | 1 | 2 | 3 | 4 | 5 |

Character | a | a | a | a | a | a |

Z-value | 6 | 5 | 4 | 3 | 2 | 1 |

Index | 0 | 1 | 2 | 3 | 4 |

Character | a | b | b | b | b |

Z-value | 6 | 0 | 0 | 0 | 0 |

Index | 0 | 1 | 2 | 3 | 4 | 5 |

Character | a | b | c | a | b | c |

Z-value | 6 | 0 | 0 | 3 | 0 | 0 |

Index | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |

Character | a | b | r | a | c | a | d | a | b | r | a |

Z-value | 11 | 0 | 0 | 1 | 0 | 1 | 0 | 4 | 0 | 0 | 1 |

### Naïve computation of Z-values

The naïve algorithm of Z-values computation is very simple: we just need to go through all string indexes starting from 1 and explicitly calculate `Z`

._{i}(S)

In Python it’d look like:

1 2 3 4 5 6 7 8 9 10 |
def z_naive(s): Z = [len(s)] for k in range(1, len(s)): n = 0 while n + k < len(s) and s[n] == s[n + k]: n += 1 Z.append(n) return Z |

Or you might want some *Python magic* :)

1 2 |
Z = [len(list(takewhile(lambda p: p[0] == p[1], zip(s, s[k:])))) for k in range(len(s))] |

But simple is better than complex, isn’t it?

### Advanced computation of Z-values

The naïve implementation has `O(|S|`

time complexity, where ^{2})`|S|`

is the length of the string. Can it be better? Yes, it can.

What do we know about Z-box? We know that the corresponding substring is exactly the same as the prefix of the string (with the same length). We compute Z-value sequentially from the beginning, so we could use these previous values.

Introduce indexes `lt`

(for “left”) and _{i}`rt`

(for “right”). _{i}`rt`

indicates the right-most end (the index of the last character) of any Z-box that starts at the position _{i}`i`

or to the left of this position at the position `lt`

. In the future we will skip _{i}`i`

in `lt`

and _{i}`rt`

. Let’s call Z-box in range _{i}`[lt..rt]`

(both ends are included) **the current Z-box**.

For index k inside the current Z-box, let’s call index `p=k-lt`

lying inside the prototype **the pair index**.

We know the value of `Z`

. Substrings in Z-box and its prototype are equal. Can we say that _{p}`Z`

? Under some conditions, yes._{k} = Z_{p}

The fact that the current character at the position k is inside previously found Z-box can give us some information. Let’s take substring `S(k..rt)`

and consider two cases.

#### Case 1

`Z`

, i.e. _{p} < |S(k..rt)|`Z`

is less then the length of the rest (starting from _{p}`k`

) of the current Z-box.

`|S(k..rt)| = rt-k+1`

.

We can add `k`

to the both side of the condition: `k+Z`

._{p} < rt+1

Note: Pay attention that if we have a string with the length `n`

which starts at the position `i`

then the end of the string will be at the position `n+i-1`

. For example, if `n=1`

and `i=2`

then the string begins and ends at the position 2. If `i=0`

and `n=2`

then the string begins in 0 and ends in 1.

In this case, we have two identical substrings (Z-box and its prototype), two identical characters inside them (at k and p positions correspondingly).

Moreover, we have the following information:

1) substrings `S(p..p+Z`

and _{p}-1)`S(0..Z`

are identical (from the meaning of _{p}-1)`Z`

);_{p}

2) substrings `S(k..k+Z`

and _{p}-1)`S(p..p+Z`

are identical because indexes _{p}-1)`k`

and `k+Z`

, _{p}-1`p`

and `p+Z`

lie inside the Z-box and its prototype at the same relative positions;_{p}-1

3) substrings `S(k..k+Z`

and _{p}-1)`S(0..Z`

are identical (from statement 1 and 2);_{p}-1)

4) characters `S(p+Z`

and _{p})`S(Z`

aren’t equal (again, from the meaning of _{p})`Z`

);_{p}

5) characters `S(k+Z`

and _{p})`S(p+Z`

are equal because these indexes lie inside Z-box and its prototype at the same relative positions;_{p})

6) characters `S(k+Z`

and _{p})`S(Z`

aren’t equal (from statement 4 and 5);_{p})

From statements 3 and 5 and the meaning of `Z`

we can conclude that _{k}`Z`

is equal to _{k}`Z`

._{p}

Thus we can state that when index `k`

is inside the current Z-box with the left border `lt`

, right border `rt`

and `Z`

then _{p} < rt-k+1`Z`

, where _{k} = Z_{p}`p=k-lt`

.

`lt`

and `rt`

remain the same because `k+Z`

certainly lies to the left of the current _{k}`rt`

.

We need the strict condition `Z`

instead of _{p} < |S(k..rt)|`Z`

to be able to say about the equality of characters _{p} ≤ |S(k..rt)|`S(k+Z`

and _{p})`S(p+Z`

with confidence. If _{p})`Z`

then _{p} = |S(k..rt)|`S(k+Z`

lies outside the current Z-box and we have no idea about either these characters are equal or not. Why don’t we just check? This smoothly brings us to the case 2._{p})

#### Case 2

`Z`

._{p} > |S(k..rt)|

`|S(k..rt)| = rt-k+1`

.

In this case we can’t simply state that `Z`

because we don’t know anything about equality of characters beyond the right bounds of prototype and Z-box._{k} = Z_{p}

We know that substrings `S(k..k+h)`

, `S(p..p+h)`

and `S(0..h)`

are identical. We can simply compare characters from ranges `[rt+1..|S|]`

and `[rt-k+1..|S|-k]`

until the first inequality or the end of the ranges (and the whole string). If we found not equal characters at positions `q`

and `k+q`

then `Z`

equals the length of the equal substrings, i.e. _{k}`Z`

._{k} = q

Of course, we must set the new value to `rt`

and `lt`

, corresponding to the newly found Z-box because `k+q-1`

lies to the right of `rt`

(or at least they are equal).

Obviously, when `k`

isn’t inside the current Z-box (`k > rt`

), we have to calculate `Z`

explicitly._{k}

The formalization of the algorithm might look like this:

Having string `S`

,

Set `Z`

, _{0}=|S|`lt=0`

, `rt=0`

.

For each `k`

, `1 ≤ k < |S|`

do:

1. If `k > rt`

: calculate `Z`

explicitly. If _{k}(S)`Z`

, set _{k} > 0`lt := k`

and `rt := k+Z`

._{k}-1

2. Else: `p := k-lt`

.

2.1. If `Z`

, set _{p} < rt-k+1`Z`

._{k} := Z_{p}

2.2. If `Z`

, for each _{p} ≥ rt-k+1`i`

, `rt+1 ≤ i < |S|`

do:

2.2.1. If `S(i) = S(i-k)`

: increment `i`

and continue.

2.2.2. If `S(i) ≠ S(i-k)`

: set `Z`

, _{k} := i-k`lt := k`

, `rt := i-1`

and break.

And in Python:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 |
def z_advanced(s): """An advanced computation of Z-values of a string.""" Z = [0] * len(s) Z[0] = len(s) rt = 0 lt = 0 for k in range(1, len(s)): if k > rt: # If k is outside the current Z-box, do naive computation. n = 0 while n + k < len(s) and s[n] == s[n+k]: n += 1 Z[k] = n if n > 0: lt = k rt = k+n-1 else: # If k is inside the current Z-box, consider two cases. p = k - lt # Pair index. right_part_len = rt - k + 1 if Z[p] < right_part_len: Z[k] = Z[p] else: i = rt + 1 while i < len(s) and s[i] == s[i - k]: i += 1 Z[k] = i - k lt = k rt = i - 1 return Z |

#### Example

Let’s consider the example: take the string `S = "ababxababyabaca"`

and calculate Z-values by the described algorithm.

We’ll highlight the current Z-box with the blue color in the “Index” row and matching and non-matching characters of a potential Z-box with the green and the red color respectively in the “Character” row.

`Z`

._{0} = |S| = 15

Start from `k = 1`

. `lt = 0`

, `rt = 0`

.

Index | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 |

Character | a | b | a | b | x | a | b | a | b | y | a | b | a | c | a |

Z-value | 15 | ||||||||||||||

k |

`k=1 > rt=0`

and we must calculate `Z`

explicitly. _{1}`S(1) ≠ S(0)`

, so `Z`

. We don’t treat empty Z-boxes as normal Z-boxes, so we just skip it, no updates of _{1} = 0`lt`

and `rt`

are made.

`k = 2`

, `lt = 0`

, `rt = 0`

.

Index | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 |

Character | a | b | a | b | x | a | b | a | b | y | a | b | a | c | a |

Z-value | 15 | 0 | |||||||||||||

k |

Again, `k=1 > rt=0`

and we’re calculating `Z`

explicitly. _{2}`S(2) = S(0)`

, `S(3) = S(1)`

, `S(4) ≠ S(2)`

, so `Z`

. This is a normal newly found Z-box and we update _{2} = 2`lt := k = 2`

, `rt := k + Z`

._{k} - 1 = 2 + 2 - 1 = 3

`k = 3`

, `lt = 2`

, `rt = 3`

.

Index | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 |

Character | a | b | a | b | x | a | b | a | b | y | a | b | a | c | a |

Z-value | 15 | 0 | 2 | ||||||||||||

lt | rt | ||||||||||||||

p | k |

Here, `k=3 ≤ rt=3`

. It has the pair index `p = k-lt = 3-2 = 1`

. `Z`

. _{p}=Z_{1}=0`Z`

, so we don’t need to examine any additional characters and can say that _{1} < rt-k+1=1`Z`

._{3} = 0

`k = 4`

, `lt = 2`

, `rt = 3`

.

Index | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 |

Character | a | b | a | b | x | a | b | a | b | y | a | b | a | c | a |

Z-value | 15 | 0 | 2 | 0 | |||||||||||

lt | rt | ||||||||||||||

k |

`k=4 > rt=3`

and we’re calculating `Z`

explicitly. _{4}`S(4) ≠ S(0)`

, so `Z`

._{4} = 0

`k = 5`

, `lt = 2`

, `rt = 3`

.

Index | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 |

Character | a | b | a | b | x | a | b | a | b | y | a | b | a | c | a |

Z-value | 15 | 0 | 2 | 0 | 0 | ||||||||||

lt | rt | ||||||||||||||

k |

`k=5 > rt=3`

and we’re calculating `Z`

explicitly. _{5}`Z`

. This is a normal newly found Z-box and we update _{5} = 4`lt := k=5`

, `rt := k+Z`

._{k}-1 = 5+4-1 = 8

`k = 6`

, `lt = 5`

, `rt = 8`

.

Index | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 |

Character | a | b | a | b | x | a | b | a | b | y | a | b | a | c | a |

Z-value | 15 | 0 | 2 | 0 | 0 | 4 | |||||||||

lt | rt | ||||||||||||||

p | k |

`k=6 ≤ rt=8`

. The pair index `p = k-lt = 6-5 = 1`

. `Z`

. _{p} = Z_{1} = 0`Z`

, so we don’t need to examine any additional characters and can say that _{1} < rt-k+1 = 3`Z`

._{6} = 0

`k = 7`

, `lt = 5`

, `rt = 8`

.

Index | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 |

Character | a | b | a | b | x | a | b | a | b | y | a | b | a | c | a |

Z-value | 15 | 0 | 2 | 0 | 0 | 4 | 0 | ||||||||

lt | rt | ||||||||||||||

p | k |

`k=7 ≤ rt=8`

. The pair index `p = k-lt = 7-5 = 2`

. `Z`

. _{p} = Z_{2} = 2`Z`

, so we need to examine characters starting from _{2} ≥ rt-k+1=2`rt+1=9`

and compare them to characters starting from `rt+1-k=2`

. `S(9) ≠ S(2)`

, first not equal characters are at positions 2 and 9. So, `Z`

. Also, we need to update _{7} = 9-k = 2`lt := k=7`

, `rt := k+Z`

._{k}-1 = 7+2-1 = 8

`k = 8`

, `lt = 7`

, `rt = 8`

.

Index | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 |

Character | a | b | a | b | x | a | b | a | b | y | a | b | a | c | a |

Z-value | 15 | 0 | 2 | 0 | 0 | 4 | 0 | 2 | |||||||

lt | rt | ||||||||||||||

p | k |

`k=8 ≤ rt=8`

. The pair index `p = k-lt = 8-7 = 1`

. `Z`

. _{p} = Z_{1} = 0`Z`

, so we don’t need to examine any additional characters and can say that _{1} < rt-k+1 = 1`Z`

._{8} = 0

`k = 9`

, `lt = 7`

, `rt = 8`

.

Index | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 |

Character | a | b | a | b | x | a | b | a | b | y | a | b | a | c | a |

Z-value | 15 | 0 | 2 | 0 | 0 | 4 | 0 | 2 | 0 | ||||||

lt | rt | ||||||||||||||

k |

`k=9 > rt=8`

and we’re calculating `Z`

explicitly. _{9}`S(9) ≠ S(0)`

, so `Z`

._{9} = 0

`k = 10`

, `lt = 7`

, `rt = 8`

.

Index | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 |

Character | a | b | a | b | x | a | b | a | b | y | a | b | a | c | a |

Z-value | 15 | 0 | 2 | 0 | 0 | 4 | 0 | 2 | 0 | 0 | |||||

lt | rt | ||||||||||||||

k |

`k=10 > rt=8`

and we’re calculating `Z`

explicitly. _{10}`S(10) = S(0)`

, `S(11) = S(1)`

, `S(12) = S(2)`

, `S(13) ≠ S(3)`

, so `Z`

. Also, we need to update _{10} = 3`lt := k=10`

, `rt := k+Z`

._{k}-1 = 10+3-1 = 12

`k = 11`

, `lt = 10`

, `rt = 12`

.

Index | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 |

Character | a | b | a | b | x | a | b | a | b | y | a | b | a | c | a |

Z-value | 15 | 0 | 2 | 0 | 0 | 4 | 0 | 2 | 0 | 0 | 3 | ||||

lt | rt | ||||||||||||||

p | k |

`k=11 ≤ rt=12`

. The pair index `p = k-lt = 11-10 = 1`

. `Z`

. _{p} = Z_{1} = 0`Z`

, so we don’t need to examine any additional characters and can say that _{1} < rt-k+1=2`Z`

._{11} = 0

`k = 12`

, `lt = 10`

, `rt = 12`

.

Index | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 |

Character | a | b | a | b | x | a | b | a | b | y | a | b | a | c | a |

Z-value | 15 | 0 | 2 | 0 | 0 | 4 | 0 | 2 | 0 | 0 | 3 | 0 | |||

lt | rt | ||||||||||||||

p | k |

`k=12 ≤ rt=12`

. The pair index `p = k-lt = 12-10 = 2`

. `Z`

, so we need to examine additional characters starting from _{p}=Z_{2}=2 ≥ rt-k+1=1`rt+1=13`

and compare them to characters starting from `rt+1-k=1`

. `S(13) ≠ S(1)`

, first not equal characters are at positions 1 and 13. So, `Z`

. Also, we need to update _{12} = 13-k = 1`lt := k=12`

, `rt := k+Z`

._{k}-1 = 12+1-1 = 12

`k = 13`

, `lt = 12`

, `rt = 12`

.

Index | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 |

Character | a | b | a | b | x | a | b | a | b | y | a | b | a | c | a |

Z-value | 15 | 0 | 2 | 0 | 0 | 4 | 0 | 2 | 0 | 0 | 3 | 0 | 1 | ||

lt rt |
|||||||||||||||

k |

`k=13 > rt=12`

and we’re calculating `Z`

explicitly. _{13}`S(13) ≠ S(0)`

, so `Z`

._{13} = 0

`k = 14`

, `lt = 12`

, `rt = 12`

.

Index | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 |

Character | a | b | a | b | x | a | b | a | b | y | a | b | a | c | a |

Z-value | 15 | 0 | 2 | 0 | 0 | 4 | 0 | 2 | 0 | 0 | 3 | 0 | 1 | 0 | |

lt rt |
|||||||||||||||

k |

`k=14 > rt=12`

and we’re calculating `Z`

explicitly. _{14}`S(14) = S(0)`

, so `Z`

._{14} = 1

Finally, we have:

Index | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 |

Character | a | b | a | b | x | a | b | a | b | y | a | b | a | c | a |

Z-value | 15 | 0 | 2 | 0 | 0 | 4 | 0 | 2 | 0 | 0 | 3 | 0 | 1 | 0 | 1 |

#### Complexity

Let’s find out (not very strictly) time and space complexity of the algorithm. We have `|S|`

iterations. In each iteration we either already have got a Z-value or we have to do some comparisons. When we find `q`

matches during the comparisons `rt`

will be moved to the right: `rt`

. After each mismatch an iteration finishes, so there aren’t more than _{k} := rt_{k-1}+q`|S|`

mismatches. We have not more than `|S|`

iterations with not more than `|S|`

matches and mismatches as a whole. So, the algorithm has `O(|S|)`

time complexity.

Obviously, it requires `O(|S|)`

space to store Z-values.

#### Search algorithm

You could ask why we need these Z-values at all. This is an appropriate question :) We can use the idea of Z-values for doing a substring search.

Take a string `T`

(for “text”) and a string `P`

(for “pattern”), both built of characters of some alphabet `A`

. Also, take character `'$'`

which is not in `A`

. This character is called **the sentinel**. Assemble all these things in the following way: `P$T`

.

Compute Z-values of this composite string by the advanced algorithm described earlier. All this values will be limited to `|P|`

(except `Z`

, of course) because of the sentinel which can’t be found at any place in _{0}`P`

or `T`

. And when you see a `Z`

then you can be sure that substring _{j} = |P|`T(j..j+|P|-1)`

is an occurrence of `P`

in `T`

. And vice versa: you can be confident that all occurrences will be marked in the list of Z-values.

Let’s consider an example. `P = "ab"`

, `T = "xaybzabxaby"`

:

Index | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 |

Character | a | b | $ | x | a | y | b | z | a | b | x | a | b | y |

Z-value | 14 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 2 | 0 | 0 | 2 | 0 | 0 |

Even with overlapping matches: `P = "aa"`

, `T = "xaaay"`

:

Index | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |

Character | a | a | $ | x | a | a | a | y |

Z-value | 8 | 1 | 0 | 0 | 2 | 2 | 1 | 0 |

This search algorithm has the `O(|P|+|T|)`

time and space complexity because it has the longest, dominating part – search of Z-values which is `O(|S|)`

, `|S|=|P|+|T|`

.

Good algorithm. But there is an obstacle for a general-purpose use: the choice of the sentinel. In some applications you have a short known alphabet. For example, in bioinformatics there are symbols `'A'`

, `'C'`

, `'G'`

and `'T'`

and we can simply use `'$'`

as the sentinel. But what about general Unicode strings?

#### Search algorithm without the sentinel

Fortunately, there is search algorithm that doesn’t need to use sentinels. The idea is in calculating Z-values of a string `PT`

with a simple modification: after each Z-value is calculated, limit it to the `|P|`

, i.e., `Z`

. This limitation plays role of the sentinel._{k} := min{|P|, Z_{k}}

The Python code:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 |
def search_without_sentinel(pattern, text): """Search without the sentinel.""" s = pattern + text Z = [0] * len(s) Z[0] = len(s) rt = 0 lt = 0 occurrence = [] for k in range(1, len(s)): if k > rt: n = 0 while n + k < len(s) and s[n] == s[n+k]: n += 1 Z[k] = n if n > 0: lt = k rt = k+n-1 else: p = k - lt right_part_len = rt - k + 1 if Z[p] < right_part_len: Z[k] = Z[p] else: i = rt + 1 while i < len(s) and s[i] == s[i - k]: i += 1 Z[k] = i - k lt = k rt = i - 1 Z[k] = min(len(pattern), Z[k]) # An occurence found. if Z[k] == len(pattern): occurrence.append(k - len(pattern)) return occurrence |

All the code can be found in this repository.

Thank you.