Minimum Edit Distance in Python
While I’m going through the NLP course by Jurafsky and Manning on coursera, I coded a small python implementation of the Wagner-Fischer algorithm presented in lecture 6, 7 and 8. And here it is! Please refer to the lectures for a more in-depth explanations of the algorithm. I’ll just go quickly through the basics and then present the code.
Introduction
How similar are these two strings?
Many people from different fields often end up asking themselves this question: the computational biologist comparing sequences of bases to see if they contain similar information; the computer scientist implementing speech recognition, trying to make sense of odd recognition results; me fighting with autocorrection for the control of my smartphone.
To actually answer this question, you first need to define some concept of distance between strings.
A useful definition is that of edit distance:
The edit distance is the number of operations (insertions, deletions or substitutions) needed to transform one string in another.
where an insert operation means adding a symbol to the string, deletion means subtracting one, and substitution is a deletion followed by an insertion. Depending on your definition of edit distance, you may just consider insertion and deletion and do without substitution.
For example, we may want to calculate the distance between the strings spell and help:
s | p | e | l | l |
h | e | l | p |
One way to transform help into spell is to align the two el substrings, insert an s at the beginning of help, and perform the remaining substitutions. If we abbreviate insertion, deletion and substitution with i, d and s,
s | p | e | l | l |
---|---|---|---|---|
h | e | l | p | |
i | s | s |
so that, depending on whether we consider substitution to be a single operation or two, we end up with an edit distance between spell and help of respectively 3 or 5.
Minimum edit distance
Normally we are not interested in any edit distance, but we want the minimum edit distance between two strings. How to compute it?
There is an infinity of ways in which we can transform one string into another. We can get creative with alignments, inserting whole books’ worth of characters and then deleting the ones we don’t need, hiring a chiliad of monkeys randomly tapping on a keyboard until they manage to get from the first string to the second, and so on.
Luckily, if it is the minimum edit distance that we want, we don’t need to search this enormous space naively; we can be smart about it.
Say we have an initial state (the starting string) and an ending state (the final string). To go from one to the other, we apply a sequence of operations: a path connecting them. It turns out that to find the shortest path between two states, we just need to make sure that we are following the shortest possible path between each of the intermediate states between the two.
This problem can be solved elegantly by dynamic programming.
Dynamic Programming for Minimum Edit Distance: Wagner–Fischer algorithm
With dynamic programming, we solve a large problem by first solving smaller parts of it, and then building on the knowledge we gathered to solve bigger and bigger parts.
In the Wagner-Fischer algorithm, we define a distance matrix \(D_{i,j} = d(X[1\ldots i], Y[1\ldots j])\), the matrix in which index \((i, j)\) corresponds to the minimum edit distance \(d\) between the first \(i\) symbols in \(X\) and the first \(j\) symbols in \(Y\). We first compute \(D_{i,j}\) for small \((i,j)\), and then go for larger and larger \(i\) and \(j\) using the smaller bits that we already computed before.
By doing this, we end up with the minimum edit distance between \(X\) and \(Y\), that is \(d(X[1\ldots m], Y[1 \ldots n])\), where \(m = \vert X \vert\) is the length of the \(X\) string, and \(n = \vert Y \vert\) is the length of the \(Y\) string.
I’ll now go through a python implementation of the algorithm. I’ll be using python3, as I wanted unicode support and I didn’t want to deal with the unicode nonsense in python2. To run the code in python2, just take out the unicode arrows.
First things first, let’s import some libraries. Numpy just makes things cleaner (not much going on here in terms of numerics), and we use tabulate to produce the final tables for backtracking and alignment.
1
2
import numpy as np
import tabulate as tb
We jump straight into defining the key function:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
def wagner_fischer(word_1, word_2):
n = len(word_1) + 1 # counting empty string
m = len(word_2) + 1 # counting empty string
# initialize D matrix
D = np.zeros(shape=(n, m), dtype=np.int)
D[:,0] = range(n)
D[0,:] = range(m)
# B is the backtrack matrix. At each index, it contains a triple
# of booleans, used as flags. if B(i,j) = (1, 1, 0) for example,
# the distance computed in D(i,j) came from a deletion or a
# substitution. This is used to compute backtracking later.
B = np.zeros(shape=(n, m), dtype=[("del", 'b'),
("sub", 'b'),
("ins", 'b')])
B[1:,0] = (1, 0, 0)
B[0,1:] = (0, 0, 1)
for i, l_1 in enumerate(word_1, start=1):
for j, l_2 in enumerate(word_2, start=1):
deletion = D[i-1,j] + 1
insertion = D[i, j-1] + 1
substitution = D[i-1,j-1] + (0 if l_1==l_2 else 2)
mo = np.min([deletion, insertion, substitution])
B[i,j] = (deletion==mo, substitution==mo, insertion==mo)
D[i,j] = mo
return D, B
And here we implement a naive backtrace:
- start from index \((m,n)\),
- look at where the computed value in \((m,n)\) came from
- In order of preference, follow a substitution, or a deletion, or an insertion (that is, go to the cell up and to the left if that’s where the value in \((m,n)\) was computed from, or to the cell above, or to the cell to the left)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
def naive_backtrace(B_matrix):
i, j = B_matrix.shape[0]-1, B_matrix.shape[1]-1
backtrace_idxs = [(i, j)]
while (i, j) != (0, 0):
if B_matrix[i,j][1]:
i, j = i-1, j-1
elif B_matrix[i,j][0]:
i, j = i-1, j
elif B_matrix[i,j][2]:
i, j = i, j-1
backtrace_idxs.append((i,j))
return backtrace_idxs
This next function takes a backtrace and computes the alignment between the two words. It goes through the operations and takes note of what has been applied at each step, while constructing the alignment.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
def align(word_1, word_2, bt):
aligned_word_1 = []
aligned_word_2 = []
operations = []
backtrace = bt[::-1] # make it a forward trace
for k in range(len(backtrace) - 1):
i_0, j_0 = backtrace[k]
i_1, j_1 = backtrace[k+1]
w_1_letter = None
w_2_letter = None
op = None
if i_1 > i_0 and j_1 > j_0: # either substitution or no-op
if word_1[i_0] == word_2[j_0]: # no-op, same symbol
w_1_letter = word_1[i_0]
w_2_letter = word_2[j_0]
op = " "
else: # cost increased: substitution
w_1_letter = word_1[i_0]
w_2_letter = word_2[j_0]
op = "s"
elif i_0 == i_1: # insertion
w_1_letter = " "
w_2_letter = word_2[j_0]
op = "i"
else: # j_0 == j_1, deletion
w_1_letter = word_1[i_0]
w_2_letter = " "
op = "d"
aligned_word_1.append(w_1_letter)
aligned_word_2.append(w_2_letter)
operations.append(op)
return aligned_word_1, aligned_word_2, operations
Finally, this function formats the results from the Wagner–Fischer algorithm and backtracking to a table that is human-readable. In the table, each cell contains the computed minimum edit distance from the initial state to that state, and where it was computed from (that is, what operations could have produced it). The backtrace is highlighted with asterisks.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
def make_table(word_1, word_2, D, B, bt):
w_1 = word_1.upper()
w_2 = word_2.upper()
w_1 = "#" + w_1
w_2 = "#" + w_2
table = []
# table formatting in emacs, you probably don't need this line
table.append(["<r>" for _ in range(len(w_2)+1)])
table.append([""] + list(w_2))
max_n_len = len(str(np.max(D)))
for i, l_1 in enumerate(w_1):
row = [l_1]
for j, l_2 in enumerate(w_2):
v, d, h = B[i,j]
direction = ("⇑" if v else "") +\
("⇖" if d else "") +\
("⇐" if h else "")
dist = str(D[i,j])
cell_str = "{direction} {star}{dist}{star}".format(
direction=direction,
star=" *"[((i,j) in bt)],
dist=dist)
row.append(cell_str)
table.append(row)
return table
Now we are ready to compute the minimum edit distance table, backtrace and alignment. Note that the “#+ATTR_HTML” print statements are there to format the table for this website, they don’t serve any other mysterious purpose.
What’s the minimum edit distance between “spell” and “hello”?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
word_1 = "spell"
word_2 = "hello"
D, B = wagner_fischer(word_1, word_2)
bt = naive_backtrace(B)
edit_distance_table = make_table(word_1, word_2, D, B, bt)
alignment_table = align(word_1, word_2, bt)
print("Minimum edit distance with backtrace:")
print("#+ATTR_HTML: :border 2 :rules all :frame border "
":style text-align: right") # org-babel export html properties
print(tb.tabulate(edit_distance_table, stralign="right", tablefmt="orgtbl"))
print("\nAlignment:")
print(tb.tabulate(alignment_table, tablefmt="orgtbl"))
Minimum edit distance with backtrace (the bold numbers):
# | H | E | L | L | O | |
# | ⇐ 1 | ⇐ 2 | ⇐ 3 | ⇐ 4 | ⇐ 5 | |
S | ⇑ 1 | ⇑⇖⇐ 2 | ⇑⇖⇐ 3 | ⇑⇖⇐ 4 | ⇑⇖⇐ 5 | ⇑⇖⇐ 6 |
P | ⇑ 2 | ⇑⇖⇐ 3 | ⇑⇖⇐ 4 | ⇑⇖⇐ 5 | ⇑⇖⇐ 6 | ⇑⇖⇐ 7 |
E | ⇑ 3 | ⇑⇖⇐ 4 | ⇖ 3 | ⇐ 4 | ⇐ 5 | ⇐ 6 |
L | ⇑ 4 | ⇑⇖⇐ 5 | ⇑ 4 | ⇖ 3 | ⇖⇐ 4 | ⇐ 5 |
L | ⇑ 5 | ⇑⇖⇐ 6 | ⇑ 5 | ⇑⇖ 4 | ⇖ 3 | ⇐ 4 |
Alignment:
s | p | e | l | l | |
h | e | l | l | o | |
d | s | i |
NLP · PYTHON
linguistics NLP org-babel python