Hands on: Levenshtein distance tells the similarity of two strings.
It’s called “edit distance”, because the allowable operations to shift from one string to the other are edits. There are typically 3 operations:
- replace (change 1 character)
Algorithm developer probably asks one big question: what direction to take? What is the ‘algebra’ in Levenshtein? How does the algorithm know which sequence of operations leads to the target string?! In addition, the algorithm shall return the smallest (minimum) score out of possibly many solutions to get to the destination string.
Quite a challenge!
Let’s try recursion.
Recursion is an idea; it’s a bit tweaked at first, if you have never heard of recursion before. Recursion means that an algorithm calls itself as part of the solution process. The function’s “levels of calls” get deeper, until there is a base case, whose result can be returned immediately. Otherwise the recursive algorithm would get stuck, eternally looping, trying to desperately solve the (what should be) base case, but ending up yet again in the base case solution finding… You get the idea! In properly implemented recursive algorithm, once this base case is requested, the result propagates “up” the call chain, to the branches of the callers. The callers merge again, producing ultimately two elementary ordinals (numbers), that can be computed to make the final result.
I’ve no idea at this point whether recursive strategy might work.
We need to start with a very simple, degenerate case.
Well.. one obvious thing is: “to get to any destination string of length 1, from a null (void) string, just add the only character of the source string, let’s say it will always be the C. The length of this is 1 operation”
Practical algorithms needs a couple of phases done right: initialization, basic work round, and have a robust termination condition. Initialization of algorithm sets any tracking & auxiliary variables ready for the working round. The work round takes a step that advances the algorithm’s state towards a solution. Robust termination means that the algorithm doesn’t get stuck in an infinite loop: rather, the algorithm at some point will return a value (or error).
Rules so far
- if dest.len is 1, and src.len is 0, then solution length is 1
- carry a intermediary string (‘current’)
- set current = src
- let variable “deviants” be equal to the number of slots (characters) in “current” that are not equal to the corresponding characters in “dest”
- if current.len == dest.len then calculate deviants AND replace one character in “current”
- in conclusion our Levenshtein algorithm grows or prunes the length of the string, until matching length is reached. Then the final score will be score-so-far plus number of “deviant” characters (since they will be one by one, replaced to be matching characters).
You know, I studied computer science but never graduated. This left me with a sort of deviant, peculiar greed to understand things on my own. I did read a Data Structures and Algorithms course, which was both laborious, and very interesting. I got a 4 our of 5 from the course.
The book we used was Mark A. Weiss’ 2nd edition of Data Structures and Algorithm Analysis in C.
Levenshtein is an algorithm that calculates a sort of digital difference (edit distance) between two strings. To give some sort of idea of what’s going on we can take a few intuitive cases. Then let’s jump into a few Levenshtein implementations and think what are the core pieces that are definitely essential.
Degenerate cases: get out as soon as possible!
With algorithms we always want to ‘exit’ as soon as possible, if there are trivial (fringe cases; edge cases) cases in the input. Handling early these cases gives two benefits:
a) we don’t stress the CPU in a long loop for nothing – this saves wattage (power) – where we can simply test for a trivial case and return immediately
b) we can sometimes develop the rest of the algorithm to be more simple in its structure, now that we can be relying on the fact that there’s not going to be any edge cases left. It’s a straight run without too much branching on “if” conditions. Theoretically speaking the early exits should target those input conditions whose probability of occurrence are the smallest in the whole input data set; thus we save the most amount of vain IF (…) condition testing.
Levenshtein returns immediately on 2 conditions:
- if input string A is of 0 (zero) length, then the function returns length of string B
- vice versa: if B.length is zero, return A.length
This is quite natural. If we compare “nothing” (an empty string) to text ‘ABC’, we do need to insert A, B and C (3 insertions) to get the difference. In addition, by definition this is symmetrical, so that’s why there are 2 trivial length comparisons which both will exit immediately and return the length of the string that does exist.
Main body of Levenshtein algorithm code
Levenshtein’s common implementation consists of 2 FOR loops. Two loops “paint” a 2-dimensional matrix, thus there’s a space complexity of O(n*m) for the algorithm. Remember, the O() complexity (big O notation) often is shown as per “on the order of” accuracy, not necessarily having exact terms in a polynomial. It often suffices to know what’s the dominating term that defines the ultimate behavior of the algorithm’s runtime or space requirements.