Levenshtein Distance Algorithm: Fastest Implementation in C#

Here is a cleaned-up performance test for several different implementations of levenshtein I have blogged about recently. This test was emailed to me by Ahmed Ghoneim, who has also kindly agreed to its publication on my blog. I am very grateful to him for his excellent contribution. I have slightly altered his file to do away with the unnecessary local variables in my C2C# port of the GNULevenshtein method.

I would like to hear from you which methods perform best on your machine. Please drop a comment ^_^!

C#LevenshteinAlgorithmPerformanceTest.cs
code only

PackagePackages
code, data and sample binary in zip and self-executable zip formats

Please note that the GNULevenshtein method was found to be buggy! Here is the new replacement method.

Comments

Anonymous said…
I tried GNULevenshtein version with the words 'fot' and 'foot'. The resulting distance is 0. I thought it was 1.

Thanks
Cetin Sert said…
That's a bug. I will try to fix it somehow. Thanks for your comment ^_^
Cetin Sert said…
I have now tested the original version in C++ and this bug seems to be in the original code and not something I have introduced in my C# port. The section of the code that cuts equal start and end sequences is responsible for this behaviour. Removing this section resolves the issue but I will try to see how I can keep that part intact yet correct the results.
Anonymous said…
Awesome. I am investigating alternative algorithms having a suffix trie dictionary. I already use it for exact matching and I am wondering if it can be used also for approximate matching. Do yo have some good reference for this?
Thanks
Cetin Sert said…
'x' and 'foxot' return 5 instead of 4 and that's after I have fixed the bug in the stripping of equal start and end sequences. Which means there is something fundementally wrong with the reference implementation. I will be looking for a better implementation to replace the GNULevenshtein method altogether. Thanks for bringing this up! I recommend using UnsafeVectorLevenshtein.

By the way I don't know anything about suffix trie dictionaries so I can't help you there. I would like to add you to my email contacts and visit your blog if you have one. Please drop a comment.
Anonymous said…
Sure thing: managed_dot_theory_at_live_dot_com

bye
Cetin Sert said…
Thanks by the way I ported another equivalently optimized but highly obfuscated edit distance implementation which also lets one specify the cost of substitution.

http://www.koders.com/c/fidF5020AACF174C23AF41E57A8DF4080BF1FDC4849.aspx?s=levenshtein

I am going to update my blog ASAP after I have run several tests. Thank you very much for helping me correct my code.
Anonymous said…
There is an algorithm for Levenshtein distance that uses diagonals and lazy evaluation, and runs in O(|A|(1 + DAB)) time (where |A| is the length of the first string, and DAB is the Levenshtein distance). Where the distance is small (strings are similar), this algorithm should be pretty fast. See http://www.csse.monash.edu.au/~lloyd/tildeStrings/Alignment/92.IPL.html

I have implemented it in Java as an exercise (feel free to improve it). http://www.its.caltech.edu/~xuanluo/Levenshtein.java
It should be pretty easy to port it into C# or something like that. I would be interested in how it does if you test it.
Cetin Sert said…
^_^ Open-source is a great culture! Thanks for sharing your code... I'll have a go at it as soon as I can regain my consciousness: I'm literally blown up by my first exposure to F# and Haskell. o_O I can't believe that I missed out on Haskell for so long.
Anonymous said…
Performance test (just run it from your ZIP archive):

Match: MULTIPLICATION Time: 509 Distance: 3 GNULevenshtein
Match: MULTIPLICATION Time: 479 Distance: 3 UnsafeVectorLevenshtein
Match: MULTIPLICATION Time: 657 Distance: 3 VectorLevenshtein
Match: MULTIPLICATION Time: 948 Distance: 3 MatrixLevenshtein
Match: MULTIPLICATION Time: 852 Distance: 3 JaggedLevenshtein

PC: Intel Core Quad 6600
OS: Windows Server 2008 x64
.NET: v3.5 SP1

Thank you for the perfect code!
Anton.
http://kyta.spb.ru
Anonymous said…
Performance measurements (just unpacked zip):

Match: MULTIPLICATION Time: 1033 Distance: 3 GNULevenshtein
Match: MULTIPLICATION Time: 1191 Distance: 3 UnsafeVectorLevenshtein
Match: MULTIPLICATION Time: 1239 Distance: 3 VectorLevenshtein
Match: MULTIPLICATION Time: 3554 Distance: 3 MatrixLevenshtein
Match: MULTIPLICATION Time: 2962 Distance: 3 JaggedLevenshtein

PC: Athlon x2 4850e
OS: Ubuntu 8.04 amd64
.NET: Mono 1.9

Thank you!
Anton.
Anonymous said…
The package download link is broken
Cetin Sert said…
@anonymous: the link has been fixed! Thank you for reporting *^o^* !!

Popular posts from this blog

WordSmith Tools 5.0, Tenka Text in China

Mono 1.2.5 binaries for Solaris 10/x86