Posts

Mono 1.2.6 binaries for Solaris 10/x86

Duncan Mac Leod from Tucan Entertainment has been kind enough to share his Mono 1.2.6 binaries for Solaris 10/x86 .

Levenshtein Distance Algorithm: Fastest Implementation in C#

Image
Here is a cleaned-up performance test for several different implementations of levenshtein I have blogged about recently. This test was emailed to me by Ahmed Ghoneim, who has also kindly agreed to its publication on my blog. I am very grateful to him for his excellent contribution. I have slightly altered his file to do away with the unnecessary local variables in my C2C# port of the GNULevenshtein method. I would like to hear from you which methods perform best on your machine. Please drop a comment ^_^! LevenshteinAlgorithmPerformanceTest.cs code only Packages code, data and sample binary in zip and self-executable zip formats Please note that the GNULevenshtein method was found to be buggy! Here is the new replacement method .

Perplexity in Markov N-Gram Models

Image
While implementing the perplexity function on Markov n-gram models as it is described on page 14 of Jurafsky & Martin's SLP to appear , I came across some floating point overflow, underflow issues and had to come up with equations to avoid them. Here is my solution in detail and its bigram implementation . It took me a lot of whiteboarding and a few hours to figure this one out but the resulting libcorsis code is %100 foreign intellectual property free ^_^. I am now looking for ways to analyze distributions graphically and testing different encapsulations of probability values and their interaction with the public API.

Markov N-Gram Models, C# 3.0, F#, Professorship

Image
Here is a very short update on what I have been doing recently. 1) Thanks to the excellent quality of the introduction to computational linguistics course I am taking this term , I get to learn a great many new stuff which boost the development tempo of my library. Here is a particularly interesting case: Markov N-Gram Models . These are soon going to find some interesting application in my experimental HMM-based PoS-Tagger. 2) C# 3.0 and .NET 3.5 have been released. I'm truly astonished with the improved ease of programming this iteration delivers. Waiting eagerly to lay my hands on some quality books on both C# and .NET 3.5. 3) F#! After watching Dr. Brian Beckman's latest video on Channel 9 about monads I decided to learn F#. Parallelization libraries like Parallel FX might provide all that is needed to seamlessly parallelize execution but learning something natively functional seems to be a great asset for the very near future. 4) I had the chance yesterday to attend a ...

Levenshtein Distance Algorithm: Fastest Implementation in C#

Image
While reading some interesting stuff about minimum edit distances in preparation for today's lecture ( ECL/ICL ), which is just about 45 minutes ahead in time as I'm writing this, I had the chance to test 5 different implementations of the Levenshtein minimum edit distance algorithm. Here is a screenshot first: I'll get into details later but let me announce the winner! And the winner is ... gLDp! gLDp is a funny display name for a levenshtein implementation from a C project. original implementation in C: levenshtein.c my C# port: libcorsis code C vs CIL vs C# Now I want to get mercilessly picky with my own port and today's C# compilers. The ternary conditional expressions in my port (lines: #516 , #524 , #533 ) are there to circumvent the following restriction: // valid C int x = 0; int y = 0; int z = 0; z += x == y; x, y and z are initialized 0 and in the final line z gets incremented by 1. This is valid code in C but causes a compile-time error in C#: C# does no...

CORSIS

Image
Tenka Text has been rebranded as CORSIS. Following addresses are no longer accessible: http://tenkatext.sourceforge.net http://www.sourceforge.net/projects/tenkatext You can reach the project on sourceforge.net by searching for "CORSIS" . The new domain name of the project is www.corsis.de. First new binaries are expected to be available in 2008 Q1.

New Segmenter Compiler: Benihime, 紅姫

Image
A quick update on API refactorings! Here is a snapshot of what the code examined in the last post would look like with refactorings and improvements I have made so far: Before: the code examined in the last post After: For segmenting streams into, say, words for example, one could also use something like SED on GNU/Linux, some regular expressions implementation of a programming language or whatever. So why am I such an otaku ? Why not just go with the given the naive and easy way? Well, I am a performance and control freak and CIL is great fun and I feel 'pleasure' writing assembly code for a VM but most importantly, Benihime, 紅姫 makes the perfect training ground for learning language and compiler design. Prior to getting into deep hack mode on Benihime, 紅姫, I had no idea about the differences between 'expressions', 'statements', 'branches' or 'stacks'. Implementing complex boolean expressions in conditional statements like if((c && pm) ...