Tenka Text

January 15, 2007

Hi, this is my first blog post.

I have established this blog to keep people in the know about my personal open-source corpus analysis project Tenka Text. It is my, Cetin Sert's, open-source to Mike Scott's WordSmith Tools (4).

As you can click the above link to read and get an introductory idea about what or what not my project is, I will allow myself to skip an introduction and tell only the most recent development news.

Back to Basics

Well... first of all, I have decided to refocus on the sub-gui parts of the code once again. I saw a need to rethink some of the basic design elements like the overall inheritance/relation model of the classes and structs in the Tenka.Text namespace. Before going back on to the GUI development, I want to simplify the word enumerators, do away with the bulky interfaces, learn more about the available inlining optimizations, move from public fields to public properties and minimize the public exposure of pointers.

System.Reflection.Emit & Runtime Compilation

One of the most research worthy areas I have come across recently is configuration-based dynamic IL code emitting provided by the System.Reflection.Emit namespace. Let's imagine that a user wants to define a custom set of characters across which no clustered word enumeration should be made. (a 5-word cluster which starts in one sentence and continues in the next is not likely to make much sense.) The solution might be provided in two ways:

A character array (string) is used to store the user-defined cluster constraints and each time the cluster enumerator moves next, it checks each character between the current word and the next word against the characters in the user-defined set of cluster constraints to determine whether the clustering is allowed.
Reflection is used in a clever way to emit a dynamic method, avoiding the need to store in an array and retrieve from there the user-defined cluster constraint characters.

I do not know how easily the second way can be implemented but I believe it might help score a performance gain. Regular expressions in .NET make use of it already.

Runtime compilation might make me do a complete rewrite of the whole project if the results turn out to be satisfactory. I will do quite a lot of reading on this in the next couple of weeks. This would mean that I don't write a rigid instance of Tenka.Text but the abstract logic of Tenka.Text that can write specialized instances of itself on user demand.

Perfomance Test

I tried to create a list of 6-word clusters disregarding any sentence boundaries or punctuations and here are the results:

WordSmith Tools (4.0.0.338)

Create Index - 12~ seconds
Open Index - 3~ seconds
Calculate 6-Word Clusters - 20~ seconds

TOTAL: 3 commands, 35~ seconds -.-

Tenka Text (version: SVN 37)

Create 6-Word Cluster List - 5,93 seconds (create: 1,97 + sort: 3,96)

TOTAL: 1 command, 5,93 seconds ^_^

By the way... you can also browse the SVN and OHLOH of my project.

Search This Blog

CORSIS

Tenka Text

Comments

Popular posts from this blog

Levenshtein Distance Algorithm: Fastest Implementation in C#

Mono 1.2.5 binaries for Solaris 10/x86

Haskell and F#: Language Design