*^o^* Happy Holidays *^o^*

meeting new friends now ...

11:38 0 comments

2009 - 2010

No news ... I asked someone to not tell me any so I will not post any personal news either. I'll be in Turkey for a week, meeting new friends and spending 2009's last and 2010's first days in their company.
10:58 1 comments

CORSIS .2 First Applications

As part of a research project I am involved in, the development version of CORSIS is now being used and actively extended with latest technologies. For more information, visit our new wiki.
12:49 0 comments

erlang ! hello

On Tuesday while reading Programming Erlang on suggestion of Andreas Cardeneo from the Research Center for Information Technology of the University of Karlsruhe, I started to experiment in Haskell with typed channels and lightweight threads to imitate Erlang style processes and networks thereof.

Here is the very early and raw code from a few hours of exploration.

And here is a simple interactive session in which a server is created that reads strings into integers. These integers then get _distributed one at a time_/_dealt_ to a first layer of 3 parallel nodes and they then travel to a second layer:

*Erlang> (server,out) ← serve (read :: String → Integer)
*Erlang> l1ts ← create 3 :: IO ([Chan Integer])
*Erlang> did ← deal move ([(\y → y - x) | x ← [1..3]]) out l1ts
*Erlang> l2ts ← create 3 :: IO ([Chan Integer])
*Erlang> tids ← sequence $ zipWith (link move (2 *)) l1ts l2ts
*Erlang> server ·· [ show x | x ← [2..4] ]
*Erlang> all_ flush l2ts
-- [[2],[2],[2]]
{-
FLOW VIEW

l1 l2
--- ---
1 → 2
in out --- ---
--- --- ↗
"2" 2 --- ---
"3" → 3 → 1 → 2
"4" 4 --- ---
--- --- ↘
--- ---
1 → 2
--- ---

-- before "all_ flush l2ts" actually only l2 is not empty

ACTUAL STATE

l1 l2
--- ---
→ 2
in out --- ---
--- --- ↗
--- ---
→ → → 2
--- ---
--- --- ↘
--- ---
→ 2
--- ---
-}

one could use such a system to distribute, say, processing of images to multiple parallel threads that utilize different cores or processors on a single machine. Using copy links (link copy ...) instead of move links (link move ...) would leave copies of the intermediate results on channels and we could see into the individual stages of a complex image manipulation. Tagging messages that travel between such processes would make it possible to divide a problem into pieces and process the pieces in parallel and put the results back together. One could make the inter-process communication lazy and declare a process network structure that produces results only when there is demand from the terminal nodes.

Check out this link for more examples of different process networks!

As I write these lines, I don't know if there is any prior Haskell work about such things but I could imagine this or a more serious endeavour turning into a nice little framework for distributed processing. By using more advanced primitives for communication than forkIO and Chan a, one might span a process network across machine boundaries.

In this blog post, 'process' means "a lightweight thread that reads from a channel and applies a function on what it has read and writes the result to another channel" and NOT "OS process".

This protoprototype of process networks has not been performance-tested in any way. This is very much an experiment and further investigation of the concept may or may not turn out to be useful.

07:11 0 comments

erlang ! hello

Here is the clock example from Programming Erlang (page 155) in Erlang [clock.erl] and its very rudimentary approximations!? in Haskell [clock2.hs, general2.hs].

21:29 1 comments

Mono 2.0 binary package for Solaris 10/x86

Just received some fresh Mono 2.0 binaries for Solaris 10/x86. Many thanks pablo! Mono 2.2 to follow after the weekend *^o^*!
16:54 1 comments

Debugging Haskell: LFG, AVM → DAG

A few days ago I had to do some debugging in Haskell while writing a function to draw directed acyclic graphs of attribute value matrices from LFG F-structures. UnsafePerformIO was great help there. I did most of this during a machine translation lecture given by Kurt Eberle of lingenio.

Here are the debugging functions:

deb1 :: Show a ⇒ a → a
deb1 a = unsafePerformIO (print a >> return a)

deb2 :: Show a ⇒ String → a → a
deb2 msg a = unsafePerformIO (putStrLn (((msg ++) . show) a) >> return a)

deb3 :: (Show a, Show b) ⇒ (a → Bool) → b → a → a
deb3 p b a = unsafePerformIO $ esc p b a
where
esc p b a | p a = putStr ((show b) ++ (show a)) >> return a
| otherwise = return a

cond = (== "^OBL")
dobj = deb3 cond

And here is how I used them in context:

core :: Integer → String → [(Integer,String)] → AVM → (String,[(Integer,String)],Integer)
core i s ls (M att (A val)) = (s ++ x i ++ " [label=\"" ++ dobj s att ++ "\"];\n",(i,val):ls,i)
core i s ls (M att (C mms)) = let (xs,xls,xi) = add i i s ls att mms in
(s ++ x i ++ " [label=\"" ++ att ++ "\"];\n" ++ xs,xls,xi)
where
add f i s ls att [] = ([],[],i)
add f i s ls att (m:ms) = (cs ++ rs,cls ++ rls,ri)
where
(cs,cls,ci) = core ni sm ls m
(rs,rls,ri) = add f ci [] ls att ms
sm = x f ++ " -> "
ni = i + 1

I know core pretty much sucks in many ways o_O!! Still trying to figure out how I could keep such things maintainable in Haskell... (I will have a look at the FGL sometime soon.)

It was very easy to use type classes and unicode infix functions to achieve a really pretty syntax to represent F-structures though ^o^:

s1 :: AVM
s1 = void ≈
[pred ≈ "antworten<(^SUBJ)(^OBL)>",

subj ≈ [pred ≈ pro,
num ≈ sg,
cas ≈ nom ],

obl ≈ [pred ≈ "auf<^OBJ>",

obj ≈ [pred ≈ "Brief",
num ≈ sg,
cas ≈ acc] ],

tense ≈ present ]

Machine translation is the last lecture before sommer holidays *^o^*! I will have more time to work on Corsis again.

09:17 0 comments

mono string hash code collisions

After reading the following blog post by David R. MacIver about hash code collisions in java strings:

For reasons you either know about by now or don’t care about, I was curious as to how well String’s hashCode was distributed (I suspected the answer was “not very”). I ran a few quick experiments to verify this.

For your amusement, here is a list of all hash collisions between alphanumeric strings of size 2: http://www.drmaciver.com/collisions.txt and here is a list of all which don’t collide with any others http://www.drmaciver.com/noncolliding.txt

Some statistics: There are 3844 alphanumeric strings of size 2. Of these 3570 collide with at least one other string. That is, 274 of these strings (or about 7% of them) *don’t* collide with something else.

Oh well. It’s a good thing no one would be stupid enough to rely on hashCode to distinguish the contents of two objects.

I tested things with .NET 3.5 and MONO 1.9.1 on a 32-bit Windows Vista:

running on .NET 3.5:
3844 two-char strings
14776336 comparisons
0 collisions with one or more items
0 total collisions

running on mono 1.9.1:
3844 two-char strings
14776336 comparisons
**3570** collisions with one or more items
5250 total collisions

2-char-long-string-hashcode-collisions-mono-1.9.1

We then both went on to testing hash code collisions in 3-char-long strings. David in java:


For what it’s worth, even fewer strings have unique hash codes for 3 characters. 3948 don’t collide, or about 1.6% of them.

Also, this of course doesn’t mean that probability of a hash collision is really high. In reality it’s acceptable low. It’s just a demonstration that it’s not hard to find colliding pairs.

and I in mono 1.9.1: 3-char-long-string-hashcode-collisions-mono-1.9.1. 3948 don't collide and the colliding pairs seem to be the same ones again.

Microsoft's implementation seems to be doing a much better work here. I will inform mono devs soon about the situation ^_^

17:16 0 comments

Mono 1.9 binary package for Solaris 10/x86

Jonel Rienton has provided a Mono 1.9.0 binary package for Solaris 10/x86! I am thankful to him for his kindness^_^!
06:43 0 comments

okitsune 狐

HaskellI am working on a new project called: okitsune 狐 - it is going to be a simple theorem prover of sorts for logic students. Okitsune is being written in Haskell and open for contributions: okitsune.sourceforge.net

code sample: analytic tableaux for propositional logic

03:45 3 comments

Haskell and F#: Language Design

After watching a taste of Haskell by Simon Peyton-Jones and having a look at PDFA History of Haskell: being lazy with class, I am left with some open questions regarding the design of F#:

  • Why is F# not lazy?: What is the advantage of eagerness?

  • Is there a point in encapsulating side-effectful operations in monads while programming in F#? IO Monads etc.?

    #light

    // Sample F# IO Monad : OSCON Haskell Video Part 2

    type IO<'a> = IO of 'a

    // get: unit -> IO<string>
    let get () = IO(read_line ())

    // put: string -> IO<unit>
    let put x = IO(print_string x)

    // bind.e: IO<'a> -> 'a
    // bind: IO<'a> -> ('a -> IO<'b>) -> IO<'b>
    let bind (x:IO<'a>) (y:('a -> IO<'b>)) = let e (IO(a)) = a in y (e x)
    let (>>=) = bind

    // gp: unit -> IO<unit>
    let gp () = get () >>= put

    gp ()

  • Why did the F# team chose ML and OCaml over Haskell?

I hope someone from the F# team enlightens me on all these. I am still scrutinizing the 55-pages-long history of Haskell to create a list of questions I want to ask.

00:45 0 comments

Segmenter Compiler: Benihime, 紅姫

updateCode generation subsystem of Benihime is undergoing major refactoring with two goals: extensive use of generics in the public API and maximum amount of code reuse.

With major concern about the performance of the current regex implementation in Mono (1.2.6), I wish I already had enough time to spare today to submit a standards-compliant regex compiler replacement as a contribution. I hope Benihime to become one in the near future.

00:44 1 comments

Expert F#

On a completely unrelated note, I have been reading Expert F# and I must admit that I begin to like the language. If you are someone who is going to learn his first programming language or just feel like adding a new one to your arsenal, I strongly recommend having a look at F# – with interesting mix of paradigms, terse syntax and strong type inference, it is sure to be a pure delight to work with.

18:18 2 comments

Mono 1.2.6 binaries for Solaris 10/x86

Duncan Mac Leod from Tucan Entertainment has been kind enough to share his Mono 1.2.6 binaries for Solaris 10/x86.
13:04 13 comments

Levenshtein Distance Algorithm: Fastest Implementation in C#

Here is a cleaned-up performance test for several different implementations of levenshtein I have blogged about recently. This test was emailed to me by Ahmed Ghoneim, who has also kindly agreed to its publication on my blog. I am very grateful to him for his excellent contribution. I have slightly altered his file to do away with the unnecessary local variables in my C2C# port of the GNULevenshtein method.

I would like to hear from you which methods perform best on your machine. Please drop a comment ^_^!

C#LevenshteinAlgorithmPerformanceTest.cs
code only

PackagePackages
code, data and sample binary in zip and self-executable zip formats

Please note that the GNULevenshtein method was found to be buggy! Here is the new replacement method.

00:55 0 comments

Perplexity in Markov N-Gram Models

While implementing the perplexity function on Markov n-gram models as it is described on page 14 of PDFJurafsky & Martin's SLP to appear, I came across some floating point overflow, underflow issues and had to come up with equations to avoid them.

Here is PDFmy solution in detail and C#its bigram implementation.

It took me a lot of whiteboarding and a few hours to figure this one out but the resulting libcorsis code is %100 foreign intellectual property free ^_^.

I am now looking for ways to analyze distributions graphically and testing different encapsulations of probability values and their interaction with the public API.

07:12 0 comments

Markov N-Gram Models, C# 3.0, F#, Professorship


Here is a very short update on what I have been doing recently.

1) Thanks to the excellent quality of the introduction to computational linguistics course I am taking this term, I get to learn a great many new stuff which boost the development tempo of my library. Here is a particularly interesting case: C# Markov N-Gram Models. These are soon going to find some interesting application in my experimental HMM-based PoS-Tagger.

2) C# 3.0 and .NET 3.5 have been released. I'm truly astonished with the improved ease of programming this iteration delivers. Waiting eagerly to lay my hands on some quality books on both C# and .NET 3.5.

3) F#! After watching WDr. Brian Beckman's latest video on Channel 9 about monads I decided to learn F#. Parallelization libraries like WParallel FX might provide all that is needed to seamlessly parallelize execution but learning something natively functional seems to be a great asset for the very near future.

4) I had the chance yesterday to attend PDFa series of talks held by our faculty to decide on the new name to join our department of computational linguistics as a professor.

08:30 0 comments

Levenshtein Distance Algorithm: Fastest Implementation in C#

While reading some interesting stuff about minimum edit distances in preparation for today's lecture (ECL/ICL), which is just about 45 minutes ahead in time as I'm writing this, I had the chance to test 5 different implementations of the Levenshtein minimum edit distance algorithm.

Here is a screenshot first:

I'll get into details later but let me announce the winner!

And the winner is ... gLDp!

gLDp is a funny display name for a levenshtein implementation from a C project.

original implementation in C:C levenshtein.c
my C# port:C# libcorsis code




C vs CIL vs C#

Now I want to get mercilessly picky with my own port and today's C# compilers.

The ternary conditional expressions in my C# port (lines: #516, #524, #533) are there to circumvent the following restriction:

// valid C
int x = 0;
int y = 0;
int z = 0;
z += x == y;

x, y and z are initialized 0 and in the final line z gets incremented by 1.

This is valid code in C but causes a compile-time error in C#: C# does not let you evaluate boolean expressions as integers (false 0, true 1). So to circumvent this type safety restriction in syntax, I use ternary conditional expressions.

// valid C#
int x = 0;
int y = 0;
int z = 0;
z += x == y ? 1 : 0;

Depending on whether C# compilers special-case this situation, these TCEs may or may not mean a very slight performance overhead.

On the other hand, the native typed assembly language of .NET: CIL treats boolean values on stack as integers. I'll add some CIL code to illustrate this later. For libcorsis, I am eventually going to switch to such an assembly version.

So what have I learned today?

  1. C# is a true C descendant.
  2. If you can't C# a piece of C, you can always CIL it.
23:48 0 comments

CORSIS

Tenka Text has been rebranded as CORSIS.

Following addresses are no longer accessible:

  • http://tenkatext.sourceforge.net
  • http://www.sourceforge.net/projects/tenkatext

You can reach the project on sourceforge.net by searching for "CORSIS".

The new domain name of the project is www.corsis.de.

First new binaries are expected to be available in 2008 Q1.

05:10 0 comments

New Segmenter Compiler: Benihime, 紅姫

updateA quick update on API refactorings! Here is a snapshot of what the code examined in the last post would look like with refactorings and improvements I have made so far:

Before:C#the code examined in the last post

After:

For segmenting streams into, say, words for example, one could also use something like SED on GNU/Linux, some regular expressions implementation of a programming language or whatever.

So why am I such an otaku? Why not just go with the given the naive and easy way?

Well, I am a performance and control freak and CIL is great fun and I feel 'pleasure' writing assembly code for a VM but most importantly, Benihime, 紅姫 makes the perfect training ground for learning language and compiler design.

Prior to getting into deep hack mode on Benihime, 紅姫, I had no idea about the differences between 'expressions', 'statements', 'branches' or 'stacks'. Implementing complex boolean expressions in conditional statements like if((c && pm) || pb) at assembly level was all not in my domain. But I can publish a C#tutorial on that today.

03:59 0 comments

New Segmenter Compiler: Benihime, 紅姫

parallelizationAn interesting design concern has brought the development of my new segmenter compiler to a temporary standstill for tonight: parallelization.

I was trying to refactor and improve the design of my new segmenter compiler Benihime, 紅姫. (It is named after Urahara's sword from the Japanese anime Bleach. Benihime means crimson princess, what is more suitable to call a state-of-the-art segmenter. ^o^ theheee~~)

One of my major concerns with the new implementation was decoupling the flow control logic from the segmenter builder and the flow direction. This is essential for being able to reuse the same logic to compile two segmenters that run in opposite directions for example.

Let me illustrate the problem with the help of a C# file I happened to submit as a practice for our introduction to computational linguistics course just 5 days ago:


// /home/sert/Projects/hw1/hw1/Main.cs created with MonoDevelop
//
// project created on 10/21/2007 at 3:19 AM
using System;
using System.IO;

using Tenka;
using Tenka.Counters;
using Tenka.Text;
using Tenka.Text.Segmentation;
using Tenka.Text.Segmentation.Emit;

namespace ECL
{
class MainClass
{
public static void Main(string[] args)
{
string path = args[0];
string text = File.ReadAllText(path);

SegmenterBuilder builder = new SegmenterBuilder(Common.CustomizationModule);
CharacterField unibase = builder.DefineCharacterField();
FlowCharacter current = FlowCharacter.Current(builder);
FlowCharacter next = FlowCharacter.Next(builder);

FlowCodeConditionStatement basic = new FlowCodeConditionStatement(current.Matches(char.IsLetterOrDigit));
basic.TrueBlock.Statements.Add(unibase.Store(current));
basic.TrueBlock.Statements.Add(FlowControlAction.Continue);

FlowCodeConditionStatement extended = new FlowCodeConditionStatement(CharacterEqualityExpression.Sequence(current, "'-"));
extended.TrueBlock.Statements.Add(unibase.Store(null));

AndSequence seq = new AndSequence();
seq.Expressions.Add(next.CheckLimits());
seq.Expressions.Add(unibase.Matches(char.IsLetterOrDigit));
seq.Expressions.Add(next.Matches(char.IsLetterOrDigit));

FlowCodeConditionStatement condition = new FlowCodeConditionStatement(seq);
condition.TrueBlock.Statements.Add(FlowControlAction.Continue);

extended.TrueBlock.Statements.Add(condition);
extended.TrueBlock.Statements.Add(FlowControlAction.Break);

AndSequence cseq = new AndSequence();
cseq.Expressions.Add(unibase.IsNotNull);
cseq.Expressions.Add(current.IsCombiningMark);

FlowCodeConditionStatement combination = new FlowCodeConditionStatement(cseq);
combination.TrueBlock.Statements.Add(FlowControlAction.Continue);

FlowControlLogic logic = new FlowControlLogic();
logic.Statements.Add(basic);
logic.Statements.Add(extended);
logic.Statements.Add(combination);
logic.Statements.Add(unibase.Store(null));
logic.Statements.Add(FlowControlAction.Break);

Type type = builder.Compile(FlowDirection.LeftToRight, logic).CreateType();
Common.CustomizationModule.Save();

Segmenter segmenter = Segmenter.CreateInstance(type);

FrequencyList<string> list = new FrequencyList<string>();
TypeTokenRatioStandardization ttrs = list.Add(text, segmenter, 1000);

string[] order = list.GetFrequencyOrder();

Console.WriteLine(" token count: {0}", list.TokenCount);
Console.WriteLine(" type count: {0}", list.TypeCount);
Console.WriteLine(" raw type/token ratio: {0}", list.TypeTokenRatio);
Console.WriteLine(" standardized type/token ratio (base 1000): {0}", ttrs.TypeTokenRatio);
Console.WriteLine();

foreach (string key in order)
{
Console.Write('\t');
Console.Write(list[key]);
Console.Write(' ');
Console.Write(key);
Console.WriteLine();
}
}
}
}


Tada~~~ Here is the problem section. unibase is a character field reference expression and needs to be defined and thus depends on a segmenter builder instance. next and other non-current flow character expressions depend on a flow direction, which is unknown until Compile() is called. All three of them are (compilation) context-dependent expressions.

...
CharacterField unibase = builder.DefineCharacterField();
FlowCharacter current = FlowCharacter.Current(builder);
FlowCharacter next = FlowCharacter.Next(builder);
...
Type type = builder.Compile(FlowDirection.LeftToRight, logic).CreateType();

In the quoted version of the sample program, builder.DefineCharacterField() leads to the definition of a field (a build process) outside the context of the builder.Compile() call and unibase stores an internal reference to builder. This a breach of compilation context.

Additionally, current and next are also initialized with an internal reference [FC->B] to builder. When Compile() is called and the code generation chain triggers Emit() on current or next, they use that internal reference [FC->B] to ask builder on which flow direction value the compilation is being made. I know the explanation sucks big balls but so does this design. It is the worst form of code chaos: Expressions that later end up in statements of the flow control logic are coupled during their initialization with a single segmenter builder instance and its flow direction!! So how in the hell can you use the same logic now with another segmenter builder or flow direction? o__O


After subconscious contemplation for a few days, the following idea suddenly occured to me early in the morning today: to solve this ugly decoupling problem, logic should deal with instantiating a type-less copy of unibase, and direction-less copies of current and next and store references to them in itself. Compile() should then either

(1) create a deep copy of logic in which types and directions of unibase, current and next are bound and use this fully initialized copy for code generation

or

(2) in an atomic pass, (place thread lock), bind unibase, current and next temporarily, generate code and unbind unibase, current and next (release thread lock).

Both approaches have their ups and downs.

(1) may parallelize better as there is nothing to lock but creating deep copies of complex segmentation flow control logics in highly demanding execution scenarios may also cause serious performance issues.

(2) seems to be the way to go as it does not create temporary copies of flow control logics however simple or complex these may be. This means fewer CPU cycles and less memory consumption but managing locks properly may become quite a challange and I might eventually have to learn or wait for a better atomization construct.


Well, that's it - phew~~ what a long post this has become o__O!! Enough talk! I am going on to refactoring the API now.

03:19 0 comments

New Project Openings!

C#Corsis is seeking code reviewers, documentation writers and framework designers.

If you want to work with us on non-commercial, voluntary terms, apply for a position today!

Project Openings

Corsis started as an open-source answer to commercial corpus analysis software and now constitutes a test bed for research in various subjects such as compiler design, performance optimization and parallel computation.

23:24 0 comments

Character Frequency Calculations

sample plain text in Devanāgarī script
character frequency analysis results
source code
sample program binaries
*You can use the sample program not only with devanagari but with other scripts as well. See example.

A student of final year B.Tech at National Institute of Technology, Hamirpur, India wrote to me today: (emphasis added)

I came across Tenka-text while searching through the net for existing concordance softwares.
I wish to develop a similar kind of free software for the Dev Nagri script which is the script for Hindi, our national Language.

Initially I plan to develop just the character frequency calculation functionality which is not provided by any currently available product.

Could you please guide me a little on how to go about this. I shall value your suggestions very much.


To which I first replied:

...

I'm at work right now and thus do not have access to my development tools and personal library and because of this cannot reply with an example C# program immediately BUT

(Presuming you are familiar with programming languages)

Step 1 - A HashTable to count devanagari characters encountered

You could use a hashtable/dictionary to hold the frequency values of each character and increment these upon each encounter.

Here is a generic counter, which can store frequency values for any type of object:

http://corsis.svn.sourceforge.net/viewvc/corsis/trunk/Tenka/Counters/Counter.cs?revision=321&view=markup

Step 2 - Study of Devanagari Characters

Devanagari ( http://unicode.org/charts/PDF/U0900.pdf ) seems to be making use of combining marks ( http://www.unicode.org/reports/tr15/images/UAX15-NormFig5.jpg, http://www.unicode.org/reports/tr15/ ).

See sequence U+0940 - U+094D ( http://unicode.org/charts/PDF/U0900.pdf ).

So you could use System.Globalization.StringInfo class ( http://msdn2.microsoft.com/en-us/library/system.globalization.stringinfo.aspx ) to feed input into a Counter instance.

Step 3 - Implementation Example

I'll return to you with source code as soon as possible, probably before 2007-09-27T21:06:56Z (utc).

Best Regards,
Cetin Sert


And then offered the following solution:

Here is a simple solution that aims to handle plain text files only:

http://corsis.sourceforge.net/divya/devanagari.zip


1. Extract files.
2. If your plain text file is in Unicode UTF-8 and in hi-IN culture: simply drag and drop the file onto the Tenka.Text.Console.exe and it will create a file named [INPUT-FILE-NAME].charanalysis.xml. You can test the program with the devanagari.txt which is a sample file in UTF-8.
If your plain text file is in some other encoding: execute Tenka.Text.Console.exe by calling

“Tenka.Text.Console.exe -e:[ENCODING-NAME] [INPUT-FILE-NAME]”

If your plain text file is in some other culture: execute Tenka.Text.Console.exe by calling

“Tenka.Text.Console.exe -c:[CULTURE-NAME] [INPUT-FILE-NAME]”

Example for a German plain text file in ISO-8859-1:

“Tenka.Text.Console.exe -e:iso-8859-1 -c:de-DE german.txt”
3. View [INPUT-FILE-NAME].charanalysis.xml with Firefox, IE 7 or something similar.

the numeric value of the length attribute of a character element indicates how many Unicode codepoints are used to encode the character or, in Unicode terms, the grapheme cluster (see Unicode Standard 5.0, Annex #29, Grapheme Cluster Boundaries (on page 1376 in ISBN 0-321-48091-0))

the numeric value of the frequency attribute of a character (grapheme cluster) element indicates how many times that character was encountered.


I am looking forward to reading your reply.

Best Regards,
Cetin Sert
INF 521, 4-6-2
69120 Heidelberg
Germany

2007-09-27T18:21:43Z


And corrected myself regarding the instructions as I noticed the person might be using Linux and not Windows:

Checking the logs of web sites, I noticed that you might be using Linux. If that’s indeed the case, I need to update the instructions in section 2:

2. You need to install Mono 1.2.5 from http://www.mono-project.com/Downloads ( recommended package: ftp://www.go-mono.com/archive/1.2.5.1/linux-installer/0/mono-1.2.5.1_0-installer.bin ) and then enter for example:

“mono Tenka.Text.Console.exe -e:utf-8 -c:hi-IN [INPUT-FILE-NIME]”

If you have ever called java apps from the terminal window, it’s pretty much the same way of doing this.

Again, a much easier way is to use Windows just for the purpose of testing this application. Quoting runtime requirement notice from Windows downloads page of the main project:

For Windows 98, ME, XP, 2000, 2003 Users:

This program requires Microsoft .NET Framework 2.0 or above to run. You can download the latest official release at the following addresses. This is a one-time installation and if you have already installed the framework, you can proceed with the download of Tenka Text.

> Microsoft .NET Framework 3.0 - bootstrapper:

A small executable that determines your cpu/os architecture automatically and downloads the full redistributable package appropriate for installation.

http://www.microsoft.com/downloads/details.aspx?FamilyID=10cc340b-f857-4a14-83f5-25634c3bf043&DisplayLang=en

> Microsoft .NET Framework 3.0 - full redistributable packages for specific cpu/os architectures:

32-bit/x86) http://go.microsoft.com/fwlink/?LinkId=70848
64-bit/x64) http://go.microsoft.com/fwlink/?LinkId=70849


For Windows Vista Users:

Windows Vista ships with the .NET Framework 3.0 preinstalled. Please proceed with the download of Tenka Text.

If you are on Windows, please follow the original section 2 from the first mail. I kindly ask you to excuse my complicated way of explaining things.

Thinking that you might just want to see the results, I’m uploading the results I got from the sample I sent a link to in the first mail.

I am looking forward to reading your reply.

Best Regards,
Cetin Sert
INF 521, 4-6-2
69120 Heidelberg
Germany


Why have I quoted all of our correspondence? Well I begin to feel that free software development is not just about open source. It is about the whole experience of putting something together and there are more faces to it than just the mere coding.

I truly enjoyed the opportunity that was given to me and this feature is definitely going to find its place in the upcoming releases. That is, if I do get a positive reply regarding the acceptability of the results produced.

sample plain text in Devanāgarī script
character frequency analysis results
source code
sample program binaries
*You can use the sample program not only with devanagari but with other scripts as well. See example.
09:00 0 comments

New Logo/Wordmark Concept and Links

I just wanted to post a new logo/wordmark concept I've come up with recently. It looks too damn serious ^__^ right? But I like it that way (*^o^*)

By the way I happened to find three interesting new links to Tenka Text.

First one on the links page of the Collection of Electronic Resources in Translation Technologies from the University of Ottowa, where they call my software "Un nouveau concurrent de WordSmith que vous pouvez télécharger gratuitement chez vous !" — "A new competitor of WordSmith which you can download for free!".

Second one on the tools page of the Institut für Dokumentologie und Editorik at the University of Cologne.

Third one on the tools page of the Dutch Language Union.

It's very encouraging to see something you do as a hobby get such recognition. ^__^

07:07 0 comments

Japanese Dramas - Fall 2007

Two dramas I'm looking forward to watching in the new season:


モップガール - Mop Girl (wiki)



ジョシデカ! - Joshi Deka! (wiki)

05:14 10 comments

Mono 1.2.5 binaries for Solaris 10/x86

I think I've finally managed to botch it all up. I created a JRE-style zipped package, which you can download here1,2 at your own risk ;). Whoa, it's taken me 4 days just to get so far with mono/solaris. ^o^ I wanna sleep like the relakkuma now hehe...
  1. These binaries are compiled for 32-bit/x86 processors.
    They also run on 64-bit/x86-64 processors.
  2. They are not compiled for running winforms applications.
To extract the archive use: "gtar -zxvf"
02:46 0 comments

Compiling Mono 1.2.5 on Solaris 10/x86-64

I've been trying to compile Mono, 1.2.4 and 1.2.5, on Solaris 10/x86 and x86-64 for the last couple of days and I am going to write the horrible amount of effort and time this has cost me so far. If I ever get to have a mono directory with a functional build of mono, the first thing I'm going to do is make the binaries publicly available.
11:45 0 comments

Sertcom, Unicode Standard 5.0 and Japanese Dramas

I've been slacking off on many things recently, including my hobby project. I'm posting from Flensburg, the northernmost city of Germany. Coming from Heidelberg/Mannheim, aka "Delta Region", you get to experience a mild intranational culture shock up in the north. It feels like a big town that cherishes horizontal freedoms rather than vertically stacked-up big cities I'm used to seeing in Southern Germany.


Now to the reason of my stay in Flensburg. My brother acquired an established telemarketing company in Neumünster, Germany and is renaming it "Sertcom". So I ended up paying him a visit to see how he managed to get so far. The last time I was here he was a team leader. What a rapid development.


Unicode Standard 5.0

The only thing I'm doing nowadays that bears some relevance to my studies and job is reading the version 5.0 of the Unicode standard. I'm now almost ready to fix some remeaning issues with my segmenter implementation that features the quality of being fast and compilable at runtime according to the logic users provide.

Click here to order the hard copy version of this book.


Japanese Dramas

Here are some suggestions for dorama fans from summer season 2007:


ホタルノヒカリ - Hotaru No Hikari (wiki)



パパとムスメの7日間 - Papa to Musume ... (wiki)

Go d-addict yourselves to japanese drama!

22:00 0 comments

GPLv3 is out!


Free Software Foundation released the third version of the GNU general public license. After a thorough review of the license text, I'm planning to switch to GPLv3. Viva Free Software!

02:38 1 comments

WordSmith Tools 5.0, Tenka Text in China

Mike Scott has finally released a public beta of WordSmith Tools 5.0. I want to quote him on a few points:

System Requirements

WordSmith Tools 5 is for Windows 2000 or later. It will be happiest on a fairly modern PC (e.g bought in the last 4 years) with plenty of memory and hard disk space. Or an Intel Mac running Windows.

Now that's an interesting world view we have here. What's an Intel Mac anyway? It's a PC produced by Apple. So this sentence actually reads: System Requirements: a PC running Windows or a PC (produced by Apple) running Windows. Technically deep Mr. Scott seems to be. -__-"

Here is my favourite one:

What's new in version 5

WordSmith is organic software!
Version 4 was a complete new re-write. Since it was launched in 2004, numerous re-compilations of WS4 were issued (about one a week on average), sometimes with a very small bug-fix but other times incorporating changes users had suggested. At the same time the Help was updated. Version 5.0 was started in June 2007, three years after version 4.0 and will continue this organic policy of growth...

Organic software? Come on Mike. You could have come up with a better excuse for suddenly raising the version number by +1.0 point without actually introducing any considerable improvements. Just tell people you felt like jumping or sth like that. Can you imagine Linux devs calling kernel version 2.6.22 3.0 out of the blue just because they wanted to have a new, shiny, attractive version number? What you do is what I would call a live branching in SCM terms.


Well, that's enough advertisement for my competitor.

Let's focus on more serious development news. I stopped working on the user interface for this week. My current development focus is on class library design:
  1. I've completely rewritten FrequencyList class.
    Old code vs New code
  2. I'm now rewriting the Concordance class.
    Old code vs New code
A final word about my project here in this post: Tenka Text has seen its first academic acknowledgement ^__^ at the following blog http://xpq.blogspot.com/ (see T under academic links).

Researchers from the USA and China have always been kind enough to praise Tenka Text for its speed (relative to any other tool (including WST 4 (or 5 or any imaginable later version)) they state to have used) and now I've also seen my project listed among other academic software. That will surely give me the necessary push to keep developing intensively and I plan to release a new version somewhere between 2007-06-30 and 2007-07-07. More on this and other news later!
02:42 0 comments

Multi-threading, Part of Speech, Matoed 2005

I am busy with my studies and cannot spend much time on development for the time being. All little what I do is test some ideas and the design of the class library.

One of the tests I conducted was to see how much of a performance gain my single-file frequency listing routines might achieve on multi-core processors with multi-threading. So I decided to write some methods that help split a 140MB file into segments which can be processed on individual threads and added together once all threads return. You can see the code here. Below is the result: creating a frequency list for a 140~MB-large file utilizing four threads on a quad-core machine can be 83% faster than single-threading the same operation:



You should expect to see this discovery taken into account in the next release.

Another major change I'm planning for the next release is making the Segmenter, Clusterer and FrequencyList classes generic so that 'print' as verb can be analyzed as a separate entity from 'print' as noun etc. Automatic part of speech recognition is not something I'm planning to implement in the near future but I want to make sure that the architecture is ready for anything beyond simple strings.

Oh, by the way I found a photo of mine on the net taken in 2005 at the election of the Mannheim Turkish Students' Association.

00:51 0 comments

Mersenne Prime Discovery

I've come across another limitation of WordSmith Tools 4. It can't process large files. And what do you think is 'large' for WST4? Well... it can be a file as small as 9,35 MB depending on the type of data stored.

While reading Wikipedia, I stumbled upon the GIMPS Project at http://www.mersenne.org/ and found their 44th Mersenne Prime discovery: http://www.mersenne.org/prime10.txt - a 9,808,358 digits long integer number. I wanted to see how the Wordlister of Tenka Text 0.1.3 performed with such a large single token and in fact I expected it to throw some sort of unexpected exception...

to be continued
04:01 1 comments

Binary Release: Tenka Text 0.1.3


Tenka Text 0.1.3* on Windows Vista Ultimate (.NET 2.0>)


New Update Dialog from Tenka Text 0.1.3* on Windows Vista Ultimate


Tenka Text 0.1.3* on Debian 4.0 (Mono 1.2.4 preview 3)

SVN267
0.1.3 Release Notes


This version of Tenka Text introduces dynamism into Wordlister. It’s the first release that juggles with interface design ideas which have come to prove themselves in professional integrated development environments. The main feature of this release is the explorer control in Wordlister, which is a tree view that performs organizational tasks much like the solution explorer control in Visual Studio. Wordlister explorer has been implemented in such a way that enables you to add, remove or switch between frequency lists dynamically. Because graphical user interfaces of integrated development environments are designed for highly complicated usage scenarios and heavy workloads developers tend to have, they stand as great examples for computational linguists that aim to deliver equally professional solutions to corpus research questions. Tenka Text is going to continue to introduce more and more concepts to its graphical user interface in the future – to become the first and best open-source “Rapid Corpus Analysis or RCA” tool available.


Change log:
http://tenkatext.sourceforge.net/docs/0.1.3.htm
http://tenkatext.sourceforge.net/docs/0.1.3.pdf


*: It reads 0.1.2 in the screenshots, it should be 0.1.3.

20:56 0 comments

Tenka Text 0.1.2 Screenshot



Here is a screenshot from a new feature which is going to be available in the wordlister tool of the next release: study explorer - a tree view of the files you are analyzing which helps you dynamically change between result views back and forth. Close to the alpha or beta, probably all the functionality will have been combined in just one type of window. Things may end up looking like how Visual Studio does since 2003. I hope a fully-portable docking suite becomes available until then to make things easy.
19:52 0 comments

Binary Release: Tenka Text 0.1.1


Tenka Text 0.1.1 on Windows Vista Ultimate

This is the first versioned release of Tenka Text.

CLASS LIBRARY

The class library has undergone great changes one of which are the brand new segmenter customization classes which can be seen at:

http://corsis.svn.sourceforge.net/viewvc/corsis/trunk/Tenka.Text/Tenka.Text/Segmentation/Builder.cs?revision=197&view=markup&pathrev=197

and read about at:

http://tenkatext.blogspot.com

GRAPHICAL USER INTERFACE

Main Window

1) Improved select files dialog.
2) New options dialog.

Wordlister

1) Segmenter Customization: Segmenters used to create frequency lists are now dynamically compiled. (Compilation uses reflection emit and MSIL/CIL/IL assembly to ensure maximum performance.)

2) Export Feature: You can now export/save data from the wordlister in either plain text or xml format. (This feature is also supported on Mono 1.2.3.50.)

3) Wordlister is now at alpha stage: With the addition of the new segmenter compiler and a simple export feature, Tenka Text Wordlister tool now offers the same core functionality as WordSmith Tools 4's Wordlist tool and runs 3x faster.

corresponding svn revision: 201

Tenka Text 0.1.1 on Debian 4.0
22:27 0 comments

Reflection Emit

I've been trying to integrate reflection emit into Tenka Text recently and that's how far I have come in code: Builder.cs.

You can use reflection emit to compile and build types at runtime. A pretty amazing way of using this technology is having an abstract class whose implementation you provide at runtime.



Tenka Text is going to use reflection emit to provide custom segmentation. Using reflection emit instead of going for a simpler approach has numerous distinguishing advantages.

I quote my friend Mike Scott from the documentation of his WordSmith Tools 4 here:

[...] you may wish to allow certain additional characters within a word. For example, in English, the apostrophe in father's is best included as a valid character as it will allow processing to deal with the whole word instead of cutting it off short. (If you change language to French you might not want apostrophes to be counted as acceptable mid-word characters.)

Examples:

' (only apostrophes allowed in the middle of a word)

'% (both apostrophes and percent symbols allowed in the middle of a word)

'_ (both apostrophes and underscore characters allowed in the middle of a word)

You can include up to 10.

If you want to allow fathers' too, check the allow to end of word box. If this is checked, any of these symbols will be allowed at either end of a word as long as the character isn't all by itself (as in " ' ").

The italic part of this quote makes the inherent limitation of his programmatically simple implementation in WordSmith Tools 4 clear. It is an array-based approach and thus as the size of the array of custom characters increases so does the time spent in loops - exponentially.

To overcome this limitation Tenka Text will emit IL assembly code compiled at runtime according to the settings the user has specified. Some of the tests I performed last week with two character sets of different sizes confirmed that the runtime compilation approach is the way to go. I will publish more on the performance gap between the two approaches later.
22:42 0 comments

Return to Development

Return to Development on Windows Vista Ultimate


Well, all systems are up and running again. I returned to developing Tenka Text on a more serious system. Just check the screenshot to get an idea of how great a bliss my new environment turns the whole business into.

I am using
  • Windows Vista Ultimate
  • Visual Studio 2005
  • Tortoise SVN
  • Ankh SVN
  • SourceGrid

at the this stage of development.

05:47 0 comments

Windows Vista Ultimate

And now I am writing from the brand new Windows Vista Ultimate.

I have to say that I truly like this OS and because I bought the OEM Version, it only cost me 189€.

I will be playing around with it for the next couple of days before going back to the development of Tenka Text. Oh by the way... one thing I am truly delighted about is that it does not show any user access control warnings if you try to run Tenka Text*. Managed applications are priviliged on Vista.
03:31 0 comments

Windows Vista Ultimate

I'm in the process of switching the development from XP/2003 to Windows Vista Ultimate.

To be continued...
03:59 0 comments

Binary Release: Tenka Text pre-alpha 2007-02-27

I spent the night implementing the option I mentioned in the previous post and decided to release a new binary after 22 days. (Corresponding SVN revision: 96)

Tenka Text WordLister is now more precise than WordSmith Tools WordList. It has an option now through which a user can choose to display in the statistics view for frequency lists either
  • the count of tokens that are a certain number of characters long or
  • the count of types that are a certain number of characters long. *NEW*
WordSmith Tools 4 can only display the first but not the latter.

Check all three screenshots!

Tenka Text WordLister with Length Statistics for Tokens


Tenka Text WordLister with Length Statistics for Types



WordSmith Tools 4 with Length Statistics for Words(Tokens)
21:02 0 comments

Frequency Lists - Statistics View



^_^ Here is the new statistics view for frequency lists from the 2007-02-05 pre-alpha release.

I'm still actively working on frequency lists at GUI and lower levels and some of the following features might be ready for the next release:
  1. An option to have the program display in the statistics view for frequency lists either
    • the count of tokens that are a certain number of characters long or
    • the count of types that are a certain number of characters long. *

    Tenka Text 2007-02-05 and WordSmith Tools 4 can only display the first but not the latter. There is only a little GUI work to deliver this option.

  2. Live updates.

    Once in place, users will be able to add or remove files from their frequency lists without the need to recalculate everything from the start.
Well, that's all the news about Tenka Text 2007-02-05. Let's move on to more personal stuff.

Logographic Failure

I found out that the "Tenka" Logo 「天花」, which was my personal attempt at creating a digital seal with the kanji of a literary word for "snow", has a huge flaw! A senpai from university was kind enough to point to the fact that the kanji combination I chose (「天花」) for representing the sound of "tenka" actually meant "smallpox" - a disease. She is now cooperating with me, trying to come up with a new logo.

Institut für Deutsche Sprache

I have been working at IDS since 2006-10-16 as an assistant student in corpus compilation and analysis and have come to accumulate some experience which I am going to consider putting into Tenka Text at some future date. By the way, I owe my current position to the Mono project team. It is their commitment to making .NET multi-platform that I am now able to write tools in C# that run on Linux, which is for some reason the preferred platform of the department of lexical analysis at IDS.

Structured Plain Text Extraction from PDFs and Other Office Files

To accomplish a task I was assigned for example, I had to put together several builds of a tool called 'pdftohtml' and a few lines of C# (2.0 with some Linq to Xml) and there I had a multi-platform, open-source and managed pdf import method. This method will definitely find its way into Tenka Text once I have developed a common API for extracting structured plain text from file formats such as PDFs or office files.
20:11 2 comments

Binary Release: Tenka Text pre-alpha 2007-02-05

A new pre-alpha binary release is now available for download! It features the performance of the new frequency list classes and a new statistics view in its WordLister tool. Go download the demo version of the commercially available WordSmith Tools 4 and compare to realize the power of open-source C#!
07:43 0 comments

Performance Optimizations for Frequency Lists


TT5 -> TT8 performance difference

I conducted a performance test on one of the remotely accesible computers of the University of Heidelberg. (2 physical/4 logical cpus and 2 GB Ram)

The test was performed by creating a frequency list based on the helsinki corpus (9.793 KB, single text file) and then sorting it.

As you can see below, my optimization efforts seem to have paid off well.

WordSmith Tools 4.0.0.374 took about 13 seconds to create and sort a word list into:
  • Alphabetical order
  • Frequency order
    • Alphabetical order between types with the same frequency value

TT5 (svn revision 17, binary release: 2006-11-25)
required 2,51 seconds to create and 10,07 seconds to sort the list into:

TT8 (svn revision 66) needed 1,10~ seconds to create and 2,40~ seconds to sort the list into:

Performance Comparison Table



WS4
0.0.374
TT5
SVN17
TT8
SVN62
TT8.1
SVN64
TT8.2
SVN66
Create11?2,51~1,33~1,30~1,10~
Sort2?10,07~2,40~2,40~2,40~
Total13~12,58~3,73~3,703,50~


Please note that due to the restrictions which applied to my user account, I had no permission to create native images of TT5 or TT8 on the test computer. Both tests were jitted. This was not the case in a previous performance test post where everything was native. TT7 did not perform any alphabetical subsorting between types with the same frequency value.

A Word on Terminology

TT5 and TT7, TT8 and a future TTx are all short names for source code revisions in the SVN repository. Tenka Text is still in the pre-alpha development stages.
04:30 0 comments

演歌 - Enka

While investing some time in improving my new frequency list implementation, I also started to watch some new Japanese Dramas from Winter 2007:
  • Erai Tokoro ni Totsuide Shimatta!
  • Hana Yori Dango 2
  • Enka* no Joou
I just fell in love with the songs in Enka no Joou and decided to cut and retime subs of one the enkas in it. Here you can download this wonderful song.

Two great open-source programs helped me do the work in under 2 minutes: VirtualDub and Aegisub. Both are extremely easy to use!
19:07 0 comments

ニュス

I have renamed parts of the class library.

Tenka.Text.Segmentation

This is the new name of the namespace which was previously called Tenka.Text.Enumerators. I will be providing some default segment enumerator implementations until I have bought and read a good book on IL, which will mark the turn of events paving the way to runtime-compiled customizable segmentation. I found a way to do this using C# syntax at http://www.codeproject.com/cs/algorithms/matheval.asp.

The new frequency list is also almost done and will completely replace the old one. I have a hard time doing the necessary encapsulation work because I keep thinking that I might be doing something that might adversely affect the performance.
03:54 0 comments

Performance Improvement Tests

To do a quick test on performance improvement, I redid some parts of my counter and switched to the new high performance hashset collection from the Orcas January CTP.

I used the helsinki corpus (9.793 KB, single text file) and conducted a single-word frequency list operation (create and sort).

And here are the results:


TT5
SVN17
TT7
SVN47
Performance
Improvement
Create2,171,841,17x
Sort12,422,285,44x
Display0,80~0,80~-
Total15,39~4,92~3,12x


For your information, WordSmith Tools 4 (4.0.0.374) takes about 23 seconds to perform the same operation.
00:33 0 comments

HashSet - a new high performance set collection from Orcas October CTP

I tested performances of several set implementations. A set is an unordered collection of unique elements.

Time required to create a set of unique words with different implementations (Corpus Size: 278,675 tokens of 21,828 types):
  1. a cheap set implementation which derives from the BCL generic list collection and imposes Contains(T item) checks on each add / insert operation [SVN]

    21,20~ seconds

  2. a set implementation which is a reflected / disassembled partial copy of the BCL generic list collection and performs manually inlined Contains checks on each add / insert operation

    20,70~ seconds

  3. System.Collections.Specialized.StringCollection : if (!set.Contains(word)) set.Add(word);

    18.96~ seconds

  4. HashSet from Orcas October CTP

    0,17~ seconds ^_^ just who can beat this?

I immediately decided to switch to the new generic HashSet from the BCL guys.

Tenka.Text will greatly benefit from this development especially when sorting word-frequency dictionaries* on their frequencies. (* Your average corpus linguist might have a tendency to call them 'wordlists'.)

Time required to create and sort word cluster frequencies (cluster size: 6 words):
  • TT-2006-11-25 (latest binary release, at the time of this writing) [SVN 17]

    2,35 seconds

  • TT-2007-01-20 [SVN 43]

    0,76 seconds
Such power combined with runtime compilation to provide optimized word or cluster enumerators for any given user customization... and all open-source...

Mike Scott, my archrival, the developer of WordSmith Tools... I am definitely looking forward to the day I release the first alpha version of Tenka Text. I can't help but wonder what your comments will be.
16:19 0 comments

Framework Design Guidelines and FxCop

I finished reading Framework Design Guidelines. It was one insightful book well worth reading. Through it, I came to learn that even in the .NET class libraries are some unsolved/unnoticed design issues which shipped with the product - naming of some classes for example or the fact that StringBuilder remains to reside in the namespace System.Text and not in System. I am happy to have read this book at the very early stages of my program's development. I have the freedom to completely redo the design of my class libraries now. In fact I have already downloaded FxCop to analyse the codes I produced so far. The list of possible issues was longer than the longest of poems. I will be looking into the matter for the next few weeks. I will also go see how regular expressions use System.Reflection.Emit to compile regex instances on-the-fly.
01:02 0 comments

Tenka Text

Hi, this is my first blog post.

I have established this blog to keep people in the know about my personal open-source corpus analysis project Tenka Text. It is my, Cetin Sert's, open-source to Mike Scott's WordSmith Tools (4).

As you can click the above link to read and get an introductory idea about what or what not my project is, I will allow myself to skip an introduction and tell only the most recent development news.

Back to Basics

Well... first of all, I have decided to refocus on the sub-gui parts of the code once again. I saw a need to rethink some of the basic design elements like the overall inheritance/relation model of the classes and structs in the Tenka.Text namespace. Before going back on to the GUI development, I want to simplify the word enumerators, do away with the bulky interfaces, learn more about the available inlining optimizations, move from public fields to public properties and minimize the public exposure of pointers.

System.Reflection.Emit & Runtime Compilation

One of the most research worthy areas I have come across recently is configuration-based dynamic IL code emitting provided by the System.Reflection.Emit namespace. Let's imagine that a user wants to define a custom set of characters across which no clustered word enumeration should be made. (a 5-word cluster which starts in one sentence and continues in the next is not likely to make much sense.) The solution might be provided in two ways:
  1. A character array (string) is used to store the user-defined cluster constraints and each time the cluster enumerator moves next, it checks each character between the current word and the next word against the characters in the user-defined set of cluster constraints to determine whether the clustering is allowed.
  2. Reflection is used in a clever way to emit a dynamic method, avoiding the need to store in an array and retrieve from there the user-defined cluster constraint characters.
I do not know how easily the second way can be implemented but I believe it might help score a performance gain. Regular expressions in .NET make use of it already.

Runtime compilation might make me do a complete rewrite of the whole project if the results turn out to be satisfactory. I will do quite a lot of reading on this in the next couple of weeks. This would mean that I don't write a rigid instance of Tenka.Text but the abstract logic of Tenka.Text that can write specialized instances of itself on user demand.

Perfomance Test

I tried to create a list of 6-word clusters disregarding any sentence boundaries or punctuations and here are the results:

WordSmith Tools (4.0.0.338)
  1. Create Index - 12~ seconds
  2. Open Index - 3~ seconds
  3. Calculate 6-Word Clusters - 20~ seconds
  • TOTAL: 3 commands, 35~ seconds -.-
Tenka Text (version: SVN 37)
  1. Create 6-Word Cluster List - 5,93 seconds (create: 1,97 + sort: 3,96)
  • TOTAL: 1 command, 5,93 seconds ^_^
By the way... you can also browse the SVN and OHLOH of my project.
01:58 0 comments

Cetin Sert

Contact

Cetin Sert
Heidelberg
Baden Württemberg, Germany

cetin.sert@gmail.com [email]
nomadsoul@msn.com [messenger]



Personal Pages