Posts

ニュス

I have renamed parts of the class library. Tenka.Text.Segmentation This is the new name of the namespace which was previously called Tenka.Text.Enumerators. I will be providing some default segment enumerator implementations until I have bought and read a good book on IL, which will mark the turn of events paving the way to runtime-compiled customizable segmentation. I found a way to do this using C# syntax at http://www.codeproject.com/cs/algorithms/matheval.asp . The new frequency list is also almost done and will completely replace the old one. I have a hard time doing the necessary encapsulation work because I keep thinking that I might be doing something that might adversely affect the performance.

Performance Improvement Tests

To do a quick test on performance improvement, I redid some parts of my counter and switched to the new high performance hashset collection from the Orcas January CTP. I used the helsinki corpus (9.793 KB, single text file) and conducted a single-word frequency list operation (create and sort). And here are the results: TT5 SVN17 TT7 SVN47 Performance Improvement Create 2,17 1,84 1,17x Sort 12,42 2,28 5,44x Display 0,80~ 0,80~ - Total 15,39~ 4,92~ 3,12x For your information, WordSmith Tools 4 (4.0.0.374) takes about 23 seconds to perform the same operation.

HashSet - a new high performance set collection from Orcas October CTP

I tested performances of several set implementations. A set is an unordered collection of unique elements. Time required to create a set of unique words with different implementations (Corpus Size: 278,675 tokens of 21,828 types): a cheap set implementation which derives from the BCL generic list collection and imposes Contains(T item) checks on each add / insert operation [ SVN ] 21,20~ seconds a set implementation which is a reflected / disassembled partial copy of the BCL generic list collection and performs manually inlined Contains checks on each add / insert operation 20,70~ seconds System.Collections.Specialized.StringCollection : if (!set.Contains(word)) set.Add(word); 18.96~ seconds HashSet from Orcas October CTP 0,17~ seconds ^_^ just who can beat this? I immediately decided to switch to the new generic HashSet from the BCL guys. Tenka.Text will greatly benefit from this development especially when sorting word-frequency dictionaries* on their frequencies. (* Your ...

Framework Design Guidelines and FxCop

I finished reading Framework Design Guidelines . It was one insightful book well worth reading. Through it, I came to learn that even in the .NET class libraries are some unsolved/unnoticed design issues which shipped with the product - naming of some classes for example or the fact that StringBuilder remains to reside in the namespace System.Text and not in System. I am happy to have read this book at the very early stages of my program's development. I have the freedom to completely redo the design of my class libraries now. In fact I have already downloaded FxCop to analyse the codes I produced so far. The list of possible issues was longer than the longest of poems. I will be looking into the matter for the next few weeks. I will also go see how regular expressions use System.Reflection.Emit to compile regex instances on-the-fly.

Tenka Text

Hi, this is my first blog post. I have established this blog to keep people in the know about my personal open-source corpus analysis project Tenka Text . It is my, Cetin Sert's, open-source to Mike Scott's WordSmith Tools (4). As you can click the above link to read and get an introductory idea about what or what not my project is, I will allow myself to skip an introduction and tell only the most recent development news. Back to Basics Well... first of all, I have decided to refocus on the sub-gui parts of the code once again. I saw a need to rethink some of the basic design elements like the overall inheritance/relation model of the classes and structs in the Tenka.Text namespace. Before going back on to the GUI development, I want to simplify the word enumerators, do away with the bulky interfaces, learn more about the available inlining optimizations, move from public fields to public properties and minimize the public exposure of pointers. System.Reflection.Emit & Run...

Cetin Sert

Contact Cetin Sert Heidelberg Baden Württemberg, Germany cetin.sert@gmail.com [email] nomadsoul@msn.com [messenger] Personal Pages chetin.deviantart.com My free 3D models on Turbosquid