Posts

Showing posts with the label segmenter

Segmenter Compiler: Benihime, 紅姫

Image
Code generation subsystem of Benihime is undergoing major refactoring with two goals: extensive use of generics in the public API and maximum amount of code reuse. With major concern about the performance of the current regex implementation in Mono (1.2.6) , I wish I already had enough time to spare today to submit a standards-compliant regex compiler replacement as a contribution. I hope Benihime to become one in the near future.

New Segmenter Compiler: Benihime, 紅姫

Image
A quick update on API refactorings! Here is a snapshot of what the code examined in the last post would look like with refactorings and improvements I have made so far: Before: the code examined in the last post After: For segmenting streams into, say, words for example, one could also use something like SED on GNU/Linux, some regular expressions implementation of a programming language or whatever. So why am I such an otaku ? Why not just go with the given the naive and easy way? Well, I am a performance and control freak and CIL is great fun and I feel 'pleasure' writing assembly code for a VM but most importantly, Benihime, 紅姫 makes the perfect training ground for learning language and compiler design. Prior to getting into deep hack mode on Benihime, 紅姫, I had no idea about the differences between 'expressions', 'statements', 'branches' or 'stacks'. Implementing complex boolean expressions in conditional statements like if((c && pm) ...

New Segmenter Compiler: Benihime, 紅姫

Image
An interesting design concern has brought the development of my new segmenter compiler to a temporary standstill for tonight: parallelization . I was trying to refactor and improve the design of my new segmenter compiler Benihime, 紅姫. (It is named after Urahara's sword from the Japanese anime Bleach . Benihime means crimson princess , what is more suitable to call a state-of-the-art segmenter. ^o^ theheee~~) One of my major concerns with the new implementation was decoupling the flow control logic from the segmenter builder and the flow direction. This is essential for being able to reuse the same logic to compile two segmenters that run in opposite directions for example. Let me illustrate the problem with the help of a file I happened to submit as a practice for our introduction to computational linguistics course just 5 days ago: // /home/sert/Projects/hw1/hw1/Main.cs created with MonoDevelop // // project created on 10/21/2007 at 3:19 AM using System; using System.IO; using...

Sertcom, Unicode Standard 5.0 and Japanese Dramas

Image
I've been slacking off on many things recently, including my hobby project. I'm posting from Flensburg, the northernmost city of Germany. Coming from Heidelberg/Mannheim, aka "Delta Region", you get to experience a mild intranational culture shock up in the north. It feels like a big town that cherishes horizontal freedoms rather than vertically stacked-up big cities I'm used to seeing in Southern Germany. Now to the reason of my stay in Flensburg. My brother acquired an established telemarketing company in Neumünster, Germany and is renaming it "Sertcom". So I ended up paying him a visit to see how he managed to get so far. The last time I was here he was a team leader. What a rapid development. Unicode Standard 5.0 The only thing I'm doing nowadays that bears some relevance to my studies and job is reading the version 5.0 of the Unicode standard. I'm now almost ready to fix some remeaning issues with my segmenter implementation that features t...