New Segmenter Compiler: Benihime, 紅姫

parallelizationAn interesting design concern has brought the development of my new segmenter compiler to a temporary standstill for tonight: parallelization.

I was trying to refactor and improve the design of my new segmenter compiler Benihime, 紅姫. (It is named after Urahara's sword from the Japanese anime Bleach. Benihime means crimson princess, what is more suitable to call a state-of-the-art segmenter. ^o^ theheee~~)

One of my major concerns with the new implementation was decoupling the flow control logic from the segmenter builder and the flow direction. This is essential for being able to reuse the same logic to compile two segmenters that run in opposite directions for example.

Let me illustrate the problem with the help of a C# file I happened to submit as a practice for our introduction to computational linguistics course just 5 days ago:


// /home/sert/Projects/hw1/hw1/Main.cs created with MonoDevelop
//
// project created on 10/21/2007 at 3:19 AM
using System;
using System.IO;

using Tenka;
using Tenka.Counters;
using Tenka.Text;
using Tenka.Text.Segmentation;
using Tenka.Text.Segmentation.Emit;

namespace ECL
{
class MainClass
{
public static void Main(string[] args)
{
string path = args[0];
string text = File.ReadAllText(path);

SegmenterBuilder builder = new SegmenterBuilder(Common.CustomizationModule);
CharacterField unibase = builder.DefineCharacterField();
FlowCharacter current = FlowCharacter.Current(builder);
FlowCharacter next = FlowCharacter.Next(builder);

FlowCodeConditionStatement basic = new FlowCodeConditionStatement(current.Matches(char.IsLetterOrDigit));
basic.TrueBlock.Statements.Add(unibase.Store(current));
basic.TrueBlock.Statements.Add(FlowControlAction.Continue);

FlowCodeConditionStatement extended = new FlowCodeConditionStatement(CharacterEqualityExpression.Sequence(current, "'-"));
extended.TrueBlock.Statements.Add(unibase.Store(null));

AndSequence seq = new AndSequence();
seq.Expressions.Add(next.CheckLimits());
seq.Expressions.Add(unibase.Matches(char.IsLetterOrDigit));
seq.Expressions.Add(next.Matches(char.IsLetterOrDigit));

FlowCodeConditionStatement condition = new FlowCodeConditionStatement(seq);
condition.TrueBlock.Statements.Add(FlowControlAction.Continue);

extended.TrueBlock.Statements.Add(condition);
extended.TrueBlock.Statements.Add(FlowControlAction.Break);

AndSequence cseq = new AndSequence();
cseq.Expressions.Add(unibase.IsNotNull);
cseq.Expressions.Add(current.IsCombiningMark);

FlowCodeConditionStatement combination = new FlowCodeConditionStatement(cseq);
combination.TrueBlock.Statements.Add(FlowControlAction.Continue);

FlowControlLogic logic = new FlowControlLogic();
logic.Statements.Add(basic);
logic.Statements.Add(extended);
logic.Statements.Add(combination);
logic.Statements.Add(unibase.Store(null));
logic.Statements.Add(FlowControlAction.Break);

Type type = builder.Compile(FlowDirection.LeftToRight, logic).CreateType();
Common.CustomizationModule.Save();

Segmenter segmenter = Segmenter.CreateInstance(type);

FrequencyList<string> list = new FrequencyList<string>();
TypeTokenRatioStandardization ttrs = list.Add(text, segmenter, 1000);

string[] order = list.GetFrequencyOrder();

Console.WriteLine(" token count: {0}", list.TokenCount);
Console.WriteLine(" type count: {0}", list.TypeCount);
Console.WriteLine(" raw type/token ratio: {0}", list.TypeTokenRatio);
Console.WriteLine(" standardized type/token ratio (base 1000): {0}", ttrs.TypeTokenRatio);
Console.WriteLine();

foreach (string key in order)
{
Console.Write('\t');
Console.Write(list[key]);
Console.Write(' ');
Console.Write(key);
Console.WriteLine();
}
}
}
}


Tada~~~ Here is the problem section. unibase is a character field reference expression and needs to be defined and thus depends on a segmenter builder instance. next and other non-current flow character expressions depend on a flow direction, which is unknown until Compile() is called. All three of them are (compilation) context-dependent expressions.

...
CharacterField unibase = builder.DefineCharacterField();
FlowCharacter current = FlowCharacter.Current(builder);
FlowCharacter next = FlowCharacter.Next(builder);
...
Type type = builder.Compile(FlowDirection.LeftToRight, logic).CreateType();

In the quoted version of the sample program, builder.DefineCharacterField() leads to the definition of a field (a build process) outside the context of the builder.Compile() call and unibase stores an internal reference to builder. This a breach of compilation context.

Additionally, current and next are also initialized with an internal reference [FC->B] to builder. When Compile() is called and the code generation chain triggers Emit() on current or next, they use that internal reference [FC->B] to ask builder on which flow direction value the compilation is being made. I know the explanation sucks big balls but so does this design. It is the worst form of code chaos: Expressions that later end up in statements of the flow control logic are coupled during their initialization with a single segmenter builder instance and its flow direction!! So how in the hell can you use the same logic now with another segmenter builder or flow direction? o__O


After subconscious contemplation for a few days, the following idea suddenly occured to me early in the morning today: to solve this ugly decoupling problem, logic should deal with instantiating a type-less copy of unibase, and direction-less copies of current and next and store references to them in itself. Compile() should then either

(1) create a deep copy of logic in which types and directions of unibase, current and next are bound and use this fully initialized copy for code generation

or

(2) in an atomic pass, (place thread lock), bind unibase, current and next temporarily, generate code and unbind unibase, current and next (release thread lock).

Both approaches have their ups and downs.

(1) may parallelize better as there is nothing to lock but creating deep copies of complex segmentation flow control logics in highly demanding execution scenarios may also cause serious performance issues.

(2) seems to be the way to go as it does not create temporary copies of flow control logics however simple or complex these may be. This means fewer CPU cycles and less memory consumption but managing locks properly may become quite a challange and I might eventually have to learn or wait for a better atomization construct.


Well, that's it - phew~~ what a long post this has become o__O!! Enough talk! I am going on to refactoring the API now.

Comments

Popular posts from this blog

Levenshtein Distance Algorithm: Fastest Implementation in C#

WordSmith Tools 5.0, Tenka Text in China

Mono 1.2.5 binaries for Solaris 10/x86