Frequency Lists - Statistics View



^_^ Here is the new statistics view for frequency lists from the 2007-02-05 pre-alpha release.

I'm still actively working on frequency lists at GUI and lower levels and some of the following features might be ready for the next release:
  1. An option to have the program display in the statistics view for frequency lists either
    • the count of tokens that are a certain number of characters long or
    • the count of types that are a certain number of characters long. *

    Tenka Text 2007-02-05 and WordSmith Tools 4 can only display the first but not the latter. There is only a little GUI work to deliver this option.

  2. Live updates.

    Once in place, users will be able to add or remove files from their frequency lists without the need to recalculate everything from the start.
Well, that's all the news about Tenka Text 2007-02-05. Let's move on to more personal stuff.

Logographic Failure

I found out that the "Tenka" Logo 「天花」, which was my personal attempt at creating a digital seal with the kanji of a literary word for "snow", has a huge flaw! A senpai from university was kind enough to point to the fact that the kanji combination I chose (「天花」) for representing the sound of "tenka" actually meant "smallpox" - a disease. She is now cooperating with me, trying to come up with a new logo.

Institut für Deutsche Sprache

I have been working at IDS since 2006-10-16 as an assistant student in corpus compilation and analysis and have come to accumulate some experience which I am going to consider putting into Tenka Text at some future date. By the way, I owe my current position to the Mono project team. It is their commitment to making .NET multi-platform that I am now able to write tools in C# that run on Linux, which is for some reason the preferred platform of the department of lexical analysis at IDS.

Structured Plain Text Extraction from PDFs and Other Office Files

To accomplish a task I was assigned for example, I had to put together several builds of a tool called 'pdftohtml' and a few lines of C# (2.0 with some Linq to Xml) and there I had a multi-platform, open-source and managed pdf import method. This method will definitely find its way into Tenka Text once I have developed a common API for extracting structured plain text from file formats such as PDFs or office files.

Comments

Popular posts from this blog

Levenshtein Distance Algorithm: Fastest Implementation in C#

WordSmith Tools 5.0, Tenka Text in China

Mono 1.2.5 binaries for Solaris 10/x86