Character Frequency Calculations

sample plain text in Devanāgarī script
character frequency analysis results
source code
sample program binaries
*You can use the sample program not only with devanagari but with other scripts as well. See example.

A student of final year B.Tech at National Institute of Technology, Hamirpur, India wrote to me today: (emphasis added)

I came across Tenka-text while searching through the net for existing concordance softwares.
I wish to develop a similar kind of free software for the Dev Nagri script which is the script for Hindi, our national Language.

Initially I plan to develop just the character frequency calculation functionality which is not provided by any currently available product.

Could you please guide me a little on how to go about this. I shall value your suggestions very much.


To which I first replied:

...

I'm at work right now and thus do not have access to my development tools and personal library and because of this cannot reply with an example C# program immediately BUT

(Presuming you are familiar with programming languages)

Step 1 - A HashTable to count devanagari characters encountered

You could use a hashtable/dictionary to hold the frequency values of each character and increment these upon each encounter.

Here is a generic counter, which can store frequency values for any type of object:

http://corsis.svn.sourceforge.net/viewvc/corsis/trunk/Tenka/Counters/Counter.cs?revision=321&view=markup

Step 2 - Study of Devanagari Characters

Devanagari ( http://unicode.org/charts/PDF/U0900.pdf ) seems to be making use of combining marks ( http://www.unicode.org/reports/tr15/images/UAX15-NormFig5.jpg, http://www.unicode.org/reports/tr15/ ).

See sequence U+0940 - U+094D ( http://unicode.org/charts/PDF/U0900.pdf ).

So you could use System.Globalization.StringInfo class ( http://msdn2.microsoft.com/en-us/library/system.globalization.stringinfo.aspx ) to feed input into a Counter instance.

Step 3 - Implementation Example

I'll return to you with source code as soon as possible, probably before 2007-09-27T21:06:56Z (utc).

Best Regards,
Cetin Sert


And then offered the following solution:

Here is a simple solution that aims to handle plain text files only:

http://corsis.sourceforge.net/divya/devanagari.zip


1. Extract files.
2. If your plain text file is in Unicode UTF-8 and in hi-IN culture: simply drag and drop the file onto the Tenka.Text.Console.exe and it will create a file named [INPUT-FILE-NAME].charanalysis.xml. You can test the program with the devanagari.txt which is a sample file in UTF-8.
If your plain text file is in some other encoding: execute Tenka.Text.Console.exe by calling

“Tenka.Text.Console.exe -e:[ENCODING-NAME] [INPUT-FILE-NAME]”

If your plain text file is in some other culture: execute Tenka.Text.Console.exe by calling

“Tenka.Text.Console.exe -c:[CULTURE-NAME] [INPUT-FILE-NAME]”

Example for a German plain text file in ISO-8859-1:

“Tenka.Text.Console.exe -e:iso-8859-1 -c:de-DE german.txt”
3. View [INPUT-FILE-NAME].charanalysis.xml with Firefox, IE 7 or something similar.

the numeric value of the length attribute of a character element indicates how many Unicode codepoints are used to encode the character or, in Unicode terms, the grapheme cluster (see Unicode Standard 5.0, Annex #29, Grapheme Cluster Boundaries (on page 1376 in ISBN 0-321-48091-0))

the numeric value of the frequency attribute of a character (grapheme cluster) element indicates how many times that character was encountered.


I am looking forward to reading your reply.

Best Regards,
Cetin Sert
INF 521, 4-6-2
69120 Heidelberg
Germany

2007-09-27T18:21:43Z


And corrected myself regarding the instructions as I noticed the person might be using Linux and not Windows:

Checking the logs of web sites, I noticed that you might be using Linux. If that’s indeed the case, I need to update the instructions in section 2:

2. You need to install Mono 1.2.5 from http://www.mono-project.com/Downloads ( recommended package: ftp://www.go-mono.com/archive/1.2.5.1/linux-installer/0/mono-1.2.5.1_0-installer.bin ) and then enter for example:

“mono Tenka.Text.Console.exe -e:utf-8 -c:hi-IN [INPUT-FILE-NIME]”

If you have ever called java apps from the terminal window, it’s pretty much the same way of doing this.

Again, a much easier way is to use Windows just for the purpose of testing this application. Quoting runtime requirement notice from Windows downloads page of the main project:

For Windows 98, ME, XP, 2000, 2003 Users:

This program requires Microsoft .NET Framework 2.0 or above to run. You can download the latest official release at the following addresses. This is a one-time installation and if you have already installed the framework, you can proceed with the download of Tenka Text.

> Microsoft .NET Framework 3.0 - bootstrapper:

A small executable that determines your cpu/os architecture automatically and downloads the full redistributable package appropriate for installation.

http://www.microsoft.com/downloads/details.aspx?FamilyID=10cc340b-f857-4a14-83f5-25634c3bf043&DisplayLang=en

> Microsoft .NET Framework 3.0 - full redistributable packages for specific cpu/os architectures:

32-bit/x86) http://go.microsoft.com/fwlink/?LinkId=70848
64-bit/x64) http://go.microsoft.com/fwlink/?LinkId=70849


For Windows Vista Users:

Windows Vista ships with the .NET Framework 3.0 preinstalled. Please proceed with the download of Tenka Text.

If you are on Windows, please follow the original section 2 from the first mail. I kindly ask you to excuse my complicated way of explaining things.

Thinking that you might just want to see the results, I’m uploading the results I got from the sample I sent a link to in the first mail.

I am looking forward to reading your reply.

Best Regards,
Cetin Sert
INF 521, 4-6-2
69120 Heidelberg
Germany


Why have I quoted all of our correspondence? Well I begin to feel that free software development is not just about open source. It is about the whole experience of putting something together and there are more faces to it than just the mere coding.

I truly enjoyed the opportunity that was given to me and this feature is definitely going to find its place in the upcoming releases. That is, if I do get a positive reply regarding the acceptability of the results produced.

sample plain text in Devanāgarī script
character frequency analysis results
source code
sample program binaries
*You can use the sample program not only with devanagari but with other scripts as well. See example.

Comments

Popular posts from this blog

Levenshtein Distance Algorithm: Fastest Implementation in C#

WordSmith Tools 5.0, Tenka Text in China

Mono 1.2.5 binaries for Solaris 10/x86