HashSet - a new high performance set collection from Orcas October CTP

January 20, 2007

I tested performances of several set implementations. A set is an unordered collection of unique elements.

Time required to create a set of unique words with different implementations (Corpus Size: 278,675 tokens of 21,828 types):

a cheap set implementation which derives from the BCL generic list collection and imposes Contains(T item) checks on each add / insert operation [SVN]

21,20~ seconds
a set implementation which is a reflected / disassembled partial copy of the BCL generic list collection and performs manually inlined Contains checks on each add / insert operation

20,70~ seconds
System.Collections.Specialized.StringCollection : if (!set.Contains(word)) set.Add(word);

18.96~ seconds
HashSet from Orcas October CTP

0,17~ seconds ^_^ just who can beat this?

I immediately decided to switch to the new generic HashSet from the BCL guys.

Tenka.Text will greatly benefit from this development especially when sorting word-frequency dictionaries* on their frequencies. (* Your average corpus linguist might have a tendency to call them 'wordlists'.)

Time required to create and sort word cluster frequencies (cluster size: 6 words):

TT-2006-11-25 (latest binary release, at the time of this writing) [SVN 17]

2,35 seconds
TT-2007-01-20 [SVN 43]

0,76 seconds

Such power combined with runtime compilation to provide optimized word or cluster enumerators for any given user customization... and all open-source...

Mike Scott, my archrival, the developer of WordSmith Tools... I am definitely looking forward to the day I release the first alpha version of Tenka Text. I can't help but wonder what your comments will be.

Search This Blog

CORSIS

HashSet - a new high performance set collection from Orcas October CTP

Comments

Popular posts from this blog

Levenshtein Distance Algorithm: Fastest Implementation in C#

Mono 1.2.5 binaries for Solaris 10/x86

Haskell and F#: Language Design