HashSet - a new high performance set collection from Orcas October CTP

I tested performances of several set implementations. A set is an unordered collection of unique elements.

Time required to create a set of unique words with different implementations (Corpus Size: 278,675 tokens of 21,828 types):
  1. a cheap set implementation which derives from the BCL generic list collection and imposes Contains(T item) checks on each add / insert operation [SVN]

    21,20~ seconds

  2. a set implementation which is a reflected / disassembled partial copy of the BCL generic list collection and performs manually inlined Contains checks on each add / insert operation

    20,70~ seconds

  3. System.Collections.Specialized.StringCollection : if (!set.Contains(word)) set.Add(word);

    18.96~ seconds

  4. HashSet from Orcas October CTP

    0,17~ seconds ^_^ just who can beat this?

I immediately decided to switch to the new generic HashSet from the BCL guys.

Tenka.Text will greatly benefit from this development especially when sorting word-frequency dictionaries* on their frequencies. (* Your average corpus linguist might have a tendency to call them 'wordlists'.)

Time required to create and sort word cluster frequencies (cluster size: 6 words):
  • TT-2006-11-25 (latest binary release, at the time of this writing) [SVN 17]

    2,35 seconds

  • TT-2007-01-20 [SVN 43]

    0,76 seconds
Such power combined with runtime compilation to provide optimized word or cluster enumerators for any given user customization... and all open-source...

Mike Scott, my archrival, the developer of WordSmith Tools... I am definitely looking forward to the day I release the first alpha version of Tenka Text. I can't help but wonder what your comments will be.

Comments

Popular posts from this blog

Levenshtein Distance Algorithm: Fastest Implementation in C#

WordSmith Tools 5.0, Tenka Text in China

Mono 1.2.5 binaries for Solaris 10/x86