HashSet - a new high performance set collection from Orcas October CTP
I tested performances of several set implementations. A set is an unordered collection of unique elements.
Time required to create a set of unique words with different implementations (Corpus Size: 278,675 tokens of 21,828 types):
I immediately decided to switch to the new generic HashSet from the BCL guys.
Tenka.Text will greatly benefit from this development especially when sorting word-frequency dictionaries* on their frequencies. (* Your average corpus linguist might have a tendency to call them 'wordlists'.)
Time required to create and sort word cluster frequencies (cluster size: 6 words):
Mike Scott, my archrival, the developer of WordSmith Tools... I am definitely looking forward to the day I release the first alpha version of Tenka Text. I can't help but wonder what your comments will be.
Time required to create a set of unique words with different implementations (Corpus Size: 278,675 tokens of 21,828 types):
- a cheap set implementation which derives from the BCL generic list collection and imposes Contains(T item) checks on each add / insert operation [SVN]
21,20~ seconds - a set implementation which is a reflected / disassembled partial copy of the BCL generic list collection and performs manually inlined Contains checks on each add / insert operation
20,70~ seconds - System.Collections.Specialized.StringCollection : if (!set.Contains(word)) set.Add(word);
18.96~ seconds - HashSet from Orcas October CTP
0,17~ seconds ^_^ just who can beat this?
I immediately decided to switch to the new generic HashSet from the BCL guys.
Tenka.Text will greatly benefit from this development especially when sorting word-frequency dictionaries* on their frequencies. (* Your average corpus linguist might have a tendency to call them 'wordlists'.)
Time required to create and sort word cluster frequencies (cluster size: 6 words):
- TT-2006-11-25 (latest binary release, at the time of this writing) [SVN 17]
2,35 seconds - TT-2007-01-20 [SVN 43]
0,76 seconds
Mike Scott, my archrival, the developer of WordSmith Tools... I am definitely looking forward to the day I release the first alpha version of Tenka Text. I can't help but wonder what your comments will be.
Comments