mono string hash code collisions

After reading the following blog post by David R. MacIver about hash code collisions in java strings:

For reasons you either know about by now or don’t care about, I was curious as to how well String’s hashCode was distributed (I suspected the answer was “not very”). I ran a few quick experiments to verify this.

For your amusement, here is a list of all hash collisions between alphanumeric strings of size 2: http://www.drmaciver.com/collisions.txt and here is a list of all which don’t collide with any others http://www.drmaciver.com/noncolliding.txt

Some statistics: There are 3844 alphanumeric strings of size 2. Of these 3570 collide with at least one other string. That is, 274 of these strings (or about 7% of them) *don’t* collide with something else.

Oh well. It’s a good thing no one would be stupid enough to rely on hashCode to distinguish the contents of two objects.

I tested things with .NET 3.5 and MONO 1.9.1 on a 32-bit Windows Vista:

running on .NET 3.5:
3844 two-char strings
14776336 comparisons
0 collisions with one or more items
0 total collisions

running on mono 1.9.1:
3844 two-char strings
14776336 comparisons
**3570** collisions with one or more items
5250 total collisions

2-char-long-string-hashcode-collisions-mono-1.9.1

We then both went on to testing hash code collisions in 3-char-long strings. David in java:


For what it’s worth, even fewer strings have unique hash codes for 3 characters. 3948 don’t collide, or about 1.6% of them.

Also, this of course doesn’t mean that probability of a hash collision is really high. In reality it’s acceptable low. It’s just a demonstration that it’s not hard to find colliding pairs.

and I in mono 1.9.1: 3-char-long-string-hashcode-collisions-mono-1.9.1. 3948 don't collide and the colliding pairs seem to be the same ones again.

Microsoft's implementation seems to be doing a much better work here. I will inform mono devs soon about the situation ^_^

Comments

Popular posts from this blog

Levenshtein Distance Algorithm: Fastest Implementation in C#

Mono 1.2.5 binaries for Solaris 10/x86

WordSmith Tools 5.0, Tenka Text in China