mono string hash code collisions
After reading the following blog post by David R. MacIver about hash code collisions in java strings:
For reasons you either know about by now or don’t care about, I was curious as to how well String’s hashCode was distributed (I suspected the answer was “not very”). I ran a few quick experiments to verify this.
For your amusement, here is a list of all hash collisions between alphanumeric strings of size 2: http://www.drmaciver.com/collisions.txt and here is a list of all which don’t collide with any others http://www.drmaciver.com/noncolliding.txt
Some statistics: There are 3844 alphanumeric strings of size 2. Of these 3570 collide with at least one other string. That is, 274 of these strings (or about 7% of them) *don’t* collide with something else.
Oh well. It’s a good thing no one would be stupid enough to rely on hashCode to distinguish the contents of two objects.
I tested things with .NET 3.5 and MONO 1.9.1 on a 32-bit Windows Vista:
running on .NET 3.5:
3844 two-char strings
14776336 comparisons
0 collisions with one or more items
0 total collisions
running on mono 1.9.1:
3844 two-char strings
14776336 comparisons
**3570** collisions with one or more items
5250 total collisions
2-char-long-string-hashcode-collisions-mono-1.9.1
We then both went on to testing hash code collisions in 3-char-long strings. David in java:
For what it’s worth, even fewer strings have unique hash codes for 3 characters. 3948 don’t collide, or about 1.6% of them.
Also, this of course doesn’t mean that probability of a hash collision is really high. In reality it’s acceptable low. It’s just a demonstration that it’s not hard to find colliding pairs.
and I in mono 1.9.1: 3-char-long-string-hashcode-collisions-mono-1.9.1. 3948 don't collide and the colliding pairs seem to be the same ones again.
Microsoft's implementation seems to be doing a much better work here. I will inform mono devs soon about the situation ^_^
Comments