Hash Tables in Mathematica

This week, after downloading citing and related paper numbers of around 50.000 papers that were presented as results to various simple searches in PubMed, we wanted to have a way to find out most cited authors.

Both extremely new to Mathematica, we were yet to see anything that looked or behaved like hash tables so our first approach looked like this:

recordsOfAuthor[author_]:=
Select[
records,
Composition[MemberQ[#, author]&, authorsOf]
]

where records were a list of XMLElements containing EndNote X3 records (these roughly correspond to PubMed search results).

Now I am not the guy to immediately understand the O-complexity of any given code, yet the first thing I had told my friend after we had finished downloading the citation numbers and right before we started coding the rest of our analysis tools was that we definitely needed to either find out how to use hash tables in Mathematica or do the whole thing in another language, otherwise we might have some headache.

Our largest EndNote X3 search results file was a single 95MB XML document with 17.278 records (papers/reviews/etc.) and 52.666 unique authors, combined with another 8MB XML document containing the citation and related paper counts we had downloaded from PubMed for each record.

x-axis: record number, y-axis: number of citations


In another part of the code we have a map over all unique authors that uses the above function to sum the citation numbers of all records for every single author.

Let's consider the worst case complexity now:
  • 52.666 unique authors
  • 17.278 records
  • 79.159 authors

The latter two makes each call of recordsOfAuthor cost 17.278 x 79.159 = 1.368.500.792 (1 billion, 368 million, 500 thousand and 792) operations! So with our 52.666 unique authors, we are talking about a whopping: 52.666 x 17.278 x 79.159 = 72.031.772.832.532 operations to find out all records of each author!!!!!!!!! 72 trillion, 31 billion, 772 million, 832 thousand and 532 operations!!!!!!!!!!!

Our recordsOfAuthor was O(n^3). No surprise that it made me wait for the results a whole night ... whose sweet slumber brought upon me a vision of myself being scolded and harshly criticized by my friend for being the slowest and most unsuccessful coder she has ever seen, then googling for a way to use hash tables in Mathematica and seeing myself go to stackoverflow and reading an answer.

Now let's have a look at this:

Clear[dic, dicAdd];
dic[accNum_] := Null (* dictionary *)
dicAdd[record_] := dic[accNumOf @ record] = record;

dicAdd /@ records; (* 17.278 operations *)

Clear[ar, arAdd, recordsOfAuthor];
ar[author_] := {} (* ar: author records *)
arAdd[record_] := (ar[#] = Append[ar[#], accNumOf @ record)& /@ authorsOf @ record;

arAdd /@ records; (* 79.159 operations *)

recordsOfAuthor[author_] := author // ar // dic /@ # & (* O(1) *)

Now we first do 17.278 + 79.159 = 96.447 operations to create an O(1) version of recordsOfAuthor by using function definitions as a hash table.

With this version of recordsOfAuthor the same map over all unique authors has become O(1) * n = O(n), so we have reduced the number of operations
  • from 72 trillion, 31 billion, 772 million, 832 thousand and 532
  • to just 52 thousand, 666!!!!!!

With all the being hated, blamed and made felt worthless by my best friend and one week of delay in development time, this has been an excellent lesson: about experience, understanding, patience, indifference, simplicity and trust – I will never forget her words: "You always say, you'll finish, you'll finish but you never do!!"!

Comments

Popular posts from this blog

Levenshtein Distance Algorithm: Fastest Implementation in C#

Mono 1.2.5 binaries for Solaris 10/x86

WordSmith Tools 5.0, Tenka Text in China