Hash Tables in Mathematica
This week, after downloading citing and related paper numbers of around 50.000 papers that were presented as results to various simple searches in PubMed, we wanted to have a way to find out most cited authors.
Both extremely new to Mathematica, we were yet to see anything that looked or behaved like hash tables so our first approach looked like this:
where records were a list of
Now I am not the guy to immediately understand the O-complexity of any given code, yet the first thing I had told my friend after we had finished downloading the citation numbers and right before we started coding the rest of our analysis tools was that we definitely needed to either find out how to use hash tables in Mathematica or do the whole thing in another language, otherwise we might have some headache.
Our largest EndNote X3 search results file was a single 95MB XML document with 17.278 records (papers/reviews/etc.) and 52.666 unique authors, combined with another 8MB XML document containing the citation and related paper counts we had downloaded from PubMed for each record.
In another part of the code we have a map over all unique authors that uses the above function to sum the citation numbers of all records for every single author.
Let's consider the worst case complexity now:
The latter two makes each call of
Our
Now let's have a look at this:
Now we first do 17.278 + 79.159 = 96.447 operations to create an O(1) version of
With this version of
With all the being hated, blamed and made felt worthless by my best friend and one week of delay in development time, this has been an excellent lesson: about experience, understanding, patience, indifference, simplicity and trust – I will never forget her words: "You always say, you'll finish, you'll finish but you never do!!"!
Both extremely new to Mathematica, we were yet to see anything that looked or behaved like hash tables so our first approach looked like this:
recordsOfAuthor[author_]:=
Select[
records,
Composition[MemberQ[#, author]&, authorsOf]
]
where records were a list of
XMLElement
s containing EndNote X3 records (these roughly correspond to PubMed search results).Now I am not the guy to immediately understand the O-complexity of any given code, yet the first thing I had told my friend after we had finished downloading the citation numbers and right before we started coding the rest of our analysis tools was that we definitely needed to either find out how to use hash tables in Mathematica or do the whole thing in another language, otherwise we might have some headache.
Our largest EndNote X3 search results file was a single 95MB XML document with 17.278 records (papers/reviews/etc.) and 52.666 unique authors, combined with another 8MB XML document containing the citation and related paper counts we had downloaded from PubMed for each record.
In another part of the code we have a map over all unique authors that uses the above function to sum the citation numbers of all records for every single author.
Let's consider the worst case complexity now:
- 52.666 unique authors
- 17.278 records
- 79.159 authors
The latter two makes each call of
recordsOfAuthor
cost 17.278 x 79.159 = 1.368.500.792 (1 billion, 368 million, 500 thousand and 792) operations! So with our 52.666 unique authors, we are talking about a whopping: 52.666 x 17.278 x 79.159 = 72.031.772.832.532 operations to find out all records of each author!!!!!!!!! 72 trillion, 31 billion, 772 million, 832 thousand and 532 operations!!!!!!!!!!!Our
recordsOfAuthor
was O(n^3). No surprise that it made me wait for the results a whole night ... whose sweet slumber brought upon me a vision of myself being scolded and harshly criticized by my friend for being the slowest and most unsuccessful coder she has ever seen, then googling for a way to use hash tables in Mathematica and seeing myself go to stackoverflow and reading an answer.Now let's have a look at this:
Clear[dic, dicAdd];
dic[accNum_] := Null (* dictionary *)
dicAdd[record_] := dic[accNumOf @ record] = record;
dicAdd /@ records; (* 17.278 operations *)
Clear[ar, arAdd, recordsOfAuthor];
ar[author_] := {} (* ar: author records *)
arAdd[record_] := (ar[#] = Append[ar[#], accNumOf @ record)& /@ authorsOf @ record;
arAdd /@ records; (* 79.159 operations *)
recordsOfAuthor[author_] := author // ar // dic /@ # & (* O(1) *)
Now we first do 17.278 + 79.159 = 96.447 operations to create an O(1) version of
recordsOfAuthor
by using function definitions as a hash table.With this version of
recordsOfAuthor
the same map over all unique authors has become O(1) * n = O(n), so we have reduced the number of operations - from 72 trillion, 31 billion, 772 million, 832 thousand and 532
- to just 52 thousand, 666!!!!!!
With all the being hated, blamed and made felt worthless by my best friend and one week of delay in development time, this has been an excellent lesson: about experience, understanding, patience, indifference, simplicity and trust – I will never forget her words: "You always say, you'll finish, you'll finish but you never do!!"!
Comments