Efficiently implementing the persistence algorithm

Around the time of my recent submission Multicore homology via Mayer Vietoris an interesting optimization was made which seems to have a nice impact on the running time of the persistence algorithm.

When computing persistence  the boundary matrix is usually stored sparsely. In particular, over \mathbb{Z}_2 each column stores only the indices of the rows of the matrix which are nonzero in it's column.

In the parlance of [ZC '05]  each column is identified as:

x = \partial(\textrm{Cascade}[\sigma]).

The column x is laid out in memory as a [dynamic] array of computer words  corresponding to row indices, with the natural order of these words being the filtration order.

The persistence algorithm proceeds iteratively, column by column, adding previous columns to the current one. If x is the current column being operated on the next column we will add to it

can be identified as:

y = \textrm{Cascade[Youngest[}\partial(\textrm{Cascade}[\sigma])\textrm{]]]}

Until either the column is zeroed out or a unique pairing occurs, e.g. y = 0.

Now let me restate the definition of y as a procedure:

Step 1) The youngest entry of x is identified.

Step 2) The address of the column y is computed.

Step 3) The address of the column y is fetched into CPU registers.

Usually, this sequence of three steps occurs only after the previous addition completed. Chain addition is generally are implemented as a set_symmetric_difference from begin() to end() on each chain, e.g. oldest entry of each column to the youngest.

In recent versions of both CTL as well as PHAT a change was made to how these column additions  are carried out. The columns are added now added from youngest to oldest elements. This should have the advantage of better locality of reference.

This is because when inspecting the youngest entry we are loading it, along with some amount of the end of the chain into memory. If we then proceeded to add front to back we might end up retiring this cache line in favor of the beginning of the array.

However, adding backwards has even more benefits. Once we add backwards we know that the first pair of elements we inspect will agree, necessarily. However, it is possible,  that the two vectors we add together agree at their tail significantly more. We may delay the allocation of the output buffer for the sum until we find the first mismatch between these two chains. This saves us a small amount of memory. But their is even more.

Supposing that x is not a positive column, e.g. after a call to mismatch we will find a pair of elements in disagreement, precisely the younger of these two elements is the new youngest element.  We may now compute the address of the next column to add to $x$ and request that the computer prefetches the column at that address.

In other words we get some added low level parallelism: as we add the pair of current columns, we are simultaneously fetching the next column to add.

How well does all of this work? In my opinion, quite well. Here is a  plot of the running time for the persistence algorithm showing the original running time, the running time when adding backwards, and the running time when adding backwards with prefetching, on the various datasets from the multicore paper

Feel free to comment and discuss on the results of this experiments. I'll answer questions the best I can.