Rob Ghrist has released his new book!

You can find it on amazon!

Skip to content
# Applied Topology

## New book! Elementary Applied Topology

## Tenure Track Positions, New College of Florida

## Special issue on “Algorithms and Software for Computational Topology”

## AATRN Seminar: Robert Ghrist

## Applied Algebraic Topology Research Network

## Efficiently implementing the persistence algorithm

## Wasserstein barycenters and Frechet means

## ATMCS 6: Day 5

## ATMCS 6: Day 4

## ATMCS 6: Day 3

Qualitative data analysis

[On behalf of Vic Reiner, UMN]

The New College of Florida now has several openings in statistics and computer science/data science. These positions may be of interest to applied topologists.

Details are here.

(The New College is a public honors college; it’s where Bill Thurston did his undergraduate studies.)

…in the Journal of Symbolic Computation

Aim and Scope:

The interest in algorithms on topological problems and their implementation has rapidly grown during the last decade. One driving force is the emergence of “topological data analysis” which connects topological concepts like Morse theory and homology to the investigation of real-world data. Another recent track of research substantially expands the realm of possibility for computational approaches in 3-manifold and knot theory. Common to these and other developments is the ability to handle large data collections through an efficient algorithmic framework as well as mature software implementations of those. A workshop session at the International Congress of Mathematical Software (ICMS) in August 2014 was dedicated to this topic (http://icms14.appliedtopology.org/).

The Journal of Symbolic Computation (JSC) invites high-quality contributions from researchers in the area of Computational Topology reporting on original research achievements towards algorithms, software, and applications. The list of topics includes, but is not limited to

- (Persistent) homology
- Topological data analysis
- 3-manifold topology and knot theory
- (Discrete) Morse theory

Researchers which are unsure whether their contribution is suitable are encouraged to contact the guest editor.

Submission instructions:

It is recommended to prepare submission in the same format as regular submission to JSC (see the “Guide for Authors” at http://www.journals.elsevier.com/journal-of-symbolic-computation/).

The paper must start with a introduction that

- clearly states the considered problem
- discusses its relevance and related work
- explains the main contribution of the paper
- explains why the contribution is original and non-trivial

There is no page limit on submitted manuscripts. It is required, however, that

- all related work is completely and carefully discussed
- all theorems are rigorously proved
- important definitions/theorems/algorithms are illustrated by well-chosen examples.

All submitted papers will be refereed according to the high standards of JSC.

Guest editor:

Michael Kerber (Max Planck Institute for Informatics) – mkerber@mpi-inf.mpg.de

Deadline:

The submission deadline is January 31 2015. The special issue is planned to appear in Fall 2015

Today, the promised AATRN seminar series got started with Robert Ghrist as the inaugural speaker. His lecture, through WebEx, builds up the cellular sheaf perspective on networks with capacity, Max Flow / Min Cut, and the work done by Ghrist, Yasu Hiraoka, and Sanjeevi Krishnan on categorifying and sheafifying MF/MC.

Among the novel insights coming from this talk even if one has been following the UPenn developments for a while was the connection to Poincare Duality: “Flow duality is a form of Poincare duality” — S. Krishnan

The approach detailed by Krishnan in his preprint (on http://www.math.upenn.edu/~sanjeevi/papers/mfmc.pdf ) encodes a flow network as a sheaf of semi-modules of capacities over a semi-ring over the directed graph of the network. Flows correspond to homology, cuts to cohomology.

MF/MC translates to:

Theorem (S. Krishnan)

Given a network X, a distinguished (virtual) edge e and a capacity sheaf F, (all) flowvalues at e of flows on F correspond to the homotopy limit over all cuts of cutvalues of cuts on F.

This approach, and connecting flows and cuts to a lattice structure on the constraints produces a setting where multi-kind flows can easily be analyzed with the same tools as ordinary flows, and where the algebra fixes the duality gaps that show up when naively searching for minima and maxima.

The talk culminated in a primer on the homology and cohomology of directed sheaves, to set us up to read Krishnan’s paper, constructing compact support cohomology and Borel-Moore homology for sheaves over directed spaces.

The added abstraction levels seem to enable MF/MC theorems for, for instance, probability distributions. It also carries a promise for insights into duality gaps in MF/MC type problems.

Peter Bubenik writes:

[…] Robert Ghrist, Konstantin

Mischaikow, Fadil Santosa and I are starting a Research Network in Applied Algebraic Topology. To start, the main activity of the network will be a weekly interactive online seminar. We have plans to expand our activities in the future.

Rob Ghrist will give the first talk on Tue Sept 23.

Please check out our web site at https://www.ima.umn.edu/topology/ and become a member.

Around the time of my recent submission Multicore homology via Mayer Vietoris an interesting optimization was made which seems to have a nice impact on the running time of the persistence algorithm.

When computing persistence the boundary matrix is usually stored sparsely. In particular, over $$\mathbb{Z}_2$$ each column stores only the indices of the rows of the matrix which are nonzero in it’s column.

In the parlance of [ZC ’05] each column is identified as:

$$x = \partial(\textrm{Cascade}[\sigma]).$$

The column $$x$$ is laid out in memory as a [dynamic] array of computer words corresponding to row indices, with the natural order of these words being the filtration order.

The persistence algorithm proceeds iteratively, column by column, adding previous columns to the current one. If $$x$$ is the current column being operated on the next column we will add to it

can be identified as:

$$y = \textrm{Cascade[Youngest[}\partial(\textrm{Cascade}[\sigma])\textrm{]]]}$$

Until either the column is zeroed out or a unique pairing occurs, e.g. $$y = 0$$.

Now let me restate the definition of $$y$$ as a procedure:

Step 1) The youngest entry of $$x$$ is identified.

Step 2) The address of the column $$y$$ is computed.

Step 3) The address of the column $$y$$ is fetched into CPU registers.

Usually, this sequence of three steps occurs only *after* the previous addition completed. Chain addition is generally are implemented as a set_symmetric_difference from begin() to end() on each chain, e.g. oldest entry of each column to the youngest.

In recent versions of both CTL as well as PHAT a change was made to how these column additions are carried out. The columns are added now added from youngest to oldest elements. This should have the advantage of better locality of reference.

This is because when inspecting the youngest entry we are loading it, along with some amount of the end of the chain into memory. If we then proceeded to add front to back we might end up retiring this cache line in favor of the beginning of the array.

However, adding backwards has even more benefits. Once we add backwards we know that the first pair of elements we inspect will agree, necessarily. However, it is possible, that the two vectors we add together agree at their tail significantly more. We may delay the allocation of the output buffer for the sum until we find the first mismatch between these two chains. This saves us a small amount of memory. But their is even more.

Supposing that $$x$$ is not a positive column, e.g. after a call to mismatch we will find a pair of elements in disagreement, precisely the younger of these two elements is the new youngest element. We may now compute the address of the next column to add to $x$ and request that the computer prefetches the column at that address.

In other words we get some added low level parallelism: as we add the pair of current columns, we are simultaneously fetching the next column to add.

How well does all of this work? In my opinion, quite well. Here is a plot of the running time for the persistence algorithm showing the original running time, the running time when adding backwards, and the running time when adding backwards with prefetching, on the various datasets from the multicore paper

Feel free to comment and discuss on the results of this experiments. I’ll answer questions the best I can.

At this past year at the IMA, there has been some attention spent on a number of interesting aspects in bringing persistent homology closer to statistical methods.

One core step in this process has been to figure out what we *mean* by an average persistence diagram, a question that has an answer proposed by Munch, Bendich, Turner, Mukherjee, Mattingly, Harer and Mileyko in various constellations.

The details on Frechet means for persistent homology is not what this post is about, however. Instead, I want to bring up something I just saw presented today at ICML. Marco Cuturi and Arnaud Doucet got a paper entitled Fast Computation of Wasserstein Barycenters accepted to this large machine learning conference.

In their paper, Cuturi-Doucet present work on computing the barycenter (or average, or mean) of N probability measures using the Wasserstein distance: they articulate the transport optimization problem of finding a measure minimizing Wasserstein distance to all the N given measures, and present a smoothing approach that allows them to bring the problem onto a convex optimization shape. In the talk — though not present in the paper — the authors argue that they can achieve quadratic running times through these approximation steps. Not only that, but their approach ends up amenable for GPGPU computation and significant parallelization and vectorization benefits.

Their approach works anywhere that the Wasserstein metric is defined, and so in particular should work (and most likely give the same results) on the persistence diagram setting studied by the persistent statistician constellations mentioned above.

I for one would be excited to hear anything about the relations between these (mostly) parallel developments.

Ryan H. Lewis summarizes the morning talks, and Andrew Cooper the afternoon talks:

**Christopher Hoffman **gives the first talk about recent advanced in random topology. He began by discussing the stopping time for monotone graph properties. An example is for a sequence of graphs $$\langle G_j \rangle$$ we define:

$$ \tau_{\textrm{connected}} = \min j \textrm{ such that } G_j \textrm{ is connected}$$

For example in the following example we have $$\tau_{\textrm{connected}} = 4$$

He is interested in generalizing two results about monotone Erdos-Renyi random graphs to facts about erdos-renyi random simplicial complexes.

The Linial-Meshulam model is a collection $$Y_i$$ of 2-dimensional simplicial complexes where $$Y_0$$ is a complete graph on $$n$$ vertices and $$Y_i = Y_{i-1} \cup \{\textrm{a } 2 \textrm{ cell} \}.$$

It turns out that the first 2-cycle either has 4 faces with probability converging to $$c_0 = .909$$ or it is larger than $$\frac{n}{\log{(n)}}$$ with probability converging to 1 – $$c_0$$.

To generalize the second result to studying isolated edges one can study when the $$H_1(Y, C)=0$$ for $$\mathbb{Z}, \mathbb{Z}_2,$$ and $$\mathbb{Q}$$ coefficients, as well as the $$\pi_1(Y) = 0$$. He presents a series of results relating the stopping times for these events.

In the future they want to use probabilistic methods to demonstrate the existence of complexes desired but unobtained by classical methods in topology.

**Paul Villeuotrox (sp?)**

Talks about using persistent homology on a wide range of epithelial cells.

By viewing such a structure as a cover of the plane, it’s nerve is a 2D topological space. A filtration of this space is given by assigning to a vertex it’s degree, and each cell the maximum filtration value of it’s boundary.

He has found that persistent homology has proven useful for studying the structure of these cell networks.

He finds that by comparing the barcodes produced from these pipelines to the barcodes produced by complexes built on a random complexes whose underlying graph is endowed with the degree distribution that has been observed empirically, that while persistent $$H_0$$ appears to be similar between these two types of complexes, persistent $$H_1$$ seems to be very different.

**Raul Rabadan** talked about **The Topology of Evolution**.

The only figure in Darwin’s *Origin of Species* is a (mostly-binary) tree. This “Tree of Life” paradigm used starting in the 1970s to analyze genomic sequences. The first major discovery using the tree paradigm was Carl Woese’s 1977 discovery of Domain Archaea.

Woese excluded viruses because they lacked some of the genes he used. But even if he had included them, he would have had trouble: viruses have a high level of horizontal gene transfer, so the choice of tree as a structure to represent the phylogeny is not very good for viruses.

Nor is it very good for bacteria and archaea. Nor is it very food for plants (even Darwin knew this). Nor is it very good for us: when you get gonorrhoea, it gets you! 10% of gonorrhoea genes are human. 8% of human genes are viral. Your genome is something like a “cemetery of past infections” rather than a list of all your ancestors.

If we can’t use a tree to model evolution, what can we use? Mathematically, the problem is:

Given a set of genomes and a way of comparing them, how do we represent their relationships without importing (too many) assumptions from biology?

We would like an answer which is statistical and incorporates the notion of scale. We’d also like to detect when clonal (descent) transfer happens, and when horizontal (non-descent) transfer happens. Answer: use persistent homology to detect the topology of the genetic data!

Persistence detects not just topology, but topology at scales. In 0th homology, scale represents taxa: as we increase the filtration value, we are collecting together more and more distantly related genomes. In 1st homology, a long bar represents a transfer between distantly-related taxa.

For example, though overall flu genes show a lot of cycles, if we restrict our attention to a particular segment there are almost no long-persisting cycles. HIV, on the other hand, has persistent cycles even when we focus on small suites of genes. Thus persistence detects the fact that gene transfer in flu occurs by trading whole segments, whereas in HIV it occurs by trading much smaller units of genetic material.

For human genomic data, persistence bars are about 2 centiMorgans-per-megabase long.

**Michael Robinson** talked about **Morphisms between Logic Circuits**

Logic circuits are described by their truth tables. But computers take time to do computations: fast input switching yields the “wrong” output (as evidenced by the flickering screen on Dr. Robinson’s slides). How can we analyze the failures of circuits due to problems of timing? Use sheaves!

Sheaves allow local specifications (we are really good at understanding small circuits) to determine global behavior (what we need to get a handle on). Plus sheaves whose stalks are vector spaces are `just’ linear-algebraic, so we can compute their cohomology using easy, well-known techniques.

As we try to associate a vector space to each logic gate, we encounter various aspects of engineering practice like *one-hot encoding*.

The zeroth cohomology of the switching sheaf detects the (synchronous) classical logic behavior of the system. The first cohomology of the switching sheaf detects stored information (hence, the possibility of a timing problem in the circuit).

But cohomology of the switching sheaf doesn’t tell us everything. The **categorification **approach says we should consider morphisms to get more information. Given a circuit we want to understand, we can construct a circuit with the same logical behavior.Then we can ask how many morphisms of the switching sheaves there are which cover the identity on inputs and outputs.

Sometimes there aren’t any such morphisms. Sometimes there are a few. Apparently there are never exactly three.

Today’s summaries are provided by Isabel Darcy.

**Omer Bobrowski** talked about Topological Estimation for Super Level Sets. The goal is to determine the homology of an unknown space from a sample of noisy data points. Super-level sets of a density function f correspond to dense regions: {x | f(x) > L}. In general, the density function is not known but can often be estimated. One can try to reconstruct the homology by looking at \(U_n(L, r)\) = the union of balls of fixed radius r around each point in a super level set, {x | f(x) > L}.

But if not enough points are chosen (i.e., L large), then the space may not be adequately covered by \(U_n(L, r)\). If too many points are chosen (i.e., L small), then more noise may be picked up. However one can obtain the correct homology with high probability by looking at how \(U_n(L_1, r)\) includes into \(U_n(L_2, r)\) for \(L_2 < L_1\). This induces a map on their homologies. The image of this map is the homology estimator, which equals the correct homology of the space of interest with high probability.

**Elizabeth Munch** talked about The Interleaving Distance for Reeb Graphs. Reeb graphs provide an efficient description to understand the properties of a real-valued function on a topological space and are useful in many applications. Thus it would be very useful to have a method for comparing two Reeb graphs. Interleavings (and interleaving distances) have been used to compare persistence modules. Interleavings can be applied to Reeb graphs by defining a generalization of a Reeb graph as a functor. One consequence is a concrete algorithm for smoothing a Reeb graph in order to remove noise.

**Peter Bubenik** talked about Generalized Persistence Modules and Stability. Generalized persistence modules is an abstract formulation of persistence modules using category theory which includes many forms of persistence modules that are currently studied. One consequence of this formulation is that one can give simpler common proofs for many standard results such as stability.

**Yuliy Baryshnikov** talked about Integral Operators in Euler World. One can integrate functions that take on a finite number of values using the Euler characteristic as the measure. For example if f(x) = 4 for x in [0, 1] and 0 elsewhere, then the integral of f with respect to the Euler characteristic = 4 times the Euler characteristic of [0, 1] = 4(-1 + 2) = 4. In applications, sometimes it is easier to solve a problem by transforming it into a simpler problem using an integral transform.

An example of an integral transform is convolution: Given functions f and g, one can create a new function, f*g, where the value f*g(x) is obtained from f and g by integrating the product f(t)g(x-t) with respect to the euler characteristic. Given f and f*g, one would like to be able to reconstruct g: that is one would like to calculate the inverse of the Euler integral transform. Cases where one can calculate the inverse transform were discussed.

Summaries for Day 3 are contributed by Rachael Phillips.

**Jeff Erickson **talked about efficiently hex-meshing things with topology. With a hex mesh, a polyhedra with six quadrilateral facets, there can be a quadrilateral mesh that can be extended to a hexahedral mesh of the interior volume. This can only happen when there are an even amount of quadrilaterals and none of the cycles are odd for the hex mesh. If such a mesh exists, then a polyhedron in 3-dimensional Euclidean space with quadrilateral facets can be constructed in polynomial time. These are extended to domains that have disconnected boundaries and are continued from Thurston, Mitchell, and Eppstein, where the odd cycle criteria is trivial. The idea is to look at a quadrilateral figure and extend that figure to the interior. So, the importance is not in the shape of that figure, since we are not looking at the geometry of this figure, but the topology.

**Jose Perea** talked about Obstructions to Compatible Extensions of Mappings. From Betti numbers in 1994 to Zig-Zag persistence in 2009, there have been several classic invariants in algebraic topology. The basic ones being from a point cloud, constructing a filtration and using that filtration to compute Betti numbers, which tell us about the number of k-dimensional holes within a metric space. Instead of this, it would be useful to come up with new ways of encoding multi-scale information from data. The main goal is to be able to fit our data into a model for the best methods. Using extending sections and the retraction problem, Mumford data is used to fit the model. The question is, how far do you have to go for the model to be good? From local to global, the model tells us the death-like events, where an example would be compatible extensions. The birth-like events where the filtration of each level extends to the next. Once these models are found, it extends compatibility once the model has been extended. The main goal being to extend the previous invariant methods to new invariant methods for data analysis.

**Donald Sheehy** talked about Nested Dissection and (Persistent) Homology. Using nested dissection, this is a way of solving systems of linear equations. This method is an improvement of the naive Gaussian elimination. The reason that it is important to improve Gaussian elimination is because it is a long process that takes a lot of computer memory. It is a method that needs to be improved when using computers to solve it. By building a filtered simplicial complex and computing the persistent homology, we can try to speed up the process of elimination. Normally, Gaussian elimination has a running time of $$O(n^3)$$, or even worse using Strassen it is a running time of $$O(n^{\log_27})$$. Nested Dissection removes a random column in a matrix and separates the graph into two pieces. When a matrix is separated in two pieces, it improves the matrix multiplication and using topology, while doing Gaussian on the boundary. The nested dissection computes the persistent homology of the space of the matrix. Using four methods such as Mesh Filtrations, Nested Dissection, Geometric Separation and Output sensitive persistence algorithm, there is a theorem that improves the asymptotic running time of the persistence algorithm.

**Shmuel Weinberger** talked about Complex and Simple “Topological” invariants. It is common in the study of topological data analysis that the values of topological invariants are discussed. When you calculate homology it does not always give you enough information. The goal is not to look at the dynamics, but learn about dynamics. It is possible to look at the persistent homology of any data set, unfortunately, conditions on the noise, “invariance” with high-probability. It is important is to find examples of Probability Approximately Correct (PAC) computable ideas. Knowing when noise is a problem and when it is not is the main goal of this talk.