By Gyulim Jessica Kang
Ronald H. Coase, a British economist and author of Essays on Economics and Economists, once said: “If you torture the data long enough, it will confess.” This quote will resonate with anyone who’s tried data analysis before. Even though data is abundant in modern society, it’s often inconsistent and scattered across various sources. Data analysis is the process of assembling a jumbled jigsaw puzzle, piece by piece, to arrive at the bigger picture: the reason why the data matters.
In pursuit of achieving ever more efficient data analysis, new and improved data analysis techniques continue to spawn in the market. Rooted in persistent homology and the premise that the shape of a dataset contains relevant information, topological data analysis, or TDA, is one of the most exciting latest arrivals.
What is Persistent Homology?
Persistent homology is a method for uncovering topological features from data. What is topology? Topology is a glorified term for a mathematics branch that studies shapes and spaces. What happens when we allow geometric objects to be crumpled, twisted, stretched, and squeezed? Surprisingly, quite a few properties are preserved, and the study of these deformations and their geometric features is topology’s primary focus.
The application of persistent homology in data science provides a robust theoretical foundation for a mathematically rigorous analysis of shape within data (Carlsson, 2009). The shape of the data can be interpreted in several forms depending on the data type. For example, let’s say we have data consisting of a finite number of points scattered in the coordinate space; naturally, you start thinking about how these points are roughly shaped or clustered.
This tendency is similar to how we look up at an unusually clear evening sky and see constellations in the shapes of polar bears or scorpions. It’s a human instinct to extract familiar, meaningful shapes from seemingly random and scattered data. TDA aims to encode the persistent homology of a dataset in visual representations accessible to human beings. Ultimately, TDA uses homology to answer meaningful questions such as ‘How can we identify geometric features from data?’ and ‘How do we know that these geometric features are significant?’
Key Concepts
Simplicial Complex
A simplex is the generalization of a triangle, the fundamental shape of geometry, in multidimensional spaces. For example, a 0-simplex is a point, a 1-simplex a line, a 2-simplex a triangle, a 3-simplex a tetrahedron, and so on (Koplik, 2022). Within TDA, we can use persistent homology to translate data into a collection of triangles: a simplicial complex.
We first draw a circle around each data point with a radius of δ.
If δ is greater than the distance between two points in the data, which would mean the intersection of two circles includes both their centers, we connect the two center points with a line segment.
If two balls overlap we connect the corresponding points with a line, or a 1-simplex. If three balls overlap, we fill in the area between them to form a triangle, or a 2-simplex.
This continues in the same pattern for subsequent intersections and simplices. Through the simplicial complex, we derive the homology group, composed of connected components, holes, and cavities (Garin & Tauzin, 2019). H0 commonly represents the group of connected components, H1 the group of holes, and H2 the group of cavities.
(Talebi, 2022)
Let’s test this process by creating a plot of our own! First, we use a Numpy array to generate 8 random points for our point cloud.
While looking at this collection of points, think about what geometric intuition we can gain. For example, it may appear that the four dots clustered at the top left of the plot are going to form a square.
The approximation of topological space from the point cloud with this simple idea is called the Vietoris-Rips Complex.
The visualizations in the video above reveal the transformation of the simplicial complex as the epsilon, or radius value grows. As the overlap between the circles increases, more and more connected components (H0) and holes (H1) are birthed within the complex. Triangles and edges appear, merge, and disappear with the change of epsilon. Additionally, the total number of components decreases over time; if epsilon is sufficiently large, there will eventually only be one united connected component remaining in the simplex. In summary, the number of components and holes can be important indicators of key geometric features, and we can obtain them by computing the homology of the complex.
From the Vietoris-Rips complex, we were able to achieve geometric intuition. However, if we only consider one Vietoris-Rips complex, we can’t know for sure which epsilon value best represents the shape of data. For every epsilon value, we must be able to summarize how the number of components and holes shift. The concept of persistent homology originated from this idea. Building on our previous definition, we can define persistent homology as the concept of measuring which homological features die are born from a simplicial complex as the epsilon value, or radius, changes to extract meaning from data.
Conclusion
In closing, persistent homology has many applications in data science. Persistent homology is the primary method used in TDA to study qualitative features of data that persist across multiple scales. It has countless advantages as well; it’s robust to perturbations of input data, independent of dimensions and coordinates, and provides a compact representation of the qualitative features of the input. For future work, I would like to dive deeper into the mathematical background of persistent homology along with the various applications of TDA in real life.
References
Comments