Oleh Sekirkin

Oleh Sekirkin

Business Professional & Data Enthusiast

Philadelphia, Pennsylvania, United States

Topological Data Analysis

Python Data Analysis

    1. What is TDA
    2. Key Concepts
    3. Code the Algorithm

1. What is TDA

First let’s explain what TDA or Topological Data Analysis even means, and is quite simple to understand (at least the basics): is a method for analyzing complex data by focusing on its shape and structure. Topology is basically a branch of mathematics that studies the properties of shapes that remain unchanged under deformations (stretching, twisting).

How it works:

    1) Represents data as a set of points in a high-dimensional space.
    2) Then looks at how these points are connected at different scales.
    3) Identifies key features like clusters, loops or holes in the data.

The main benefits of this type of data analysis is that it can handle high-dimensional and complex data, and it can reveal insights that traditional statistical methods might miss.

Unlike traditional statistical methods that often rely on averages or linear relationships, TDA can capture non-linear and geometric features in data. While conventional techniques might struggle with high-dimensional or noisy data, TDA excels in these areas. It complements rather than replaces traditional methods, offering a different perspective that can uncover hidden patterns or structures in complex datasets.

2. Key Concepts

Some key concepts before we start:

    - Simplicial Complexes: they are the building blocks of Topological Data Analysis. They’re a way to represent data points and their relationship. 0-simplex is a point, 1-simplex is a line segment, 2-simplex is a triangle, 3-simplex is a tetrahedron. These are combined to form more complex structures that represents the data’s topology.

    - Persistent Homology: core technique in TDA that tracks how topological features (connected components, loops, voids) appear and disappear as we change the scale at which we observe the data.

    - Persistence Diagrams: visual representations of persistent homology. They show when topological features appear and disappear across different scales. Points far from the diagonal in these diagrams represent more significant, persistent features.

    - Homology Groups: algebraic structures that capture the essence of topological features; 0-dimensional homology groups represent connected components, 1-dimensional homology groups represent loops, 2-dimensional homology groups represent voids or cavities. Higher-dimensional groups capture more complex structures.

    - Wasserstein Distance: let's think of it as a measure to compare two Persistence Homologies, a simple way to understand it is with the next example: imagine you have two stacks of sugar (two sets of data), so the Wasserstein Distance measures the least amount of effort you would need to move the sugar grains around to transform one stack into the shape of the other. Less effort is equal to more similar stacks; if you need to move a lot of grains around, it means that the stacks are different.

Topological data analysis has found applications across diverse fields, but in this post I will use it for financial analysis, trying to detect potential market crashes in the S&P500, as it offers a method to determine how similar two time periods might be, based on their underlying patterns.

3. Code the Algorithm

 

from Algorithmimports import *
import numpy as np
from ripser import Rips
import persim

class MuscularMagentaLion(QCAlgorithm):
    def initialize(self):
        # Set initial cap for algorithm to work with.
        self.SetStartDate(2024, 1, 1)
        self.SetCash(10000000)
        
        self.eq = self.AddEquity("SPY", Resolution.Hour).Symbol
        
        # Set parameters for the algorithm: first the window size for the calculations,
        # then the threshold for trading decisions.
        self.lookback = 20
        self.threshold = 1
        
        self.rips = Rips(maxdim=2)
        
        # Create a rolling window to store closing prices.
        self.close_window = RollingWindow[float](self.lookback*5)
        
        self.WarmUpFinished = False
    
    def OnData(self, data: Slice):
        if self.IsWarmingUp:
            return
            
        if not (data.ContainsKey(self.eq) and data[self.eq] is not None:
            return
            
        self.close_window.Add(data[self.eq].Close)
        
        if not self.close_window.IsReady:
            return
            
        # Rolling window to numpy array.
        closes_list = list(self.close_window)
        self.prices = np.array(closes_list)
        
        lgr = np.log(self.prices[1:] / self.prices[:-1])
        
        # Compute Wasserstein Distance.
        wasserstein_dists = self.compute_wasserstein_distances(lgr, self.lookback, self.rips)
        wd = sum(wasserstein_dists)
        
        # Trading logic based on Wasserstein Distance.
        if self.Portfolio[self.eq].IsShort:
            if wd >= self.threshold:
                self.SetHoldings(self.eq, 0.80, True)
            else: return
        elif self.Portfolio[self.eq].IsLong:
            if wd <= self.threshold:
                self.SetHoldings(self.eq, -0.80, True)
            else: return
        else: self.SetHoldings(self.eq, 0.80)
    
    def compute_wasserstein_distances(self, log_returns, window_size, rips):
        """Compute the wasserstein distances between consecutive windows of log returns."""
        n = len(log_returns) - (2 * window_size) + 1
        distances = np.full((n, 1), np.nan)
        
        for i in range(n):
            segment1 = log_returns[i:i+window_size].reshape(-1, 1)
            segment2 = log_returns[i+window_size:i+(2*window_size)].reshape(-1, 1)
            
            if segment1.shape[0] != window_size or segment2.shape[0] != window_size:
                continue
                
            # Compute Persistence Diagrams.
            dgm1 = rips.fit_transform(segment1)
            dgm2 = rips.fit_transform(segment2)
            
            # Calculate Wasserstein Distance between 2D persistence diagrams.
            distance = persim.wasserstein(dgm1[1], dgm2[1], matching=False)
            distances[i] = distance
            
        return distances