Clustering ========== .. toctree:: :hidden: Sometimes, we want to group 5C interactions into *clusters* according to some perceived spatial and architectural related-ness. This can happen when we see several nearby "pixels" on a 5C heatmap with high intensities. This section explains the functionality of the ``lib5c`` clustering framework, which can be used to identify these clusters using any of a suite of clustering algorithms. Command-line interfaces ----------------------- A command-line interface to this clustering framework exists as a separate repository, which can be found at https://bitbucket.org/creminslab/3d-clusterbot Exposed functionality --------------------- The algorithms which make up the clustering framework can be found in the :mod:`lib5c.algorithms.clustering` subpackage. Core data structures ~~~~~~~~~~~~~~~~~~~~ peak (``Dict[str, numeric]``) Withing the clustering framework, the term *peak* is used to refer to an object that represents a single interaction. The data structure used to represent a peak is a dictionary of the form:: { 'x': 4, 'y': 7, 'value': 12.43 } we can think of peaks as an unfolded contact matrix, where each entry in the contact matrix gets represented as a separate dictionary. This representation may seem overly verbose, but it has the dual advantage of becoming sparse when interactions when peaks with values below some threshold are excluded from clustering, and of providing a simple representation of clusters as lists of peaks (see below). cluster (``List[Dict[str, numeric]]``) Within the clustering framework, the term *cluster* is used to refer to a list of peaks that have been determined to belong together. Core API ~~~~~~~~ Many different paradigms for clustering operations exist. There are three broadly-defined clustering operations: 1. cluster assembly (the initial construction of clustered from a flat list of candidate peaks) 2. cluster merging (the merging of two clusters into one cluster) 3. cluster splitting (the splitting of clusters into subclusters) Cluster assembly ^^^^^^^^^^^^^^^^ The API for cluster assembly is to define a single function :: make_clusters(peaks, **kwargs) that takes in a list of candidate peaks and returns a list of clusters. The cluster assembly paradigms available are: knn, via :func:`lib5c.algorithms.clustering.knn.make_clusters` Assembles clusters by classifying each peak into a cluster according to the cluster membership of its nearest neighbors. adjacency, via :func:`lib5c.algorithms.clustering.adjacency.make_clusters` Assumes the clusters are simply the connected components made up of peaks, as judged by spatial adjacency. greedy, via :func:`lib5c.algorithms.clustering.greedy.make_clusters` Attempts to grow clusters in a greedy fashion by absorbing all nearby peaks and clusters. Cluster merging ^^^^^^^^^^^^^^^ The API for cluster merging is to define a single function :: merge_to_which(clusters) which takes in a list of clusters and returns the index of the cluster that ``clusters[0]`` should be merged into, or -1 if ``clusters[0]`` shouldn't be merged. The cluster merging paradigms available are: adjacency, via :func:`lib5c.algorithms.clustering.adjacency.merge_to_which` Merges adjacent clusters together. enclave, via :func:`lib5c.algorithms.clustering.enclave.merge_to_which` Merges clusters together if one cluster completely surrounds another. To recursively merge a list of clusters using a ``merge_to_which()`` function, a utility function is provided as :func:`lib5c.algorithms.clustering.util.merge_clusters` which takes in the list of clusters and a reference to the ``merge_to_which()`` function that defines the merge rule to use, and returns a list of recursively-merged clusters. Cluster splitting ^^^^^^^^^^^^^^^^^ The API for cluster splitting is to define a single function :: split_clusters(clusters, **kwargs) which takes in a list of clusters and returns a list of clusters that have been recursively split according to some splitting rule. By convention, the splitting rule should be defined in a function called ``split_cluster()``, but different splitting paradigms may use different signatures for this function. The cluster splitting paradigms available are: valley, via :func:`lib5c.algorithms.clustering.valley.split_clusters` Splits clusters by identifying "valleys" between the "mountains" that correspond to true clusters. quasicontiguity, via :func:`lib5c.algorithms.clustering.quasicontiguity.split_clusters` Splits clusters according to a "quasicontiguity" criterion where clusters that are physically separated into two sufficiently large and sufficiently distant subcomponents get split into separate clusters.