neurosnap.algos.clusterprot module#

Implementation of the ClusterProt algorithm from https://neurosnap.ai/service/ClusterProt. ClusterProt is an algorithm for clustering proteins by their structure similarity.

neurosnap.algos.clusterprot.ClusterProt(proteins, chain=None, umap_n_neighbors=0, proj_1d_algo='umap', dbscan_eps=0, dbscan_min_samples=0, eps_scale_factor=0.05)[source]#

Run the ClusterProt algorithm on some input proteins.

Clusters proteins using their structural similarity.

Algorithm Description:
  1. Ensure all protein structures are fully loaded

  2. Compute the distance matrices of using the alpha carbons of all the loaded proteins from the selected regions

  3. Get the flattened upper triangle of the of the distance matrices excluding the diagonal.

  4. Align all the proteins to the reference protein (optional but useful for analysis like the animation)

  5. Create the 2D projection using UMAP

  6. Create clusters for the 2D projection using DBSCAN

  7. Create the 1D projection using either UMAP or PCA (optional but useful for organizing proteins 1-dimensionally)

Parameters:
  • proteins (Union[List[Structure], StructureEnsemble, StructureStack]) – Structures to cluster. This can be either a list of single-model Structure objects or a multi-model StructureEnsemble/StructureStack, in which case each model is clustered separately.

  • chain (Optional[str]) – Chain ID to for ClusterProt to use (must be consistent across all structures), if not provided calculates for all chains

  • umap_n_neighbors (int) – The n_neighbors value to provide to UMAP for the main projection. Leave as 0 to automatically calculate optimal value. Prior to the 2024-06-14 update this values was left as 7.

  • proj_1d_algo (str) – Algorithm to use for the 1D projection. Can be either "umap" or "pca"

  • dbscan_eps (float) – The eps value to provide to DBSCAN. Leave as 0 to automatically calculate optimal value. Prior to the 2024-04-15 update this values was left as 0.5.

  • dbscan_min_samples (int) – The min_samples value to provide to DBSCAN. Leave as 0 to automatically calculate optimal value. Prior to the 2024-04-15 update this values was left as 5.

  • eps_scale_factor (float) – Fraction of the 2D data’s diagonal range used to set DBSCAN’s eps. Recommended: 0.05-0.10 for larger datasets or finer clusters; 0.15 for smaller datasets or broader clustering.

Returns:

  • structures (list): Sorted list of all the Neurosnap structures aligned by the reference structure.

  • titles (list<str>): Display labels for each structure.

  • projection_2d (list<list<float>>): Generated 2D projection of all the structures.

  • cluster_labels (list<float>): List of the labels for each of the structures.

Return type:

A dictionary containing the results from the algorithm

neurosnap.algos.clusterprot.animate_results(cp_results, animation_fpath='cluster_prot.gif')[source]#

Animate the ClusterProt results using the aligned proteins and 1D projections.

Parameters:
  • cp_results (Dict) – Results object from ClusterProt run

  • animation_fpath (str) – Output filepath for the animation of all the proteins

neurosnap.algos.clusterprot.create_figure_plotly(cp_results)[source]#

Create a scatter plot of the 2D projection from ClusterProt using plotly express.

NOTE: The plotly package will need to be installed for this

Parameters:

cp_results (Dict) – Results object from ClusterProt run