MMseqs2: Deep understanding of MSA Generation & Optimization

Written by Danial Gharaie Amirabadi

Published 2024-10-17

In protein structure prediction workflows, generating Multiple Sequence Alignments (MSAs) is a key step for improving the accuracy of models such as AlphaFold2 and RoseTTAFold. Traditionally, this process is time-consuming. However, using the MMseqs2 tool provides significant speed-ups while maintaining high accuracy. Here's a closer look at the optimized method for faster MSA generation and filtering, adapted from the ColabFold pipeline.

Schematics of MSA generation, adapted from Dr. Sergey Ovchinnikov's talk: Review and Discussion of AlphaFold3

MSA Generation with MMseqs2

MMseqs2 enables fast generation of MSAs by utilizing a more efficient search process. The query sequence is first searched against the UniRef30 database, a clustered version of UniRef100, with three iterations to retrieve homologous sequences. By searching consensus sequences (29.3 million) instead of the full UniRef100 (277.5 million sequences), a ~10-fold speed improvement is achieved. This process retains sensitivity, allowing for the identification of relevant sequences.

Once the consensus sequences are found, the individual members of each identified UniRef100 cluster are realigned using profiles from the last search iteration, after that they are filtered and added to the MSA. To further refine this, iterative searches are also conducted against the BFD/MGnify or ColabFoldDB databases, extending the search space.

Diversity-Aware Filtering

To ensure the final MSA remains manageable and diverse, MMseqs2 uses an extended version of the HHblits filtering algorithm. This filter is applied in multiple stages during MSA construction:

Initial Cluster Filtering: During UniRef cluster expansion, clusters are filtered to ensure that no two cluster pairs have higher than 95% sequence identity.
Post-Realignment Filtering: After realignment, only hits with a qsc (minimum score per column with master sequence) of 0.8 or higher are retained if there are at least 100 hits. Other thresholds, such as identity, are disabled to allow broader sequence inclusion.
Final MSA Filtering: The MSA is further refined using identity buckets to limit redundancy, keeping the most diverse 3,000 sequences in identity ranges between 0.0–1.0. Buckets with fewer than 1,000 hits are not filtered.

MMseqs2 Database Optimization

A pre-computed index stores cluster consensus sequences, member sequences, and pairwise alignments of the cluster representatives to the cluster members, significantly reducing computational overhead by caching these elements. This allows MMseqs2 to handle large-scale database queries efficiently.

To reduce the memory footprint, databases like BFD and MGnify were merged and filtered. The final database was reduced from 2.5 billion sequences to 513 million, requiring only 84 GB of RAM instead of ~517 GB.

ColabFoldDB was built by expanding the BFD/MGnify databases with metagenomic sequences from various environments. Sequences from multiple sources, including SMAG, MetaEuk, TOPAZ, MGV, GPD, and an updated MetaClust, were compared against BFD/MGnify centroids using MMseqs2. Sequences with at least 30% identity and 90% overlap were assigned to the relevant clusters. Remaining sequences were clustered and appended to the database. Redundancy was reduced by retaining only the 10 most diverse sequences per cluster. The final ColabFoldDB contains over 209 million representative sequences and 738 million members.

MSA Pairing Process

Paired Multiple Sequence Alignments (MSAs) play a critical role in improving the accuracy of protein complex predictions in AlphaFold2. For these predictions to be meaningful, it is essential to ensure that the orthologous genes—those that are evolutionarily related across species—are correctly paired. The pairing strategy is built around the identification of orthologous sequences using taxonomic identifiers, ensuring that sequences are aligned based on evolutionary relationships.

Taxonomic-Based Pairing Strategy

The process of pairing sequences begins by searching each distinct protein sequence within a complex against the UniRef100 database. Initially, the query sequence is searched against the UniRef30 database— a clustered version of UniRef100— using three iterations to achieve a ~10-fold speed improvement while maintaining sensitivity. Once homologous consensus sequences are identified, their UniRef100 cluster members are realigned and added to the MSA. Further iterative searches are performed against additional databases, such as BFD, MGnify, or ColabFoldDB, to expand the search space.

For pairing, only hits that cover all proteins in the complex within a single species are considered. From the retrieved hits, the best match, determined by the smallest E-value and an alignment covering at least 50% of the query sequence, is selected.

This strategy, inspired by the approach of Bryant et al., focuses on pairing orthologous sequences from the same species, ensuring that they represent the same functional gene across different organisms. The resulting paired alignments significantly enhance the ability of AlphaFold2 to accurately predict protein-protein interactions within complexes.

Retrieving Taxonomic Labels for Pairing

To enable accurate pairing, taxonomic identifiers are used to group sequences by species. For each UniRef100 sequence found during the search process, the associated taxonomic identifier is retrieved from the NCBI Taxonomy database. Specifically, the taxonomic labels are extracted from the "lowest common ancestor" field of each UniRef100 sequence, which provides a reliable evolutionary relationship between the sequences. This information is stored in the uniref100.xml file and is used to ensure that only sequences from the same species are paired together.

MMseqs2's Pairaln Module

The pairing process is implemented through MMseqs2's pairaln module, which automates the task of finding and aligning suitable sequence pairs. It handles the filtering of hits based on taxonomic and alignment criteria, ensuring that only orthologous sequences that meet the minimum coverage requirements are paired. By using this module, the MSA is constructed with accurate sequence relationships that reflect evolutionary and functional connections between proteins within a complex.

By pairing MSAs with a focus on taxonomic accuracy and genomic relationships, this process enables AlphaFold2 to make more reliable predictions about protein complex interactions.

Using the Neurosnap Tools Package to Retrieve MSAs

The neurosnap tools package provides a convenient method to generate Multiple Sequence Alignments (MSAs) using ColabFold's API. With the run_mmseqs2_modes function from the package, you can easily retrieve paired or unpaired MSAs for your protein sequences. Below is an example of how to use the run_mmseqs2_modes function to generate an MSA:

from neurosnap.msa import run_mmseqs2_modes

# Input sequence(s)
sequences = [
    "MARGPGLAPPPLRLPLLLLVLAAVTGHTAAQDNCTCPTNKMTVCSPDGPGGRCQCRALGSGMAVDCSTLTSKCLLLKARMSAPKNARTLVRPSEHALVDNDGLYDPDCDPEGRFKARQCNQTSVCWCVNSVGVRRTDKGDLSLRCDELVRTHHILIDLRHRPTAGAFNHSDLDAELRRLFRERYRLHPKFVAAVHYEQPTIQIELRQNTSQKAAGDVDIGDAAYYFERDIKGESLFQGRGGLDLRVRGEPLQVERTLIYYLDEIPPKFSMKRLTAGLIAVIVVVVVALVAGMAVLVITNRRKSGKYKKVEIKELGELRKEPSL",
    "EVQLVESGGGLVQPGGSLRLSCAASGFTFSSYAMSWVRQAPGKGLEWVSYISSSSSYTNYADSVKGRFTISRDNSKNTLYLQMNSLRAEDTASYYCARGLAGVWGIDVWGQGTLVTVSS",
]

# Generate MSA with default parameters
run_mmseqs2_modes(
    seq=sequences,         # The sequence or list of sequences
    output="my_msa",       # Name for the output directory
    cov=50,                # Coverage threshold
    id=90,                 # Identity threshold
    max_msa=2048,          # Maximum number of MSA sequences
    mode="unpaired_paired" # Alignment mode
)

Parameters

seq: Can be a single sequence (string) or a list of sequences to generate the MSA for.
Output: Specifies the Output directory path.
cov: Minimum coverage percentage for the sequences in the alignment.
id: Identity threshold for filtering similar sequences.
max_msa: The maximum number of sequences to include in the final MSA.
mode: Defines the type of MSA to generate: "unpaired", "paired", or "unpaired_paired".

The run_mmseqs2_modes function supports three different modes for generating MSAs:

"unpaired": This mode generates MSAs for each sequence independently. The function will create separate MSAs for each input sequence and combine them.
"paired": This mode is designed for cases where you have multiple related sequences, such as different chains of a protein complex. It attempts to find paired homologs for all input sequences together. This can help maintain the relationship between the sequences in the resulting MSA.
"unpaired_paired": This is a combination of the previous two modes. It first attempts to generate a paired MSA (like in "paired" mode). If the resulting MSA doesn't reach the maximum number of sequences (specified by max_msa), it then supplements with unpaired MSAs for each sequence (like in "unpaired" mode). This mode aims to provide the most comprehensive MSA by leveraging both paired and unpaired MSAs.

The function adapts its behavior based on the chosen mode and the number of input sequences to provide the most appropriate MSA for your needs.

Conclusion

Multiple Sequence Alignments (MSAs) are essential for protein structure prediction. MMseqs2 offers an efficient method for generating MSAs, significantly reducing computation time while maintaining accuracy. Complementary pairing strategies, particularly for protein complexes, further improve structure predictions. These advancements in MSA generation and pairing contribute to ongoing progress in computational biology and structural bioinformatics.

Acknowledgements

Thanks to Dr. Milot Mirdita for reviewing and providing helpful feedback on this blog post.

Accelerate your lab's
research today

Interested in getting a license? Contact Sales.