neurosnap.msa module#
Provides functions and classes related to processing protein sequence data.
- neurosnap.msa.align_mafft(seqs, ep=0.0, op=1.53, threads=8)[source]#
Generates an alignment using mafft.
- Parameters:
seqs (
Union[str,List[str],Dict[str,str]]) –Can be:
fasta file path,
list of sequences, or
dictionary where values are AA sequences and keys are their corresponding names/IDs
ep (
float) – ep value for mafft, default is 0.00op (
float) – op value for mafft, default is 1.53threads (
int) – Number of MAFFT threads to use (default: 8)
- Return type:
- Returns:
A tuple of the form
(out_names, out_seqs)out_names: list of aligned protein namesout_seqs: list of corresponding protein sequences
- neurosnap.msa.alignment_coverage(seq1, seq2)[source]#
Calculate alignment coverage (%) for two aligned sequences. First sequence should the query sequence in most cases
- neurosnap.msa.consensus_sequence(sequences)[source]#
Generates the consensus sequence from a list of aligned sequences.
The consensus is formed by taking the most common character at each position.
- Parameters:
sequences (
List[str]) – A list of equal-length sequences (e.g., amino acid or DNA).- Return type:
- Returns:
The consensus sequence.
- Raises:
ValueError – If the sequence list is empty or sequences are of unequal lengths.
- neurosnap.msa.pad_seqs(seqs, char='-', truncate=False)[source]#
Pads all sequences to the longest sequences length using a character from the right side.
- Parameters:
- Return type:
- Returns:
The padded sequences
- neurosnap.msa.read_msa(input_fasta, *, size=inf, allow_chars='', drop_chars='', remove_chars='*', uppercase=True, name_allow_all_chars=False, query=None, cov=0, id=0)[source]#
Reads an MSA, a3m, or fasta file and yields (name, seq) pairs as a stream. Returned headers will consist of all characters up until the first space with the “|” character replaced with an underscore.
- Parameters:
input_fasta (
Union[str,TextIOBase]) – Path to read input a3m file, fasta as a raw string, or a file-handle like object to readsize (
float) – Number of rows to readallow_chars (
str) – Sequences that contain characters not included within STANDARD_AAs+allow_chars will throw an exceptiondrop_chars (
str) – Drop sequences that contain these characters. For example,"-X"remove_chars (
str) – Removes these characters from sequences. For example,"*-X"uppercase (
bool) – Converts all amino acid chars to uppercase when Truename_allow_all_chars (
bool) – Uses the entire header string for names instead of the standard regex patternquery (
Optional[str]) – Query amino acid sequence. If not provided, the first sequence in the MSA is used.cov (
int) – Minimum percentage of query sequence coverage required to keep a sequence. It measures the proportion of non-gap positions in the query that are aligned to non-gap positions in the candidate. For example, with a 50% threshold, at least half of the query positions must align. Raising this value filters out shorter/partial matches and increases overall overlap.id (
int) – Minimum percentage of sequence identity required to keep a sequence. Identity is the exact match rate at aligned positions between the query and candidate. Higher values keep only close homologs; lower values allow more diverse sequences.
- Yields:
Tuples of the form
(name, seq)name: protein name from the a3m file, including gapsseq: protein sequence from the a3m file, including gaps
- Return type:
- neurosnap.msa.run_mmseqs2(seqs, output, database='mmseqs2_uniref_env', use_filter=True, use_templates=False, pairing=None)[source]#
Submits amino acid sequences to the ColabFold MMseqs2 API to generate multiple sequence alignments (MSAs), optionally downloading template structures. Results are written to the specified output directory, including: - One combined A3M file per input sequence (named combined_{i}.a3m) - Optional structure templates (if use_templates=True and supported)
- Parameters:
seqs (str or List[str]) – One or more amino acid sequences. If a single string is provided, it is treated as one sequence. Duplicates are automatically de-duplicated to reduce redundant API calls.
output (str) – Path to an output directory. If it exists, it will be deleted and recreated.
database (str) – MMseqs2 search database to use. Must be one of: - “mmseqs2_uniref_env” (environmental sequences + UniRef) - “mmseqs2_uniref” (UniRef only)
use_filter (bool) – Whether to apply diversity/length filtering to limit the size of the resulting MSA. Recommended for performance and downstream quality. If False, may yield larger but noisier MSAs.
use_templates (bool) – Whether to fetch structural templates for each input sequence using hits from the pdb70.m8 file. Automatically disabled if pairing is set.
pairing (str or None) –
If specified, activates MSA pairing mode. Must be one of: - “greedy”: fast pairing using best bidirectional hits - “complete”: exhaustive pairing of all hits - None: disables pairing (default)
Note: Only one MSA is generated per pair using pair.a3m when pairing is enabled. If this is set, use_templates will be ignored.
- Returns:
- a3m_lines: A list of strings, each representing the combined MSA (in A3M format) for each input sequence,
in the same order as provided. These are also written as combined_{i}.a3m files.
- template_paths: If use_templates=True, returns a list of template directory paths (or None for sequences with no templates found).
Otherwise, returns None.
- Return type:
Notes
Internally deduplicates sequences but returns results in original input order.
Implements robust retry logic for ColabFold’s unstable API endpoints, including long-lived 502 errors.
Null bytes in A3M files are stripped to avoid downstream parsing issues.
Original code adapted from ColabFold: sokrypton/ColabFold
Please cite ColabFold if using this in research: https://colabfold.mmseqs.com/
- neurosnap.msa.run_mmseqs2_modes(seq, output, cov=50, id=90, max_msa=2048, mode='unpaired_paired')[source]#
Generate a multiple sequence alignment (MSA) for the given sequence(s) using Colabfold’s API. Key difference between this function and run_mmseqs2 is that this function supports different modes. The final a3m and most useful a3m file will be written as “output/final.a3m”. Code originally adapted from: sokrypton/ColabFold
- Parameters:
seq (
Union[str,List[str]]) – Sequence(s) to generate the MSA for. If a list of sequences is provided, they will be considered as a single protein for the MSA.output (
str) – Output directory path, will overwrite existing results.cov (
int) – Coverage of the MSAid (
int) – Identity threshold for the MSAmax_msa (
int) – Maximum number of sequences in the MSAmode (
str) – Mode to run the MSA generation in. Must be in["unpaired", "paired", "unpaired_paired"]
- neurosnap.msa.run_phmmer(query, database, evalue=10.0, cpu=2)[source]#
Run phmmer using a query sequence against a database and return all the sequences that are considered as hits. Shamelessly stolen and adapted from seanrjohnson/protein_gibbs_sampler
- Parameters:
query (
str) – Amino acid sequence of the protein you want to find hits fordatabase (
str) – Path to reference database of sequences you want to search for hits and create and alignment with, must be a protein fasta fileevalue (
float) – The threshold E value for the phmmer hit to be reportedcpu (
int) – The number of CPU cores to be used to run phmmer
- Return type:
- Returns:
List of hits ranked by how good the hits are
- neurosnap.msa.run_phmmer_mafft(query, ref_db_path, size=None, in_name='input_sequence', phmmer_cpu=2, mafft_threads=8)[source]#
Generate MSA using phmmer and mafft from reference sequences.
- Parameters:
query (
str) – Amino acid sequence of the protein whose MSA you want to createref_db_path (
str) – Path to reference database of sequences with which you want to search for hits and create and alignmentsize (
Optional[int]) – Total number of sequences to return, including the query. Use None to include all hits. Must be a positive integer greater than 1.in_name (
str) – Optional name for input sequence to put in the outputphmmer_cpu (
int) – The number of CPU cores to be used to run phmmer (default: 2)mafft_threads (
int) – Number of MAFFT threads to use (default: 8)
- Return type:
- Returns:
A tuple of the form
(out_names, out_seqs)out_names: list of aligned protein namesout_seqs: list of corresponding protein sequences
- neurosnap.msa.seqid(seq1, seq2, *, count_gaps=False)[source]#
Calculate the pairwise sequence identity of two same length sequences or alignments. Will not perform any alignment steps.
- Parameters:
- Return type:
- Returns:
The pairwise sequence identity, 0 means no matches found, 100 means sequences were identical.