neurosnap.msa module#

Provides functions and classes related to processing protein sequence data.

neurosnap.msa.align_mafft(seqs, ep=0.0, op=1.53, threads=8)[source]#

Generates an alignment using mafft.

Parameters:

seqs (Union[str, List[str], Dict[str, str]]) –
Can be:
- fasta file path,
- list of sequences, or
- dictionary where values are AA sequences and keys are their corresponding names/IDs
ep (float) – ep value for mafft, default is 0.00
op (float) – op value for mafft, default is 1.53
threads (int) – Number of MAFFT threads to use (default: 8)

Return type:

Tuple[List[str], List[str]]

Returns:

A tuple of the form (out_names, out_seqs)

out_names: list of aligned protein names
out_seqs: list of corresponding protein sequences

neurosnap.msa.alignment_coverage(seq1, seq2)[source]#

Calculate alignment coverage (%) for two aligned sequences. First sequence should the query sequence in most cases

Parameters:

seq1 (str) – Query aligned sequence (with gaps).
seq2 (str) – Subject aligned sequence (with gaps).

Return type:

float

Returns:

Percentage of non-gap positions in the query sequence.

neurosnap.msa.consensus_sequence(sequences)[source]#

Generates the consensus sequence from a list of aligned sequences.

The consensus is formed by taking the most common character at each position.

Parameters:: sequences (List[str]) – A list of equal-length sequences (e.g., amino acid or DNA).
Return type:: str
Returns:: The consensus sequence.
Raises:: ValueError – If the sequence list is empty or sequences are of unequal lengths.

neurosnap.msa.pad_seqs(seqs, char='-', truncate=False)[source]#

Pads all sequences to the longest sequences length using a character from the right side.

Parameters:

seqs (List[str]) – List of sequences to pad
chars – The character to perform the padding with, default is “-”
truncate (Union[bool, int]) – When set to True will truncate all sequences to the length of the first, set to integer to truncate sequence to that length

Return type:

List[str]

Returns:

The padded sequences

neurosnap.msa.read_msa(input_fasta, *, size=inf, allow_chars='', drop_chars='', remove_chars='*', uppercase=True, name_allow_all_chars=False, query=None, cov=0, id=0)[source]#

Reads an MSA, a3m, or fasta file and yields (name, seq) pairs as a stream. Returned headers will consist of all characters up until the first space with the “|” character replaced with an underscore.

Parameters:

input_fasta (Union[str, TextIOBase]) – Path to read input a3m file, fasta as a raw string, or a file-handle like object to read
size (float) – Number of rows to read
allow_chars (str) – Sequences that contain characters not included within STANDARD_AAs+allow_chars will throw an exception
drop_chars (str) – Drop sequences that contain these characters. For example, "-X"
remove_chars (str) – Removes these characters from sequences. For example, "*-X"
uppercase (bool) – Converts all amino acid chars to uppercase when True
name_allow_all_chars (bool) – Uses the entire header string for names instead of the standard regex pattern
query (Optional[str]) – Query amino acid sequence. If not provided, the first sequence in the MSA is used.
cov (int) – Minimum percentage of query sequence coverage required to keep a sequence. It measures the proportion of non-gap positions in the query that are aligned to non-gap positions in the candidate. For example, with a 50% threshold, at least half of the query positions must align. Raising this value filters out shorter/partial matches and increases overall overlap.
id (int) – Minimum percentage of sequence identity required to keep a sequence. Identity is the exact match rate at aligned positions between the query and candidate. Higher values keep only close homologs; lower values allow more diverse sequences.

Yields:

Tuples of the form (name, seq)

name: protein name from the a3m file, including gaps
seq: protein sequence from the a3m file, including gaps

Return type:

Iterator[Tuple[str, str]]

neurosnap.msa.run_mmseqs2(seqs, output, database='mmseqs2_uniref_env', use_filter=True, use_templates=False, pairing=None)[source]#

Submits amino acid sequences to the ColabFold MMseqs2 API to generate multiple sequence alignments (MSAs), optionally downloading template structures. Results are written to the specified output directory, including: - One combined A3M file per input sequence (named combined_{i}.a3m) - Optional structure templates (if use_templates=True and supported)

Parameters:

seqs (str or List[str]) – One or more amino acid sequences. If a single string is provided, it is treated as one sequence. Duplicates are automatically de-duplicated to reduce redundant API calls.
output (str) – Path to an output directory. If it exists, it will be deleted and recreated.
database (str) – MMseqs2 search database to use. Must be one of: - “mmseqs2_uniref_env” (environmental sequences + UniRef) - “mmseqs2_uniref” (UniRef only)
use_filter (bool) – Whether to apply diversity/length filtering to limit the size of the resulting MSA. Recommended for performance and downstream quality. If False, may yield larger but noisier MSAs.
use_templates (bool) – Whether to fetch structural templates for each input sequence using hits from the pdb70.m8 file. Automatically disabled if pairing is set.
pairing (str or None) –
If specified, activates MSA pairing mode. Must be one of: - “greedy”: fast pairing using best bidirectional hits - “complete”: exhaustive pairing of all hits - None: disables pairing (default)

Note: Only one MSA is generated per pair using pair.a3m when pairing is enabled. If this is set, use_templates will be ignored.

Returns:

a3m_lines: A list of strings, each representing the combined MSA (in A3M format) for each input sequence,
in the same order as provided. These are also written as combined_{i}.a3m files.
template_paths: If use_templates=True, returns a list of template directory paths (or None for sequences with no templates found).
Otherwise, returns None.

Return type:

Tuple[List[str], Optional[List[str]]]

Notes

Internally deduplicates sequences but returns results in original input order.
Implements robust retry logic for ColabFold’s unstable API endpoints, including long-lived 502 errors.
Null bytes in A3M files are stripped to avoid downstream parsing issues.
Original code adapted from ColabFold: sokrypton/ColabFold
Please cite ColabFold if using this in research: https://colabfold.mmseqs.com/

neurosnap.msa.run_mmseqs2_modes(seq, output, cov=50, id=90, max_msa=2048, mode='unpaired_paired')[source]#

Generate a multiple sequence alignment (MSA) for the given sequence(s) using Colabfold’s API. Key difference between this function and run_mmseqs2 is that this function supports different modes. The final a3m and most useful a3m file will be written as “output/final.a3m”. Code originally adapted from: sokrypton/ColabFold

Parameters:

seq (Union[str, List[str]]) – Sequence(s) to generate the MSA for. If a list of sequences is provided, they will be considered as a single protein for the MSA.
output (str) – Output directory path, will overwrite existing results.
cov (int) – Coverage of the MSA
id (int) – Identity threshold for the MSA
max_msa (int) – Maximum number of sequences in the MSA
mode (str) – Mode to run the MSA generation in. Must be in ["unpaired", "paired", "unpaired_paired"]

neurosnap.msa.run_phmmer(query, database, evalue=10.0, cpu=2)[source]#

Run phmmer using a query sequence against a database and return all the sequences that are considered as hits. Shamelessly stolen and adapted from seanrjohnson/protein_gibbs_sampler

Parameters:

query (str) – Amino acid sequence of the protein you want to find hits for
database (str) – Path to reference database of sequences you want to search for hits and create and alignment with, must be a protein fasta file
evalue (float) – The threshold E value for the phmmer hit to be reported
cpu (int) – The number of CPU cores to be used to run phmmer

Return type:

List[str]

Returns:

List of hits ranked by how good the hits are

neurosnap.msa.run_phmmer_mafft(query, ref_db_path, size=None, in_name='input_sequence', phmmer_cpu=2, mafft_threads=8)[source]#

Generate MSA using phmmer and mafft from reference sequences.

Parameters:

query (str) – Amino acid sequence of the protein whose MSA you want to create
ref_db_path (str) – Path to reference database of sequences with which you want to search for hits and create and alignment
size (Optional[int]) – Total number of sequences to return, including the query. Use None to include all hits. Must be a positive integer greater than 1.
in_name (str) – Optional name for input sequence to put in the output
phmmer_cpu (int) – The number of CPU cores to be used to run phmmer (default: 2)
mafft_threads (int) – Number of MAFFT threads to use (default: 8)

Return type:

Tuple[List[str], List[str]]

Returns:

A tuple of the form (out_names, out_seqs)

out_names: list of aligned protein names
out_seqs: list of corresponding protein sequences

neurosnap.msa.seqid(seq1, seq2, *, count_gaps=False)[source]#

Calculate the pairwise sequence identity of two same length sequences or alignments. Will not perform any alignment steps.

Parameters:

seq1 (str) – The 1st sequence / aligned sequence.
seq2 (str) – The 2nd sequence / aligned sequence.
count_gaps (bool) – When True, include gap positions in the identity calculation.

Return type:

float

Returns:

The pairwise sequence identity, 0 means no matches found, 100 means sequences were identical.

neurosnap.msa.write_msa(output_path, sequences)[source]#

Writes an MSA, a3m, or fasta to a file. Makes no assumptions about the validity of names or sequences.

Parameters:

output_path (Union[str, PathLike]) – Path to output file to write, will overwrite existing files
sequences (Iterable[Tuple[str, str]]) – Iterable of (name, sequence) pairs

neurosnap.msa module#

This Page