neurosnap.msa module#

Provides functions and classes related to processing protein sequence data.

neurosnap.msa.align_mafft(seqs, ep=0.0, op=1.53, threads=8)[source]#

Generates an alignment using mafft.

Parameters:
  • seqs (Union[str, List[str], Dict[str, str]]) –

    Can be:

    • fasta file path,

    • list of sequences, or

    • dictionary where values are AA sequences and keys are their corresponding names/IDs

  • ep (float) – ep value for mafft, default is 0.00

  • op (float) – op value for mafft, default is 1.53

  • threads (int) – Number of MAFFT threads to use (default: 8)

Return type:

Tuple[List[str], List[str]]

Returns:

A tuple of the form (out_names, out_seqs)

  • out_names: list of aligned protein names

  • out_seqs: list of corresponding protein sequences

neurosnap.msa.alignment_coverage(seq1, seq2)[source]#

Calculate alignment coverage (%) for two aligned sequences. First sequence should the query sequence in most cases

Parameters:
  • seq1 (str) – Query aligned sequence (with gaps).

  • seq2 (str) – Subject aligned sequence (with gaps).

Return type:

float

Returns:

Percentage of non-gap positions in the query sequence.

neurosnap.msa.consensus_sequence(sequences)[source]#

Generates the consensus sequence from a list of aligned sequences.

The consensus is formed by taking the most common character at each position.

Parameters:

sequences (List[str]) – A list of equal-length sequences (e.g., amino acid or DNA).

Return type:

str

Returns:

The consensus sequence.

Raises:

ValueError – If the sequence list is empty or sequences are of unequal lengths.

neurosnap.msa.pad_seqs(seqs, char='-', truncate=False)[source]#

Pads all sequences to the longest sequences length using a character from the right side.

Parameters:
  • seqs (List[str]) – List of sequences to pad

  • chars – The character to perform the padding with, default is “-”

  • truncate (Union[bool, int]) – When set to True will truncate all sequences to the length of the first, set to integer to truncate sequence to that length

Return type:

List[str]

Returns:

The padded sequences

neurosnap.msa.read_msa(input_fasta, *, size=inf, allow_chars='', drop_chars='', remove_chars='*', uppercase=True, name_allow_all_chars=False, query=None, cov=0, id=0)[source]#

Reads an MSA, a3m, or fasta file and yields (name, seq) pairs as a stream. Returned headers will consist of all characters up until the first space with the “|” character replaced with an underscore.

Parameters:
  • input_fasta (Union[str, TextIOBase]) – Path to read input a3m file, fasta as a raw string, or a file-handle like object to read

  • size (float) – Number of rows to read

  • allow_chars (str) – Sequences that contain characters not included within STANDARD_AAs+allow_chars will throw an exception

  • drop_chars (str) – Drop sequences that contain these characters. For example, "-X"

  • remove_chars (str) – Removes these characters from sequences. For example, "*-X"

  • uppercase (bool) – Converts all amino acid chars to uppercase when True

  • name_allow_all_chars (bool) – Uses the entire header string for names instead of the standard regex pattern

  • query (Optional[str]) – Query amino acid sequence. If not provided, the first sequence in the MSA is used.

  • cov (int) – Minimum percentage of query sequence coverage required to keep a sequence. It measures the proportion of non-gap positions in the query that are aligned to non-gap positions in the candidate. For example, with a 50% threshold, at least half of the query positions must align. Raising this value filters out shorter/partial matches and increases overall overlap.

  • id (int) – Minimum percentage of sequence identity required to keep a sequence. Identity is the exact match rate at aligned positions between the query and candidate. Higher values keep only close homologs; lower values allow more diverse sequences.

Yields:

Tuples of the form (name, seq)

  • name: protein name from the a3m file, including gaps

  • seq: protein sequence from the a3m file, including gaps

Return type:

Iterator[Tuple[str, str]]

neurosnap.msa.run_mmseqs2(seqs, output, database='mmseqs2_uniref_env', use_filter=True, use_templates=False, pairing=None)[source]#

Submits amino acid sequences to the ColabFold MMseqs2 API to generate multiple sequence alignments (MSAs), optionally downloading template structures. Results are written to the specified output directory, including: - One combined A3M file per input sequence (named combined_{i}.a3m) - Optional structure templates (if use_templates=True and supported)

Parameters:
  • seqs (str or List[str]) – One or more amino acid sequences. If a single string is provided, it is treated as one sequence. Duplicates are automatically de-duplicated to reduce redundant API calls.

  • output (str) – Path to an output directory. If it exists, it will be deleted and recreated.

  • database (str) – MMseqs2 search database to use. Must be one of: - “mmseqs2_uniref_env” (environmental sequences + UniRef) - “mmseqs2_uniref” (UniRef only)

  • use_filter (bool) – Whether to apply diversity/length filtering to limit the size of the resulting MSA. Recommended for performance and downstream quality. If False, may yield larger but noisier MSAs.

  • use_templates (bool) – Whether to fetch structural templates for each input sequence using hits from the pdb70.m8 file. Automatically disabled if pairing is set.

  • pairing (str or None) –

    If specified, activates MSA pairing mode. Must be one of: - “greedy”: fast pairing using best bidirectional hits - “complete”: exhaustive pairing of all hits - None: disables pairing (default)

    Note: Only one MSA is generated per pair using pair.a3m when pairing is enabled. If this is set, use_templates will be ignored.

Returns:

  • a3m_lines: A list of strings, each representing the combined MSA (in A3M format) for each input sequence,

    in the same order as provided. These are also written as combined_{i}.a3m files.

  • template_paths: If use_templates=True, returns a list of template directory paths (or None for sequences with no templates found).

    Otherwise, returns None.

Return type:

Tuple[List[str], Optional[List[str]]]

Notes

  • Internally deduplicates sequences but returns results in original input order.

  • Implements robust retry logic for ColabFold’s unstable API endpoints, including long-lived 502 errors.

  • Null bytes in A3M files are stripped to avoid downstream parsing issues.

  • Original code adapted from ColabFold: sokrypton/ColabFold

  • Please cite ColabFold if using this in research: https://colabfold.mmseqs.com/

neurosnap.msa.run_mmseqs2_modes(seq, output, cov=50, id=90, max_msa=2048, mode='unpaired_paired')[source]#

Generate a multiple sequence alignment (MSA) for the given sequence(s) using Colabfold’s API. Key difference between this function and run_mmseqs2 is that this function supports different modes. The final a3m and most useful a3m file will be written as “output/final.a3m”. Code originally adapted from: sokrypton/ColabFold

Parameters:
  • seq (Union[str, List[str]]) – Sequence(s) to generate the MSA for. If a list of sequences is provided, they will be considered as a single protein for the MSA.

  • output (str) – Output directory path, will overwrite existing results.

  • cov (int) – Coverage of the MSA

  • id (int) – Identity threshold for the MSA

  • max_msa (int) – Maximum number of sequences in the MSA

  • mode (str) – Mode to run the MSA generation in. Must be in ["unpaired", "paired", "unpaired_paired"]

neurosnap.msa.run_phmmer(query, database, evalue=10.0, cpu=2)[source]#

Run phmmer using a query sequence against a database and return all the sequences that are considered as hits. Shamelessly stolen and adapted from seanrjohnson/protein_gibbs_sampler

Parameters:
  • query (str) – Amino acid sequence of the protein you want to find hits for

  • database (str) – Path to reference database of sequences you want to search for hits and create and alignment with, must be a protein fasta file

  • evalue (float) – The threshold E value for the phmmer hit to be reported

  • cpu (int) – The number of CPU cores to be used to run phmmer

Return type:

List[str]

Returns:

List of hits ranked by how good the hits are

neurosnap.msa.run_phmmer_mafft(query, ref_db_path, size=None, in_name='input_sequence', phmmer_cpu=2, mafft_threads=8)[source]#

Generate MSA using phmmer and mafft from reference sequences.

Parameters:
  • query (str) – Amino acid sequence of the protein whose MSA you want to create

  • ref_db_path (str) – Path to reference database of sequences with which you want to search for hits and create and alignment

  • size (Optional[int]) – Total number of sequences to return, including the query. Use None to include all hits. Must be a positive integer greater than 1.

  • in_name (str) – Optional name for input sequence to put in the output

  • phmmer_cpu (int) – The number of CPU cores to be used to run phmmer (default: 2)

  • mafft_threads (int) – Number of MAFFT threads to use (default: 8)

Return type:

Tuple[List[str], List[str]]

Returns:

A tuple of the form (out_names, out_seqs)

  • out_names: list of aligned protein names

  • out_seqs: list of corresponding protein sequences

neurosnap.msa.seqid(seq1, seq2, *, count_gaps=False)[source]#

Calculate the pairwise sequence identity of two same length sequences or alignments. Will not perform any alignment steps.

Parameters:
  • seq1 (str) – The 1st sequence / aligned sequence.

  • seq2 (str) – The 2nd sequence / aligned sequence.

  • count_gaps (bool) – When True, include gap positions in the identity calculation.

Return type:

float

Returns:

The pairwise sequence identity, 0 means no matches found, 100 means sequences were identical.

neurosnap.msa.write_msa(output_path, sequences)[source]#

Writes an MSA, a3m, or fasta to a file. Makes no assumptions about the validity of names or sequences.

Parameters:
  • output_path (Union[str, PathLike]) – Path to output file to write, will overwrite existing files

  • sequences (Iterable[Tuple[str, str]]) – Iterable of (name, sequence) pairs