neurosnap.sequence.protein module#

neurosnap.sequence.protein.getAA(query, *, non_standard='reject')[source]#

Resolve an amino acid identifier to a canonical record.

This function accepts either a 1-letter code, 3-letter abbreviation, or full name (case-insensitive) and returns the corresponding AARecord.

Parameters:
  • query (str) – Amino acid identifier (1-letter code, 3-letter CCD abbreviation, or full name).

  • non_standard ({"reject", "convert", "allow"}, optional) –

    Policy for handling non-standard amino acids (default: “reject”):

    • ”reject”: Raise an error if the amino acid is non-standard.

    • ”convert”: Map non-standard amino acids to their closest standard equivalent (e.g., MSE → MET).

    • ”allow”: Return the non-standard amino acid unchanged.

Returns:

A record containing: - code: 1-letter code (may be “?” if unavailable for non-standard AAs). - abr: 3-letter abbreviation. - name: Full amino acid name. - is_standard: Whether the residue is one of the 20 canonical amino acids. - standard_equiv_abr: 3-letter abbreviation of the standard equivalent

(if applicable).

Return type:

AARecord

Raises:

ValueError – If query does not match any supported amino acid identifier. If non_standard=”reject” and the amino acid is non-standard. If non_standard=”convert” but no standard equivalent is defined.

neurosnap.sequence.protein.isoelectric_point(sequence, pKa={'C': 8.5, 'C_TERMINUS': 3.6, 'D': 3.9, 'E': 4.1, 'H': 6.5, 'K': 10.8, 'N_TERMINUS': 8.6, 'R': 12.5, 'U': 5.2, 'Y': 10.1}, *, pH_low=0.0, pH_high=14.0, tol=0.0001, max_iter=100)[source]#

Estimate the isoelectric point (pI) of a protein or peptide.

The pI is the pH at which the net charge of the molecule is zero. This function computes the net charge across pH and uses a bisection search to find the root.

Parameters:
  • sequence (str) – Amino acid sequence (one-letter codes). Supports the 20 canonical residues and optionally ‘U’ (selenocysteine). Non-titratable residues contribute no charge.

  • pKa (Dict[str, float]) – Dictionary of pKa values for titratable groups. Must include keys “N_TERMINUS”, “C_TERMINUS”, and for side chains “D”, “E”, “C”, “Y”, “H”, “K”, “R”. If ‘U’ appears in the sequence, include “U” (default ~5.2, approximate).

  • pH_low (float) – Lower bound of the bracketing interval for the bisection search (default 0.0).

  • pH_high (float) – Upper bound of the bracketing interval for the bisection search (default 14.0).

  • tol (float) – Target absolute net charge tolerance at the solution (default 1e-4).

  • max_iter (int) – Maximum iterations for the bisection search (default 100).

Return type:

float

Returns:

Estimated pI.

Notes

  • Results depend on the chosen pKa set. For consistency with common tools, you may substitute a different pKa dictionary (e.g., Bjellqvist or IPC sets).

  • Pyrrolysine (‘O’) is not included by default due to scarce consensus pKa data; it is treated as non-titratable here. You can add an entry if you have a value.

  • This model ignores sequence-context and microenvironment effects (local shifts in pKa due to neighbors or structure). It’s a good heuristic, not a guarantee.

neurosnap.sequence.protein.molecular_weight(sequence, aa_mws={'A': 71.0779, 'C': 103.1429, 'D': 115.0874, 'E': 129.11398, 'F': 147.17386, 'G': 57.05132, 'H': 137.13928, 'I': 113.15764, 'K': 128.17228, 'L': 113.15764, 'M': 131.19606, 'N': 114.10264, 'O': 237.29816, 'P': 97.11518, 'Q': 128.12922, 'R': 156.18568, 'S': 87.0773, 'T': 101.10388, 'U': 150.0379, 'V': 99.13106, 'W': 186.2099, 'Y': 163.17326})[source]#

Calculate the molecular weight of a protein or peptide sequence.

This function computes the molecular weight by summing the residue masses for each amino acid in the input sequence. By default, it uses average amino acid residue masses (AA_MASS_PROTEIN_AVG), but you can provide a custom mass dictionary (e.g., monoisotopic or free amino acid masses).

The calculation accounts for the loss of one water molecule (H₂O, 18.015 Da) for each peptide bond formed. For a sequence of length n, (n - 1) * 18.015 Da is subtracted from the total.

Parameters:
  • sequence (str) – Amino acid sequence (one-letter codes).

  • aa_mws (Dict[str, float]) – Dictionary mapping amino acid one-letter codes to molecular weights. Defaults to AA_MASS_PROTEIN_AVG.

Return type:

float

Returns:

Estimated molecular weight of the protein or peptide in Daltons (Da).

Raises:
  • ValueError – If the sequence contains an invalid or unsupported

  • amino acid code.

Notes

  • Use AA_MASS_PROTEIN_MONO for monoisotopic mass calculations, typically used in mass spectrometry.

  • Use AA_MASS_PROTEIN_AVG (default) for average residue masses, appropriate for bulk molecular weight estimation.

  • For free amino acids (not incorporated in peptides), use AA_MASS_FREE.

  • Weight dictionaries are defined in constants.py.

neurosnap.sequence.protein.net_charge(sequence, pH, pKa={'C': 8.5, 'C_TERMINUS': 3.6, 'D': 3.9, 'E': 4.1, 'H': 6.5, 'K': 10.8, 'N_TERMINUS': 8.6, 'R': 12.5, 'U': 5.2, 'Y': 10.1})[source]#

Calculate the net charge of a protein or peptide sequence at a given pH.

This function applies the Henderson–Hasselbalch equation to estimate the protonation state of titratable groups (N-terminus, C-terminus, and ionizable side chains) and computes the overall net charge.

Parameters:
  • sequence (str) – Amino acid sequence in one-letter codes. Supports the 20 canonical residues and optionally ‘U’ (selenocysteine). Non-ionizable residues are ignored.

  • pH (float) – The solution pH at which to evaluate the net charge.

  • pKa (Dict[str, float]) – Dictionary of pKa values for titratable groups. Must include keys “N_TERMINUS”, “C_TERMINUS”, “D”, “E”, “C”, “Y”, “H”, “K”, and “R”. If ‘U’ is present in the sequence, it should also include “U”.

Return type:

float

Returns:

Estimated net charge of the sequence at the given pH.

Notes

Positive charges come from protonated groups: - N-terminus - Lysine (K) - Arginine (R) - Histidine (H)

Negative charges come from deprotonated groups: - C-terminus - Aspartic acid (D) - Glutamic acid (E) - Cysteine (C) - Tyrosine (Y) - Selenocysteine (U), if included

The calculation assumes independent ionization equilibria and does not account for local environment or structural effects. It is best interpreted as an approximate charge profile.

neurosnap.sequence.protein.sanitize_aa_seq(seq, *, non_standard='reject', trim_term=True, uppercase=True, clean_whitespace=True)[source]#

Validates and sanitizes an amino acid sequence string.

Parameters:
  • seq (str) – The input amino acid sequence.

  • non_standard (str) – How to handle non-standard amino acids. Must be one of: - “reject”: Raise an error if any non-standard residue is found (default). - “convert”: Replace non-standard residues with standard equivalents, if possible. - “allow”: Keep non-standard residues unchanged.

  • trim_term (bool) – If True, trims terminal stop codons (“*”) from the end of the sequence. Default is True.

  • uppercase – If True, converts the sequence to uppercase before processing. Default is True.

  • clean_whitespace (bool) – If True, removes all whitespace characters from the sequence. Default is True.

Return type:

str

Returns:

The sanitized amino acid sequence.

Raises:
  • ValueError – If an invalid residue is found and non_standard is set to “reject”, or if a residue cannot be converted when non_standard is “convert”.

  • AssertionError – If non_standard is not one of “allow”, “convert”, or “reject”.