neurosnap.protein module#

Provides functions and classes related to processing protein data as well as a feature rich wrapper around protein structures using BioPython.

class neurosnap.protein.Protein(pdb, format='auto')[source]#

Bases: object

Class that wraps around a protein structure.

Utilizes the biopython protein structure under the hood. Atoms that are not part of a chain will automatically be added to a new chain that does not overlap with any existing chains.

Parameters:

pdb (Union[str, Path, IOBase]) – Can be either a file handle, PDB or mmCIF filepath, PDB ID, or UniProt ID
format (str) – File format of the input (“pdb”, “mmcif”, or “auto” to infer format from extension)

__call__(model=None, chain=None, res_type=None)[source]#

Returns a selection of a copy of the internal dataframe that matches the provided query. If no queries are provided, will return a copy of the internal dataframe.

Parameters:

model (Optional[int]) – If provided, returned atoms must match this model
chain (Optional[int]) – If provided, returned atoms must match this chain
res_type (Optional[int]) – If provided, returned atoms must match this res_type

Return type:

DataFrame

Returns:

Copy of the internal dataframe that matches the input query

__sub__(other_protein)[source]#

Automatically calculate the RMSD of two proteins. Model used will naively be the first models that have identical backbone shapes. Essentially just wraps around self.calculate_rmsd()

Parameters:: other_protein (Protein) – Another Protein object to compare against
Return type:: DataFrame
Returns:: Copy of the internal dataframe that matches the input query

align(other_protein, chains1=[], chains2=[], model1=0, model2=0)[source]#

Align another Protein object’s structure to the self.structure of the current object. The other Protein will be transformed and aligned. Uses protein and nucleotide backbone atoms.

Parameters:

other_protein (Protein) – Another Neurosnap Protein object to compare against
chains1 (List[str]) – The chain(s) you want to include in the alignment within the reference protein, set to an empty list to use all chains. If both chains1 and chains2 are provided, they are interpreted as pairwise chain mappings in matching order.
chains2 (List[str]) – The chain(s) you want to include in the alignment within the other protein, set to an empty list to use all chains. If both chains1 and chains2 are provided, they are interpreted as pairwise chain mappings in matching order.
model1 (int) – Model ID of reference protein to align to
model2 (int) – Model ID of other protein to transform and align to reference

calculate_center_of_mass(model=0, chains=None)[source]#

Calculate the center of mass of the protein. Considers only atoms with defined masses.

Parameters:

model (Optional[int]) – Model ID to calculate for, if not provided defaults to 0
chains (Optional[List[str]]) – List of chain IDs to calculate for, if not provided calculates for all chains

Returns:

A numpy array with shape (3,) representing the center of mass (x, y, z)

Return type:

center_of_mass

calculate_distance_matrix(model=None, chain=None)[source]#

Calculate the distance matrix for all alpha-carbon (CA) atoms in the chain. Useful for creating contact maps or proximity analyses.

Parameters:

model (Optional[int]) – The model ID to calculate the distance matrix for, if not provided will use first model found
chain (Optional[str]) – The chain ID to calculate, if not provided calculates for all chains

Return type:

ndarray

Returns:

A 2D numpy array representing the distance matrix

calculate_hydrogen_bonds(chain=None, chain_other=None, *, model=None, donor_acceptor_cutoff=3.5, angle_cutoff=120.0)[source]#

Calculate the number of hydrogen bonds in the protein structure. Hydrogen atoms must be explicitly defined within the structure as implicit hydrogens will not computed. We recommend using a tool like reduce to add missing hydrogens.

Hydrogen bonds are detected based on distance and angle criteria: - Distance between donor and acceptor must be less than donor_acceptor_cutoff. - The angle formed by donor-hydrogen-acceptor must be greater than angle_cutoff.

If model is set to None, hydrogen bonds are calculated only for the first model in the structure.

If chain_other is None:

Hydrogen bonds are calculated for the specified chain or all chains if chain is also None.

If chain_other is set to a specific chain:

Hydrogen bonds are calculated only between atoms of chain and chain_other.

If chain_other is specified but chain is not, an exception is raised.

Parameters:

model (Optional[int]) – Model ID to calculate for. If None, only the first model is considered.
chain (Optional[str]) – Chain ID to calculate for. If None, all chains in the selected model are considered.
chain_other (Optional[str]) – Secondary chain ID for inter-chain hydrogen bonds. If None, intra-chain bonds are calculated.
donor_acceptor_cutoff (float) – Maximum distance between donor and acceptor (in Å). Default is 3.5 Å.
angle_cutoff (float) – Minimum angle for a hydrogen bond (in degrees). Default is 120°.

Return type:

int

Returns:

The total number of hydrogen bonds in the structure.

Raises:

ValueError – If chain_other is specified but chain is not.

calculate_interface_hydrogen_bonding_residues(chain=None, chain_other=None, *, model=None, donor_acceptor_cutoff=3.5, angle_cutoff=120.0)[source]#

Count the number of unique residues that participate in hydrogen bonds at an interface.

A residue is considered hydrogen-bonding if at least one of its atoms participates in a hydrogen bond that satisfies both:

donor–acceptor distance <= donor_acceptor_cutoff (Å)

donor–H–acceptor angle >= angle_cutoff (degrees)

Hydrogen atoms must be explicitly present; add missing hydrogens with a tool like ‘reduce’.

If ‘chain_other’ is None:

Hydrogen bonds are evaluated within ‘chain’ (intra-chain), or across all chains if ‘chain’ is None.

If ‘chain_other’ is provided:

Only inter-chain hydrogen bonds between ‘chain’ and ‘chain_other’ are considered.
If ‘chain_other’ is provided but ‘chain’ is not, an exception is raised.

Parameters:

model (Optional[int]) – Model index to evaluate. If None, defaults to the first model.
chain (Optional[str]) – Primary chain ID to evaluate. If None and ‘chain_other’ is also None, all chains are considered.
chain_other (Optional[str]) – Secondary chain ID for inter-chain evaluation. If None, intra-chain bonds are considered.
donor_acceptor_cutoff (float) – Maximum donor–acceptor distance (Å). Default 3.5.
angle_cutoff (float) – Minimum donor–H–acceptor angle (degrees). Default 120.0.

Return type:

int

Returns:

Integer count of unique residues (from all evaluated chains) that participate in at least one hydrogen bond.

Raises:

ValueError if 'chain_other' is provided but 'chain' is not, or if a specified chain does not exist. –

calculate_protein_volume(model=0, chain=None)[source]#

Compute an estimate of the protein volume using the van der Waals radii. Uses the sum of atom radii to compute the volume.

Parameters:

model (int) – Model ID to compute volume for, defaults to 0
chain (Optional[str]) – Chain ID to compute, if not provided computes for all chains

Return type:

float

Returns:

Estimated volume in Å³

calculate_rmsd(other_protein, chains1=None, chains2=None, model1=0, model2=0, align=True)[source]#

Calculate RMSD between the current structure and another protein. Only compares backbone atoms (N, CA, C). RMSD is in angstroms (Å).

Parameters:

other_protein (Protein) – Another Protein object to compare against
chains1 (Optional[Tuple[List[str], Set[str]]]) – Chain ID of original protein, if not provided compares all chains
chains2 (Optional[Tuple[List[str], Set[str]]]) – Chain ID of other protein, if not provided compares all chains
model1 (int) – Model ID of original protein to compare
model2 (int) – Model ID of other protein to compare
align (bool) – Whether to align the structures first using Superimposer

Return type:

float

Returns:

The root-mean-square deviation between the two structures

calculate_rog(model=0, chains=None, distances_from_com=None)[source]#

Calculate the radius of gyration (Rg) of the protein.

The radius of gyration measures the overall size and compactness of the protein structure. It is calculated based on the distances of atoms from the center of mass (COM).

Parameters:

model (Optional[int]) – The model ID to calculate for. If not provided, defaults to 0.
chains (Optional[List[str]]) – List of chain IDs to calculate for. If not provided, calculates for all chains.
distances_from_com (Optional[ndarray]) – A 1D NumPy array containing the distances (in Ångströms) between each atom and the center of mass. Set to None to automatically calculate this.

Return type:

float

Returns:

The radius of gyration (Rg) in Ångströms. Returns 0.0 if no atoms are found.

calculate_surface_area(model=0, level='R')[source]#

Calculate the solvent-accessible surface area (SASA) of the protein. Utilizes Biopython’s SASA module.

Parameters:

model (int) – The model ID to calculate SASA for, defaults to 0.
level (str) – The level at which ASA values are assigned, which can be one of “A” (Atom), “R” (Residue), “C” (Chain), “M” (Model), or “S” (Structure). The ASA value of an entity is the sum of all ASA values of its children.

Return type:

float

Returns:

Solvent-accessible surface area in Å²

center(x=0.0, y=0.0, z=0.0, model=0, chains=None)[source]#

Translate the structure so its center of mass matches the target coordinates.

Parameters:

x (float) – Target x-coordinate to center on, defaults to 0.0
y (float) – Target y-coordinate to center on, defaults to 0.0
z (float) – Target z-coordinate to center on, defaults to 0.0
model (Optional[int]) – Model ID to center. If None, all models are centered individually.
chains (Optional[List[str]]) – Optional list of chain IDs to center. If None, use all chains in the model.

chains(model=0)[source]#

Returns a list of all the chain names/IDs.

Parameters:: model (int) – The ID of the model you want to fetch the chains of, defaults to 0
Return type:: List[str]
Returns:: Chain names/IDs found within the PDB file

distances_from_com(model=0, chains=None, com=None)[source]#

Calculate the distances of all atoms from the center of mass (COM) of the protein.

This method computes the Euclidean distance between the coordinates of each atom and the center of mass of the structure. The center of mass is calculated for the specified model and chain, or for all models and chains if none are provided.

Parameters:

model (Optional[int]) – The model ID to calculate for. If not provided, defaults to 0.
chains (Optional[List[str]]) – List of chain IDs to calculate for. If not provided, calculates for all chains.
com (Optional[ndarray]) – Center of mass (com) to use, must be a numpy array with shape (3,) representing the center of mass (x, y, z). Set to None to calculate the com automatically

Return type:

ndarray

Returns:

A 1D NumPy array containing the distances (in Ångströms) between each atom and the center of mass.

find_disulfide_bonds(chain=None, model=None, threshold=2.05)[source]#

Find disulfide bonds between Cysteine residues in the structure. Looks for SG-SG bonds within a threshold distance.

Parameters:

chain (Optional[str]) – Chain ID to search. If None, all chains are searched.
model (Optional[int]) – Model ID to search, If None, the first available model is searched.
threshold (float) – Maximum distance to consider a bond between SG atoms, in angstroms. Default is 2.05 Å.

Return type:

List[Tuple]

Returns:

List of tuples of residue pairs forming disulfide bonds

find_hydrophobic_residues(chain=None, model=None)[source]#

Identify hydrophobic residues in the structure.

Parameters:

chain (Optional[str]) – Chain ID to extract from. If None, all chains are checked.
model (Optional[int]) – Model ID to extract from. If None, the first available model is searched.

Return type:

List[Tuple]

Returns:

List of tuples (chain_id, residue) for hydrophobic residues

find_interface_contacts(chain1, chain2, *, model=None, cutoff=4.5, hydrogens=True)[source]#

Identify interface atoms between two chains using a distance cutoff.

Parameters:

chain1 (str) – ID of the binder chain.
chain2 (str) – ID of the target chain.
cutoff (float) – Distance cutoff in Ångströms for defining an interface contact (default 4.5 Å).
hydrogens (bool) – Whether to keep hydrogen atoms when evaluating contacts (set False to exclude them).

Returns:

Paired atoms from the binder and target chains that are within the cutoff distance.

Return type:

List[Tuple[Atom, Atom]]

find_interface_residues(chain1, chain2, *, model=None, cutoff=4.5, hydrogens=True)[source]#

Identify residue-residue contacts between two chains using a distance cutoff.

A residue pair is included when any atom in the binder residue is within cutoff Å of any atom in the target residue.

Parameters:

chain1 (str) – ID of the binder chain.
chain2 (str) – ID of the target chain.
model (Optional[int]) – Index of the model to use (defaults to the first).
cutoff (float) – Distance cutoff in Ångströms for defining an interface contact (default 4.5 Å).
hydrogens (bool) – Whether to keep hydrogen atoms when evaluating contacts (set False to exclude them).

Returns:

Unique residue pairs (binder residue, target residue) that are in contact.

Return type:

List[Tuple[Residue, Residue]]

find_missing_residues(chain=None)[source]#

Identify missing residues in the structure based on residue numbering. Useful for identifying gaps in the structure.

Parameters:: chain (Optional[str]) – Chain ID to inspect. If None, all chains are inspected.
Returns:: List of missing residue positions
Return type:: missing_residues

find_non_interface_hydrophobic_patches(chain_pairs, target_chains=None, *, model=None, cutoff_interface=4.5, hydrogens=True, patch_cutoff=6.0, min_patch_area=40.0)[source]#

Identify solvent-exposed hydrophobic patches that are not part of specified interfaces.

Parameters:

chain_pairs (Iterable[Tuple[str, str]]) – Iterable of chain ID pairs that define interface contacts (e.g. [(‘A’, ‘B’)]).
target_chains (Optional[Iterable[str]]) – Optional iterable of chain IDs to restrict the patch search to. If None, all chains are considered.
model (Optional[int]) – Model index to use. Defaults to the first available model.
cutoff_interface (float) – Atom–atom distance cutoff (Å) for defining an interface contact.
hydrogens (bool) – Whether to include hydrogens when identifying interface contacts.
patch_cutoff (float) – CA–CA distance cutoff (Å) for linking hydrophobic residues into the same patch.
min_patch_area (float) – Minimum total solvent-accessible surface area (Å²) for a patch to be counted.

Return type:

List[Set[Residue]]

Returns:

List of patches, where each patch is a set of residues belonging to that solvent-exposed, non-interface hydrophobic cluster.

find_salt_bridges(chain=None, model=None, cutoff=4.0)[source]#

Identify salt bridges between oppositely charged residues. A salt bridge is defined as an interaction between a positively charged residue (Lys, Arg) and a negatively charged residue (Asp, Glu) within a given cutoff distance.

Parameters:

chain (Optional[str]) – Chain ID to search. If None, all chains are searched.
model (Optional[int]) – Model ID to search, If None, the first available model is searched.
cutoff (float) – Maximum distance for a salt bridge (float)

Return type:

List[Tuple]

Returns:

List of residue pairs forming salt bridges

generate_df()[source]#

Generate the biopandas-like dataframe and update the value of self.df to the new dataframe. This method should be called whenever the internal protein structure is modified or has a transformation applied to it.

Inspired by: https://biopandas.github.io/biopandas

get_aas(chain, model=None)[source]#

Returns the amino acid sequence of a target chain. Ligands, small molecules, and nucleotides are ignored.

Parameters:

chain (str) – The ID of the chain you want to fetch the AA sequence of
model (Optional[int]) – The ID of the model containing the target chain, if set to None will default to first model

Return type:

str

Returns:

The amino acid sequence of the found chain

get_backbone(chains=None, model=None)[source]#

Extract backbone atoms (N, CA, C) from the structure. If model or chain is not provided, extracts from all models/chains.

Parameters:

chains (Optional[Tuple[List[str], Set[str]]]) – Chain ID(s) to extract from. If None, all chains are included.
model (Optional[int]) – Model ID to extract from. If None, all models are included.

Return type:

ndarray

Returns:

A numpy array of backbone coordinates (Nx3)

models()[source]#

Returns a list of all the model names/IDs.

Returns:: Chain names/IDs found within the PDB file
Return type:: models

remove(model, chain=None, resi_start=None, resi_end=None)[source]#

Completely removes all parts of a selection from self.structure. If a residue range is provided then all residues between resi_start and resi_end will be removed from the structure (inclusively). If a residue range is not provided then all residues in a chain will be removed.

Parameters:

model (int) – ID of model to remove from
chain (Optional[str]) – ID of chain to remove from, if not provided will remove all chains in the model
resi_start (Optional[int]) – Index of first residue in the range you want to remove
resi_end (Optional[int]) – Index of last residues in the range you want to remove

remove_non_biopolymers(model=None, chain=None)[source]#

Removes all ligands, heteroatoms, and non-biopolymer residues from the selected structure. Non-biopolymer residues are considered to be any residues that are not standard amino acids or standard nucleotides (DNA/RNA). If no model or chain is provided, it will remove from the entire structure.

Parameters:

model (Optional[int]) – The model ID to process. If None, will use all models.
chain (Optional[str]) – The chain ID to process. If None, will use all chains.

remove_nucleotides(model=None, chain=None)[source]#

Removes all nucleotides (DNA and RNA) from the structure. If no model or chain is provided, it will remove nucleotides from the entire structure.

Parameters:

model (Optional[int]) – The model ID to process. If None, will use all models.
chain (Optional[str]) – The chain ID to process. If None, will use all chains.

remove_waters()[source]#: Removes all water molecules (residues named ‘WAT’ or ‘HOH’) from the structure. It is suggested to call renumber() afterwards as well.

renumber(model=None, chain=None, start=1)[source]#

Renumbers all selected residues. If selection does not exist this function will do absolutely nothing.

Parameters:

model (Optional[int]) – The model ID to renumber. If None, will use all models.
chain (Optional[int]) – The chain ID to renumber. If None, will use all models.
start (int) – Starting value to increment from, defaults to 1.

save(fpath, format='auto')[source]#

Save the structure as a PDB or mmCIF file. Will overwrite any existing files.

Parameters:

fpath (Union[str, Path]) – File path where you want to save the structure
format (str) – File format to save in, either ‘pdb’ or ‘mmcif’, set to ‘auto’ to infer format from extension.

select_residues(selectors, invert=False, model=None)[source]#

Select residues from a protein structure using a string selector.

This method allows for flexible selection of residues in a protein structure based on a string query. The query must be a comma-delimited list of selectors following these patterns:

“C”: Select all residues in chain C.
“B1”: Select residue with identifier 1 in chain B only.
“A10-20”: Select residues with identifiers 10 to 20 (inclusive) in chain A.
“A15,A20-23,B”: Select residues 15, 20, 21, 22, 23, and all residues in chain B.

If any selector does not match residues in the structure, an exception is raised.

Parameters:

selectors (str) – A string specifying the residue selection query.
invert (bool) – Whether to invert the selection.
model (Optional[int]) – The ID of the model to select from. If None, the first model is used.

Returns:

A dictionary where keys are chain IDs and values are sorted: lists of residue sequence numbers that match the query.

Return type:

dict

Raises:

ValueError – If a specified chain or residue in the selector does not exist in the structure.
ValueError – If selector string is empty.

to_sdf(fpath)[source]#

Save the current protein structure as an SDF file. Will export all models and chains. Use remove() method to get rid of undesired regions.

Parameters:: fpath (str) – Path to the output SDF file

neurosnap.protein.calculate_bsa(protein_complex, chain_group_1, chain_group_2, *, model=0, level='R')[source]#

Calculate the buried surface area (BSA) between two groups of chains.

The buried surface area is computed as the difference in solvent-accessible surface area (SASA) when the two groups of chains are in complex versus separate. The BSA is defined as:

BSA = (SASA(group 1) + SASA(group 2)) − SASA(complex)

Parameters:

complex – Complex to calculate BSA for.
chain_group_1 (List[str]) – List of chain IDs for the first group.
chain_group_2 (List[str]) – List of chain IDs for the second group.
model (int) – Model ID to calculate BSA for, defaults to 0.
level (str) – The level at which ASA values are assigned, which can be one of “A” (Atom), “R” (Residue), “C” (Chain), “M” (Model), or “S” (Structure).

Return type:

float

Returns:

The buried surface area (BSA) in Å².

neurosnap.protein.extract_non_biopolymers(pdb_file, output_dir, min_atoms=0)[source]#

Extracts all non-biopolymer molecules (ligands, heteroatoms, etc.) from the specified PDB file and writes them to SDF files. Each molecule is saved as a separate SDF file in the output directory. Automatically adds hydrogens to molecules. Attempts to sanitize the molecule if possible; logs a warning if sanitization fails.

Parameters:

pdb_file (str) – Path to the input PDB file.
output_dir (str) – Directory where the SDF files will be saved. Will overwrite existing directory.
min_atoms (int) – Minimum number of atoms a molecule must have to be saved. Molecules with fewer atoms are skipped.

neurosnap.protein.fetch_accessions(accessions, batch_size=150)[source]#

Fetch protein sequences corresponding to a list of UniProt accession numbers.

This function retrieves sequences from the UniProt API, checking first the UniParc database and then UniProtKB if sequences are missing. Accessions are processed in batches to handle large lists efficiently.

Parameters:

accessions (Iterable[str]) – A list of UniProt accession numbers. Duplicate accessions will be automatically removed.
batch_size (int) – Size of each batch of sequences to fetch from uniprot API per request.

Returns:

A dictionary where keys are accession numbers and values are the corresponding protein sequences. Missing sequences will have the value None.

Return type:

dict

Raises:

requests.exceptions.HTTPError – If the API request fails and raises an HTTP error.

Notes

Batching is performed with a default batch size of 150, which was determined to be optimal during testing.
The function first queries the UniParc API and then queries the UniProtKB API for any missing accessions.

Example

>>> accessions = ["P12345", "Q67890", "A1B2C3"]
>>> sequences = fetch_accessions(accessions)
>>> print(sequences["P12345"])
"MEEPQSDPSV...GDE"

Steps:

Deduplicate the input list of accessions.
Split the accessions into batches to query UniParc.
Query the UniParc API for each batch and store results.
Identify missing accessions and query UniProtKB.
Validate that all input accessions were retrieved successfully.

neurosnap.protein.fetch_uniprot(uniprot_id, head=False)[source]#

Fetches a UniProt or UniParc FASTA entry by its identifier.

This function retrieves a protein sequence in FASTA format using the UniProt REST API. If the given UniProt ID is not found in UniProtKB, the function will attempt to fetch it from UniParc. The function can either fetch the header information (if head is True) or the full sequence.

Parameters:

uniprot_id (str) – The UniProt or UniParc accession ID for the protein sequence.
head (bool, optional) – If True, performs a HEAD request to check if the entry exists without downloading the sequence. Defaults to False.

Returns:

If head is True: Returns True if the ID exists.
If head is False: Returns the protein sequence as a string if successfully fetched.

Return type:

Union[str, bool]

Raises:

Exception – If the UniProt or UniParc ID is not found in either database.
ValueError – If the retrieved FASTA contains too many or no sequences.

Example

```python try:

sequence = fetch_uniprot(“P12345”) print(sequence)

except Exception as e:: print(f”Error: {e}”)

```

neurosnap.protein.find_contacts(atoms1, atoms2, cutoff=4.5)[source]#

Identifies atomic contacts between two sets of atoms within a distance threshold.

Parameters:

atoms1 (List[Atom]) – A list of Bio.PDB.Atom objects from the binder chain.
atoms2 (List[Atom]) – A list of Bio.PDB.Atom objects from the target chain.
cutoff (float) – Distance cutoff (in Å) to consider a contact (default is 4.5 Å).

Returns:

Atom pairs (binder atom, target atom) within the cutoff.

Return type:

List[Tuple[Atom, Atom]]

neurosnap.protein.foldseek_search(protein, mode='3diaa', databases=None, max_retries=10, retry_interval=5, output_format='json')[source]#

Perform a protein structure search using the Foldseek API.

Parameters:

protein (Union[Protein, str]) – Either a Protein object or a path to a PDB file.
mode (str) – Search mode. Must be on of “3diaa” or “tm-align”.
databases (List[str]) – List of databases to search. Defaults to a predefined list if not provided.
max_retries (int) – Maximum number of retries to check the job status.
retry_interval (int) – Time in seconds between retries for checking job status.
output_format (str) – Format of the output, either “json” or “dataframe”.

Return type:

Union[str, DataFrame]

Returns:

Search results in the specified format (JSON string or pandas DataFrame).

Raises:

RuntimeError – If the job fails.
TimeoutError – If the job does not complete within the allotted retries.
ValueError – If an invalid output_format is specified.

neurosnap.protein.getAA(query, *, non_standard='reject')[source]#

Resolve an amino acid identifier to a canonical record.

This function accepts either a 1-letter code, 3-letter abbreviation, or full name (case-insensitive) and returns the corresponding AARecord.

Parameters:

query (str) – Amino acid identifier (1-letter code, 3-letter CCD abbreviation, or full name).
non_standard ({"reject", "convert", "allow"}, optional) –
Policy for handling non-standard amino acids (default: “reject”):
- ”reject”: Raise an error if the amino acid is non-standard.
- ”convert”: Map non-standard amino acids to their closest standard equivalent (e.g., MSE → MET).
- ”allow”: Return the non-standard amino acid unchanged.

Returns:

A record containing: - code: 1-letter code (may be “?” if unavailable for non-standard AAs). - abr: 3-letter abbreviation. - name: Full amino acid name. - is_standard: Whether the residue is one of the 20 canonical amino acids. - standard_equiv_abr: 3-letter abbreviation of the standard equivalent

(if applicable).

Return type:

AARecord

Raises:

ValueError – If query does not match any supported amino acid identifier. If non_standard=”reject” and the amino acid is non-standard. If non_standard=”convert” but no standard equivalent is defined.

neurosnap.protein.isoelectric_point(sequence, pKa={'C': 8.5, 'C_TERMINUS': 3.6, 'D': 3.9, 'E': 4.1, 'H': 6.5, 'K': 10.8, 'N_TERMINUS': 8.6, 'R': 12.5, 'U': 5.2, 'Y': 10.1}, *, pH_low=0.0, pH_high=14.0, tol=0.0001, max_iter=100)[source]#

Estimate the isoelectric point (pI) of a protein or peptide.

The pI is the pH at which the net charge of the molecule is zero. This function computes the net charge across pH and uses a bisection search to find the root.

Parameters:

sequence (str) – Amino acid sequence (one-letter codes). Supports the 20 canonical residues and optionally ‘U’ (selenocysteine). Non-titratable residues contribute no charge.
pKa (Dict[str, float]) – Dictionary of pKa values for titratable groups. Must include keys “N_TERMINUS”, “C_TERMINUS”, and for side chains “D”, “E”, “C”, “Y”, “H”, “K”, “R”. If ‘U’ appears in the sequence, include “U” (default ~5.2, approximate).
pH_low (float) – Lower bound of the bracketing interval for the bisection search (default 0.0).
pH_high (float) – Upper bound of the bracketing interval for the bisection search (default 14.0).
tol (float) – Target absolute net charge tolerance at the solution (default 1e-4).
max_iter (int) – Maximum iterations for the bisection search (default 100).

Return type:

float

Returns:

Estimated pI.

Notes

Results depend on the chosen pKa set. For consistency with common tools, you may substitute a different pKa dictionary (e.g., Bjellqvist or IPC sets).
Pyrrolysine (‘O’) is not included by default due to scarce consensus pKa data; it is treated as non-titratable here. You can add an entry if you have a value.
This model ignores sequence-context and microenvironment effects (local shifts in pKa due to neighbors or structure). It’s a good heuristic, not a guarantee.

neurosnap.protein.molecular_weight(sequence, aa_mws={'A': 71.0779, 'C': 103.1429, 'D': 115.0874, 'E': 129.11398, 'F': 147.17386, 'G': 57.05132, 'H': 137.13928, 'I': 113.15764, 'K': 128.17228, 'L': 113.15764, 'M': 131.19606, 'N': 114.10264, 'O': 237.29816, 'P': 97.11518, 'Q': 128.12922, 'R': 156.18568, 'S': 87.0773, 'T': 101.10388, 'U': 150.0379, 'V': 99.13106, 'W': 186.2099, 'Y': 163.17326})[source]#

Calculate the molecular weight of a protein or peptide sequence.

This function computes the molecular weight by summing the residue masses for each amino acid in the input sequence. By default, it uses average amino acid residue masses (AA_WEIGHTS_PROTEIN_AVG), but you can provide a custom mass dictionary (e.g., monoisotopic or free amino acid masses).

The calculation accounts for the loss of one water molecule (H₂O, 18.015 Da) for each peptide bond formed. For a sequence of length n, (n - 1) * 18.015 Da is subtracted from the total.

Parameters:

sequence (str) – Amino acid sequence (one-letter codes).
aa_mws (Dict[str, float]) – Dictionary mapping amino acid one-letter codes to molecular weights. Defaults to AA_WEIGHTS_PROTEIN_AVG.

Return type:

float

Returns:

Estimated molecular weight of the protein or peptide in Daltons (Da).

Raises:

ValueError – If the sequence contains an invalid or unsupported
amino acid code. –

Notes

Use AA_WEIGHTS_PROTEIN_MONO for monoisotopic mass calculations, typically used in mass spectrometry.
Use AA_WEIGHTS_PROTEIN_AVG (default) for average residue masses, appropriate for bulk molecular weight estimation.
For free amino acids (not incorporated in peptides), use AA_WEIGHTS_FREE.
Weight dictionaries are defined in constants.py.

neurosnap.protein.net_charge(sequence, pH, pKa={'C': 8.5, 'C_TERMINUS': 3.6, 'D': 3.9, 'E': 4.1, 'H': 6.5, 'K': 10.8, 'N_TERMINUS': 8.6, 'R': 12.5, 'U': 5.2, 'Y': 10.1})[source]#

Calculate the net charge of a protein or peptide sequence at a given pH.

This function applies the Henderson–Hasselbalch equation to estimate the protonation state of titratable groups (N-terminus, C-terminus, and ionizable side chains) and computes the overall net charge.

Parameters:

sequence (str) – Amino acid sequence in one-letter codes. Supports the 20 canonical residues and optionally ‘U’ (selenocysteine). Non-ionizable residues are ignored.
pH (float) – The solution pH at which to evaluate the net charge.
pKa (Dict[str, float]) – Dictionary of pKa values for titratable groups. Must include keys “N_TERMINUS”, “C_TERMINUS”, “D”, “E”, “C”, “Y”, “H”, “K”, and “R”. If ‘U’ is present in the sequence, it should also include “U”.

Return type:

float

Returns:

Estimated net charge of the sequence at the given pH.

Notes

Positive charges come from protonated groups: - N-terminus - Lysine (K) - Arginine (R) - Histidine (H)

Negative charges come from deprotonated groups: - C-terminus - Aspartic acid (D) - Glutamic acid (E) - Cysteine (C) - Tyrosine (Y) - Selenocysteine (U), if included

The calculation assumes independent ionization equilibria and does not account for local environment or structural effects. It is best interpreted as an approximate charge profile.

neurosnap.protein.run_blast(sequence, email, matrix='BLOSUM62', alignments=250, scores=250, evalue=10.0, filter=False, gapalign=True, database='uniprotkb_refprotswissprot', output_format=None, output_path=None, return_df=True)[source]#

Submits a BLAST job to the EBI NCBI BLAST web service, checks the status periodically, and retrieves the result. The result can be saved either as an XML or FASTA file. Optionally, a DataFrame with alignment details can be returned.

Parameters:

sequence (Union[str, Protein]) –
The input amino acid sequence as a string or a Protein object.

If a Protein object is provided with multiple chains, an error will be raised, and the user will be prompted to provide a single chain sequence using the Protein.get_aas method.
email (str) – The email address to use for communication if there is a problem.
matrix (str) –
The scoring matrix to use, default is "BLOSUM62".

Must be one of:
```
["BLOSUM45", "BLOSUM62", "BLOSUM80", "PAM30", "PAM70"].
```
alignments (int) –
The number of alignments to display in the result (default is 250). the number alignments must be one of the following:
```
[50, 100, 250, 500, 750, 1000]
```
scores (int) – The number of scores to display in the result, default is 250.
evalue (float) –
The E threshold for alignments (default is 10.0). Must be one of:
```
[0.00001, 0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0]
```
filter (bool) – Whether to filter low complexity regions (default is False).
gapalign (bool) – Whether to allow gap alignments (default is True).

database (str) –

The database to search in, default is "uniprotkb_refprotswissprot".

Must be one of:

["uniprotkb_refprotswissprot", "uniprotkb_pdb", "uniprotkb", "afdb", "uniprotkb_reference_proteomes", "uniprotkb_swissprot", "uniref100", "uniref90", "uniref50", "uniparc"]

output_format (Optional[str]) – The format in which to save the result, either "xml" or "fasta". If None, which is the default, no file will be saved.
output_path (Optional[str]) – The file path to save the output. This is required if output_format is specified.
return_df (bool) – Whether to return a DataFrame with alignment details, default is True.

Return type:

Optional[DataFrame]

Returns:

A pandas DataFrame with BLAST hit and alignment information, if return_df is True.

The DataFrame contains the following columns: - “Hit ID”: The identifier of the hit sequence. - “Accession”: The accession number of the hit sequence. - “Description”: The description of the hit sequence. - “Length”: The length of the hit sequence. - “Score”: The score of the alignment. - “Bits”: The bit score of the alignment. - “Expectation”: The E-value of the alignment. - “Identity (%)”: The percentage identity of the alignment. - “Gaps”: The number of gaps in the alignment. - “Query Sequence”: The query sequence in the alignment. - “Match Sequence”: The matched sequence in the alignment.

Raises:

AssertionError – If sequence is provided as a Protein object with multiple chains.

neurosnap.protein.sanitize_aa_seq(seq, *, non_standard='reject', trim_term=True, uppercase=True, clean_whitespace=True)[source]#

Validates and sanitizes an amino acid sequence string.

Parameters:

seq (str) – The input amino acid sequence.
non_standard (str) – How to handle non-standard amino acids. Must be one of: - “reject”: Raise an error if any non-standard residue is found (default). - “convert”: Replace non-standard residues with standard equivalents, if possible. - “allow”: Keep non-standard residues unchanged.
trim_term (bool) – If True, trims terminal stop codons (“*”) from the end of the sequence. Default is True.
uppercase – If True, converts the sequence to uppercase before processing. Default is True.
clean_whitespace (bool) – If True, removes all whitespace characters from the sequence. Default is True.

Return type:

str

Returns:

The sanitized amino acid sequence.

Raises:

ValueError – If an invalid residue is found and non_standard is set to “reject”, or if a residue cannot be converted when non_standard is “convert”.
AssertionError – If non_standard is not one of “allow”, “convert”, or “reject”.

neurosnap.protein module#

This Page