Optimizing protein function / fitness is among one of the most challenging tasks that a lab can undergo. Whether the objective is to merely create functional homologs or improved versions of existing proteins, modern laboratory techniques tend to be expensive, time consuming, and leave much to be desired.
In this blog post we explore a novel protein optimization technique which can be done effortlessly and directly from your web browser at effectively no cost. Thanks to recent advances in deep learning and our platform Neurosnap, end-to-end in-silico mutagenesis is now a reality and far more effective than prior methods. In this article we take the brightest natural Fluorescent Protein, AausFP1, and create 100 mutants and predict their structures using AlphaFold2. All data from our experiment is also publicly available for scrutiny.
Laboratory mutagenesis usually refers to some technique used to deliberately create mutants of some gene of interest. Modern mutagenesis techniques such as site-directed mutagenesis, insertional mutagenesis, transposon mutagenesis, and others are capable of producing site-specific mutations that can then be screened to determine the effect of the mutation. This allows us to explore the fitness landscape of a protein, and in theory optimize our protein of interest for some particular application. The major drawback of these kinds of experiments are that the protein sequence space is incredibly vast and sampling just a fraction doesn't guarantee any results (this gets even harder as your protein gets longer). Furthermore, many of these experiments tend to be time consuming, expensive, and can require complicated high-throughput assays in order to even have a chance of succeeding.
A deep mutational scan (DMS) consists of creating an initial library of mutants and homologous sequences of a protein of interest. Mutant sequences can usually be compiled from existing mutagenesis experiments (if they exist), while homologous sequences can be acquired from evolutionary data in the form of homology searches.
Instead of characterizing each protein within your library one-by-one, a DMS allows you to pool all your mutants together and put them through some kind of experimental assay that is then used to measure the fitness of that protein (e.g., reaction rate, thermo-stability, binding affinity, brightness, or something else). The advantage of this technique is that it allows you to efficiently characterize the effects of mutations from a number of different samples in parallel. Once you're satisfied with your results you can choose from your most fit mutants, or use the data from the experiment with some newer in-silico techniques to further optimize your protein of interest.
However, two major drawbacks of this technique is that it requires lots of data to be effective as well as a compatible assay for screening. If there isn't much data available for your protein then this option isn't ideal.
AlphaFold2 is the first deep learning based approach for effectively predicting protein structures from their amino acid sequences and coevolutionary data in the form of MSAs. Traditional methods for determining protein structures such as X-Ray crystallography or NMR spectroscopy tend to be expensive, laborious, time consuming, and don't guarantee good results. AlphaFold2 on the other hand is relatively cheap and is usually able to create predictions within a span of minutes to hours. Another benefit of AlphaFold2 is its confidence metrics. AlphaFold2 is fairly good at knowing when it's wrong and probably shouldn't be trusted.
ProteinMPNN is another new deep learning model that does the exact opposite of AlphaFold2. Instead of predicting protein structures from an amino acid sequence, ProteinMPNN predicts the amino acids of an input structure. With ProteinMPNN the user specifies the chain(s) in a structure that they want to mask out, where masking is just a fancy word for removing the side chains and hiding the individual amino acids from the model. After this, the model's job is to take that structural information and predict which amino acids have the highest probability for each position. The nice thing about this approach is that because ProteinMPNN has access to structural information, it can directly infer which amino acids would correspond to which positions (at least ideally). This allows us to essentially frame in-silico mutagenesis as an inverse folding problem instead of functional effect prediction by position.
So to kick off our experiment I went over to FPbase.org, a fluorescent protein database with extensive data on over 850 unique fluorescent proteins and chose the brightest GFP I could find. At the time of this writing, the brightest protein characterized on FPbase was AausFP1 derived from Aequorea australis. The next step is to get a structure for this protein to feed into ProteinMPNN, luckily for us a 2019 study has already deposited the structure of AausFP1 into the pdb with the ID 6S67.
Note some fluorescent proteins have a tendency to form dimers or other oligos under higher concentrations. AausFP1 is one of those proteins, for more information check out this description on FPbase.
So now that we have a protein of interest and its corresponding structure, it's time for us begin the inverse folding process! To get started with ProteinMPNN, head over to the ProteinMPNN model page. Once on the model page we upload the structure AausFP1 to the Input Structure field and set the value for Designed Chain to
A,B (since there are two chains and we want ProteinMPNN to design both). Additionally, since AausFP1 is a homo-dimer we made sure to check the homomer box to ensure both chains have the same sequence. Checking homomer basically just treats the complex as a homo-oligomer and ties the amino acids together for each position in the chain. Other input fields can be left alone as neurosnap automatically selects the optimal values for each field for most use cases. In our example we generate
100 sequences and set the Sampling Temperature to
Now we just hit Run Job and wait for the results, this is usually pretty quick and in our case took approximately one minute at the cost of $0.1 USD. Once the ProteinMPNN job is complete we check the Output Sequences and look for the sequences with the lowest scores. ProteinMPNN outputs a score metric for its predictions where scores closer to zero are correlated with greater prediction quality. Since in our case the input structure we used was a multimer, different chains will be delimited by a
/ character. Note that before inserting these into AlphaFold2 we need to replace the
/ characters with
: characters instead, as AlphaFold2 delimits chains using the
: character. Some attentive readers might also notice a
XXX regions within the predicted chains. This is due to the initial PDB structure containing chromophore molecules within the GFP proteins which then get replaced with the unknown amino acid which has the symbol
X. We replace the
XXX regions within the output fasta with the correct chromophore region of
TYG. In most GFPs, the chromophore region is usually altered automatically by the enzyme itself as GFP is one of those cool enzymes where they are their own substrate.
Now that we have our sequences, the next step is to predict their structures as a form of "in-silico screening". To do this we download the ProteinMPNN results as a fasta and then use that as an input to the AlphaFold2 model page (just don't forget the modifications mentioned above). It's also noteworthy to mention that if you don't want to predict the structures of all your predicted sequences, you can always choose the top-n (eg., top-10, top-20) sequences with the lowest ProteinMPNN scores. Once you've selected the sequences you want to fold within the AlphaFold2 panel, the next step is to configure the rest of the model. The default settings for AlphaFold2 are sufficient for our case. However, in some cases you could greatly benefit from increasing the Number Recycles parameter to something higher. Higher recycling steps tend to result in higher prediction quality.
Finally, once AlphaFold2 is complete we can download the results and use the structure with the highest mean pLDDT. The pLDDT or predicted Local Distance Difference Test is a per residue metric used by AlphaFold2 to gauge whether regions of a protein are likely to be disordered or not. Higher pLDDT scores are generally correlated with greater prediction quality for that region. Another metric that can be used to rank the predicted structures is the pTM metric, where higher pTM values are correlated with higher prediction quality. In our example we ranked our predictions by their mean pLDDT and present the highest ranked structure below. Some next steps for this pipeline could include synthesizing the top-3 best structures and screening them with an assay in-vivo.
We upload all our results to a public github repository, as well as the predicted AlphaFold2 structures for each mutant protein. The pictures below show the best predicted structure according to pLDDT, ProteinMPNN Score, and the two structures aligned on each other respectively. The TM-Score between the two aligned structures is
Here we plot various metrics of the resultant predictions. Ironically, there doesn't seem to be any meaningful correlation between the ProteinMPNN scores and the AlphaFold2 metrics.