DNA2.0 has a strong history of protein engineering projects with industry and academia, including ADM, Pfizer, Stanford, California Institute of Technology, and many others. The DNA2.0 ProteinGPS® proprietary technology uses megadimensional, empirical optimization processes to calculate the set of nodes that are information-rich in the relevant space, gene synthesis to make those exact sequences, and machine learning to find the preferred solution.
Protein properties such as activity, substrate specificity, expression ability, affinity, aggregation, immunogenicity, and much more can be engineered through changes in the amino acid sequence of a protein or protein complex. Historically, there are two general strategies for protein engineering: rational protein design relying on a complete mechanistic understanding and directed evolution relying on more or less random search and selection.
|Feature||Typical Library Approach||ProteinGPS® Engineering|
|Size to screen per round||104 – 1012
Limited by transformation efficiency
|48 – 96|
|Sampling of sequence space||Highly biased due to molecular biology process||Mathematically optimal|
|Assay requirement||High throughput,
Typically requires a surrogate assay
|Low throughput, high quality assay,
Identical/similar to ‘real’ function
Usually pick best clone and repeat
Iterative expansion of comprehensive sequence-function map
|Functional statistics||Fragile, substitutions not internally validated||Robust, substitutions validated in multiple systematic contexts|
|Engineering emphasis||Test||Design, Learn|
DNA2.0 has developed a unique protein engineering platform based on machine learning and Design of Experiment (DoE). The ProteinGPS proprietary technology uses the same megadimensional, empirical optimization processes currently applied to gasoline formulation, web advertising, and stock market investing. ProteinGPS technology relies on DoE to calculate the set of nodes that are information-rich in the relevant space, gene synthesis to make those exact sequences, and machine learning to find the preferred solution.
DNA2.0 combines state-of-the-art gene synthesis with cutting edge machine learning and ‘big data’ to explore and exploit sequence-function relationships, identifying protein variants with significantly improved characteristics; typically assaying 96-400 variants over 3-5 rounds.
- Optimize directly for function in the final application
- Screen only 48-96 variants per round
- Sample space of >1016
- Low throughput, high quality assay; identical/similar to ‘real’ function
- No redundant clones
- Multidimensional function optimization
- Robust functional statistics, substitutions validated in multiple systematic contexts
Design of Experiment
DoE relies on systematic variance to explore sequence-function correlation efficiently. This is illustrated above using an example for optimizing enzymatic reaction conditions: the first set of experiments alters the temperature (X axis), the second set alters the pH (Y axis) of the reaction. The Z-dimension indicates the resulting enzymatic activity. First sampling (white) and second sampling (yellow) show how this 2-dimensional space is navigated using OFAT or DoE with each final outcome achieved (red circle). Now imagine an optimization in 5, 10, or even 50 dimensions. The DoE process is faster, more efficient, captures multivariable interactions, and finds the best solution.
Sequence diversity based on the pairwise Hamming distances in a typical random library (left) and a ProteinGPS dataset (right). The ProteinGPS gene variants (Infologs) enable efficient DoE based sampling of the entire sequence-function space.
Finding the Best Starting Point
Before initiating a ProteinGPS program, there is often a need to identify a good starting point. This step is particularly relevant for biocatalysis and similar applications, and less so for protein therapeutics. DNA2.0 has developed a standardized process to uniformly and accurately sample the phylogenetic tree of one or more protein families. The sampled sequence space is derived from public domain genetic databases and other sources.
Homologs are re-coded for host expression (e.g. E. coli) using DNA2.0 proprietary technology and tagged for solubility and purification. The genes are synthesized, Electra-cloned, sequence verified, expressed, and the relevant functional activity(ies) assessed.
Depending on functional activity outcome from the homologs, it may be relevant to further drill down into one or more of the richer phylogenetic branches for synthesis and testing of additional related homologs. This second iteration is often useful for increased functionality and/or broader intellectual property claims.
The ProteinGPS process may use one, two, or more starting points for the subsequent ProteinGPS engineering depending on the outcome of the phylogenetic search, the number and distribution of the functional properties to engineer, and any other constraints that could affect the search. ProteinGPS starting point(s) are limited only by imagination.
ProteinGPS Engineering Process
All optimization can be divided into two key steps: Variable selection – how to choose amino acid substitutions to test, and Search – how to combine substitutions for best effect.
Variable Selection: DNA2.0 builds a complete alignment of all homologs of a given protein family centered on the starting point(s) and identifies all sequence diversity available. Each amino acid substitution in the alignment is assigned multiple scores based on evolutionary, structural, and functional analysis (if available). Scores for each substitution are averaged, normalized, and mean centered. Substitutions are rank-ordered accordingly and top substitutions are included for ProteinGPS.
Search: The variables identified are incorporated in a systematically varied set of Infologs centered around the starting point(s). In the first round each Infolog is typically 3-5 amino acid substitutions away from all other Infologs in the round, including the starting point. Each substitution is present in 4-6 Infologs to access additivity. The substitution distribution in the Infolog set is determined through DoE algorithms. This process allows for maximum search efficiency throughout the ‘design-build-test-learn’ cycle.
1. DesignDoE based mining of available space and combining substitutions in an information maximized variant dataset.
2. BuildSynthesizing individually designed Infologs (48-96 per round) ensures that the physical implementation is identical to the virtual design with no random mutations.
3. TestTest in commercially relevant assays. Results allow us to deconvolute how substitutions within a protein sequence modify its function.
4. LearnEstablish a sequence-function model from the assay results and cross validate. Models are assessed based on their predictive value.
Glutathione S-transferases (GST) are key enzymes involved in chemical detoxification and breakdown. Wheat GST with the ability to detoxify a panel of 15 common herbicides was designed using ProteinGPS. The relative functional contribution of 60 amino acid substitutions against 15 substrates was quantified using only 96 Infologs and subsequently improved by making a small set (16) of 2nd generation Infologs. Govindarajan et al., ACS Synth Biol. 2015 Mar 20;4(3):221-7.
Heat map of the catalytic activity of 96 Infologs (rows) against 15 substrates (columns). Functional values are normalized. Dark blue signifies high activity.
Engineered diversity of GST activity and substrate specificity: highly predictable GST sequence-function models against two commercially relevant herbicides were created with quantification of relative functional contribution of 60 amino acid substitutions in two dimensions.
Principal component analysis (PCA) showing improved activity and altered specificity of GST variants. Predicted amino acid changes give desired shifts in substrate specificity.
Researchers at Pfizer and DNA2.0 successfully utilized ProteinGPS for enzyme engineering of an aminotransferase for the biocatalysis of a key chiral intermediate in the synthesis of imagabalin, an advanced anxiolytic drug candidate. The starting wt protein, Vfat, is an ω-amino acid:pyruvate transaminase with very weak but detectable catalytic activity toward aliphatic amines. Designing and testing <450 Vfat synthesized variants resulted in an aminotransferase optimized for substrate selectivity and reaction velocity sufficient for the commercial biocatalysis goal. Midelfort et al. Protein Eng Des Sel. 2013. 26(1):25-33.
Process & Enzyme Engineering
Engineering of biocatalysis enzyme for Pfizer pharmaceutical intermediate synthesis. Four rounds (R1-R4) of biocatalytic variants screened for stereospecific activity for desired novel substrate. Several orders of magnitude improvement in specific activity was achieved testing a total of only 300 samples.
Multiple Functional Criteria
An industrial partner desired a protein with increased activity (blue bars), thermostability (red bars), and solubility (green bars) over their current candidate. ProteinGPS was used to characterize substitutions altering functionality in multiple dimensions. Combining positive effect substitutions ultimately produced several variants with orders of magnitude improvement in all 3 criteria.
Search the DNA2.0 Literature Database, containing over 1,200 scientific publications using DNA2.0 technology for references relevant to your research.ACS Synth Biol 2015; 4(3):221-227. Mapping of Amino Acid Substitutions Conferring Herbicide Resistance in Wheat Glutathione Transferase. Govindarajan et al.
Protein Eng Des Sel 2013; 26(1):25-33. Redesigning and characterizing the substrate specificity and activity of Vibrio fluvialis aminotransferase for the synthesis of imagabalin. Midelfort, KS. et al.
PNAS 2010; 107(5):1948-53. Reconstructed evolutionary adaptive paths give polymerases accepting reversible terminators for sequencing and SNP detection. Chen, F. et al.
J Biol Chem 2009; 284(39):26229-33. SCHEMA recombination of a fungal cellulase uncovers a single mutation that contributes markedly to stability. Heinzelman, P. et al.
PNAS 2009; 106(14):5610-5. A family of thermostable fungal cellulases created by structure-guided recombination. Heinzelman, P. et al.
Protein Eng Des Sel 2008; 21:699-707. Protein engineering of improved prolyl endopeptidases for celiac sprue therapy. Ehren, Govindarajan, Morón, Minshull, Khosla.
BMC Biotechnol. 2007; 7:16. Engineering proteinase K using machine learning and synthetic genes. Liao, Warmuth, Govindarajan, Ness, Wang, Gustafsson, Minshull.
Curr Opin Chem Biol 2005; 9:202-9. Predicting enzyme function from protein sequence. Minshull, Ness, Gustafsson, Govindarajan
Curr Opin Biotechnol 2003; 14:366-70. Putting engineering back into protein engineering. Bioinformatic approaches to catalyst design. Gustafsson, Govindarajan, Minshull.
- PEGS 2016: Engineering Biology for Optimized Antibody Production
- Protein Engineering Solutions Brochure
- PepTalk, 2016: Engineering Biological Systems from Genes to Genomes
- BIO World Congress, 2015: Tools for Engineering Better Biology
- PEGS, 2015: Quantitative Biology – Tools to Build Better Biology
- Protein & Antibody Congress London, 2015: Using Quantitative Biology to Engineer Protein Properties
- GRC Biocatalysis, 2014: Systematic Exploration of Sequence Space for Protein Engineering Poster
- Autodesk Ideas Conference, 2012: Gene Synthesis + Machine Learning = BioDesign
Video – ProteinGPS Engineering Overview
Webinar – ProteinGPS Engineering via Systematic Exploration of Space
Learn more about Protein Engineering and Infologs. ProteinGPS relies on identifying key amino acid substitutions through bioinformatics-based mining of available sequence space and combining such substitutions in information maximized Infologs – synthetic gene variants designed to be systematically varied across the searched space.
Using a small set of variants to explore the sequence space systematically can help us understand the effects of substitutions on the protein activities and further helps us to determine e strategies to explore the sequence space. This can be attained using Machine Learning techniques to analyze the data from a small number of systematically designed variants of the protein, usually on the order of 100 variants. We can address questions related to additivity and multidimensional effects of substitutions on the various properties and activities that can be measured accurately under commercially relevant conditions.
View a pdf of the webinar slides.
Patents: This technology is covered by issued US patents 8825411, 8635029, 8412461, 8401798, 8323930, 8126653, 8005620, 7805252, 7561973, 7561972, and related pending applications.