Identifying synthetic genes and understanding their use in bioengineering
This "Behind the paper" blog post references: Kunjapur, A. M., Pfingstag, P. & Thompson, N. C. Gene synthesis allows biologists to source genes from farther away in the tree of life. Nat. Commun. 9, 4425 (2018). DOI: 10.1038/s41467-018-06798-7
How do new tools influence innovation in research? Neil Thompson’s group at the MIT Sloan School of Management was interested in this question and considered the field of synthetic biology as a testbed for studying the influence of new tools. In the summer of 2014, they reached out to MIT labs participating in the Synthetic Biology Engineering Research Consortium (Synberc, now EBRC) in search of a tutor who could provide subject matter expertise. I was intrigued by what lessons synthetic biologists could potentially draw from innovation across other fields and expressed interest in that tutoring role. Our collaboration began as I covered topics ranging from the basics of the central dogma to genome editing tools such as CRISPR-Cas9. The influence of CRISPR on research innovation was interesting but perhaps too new, so we continued discussing other tools until we reached DNA synthesis and DNA sequencing.
DNA synthesis is considered the key enabling technology for the field of synthetic biology. Its cost has decreased by orders of magnitude during the last two decades, and as a result it has become a routine service used by academic labs across the world. While this has fostered the development of academic and commercial technologies across numerous industrial sectors, some communities are concerned about the reduced barriers to engineering organisms.
Neil observed that the clear decrease in cost during the last two decades could make for a rich economics-oriented manuscript on how this trend has affected innovation in synthetic biology. The ability to identify synthetic DNA sequences would be essential to conduct this kind of study. Yet, to our knowledge the technical literature in synthetic biology contained no strategy for identification of synthetic sequences. Moreover, the need to identify synthetic sequences and the engineered organisms that harbor them had never been greater. These realizations became the genesis of a related but separate line of inquiry that was better suited for a scientific publication.
In “Gene synthesis allows biologists to source genes from farther away in the tree of life”, we present a bird’s eye view on a trend enabled by affordable gene synthesis within the academic biological research community. First, we developed a robust classifier for natural or synthetic genes based on sequence alone. We had a sense that sequences from nature would be contained in a publicly available database and that synthetic sequences would need to be different, but we did not know in what ways and by how much. We used a combination of theory, simulation, and machine learning to arrive at a threshold of sequence percentage identity arising from use of the nucleotide basic local alignment search tool (BLASTn) against the RefSeq reference genomic collection. Philipp Pfingstag’s development and implementation of this simple classifier on a test set of 173 sequences compiled by me resulted a remarkable 97.7% accuracy. Encouraged by this result and outside interest in applying our method to biosurveillance, we applied the strategy to a larger sequence database to investigate whether synthetic sequences were being used differently than their natural counterparts.
We could not have performed this study without tremendous assistance from the Addgene plasmid repository, which provided us with a database that contained over 19,000 unique sequences. Equipped with this rich dataset, Philipp examined one of my pet hypotheses about whether gene synthesis was being used disproportionately for expressing heterologous genes in model organisms. As a metabolic engineer, I view evolutionarily distant genome and metagenome collections as rich treasure troves of biosynthetic clusters, genetic parts, and orthogonal tools. From my own experience I knew that amplification of these natural genes for subsequent expression in the model organism Escherichia coli presents the risk of failed expression due to codon usage and that genomic DNA templates could take a long time to obtain. While in graduate school, I switched over to ordering synthetic codon-optimized DNA sequences to address both concerns and because affordable and synthetic linear DNA fragments became commercially available. But what about the community at large?
It was exciting to observe that the average genetic distance between organisms that we defined as the “source” and “expression” organisms for individual gene sequences was significantly greater for synthetic genes than for natural genes. This underscores one of the effects that DNA synthesis is having on synthetic biology innovation while also highlighting why synthetic sequences are strong indicators of engineered organisms that efficiently exhibit non-native traits. We hope our classification strategy will be part of a suite of tools used to identify such organisms as DNA synthesis technology continues to be democratized.