Synthetic biology has improved our quality of life dramatically in various areas. However, the risk associated with misuse or abuse of genome engineering techniques is increasing. To alleviate this kind of risk, it is critically important to have the ability to trace back to the depositing lab of a given unknown engineered DNA sequence. Recently, scientists tried to solve the lab-of-origin problem using deep learning approaches in Nielsen et. al and Alley et. al studies (Nielsen and Voigt 2018; Alley et al. 2020). Although “lab-of-origin” prediction accuracy is increasing, the neural network option fails to provide detailed explanations behind its decisions. Meanwhile, the explicit evidence in supporting the rationale of “lab-of-origin” prediction is crucial in establishing effective biosecurity strategies. The limitation of neural networks motivated us to develop an effective and explainable pipeline, called PlasmidHawk, to predict the true depositing labs of unknown DNA sequences (PlasmidHawk improves lab of origin prediction of engineered plasmids using sequence alignment. Nat Commun 12, 1167 (2021). https://doi.org/10.1038/s41467-021-21180-w).
Figure 1 PlasmidHawk pipeline. First, a synthetic sequence pan-genome is built by Plaster. Then the pan-genome is annotated with the depositing lab information. To predict the lab-of-origin of an unknown plasmid, PlasmidHawk aligns the unknown plasmid to the annotated pan-genome (Prediction Step 1) and counts the number of aligned fragments for each lab (Prediction Step 2). Finally, PlasmidHawk calculates lab scores for each lab. The lab(s) with the minimum lab score are the predictions for lab-of-origin (Prediction Step 3).
From the beginning, we decided to apply alignment-based methods to solve this problem, as alignment is the cornerstone in genome comparison study. The initial idea of PlasmidHawk is to summarize all the synthetic sequences in the database and see if we can link any unique part of the sequences, also referred to as signature sequences, directly to individual labs. The best way to achieve this goal is to build a pan-genome for all the synthetic sequences. Pan-genome is commonly defined as the entire gene set of all sequences in the group. In other words, a pan-genome usually ignores the intergenic regions of input sequences. However, we postulated that the signature sequences may not completely fall into single genes. At the time of our research, the popular pan-genome construction software used genes as building blocks to construct pan-genome sequences and could not scale up to tens of thousands of input sequences. Therefore, the first task for us was to develop a faster linear pan-genome construction software taking not only genes but intergenic regions into consideration. Our project, Plaster, was designed for this purpose (Wang et al. 2019). It constructs a linear pan-genome which contains different unique fragment sequences extracted from the input sequences. By doing this, Plaster can facilitate us to extract signature sequences covering partial genes or intergenic areas.
After building the pan-genome for all the synthetic sequences in the training database using Plaster, we annotated pan-genome fragments with labs that used the fragments in their synthetic plasmids (Figure 1 Annotation). Each pan-genome fragment was annotated with across a number of labs. To predict the “lab-of-origin” of unknown sequences, we aligned unknown sequences to the pan-genome (Figure 1 Prediction Step1). The unknown sequences could align with several fragments in the pan-genome. Following the alignment, we first took the naive approach by assigning labs with the most number of aligned pan-genome fragments as the predicted depositing labs (Figure 1 Prediction Step2). To our surprise, the accuracy of this naive approach was already higher than the Nielsen et. al study (Nielsen and Voigt 2018). One of the reasons we postulated that PlasmidHawk performed better than the deep learning approach is that PlasmidHawk captured longer signature sequences, which the machine learning method failed to do due to the size of its sliding window.
Although our initial attempt has already achieved decent prediction accuracies, we want to further improve PlasmidHawk performance. The key observation was that although the unknown sequences can map to multiple pan-genome fragments, the uniqueness of aligned pan-genome fragments varies considerably. Based on this observation, we proposed a score scheme which assigned more weight to aligned pan-genome fragments with fewer labs annotated (Figure 1 Prediction Step3). Additionally, the score scheme tended to pick labs with fewer fragments in the pan-genome while multiple labs have the same amount of aligned pan-genome fragments. In general, the lab score we proposed refined the output of the naive approach and improved the prediction accuracy dramatically.
Later on, thanks to the feedback from the reviewers, we decided to look into the details of the benchmark results between Convolutional Neural Network (CNN) and PlasmidHawk, in order to find out which techniques can or cannot be traced back to the engineering labs. Both PlasmidHawk and CNN have trouble identifying the “lab-of-origin” plasmids, whose main backbones came from other labs and were only introduced 1-nucleotide mutations into the sequences. To further identify the techniques that are readily traceable, we examined the function of signature sequences identified by PlasmidHawk. At the time we were conducting our experiments, serine/threonine-protein kinase PknD appears as the most frequent signature sequence. In other words, serine/threonine-protein kinase PknD is potentially the most easily traceable genetic element. In addition, CRISPR-associated endonuclease proteins, one of the most popular genetic engineering techniques, have been used as signature sequences multiple times. (Supplementary Figure 8)
In the end, we believe that PlasmidHawk can help us reveal the research landscape in synthetic biology and characterize research diversity. We built a lab-relatedness tree by aligning synthetic plasmids from each depositing lab to the synthetic plasmid pan-genome. In the lab-relatedness tree, labs that collaborate with each other or work in a similar field, tend to appear in the same clade. Furthermore, to assess the research diversity within labs, we proposed within-lab research diversity scores, which count the number of unique sequences each lab has in the pan-genome (Supplementary Note 2). In this way, we identified “influencers” who have been working on a wide range of subfields in synthetic biology.
In conclusion, we propose an alignment-based approach to predict lab-of-origin of unknown engineered sequences. The goal of this study is not to downplay the importance of deep learning in lab-of-origin analyses, but rather raise awareness of the value of investing in traditional comparative genomic methods.
Alley, Ethan C., Miles Turpin, Andrew Bo Liu, Taylor Kulp-McDowall, Jacob Swett, Rey Edison, Stephen E. Von Stetina, George M. Church, and Kevin M. Esvelt. 2020. “A Machine Learning Toolkit for Genetic Engineering Attribution to Facilitate Biosecurity.” Nature Communications 11 (1): 6293.
Wang, Qi, R. A. Elworth, Tian Rui Liu, and Todd J. Treangen. "Faster pan-genome construction for efficient differentiation of naturally occurring and engineered plasmids with plaster." In 19th International Workshop on Algorithms in Bioinformatics (WABI 2019). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2019.