Toehold switches1 are versatile engineered riboregulators developed for a myriad of applications such as pathogen-sensing diagnostic tools2 or components in synthetic gene circuits. These riboregulators can easily transition between two stable states, depending on the presence of a target nucleotide sequence. If this target sequence is present, the toehold undergoes a secondary structure change that will allow for the expression of a reporter gene, which results in a colorimetric or fluorometric readout.
However, the development of new toehold regulators is more often than not highly time-consuming and uncertain. The screening and fine-tuning of a single switch can take weeks from ideation to final design, and researchers might have to order tens or hundreds of sequences to find one or two that will have the desired output. As such, we hypothesized that we could couple big data and biological insight to more reliably automate the prediction and design of toehold switches3. The first challenge was to generate a large dataset that would be used to power our predictive models. Our team partnered with Angenent-Mari et al.4 and used simple bioinformatics techniques for sequence analyses in order to generate a comprehensive set of potential toeholds, which were then experimentally tested in their lab. With this dataset in hand, both teams worked on two orthogonal projects that share one common thread: using deep learning to characterize toehold switches in silico.
Our team approached this challenge by designing and implementing different machine learning procedures. On one hand, we used a deep learning architecture based on convolutional neural networks, which borrows a variety of concepts from computer vision and image analyses. On the other hand, we treated the toehold sequences as part of a common DNA/RNA language: based on approaches from natural language processing (NLP), we developed an architecture that uses a quasi-recurrent neural network and a tokenized input sequence to represent k-mers as ‘words’ in the toehold ‘sentence’. Both of our models offered distinct advantages, like divergent visualization techniques. In order to understand what both types of models were actually ‘learning’, we tested white-box approaches, in particular attention maps and in silico mutagenesis. These methods allowed us to discover biologically-relevant insights such as determining that the 6 to 9 nucleotides around the ribosome binding site are critical for both models’ predictions.
Following these exciting results, we performed a data ablation experiment, in which we trained each of our models with reduced training data. This experiment allowed us to elucidate the minimal number of toehold sequences that would be needed for effective training, as measured by model accuracy. Encouragingly, we found that these architectures were still accurate when trained with an order of magnitude less data! Emboldened by the models’ flexibility to use less data, and aiming to improve the models’ generalizability, we used transfer learning techniques to fine-tune model weights on a set of 168 toeholds tested in a different experimental context1. We were ecstatic when we observed that our language model classified with 100% accuracy a set of 24 manually-designed sensors for Zika, as tested by Pardee et al.2 As these models showed an incredible predictive power for the design of novel pathogen sensors, we deployed both models in an integrated design pipeline. We built two frameworks to optimize sequences, NuSpeak and STORM, where NuSpeak constructs toeholds that retain complementarity to 21 nucleotides of the 30-nucleotide target, while STORM allows for all 30 nucleotides to vary simultaneously. We experimentally validated our predictions and were encouraged to find strong agreement between in silico and in vitro results.
We hope our manuscript carries on the important work of the original 2014 paper by Green et al.1 for automating toehold prediction and optimization, steps that would take weeks in labs but just a few minutes using our computational framework. We believe the white-box results, deep learning models, and transfer learning approaches presented in our work will be useful and impactful for the synthetic biology community, as new and creative tools continue to be developed for toeholds sensors.
- Green, A. A., Silver, P. A., Collins, J. J. & Yin, P. Toehold Switches: De-Novo-Designed Regulators of Gene Expression. Cell 159, 925–939 (2014).
- Pardee, K. et al. Rapid, Low-Cost Detection of Zika Virus Using Programmable Biomolecular Components. Cell165, 1255–1266 (2016).
- Jacqueline A. Valeri et al. Sequence-to-function deep learning frameworks for engineered riboregulators. Nat. Commun. (2020) doi:ttps://doi.org/10.1038/s41467-020-18676-2.
- Angenent-Mari, N., Garruss, A. & Soenksen, L. A deep learning approach to programmable RNA switches. (2020).