Teaching machines to design cells for us
Using machine learning to help us predict (and design) metabolic pathways for the production of biofuels.
The paper in npj Systems Biology & Applications is here: https://www.nature.com/articles/s41540-018-0054-3
It really is a pain to engineer biological systems. The problem is not so much to change the instructions of the cell (its DNA): new CRISPR-based systems are making it easier and easier to make very specific changes, and DNA synthesis is getting exponentially cheaper by the day. The problem is that we cannot predict how those DNA modifications will affect the cell’s behavior. The result is that most metabolic engineering, and synthetic biology, consists of an Edisonian process of trial and error, where it can take on the order of $150 million and ten years to bring a new molecule to market.
Zak and I got together (with Marcella Gomez) in my office to try to make a dent on this problem of making predictive biology, which we also consider to be one of the fundamental problems of 21st century’s biology. Our combined experience working on producing renewable biofuels for a decade convinced us that trying to tackle the DNA-to-phenotype problem directly would be too hard to solve with the available data. So we needed to divide the problem into smaller problems and solve those first (as my advisor always said: for any problem you can’t solve, there is an easier problem you can; solve that first!).
We decided to focus on trying to predict metabolomics data from proteomics, since that covered a lot of usual metabolic engineering needs. If we can predict the dynamics of the final product as a function of time, we can design pathways with desired Titers, Rates and Yields (the TRY trilogy that is key to marketability). Also, the Joint BioEnergy Institute (JBEI), that we are part of, had produced some of the best proteomics and metabolomics data available, and we were keenly aware of that. Our friends Jorge Alonso Gutierrez and Kevin George had suffered through sleepless nights and meticulous sampling in creating those data sets.
Kinetic models were really our first choice to predict pathway dynamics, but they did not work well. We had one of the most comprehensive data sets with measurements of protein and metabolite concentrations as a function of time, but no matter how unconstrained our kinetic model was, it was unable to replicate the observed data. It was hard to pinpoint where the error was: the data? the model? unaccounted assumptions?
I finally asked Zak to go into full trouble-shooting mode: calculate the derivatives from the experimental data and see if the Michaelis-Menten functions can really produce those derivatives from the protein and metabolite data. Michaelis-Menten definitely failed a lot. Zak went on to completely ditch Michaelis-Menten, and start using the machine learning approaches that I had requested him to use to solve other problems where mechanistic models were not available.
I did not think it was going to work, but it did! We only had three time series for each biofuel and we needed to use one for testing, so it did not look like there was enough data to properly train machine learning models. The lack of data is the bane of our existence when applying machine learning methods. But Zak came up with a very clever way to use very obvious assumptions (like data continuity) to produce more data (this is called data augmentation in the ML field). The fits were not fantastic, but they were great for the little amount of data we had and, more importantly, the method really showed great improvements as more (simulated) data was added.
We need more experimental data in the future to make this method work quantitatively, and we can only get this through automation. We are very excited with the predictive capabilities this approach can bring to the field, and have recently acquired a liquid handler to take samples automatically from our automated fermentation platform. We hope to produce a continuous stream of samples using this arrangement. There are plenty of hurdles to overcome, but we believe the national labs are the best (public) place to bring to fruition the combination of synthetic biology, machine learning and automation that we believe will change biology radically.
A more predictive biology will have a huge impact in society, beyond biofuels and metabolic engineering. Producing renewable biofuels and pharmaceutical products is great, but the success of these initiatives has limited the vision of many of the researchers involved in the field. A fully predictive biology will allow us to design microbial communities to treat human diseases, model human metabolism to (finally) scientifically assess the impact of diet in human physiology and banish fad diets, study the appearance of life and consciousness as emergent properties, terraform other planets, and many other applications that nowadays sound like science fiction. And that effort will bring biology into a new development stage, where it will fulfill its full potential for humankind’s benefit.
(Photo and video credit: Marilyn Chung, Berkeley Lab)