Three years ago, Simon Rasmussen and I were working at the metagenomics group at Technical University of Denmark. Back then, we were particularly interested in exploring microbiomial communities of relatively unexplored environments: Given high-throughput whole genome sequencing data, what microorganisms could be found, and what would that imply about the environment?
An important step in our workflow was the grouping of assembled metagenomic contigs by their genome of origin in order to obtain crude draft genomes - a process known as binning. Simon had some experience developing binning tools, as he had contributed to the 2014 Nielsen et al. binner known as "Canopy". When he discovered the concept of variational autoencoders (VAEs), an idea formed. See, deep learning (DL) was all the rage, and no-one had, at least to our knowledge, applied DL to binning. Binning is a kind of clustering problem, the archetypical unsupervised learning problem, which VAEs in other studies had shown great promise for. Similar to what the earlier work had shown for other kinds of data, perhaps autoencoding metagenomic sequence data could improve the clustering, and hence improve binning? He wrote a quick proof-of-concept, and, surprisingly, achieved promising results almost immediately. He then launched the project, involving me and the other co-authors. At that time, neither Simon nor the rest of us really understood why using VAEs worked so well for binning. Perhaps not atypical in science, we only began to understand it after having worked on it for some time.
Looking back, it's remarkable that it took a few days to get a performant prototype, but eventually three years and several rewrites to make the code reliable enough, and the findings persuasive and presentable enough for publication.
Deep learning seems to work well for problems where you are trying to build a statistical model, but can't articulate the model explicitly. In our case, from other people's work on binning tools, we knew that contigs from the same genome have similar abundances across samples (so called co-abundance), and similar k-mer distributions. These two measures can be represented as numerical vectors. However, it's unclear how these two measures relate to each other, or to the probability that two contigs belong to the same genome. This is an ideal situation for deep learning. And so, we developed a new deep learning-based binning tool, Vamb.
In our article, we show that Vamb yields better bins than the other reference-free binners we have tested. In hindsight, however, it's clear the underlying model and feature engineering of Vamb could still be improved. For example, the assumptions inherent in VAEs do not align well with what we already know about microbial communities - and are not very amenable to clustering in general. While frustrating for us who have spent three years developing Vamb, the fact that there are probably still significant gains left on the table open the door for the development exciting new DL-based binning methods, and hints a bright future for deep learning in this domain.