We propose SemiBin (https://github.com/BigDataBiology/SemiBin), which uses a semi-supervised deep siamese neural network to exploit information from reference genomes for metagenomic binning.
Binning is a computational tasks where assembled contigs from metagenomes are grouped together in an attempt to reconstruct genomes from the original sample.
Our initial idea for SemiBin was inspired by SolidBin , which uses semi-supervised learning to perform the binning task. SolidBin proposed using “must-link” and “cannot-link” constraints between contigs. They first annotate the contigs with reference genomes and define the contigs with the same species as having a must-link constraint between them, while contigs annotated to different species or genus have a cannot-link constraint. Then, they use a semi-supervised Ncut algorithm  to deal with these constraints. After reading this, we felt it was a very interesting idea to use the prior information from reference genomes for binning. However, we realized SolidBin’s approach to generating must-link and cannot-link constraints could be improved: relying on annotation can lead to both errors (especially for generating the must-link constraints) and sampling bias (must-link constraints cover a small part of the genomes).
In our experiments in simulated datasets, cannot-link constraints from annotations have high accuracy and can be used, but the must-link constraints cannot. So we used an idea from self-supervised learning  to artificially generate the must-link constraints by breaking up the long contigs.
We also improved on how the constraints are handled: instead of the Ncut algorithm, we drew inspiration from the contrastive learning field (which is a hot topic in the computer vision and natural language process field nowadays) by using a siamese neural network .
We implemented this idea and benchmarked it in simulated and real datasets. The results showed that SemiBin outperformed other binners. In real datasets, SemiBin could significantly outperform the second best binner.
SemiBin could get significant improvements in results, but another question occurred: contig annotations and training required a lot of time that made it hard for SemiBin to run on large-scale datasets. The default pipeline of SemiBin now is training the model from one sample and then uses the learnt embeddings for the clustering. To overcome this, an initial idea comes to mind is if we can skip these parts or if we can transfer the model trained before to other samples. After we tried this idea, our results showed this idea could work to some extent, but could not get results as good as the default version. Inspired by the superior results from large pretrained models in the computer vision and natural language processing fields, the next idea was “if more samples used in the training can get better results?” The answer is yes! From our results, training from more samples can get better results and even outperform the default version. So we built SemiBin(pretrain) which could get better binning results and be used in large scale metagenomic analysis. Thanks to an anonymous reviewer's comment, we decided to provide more pretrained models from different environments. In the end, we provide 10 pretrained models from the human gut, dog gut, ocean, soil, cat gut, human oral, mouse gut, pig gut, built environment and wastewater.
In summary, we propose SemiBin, a metagenomic binning tool. We demonstrated that SemiBin can reconstruct more high-quality bins.
Read the paper here: https://doi.org/10.1038/s41467-022-29843-y
 Wang, Z., Wang, Z., Lu, Y. Y., Sun, F., & Zhu, S. (2019). SolidBin: improving metagenome binning with semi-supervised normalized cut. Bioinformatics, 35(21), 4229-4238.
 Gu, J., Feng, W., Zeng, J., Mamitsuka, H., & Zhu, S. (2012). Efficient semisupervised MEDLINE document clustering with MeSH-semantic and global-content constraints. IEEE transactions on cybernetics, 43(4), 1265-1276.
 Śmieja, M., Struski, Ł., & Figueiredo, M. A. (2020). A classification-based approach to semi-supervised clustering with pairwise constraints. Neural Networks, 127, 193-203.