In recent years, the advance of sequencing technologies for high-throughput measurement of the transcriptome and epigenome of single cells has brought tremendous opportunities to the research community, expanding our knowledge about neuroscience, developmental biology, immunology, etc. Each of these modalities, including gene expression, chromatin accessibility, DNA methylation, offers a unique perspective about individual cells. This information jointly allows us to infer cell identities quantitatively, which is a huge step forward as compared to qualitative definition based on morphology, surface proteins and broad functions.
In order to identify cell types and states in an unbiased, comprehensive manner, a variety of single-cell data integration methods have been developed. In 2019, our lab introduced LIGER (linked inference of genomic experimental relationships) , an algorithm built upon integrative nonnegative matrix factorization (iNMF, Figure 1). Given a set of single-cell datasets assayed on different cell samples, LIGER is able to flexibly combine them into shared analysis while retaining the differences between the datasets via alternating nonnegative least-squares (ANLS). Together with Seurat v3 and Harmony, LIGER was listed as a top performer on batch-effect correction for single-cell datasets .
However, computational challenges emerge as the sizes of the available datasets are rapidly increasing (up to over a million cells). We realized that these batch methods are not computationally efficient in scaling to such massive datasets since they require the entire data to be stored in the computer memory throughout the analysis. In addition, all these methods are not able to incorporate new data without recalculating from scratch.
The key idea of using online learning for data integration was inspired by several concepts from previous papers. In the original LIGER paper, we introduced a method to incorporate new data: we utilize the previously learned metagene factors to obtain factor loadings for these additional cells as a warm start, then update the factors and factor loadings using all of the data. An additional source of inspiration came from the eXpress method for transcript abundance estimation. Back in 2013, Roberts et al. developed an online learning algorithm for streaming assignment of RNA-seq fragments . Another inspiration came from a machine learning paper by Mairal et al. . They developed an online dictionary learning algorithm, and showed its utility for performing dictionary learning and NMF on large image datasets.
Inspired by these ideas, we decided to improve LIGER by applying the idea of online learning to the iNMF problem. Our devised online iNMF algorithm achieves faster convergence than the batch methods. Moreover, it does not require all the datasets kept in the memory during implementation. Instead, it loads fixed-size mini-batch from disk on the fly. Since no more than a single mini-batch of samples is fetched and processed at a time, the peak memory usage is decoupled from the dataset size, making it possible for researchers to analyze large-scale single-cell datasets even on their laptop.
The users can apply the online iNMF algorithm on their single-cell datasets in 3 different scenarios (Figure 2). In Scenario 1, where the datasets are large and fully observed, the algorithm accesses mini-batches from all datasets at the same time and repeatedly updates the metagenes and cell factor loadings throughout multiple epochs. Depending on the available computational resources, the peak memory usage is controllable by manually specifying the size of mini-batch. The key advantage of Scenario 2 is that the factorization is efficiently refined as new data arrive, without requiring recalculation each time. In Scenario 3, one can directly project new data into the latent space already learned, without using the new data to update the metagenes.
As a general strategy, the users first use online iNMF to learn metagenes as in Scenario 1 or Scenario 2. Then, they use the shared metagenes to calculate cell factor loadings for a new dataset. Scenario 3 is highly efficient at incorporating data, which allows users to query their data against a curated reference, and provides increased robustness to dataset differences in newly arriving data. We hope that online iNMF will become increasingly useful for integrating single-cell multi-omic datasets of growing scale from projects such as the BRAIN Initiative, Human Body Map and Human Cell Atlas.
Click here for more details about our work in Nature Biotechnology.
- Welch, J. D. et al. Single-Cell Multi-omic Integration Compares and Contrasts Features of Brain Cell Identity. Cell 177, 1873–1887.e17 (2019).
- Tran, H. T. N. et al. A benchmark of batch-effect correction methods for single-cell RNA sequencing data. Genome Biol. 21, 12 (2020).
- Roberts, A. & Pachter, L. Streaming fragment assignment for real-time analysis of sequencing experiments. Nat. Methods 10, 71–73 (2013).
- Mairal, J., Bach, F., Ponce, J. & Sapiro, G. Online Learning for Matrix Factorization and Sparse Coding. J. Mach. Learn. Res. 11, 19–60 (2010).