Error tolerant barcodes for single-cell sequencing

Synthesising oligonucleotides using blocks of nucleotides to overcome sequencing errors in error prone single-cell RNA-sequencing.

Like Comment
Read the paper

Error tolerant barcode design

The challenges of short-read single-cell sequencing

There are a reported 20,352 protein coding genes within the human genome,1 which speculatively encode more than 100,000 different proteins.2 If we include T cell receptor, B cell receptor and antibody diversity then there are likely to be many millions of unique proteins. The diversity of the human proteome exceeds the genome, in part because of alternative splicing and recombination events, which can create many more combinations of substrates from the same gene or combinations of gene segments, respectively.  Current short-read single-cell sequencing methods typically only report the 3’ or 5’ end of a transcript.3-5 Therefore, gene rearrangements and alternative splicing are challenging to detect. This means that we are only capturing a fraction of the cellular information in which to infer the phenotype of a cell. If we are to understand more about cellular biology, then we need more informative single-cell technologies.

Long-read sequencing platforms, such as Oxford Nanopore and PacBio,6, 7 provide true full-length sequencing of mRNA, therefore allowing examination of RNA splicing events, single nucleotide polymorphisms, structural variation, imprinting and measurement of chimeric transcripts at the single-cell level. Nanopore Sequencing has a relatively high output (50-250 million reads for a Promethion flow cell) when compared to PacBio (4 million reads), giving Nanopore the potential to deliver the throughput and economy of droplet-based short-read single-cell sequencing methods. However, Nanopore sequencing has a high basecalling error rate (80-95% accuracy per base),8 which presents a significant challenge for sequencing methods that include molecular barcodes, as is the case for single-cell RNA-sequencing.

Oxford Nanopore Sequencing is error prone

In droplet-based single-cell sequencing methods, such as Dropseq, In-Drops or 10X,3-5 cells are co-capture with oligonucleotide-barcoded RNA-capture microbeads in droplets within an oil emulsion. Each droplet becomes a discrete reaction vessel, associating two different barcodes with each RNA, a cell barcode for assigning RNA to an individual cell and a Unique Molecular Identifier (UMI) for removing PCR duplication artefacts.  This is followed by pooled library production and short-read sequencing.  Accurate barcode assignment is critical for associating each captured mRNA to each cell and because of the accuracy of short-read Illumina sequencing this is a relatively trivial computational task. However, for Oxford Nanopore sequencing, this task becomes challenging and error prone because of the lower basecalling accuracy.

Overcoming basecalling error rates 

Several approaches have been developed to apply long-read sequencing technologies to single-cell data. The most reliable methods require a single-cell sequencing library to be sequenced first using short-read Illumina sequencing to accurately determine the barcode sequence, followed by Nanopore sequencing.9 We reasoned that we could forgo the need for Illumina sequencing by building up our oligos with repeated nucleotides in the barcode and UMI regions, thereby allowing sequencing errors to be pinpointed and corrected without the need for additional short-read data.  To simplify synthesis of the barcode and to make repeated bases in the randomly generated UMI possible, we made dimer-nucleoside reverse phosphoramidites, that allowed duplicate bases to be added in a single oligonucleotide synthesis cycle.

In our recent publication, we introduce this approach as Single-cell COrrected Long-Read sequencing (scCOLOR-seq; pronounced scholar seq).10 We demonstrate the effectiveness of this approach using species mixing experiments to evaluate barcode assignment accuracy and evaluate differential isoform usage and fusion transcripts using myeloma and sarcoma cell line models.

We believe that the application of accurate long-read single-cell sequencing will have a transformational effect on the wider single-cell sequencing community, as longer reads allow users to capture more information about the transcriptional state of a cell. In combination with single-cell copy number variation and mutational analysis, long-read sequencing would have disease diagnostic potential. This will likely be more evident within the oncology field, where the ability to simultaneously measure fusion transcripts and isoform expression, in addition to structural variant calling will allow researchers to understand the clonal nature of cancer initiation and development.

Adoption by the wider scientific community

We want to see the wider scientific community benefit from our research, as such we are currently in discussions with various commercial partners to commercialise the synthesis of sc-COLOR-seq beads. As a future perspective, we have plans to further improve our technology by synthesising the capture oligonucleotide with blocks of trimer phosphoramidites, which would further improve error correction accuracy.

You can access our Nature Biotechnology paper here.

References 

  1. Pertea, M. et al. CHESS: a new human gene catalog curated from thousands of large-scale RNA sequencing experiments reveals extensive transcriptional noise. Genome Biol 19, 208 (2018).
  2. Savage, N. Proteomics: High-protein research. Nature 527, S6-7 (2015).
  3. Macosko, E.Z. et al. Highly Parallel Genome-wide Expression Profiling of Individual Cells Using Nanoliter Droplets. Cell 161, 1202-1214 (2015).
  4. Klein, A.M. et al. Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells. Cell161, 1187-1201 (2015).
  5. Zheng, G.X. et al. Massively parallel digital transcriptional profiling of single cells. Nat Commun 8, 14049 (2017).
  6. Wenger, A.M. et al. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat Biotechnol 37, 1155-1162 (2019).
  7. Weirather, J.L. et al. Comprehensive comparison of Pacific Biosciences and Oxford Nanopore Technologies and their applications to transcriptome analysis. F1000Res 6, 100 (2017).
  8. Rang, F.J., Kloosterman, W.P. & de Ridder, J. From squiggle to basepair: computational approaches for improving nanopore sequencing read accuracy. Genome Biol 19, 90 (2018).
  9. Lebrigand, K., Magnone, V., Barbry, P. & Waldmann, R. High throughput error corrected Nanopore single cell transcriptome sequencing. Nat Commun 11, 4025 (2020).
  10. Philpott, M. et al. Nanopore sequencing of single-cell transcriptomes with scCOLOR-seq. Nat Biotechnol (2021).

Adam Cribbs

Group leader in Systems Biology, University of Oxford