The Vertebrate Genomes Project
Sequencing all vertebrates--66,000 species and maybe more --is now underway. The effort involves 150 researchers at 50 institutions in 12 countries. Here is some information and perspective about the Vertebrate Genomes Project (VGP), about where to find the data and more.
Scientists in the Genome 10K Consortium have launched the Vertebrate Genomes Project (VGP) to sequence all vertebrates. Some aspects are also described in this Nature Methods technology news feature and in this editorial in Nature Biotechnology.
The assembled genomes are to be near error-free, reference-grade sequences.
Phase I of the Vertebrate Genomes Project is devoted to all 260 orders of vertebrates.
Phase II is about the 1,045 vertebrate families.
Phase III will cover all 9,478 genera.
Phase IV will cover the totality of all vertebrate species.
Here is a mini-documentary about the project. It includes comments by Genome 10K Consortium Chair Eric Jarvis from Rockefeller University and Emma Teeling from University College Dublin.
There is also a conversation about the project from an informatics perspective with David Haussler from the University of California at Santa Cruz, Richard Durbin from the Wellcome Trust Sanger Institute and Gene Myers from the Max Planck Institute of Molecular Cell Biology and Genetics.
This video was taped at the 2018 annual meeting of the Genome 10K Consortium.
And here, as a podcast, is the complete conversation with David Haussler, Richard Durbin and Gene Myers.
At the 2018 Genome 10K Consortium meeting, scientists presented the first 15 genomes of 14 vertebrates: amphibians, birds, fishes, mammals and reptiles.
They are in this video:
The Vertebrate Genomes Project teams are using a number of commercial platforms.
Sequencers from Pacific Biosciences, which was recently acquired by Illumina, are being used to generate long reads for Phase I.
Contiguous sequence is joined to larger pieces with the 10X Genomics platform.
Bionano's optical mapping approach is used to build those sequences out to a longer scaffold.
Platforms from Arima Genomics, Dovetail Genomics and Phase Genomics are applied to Hi-C-proximity data for determining how near to one another genomic stretches are that interact with one another.
Where are the data from the Vertebrate Genomes Project?
The data can be found in public archives and on GenomeArk, which is hosted by Amazon Web Services. The data, the public genome databases and genome browsers are there along with computational tools, all of which is an arrangement facilitated by the company DNAnexus. Here is a page about the scaffolding workflow.
Erich Jarvis and David Haussler explain the data use policies here. This was taped at the 2018 Genome 10K Consortium's annual meeting. Erich Jarvis speaks first.
A peek at the VGP’s assembly working group
In the assembly working group of the Vertebrate Genomes Project (VGP) Adam Phillippy, a researcher at the National Institutes of Health’s National Human Genome Research Institute (NIH/NHGRI) and Arang Rhie, a postdoctoral fellow in his lab, build and evaluate assembly pipelines and develop new tools. Here, for example, is the VGP's GitHub assembly pipeline repository GitHub repository.
The team is setting up VGP assembly software 'apps' on the DNANexus cloud with publicly available tools for the community to use, also with recommended parameters. The tools are available via DNAnexus but are still being put together on one page.
The tools that will be there are on the VGP's GitHub repository and information about the scaffolding pipeline can be found here. Up until now, it has not been uncommon in his group to spend six months on assembling one genome, which is too much time for a projects on the scale of the VGP, says Phillippy.
For the VGP, Rhie, Phillippy and their colleagues have tested platforms to see which sequencing platform might be best to achieve what they seek: many, near error-free, assembled genomes. That means an error only around every 100,000 bases. The VGP metrics are “a little shy of what the current human reference genome is,” says Phillippy.
For example, they tried different combinations of sequencers such as instruments from Pacific Biosciences (PacBio) alone, PacBio plus Oxford Nanopore Technologies and Oxford Nanopore alone. For now, they have settled on using PacBio sequencers because of the accuracy they found these technologies could achieve.
They also tested multiple scaffolding platforms, all with a view to maximizing reproducibility and to be enable setting up a standardized pipeline, says Rhie.
When the team aligned an assembled hummingbird scaffold to the existing Anna’s Hummingbird reference, but things didn't go too well, says Rhie. That hummingbird sequence had been completed with Illumina-based paired-end sequencing at 50x coverage, as presented in 2014 in Science.
As it turns out, the genome sequence of the hummingbird was less complete than people had recognized. For the alignment, they moved on to the hummingbird’s next closest relative for which a high-quality assembled genome was available: the chicken genome.
The zebra finch had previously been sequenced with Sanger-based techniques. And even though this is considered to be the ‘gold standard,’ says Phillippy, “we are seeing more and more that a well done long-read assembly is actually superior in quality.”
They expect many challenges as they assemble the genomes of many animals. Birds have small genomes but amphibian genomes are big: frog genomes are around 10 gigabases and salamander genomes are around 40 gigabases.
The lab of Erich Jarvis at Rockefeller University had been looking at genes associated with vocal learning. But the researchers were just not finding the genes of interest in the zebra finch reference genome, says Phillippy. They realized their work was being hampered by errors in the reference genome. Another issue was that the genome was not haplotype-resolved. For the vertebrate genomes, the assembly working group hopes to achieve haplotype resolution. This will also help to build more haplotype-resolved human genomes, says Phillippy.
Rhie previously developed approaches for phasing human genomes and published de novo assembly and phasing of a Korean genome in Nature in 2016.
Phasing, the separation of genomes into two haplotypes, is especially informative for genomic regions with plenty of heterozygosity, says Rhie. One such region in the human genome is the major histocompatibility complex (MHC). Separating both haplotypes lets one see variants on individual copies, instead of needing to work with a mosaic genome.
Currently, says Rhie, they are looking at different levels of heterozygosity in the human genome. “The current pipeline still gives you a better reference genome,” she says. “But it’s not enough for me.” They are working on methods to improve ways of reconstructing “hidden haplotypes” that lie behind regions of variable heterozygosity, areas of both low and high heterozygosity, she says.
An important facet of the VGP is the way early data release is built into the process. “Making the data open like this early on is, to me, a great way to encourage and foster the development of new methods outside of this group,” says Phillippy. “When people see this data set that’s so rich, it will be irresistible to go and develop develop new methods for comparative genomics, assembly, annotation.
There are other large-scale genome projects, such as
(Zebra finch image credit at top of page: Sergio Mendoza Hochmann/Moment/Getty)