The Vertebrate Genomes Project

The Vertebrate Genomes Project: What it takes to get it right. Sequencing all the vertebrates on Earth is a big job that requires many techniques. 150 scientists at 50 institutions in 12 countries are involved. More can join. And: labs do not need to be VGP members to use VGP data.
Published in Protocols & Methods
The Vertebrate Genomes Project
Like

With colleagues around the world, scientists in the Genome 10K Consortium have launched the Vertebrate Genomes Project (VGP) to sequence all vertebrates.

On this page are text, animation, podcasts, video about the VGP. Some aspects are also described in this Nature Methods technology news feature and in this editorial in Nature Biotechnology.

The Genome 10K Consortium plans to sequence all vertebrates on the planet. That's 66,000 species. There may be even more species of vertebrates. The assembled genomes are to be near error-free, reference-grade sequences. 

Phase I of the Vertebrate Genomes Project is devoted to all 260 orders of vertebrates.

Phase II is about the 1,045 vertebrate families.

Phase III will cover all 9,478 genera.

Phase IV will cover the totality of all vertebrate species.

*

Here is a mini-documentary about the project. It includes comments by Genome 10K Consortium Chair Erich Jarvis from Rockefeller University and Emma Teeling from University College Dublin. 

There is also a conversation about the project from an informatics perspective with David Haussler from the University of California at Santa Cruz, Richard Durbin from the Wellcome Trust Sanger Institute and Gene Myers from the Max Planck Institute of Molecular Cell Biology and Genetics.

This video was taped at the 2018 annual meeting of the Genome 10K Consortium. Further below, on this page, you will find a transcript of this video. 


And here, as a podcast, is the complete conversation with David Haussler (left), Richard Durbin (middle) and Gene Myers (right). A transcript of this podcast is below, on this page. 



At the 2018 Genome 10K Consortium meeting, scientists presented the first 15 genomes of 14 vertebrates: amphibians, birds, fishes, mammals and reptiles.

They are in this video and listed below:

An amphibian
The two-lined caecilian (Rhinatrema bivittatum)
These birds Kakapo (Strigops habroptilus)
 
Zebra finch (Taeniopygia guttata)

Anna's hummingbird (Calypte anna)
This reptile Goode's Desert Tortoise (Gopherus evgoodei)
These fishes Flier Cichlid (Archocentrus centrarchus)
 
Eastern happy (Astatotilapia calliptera)

Climbing perch (Anabas testudineus)
  Tire track eel (Mastacembelus armatus)
These mammals
Greater horseshoe bat
(Rhinolophus ferrumequinum)

Pale spear-nosed bat (Phyllostomus discolor)
Canada lynx (Lynx canadensis)

Duck-billed platypus (Ornithorrhynchus anatinus)



Here is some information based on a conversation with Arang Rhie and Adam Phillippy of the National Human Genome Research Institute about assembly techniques and pipelines. They are part of the VGP's assembly working group.

The research community has been supportive of the project, says Phillippy. The next milestone is the next batch of genomes.

Adam Phillippy talks about the research community reaction: "Ok, we agree that this will be good. Come back when you have your 266 reference genomes."


They developed an approach called 'trio binning.' It's a way to use parental and offspring genomes to discern the differences between haplotypes. It lets researchers reconstruct genomes at higher resolution and work across species. There are regions in the genome that are quite diverse, these are regions of high heterozygosity. 

Arang Rhie: "The current pipeline still gives you a better reference genome. But it's not enough for me. [laughs] So we are trying to improve the ways to reconstruct the hidden haplotypes that are lying behind this high heterozygosity. One of those approaches was our trio binning."



Arang Rhie and Adam Phillippy
NIH/NHGRI researchers Arang Rhie and Adam Phillippy are part of the
VGP's genome assembly working group. They develop pipelines for the
VGP and the wider research community. 

A peek at the VGP’s assembly working group 

In the assembly working group of the Vertebrate Genomes Project (VGP) Adam Phillippy, a researcher at the National Institutes of Health’s National Human Genome Research Institute (NIH/NHGRI) and Arang Rhie, a postdoctoral fellow in his lab, build and evaluate assembly pipelines and develop new tools. Here, for example, is the VGP's GitHub assembly pipeline repository GitHub repository

The team is setting up VGP assembly software 'apps' on the DNANexus cloud with publicly available tools for the community to use, also with recommended parameters. The tools are available via DNAnexus but are still being put together on one page. 


The tools that will be there are on the VGP's GitHub repository and information about the scaffolding pipeline can be found here. Up until now, it has not been uncommon in his group to spend six months on assembling one genome, which is too much time for a projects on the scale of the VGP, says Phillippy.  

For the VGP, Rhie, Phillippy and their colleagues have tested platforms to see which sequencing platform might be best to achieve what they seek: many, near error-free, assembled genomes. That means an error only around every 100,000 bases. The VGP metrics are “a little shy of what the current human reference genome is,” says Phillippy. 

For example, they tried different combinations of sequencers such as instruments from Pacific Biosciences (PacBio) alone, PacBio plus Oxford Nanopore Technologies and Oxford Nanopore alone. For now, they have settled on using PacBio sequencers because of the accuracy they found these technologies could achieve. 

They also tested multiple scaffolding platforms, all with a view to maximizing reproducibility and to be enable setting up a standardized pipeline, says Rhie.

When the team aligned an assembled hummingbird scaffold to the existing Anna’s Hummingbird reference, but things didn't go too well, says Rhie. That hummingbird sequence had been completed with Illumina-based paired-end sequencing at 50x coverage, as presented in 2014 in Science

As it turns out, the genome sequence of the hummingbird was less complete than people had recognized. For the alignment, they moved on to the hummingbird’s next closest relative for which a high-quality assembled genome was available: the chicken genome. 

The zebra finch had previously been sequenced with Sanger-based techniques. And even though this is considered to be the ‘gold standard,’ says Phillippy, “we are seeing more and more that a well done long-read assembly is actually superior in quality.” 

They expect many challenges as they assemble the genomes of many animals. Birds have small genomes but amphibian genomes are big: frog genomes are around 10 gigabases and salamander genomes are around 40 gigabases. 

The lab of Erich Jarvis at Rockefeller University had been looking at genes associated with vocal learning. But the researchers were just not finding the genes of interest in the zebra finch reference genome, says Phillippy. They realized their work was being hampered by errors in the reference genome. Another issue was that the genome was not haplotype-resolved. For the vertebrate genomes, the assembly working group hopes to achieve haplotype resolution. This will also help to build more haplotype-resolved human genomes, says Phillippy. 

Rhie previously developed approaches for phasing human genomes and published de novo assembly and phasing of a Korean genome in Nature in 2016.

Phasing, the separation of genomes into two haplotypes, is especially informative for genomic regions with plenty of heterozygosity, says Rhie. One such region in the human genome is the major histocompatibility complex (MHC). Separating both haplotypes lets one see variants on individual copies, instead of needing to work with a mosaic genome. 

Currently, says Rhie, they are looking at different levels of heterozygosity in the human genome. “The current pipeline still gives you a better reference genome,” she says. “But it’s not enough for me.”  They are working on methods to improve ways of reconstructing “hidden haplotypes” that lie behind regions of variable heterozygosity, areas of both low and high heterozygosity, she says. 

An important facet of the VGP is the way early data release is built into the process. “Making the data open like this early on is, to me, a great way to encourage and foster the development of new methods outside of this group,” says Phillippy. “When people see this data set that’s so rich, it will be irresistible to go and develop develop new methods for comparative genomics, assembly, annotation. 

*

The Vertebrate Genomes Project teams are using a number of commercial platforms.

Sequencers from Pacific Biosciences, which was recently acquired by Illumina, are being used to generate long reads for Phase I.

Contiguous sequence is joined to larger pieces with the 10X Genomics platform.

Bionano's optical mapping approach is used to build those sequences out to a longer scaffold.

Platforms from Arima Genomics, Dovetail Genomics and Phase Genomics are applied to Hi-C-proximity data for determining how near to one another genomic stretches are that interact with one another.  

*

Where are the data from the Vertebrate Genomes Project?

The data can be found in public archives and on GenomeArk, which is hosted by Amazon Web Services. The data, the public genome databases and genome browsers are there along with computational tools, all of which is an arrangement facilitated by the company DNAnexus. Here is a page about the scaffolding workflow.

-

Erich Jarvis and David Haussler explain the data use policies here. This was taped at the 2018 Genome 10K Consortium's annual meeting. Erich Jarvis speaks first.


Transcript:

Erich Jarvis: We had a lot of discussion, a lot of vetting, of the data use policy for these genomes. You can imagine, with all of the effort, the folks up here at the table, some of the people in the audience and others not here have put into generating this data, then making it publicly available, then somebody else scooping them. So we had to avoid doing that. 

At the same time, a lot of my colleagues are saying: Erich, please share this new zebrafinch genome that you have because the one we're using, we're not happy with. And so the data-use policy basically is: anybody can use these genomes in the scientific community or otherwise, to publish papers on one gene across species or 5 genes within a species. And they have to notify me as the chair. 

Any kind of studies they would like to do comparative genomics across species before we publish a major paper or issues on the 260 species, which we expect in roughly two years, that's what we hope, they have to contact me or become part of the consortium and collaborate with us. 

We think this is a fair policy, we've had it with different journals, we've had it with annotation centers, the public ones and so forth. And we'll see how it works, so far some like this this has worked for the Avian Phylogenomics Project, I think it's going to work for us. 

David Haussler: Just want to clarify, the data will be available openly, immediately.

Erich Jarvis: The data is available today. 

--

There are other large-scale genome projects, such as

Bat 1K

Birds 10k Genomes Project

Earth BioGenome Project

The NHGRI Dog Genome Project

and more...


(Zebra finch image credit at top of page: Sergio Mendoza Hochmann/Moment/Getty)




-

You will find a transcript of the podcast and video below. If you are able to do so, please listen to the audio or watch the video, because spoken word is distinct. Please note that the transcript might not be identical to the spoken word in all instances. If you would like to quote from the podcast or video, please credit the source and please refer to the audio or video.   


Transcripts

Transcript of

Why sequence all vertebrates on Earth?

Which informatics challenges await?

A mini-documentary by Vivien Marx

Observations at the Genome 10k Annual Conference

with Erich Jarvis, Rockefeller University

Emma Teeling, University College Dublin

David Haussler, UCSC

Gene Myers, Max-Planck Institute of Molecular Cell Biology and Genetics

Richard Durbin, Wellcome Trust Sanger Institute

[0'25”] - Erich Jarvis

What I'm announcing here is the
Vertebrate Genomes Project organized by the genome 10k Consortium.

[0'30”] - Emma Teeling

I just want to set the scene a little
best to think about why today is so extraordinary. So this year marks
the 75 years since Ernst Schrödinger wrote his book: What is life?
And in that little book he mused about what was the raw material of
inheritance. What potentially was DNA? And he's a physicist who
moved into biology, there had to be some form of aperiodic crystal.
And there had to be a way to maintain this crystal. But yet they
didn't know how was replicated.

And this little book spawned a lot of the ideas of Watson and Crick and Watson, Rosalind Franklin, Mark Wilkins and the scientific consciousness moved towards discovering what DNA actually was. But that was only 65 years ago.

And there are people alive today here in this audience who were born earlier than that, I'm sure, not me of course. But think about it: 65 years ago, they didn't know really: was it the nucleotides, was it a protein. What was the raw material of inheritance? And now today we're able to sequence the entire genome from multiple different species like this weird caecilian, I know Mark Wilkinson won't agree with me.

But we're now able to go and look at all of life. We're able to uncover the raw material of inheritance, we're able to then use that information to make trees of life. And now with these new exquisite genomes, 15 we're talking about today we're able to uncover maybe what are the genetic basis of the rare weird adaptations that these extraordinary animals actually have.

And so this is a landmark event, because in 65 years we've come along so much so in 2009 when Steve asked me to come to Santa Cruz to talk about the idea of Genome 10k which really was quite outrageous because we didn't have the methodology yet, we didn't have the DNA, we didn't know how we're gonna sequence it but Illumina had come on the scene.

And now it was possible maybe to sequence genomes. Within a decade we realized that we were going to have to get much better types of DNA. We had to try and work out how could we how could we assemble and move things forward.

[2'30”] And as a group we brought together mathematicians, geneticists, conservation biologists, we brought together zoologists, people who actually know these animals where they exist. What's their unique and lifestyle. And also had access to DNA and really G10k spawned VGP spawned, spawned Bat1k, spawned the Earth Biogenome Project that Harris is going to talk about that. Today we're now able to uncover really the blueprint of all different species. We're able to look at each different base pair potentially assign it to some type of functional aspect, it's unique.

And for me as part of Bat1k ,the idea that we could actually now go and sequence the genome to that level of exquisite definition of every living bat on this planet, which I think everybody should be working on personally. They're the most extraordinary mammals. We're gonna be able to uncover what is the genetic basis for flight, for these long healthy lifespans, for some type of immunity against different viruses such Ebola, they don't die from Ebola, they don't get cancer and it's all in their genomes.

And now we're gonna be able to uncover it. But it took a decade, from 2009 till now to get to where we are. But I think that good things come for people that wait and we moved the field forward. And this is just the start and the floodgates will be opened. And why should we not now sequence all of life to see how it's evolved on this planet. And so I am glad and honored to be here. And we can talk about bats anytime.

[4'05”] Vivien

Bats and all other vertebrates. The Vertebrate Genomes Project has set out to sequence and assemble the
genomes of all vertebrates at high quality. These are to be reference grade genomes. For plenty of vertebrates no genome sequence exists at all. Existing genomes are often incomplete or there are errors. Thishas been true, for example, for the zebra finch genome. 

Here are the first 15 genomes from the Vertebrate Genomes Project. 14 animals, 15 genomes because this group includes both the female and male zebra finch.

[5'00”]

There are an estimated 66,000 vertebrate species. There may be 71,000. Or more. Tallies vary. Which is why the 150 scientists in over 50 institutions in 12 countries who are part of the Vertebrate Genomes Project have their work cut out for them as they sequence and assemble these genomes. At a meeting at Rockefeller University in New York City they presented the first 15 genomes and talked about what's next.

[5'30”] Erich Jarvis

We've decided we're going to do what's called a reference Vertebrate Genomes Project, not just 10,000
species, we initially started with the G10k but all sixty six thousand vertebrates at high quality genome assemblies. Starting with
all orders of vertebrates roughly 260, families 1,000 of them, eventually genera and eventually all species . 

The first phase would be a proof of principle project ,which we're announcing today. The second phase I call it the rocking and rolling phase, where we're starting to do thousands of genomes, then we're really in business. And the third phase if we complete roughly 10,000 genera we reach the G10k milestone and finally virtually all species associated with other consortia, like the Bird 10k or the Bat 1k, also represented here today.

[6'18”]

The approach we're using for phase 1 is to have these long reads, PacBIo long reads in particular, with 10x genomics, Bionano Genomics and Arima Genomics scaffolding approaches to pull those, what we call these, contigs together into the scaffolds using longer and longer and longer range information.

And to give you an analogy: a short read is like the height of a man or woman. Whereas a long read is like the height of the Empire State Building or even bigger. Whereas the scaffolding approaches are like trying to take these Empire State Buildings together and bridge a continuous set of buildings between Earth and the moon.

[7'04”] And what does that gain you. An example is shown here as to what these differences or these longer means gain you is that if you're trying to assemble a genome that has lots of pieces, and all your pieces are small. In the case of this and I think you're kind of guessing what animal it is, but it looks a little strange, this wing here is in the wrong place. The face is turned around, it's inverted. There are all these gaps in here that can have some errors in them. Long reads makes it easier to assemble and then with long scaffolding approaches, you can get all the chromosomes together.

Now this is would be a perfect genome. We're not there yet but we're getting much closer. And here is a an approach, Arang Rhie and others in our group have been working on to take ever longer and longer pieces of DNA, to string these pieces of DNA into longer and longer pieces until you get chromosomes And then finally we'll annotate that, named individual genes.

[8'03”]

I'm David Haussler. Here next to me Steve O'Brien, Oliver Ryder and I had the idea that we need to organize the communities that were sequencing various genomes. This was almost 10 years ago.

To work together towards a shared technology and an organized exchange of information and mutual support. We created an organization called Genome 10k. The thing that really strikes me is: back then we had no idea how long it would take to get to the point where you had a genome that was complete enough to do science on and could be obtained at a reasonable price.

Many of you have seen the curves of how the cost of sequencing is going down. But what became very clear, and Eric explained so well, is that what we thought was a genome back then really wasn't the genome that was suitable for doing science.

And so it took a long time, two Asssemblathons, I see Benedict Paten in the audience and the Alignathon that Benedict put together, really revealed that the state of technology then and revealed the fact that it was not adequate to go forward.

So we have been keeping the troops excited about this project to the best we can while furiously trying to improve technology along with many others across the planet. I think we've made a turning point at this point. These new technologies for long-read sequencing and long association, over a mega-base associating one region with another, are spectacular.

And we're thinking of genomes now in terms of haploids. We're talking about reconstructing both chromosomes separately, which was a completely alien concept in the old days. But that is what it takes to get it right at this point.

[10'01”] Gene Myers

The advances in this long-read sequencing and long-range scaffolding technologies is revolutionizing de novo sequencing. I basically left the field in 2003/2004 because with the short reads, you know basically producing really high quality genomes was off the table. I saw that I didn't as an informatician, I didn't want to play. That wasn't my game. 

But when these new long-range technologies came in you know about a half dozen years really started to come into their own, I saw the potential and I think several of my colleagues saw the potential for us to finally be able to produce gold-standard genomes, things that are as good as the best genomes we have today.

You can basically, out of the box, now with these technologies produce a genome that's as good as the Drosophila genome, which is the best sequenced model organism release 6. You can get the same numbers of continuity and statistics out of this.

So the 15 genomes that we're talking about today. .I mean people have talked about you know we're going to do a thousand we're going to do 10,000 human beings. We're talking about, forgive me, low quality stuff. What we're talking about are things that will basically, the word was bandied about with the human genome project: things that will stand the test of time.

These are things it's gonna be one and done. We're gonna do these things and we're not going to have to do them again. Almost everything else that's been done to date is oing to have to be done again if you really want to do your science.

What's exciting about today': it's only 15 but it's the first 15 of what's going to be a flood. So what's happening today and what we're announcing is the beginning of a trend that's basically going to completely and dramatically alter genomics in the future. Thank you.

[11.49] Richard Durbin

So I wanted to say, I mean people have said to how this is transformative. It's essentially a hundred times
longer contiguity than the vast majority of of genomes currently in the archives. So that's a good number to keep in your head. It's not
alone. 

There are other genomes out there like Drosophila and human and a few others. But. The 15 genomes we're presenting are in fact a major contribution to that set of things at this at this level of millions of bases rather than tens of thousands of bases of contiguity.

And despite 20 years of work and all the new technologies, there's still great technical challenges in piecing together the billions of letters in a genome in the right order with the goal of no error. And in particular in scaling that up so that we can do it at the level of hundreds and thousands and tens of thousands that we want to . So it's great to be part of this this team. You know it's great to be back working with Gene, we worked together 20 years ago. And David and others here and have new people join and the team here in my view is leading the world in attacking these problems.And it's an exciting time in the field.

Next: a conversation about VGP informatics

[13.05] Richard Durbin

It's interesting conversation and my combative, contradictory tendencies have been raised.

So I think, absolutely, the goal has to be to get it correct. I actually think though you can't let total perfection get in the way of, you know, there is a balance between perfection and utility. Which we have to think about and address and meet on each occasion.

Gene Myers

I would agree with that. I just feel it's gone the wrong way

Richard Durbin 

The other thing: the speed is important because what happens with technology is it's increasing faster than two-fold a year and computing technology increases are slower than two-fold a year. Over the last 30 years, we have managed to keep the computing-to-sequencing costs balanced. By improving the efficiency of the computation. And that has been done through a whole lot of algorithmic invention and insight.

At times I think we've been at the cutting edge copy of computational algorithms in this field, which is kind of surprising and I think we are.

David Haussler

I think we are at the cutting edge. 

[14'29”] Richard Durbin

And secondly but also we've taken advantage of the fact that the more we sequence, the more it looks
like things we've seen before. There's structure there. So I think Gene is right. We have to absolutely get it right. We have to
understand what right looks like and be getting it right.

But then there is going to be a lot of scope and it's going to be necessary to make that time and space computer-efficient

Right now we're not going to manage to scale up a thousandfold, which is what we will do, we need to do and will do without becoming yet more computer efficient than we are now. So I think that's fun. And then the second thing I want to say.

Gene Myers

Can I respond to that? 

Richard Durbin

In a second. The second thing I wanted to say is I think that. A key thing is that the relationship between
sequences. So there's a relationship between these pieces we have and how they relate to the truth of the individual genomes, then the
relationship between the individual genomes. And we're building a reference resource. 

Our concept of reference is in a state of change at the moment. David I have talked about this for years. People have kind of known It's necessary for that to happen. And at the computational level we've talked about graph genomes and they haven't really taken off. And I've come to realize and I think that's still a really essential thing. In fact the data structures used, some of which Gene introduced for assembly are very closely related to the ones we need for representing genetic variation within and between species. So we've all published on forms of this over the last decades.

[16'25”]

And I think that's a really important area still. But I've come to realize that you don't really want to expose all that, just as we don't expose Burrows-Wheeler transforms and fancy hash tables to end users and finite state automaton BLAST. That wasn't what the BLAST user knew about, we don't want to tell the end user that our reference is a graph. We want to give them something that lets them look at any individual genome in the context of all the other things related to it and transfer information between them. So we have to work out the modality for presenting that to people and what to put under the hood for doing that.

And I think that remains a major challenge. After several years of effort in conjunction with people in Santa Cruz we've just published in Nature Biotechnology this VG package, which is a piece of that from my perspective. But I still see that whole agenda as open as a huge amount to be done, methodologically and computationally in getting the right assemblies. And then relating everything to each other and making it all scale so that you can..

[17.45] Gene Myers

Richard and I have known each other for a long time and actually one of the reasons I like working with
Richard is because, I mean as a mathematician I do tend to suffer from perfectionistic tendencies.

It's wonderful that Richard offsets me with a certain kind of

Richard Durbin

English pragmatism

Gene Myers

English pragmatism. I wasn't gonna put English in there but since you did I guess it's OK. And I mean really
I actually agree with everything that that Richard has said. I mean that I only stated what I stated because I feel that much of the work
that I've seen has placed the emphasis on efficiency before the aspect of correctness.

Of course you want your processes to run a speed and with a turnover rate that allows you to to basically to do your work. Richard: the assembly of PacBio data was taking 400,000 CPU hours and I brought it down to 2,000, which was basically. I would agree at 400,000 Amazon Cloud CPU it's totally untenable.

I mean then you really have an issue. But you know now that it's down to 2000 hours on the HPC cluster we're at least back in the envelope.

Richard Durbin

It's good for now. 

[19.04] Vivien

And the shorthand of it is?

Gene Myers

My wizardry, computer science stuff you
know it's yeah. I guess wizard is the best answer. 

Richard Durbin

But it is it's not magic.

You have to seed things efficiently. You have to find things that are likely to be similar in more and more efficient ways, by index structures. And then you have to have fast ways to evaluate and align around them

[19.43] Gene Myers

Right. So Richard's correct in the
sense that the overarching way to do these comparisons is what you do
is as you build a pre-computed index. You take each of the objects
you want to compare and you build a representation of it, an index,
that facilitates the comparison of the objects. 

Richard Durbin

This is pretty central to computer
science now. The whole of the internet search is based on not the
same but related. 

Gene Myers

Massive indexing

Richard Durbin

In fact there's a lot of areas of
connection between what we do and

[20.19] David Haussler

I think we're only beginning to really address these deep problems that we'll have computationally Assembly is the first line of attack to the raw data that we have coming off the machines.

But ultimately the comparison for scientific understanding between thousands and thousands of genomes will require an enormous additional numbers of efficiencies as the type that Gene is talking about here that. It's compounded I think by the richer conceptual framework that is required.

Richard Durbin

You have to get the representation right. 

David Haussler

And that comes down to what Richard was talking about in terms of the representation. We see people come from
computer science into genomics and want to apply their tricks. But it takes them many many years before they understand the true depth of
genetics. If you look at what we're dealing with we're dealing with an evolutionary process on the scale of billions of years that has
gradually molded these genomes of different species into what they are today through a very complex set of processes. 

[21.45]

It isn't just copy paste, it isn't just make a substitution here. The actual driving changes are very subtle things like gene conversion and so forth. When you actually want to model that process, it gets very complicated. One has to make compromises in order to do it efficiently. But ultimately the fact underlying this, the conceptual picture that I want people to take home, is that actually these bases at the DNA base level derive one from another through this process of evolution.

I loved Emma's aperiodic crystal to bring back Schrödinger in here. So the information in that original

aperiodic crystal has been copied again and again with modification, but modifications of a very rich type. But when we look at a piece of DNA, when people look at a piece of DNA in the human genome browser, and they look they turn on the comparative genomics tracks, they want to see the relatives of that piece of DNA and other species. But increasingly they also want to retract that to a state that it was in earlier.

[23'06”]

And we actually, for another piece of our work on krab-zinc fingers, we actually reconstruct the ancestral form of the DNA segment, from that derived the protein and then actually create the protein in its ancestral form and do experiments on it. So to understand evolution you have to understand evolution as an historical process for which biomolecules existed aundred million years ago, a billion years ago that we can infer and experiment with but only if we get the relationship between all of these trillions of bases of DNA reasonably correct, not perfect.

[23'51”] Richard Durbin

The more you know them the better. 

David Haussler

I think we're talking about enough to do science. I liked Erich's introduction because what we had before
was not enough to do his science and that's the true motivation for digging deeper in technology. We don't know, exactly what the formula
is that's the minimum to do science. 

But we want to I emphasize what Gene was saying when we tried to compare the genomes that were only partially assembled in little pieces to each other and reconstruct their evolutionary history, it was not enough to do that science. We did not have enough information in that sequence.

And it's going to be revolutionary to write these new algorithms now that can actually reconstruct the evolutionary history of all of the bases in a large collection of genomes back to their common ancestral form. That's a huge challenge. It's one of the most exciting pieces of software one could write. We really now only have the opportunity to do it successfully because of these fantastic new genomes.

[25'03”] Richard Durbin

We tend to say this is all the first time this is happening but actually for bacteria one of the three
kingdoms of life, this has been done. There are hundreds, I think there are thousands of complete end to end perfect bacterial genome
sequences that have been assembled over the last, particularly over the last five years but over the last , starting 20 years ago.

Because there are only a few mega bases and now these in fact the long read technologies before we started applying them on this scale have been applied to make end-to-end perfect bacterial genomes. It exemplifies, we've seen that absolutely critical things in evolution have taken place through structural variation and insertions and pieces, which are distributed. Now people can say the mutation process and the recombination and DNA exchange processes are different bacteria. It is to some extent. But I think there's more of what goes on in bacteria that's also happening in humans, in eukaryotes than people have thought in the past.

A lot of what we're interested in life is not bacterial. So there's a huge amount to be done. It's fine.

[26.43] Gene Myers

But the structure of eukaryotic genome for so much richer, too.

Richard Durbin

All of life is rich. If we just you know if life was all bacterial, life would be the poorer for it. I just want to say you know there is material there to work on at one level with respect to all these things. I want to say as. We can talk about the relationship between things, which is evolutionary which is that they've descended by mutation and recombination. And drift and selection.

[27'19”] So these are population-genetic concepts and one of the great things is that there is a mathematical structure there. Even if it's more complicated than just single bases changing from A to G and so on, which is absolutely true and necessary. It's a relatively, the set of basic operations out of which DNA sequences are made, can be reduced into a small set. There is a theory and it's a rich theory which is developed over the last.

Getting on now for 100 years. We're coming up to almost the centenary, the year the people beginning to write down these processes and think about them mathematically and analytically. And these would be the holy trio in that case is Ronald Fisher, Sewell Wright and JBS Haldane and their descendants. For a long time they worked in the abstract actually.

David Haussler

They didn't have data

[28'19”] Richard Durbin

They did amazing things with limited data. But now it's all laid out. And it's that which underlies putting together data, computer science and biology. It's an interesting time.

And that gave rise in Europe, it was a major component of the origin statistical theory of those processes. So Fisher in particular is seen by some, as a possibly slightly wayward father of both population genetics and statistics.

This issue about how sequences are generated and converted to each other and how they relate to each other and how you efficiently manage and compute and make inferences about things is an incredibly rich area.

---

Transcript of podcast

The conversation with David Haussler, Richard Durbin and Gene Myers

Richard Durbin

So my name is Richard Durbin. I'm a
professor in the Department of Genetics at the University of
Cambridge in the U.K. and an associate faculty member now at this
Wellcome Sanger Institute, which is just outside Cambridge, which is
Europe's leading genome institute. 

David Haussler

My name is David Haussler. I'm the director of the UC Santa Cruz Genomics Institute.

Gene Myers

So I'm I'm basically at this time an American expat, I'm working in Dresden Germany for the Max Planck
Society.

Vivien

What is exciting to you about looking at genomes of animals? Pick your animal. What's peculiar, has
anything struck you about, I mean, are many more repeats, are there many more things that make it difficult to adapt tools that you
already have to analyze these genomes? 

[0'50] Richard Durbin

You know I think that we have we have realized over the last couple of, a decade, progressively that there are hard bits of genomes and increasingly people have with big focused effort worked out how to get at those. And I think the thing that excites us that Gene said earlier is, and me personally, is that now we can see that. There always the combination of the data that you can collect and the computational methodology you use to put it together. The data that we can get is now enabling us, it means that the raw information we think is present to untangle things.

[2.40”]

At the same time, I think Evan Eichler is a good example of somebody in human genetics who has been demonstrating repeatedly that there is evolutionary and functionally important stuff in the complex material that's hard to assemble.

We've also in fact learned that through organisms such as malaria and a whole load of pathogenic organisms, that a lot of the functionally important stuff, the rapid adaptation is in this very similar, highly duplicated or structurally complex, often polymorphic between individuals, within a species.

So those have been things that have been hard to get at. We now have the raw data to get to them. People have been able to, on a case-by-case basis, sort them out in particular cases over the last decade. But we can see ourselves now doing that in a much more automated, high-throughput fashion. And that's to my mind going to enable is one of the drivers for this.

[3'02”] David Haussler

I would amplify this. There are regions in the human genome that we're still discovering great new biology.

Vivien

Centromere, I guess?

David Haussler

Right now, near the centromere. We discovered the activity of the Notch2NL gene, just recently published
in Cell, in an area that had been unassembled and actually wrongly assembled.

So geneticists for decades were looking at this gene in the wrong place. Notch2NL turns out to be a gene that's specifically active in humans. It's a new gene in the Notch family which is incredibly fundamental to developmental biology.

And here you have the situation where it's ignored because the assembly was first wrong and in addition incomplete. When that was completed, we were able to show that in fact this gene emerged in its active form only in human about three million years ago and arguably had a profound impact on the process of expansion of our cerebral cortex, which is one of my personal favorite genomic traits of people. So these stories, that come from the difficult parts of the genome that Richard was talking about, are ubiquitous. We expect to find them in all species.

We heard from Eric another example in the birds where the duplicated parts of the genome were confusing. But it's those duplications that drive innovation. So we have enormous amounts of new biology coming.

[4'50”] Vivien

And data visualization? Just does it to pick one, so you're going to get, the Genome Browser it's going to grow with this project obviously, it sounds like.

David Haussler

Well yes the challenge is severe. So we now have a new feature, the multiple track feature, which is multiple region feature, which is extremely popular. 

Jim Kent's group has designed it so that you can take pieces from different parts of the genome and view them together in the same view in a seamless way. And this is very very important for parologous genes and other kinds of duplicated regions.

It's also important for things like looking at all of the exons in a very long gene at a level of detail. And it's also going to be fundamental when we start to think about the fact that there are two haplotypes now for these species and more to come as we understand the haplotypic variation within the species so to be able to look at the alternative paths through the genomic architecture of the species will be a new era for the Genome Browser. A big challenge.

[6'10”] Gene Myers

I think that my my friends have already spoken eloquently about initial findings about these difficult-to-assemble regions. I'm basically just a technologist, the guy in the boiler room, and what I'm really all fired up about is is that I finally have data sets where I think that there is sufficient information content informatically that I can put these things together near-perfectly.

The challenges are going to be basically to understand small variations between repeats between haplotypes within the context of a noisy and erroneous read. How do we really see those things, that's not a problem that's been solved well.

I think that another problem in the assembly domain that people have really not focused on sufficiently is the problem of really building models and understanding what the repeats are in the genomes and really understanding how to assemble those. So I think that developing those techniques is something that I'm particularly excited about. I think they're going to be huge issues of scale, these assemblies are already difficult to achieve and require high-performance computing.

[7'33”]

One of the things I tell my students all the time is that it's the first goal of genome assembly is not to get it fast but to get it right. So, in other words, many people I think, and I think one of the problems in the field, is is that a lot of people have been focusing, the informatics world has been focusing on speed. Mine's faster. Mine uses less memory. This is the wrong focus for genome assembly.

The right focus for genome assembly is "have I put it together correctly right by any means at all means necessary.” And I think this is the first thing that computer scientists like myself need to understand is that, that's the primary issue.

[8.15”]

It's not in my area in particular but I also think that the people that think about the Tree of Life and think about that aspect are going to be really quite overwhelmed. What are you going to do when we have 66,000 near-perfect genomes? I mean, how do you how do you think about comparing all of those and about understanding the differences within them.

I really think that people should actually now already in the corpus of data that was announced today, be really looking at doing comparative genomics and seeing what kinds of analysis they've got. Because a lot of the attempts up until now, if I might speak bluntly, have been that people have spent most of their time dealing with the caveats of having partial and incomplete genomes. And that's been the main technical impediment. They spend 80 to 90 percent of their time dealing with the artifacts and the problems.

[9'15']

Just think about the world when that's not going to be the issue when the issue is going to be what's the biological meaning of these things. And I think that's going to be a very important thing I'm very excited about. That's why I'm really excited about this project because it really I feel like at least and I'm kind of an old dog, before I retire, maybe we can actually get this first and most fundamental piece of information and all of the molecular biology, that we can get it to a standard where it's really a reference, it's really something that you can bet the bank on.

[9'50'] Vivien

And the fact that Eric mentioned that his students spent a lot of time and discovered, I guess, after an
extended period that something was wrong when they were comparing.

Gene Myers

Yeah and re-sequencing.

Vivien

Was this known?

Richard Durbin

It's happened all over the world many many times.


David Haussler

Many times

Gene Myers

Many times

Vivien,

Yes you guys are going to solve this, right?


Gene Myers

We want to save the misery of millions of postdocs around the world. Are there millions, I don't know.

Richard Durbin

You know, it's an interesting conversation and my combative nature, contradictory tendencies have been raised. So I think, absolutely, the goal has to be to get it correct. I actually think though you can't let total perfection get in the way of, you know, there is a balance between perfection and utility. Which we have to think about and address and meet on each occasion.

Gene Myers

I would agree with that. I just feel it's gone the wrong way

Richard Durbin

The other thing the speed is important because what happens with technology is it's increasing faster than two-fold a year and computing technology increases are slower than two-fold a year. Over the last 30 years, we have managed to keep the computing-to-sequencing costs balanced. By improving the efficiency of the computation. And that has been done through a whole lot of algorithmic invention and insight.

At times I think we've been at the cutting edge copy of computational algorithms in this field, which is kind of surprising and I think we are.

David Haussler

I think we are at the cutting edge.

[14'29”] Richard Durbin

And secondly but also we've taken advantage of the fact that the more we sequence, the more it looks like things we've seen before. There's structure there. So I think Gene is right. We have to absolutely get it right. We have to understand what right looks like and be getting it right.

But then there is going to be a lot of scope and it's going to be necessary to make that time and space computer-efficient

Right now we're not going to manage to scale up a thousandfold, which is what we will do, we need to do and will do without becoming yet more computer efficient than we are now. So I think that's fun. And then the second thing I want to say.

Gene Myers

Can I respond to that?

Richard Durbin

In a second. The second thing I wanted to say is I think that. A key thing is that the relationship between sequences. So there's a relationship between these pieces we have and how they relate to the truth of the individual genomes, then the relationship between the individual genomes. And we're building a reference resource.

Our concept of reference is in a state of change at the moment. David I have talked about this for years. People have kind of known It's necessary for that to happen. And at the computational level we've talked about graph genomes and they haven't really taken off. And I've come to realize and I think that's still a really essential thing. In fact the data structures used, some of which Gene introduced for assembly are very closely related to the ones we need for representing genetic variation within and between species. So we've all published on forms of this over the last decades.

[16'25”]

And I think that's a really important area still. But I've come to realize that you don't really want to expose all that, just as we don't expose Burrows-Wheeler transforms and fancy hash tables to end users and finite state automaton BLAST. That wasn't what the BLAST user knew about, we don't want to tell the end user that our reference is a graph. We want to give them something that lets them look at any individual genome in the context of all the other things related to it and transfer information between them. So we have to work out the modality for presenting that to people and what to put under the hood for doing that.

And I think that remains a major challenge. After several years of effort in conjunction with people in Santa Cruz we've just published in Nature Biotechnology this VG package, which is a piece of that from my perspective. But I still see that whole agenda as open as a huge amount to be done, methodologically and computationally in getting the right assemblies. And then relating everything to each other and making it all scale so that you can..

[17.45] Gene Myers

Richard and I have known each other for a long time and actually one of the reasons I like working with Richard is because, I mean as a mathematician I do tend to suffer from perfectionistic tendencies.

It's wonderful that Richard offsets me with a certain kind of

Richard Durbin

English pragmatism

Gene Myers

English pragmatism. I wasn't gonna put English in there but since you did I guess it's OK. And I mean really I actually agree with everything that that Richard has said. I mean that I only stated what I stated because I feel that much of the work that I've seen has placed the emphasis on efficiency before the aspect of correctness.

Of course you want your processes to run a speed and with a turnover rate that allows you to to basically to do your work. Richard: the assembly of PacBio data was taking 400,000 CPU hours and I brought it down to 2,000, which was basically. I would agree at 400,000 Amazon Cloud CPU it's totally untenable.

I mean then you really have an issue. But you know now that it's down to 2000 hours on the HPC cluster we're at least back in the envelope.

Richard Durbin

It's good for now.

[19.04] Vivien

And the shorthand of it is that you know although my wizardry.

Gene Myers

My wizardry, computer science stuff you know it's yeah. I guess wizard is the best answer.

Richard Durbin

But it is it's not magic.

You have to seed things efficiently. You have to find things that are likely to be similar in more and more efficient ways, by index structures. And then you have to have fast ways to evaluate and align around them

[19.43] Gene Myers

Right. So Richard's correct in the sense that the overarching way to do these comparisons is what you do is as you build a pre-computed index. You take each of the objects you want to compare and you build a representation of it, an index, that facilitates the comparison of the objects.

Richard Durbin

This is pretty central to computer science now. The whole of the internet search is based on not the same but related.

Gene Myers

Massive indexing

Richard Durbin

In fact there's a lot of areas of connection between what we do and

[20.19] David Haussler

I think we're only beginning to really address these deep problems that we'll have computationally Assembly is the first line of attack to the raw data that we have coming off the machines.

But ultimately the comparison for scientific understanding between thousands and thousands of genomes will require an enormous additional numbers of efficiencies as the type that Gene is talking about here that. It's compounded I think by the richer conceptual framework that is required.

Richard Durbin

You have to get the representation right.

David Haussler

And that comes down to what Richard was talking about in terms of the representation. We see people come from computer science into genomics and want to apply their tricks. But it takes them many many years before they understand the true depth of genetics. If you look at what we're dealing with we're dealing with an evolutionary process on the scale of billions of years that has gradually molded these genomes of different species into what they are today through a very complex set of processes.

[21.45]

It isn't just copy paste, it isn't just make a substitution here. The actual driving changes are very subtle things like gene conversion and so forth. When you actually want to model that process, it gets very complicated. One has to make compromises in order to do it efficiently. But ultimately the fact underlying this, the conceptual picture that I want people to take home, is that actually these bases at the DNA base level derive one from another through this process of evolution.

I loved Emma's aperiodic crystal to bring back Schrödinger in here. So the information in that original

aperiodic crystal has been copied again and again with modification, but modifications of a very rich type. But when we look at a piece of DNA, when people look at a piece of DNA in the human genome browser, and they look they turn on the comparative genomics tracks, they want to see the relatives of that piece of DNA and other species. But increasingly they also want to retract that to a state that it was in earlier.

[23'06”]

And we actually, for another piece of our work on krab-zinc fingers, we actually reconstruct the ancestral form of the DNA segment, from that derived the protein and then actually create the protein in its ancestral form and do experiments on it. So to understand evolution you have to understand evolution as an historical process for which biomolecules existed aundred million years ago, a billion years ago that we can infer and experiment with but only if we get the relationship between all of these trillions of bases of DNA reasonably correct, not perfect.

[23'51”] Richard Durbin

The more you know them the better.

David Haussler

I think we're talking about enough to do science. I liked Eric's introduction because what we had before was not enough to do his science and that's the true motivation for digging deeper in technology. We don't know, exactly what the formula is that's the minimum to do science.

But we want to I emphasize what Gene was saying when we tried to compare the genomes that were only partially assembled in little pieces to each other and reconstruct their evolutionary history, it was not enough to do that science. We did not have enough information in that sequence.

And it's going to be revolutionary to write these new algorithms now that can actually reconstruct the evolutionary history of all of the bases in a large collection of genomes back to their common ancestral form. That's a huge challenge. It's one of the most exciting pieces of software one could write. We really now only have the opportunity to do it successfully because of these fantastic new genomes.

[25'03”] Richard Durbin

We tend to say this is all the first time this is happening but actually for bacteria one of the three kingdoms of life, this has been done. There are hundreds, I think there are thousands of complete end to end perfect bacterial genome sequences that have been assembled over the last, particularly over the last five years but over the last , starting 20 years ago.

Because there are only a few mega bases and now these in fact the long read technologies before we started applying them on this scale have been applied to make end-to-end perfect bacterial genomes. It exemplifies, we've seen that absolutely critical things in evolution have taken place through structural variation and insertions and pieces, which are distributed. Now people can say the mutation process and the recombination and DNA exchange processes are different bacteria. It is to some extent. But I think there's more of what goes on in bacteria that's also happening in humans, in eukaryotes than people have thought in the past.

A lot of what we're interested in life is not bacterial. So there's a huge amount to be done. It's fine.

[26.43] Gene Myers

But the structure of eukaryotic genome for so much richer, too.

Richard Durbin

All of life is rich. If we just you know if life was all bacterial, life would be the poorer for it. I just want to say you know there is material there to work on at one level with respect to all these things. I want to say as. We can talk about the relationship between things, which is evolutionary which is that they've descended by mutation and recombination. And drift and selection.

[27'19”] So these are population-genetic concepts and one of the great things is that there is a mathematical structure there. Even if it's more complicated than just single bases changing from A to G and so on, which is absolutely true and necessary. It's a relatively, the set of basic operations out of which DNA sequences are made, can be reduced into a small set. There is a theory and it's a rich theory which is developed over the last.

Getting on now for 100 years. We're coming up to almost the centenary, the year the people beginning to write down these processes and think about them mathematically and analytically. And these would be the holy trio in that case is Ronald Fisher, Sewell Wright and JBS Haldane and their descendants. For a long time they worked in the abstract actually.

David Haussler

They didn't have data

[28'19”] Richard Durbin

They did amazing things with limited data. But now it's all laid out. And it's that which underlies putting together data, computer science and biology. It's an interesting time.

And that gave rise in Europe, it was a major component of the origin statistical theory of those processes. So Fisher in particular is seen by some, as a possibly slightly wayward father of both population genetics and statistics.

This issue about how sequences are generated and converted to each other and how they relate to each other and how you efficiently manage and compute and make inferences about things is an incredibly rich area.


Please sign in or register for FREE

If you are a registered user on Research Communities by Springer Nature, please sign in

Subscribe to the Topic

Biological Techniques
Life Sciences > Biological Sciences > Biological Techniques