Sequencing the World

It looks like the beginnings of a consortium are taking shape, with the goal of sequencing all life on earth. As something of a genomicist, I am psyched by the goal, unattainable as it may be. I also want to say why lofty goals are helpful, and this one will be too.

The Human Genome Project took years to finish, and ended up costing about a dollar per base-pair, which are the chemical “letters” that make up the genetic code. Since then, sequencing has become orders of magnitude cheaper. The current genome sequencing leader, Illumina, famously announced that sequencing a genome could be done for a thousand dollars. If we compare that to the investment required for the human sequence, we certainly have made strides. This is due  to the technology we use to sequence genomes. The most popular way to do it today is to take a sample of DNA from an organism, which is typically present in long stretches of DNA called chromosomes, and break it into short fragments. Since we have a lot of DNA in the sample, we end up having more than one copy of each letter of the genome. Using the powerful genome sequences that we have developed , we can sequence a little bit of each of these fragments before using a computer program to take the short reads and assemble them into a contiguous sequence. If you can imagine taking a few hundred copies of “Moby Dick” and randomly cutting out stretches of letters before trying to reassemble the book from the fragments by looking for overlap between random fragments, then you understand the basic strategy that genome sequencing uses today.

In spite of the cutting edge technology, it still takes a ton of work to go from a draft genome assembly–which is what you could immediately get after putting a thousand dollars into an Illumina machine and plugging the resulting reads into the computer to assemble–to the kind of gold-standard genome assemblies that we have in well-studied organisms like mice and humans. Typically, more work has to be put in to fill in gaps in the assembly that result from highly repetitive DNA, which confounds assemblers. Scientists sometimes have to do follow-up experiments to prove that their genome assembly is real and is not just a computer error. Finally, the genome sequence is useless until you start to figure out where the genes and other features lie. This means more follow-up experiments and comparing the genome to those of other related organisms.

All of this take a significant investment of time and treasure, and there is no way that we could do that for all life on earth. You would never be able to have a gold-standard genome assembly for every organism on earth. Much like the oft-told anecdote about restaurants in New York City–where it is said that you could never eat at every restaurant in the city because new ones are opening for business and going out of business faster than you could visit them all–new organisms are evolving and going extinct all of the time. The idea of putting in enough work to get something as polished as the fruit fly genome, let alone the mouse or human genome, is laughable if you start to think about it. But it would allow researchers to gain an appreciation for the diversity of life that exists on earth, specifically at the DNA level. Just having fractions of the genomes of most of the species on earth would allow us to better understand the evolutionary relationships between all life on earth.

As for this goal being a little too big to handle, big goals are important to push us to new heights. Getting to the moon seemed ridiculous at the time, and sequencing the human genome was impossible when we first started to plan how to do it. These goals ended up being attainable, but just imagine if they had not been. Even if we had never made it to the moon, we would have still developed the kind of technology that allowed us to put satellites into orbit that now power our ubiquitous mobile devices. Even if the human genome proved intractable, we would have still ended up with improved sequencing technology. This is because setting these lofty goals has the effect of pushing us to achieve things that we would have never thought to accomplish without a lofty goal. If we set out sequence all life on earth, just imagine what we might find we can do along the way.

*I found a post by professor/blogger Jeff Ollerton who also had his own take on the proposal. While he and I do not agree, he has an interesting take that I enjoyed reading. It should also be said that he has more expertise than me in this area.

A Voyage of Viral Discovery

Richard Dawkins’ Selfish Gene came out 40 years ago, so it is only fitting that I get to write about the most selfish genes of all: viruses. Basically, viruses are pieces of genetic material–either DNA or RNA–surrounded by a protein shell and maybe some lipid membrane. Viruses are not living cells, and they do not fulfill most of the hallmarks of life that many of us learned in middle school: viruses do not catalyze their own chemical
reactions, they are not made up of cells, and they do not reproduce on their own. In order to do the chemical reactions necessary to reproduce and make more copies of themselves, viruses must find a way to put that genetic material that they carry into a living host cell and trick the host into using the code as it would use its own genome. This is how the virus manages to make the host into a veritable virus factory.

Since viruses rely on living cells for almost everything, it has not been easy to study them. In fact, we did not even know that viruses existed until the late 19th century. The first viruses were isolated when scientists studying a pathogen found that they could run infectious material through the smallest available filters without removing the infectious factor. At that point, they just called them “non-filterable agents” and reasoned that they must be extremely small, even smaller than bacteria. Experiments by others in the early and mid-20th century went on to discover that viruses were mostly protein and nucleic acid (RNA or DNA), making them radically different from previously known cellular life.

As biologists, we were pretty late to the virus party–shoot, we pretty much knew what cells were shortly after the first microscopes were built in the 1600s, but it somehow took until the 1800s to know that there was something smaller that could cause disease–so it is no surprise that there is still a lot for us to learn about the tiny “non-filterable agents.” Appropriately, a recent paper in Nature claimed to find over 1000 distinct viruses that are all new to science. To make this discovery, the scientists first had to pick a group of cellular hosts in which to look for viruses. They settled on invertebrates, a diverse group of animals that include everything from insects and squids to sea urchins and earthworms. They also had to decide what type of viruses they would look for, opting to search for RNA viruses, which invade a host using RNA instead of DNA as their genetic material. By collecting and sequencing RNA from over 200 different invertebrate species, they were able to piece together long strands of RNA using the sequencing data and a computer program. However, those long reconstructed strands of RNA did not necessarily come from a virus present within the host. Host cells make their own RNA all of the time using their own DNA as a template. In order to be sure that the piece of RNA they found originated in a virus, they needed a signature that could only be present in a viral RNA. They found that signature in the form of a RNA virus-specific gene called “RNA-dependent RNA polyermase” or RdRp. RNA viruses use RdRp to copy their RNA genome when they invade a host cell, but they have to bring their own as part of their RNA genome; animals just do not have an RdRp. (That is, unless you believe this group that claims to have found a possibly-functional RdRp gene in a bat genome. I hope you will agree with me when I say that living things tend to be amazing because all of the rules we have about them are inevitably broken in some other organism.)

With this handy tool to distinguish viral RNAs from the rest of the pool, the authors had a field day discovering new RNA viruses. In addition to classifying viruses based on the host they were discovered within, they also used a technique known as “phylogenetics” to compare the RNA sequence of all viruses in order to place them on a tree of life relative to each other. Since all life on earth can ultimately trace its root back to one common ancestor that is the evolutionary relative to all of us, from human to bacterium, we can compare the nucleic acid sequences of organisms or viruses in order to infer their evolutionary distance from each other. For example, two viruses with relatively similar RdRp genes would be inferred to be quite closely related compared to a third virus with less sequence in common in the RdRp gene.

These new viruses were not discovered as human pathogens, so it is unlikely that this finding will have any direct medical relevance. This result can instead be useful for ecologists and evolutionary biologists who want to understand the variety of viruses that infect the invertebrates studied. Moreover, since we know quite a lot about the evolutionary relationships between different invertebrates–owing to us having studied them quite intensely for decades or even centuries–we can now use the new phylogenetic information about viral genome relatedness to start to ask questions about how the viruses co-evolved with their hosts. For instance, a group of related beetles may tend to be infected with related RNA viruses. If this is the case, then it is possible that an early ancestor of those RNA viruses made a living infecting an early ancestor of those beetles. Basic studies like that might also help us to someday understand host-virus co-evolution in humans and our viruses. After all, humans are in no danger of hitting an evolutionary brick wall, and neither are our viral foes.