In June, 4.1 petabytes of cancer data went online on a new platform, run by the University of Chicago, called the Genomic Data Commons. Four petabytes is about sixteen times all the information in the Library of Congress—but only about four percent of the data stored by Facebook as of 2012.
In other words, the data we generate about how we live our lives is much more vast and organized than the data about biological life itself, and the technology used to organize it is more seamless and sophisticated. But given the evolution of technology, that makes sense. Consider how we documented day-to-day life before smartphones and services like Facebook: you had to have a camera, which wrote to a card, from which you transferred photos to a hard drive. That recorded limited metadata, generally date and time and some information about the camera itself.
As phones got better cameras and those cameras were connected by software to the phone's other features, pictures could go straight to the cloud with location data; software on the other end could take easily added metadata—like tagging someone with their name—and connect that to other photos and profiles on cloud platforms like Facebook, which grew to 1.71 billion monthly active users in 12 years.
The amount of data being recorded about the biological lives of people is comparatively sparse, because sequencing a genome is harder than taking a picture or posting an update. But that's about to change:
[There is a] very reasonable possibility that a significant fraction of the world’s human population will have their genomes sequenced. The leading driver of this trend is the promise of genomic medicine to revolutionize the diagnosis and treatment of disease, with some countries contemplating sequencing large portions of their populations: both England and Saudi Arabia have announced plans to sequence 100,000 of their citizens, one-third of Iceland’s 320,000 citizens have donated blood for genetic testing, and researchers in both the US and China both aim to sequence 1 million genomes in the next few years. With the world’s population projected to top 8 billion by 2025, it is possible that as many as 25% of the population in developed nations and half of that in less-developed nations will have their genomes sequenced (comparable to the current worldwide distribution of Internet users).
The authors of that study suggest that 100 million to two billion people could have their genomes sequenced by 2025—in other words, between the number of people on LinkedIn and the number of people on Facebook right now. And the reason is the same as the growth of those platforms; it's getting cheaper and faster to do it. As of last year, we had about 250 petabytes worth of human genomic data, and scientists expect to generate 20 to 400 petabytes more per year within a decade.
So what do you do with all that data? Right now, the Genomic Data Commons is transitioning cancer research into its Facebook moment. In four years, from 2007 to 2011, the cost of sequencing a human genome fell from $8.9 million to $10,500. It's expected to fall below $1,000. That's the bioinformatics equivalent of putting a camera on a smartphone.
"The current model is that a bioinformatician downloads the data that they need. It's usually almost always in multiple sources. They set up an environment to hold that data, then they spend a fair amount of time analyzing that data. Then they create a paper describing the results and publish it. And then repeat the process," says Robert Grossman, the GDC's principal investigator and the director of the Center for Data Intensive Science at the University of Chicago. "That was fine when they could work from their small workstation that they had, or a larger workstation, or what became over the last decade the computing infrastructure for bioinformaticians—each department or center had a cluster of computers that could provide the storage and computing required. It was expensive to set up and operate that, but there was no alternative."
In an era in which data of all kinds is instantaneously available from all over the world, the barriers that large-scale genomic data face boggle the mind:
"The data has been available, but was very, very cumbersome to get," [Dr. Louis Staudt, director of the National Cancer Institute's Center for Cancer Genomics] said. "To download all of the data from The Cancer Genome Atlas would take 3 weeks of continuous download [and] require $1 million of software, and a team of people to ensure privacy.… Only very well-funded and well-positioned researchers were able to access the data."
And that's just the data alone. Anyone familiar with downloading data from different sources to work with in the same platform—spreadsheets, pictures, video—is familiar with how formats proliferate, with how messy data can get when many people are generating it. Genomic data isn't immune from this, so the commons isn't just a hardware commons; it's a software commons as well, standardizing not just the means of access but the data itself. The hardware aspect is comparatively straightforward, general-purpose servers common to commercial cloud computing. The expertise comes in the people and software handling the data itself.
"A lot of the work that we did for the Genomic Data Commons that launched June 6 was to integrate a number of data sets that really hadn't been brought together before. Although there are tools, you need smart data scientists, smart bioinformaticians, to spend time with the data to bring that together. By and large—although there are tools—that is not automated, but still requires good people to take time to understand the data," Grossman says. "Other people like us have done the work of bringing the data together, of cleaning it, of integrating it, and of running the computing infrastructure. So that all that you have to do is come in with your knowledge about cancer, or drug development, et cetera, and use that over the data we have. We're trying to dramatically shorten the path to discovery."
That software commons includes analysis tools stored at the GDC that can be used remotely. Alternately, an API (application program interface) allows tools built elsewhere by researchers to use the data in the commons by giving them a common language.
"To summarize, you can interact directly with our commons, you can use tools or applications or services that interoperates with our commons, or you can use a commons that interoperates with our commons," Grossman says. "Or you could just download the data and do whatever you want."
Our ability to collect, parse, and share this data is growing rapidly, but so is the complexity of what we understand cancer to be. At the cutting edge of this is tumor heterogeneity, our increasing awareness that the genetics of cancer cells can vary greatly even within a single tumor. A recent study, for instance, found that African-American women, who suffer from high breast-cancer mortality, have greater intra-tumor genetic heterogeneity. That study was done using data from The Cancer Genome Atlas—the one that would take three weeks to download, and one of the datasets that can be explored and analyzed remotely Genomic Data Commons.
"Cancer is a disease of mutations," Grossman says, "and many of those mutations are rare, and often times it's combinations of rare mutations that we have to understand. But we need enough data to provide enough statistical power so that scientists can look at that to make the discoveries they need to understand the mechanisms underlying cancer, the growth, the changes to the genome and the pathways that allow cells to grow and proliferate in the way that leads to cancer."