Big Data Coming In Faster Than Biomedical Researchers Can Process It

Nov 28, 2016
Originally published on November 28, 2016 6:28 pm

Biomedical research is going big-time: Megaprojects that collect vast stores of data are proliferating rapidly. But scientists' ability to make sense of all that information isn't keeping up.

This conundrum took center stage at a meeting of patient advocates, called Partnering For Cures, in New York City on Nov. 15.

On the one hand, there's an embarrassment of riches, as billions of dollars are spent on these megaprojects.

There's the White House's Cancer Moonshot (which seeks to make 10 years of progress in cancer research over the next five years), the Precision Medicine Initiative (which is trying to recruit a million Americans to glean hints about health and disease from their data), The BRAIN Initiative (to map the neural circuits and understand the mechanics of thought and memory) and the International Human Cell Atlas Initiative (to identify and describe all human cell types).

"It's not just that any one data repository is growing exponentially, the number of data repositories is growing exponentially," said Dr. Atul Butte, who leads the Institute for Computational Health Sciences at the University of California, San Francisco.

One of the most remarkable efforts is the federal government's push to get doctors and hospitals to put medical records in digital form. That shift to electronic records is costing billions of dollars — including more than $28 billion alone in federal incentives to hospitals, doctors and others to adopt them. The investment is creating a vast data repository that could potentially be mined for clues about health and disease, the way websites and merchants gather data about you to personalize the online ads you see and for other commercial purposes.

But, unlike the data scientists at Google and Facebook, medical researchers have done almost nothing as yet to systematically analyze the information in these records, Butte said. "As a country, I think we're investing close to zero analyzing any of that data," he said.

Prospecting for hints about health and disease isn't going to be easy. The raw data aren't very robust and reliable. Electronic medical records are often kept in databases that aren't compatible with one another, at least without a struggle. Some of the potentially revealing details are also kept as free-form notes, which can be hard to extract and interpret. Errors commonly creep into these records.

And data culled from scientific studies aren't entirely trustworthy, either.

"So many articles that are published today are going to be wrong in 10 years," said Greg Simon, who leads the Cancer Moonshot. "That's just the history of scientific research, and the question is you just don't know which ones are going to be wrong."

Scientists trying to figure out how to analyze that flood of big data are going to have to cut through the dissonance to find a melody. That takes skill.

"In a world when anything is possible because you have so much data, how do you figure out who has done the math right?" asked Food and Drug Administration Commissioner Robert Califf.

He said the only way to know for sure is to take ideas gleaned from the big datasets and then try them out in people. That means persuading patients to participate in studies.

Just a small percentage do today, "and what we're seeing in our best academic centers, the clinicians say they don't have time to talk to patients about participating in studies," Califf said. "So, far and away this is our No. 1 issue that we're focused on with big data."

These problems aren't just abstractions for Sonia Vallabh. Her mother died of a rare, fatal genetic disease in middle age, called prion disease. Vallabh carries the same mutation that afflicted her mother. Vallabh quit her job as a lawyer and is now seeking a doctorate in biological and biomedical sciences at the Broad Institute in Cambridge, Mass.

Vallabh turned to a huge data set of genetic information to see what she could learn about her condition. "It basically confirmed what we thought we knew about my genetic mutation, which is it makes me almost 100 percent likely to die this way by midlife," she said.

But the data also yielded a surprise. Her disease is caused by having too much of a certain protein in her body. And some people with only half as much of this dangerous protein didn't get sick and die.

"So, here's an experiment of nature handed to us on a platter by big data, that says if we can find a way to turn down this disease protein, this protein that wants to kill me, that should be a safe way to delay or prevent disease."

But that's not a question to be answered through data-crunching. Vallabh needs the old-fashioned kind of medical research — laboratory and clinical science — to develop a drug that would reduce the protein safely and effectively.

You can email Richard Harris at rharris@npr.org.

Copyright 2016 NPR. To see more, visit http://www.npr.org/.

ARI SHAPIRO, HOST:

In medical research, small-scale projects are given way to huge efforts that rely on big data. The most famous of these is the Human Genome Project about 15 years ago. Now there are many others, and they share a common problem. As NPR's Richard Harris reports, scientists are gathering mountains of data far more quickly than they're able to make sense of it.

RICHARD HARRIS, BYLINE: Advertisers and retail chains collect a vast amount of data from their customers. And they've learn to squeeze a lot of commercially valuable information from it. So naturally, scientists would like to use the same approach to revolutionize health and medicine. Francis Collins, head of the National Institutes of Health, recently ticked off a long list of data-gathering efforts that followed in the wake of the human genome project, like the Cancer Moonshot and the BRAIN initiatives.

(SOUNDBITE OF ARCHIVED RECORDING)

FRANCIS COLLINS: We have the precision medicine initiative, which aims to figure out, by enrolling a million Americans, what really are the factors that are involved in health and disease?

HARRIS: And doctors coast to coast have spent billions to make their medical records digital so they can be mined for hints about how to improve wellness and conquer disease. Collins led a conversation about the issue at a meeting of advocates called Partnering for Cures, in New York. Atul Butte from UC San Francisco told the audience it's an embarrassment of riches in more ways than one.

(SOUNDBITE OF ARCHIVED RECORDING)

ATUL BUTTE: It's not just that any one data repository is growing exponentially. The number of repositories is growing exponentially.

HARRIS: Spending is now heading into the hundreds of billions of dollars, he said. But what's not growing is scientists' ability to make sense of that avalanche of data.

(SOUNDBITE OF ARCHIVED RECORDING)

BUTTE: As a country, I think we're investing close to zero analyzing any of that data.

HARRIS: And mining it for hints about health and disease isn't nearly as easy as, say, having Google figure out what you like and what ads to serve up for you. The raw data are not very robust and reliable. Electronic medical records can be tricky to work with and likely to have errors. And Greg Simon, who runs the Cancer Moonshot initiative, says data collected from scientific studies aren't trustworthy either.

(SOUNDBITE OF ARCHIVED RECORDING)

GREG SIMON: So many articles that are published today are going to be wrong in 10 years. That's just the history of scientific research. And the question is you just don't know which ones are going to be wrong.

HARRIS: So scientists trying to figure out how to analyze that flood of big data are going to have to cut through the dissonance to find a melody. Robert Califf, commissioner of the Food and Drug Administration, says that's no mean feat.

(SOUNDBITE OF ARCHIVED RECORDING)

ROBERT CALIFF: In a world where anything is possible because you have so much data, how do you figure out who's done the math right, what's inside the box that gives you the answer?

HARRIS: He said the only way to know for sure is to take ideas gleaned from big data sets and then try them out in people. That means convincing patients to participate in studies. Just a small percentage do today.

(SOUNDBITE OF ARCHIVED RECORDING)

CALIFF: And what we're seeing in our best academic centers, the clinicians say they don't have time to talk to patients about participating in studies. So far and away, this is our No. 1 issue now that we're focused on with big data.

HARRIS: These problems aren't just abstractions for Sonia Vallabh. Her mother died of a rare, fatal genetic disease in middle age, called prion disease. Vallabh quit her job as a lawyer and became a medical researcher at the Broad Institute in Cambridge, Mass. She turned to a huge set of genetic information to see what she could learn about her condition.

(SOUNDBITE OF ARCHIVED RECORDING)

SONIA VALLABH: We basically confirmed what we thought we knew about my genetic mutation, which is that it makes me almost a hundred percent likely to die this way by midlife.

HARRIS: But the data also yielded a surprise. Her disease is caused by having too much of a certain protein in her body. And some people with only half as much of this dangerous protein didn't get sick and die.

(SOUNDBITE OF ARCHIVED RECORDING)

VALLABH: So here's an experiment of nature handed to us on a platter by big data that says if we can find a way to turn down this disease protein - this protein that wants to kill me - that should be a safe way to delay or prevent disease.

HARRIS: But that's not a question for big data. Vallabh needs the old-fashioned kind of medical research, laboratory and clinical science, in order to develop a drug that would reduce the protein safely and effectively.

Richard Harris, NPR News. Transcript provided by NPR, Copyright NPR.