Large genomic databases are indispensable for scientists looking for genetic variations associated with diseases. But they come with privacy risks for people who contribute their DNA. A 2013 study1 showed that hackers could use publicly available information on the Internet to identify people from their anonymized genomic data.
To address those concerns, a system developed by Bonnie Berger and Sean Simmons, computer scientists at the Massachusetts Institute of Technology (MIT) in Cambridge, uses an approach called differential privacy. It masks the donor’s identity by adding a small amount of noise, or random variation, to the results it returns on a user’s query. The researchers published their results in the latest issue of Cell Systems2.
The system calculates the statistic that researchers want — such as the chance that one genetic variation is associated with a particular disease, or the top five genetic variations associated with an illness. Then it adds random variation to the result, essentially returning slightly incorrect information. For example, in a query for the top five genetic variations associated with a disease, the system might yield the top four genetic variations and the sixth or seventh variation.
The user would not know which of the results to their query is more correct than another, but they could still use the information. It would just be much harder for someone to work out the patient information behind the data.
“When you induce a little noise in the system, in many ways it’s not that different from noise in the data to begin with,” says Bradley Malin, a computer scientist at Vanderbilt University in Nashville, Tennessee. “It is reliable to a certain degree.” The US Census Bureau and US Department of Labor have been adding noise to their data in this way for decades, he says.
The privacy of individuals in a …