How to Identify Almost Anyone in a Consumer Gene Database
Researchers are becoming so adept at mining information from genealogical, medical and police genetic databases that it is becoming difficult to protect anyone’s privacy—even those who have never submitted their DNA for analysis.
In one of two separate studies published October 11, researchers report that by testing the 1.28 million samples contained in a consumer gene database, they could match 60 percent of the DNA of the 140 million Americans of European descent to a third cousin or closer relative. That figure, they say in the study published in Science, will soon rise to nearly 100 percent as the number of samples rises in such consumer databases as AncestryDNA and 23andMe.
In the second study, in the journal Cell, a different research group show that police databases—once thought to be made of meaningless DNA useful only for matching suspects with crime scene samples—can be cross-linked with genetic databases to connect individuals to their genetic information. “Both of these papers show you how deeply you can reach into a family and a population,” says Erin Murphy, a professor of law at New York University School of Law. Consumers who decide to share DNA with a consumer database are providing information on their parents, children, third cousins they don’t know about—and even a trace that could point to children who don’t exist yet, she says.
In the first study researchers looked at the database of 1.28 million Americans of European descent and discovered that was enough to provide information on millions of individuals who were not in the database. The database came from MyHeritage, a company that both tests genomes—as do AncestryDNA and 23and Me—and also allows them to be uploaded for further genealogical analysis, as can be done with the database GEDmatch.
Ancestry and 23andMe said they zealously guard the privacy of their testing results, but the data can become accessible when uploaded to other databases. “Your chance of finding someone that is a third cousin is about 60 percent in U.S. individuals with European ancestry,” says Yaniv Erlich, the first author of the paper and the chief science officer of MyHeritage. He suggested it might be wise to encrypt genetic data to protect personal information, although that could complicate the type of searches police and researchers wish to make.
The technique relies on links between distant relatives. “Think of your family like layers of an onion,” he says. Your closest relatives are parents, children and siblings. The next layer is first cousins, which you might have in higher numbers. Another layer and you reach second cousins, and so on until you could find yourself related to many third cousins you don’t know at all. “When you go to very distant relatives, chances of a link are much higher,” he says. These kinds of links were used earlier this year to identify a suspect in the case of the alleged Golden State Killer, who was connected to the crimes partially via the DNA of relatives in a genetic database.
Once police had genetic links to distant relatives, they could draw a large, complex family tree, possibly too large to analyze. But then they could exclude many of the linked individuals based on other data—where they live, their age, their sex and so forth, Erlich notes. Much of that information comes from widely shared family trees drawn by consumers. After pruning the data that way, a pool of, say, 850 relatives could be reduced to 15 who might plausibly be connected to the crimes in question.
Then police can start knocking on doors and doing the kind of investigation they do routinely. “It’s really only over the last year that the ability to use these public genealogical databases for identifying individuals became clear,” says Daniel MacArthur, group leader of Massachusetts General Hospital’s Analytic and Translational Genetics Unit. “The academic community didn’t appreciate how large these databases are and how readily they could be used to triangulate genetic identity.”
The second study showed police databases contain more genetic information than researchers had suspected. Forensic databases hold information on a handful of identifying markers called STRs. Consumer databases use a far more detailed panel of markers known as SNPs. Until recently there was thought to be no connection between the two. Now it is clear forensic databases contain some SNP information, says Bruce Weir, a professor of biostatistics at the University of Washington. “For law enforcement, this means if they can’t find a match” in their databases, “they can now seek a match in other databases,” he says. It also means they can track information on relatives, rather than merely matching individuals, he notes. “Practically, it’s an enormous advance.” This raises an important privacy issue, he adds. “Should I be worried that by uploading my data, I make my relatives subject to being found by law enforcement?” That might be acceptable if those relatives committed a crime. “But suppose they didn’t?”
N.Y.U.’s Murphy says when police DNA databases were devised, the DNA in them was supposed to be meaningless junk—just DNA patterns that could be used for matching individuals with one another or with evidence. So no attention was paid to safeguarding the privacy of the information—as is done with health records, cell phone use, social media accounts and other information, she says. Police have free access to DNA and they demand samples even when no crime has been committed. “That represents a breakdown in what had been a high wall between genetics as used for criminal justice purposes and genetics as used in medical diagnosis or genealogy or anything totally unrelated to criminal justice,” she notes.
Asked if she would send her DNA to a consumer database, she says, “Heck no. But I have family members who have done it.” If they are in the database, so is she.