Three years ago, Netflix released a database containing the movie preferences of 480,000 users. Last year, Google released information on the viewing habits of millions of YouTube users to Viacom. In both of these and many other similar cases, the data was anonymized first—fields containing personally identifying information were removed. According to a paper by Paul Ohm (via David Canton), this is not enough. “Data,” he writes, “can either be useful or perfectly anonymous but never both.” The paper outlines how re-identification allows anonymity to be broken and, after reviewing dead end approaches, suggests how privacy regulation can adapt to reflect the failure of anonymization. This article will briefly explain how anonymization and re-identification work and then discuss the proposed methods.
How it works—or doesn’t
Anonymization (also called de-identification or scrubbing) means removing traditionally recognized identifying information from sets of data. Many privacy laws allow data to be shared by companies if it is anonymized first. The purpose is to leave enough useful information for statistical analysis without allowing a company or researcher to discover the habits of any specific person. For example, a company might release information on TV viewing habits to a researcher. They would assign each person a unique number and remove all names, address, phone numbers, and other conventional identifiers. John Smith (born July 20, 1969 and living in Cocoa Beach, FL) becomes merely Subscriber #24601 connected with a list of TV shows and perhaps some demographic information. It sounds like a good scheme, but it turns out that John Smith could be re-identified.
Re-identification works by finding fingerprints—sets of apparently innocuous facts that uniquely identify a person. For example, for any possible combination of birthday, sex, and city of residence—e.g., July 20 1969, male, Cocoa Beach FL—there is a 55% chance that it describes only a single person in the entire United States. Add an extra item of information and the odds get better. None of these would be considered personally identifying and yet combined they are, in effect, a national ID number. If our researcher looked on a social network for this information, she would probably get only a single result. She now knows that #24601 is John Smith.
A fingerprint can be any set of facts unlikely to apply to more than one person. Another group of researchers, for example, showed that you can uniquely identify most Netflix users if you just know two movies they have rented and when they rented them. Therefore, while the government can list the traditionally recognized pieces of information (name, address, SIN number, etc) that are personally identifying on their own, there are still infinite sets of information that can be used to identify an individual when taken together. Anonymization doesn’t work unless you remove so much information that the data becomes almost useless.
Policy problems and recommendations
Ohm writes that current privacy schemes share three problematic features: (1) they are primarily concerned with traditional types of personally identifiable information; (2) they impose onerous requirements; (3) they place too much faith in anonymization. The first is problematic because it seems that almost any information can be personally identifiable. Even if it isn’t tied to traditionally sensitive data—e.g., health records—it can be useful in developing fingerprints to unlock other datasets. The latter two are problematic because in many privacy schemes, companies that anonymize their data can escape having to meet burdensome requirements. But in the face of re-identification research, it seems that anonymization doesn’t offer the solution legislators imagined. If all companies should be subject to privacy regulation and anonymization is not available as an easy loophole, then all companies will be subject to these requirements. It would be going too far to require a dry cleaner to meet the same requirements as a hospital.
Ohm advocates a few changes to existing privacy schemes. (1) They should apply universally (unlike in the US) but be tailored by sector (unlike in the EU) to impose less burdensome requirements on less sensitive data. Legislators must weigh the benefit of free flowing information and cost of regulation against the privacy implications of certain types of data. (2) Public releases of data should be restricted even where the data is anonymized and seems innocuous. (3) Limits should be set on the quantity of data retained or shared rather than on the quality or type of data. Since data can be re-identified, allowing companies to store unlimited amounts of anonymized data is equivalent to having no restrictions at all.