Synthetic Data: The Next Solution for Data Privacy?


Gregory Hong is an IPilogue Writer and a 1L JD candidate at Osgoode Hall Law School.


One contentious point from the Bracing for Impact Conference: AI for the Future of Health session was synthetic data’s potential to solve the privacy concerns surrounding the datasets needed to train AI algorithms. In light of its increasing popularity, I will explore the benefits and dangers of this potential solution.

Concept

The data privacy concern that synthetic data aims to address is very similar to the purpose of differential privacy — protecting anonymized data from being de-identified without reducing data utility. This is distinct from data augmentation, which is the process of adding new data to an existing real-world dataset in order to provide more training data, and could include rotating images or combining two images to create a new one. Data augmentation is typically not useful in the privacy context.

In a tech blog post, the Office of the Privacy Commissioner of Canada (“OPC”) describes synthetic data as “fake data produced by an algorithm whose goal is to retain the same statistical properties as some real data, but with no one-to-one mapping between records in the synthetic data and the real data.” Synthetic data consists of real-world source data that is put through a generative statistical model, which is evaluated for statistical similarity to the source alongside privacy metrics. Critically, there is no need to remove quasi-identifying data, that is, data vulnerable to de-anonymization. This results in more complete datasets.

Benefits

Synthetic data uses a highly automated process to provide protection from de-identification using a highly automated process. This results in datasets that can be readily shared between AI developers without the dangers of privacy concerns. A Nvidia blog post also points out that there are substantial cost savings. The points to how a synthetic data service company founder estimated that “a single image that could cost $6 from a labeling service can be artificially generated for six cents.” Synthetic data can also be manufactured to reduce bias by deliberately including a wide variety of rare but crucial edge-cases. Nvidia uses machine vision for autonomous vehicles as their example, but I think this concept should translate to improving representation of marginalized and under-represented groups in large datasets in healthcare or facial recognition. Many of the Bracing for Impact panelists shared this concern.

Dangers

The OPC notes in their blog many issues and concerns, particularly regarding de-identification. This is especially true if the synthetic data is not generated with sufficient care and if the “generative model learns the statistical properties of the source data too closely or too exactly”. In other words, if it “overfits” the data, then the synthetic data will simply replicate the source data, making re-identification easy.” Moreover, there is also concern with membership inference, where the fact that some individual data exists is an inherent risk. A 2022 machine learning paper also demonstrated that “synthetic data does not provide a better tradeoff between privacy and utility than traditional anonymization techniques” and “the privacy-utility tradeoff of synthetic data publishing is hard to predict.” This indicates that the characterization of synthetic data as a “silver bullet” is likely overselling its capabilities.  

Implementations

Nvidia is using synthetic data in computer vision, but its primary purpose is not privacy — that there are other important functions for the technology. MDClone is a leading platform for synthetic data in healthcare and is making its way to Canada through a partnership with McGill. It is only beginning: it is predicted that “synthetic data will completely overshadow real data in AI models by 2030.”

Conclusion

Synthetic data has the potential to be highly beneficial, as it may be the answer to the many challenges AI developers face in sharing sensitive data. However, like many developments in AI technology, it requires caution and careful implementation to be effective and is potentially dangerous if relied upon haphazardly.