Consider this – over ninety percent of the world’s data, estimated to be a staggering sixteen zettabytes, was created in the past 5 to 6 years. And, it is estimated that by 2025 the world’s digital data will grow to one hundred and sixty-three zettabytes.
Digital data pervades virtually every aspect of our lives. A recently published IDC study estimates that by 2025, over 20% of the world’s digital data will be utilized by both public and private sector entities to manage, protect and improve our daily lives through increased investments in a wide range of open data initiatives.
While the social utility of open data is self-evident, there are a number of vexing issues that impact privacy rights. Foremost is the reliability of de-identification protocols that remove personally identifiable information from data sets that are published as part of open data initiatives.
There is a considerable body of recent research that seems to indicate that the application of robust de-identification protocols notwithstanding the risk of re-identification of personally identifiable information is high given rapid advances in technology that make it easier to combine data from multiple sources which, individually may not expose personal data but in combination could compromise privacy rights. In particular, one recent study found that it is possible to re-identify 87% of the US population by simply combining three data points – zip code, gender, and date of birth.
The challenge faced by open data advocates is balancing privacy rights with the utility of published datasets. The more granular the data set is, the more valuable it is for research and analysis purposes. Ensuring proper secondary uses of data sets that safeguard privacy rights is of paramount importance in light of more rigorous privacy regimes such as the General Data Protection Regulation (GDPR) which defines de-identification as data that “can no longer be attributed to a specific data subject without the use of additional information, provided that such additional information is kept separately and is subject to technical and organizational measures to ensure that the personal data are not attributed to an identified or identifiable natural person”.
What are the best practices and technological measures that should be considered that will meet the GDPR standard?
The first consideration is one of terminology. We must distinguish between two instances of de-identification: anonymization and pseudonymization. The former is designed to remove any association with the data subject through techniques such as masking, generalization, and suppression. The latter technique substitutes the identity of the data subject with a token, which references additional information.
If properly applied, these de-identification techniques exempt de-identified data from the application of GDOR, thereby providing policymakers with the latitude to use such data for secondary purposes such as Open Data initiatives.
Given the risks of re-identification, GDPR includes an important caveat that limits the use of de-identified data where such data is “reasonably likely to be used, such as singling out, either by the controller or by another person to identify the natural person directly or indirectly.”
The second consideration is the application of de-identification measures that can properly safeguard privacy rights. This requires the application of rigorous protocols that includes an assessment of data risk, the likelihood of re-identification, and the utility of the data post de-identification.
Data risk assessment should take into account a number of variables, such as the size of the data set, the sensitivity of the information, and the granularity of the data set. While direct identifiers are masked and may also be encrypted (such as name, address, social insurance number) indirect identifiers (such as gender, political affiliations) may also require de-identification through techniques such as generalization that aggregates granular data sets to more generalized values, or suppression which removes certain values from the dataset.
Finally, it is important to ensure that data sets intended to public consumption are assigned the highest risk profile to minimize the risks of re-identification by nefarious actors, while data sets indented for non-public uses should be governed by rigorous data-sharing agreements commensurate with the sensitivity of the data.
Failure to institute such measures and safeguards may expose organizations to onerous enforcement actions under the GDPR.