The Urgent Need for Data Minimization Standards
A central principle in many data protection laws around the globe is data minimization. But we are currently facing a serious issue: we don’t have legal clarity on what exactly the laws require when they demand data minimization. Lack of specificity directly affects organizations’ lack of confidence that the products they are building are responsible and truly comply with regulatory requirements. As a result, apprehension can often surround the process of bringing innovative technologies into production.
It is clear that data minimization will have different requirements for different use cases. On one side of the spectrum is the redaction of direct identifiers such as names, or payment card information such as credit card numbers. On the other side of the spectrum lies anonymization, where re-identification of individuals is extremely unlikely. Within the spectrum, we also find pseudonymization, which, depending on the jurisdiction, often means something like reversible de-identification
Many organizations are keen to anonymize their data because, if anonymization is achieved, the data falls outside of the scope of data protection laws as they are no longer considered personal information. But that’s a big if. Some argue that anonymization is not possible. We hold that the claim that data anonymization is impossible is based on a lack of clarity around what is required for anonymization, with organizations often either wittingly or unwittingly misusing the term for what is actually a redaction of direct identifiers. Furthermore, another common claim is that data minimization is in irresolvable tension with the use of data at a large scale in the machine learning context. This claim is not only based on a lack of clarity around data minimization but also a lack of understanding around the extremely valuable data that often surrounds identifiable information, such as data about products, conversation flows, document topics, and more.
Years of research in structured data de-identification have contributed to much of what is understood about the balance of data minimization and data utility.
Given the stark differences in how structured and unstructured data are processed and anonymized, a one-size-fits-all approach to privacy standards and re-identification risk thresholds may not be appropriate. Each type of data presents unique challenges and risks that need tailored approaches.
Without that clarity, even organizations with the best intentions will not consistently get it right and will be left to their best guesses. Many people misinterpret anonymizing data to mean removing names and social security numbers but ignoring quasi-identifiers like religion, approximate location, rare disease, etc.
Why we need data minimization standards
Why is not having clear data minimization standards a problem? For one, in the absence of clear standards, organizations disclosing data can do a poor job of de-identifying the data and then still claim that they have been anonymized. Inevitably, this will lead to the re-identification of some individuals, even if only by hacktivists trying to prove a point. In a worse scenario, poor de-identification practices can lead to data breaches, which are costly both financially and reputationally.
Secondly, a common refrain among critics is that “true” data anonymization is a myth. These criticisms frequently stem from well-publicized incidents where supposedly “anonymized” data was re-identified. But a closer look at these instances often reveals a salient point: the data in question was not properly anonymized in the first place or anonymization was simply not the right privacy-preserving technique to use for the task at hand.
These ill-informed claims diminish the trust in the kinds of capable technologies that are currently being developed and can effectively and reliably identify personally identifiable information, redact it, add noise and permutations, generalize values, aggregate data, and compute data accuracy and re-identification risks. Such claims may also lead to resistance to data minimization as a whole given a perceived futility of the effort, or an unwarranted hesitancy to share information that has been de-identified in light of the uncertainty of whether it’s good enough.
Either way, current technological capabilities will not be used to their full potential due to unwarranted distrust that is very hard to disprove without certifying bodies for the resulting datasets or technologies. This will negatively impact the availability of securely de-identified or anonymized data for beneficial secondary purposes, e.g., for the development and training of generative AI models.
How we know clear standards can (responsibly) accelerate innovation
HIPAA (Health Insurance Portability and Accountability Act) in the U.S., for instance, is an example of a law that contains a clear de-identification standard. It has provisions that require health data to meet certain criteria to be considered “de-identified” and even provides two distinct methods: Expert Determination and Safe Harbor.
The Expert Determination method hinges on a knowledgeable individual’s analysis that the risk of re-identification is “very small” (§164.514(b)(2) HIPAA Privacy Rule). Safe Harbor, on the other hand, prescribes specific identifiers that must be removed for health data to no longer be deemed personal information. These methods are illustrative of a flexible, and in the case of expert determination, rigorous approach to data de-identification—one that can inspire other industries. For small organizations that do not have the resources to employ a privacy technology expert to ensure secure de-identification, there can still be clear guidance on what is required in terms of removing direct and indirect identifiers before the data can be considered safe for disclosing it to third parties to enable innovative products and research.
The Safe Harbor rule has rightly been criticized as insufficient for anonymization of data as understood under the GDPR, for example. It is questionable whether unrestricted publication of data sets that fall under the Safe Harbor rule is the right approach. More on that below.
The Data De-Identification Framework – ISO/IEC 27559:2022 developed by the International Standards Organization is another example of helpful, yet non-mandatory, guidance on how to properly de-identify data. We have summarized this framework here. This framework offers an advantage over HIPAA by including an appendix that establishes specific numerical thresholds for identifiability.
Another example of a successful application of a judicially set standard supplemented by expert guidance is revealed by the Office of the Privacy Commissioner of Canada’s investigation of complaints against the Public Health Agency of Canada (“PHAC”) and Health Canada (“HC”) under the Privacy Act. Mobility data obtained from TELUS and other data providers was properly de-identified beyond the “serious possibility” for re-identification threshold before using it in the fight against the COVID-19 pandemic. This standard was decided upon in Gordon v. Canada (Health), 2008 FC 258 by the Federal Court and the Treasury Board Secretariat and other experts have since provided more actionable guidance down to the range of acceptable cell sizes.
Following this guidance, stripping data of personal identifiers alone was by no means all the parties involved did in this case. Rather, they:
- Hashed each identifier more than once using SHA 256 hashing
- Limited access to the data to a limited number of individuals
- Monitored access and use
- Implemented permitted use restrictions
- Restricted access and export via the enclave model
- Allowed only import of data that was aggregated in accordance with accepted standards
To reiterate what access controls and use restrictions have to do with data de-identification: Since determining proper de-identification or even anonymization is a statistical calculation, the likelihood of re-identification is an important factor. This likelihood is generally considered in the context of the risk to the data itself, namely, who has access to it and which security controls are put in place.
In addition to anonymization, data minimization in the form of redaction has shown to benefit from specific standards that take into account not only the information to be removed but also the security infrastructure surrounding the data. For example, data minimization is a risk mitigator under PCI DSS where information like account numbers and cardholder names need to be removed from call and contact center information. Especially when used appropriately and in conjunction with cybersecurity safeguards, redaction in this context prevents crimes like identity theft.
The work that still needs to be done
While we have seen huge improvements in the capabilities of tools that can help with the de-identification of data, even unstructured data, it is possible that in parallel with the advance of de-identification tools, the technologies enabling re-identification advance as well, and more data becomes publicly available against which records can be compared, increasing the risk of re-identification.
Moreover, while HIPAA Safe Harbor brings clarity, it does not take into account several pieces of information that may be used to re-identify an individual. As Khaled El Emam in “Methods for the de-identification of electronic health records for genomic research” argued in 2011, not requiring the removal of longitudinal data, such as length of stay and time since the last visit, can mean a much higher risk of patient re-identification. For reasons like this, HIPAA Expert Determination, where an expert determines whether the likelihood of re-identification is low enough to be considered de-identified, is the method of choice for many healthcare organizations.
We must also pay more attention to unstructured data when having a dialogue about data de-identification and anonymization. Unstructured data, according to estimates, make up 80 percent of all recorded data. As we explained, unstructured data comes with the unique difficulty of identifying where personal data are. This is not terribly hard in a table with columns labelled “SSN” or “name.” However, it is a more complicated problem with unstructured data given the disfluencies, complicated contexts, different formats, and multilinguality of unstructured data. However, similar to lacking in data minimization standards, there likewise exists no accepted standard of how accurate the identification of personal information should be. Organizations therefore have little guidance regarding the required level of accuracy of identification of identifiable information, often opting for a band-aid solution made up of regular expressions and inaccurate machine learning models which may even be built for a different task. Note that not only does getting this step wrong prevent an accurate assessment of risk with the data, but also prevents the reliable redaction of the data, let alone the anonymization of it. Identifying the data elements in the unstructured data is the difficult but essential groundwork required before re-identification risk can be tackled automatically.
What we can already do today
With the recent advances in machine learning (ML), we can now teach machines to do the identification work for us, and much more reliably than regular expressions (regexes) – the technique most commonly used for data identification, but which often fails in particular with unstructured data. For example, using ML, we are able to use the context of a conversation to determine whether something constitutes personal information or not. Instead of searching for set patterns, the ML model can learn from exposure to training data prepared by privacy experts. By annotating the data elements that are personal identifiers, privacy experts can effectively train the model to identify highly complex, natural language patterns based on which it can detect personal information in data it hasn’t seen before.
While we don’t have a set standard for personal information detection tools, Private AI builds AI-driven de-identification software that meets and exceeds industry standards. Refer to our Whitepaper for details on how we compare to our competitors or request a sample report on how the output data from our system has passed HIPAA Expert Determination. Anything lower than what the best technology in the industry has to offer in terms of personal data identification will, as it necessarily carries through to the de-identification stage, increase the re-identification risk intolerably. With accurately identified and categorized personal information, these identifiers can then be removed or replaced as needed for the use case, maximizing data privacy and utility.
Conclusion
Embracing rigorous data minimization protocols isn’t just a compliance requirement; it’s a pledge to protect individual privacy while harnessing the full potential of data for the collective good. The current ambiguity surrounding data de-identification, anonymization, and personal information identification standards poses significant challenges. While we have examples in HIPAA and ISO/IEC 27559:2022 and other sources, more comprehensive and universally accepted standards are imperative. Otherwise, we are at risk of falling behind our current capabilities of making safe data available for responsible innovation and other beneficial purposes.
About the Author
Kathrin Gardhouse is Private AI’s Privacy Evangelist and a German- and Ontario-trained lawyer specializing in data and AI governance. Her experience includes developing comprehensive privacy and data governance programs for a Toronto-based financial institution and data and AI governance consulting for several boutique firms.
Kathrin’s influence in data and AI governance spans multiple domains. She actively shapes responsible AI policy at a national level while simultaneously offering thought leadership to innovators in privacy-enhancing technologies and advising start-up founders in privacy and AI governance matters. Kathrin can be reached through her LinkedIn and our company website: https://www.private-ai.com/
Patricia Thaine is the Co-Founder & CEO of Private AI, a Microsoft-backed startup that raised their Series A led by the BDC. Private AI was named a 2023 Technology Pioneer by the World Economic Forum and a Gartner Cool Vendor. Patricia was on Maclean’s magazine Power List 2024 for being one of the top 100 Canadians shaping the country. She is also a Computer Science PhD Candidate at the University of Toronto (on leave) and a Vector Institute alumna. Patricia is a recipient of the NSERC Postgraduate Scholarship, the RBC Graduate Fellowship, and the Ontario Graduate Scholarship. She is the co-inventor of one U.S. patent and has 10 years of research and software development experience, including at the McGill Language Development Lab, the University of Toronto’s Computational Linguistics Lab and Department of Linguistics, and the Public Health Agency of Canada. Patricia can be reached through her LinkedIn and our company website: https://www.private-ai.com/