Data Anonymisation in Clinical Trials

Data Anonymization in Clinical Trials

In recent years, the healthcare industry has increasingly relied on data-driven insights to improve patient outcomes, enhance treatment efficacy, and advance medical research. Clinical trials, in particular, play a pivotal role in this process by providing evidence that shapes the development of new therapies and medical practices. However, as the volume of sensitive patient data used in these trials grows, so does the risk of compromising privacy and confidentiality. This makes data anonymisation a critical practice in clinical research. By removing or encrypting personally identifiable information, data anonymisation safeguards participants' privacy while allowing researchers to utilise data for meaningful analysis. This allows valuable research to be completed whilst maintaining ethical standards, ensuring regulatory compliance, and promoting public trust in medical research.

What is Data Anonymisation?

"Data anonymisation" is the process of irreversibly removing personally identifiable information from one or more clinical datasets such that a patient cannot be linked to their data, even with the use of external data sources; this must include the data controller themselves. This process is also known as "de-identification" in some contexts.

Anonymisation is legally required to be performed before sharing data with third parties, due to data protection legislation such as the EU General Data Protection Regulation (GDPR) or the US Health Insurance Portability and Accountability Act (HIPAA). While the goal remains the same, the specific methods used can vary depending on the dataset and regulatory requirements. Ensuring the right balance between privacy and data utility is key, and different approaches to achieving this can be explored in data and document anonymisation. Anonymised data is exempt from such regulations due to no longer being associated with an individual.

It is accepted that it may not be possible to perform completely indefatigable anonymisation, for example Article 29 of EU directive 95/46/EC acknowledges this caveat by stating that the data must be protected against 'all the means "likely reasonably" to be used for reidentification'.

How Does Pseudoanonymisation Contribute to Data Privacy?

If the data controller has some way to reverse the process, for example possessing a dataset which relates subject IDs to patient names, then the data is instead "pseudonymised". This is not a true form of anonymisation because the identifying data is stored rather than destroyed, but pseudonymised data may be treated as if it were anonymised provided that the identifying dataset is retained securely by the data controller. This still provides data privacy because the direct identifiers are separated prior to the release of data for analysis. There are certain advantages to pseudonymising data rather than conducting a fully irreversible anonymisation, especially in clinical trials.

One such advantage is the ability to gather additional data and integrate it with existing anonymised datasets, thereby enabling more comprehensive analyses. This capability is essential for ongoing research, as it allows new data to be merged with historical information, providing a richer, more detailed picture of patient outcomes and treatment efficacy over time. This requires maintaining a secure link between the subject ID and the corresponding patient details, even when the direct identifiers are removed from the data. Pseudonymisation means that researchers can confidently combine data from multiple sources without compromising privacy, facilitating longitudinal studies, follow-up analyses, and the identification of trends that may not be apparent in isolated datasets.

A second advantage that is particularly unique to clinical trials is the ability to identify specific participants should a safety concern arise after the completion of their involvement in the trial. This capability is critical in situations where adverse effects or unforeseen complications related to the treatment may only become apparent once the patient has finished their participation. In such cases, pseudonymised data allows for the retrieval of relevant participant information, enabling the researchers or clinicians to swiftly reach out to the individual in question. This prompt identification ensures that any necessary preventative or corrective action, such as additional medical monitoring, treatment adjustments, or further follow-up, can be taken without delay.

A double-blinded study is not an example of pseudonymisation because the details being obfuscated relate to the treatment under test rather than an individual patient, but it is a similar concept because the blind may be broken using an additional dataset for analysis or if medically necessary.

What is Deanonymisation?

"Deanonymisation" or "reidentification" describes the process of breaking the anonymisation and linking the data back to the individual data subject. Sometimes this is done legitimately by data controllers, especially for pseudonymised data, but it is also attempted by malicious actors.

Deanonymisation is surprisingly easy, because even a few pieces of seemingly innocuous data can create a unique fingerprint. A famous study by Latanya Sweeney in 1997 estimated that 87% of Americans are uniquely identified by their combination of zip code, date of birth, and sex alone.

Additionally, once one dataset has been deanonymised, that data can be used to aid further reidentification of data pertaining to those particular patients. Thus, care must be taken to ensure that all patient data is fully anonymised, especially because modern computing power allows relentless automated inference attacks to occur. These attacks mean that if some data recorded in a dataset is particularly unique then it cannot be reliably obfuscated and may need to be completely removed from the dataset during anonymisation.

Variable Roles

There are three roles that variables can have in a dataset with respect to anonymisation.

The first are "identifiers" or "key attributes" (KAs). These are highly identifying variables such as patient names or contact details which should never be included in analysis datasets.

The second are "quasi-identifiers" (QIDs). These are less identifying but could be used to identify the patient if the data is combined with other quasi-identifiers or external sources and could easily be known to acquaintances of the patient. They commonly appear in demographic datasets and include variables such as age, sex, or country, but can also include other values such as visit dates or height. These are the variables that will typically be anonymised to break the link between the patient and the trial data.

The third are "sensitive attributes" (SAs). These are the variables which contain the actual study endpoints and are far more difficult to anonymise due to the need to preserve correlations and statistical significance.

Data Anonymisation Techniques

There are multiple methods of anonymisation, and the anonymiser should use their judgement to choose the most appropriate method or methods for the specific data they are working with. There is no industry-standard process because the required anonymisation is completely dependent on the data in question. All anonymisation operations inherently reduce the amount of overall information in the dataset and therefore decrease the usefulness of the data, so a balance must be struck between data utility and information loss.

Some data types are harder than others to anonymise depending on how statistically significant the exact values are. Ages may usually be generalised into age groups without issue because analysis endpoints do not tend to use the individual values, but dates are far more complex because relative durations must be kept consistent across all related datasets. Sometimes these can be redacted in lieu of visit names or numbers, but a masking or perturbation approach is needed if actual date values are necessary for analysis endpoints or if data does not correlate to visits (such as for adverse event or medical history data). Imputed dates can pose an additional challenge with these techniques, for example if the imputation algorithm for incomplete dates uses the first of January, then any offset applied to this imputed date will be clear and easily reversible.

Some of the most common anonymisation techniques are described below, but this is not an exhaustive list.

Generalisation

"Generalisation" or "bucketing" is the process of replacing a value with a more general alternative to reduce the identifiability of the value and is generally applied to QIDs. This is commonly done by grouping values into categories and discarding the original value. Examples include ages being replaced by age groups, countries being replaced by continents, or lab results replaced by reference range indicators (i.e. LOW, NORMAL, or HIGH).

Suppression

"Suppression" or "redaction" is the process of completely removing identifiable data from a dataset and is generally applied to QIDs. This can be done by targeted replacement of certain values with placeholder values, or by the complete deletion of variables and/or records which contain sensitive data. This is one of the most effective and irreversible anonymisation techniques that can be applied to data but can result in significant information loss and should be used with care.

Masking

"Masking" is the process of adjusting values to artificial values while preserving key relationships, unlike suppression which completely loses the obfuscated data. Masking is generally applied to QIDs. One common use for masking is to apply an offset to all date values for a patient, preserving the duration between any two dates while obscuring the exact date any event occurred. Another is the use of subject IDs to uniquely identify patients across related datasets without the need for KAs.

Swapping

"Swapping" or "permutation" is the process of shuffling the values of one or more variables between patients, typically within analysis subsets, and is generally applied to SAs. This preserves the individual values while disassociating them from the original patient, and is useful if the exact distribution of values is required but direct correlation with patients is not. However, further analyses must consider that swapping has occurred and take care when calculating new endpoints.

Perturbation and Synthetic Data

"Perturbation" or "differential privacy" is the process of adding noise to the data to obscure individual values while preserving trends and is generally used for SAs. The process is statistically complex and requires the noise to be generated by carefully selected and defined distributions to avoid polluting the data. Adding too much noise will ruin data utility while adding too little noise will not adequately anonymise the data. Furthermore, noise removal algorithms could be used by bad actors to deanonymise the data especially if the noise generation algorithm used is too simple.

The natural extension of perturbation is the generation of entirely synthetic data. This uses statistical methods such as linear regressions and averages to replace all values for a variable with entirely new values generated by the determined distribution. This is completely irreversible but any error in the statistical methods used will result in unusable data being generated. However, synthetic data is particularly useful when the actual values are not important, for example to enable statistical models to be trained or validated on realistic data.

Quantitative Metrics of Anonymisation

There are three standard metrics for the degree of anonymisation that a dataset has achieved; these are k-anonymity, l-diversity, and t-closeness.

k-Anonymity

The simplest of these is k-anonymity, which defines the minimum number of patients sharing any combination of specified QIDs. The variables chosen for k-anonymity, and the value of k itself, are at the anonymiser's discretion and the choice depends on the data and the level of anonymisation needed. The value of k is the number of patients, so a "2-anonymous" dataset has at least two patients sharing each possible combination of the specified QIDs. A superscript number may optionally be included to state the number of QID variables used; this notation is known as k^m-anonymity. k-anonymity may be used for "automatic generalisation" by providing an advanced algorithm with the list of m QIDs and the required value of k, as opposed to "declarative generalisation" which is entirely defined by the anonymiser. Higher values of k and m will result in stronger anonymisation but risk reducing data utility if the ensuing generalisation removes too much fine detail.

l-Diversity

This is not the complete picture though, because if all k patients share the same SA values, then anyone with knowledge of the QID values will be able to link the SA value to the patient. This is where l-diversity comes in, which requires that there are at least l different values of each specified SA within each group of k patients, for example a "3-anonymous 2-diverse" dataset has at least two different values of the specified SA within each group of at least three patients. Again, the choice of SA and value of l are chosen by the anonymiser. The value of l must be less than or equal to k, and if l is too close to k then it may be difficult to create compatible groups unless the values of l are uniformly distributed across patients. Some uncommon values of one or more QIDs may need to be suppressed or grouped to create large enough patient groups to achieve l-diversity.

t-Closeness

However, there is still a risk that a k-anonymous l-diverse dataset could contain a group skewed towards a particular SA value, such as l-1 patients having different values of the SA and the rest of the group having the remaining value, which weakens the anonymity provided by the l-diversity. Additionally, if the number of possible values of the SA is much higher than l then the small number of distinct values in the group can still provide an attacker with a reduced list of possibilities. The distribution of SAs within the entire dataset is clearly available, so matching the distribution of each group to the overall distribution ensures that no additional information is provided about the SA for a patient with a known combination of QIDs. This is quantised by t-closeness. The most appropriate metric of distribution should be chosen by a statistician for the specific data, and then the t value is a decimal representing the maximum allowed percentage difference between the distribution metric of each group and the population as a whole.

Conclusion

In conclusion, data anonymisation plays a crucial role in ensuring the privacy and security of personal information in clinical trials. As healthcare data becomes increasingly valuable for research and the development of new treatments, it is essential to apply effective anonymisation techniques to protect patient identities while maintaining the utility of the data. Whether through generalisation, suppression, or masking, each method serves to reduce the risk of reidentification, balancing the need for privacy with the necessity of meaningful analysis. Furthermore, the difference between anonymisation and pseudonymisation is significant, with the latter offering a practical approach to safeguard patient data while allowing for future identification if required. As the potential for deanonymisation grows with advancements in computational power, ongoing vigilance and the use of robust anonymisation techniques are imperative to uphold data integrity and trust in clinical research. Ultimately, the challenge remains to strike the right balance between safeguarding privacy and enabling meaningful scientific progress.

Quanticate’s Statistical Programming team ensures compliance with global data privacy regulations by delivering fully anonymised and audit-ready datasets. Our experts specialise in secure data anonymisation, risk assessment, and regulatory adherence, helping sponsors protect patient privacy while maintaining data utility. With advanced anonymisation techniques and tailored solutions, we safeguard data integrity and streamline compliance processes. Submit an RFI today to discover how we can support your clinical research needs.

Data Anonymisation in Clinical Trials

What is Data Anonymisation?

How Does Pseudoanonymisation Contribute to Data Privacy?

What is Deanonymisation?

Variable Roles