Before identifiable information can be collected, used, or shared, researchers must consider relevant legal and ethics requirements such as privacy legislation and informed consent.
While access to data should ideally be as open as possible, with sensitive data and especially identifiable data it should also be as closed as necessary. It is possible to reduce the identifiability of data through techniques referred to as ‘de-identification’, ‘anonymisation’, or ‘de-personalising’. Newer approaches such as generating synthetic data also aim to reduce the identifiability of information, however these methods may not suit all research designs. For example, they may be appropriate for quantitative analysis but not for qualitative studies where the validity of the research may be reduced if synthetic data are used.
Regardless of the techniques used, in the current age of big data and triangulation methods there is debate whether any method exists that can reliably ensure the complete removal of identifiable information from data. This does not mean that data can not be used or shared for research, but that well-defined approaches for managing and working with data must be implemented.
Working with identifiable data
Management of identifiable data
Data may often need to be identifiable (i.e. contains personal information) during the process of research, e.g. for study administration, qualitative analysis, etc. If data is identifiable then ethical and privacy requirements may be met through access control and data security but establishing a well-defined data management plan before a research activity has begun is the most effective way of meeting these requirements. This may include:
- control of access through physical or digital means (e.g. passwords)
- encryption of data, particularly if it is being moved between locations
- ensuring data is not stored in an identifiable and unencrypted format when on easily lost items such as USB keys, laptops and external hard drives
- taking reasonable actions to prevent the inadvertent disclosure, release or loss of sensitive personal information.
Five Safes: Working with identified data
The UK Data Service has developed the Five safes framework to provide secure access to carry out work that would not usually be possible with de-identified data. It offers data custodians a framework to place appropriate controls, not just on the data itself, but on the manner in which data are accessed.
’De-identification’, ‘anonymisation’, and ‘de-personalising’ are approaches commonly undertaken to protect the privacy of individuals and the terms are sometimes used interchangeably, though there is debate about whether this is appropriate. They all aim to reduce the identifiability of data but, as mentioned above, the ability to completely remove the risk of identification is a matter of contention. To simplify the following discussion, the term ‘de-identification’ shall be used to refer to this group of methods
In addition to protecting individuals, data de-identification may also be used to protect organisations, such as businesses, or other information such as the spatial location of mineral or archaeological findings or endangered species. Data de-identification is not an exact science and judgement calls may still need to be made when de-identifying data.
It should be noted that de-identification is not a ‘magic bullet’ for being able to share and publish sensitive data. De-identification should be considered within a range of activities to protect the privacy of research participants, such as obtaining informed consent for data sharing and controlling access to the data.
Additionally, the validity of some research may be reduced if de-identified data are used for analysis (e.g. qualitative studies of oral histories, historical texts, and stories). But then when archiving or publishing either excerpts, derivatives or aggregates of that data, it may be critical to either mask the identity of the individual in the data or metadata to protect their privacy.
It is therefore critical to have a clear plan for managing identifiable data through all research stages and when publishing data. Understanding the requirements and risks of using identifiable data at each stage of research will inform the kinds of consent, data security, and access controls required.
Best practice basics for managing de-identification
Here are some tips to start your de-identification:
- plan de-identification early in the research as part of your data management planning
- retain original unedited versions of data for use within the research team and for preservation
- create a de-identification log of all replacements, aggregations or removals made
- store the log separately from the de-identified data files
- identify replacements in text in a meaningful way, e.g. in transcribed interviews indicate replaced text with [brackets] or use XML markup tags e.g. <anon>…..</anon>.
For more in depth information and processes see the resources below.
Australian practical guidance for de-identification
- The Australian Government’s Office of the Australian Information Commissioner (OAIC) and CSIRO Data61 have a ‘De-identification Decision Making Framework’, which is a “practical guide to de-identification, focussing on operational advice”
- The OAIC also provides high-level guidance on de-identification of data and information, outlining what de-identification is, and how it can be achieved
- The Australian Government’s Guidelines for the Disclosure of Secondary Use Health Information for Statistical Reporting, Research and Analysis includes techniques for making a dataset non-identifiable and example case studies
- Australian National Statistical Service’s information on confidentiality and how to confidentialise data
- Queensland Office of the Information Commissioner’s Guideline: privacy and de-identification.
International practical guidance for de-identification
- U.S. Department of health & Human Services’ Guidance for methods for de-identifications of health information
- USA National Institute of Standards and Technology has a guide to De-Identifying Government Datasets, and a De-Identification of Personal Information publication
- UK Anonymisation Network (UKAN) has a comprehensive Anonymisation Decision-Making Framework
- UK Data Service’s guide to anonymisation
- UK Research Data Network hosts a curated list of resources for managing personal data and best practice for anonymisation and preservation
- The UK Information Commissioner’s Office’s Anonymisation: managing data protection risk code of practice.
Qualitative and audio-visual data
When dealing with qualitative data, such as transcribed interviews, or textual answers to surveys, rather than blanking-out information, pseudonyms or generic descriptors can be used to replace identifying information. Audio and image files can be digitally manipulated to remove identifying information. However, techniques such as voice alteration and image blurring are labour-intensive and expensive and are likely to damage the research potential of the data.
Agreeing during the consent process as to the level of anonymity required will determine what may and may not be recorded, transcribed, or shared. This can be a more effective way of creating data that accurately reflects the research process and participants contribution, than removing sensitive information post collection. If confidentiality is an issue, it may be better to obtain the participant’s consent to use the data unaltered, but with additional access controls in place.
The UK Data Archive has advice on anonymising qualitative data, and the Irish Qualitative Data Archive has developed a tool for anonymising qualitative data.
- ARDC Guide: Publishing and sharing sensitive data
- ARDC Guide: Data sharing considerations for Human Research Ethics Committees
- Data management plans
- The Future of Privacy Forum: A visual guide to practical data de-identification
- Office of the Australian Information Commissioner: Guide to securing personal information
- Research Data Network’s (UK) curated list of resources that covers management of personal data
- Office of the National Data Commissioner: Sharing data principles