Anonymisation is a valuable tool that allows data to be shared, whilst preserving privacy. The process of anonymising data requires that identifiers are changed in some way such as being removed, substituted, distorted, generalised or aggregated.
A person's identity can be disclosed from:
You decide which information to keep for data to be useful and which to change. Removing key variables, applying pseudonyms, generalising and removing contextual information from textual files, and blurring image or video data could result in important details being missed or incorrect inferences being made. See example 1 and example 2 for balancing anonymisation with keeping data useful for qualitative and quantitative data.
Anonymising research data is best planned early in the research to help reduce anonymisation costs, and should be considered alongside obtaining informed consent for data sharing or imposing access restrictions. Personal data should never be disclosed from research information, unless a participant has given consent to do so, ideally in writing.
Anonymising quantitative data may involve removing or aggregating variables or reducing the precision or detailed textual meaning of a variable.
Primary anonymisation techniques
Example: Remove respondents' names or replace with a code; remove addresses, postcode information, institution and telephone numbers.
Example: Record the year of birth rather than the day, month and year; record postcode sectors (first 3 or 4 digits) rather than full postcodes; aggregate detailed 'unit group' standard occupational classification employment codes up to 'minor group' codes by removing the last digit.
Example: Detailed areas of medical expertise could indirectly identify a doctor. The expertise variable could be replaced by more general text or be coded into generic responses such as 'one area of medical speciality', 'two or more areas of medical speciality', etc.
Example: Annual salary could be 'top-coded' to avoid identifying highly paid individuals. A top code of £100,000 or more could be applied, even if lower incomes are not coded into groups.
Example: In confidential interviews on farms the names of farmers have been replaced with codes and other confidential information on the nature of the farm businesses and their locations have been disguised to anonymise the data.
However, if related biodiversity data collected on the same farms, using the same farmer codes, contain detailed locations for biodiversity data alone the location would not be confidential. Farmers could be identified by combining the two datasets.
The link between farmer codes and biodiversity location data should be removed, for example by using separate codes for farmer interviews and for farm locations.
Point data may fix the position of individuals, organisations or businesses studied, which could disclose their identity. Point coordinates may be replaced by larger, non-disclosing geographical areas such as polygon features (km2 grid, postcode district, county), or linear features (random line, road, river). Point data can also be replaced by meaningful alternative variables that typify the geographical position and represent the reason why the locality was selected for the research, such as poverty index, population density, altitude, vegetation type. In this way, the value of data is maintained, whilst removing disclosing geo-references.
A better option may be to keep detailed spatial references intact and to impose access controls on the data instead.
Procedures to anonymise any research data that are destined for sharing or archiving should always be considered together with appropriate informed consent procedures.
We expect to run as normal a service as possible during this COVID-19 (Coronavirus) emergency. Please visit our COVID-19 page for the latest information.