Anonymisation is a valuable tool that allows data to be shared, whilst preserving privacy. The process of anonymising data requires that identifiers are changed in some way such as being removed, substituted, distorted, generalised or aggregated.

A person's identity can be disclosed from:

  • Direct identifiers such as names, postcode information or pictures
  • Indirect identifiers which, when linked with other available information, could identify someone, for example information on workplace, occupation, salary or age

You decide which information to keep for data to be useful and which to change. Removing key variables, applying pseudonyms, generalising and removing contextual information from textual files, and blurring image or video data could result in important details being missed or incorrect inferences being made. See example 1 and example 2 for balancing anonymisation with keeping data useful for qualitative and quantitative data.

Anonymising research data is best planned early in the research to help reduce anonymisation costs, and should be considered alongside obtaining informed consent for data sharing or imposing access restrictions. Personal data should never be disclosed from research information, unless a participant has given consent to do so, ideally in writing.

Quantitative data

Anonymising quantitative data may involve removing or aggregating variables or reducing the precision or detailed textual meaning of a variable.

Primary anonymisation techniques

  • Remove direct identifiers from a dataset. Such identifiers are often not necessary for secondary research.

Example: Remove respondents' names or replace with a code; remove addresses, postcode information, institution and telephone numbers.

  • Aggregate or reduce the precision of a variable such as age or place of residence. As a general rule, report the lowest level of geo-referencing that will not potentially breach respondent confidentiality. The exact scale depends on the type of data collected, but very detailed geo-references like full postcodes, names of small towns or villages are likely to be problematic. Coded or categorical variables which may be potentially revealing can be aggregated into broader codes. If aggregation of a disclosive variable is not possible, consider whether it should be removed from the dataset.

Example: Record the year of birth rather than the day, month and year; record postcode sectors (first 3 or 4 digits) rather than full postcodes; aggregate detailed 'unit group' standard occupational classification employment codes up to 'minor group' codes by removing the last digit.

  • Generalise the meaning of a detailed text variable by replacing potentially disclosive free-text responses with more general text.

Example: Detailed areas of medical expertise could indirectly identify a doctor. The expertise variable could be replaced by more general text or be coded into generic responses such as 'one area of medical speciality', 'two or more areas of medical speciality', etc.

  • Restrict the upper or lower ranges of a continuous variable to hide outliers if the values for certain individuals are unusual or atypical within the wider group researched. In such circumstances the unusually large or small values might be collapsed into a single code, even if the other responses are kept as actual quantities, or one might code all responses.

Example: Annual salary could be 'top-coded' to avoid identifying highly paid individuals. A top code of £100,000 or more could be applied, even if lower incomes are not coded into groups.

  • Anonymise relational data where relations between variables in related or linked datasets or in combination with other publicly available outputs may disclose identities.

Example: In confidential interviews on farms the names of farmers have been replaced with codes and other confidential information on the nature of the farm businesses and their locations have been disguised to anonymise the data.

However, if related biodiversity data collected on the same farms, using the same farmer codes, contain detailed locations for biodiversity data alone the location would not be confidential. Farmers could be identified by combining the two datasets.

The link between farmer codes and biodiversity location data should be removed, for example by using separate codes for farmer interviews and for farm locations.

  • Anonymise geo-referenced data by replacing point coordinates with non-disclosing features or variables; or, preferably, keep geo-references intact and impose access restrictions on the data instead.

Point data may fix the position of individuals, organisations or businesses studied, which could disclose their identity. Point coordinates may be replaced by larger, non-disclosing geographical areas such as polygon features (km2 grid, postcode district, county), or linear features (random line, road, river). Point data can also be replaced by meaningful alternative variables that typify the geographical position and represent the reason why the locality was selected for the research, such as poverty index, population density, altitude, vegetation type. In this way, the value of data is maintained, whilst removing disclosing geo-references.

A better option may be to keep detailed spatial references intact and to impose access controls on the data instead.

Procedures to anonymise any research data that are destined for sharing or archiving should always be considered together with appropriate informed consent procedures.

Qualitative data

Back to top