Disclosure Assessment

If the identity of research participants is to be kept confidential (for example if this is agreed with participants), then the risks of disclosure need to be considered before, during and after the data are collected.

Assessing disclosure risk is done by evaluating key characteristics or variables in data files that are the most risky for leading to participant identification in a specific project. These can be direct identifiers (e.g. a person’s name, national insurance number, picture or detailed geographic location) or indirect identifiers (e.g. large household size, specialised profession, unusual health conditions, verbatim textual responses to survey questions).

Risk assessment is about managing risk, rather than removal of all risk. The risk of identification and the risk of harm from exposing data are evaluated together to assess the overall risk.

Risk analysis can involve running frequency analyses of variables to determine low-frequency responses and extreme outliers. This needs to be complemented by qualitative analysis of risk characteristics based on local knowledge of the data and the population and individuals studied. For example, a single house with solar panels in a small rural village may be highly disclosive if this is not a common feature in that area. It is important to decide whether confidentiality is sufficiently served by concealing individual and household identities, or whether community or other location-specific identifiers also need to be concealed.

Once disclosure assessment has been completed, relevant strategies for consent protocols, anonymisation, and regulation of data access can be evaluated and applied.

Millennium Villages study example

This table shows examples of some assessed variables, with risk and actions taken, from a review we carried out of household survey data from the Millennium Villages Impact Evaluation project, northern Ghana. It shows examples of direct identifiers commonly assessed for disclosure risk (age, community) as well as variables for which local knowledge is essential to indicate risk (fuel type use and house wall material).

Extract of household survey variables assessed for disclosure risk

Variables Disclosure risk Action
Community

Low frequency counts for all named communities; respondents are very easily identifiable (especially in combination with other variables)

Exclude variable from dataset
Age

Low counts of older respondents over 75 years old

Top-code age >= 75 as '75 and over'

Main occupation during last 12 months

Low counts of very specific occupations

Occupations aggregated into standard occupation codes

Ethnicity of the household head

Low counts of specific ethnicities

Recode the low-frequency responses (all responses but 'Mamprusi' and 'Builsa') into 'Other'

Household's primary type or energy/fuel used for cooking

Very low counts for 'Gas/LPG' and 'Electricity-solar panel' responses may lead to household identification (especially if combined with other datasets)

Recode all responses into the following main categories: 1 - 'Firewood'; 2 - 'Electricity-based'; 3 - 'Charcoal'; 4 - 'Other', 5 - 'Don't know'; 6 - 'NA/missing'

Main material of the wall of the house

A number of low-frequency responses; exterior features of buildings area easily identifiable and could result in identification of the household

As the main material of the wall refers to the exterior of a building, it may be advisable to recode the low-frequency and 'Other' variables into 'Other (incl. wood-based and stone-based') and retain the remaining groups

Back to top  

We are giving away £20 in Amazon vouchers to the first 100 people who complete our online survey*

Discover UK Data Service