Assessing data quality and disclosure risk in numeric data
20 February 2019
London School of Economics
Event is now full, waiting list available.
In this hands-on day course you will learn about the principles of, and tools for, assessing data quality and reviewing disclosure risk in numeric data sources. Data assessment is extremely useful whether it is for wishing to create high quality data for publishing, thereby supporting the transparency and replication agenda (e.g. to meet funder or journal policy), or simply to check unknown data that has been accessed for reuse. The requirements of the GDPR when processing and de-identifying data benefit from quick examination, using tools where possible.
The course will introduce the key elements of data quality and disclosure risk, including: file checks, data and metadata checks, and direct and indirect identifiers. The day makes use of two tools to undertake review. The first is QAMyData that automatically assesses elements of quality, such as missingness, duplication, outliers and direct identifiers. A user can specify and set thresholds in the QAMYData tool, to indicate what one is prepared to accept (i.e. no missing data or data must be fully labelled). Issues are identified in both a summary and detailed report. The second tool is R sdcMicro, a practical tool for checking disclosure risk through examining combinations of key variables.
Practical demonstrations and hands-on exercises will be used throughout. The course will be held in a lab where the software will be mounted. However, these software are easily downloaded to a laptop and can be quickly used after the workshop and can be integrated into data cleaning and processing pipelines for data creators, users, reviewers and publishers.
The course covers:
By the end of the course participants will:
Louise Corti is Director of Collections Development and Data Publishing at the UK Data Service and was PI on the QAMyData NCRM award. She has over 30 years of working in social surveys and data archiving and was instrumental in helping develop and promote the research data management policies and guidelines we see today. She is passionate about transforming the way we handle and treat research data to ensure that it can be enduring.
Sharon Bolton manages the data curation and publishing activities at the UK Data Service. Having spent over 15 years in data archiving, she is an expert on the preparation and documentation of survey and many other kinds of data. She undertakes research to develop and improve new ingest processing tool and is involved in metadata and controlled vocabularies work for a number of European projects, including the use of multilingual thesauri.
Cristina Magder works in the data curation and publishing team at the UK Data Service specialising in the enhancement of data and documentation for reuse. She undertakes specialised work to prepare microdata, such as the UK's large-scale social surveys and is involved in scoping new open source QA tools for data. She developed the teaching module on R sdcMicro to undertake disclosure review, and has been heavily involved in the QAMyData project.
Anca Vlad has responsibility for overseeing the deposit of research data into the UK Data Service’s ‘ReShare’ self-deposit repository. She coordinates and undertakes data disclosure and quality reviews of data collections received, to ensure the publication of high quality data and metadata for reuse. Anca has helped develop the teaching module on assessment of numeric data and has been heavily involved in the QAMyData project.
Myles Offord is a software engineer at the UK Data Service who has researched and developed the open source tool, QMyData tool as part of a recent NCRM Collaborative Projects award held by the UK Data Service.
All fees include event materials, lunch, morning and afternoon tea. They do not include travel and accommodation costs.
Level: Intermediate (some prior knowledge)
Experience/Knowledge required: Some knowledge about the creation and QA of survey or numeric data are expected, as is familiarity with some kind of statistics software tools e.g. SPSS, STATA or R.
Target audience: Academics, lecturers, researchers and data publishers from all sectors who are interested in the practical elements of assessing numeric data for quality and disclosure risk.
To sign up for the course please visit the event page on the NCRM website: https://www.ncrm.ac.uk/training/show.php?article=9311