Data-level documentation

Data-level, or object-level, documentation provides information at the level of variables in a database or individual objects such as interview transcripts or pictures. Data-level information can be embedded in data files, such as variable, value and code labels in an SPSS file or headers in a interview transcript. 

Quantitative data

Where possible, variable-level annotation should be embedded within a data file itself. More comprehensive variable level documentation can also be created using a structured metadata format such as XML.

Structured tabular data should have as documentation (where applicable):

  • variable names, labels and descriptions (maximum 80 characters)
  • units of measurement for variables
  • reference to the question number of a survey or questionnaire
Example: variable 'q11hexw' with label 'Q11: hours spent taking physical exercise in a typical week' —— the label gives the unit of measurement and a reference to the question number (Q11)
  • value code labels
Example: variable 'p1sex' = 'sex of respondent' with codes '1=female', '2=male', '8=don't know', '9=not answered'
  • coding and classification schemes explained, with a bibliographic and dated reference (some standards change over time)
Examples: Standard Occupational Classification, 2000 —— a series of codes to classify respondents' jobs; ISO 3166 alpha-2 country codes —— an international standard of 2-letter country codes
  • codes for missing data, with reason data are missing (blanks, system-missing or '0' values are best avoided)
Example: '99=not recorded', '98=not provided (no answer)', '97=not applicable', '96=not known', '95=error'
  • deviating universe information for variables in case of skipped cases or questions
  • derived or constructed variables created after collection, giving code, algorithm or command files used to create them —— simple derivations, such as grouping age data into age intervals, can be explained in the variable and value labels; complex derivations can be described by providing the algorithms, logical statements or functions used to create derived variables, such as the SPSS or Stata command files

Uncoded, ungrouped and underived raw data provide more re-use options than those where coding, grouping or derivation has been applied, allowing secondary users to apply their own codes, groupings or derivations.

Embedding data documentation

Many data software packages have facilities for data annotation and description as variable attributes (labels, codes, data type, missing values), table relationships, etc.

  • Example embedded documentation SPSS file: variable descriptions and attributes such as codes, data type, missing values can be documented for each variable in 'Variable View' or via syntax, whereby embedded data documentation is then contained in the SPSS command file
  • GIS e.g ArcGIS: shapefiles or layers and tables can be organised in a geo-database with rich metadata created in ArcCatalog

A structured dataset may also be accompanied by a codebook detailing all variables and their values. This can be created by importing frequency distribution outputs, created from the software package used, into a word processor, with annotation added where necessary.

Structured metadata: XML schemas

More comprehensive variable level documentation, including basic data dictionary information, question text and question routing instructions can also be created using a structured metadata format. XML is often used to enable this, such as in the Data Documentation Initiative (DDI). Detailed DDI documentation can be directly created from various software packages, using DDI-specific XML authoring tools.

Such standardised documentation in XML format can be used for data extract and analysis engines such as Nesstar; see for example the datasets included in our Nesstar catalogue.

Qualitative data

Back to top