File formats and software

The format and software in which research data are created usually depend on how researchers choose to collect and analyse data, which is often determined by discipline-specific standards and customs.

Standard formats for long-term access

All digital information is designed to be interpreted by computer programs to make it understandable; it is therefore inherently dependent on software. All digital data are thus at risk of becoming inaccessible should the hardware or software environments they depend on become obsolete.

Despite the backward compatibility of many software packages to import data created in previous software versions and the interoperability between competing popular software programs, the safest option to guarantee long-term data access and usable data is to convert data to standard formats that most software are capable of interpreting, and that are suitable for data interchange and transformation.

This typically means using open or standard formats (such as OpenDocument Format or ODF, ASCII, tab-delimited format, comma-separated values or XML) instead of proprietary ones. Some proprietary formats (such as Microsoft Rich Text Format, Microsoft Excel and SPSS) are widely used and likely to be accessible for a reasonable, but not unlimited, time.

Thus, while researchers use the most suitable data formats and software according to planned analyses, once data analysis is completed and data are prepared for storing, researchers should consider converting their research data to standard, interchangeable and longer-lasting formats to avoid being unable to use the data in the future. Similarly for backups of data, standard formats should be considered.

For long-term digital preservation, the UK Data Service hold data in such standard formats. At the same time, data are offered to users by conversion to current common and user-friendly data formats and may be migrated forward when needed.

Converting data

Data may need to be converted from the original format to a preferred data preservation format in preparation for long-term storage, or to deposit them with the UK Data Service. Conversion is best done by the researcher familiar with the data, to ensure data integrity during conversion.

When data are converted from one format to another - through export or by using data translation software - certain changes may occur to the data:

  • for data held in statistical packages, spreadsheets or databases: some data or internal metadata such as missing value definitions, decimal numbers, formulae or variable labels may be lost during conversion to another format, or data may be truncated
  • for textual data: editing such as highlighting, bold text or headers/footers may be lost

After conversion, data should be checked for errors or changes.

Qualitative data analysis products

Qualitative data analysis software packages such as NVivo, ATLAS-ti and MAXQDA have export facilities that enable a whole 'project' consisting of the raw data, coding tree, coded data, and associated memos and notes to be saved.

For archiving such data, the raw data, the final coding tree and any useful memos should be exported prior to deposit. Coded data are preserved by the UK Data Service in their incoming format, but are not normally distributed, as they cannot be exported in a common non-proprietary format.

The UK Data Service is working to encourage the development of data documentation standards using XML. The Data Exchange Tools and Conversion Utilities (DExT) project proposed an XML schema, QuDEx, to represent annotated and complex multimedia data.

At present, coded data are requested infrequently by data users, mainly because the coding process is subjective, often geared towards specific themes, and therefore may not be applicable to the secondary analyst's topic of investigation. However, this is changing and access to coding schemes can be valuable for teaching and other forms of reuse. For larger studies, there is a stronger case for retaining coded data in order to aid searching within large bodies of text. However, this will always be an adjunct to the main body of raw data.

Back to top