Data Provenance Metadata: Builds Trust, Credibility and Reproducibility

Data provenance, a type of metadata, is the documentation of why and how the data was produced, where, when and by whom the data is collected.
Data Management

Metadata is information about an object or resource that describes characteristics such as content, quality, format, location and data administrative information. It describes physical items as well as digital items and can take many different forms, from free text (such as read-me files) to standardised, structured, machine-readable form.

Data provenance, a type of metadata, is the documentation of why and how the data was produced, where, when and by whom the data is collected. Data provenance metadata ranges from the easily human readable to the highly technical, and usually requires some knowledge of the domain to create. Data provenance metadata enables interpretation and reuse of the data; builds trust, credibility and reproducibility.

Here are some typical scenarios of why it’s essential to capture data provenance metadata:

  1. In data intensive research, the data users are not likely to be the original data producers. Data producers may configure an instrument or simulation in a certain way to collect primary data, or apply certain methodologies and processes to extract, transform and analyse input data to produce an output data product. Provenance information documents these.

  2. The provision of provenance metadata as part of the published data is important for determining the quality, the amount of trust one can place on the results, the reproducibility of results and reusability of the data.

  3. For data users, the scientific basis of their analysis and accountability of their research rely largely on the credibility and trustworthiness of their input data and so they may want to check data quality along with expected level of imprecision.

So how is it captured? 

The capture and maintenance of provenance metadata should occur as a normal part of research and data management processes. Metadata field types such as date created, creator, instrument or software used and data processing methods fall into the category of provenance information. Learn more about data management and how it forms the basis of recording provenance.

Approaches to capture and represent provenance can be described on a number of dimensions:

  • Recorded in a text string (single README text file that describes the data collection and processing methods used) or in a more structured way by applying generic standards (e.g.  Dublin Core) through to discipline-specific metadata standards such as ISO 19115-2, to highly abstract data models such as the W3C Provenance Data Model (PROV-DM) and Provenance Ontology (PROV-O).
  • Captured internally within a software tool or program; or in an external system such as KeplerGalaxy or Taverna.
  • Represented in machine readable and/or human readable form. This might just be a textual description, but might also involve a visualisation of the machine-readable representation such as VisTrails.

Learn more about Data Provenance and current best practice for using and recording it.