Data Provenance

Data provenance is the documentation of where a piece of data comes from and the processes and methodology by which it was produced. Put simply, provenance answers the questions of why and how the data was produced, as well as where, when and by whom.

Data provenance is metadata that confirms the authenticity of that data and enables it to be reused.

Why data provenance matters

Data provenance provides essential information for determining data quality and facilitating reproducibility and reliability of data. Accurately recording data provenance is a cornerstone of good data management.

This is especially important in data-intensive research, where the data users are not likely to be the same person as the data producers.

For data users

Data users know the scientific basis of their analysis and the accountability of their research rely largely on the credibility and trustworthiness of their input data. They’ll want to check the data’s quality and expected level of imprecision.

Data provenance is the information that confirms the authenticity of a dataset so it can be confidently reused.

For data producers

Data producers may configure an instrument or simulation in a certain way to collect primary data, or apply methodologies and processes to extract, transform and analyse input data to produce an output data product.

Provenance information documents these activities.

Providing provenance metadata as part of the published data helps others assess the quality and reusability of your data, the reproducibility of results and, ultimately, the amount of trust one can place on the results.

How to record and manage data provenance

Provenance is recorded as metadata about the data product. Many metadata fields routinely collected fall into the category of provenance information such as date created, creator, instrument or software used and data processing methods. Read more about metadata.

Capturing and representing provenance can be done the following ways:

  • recorded in a text string; using generic or discipline-specific schema; or a provenance data model
  • captured internally within a software tool or program, or in an external system
  • represented in machine readable and/or human readable form.

In its simplest form, provenance can be recorded in a single README text file that describes the data collection and processing methods. Provenance can also be recorded in a more structured way using specific elements in very generic metadata standards, such as Dublin Core, or discipline-specific metadata standards such as ISO 19115-2. Alternatively, provenance information can be described directly in the W3C Provenance Data Model (PROV-DM) and Provenance Ontology (PROV-O).

Provenance trails can be captured internally by software tools during their processing activity, such as by the workflow systems Kepler, Galaxy or Taverna. This provenance information is typically only available to other users of the same system or exported to a separate provenance store.

Finally, provenance information can be captured in a way that supports machine-to-machine interactions and/or at a higher level that allows for human users to easily read the provenance trail. In some cases this might just be a textual description, but it might also involve a visualisation of the machine-readable representation such as VisTrails.

More data provenance information