Data Provenance
Data provenance is the documentation of where a piece of data comes from and the processes and methodology by which it was produced. Put simply, provenance answers the questions of why and how the data was produced, as well as where, when and by whom.
Data provenance is metadata that confirms the authenticity of that data and enables it to be reused.
Why data provenance matters
Data provenance provides essential information for determining data quality and facilitating reproducibility and reliability of data. Accurately recording data provenance is a cornerstone of good data management.
This is especially important in data-intensive research, where the data users are not likely to be the same person as the data producers.
How to record and manage data provenance
Provenance is recorded as metadata about the data product. Many metadata fields routinely collected fall into the category of provenance information such as date created, creator, instrument or software used and data processing methods. Read more about metadata.
Capturing and representing provenance can be done the following ways:
- recorded in a text string; using generic or discipline-specific schema; or a provenance data model
- captured internally within a software tool or program, or in an external system
- represented in machine readable and/or human readable form.
In its simplest form, provenance can be recorded in a single README text file that describes the data collection and processing methods. Provenance can also be recorded in a more structured way using specific elements in very generic metadata standards, such as Dublin Core, or discipline-specific metadata standards such as ISO 19115-2. Alternatively, provenance information can be described directly in the W3C Provenance Data Model (PROV-DM) and Provenance Ontology (PROV-O).
Provenance trails can be captured internally by software tools during their processing activity, such as by the workflow systems Kepler, Galaxy or Taverna. This provenance information is typically only available to other users of the same system or exported to a separate provenance store.
Finally, provenance information can be captured in a way that supports machine-to-machine interactions and/or at a higher level that allows for human users to easily read the provenance trail. In some cases this might just be a textual description, but it might also involve a visualisation of the machine-readable representation such as VisTrails.
More data provenance information
- The W3C Provenance Working Group recommended six specifications including PROV Primer, PROV Ontology (PROV-O), PROV Data Model (PROV-DM), PROV Notation (PROV-N), PROV Constraints, PROV Access and query.
- Workshop papers and presentation slides are available from the International Provenance and Annotation Workshop (IPAW), a biannual workshop concerned with issues of data provenance, data derivation and data annotation.
- The ARDC has a playlist of data provenance videos.