Data Provenance

Data provenance is the documentation of where a piece of data comes from and the processes and methodology by which it was produced. Put simply, provenance answers the questions of why and how the data was produced, as well as where, when and by whom.

Data provenance is metadata that confirms the authenticity of that data and enables it to be reused.

Why data provenance matters

Data provenance provides essential information for determining data quality and facilitating reproducibility and reliability of data. Accurately recording data provenance is a cornerstone of good data management.

This is especially important in data-intensive research, where the data users are not likely to be the same person as the data producers.

For data users

Data users know the scientific basis of their analysis and the accountability of their research rely largely on the credibility and trustworthiness of their input data. They’ll want to check the data’s quality and expected level of imprecision.

Data provenance is the information that confirms the authenticity of a dataset so it can be confidently reused.

For data producers

Data producers may configure an instrument or simulation in a certain way to collect primary data, or apply methodologies and processes to extract, transform and analyse input data to produce an output data product.

Provenance information documents these activities.

Providing provenance metadata as part of the published data helps others assess the quality and reusability of your data, the reproducibility of results and, ultimately, the amount of trust one can place on the results.

How to record and manage data provenance

Provenance is recorded as metadata about the data product. Many metadata fields routinely collected fall into the category of provenance information such as date created, creator, instrument or software used and data processing methods. Read more about metadata.

Capturing and representing provenance can be done the following ways:

recorded in a text string; using generic or discipline-specific schema; or a provenance data model
captured internally within a software tool or program, or in an external system
represented in machine readable and/or human readable form.

In its simplest form, provenance can be recorded in a single README text file that describes the data collection and processing methods. Provenance can also be recorded in a more structured way using specific elements in very generic metadata standards, such as Dublin Core, or discipline-specific metadata standards such as ISO 19115-2. Alternatively, provenance information can be described directly in the W3C Provenance Data Model (PROV-DM) and Provenance Ontology (PROV-O).

Provenance trails can be captured internally by software tools during their processing activity, such as by the workflow systems Kepler, Galaxy or Taverna. This provenance information is typically only available to other users of the same system or exported to a separate provenance store.

Finally, provenance information can be captured in a way that supports machine-to-machine interactions and/or at a higher level that allows for human users to easily read the provenance trail. In some cases this might just be a textual description, but it might also involve a visualisation of the machine-readable representation such as VisTrails.

More data provenance information

The W3C Provenance Working Group recommended six specifications including PROV Primer, PROV Ontology (PROV-O), PROV Data Model (PROV-DM), PROV Notation (PROV-N), PROV Constraints, PROV Access and query.
Workshop papers and presentation slides are available from the International Provenance and Annotation Workshop (IPAW), a biannual workshop concerned with issues of data provenance, data derivation and data annotation.
The ARDC has a playlist of data provenance videos.

Search all resources

Curated collections

Data Provenance

Why data provenance matters

For data users

For data producers

How to record and manage data provenance

More data provenance information

Did you find this resource useful?

You may also be interested in

Australian National Persistent Identifier (PID) Strategy 2024

Vocabulary Symposium 2023 Recordings

Good Data Practices

Resources for HASS and Indigenous Researchers

Last updated

Type

Categories

Research Topic

Related Articles

Data Provenance Metadata: Builds Trust, Credibility and Reproducibility

Related Resources

Metadata

Citation and Identifiers

Good Data Practices

Data Versioning

TRUST Principles

NEWSLETTER SIGNUP

Search all resources

Curated collections

Data Provenance

Why data provenance matters

For data users

For data producers

How to record and manage data provenance

More data provenance information

Did you find this resource useful?

You may also be interested in

Australian National Persistent Identifier (PID) Strategy 2024

Vocabulary Symposium 2023 Recordings

Good Data Practices

Resources for HASS and Indigenous Researchers

Last updated

Type

Categories

Research Topic

Related Articles

Data Provenance Metadata: Builds Trust, Credibility and Reproducibility

Related Resources

Metadata

Citation and Identifiers

Good Data Practices

Data Versioning

TRUST Principles

Share & Print

NEWSLETTER SIGNUP