Data Versioning

What is data versioning?

New versions of a dataset are being created when an existing dataset is reprocessed, corrected or appended with additional data. Data versioning helps track changes associated with dynamic data — data that is not static over time.

Why is data versioning important?

Researchers are required to identify and cite the exact dataset used as a research input in order to support research reproducibility and trustworthiness. This requires good management of data and data revision, and becomes particularly challenging when the datasets that are being cited are under constant changes and revision.

For a background on the importance of data versioning and recent developments, read the ARDC article Data Versioning — an epic chapter of a long story. The concept and benefits are also well summarised in W3C’s Data on the Web Best Practices guide.

Data versioning standards

There is no agreed standard or recommendation among data communities as to how and when data should be versioned. Some data providers may not retain a history of changes to a dataset, opting to make only the most recent version available. Other data providers have documented data versioning policies or guidelines based on their own discipline’s practice, which may not be applicable to other disciplines.

There is currently global work taking place on best practice for data versioning across data communities. The Research Data Alliance Data Versioning working group has come up with the following guidelines for data versioning:

  • Revision (version control)
  • Release (data products)
  • Granularity (aggregates, composites, collections and time series)
  • Manifestation (data formats and encodings)
  • Provenance (derived products).

Data versioning tools

There is no one-size-fits-all solution for data versioning and tracking changes. Data comes in different forms and is managed by different tools and methods. In principle, data managers should take advantage of data management tools that support versioning and track changes.

Example approaches include:

  • Git (and Github) for Data (with size <10Mbit or 100,000 rows), which offers effective distributed collaboration, provenance tracking, and the ability to share updates and synchronise datasets in a simple, effective way
  • Data versioning at ArcGIS, where users can create a geodatabase version, derived from an existing version.

Citing versioned data

There is no universal way to cite versioned data. The form of citation statement will depend on factors including publisher instructions, research domain and type of data. Citations to revisable datasets are likely to include version numbers or access dates.

DataCite recommends the following format citing versioned data: Creator(s) (Publication Year): Title. Version. Publisher. Identifier.

Did you find this resource useful?

Receive tailored updates on latest digital research news, events, resources, career opportunities and more.

I am Interested in
This field is for validation purposes and should be left unchanged.

Confirm what you are interested in:(Required)
This field is for validation purposes and should be left unchanged.