Language Data Commons of Australia

Rescuing vulnerable language collections.
Coworkers are sitting down on the floor and discussing

The Challenge

Australia is a massively multilingual country, in one of the world’s most linguistically diverse regions. Significant collections of this intangible cultural heritage have been amassed, including collections of Australian Indigenous languages, regional languages of the Pacific, and Australian English.

There are also language collections important for cybersecurity (AusTalk, Australian National Corpus, corpora of regional languages), for gauging popular sentiment (Australian Twitter Corpus), and for emergency communication (languages of the region and some Indigenous languages).

However, much of Australia’s language data is scattered, hard to find, and in danger of being lost. Many collections remain under-used and researchers lack the tools and skills to exploit their research potential.

The Response

We’re establishing the Language Data Commons of Australia (LDaCA), an integrated national infrastructure that supports language research. It will enable researchers and communities to access and use nationally significant collections of written, spoken, multi-modal and signed text.

The project will:

  • improve researchers’ digital skills and raise awareness of best practice in digital research
  • render valuable collections of national significance more findable, accessible, interoperable and reusable (FAIR) while adhering to CARE principles
  • develop the integrated national technical infrastructure to analyse language collections at scale.

It will support researchers to deliver innovative research outcomes, and will open up the social and economic possibilities of Australia’s language data for translational research in the national interest.

We will:

  • address the challenge of balancing research needs while respecting community rights for language and cultural collections
  • highlight contributions that language research and HASS disciplines can make to STEM research and non-academic applications
  • position Australia internationally as a leading contributor of language collections and digital infrastructure.

Who Will Benefit

Establishing the LDaCA will give researchers more widespread access to Australia’s rich language resources, accelerating the development of language data analysis capability in Australian research and industry.

The Partners

The LDaCA is supported by 3 ARDC programs:

Our partners are:

  • The University of Queensland (lead)
  • Australian National University
  • Monash University
  • The University of Melbourne
  • The University of Sydney
  • AARNet
  • First Languages Australia
  • Australian Institute for Aboriginal and Torres Strait Islander Studies
  • PARADISEC
  • ARC Centre of Excellence for the Dynamics of Language
  • Digital Observatory (QUT)
  • CLARIN

Target Outcomes

The LDaCA will be a sustainable long-term repository for language data collections of national significance. This has implications for the development of Australia’s economy, national security and social and cultural well-being.

Key Resources

Contact the ARDC

  • This field is for validation purposes and should be left unchanged.

Timeframe

November 2022 to June 2024

Current Phase

In progress

ARDC Co-investment

$1,933,000

Project lead

Professor Michael Haugh, School of Languages and Cultures, The University of Queensland