A Language Data Commons for Australia

Language data is an important source of information for revitalising Indigenous language in Australia. New digital research infrastructure backed by the ARDC is being developed to support it. Read the case study.
Banduk Marika and Ernie Dingo on a beach
Banduk Marika and Ernie Dingo from the TV series Talking Language with Ernie Dingo. Made by CAAMA Productions for Imparja Television, distributed by Ronin Films

2022 marks the beginning of the International Decade of Indigenous Languages, declared by the United Nations General Assembly to draw attention to the critical status of many Indigenous languages across the world and encourage action for their preservation, revitalisation and promotion.

The international attention on Indigenous languages comes at a critical time for Australia, with the state of most of the 250 Aboriginal and Torres Strait Islander languages, plus around 800 dialects, in steep decline, triggered by the systematic separation of people from their language through colonisation.[1] The National Indigenous Language surveys record the decline, said Ms Christie Wishart, a researcher at Central Queensland University. Ms Wishart is Ngugi from Quandamooka (Grandmother) Country and Nalbo from Jinibara Country (Grandfather).

“Since 2005, despite government funding and increased language programs, we continue to lose languages. We have lost 8 strong languages between 2005 and 2019,” she said.

“I hear the devastation when I’m talking to people who say, ‘I don’t know who my mob is, I don’t know where I come from, I don’t know my language.’

“It is devastating to think that we’re losing something that could genuinely close the gap in a really big way.”

Language and Wellbeing Are Linked

Ms Wishart is researching the connection between First Nations language and wellbeing as part of a higher degree by research.

“There are known links between the loss of First Nations’ language and the resulting disadvantage and reduced wellbeing within First Nations’ communities,” she said. “Language is integral to our identity and a strong identity has positive effects on health and wellbeing.”

The benefits of language are already well known by First Nations communities, but Indigenous-led research by Ms Wishart will contribute to understanding the impact of language on wellbeing at a broader level.

Revitalising and Reawakening Language

On a positive note, the reclamation and revitalisation of First Nations language is growing across Australia, and this is strengthening identity, culture, health and wellbeing in Indigenous communities. Its importance has been recognised in the recent National Agreement on Closing the Gap, with Target 16 stating that by 2031 there is a sustained increase in the number and strength of Aboriginal and Torres Strait Islander languages being spoken.

According to the National Indigenous Languages Report (2020), “Many Aboriginal and Torres Strait Islander people are actively seeking ways to reconnect with traditional languages. This is painstaking work, but in parts of the country some languages are being reawakened, demonstrating what is possible with community will and ongoing investment.”

Warlpiri translator Theresa Napurrurla Ross with granddaughter reading
Warlpiri translator Theresa Napurrurla Ross with granddaughter Bethalia Kelly. We know that educational outcomes improve when children are taught in their first language, especially in the early years. Image: AIATSIS

Language Data, a Crucial Piece of the Revitalisation Puzzle

Language data is an important source of information for Indigenous language revitalisation, and new digital research infrastructure supported by the ARDC is being developed to support it.

Language data includes audio and video recordings of people speaking, and written text, from entire newspapers to tweets. It’s used for linguistic research into pronunciation, syntax, semantics, how language is used, how language and language use changes over time, and how language varies across social groups.

Large collections of language data have been amassed in Australia by several institutions — AIATSIS (Indigenous Australian languages) and PARADISEC (South West Pacific languages and Indigenous Australian languages) are some of the bigger ones.

But much of Australia’s language data is scattered, hard to find, and in danger of being lost.

Now, with co-investment from the ARDC, the Language Data Commons of Australia is being developed by 17 partner institutions as a sustainable long-term resource for language data collections of national significance. Capitalising on existing infrastructure, it will secure vulnerable and dispersed collections, and link with improved analysis environments for new research outcomes.

The Language Data Commons will work like an online portal, with a search function that directs users to the institutions that hold the relevant data. For collections that are at risk of being lost, it will provide a pathway to repositories that will ingest and curate them for the long term. As well as the data collections, it will provide access to tools to analyse the data.

While the Commons is for all languages used in Australia and our region, Australian First Nations languages are at its core.

The Commons project will also aim to strengthen First Nations languages through community outreach and education, and Indigenous-led research on First Nations languages.

To date, little research has focused on language reclamation and its effects on the mental health and wellbeing of First Nations Peoples.

Ms Wishart said, “There is only one study I have identified that is Indigenous-led and uses Indigenous methodologies [and] its main focus is on language reclamation and the impacts on wellbeing of First Nations Peoples.”

Language Data for Non-Linguists

An important part of the project is working with First Nations communities to ensure the responsible sharing of data and tangible benefits for language speakers alongside researchers.

Two industry fellows at the University of Queensland will be a direct link between the project and Aboriginal and Torres Strait Islander communities.

Robert McLellan is a Gooreng Gooreng man and Industry Fellow at the University of Queensland working on the Language Data Commons.

In his role he will be “upholding Indigenous interests in the process and seeing that Aboriginal voices are heard and that we are actively engaging with people in the community.

“In the bigger picture, we’re working towards this digital catalogue that will make these language resources more accessible to everyone. But there’s a large piece of that puzzle that needs to be culturally informed, culturally appropriate, and we need to see that Indigenous peoples are benefitting from this initiative too,” added Mr McLellan.

Some of the Language Data Commons team interacting around a laptop
Some of the Language Data Commons team from the School of Languages and Cultures at the University of Queensland. The team includes over 20 members from across the project partners. (L-R) Simon Musgrave, Alvin Sebastian, Marco Fahmi, Ben Foley, Peter Sefton, Martin Schweinberger. Image: Marc Grimwade/ARDC

Language Data is Dispersed and “Hidden in a Vault”

On his journey to strengthen his Gooreng Gooreng language, Mr McLellan described his frustration with trying to find and access language data:

“You find your language and it’s hidden in a vault, and the vault might be a ‘sketch grammar’ or other linguistic papers. And if you don’t know linguistics, it’s so frustrating to know you can hold your language in your hand, but you can’t understand it. You don’t know what all of this jargon is about,” said Mr McLellan.

This frustrating experience is echoed in communities Mr McLellan has spoken with who are seeking language data.

The Language Data Commons will be a valuable source of information for those seeking to revitalise language. It will not only be a registry, but will also provide tools to analyse the data for research, and facilitate training on how to use and understand linguistics data.

“Till now, finding language data has really relied on talking to people. And if you weren’t connected with academics, well then you would never know,“ said Mr McLellan.

“With the Language Data Commons, if you apply the right search, you’ll be able to access that material, but you’ll also be able to access anthropological materials and other papers that contain less linguistic terminology that can help you get a better understanding about your language.

“With the Language Data Commons, the only barrier that exists is Internet access, which is just one barrier as opposed to the plethora that existed before.”

With the UN spotlight on the importance of Indigenous languages over the coming decade, the Language Data Commons has the potential to strengthen language revitalisation efforts in Australia, and contribute to closing the gap, one dataset at a time.


ARDC Support

The Language Data Commons received co-investment from the ARDC through these programs:

Datasets of national significance managed by project partners PARADISEC and CoEDL are being supported by the ARDC Data Retention Project.

The Language Data Commons is using these ARDC services:


  1. Sivak et al. (2019), doi.org/10.3390/ijerph16203918. Jump back

Written by Jo Savill, ARDC. Edited by Mary O’Callaghan. Reviewed by Prof Michael Haugh, Robert McLellan, Christie Wishart, Rowan Brownlee, Catherine Brady, Jenny Fewster, Dr Adrian Burton, Dr Andrew Treloar, Adelle Coote, Ian Duncan, Rosie Hicks