A National Language Data Commons for Australia

With support from the ARDC, a language data commons for Australia is being developed to create a sustainable long-term repository for language data collections of national significance.

More than 250 languages are spoken in Australia and a quarter of the world’s languages are spoken in the South West Pacific region.

Aboriginal and Torres Strait Islander languages and Australian English are seen as part of people’s cultural identity. Oral history is part of our cultural heritage.

“Often we think of Australia as a young, English-speaking country. Well, yes, but no. Australia is actually a massively multilingual country. We have indigenous languages which have been spoken for 50, 60, 70, 80,000 years and some of the longest living cultures in the world. When we think of history, we think of Egypt being pretty old. But, actually, Australia’s language data is just staggeringly older than that,” said Prof Michael Haugh, Head of Languages and Cultures School at the University of Queensland.

Language data includes audio and video recordings of people speaking, and written text, from newspapers to tweets. It’s used for linguistic research into pronunciation, syntax, semantics, how language is used, how language and language use changes over time, and how language varies across social groups, among other things.

It’s also relevant to humanities and social science more generally, says Prof. Haugh.

“If you want to find out about people’s attitudes to racism, for example, analysing how they talk about different groups of people is more enlightening than just asking them if they think they are a racist.”

Several institutions hold secure collections of language data — AIATSIS (Indigenous Australian languages) and PARADISEC (South West Pacific languages and Indigenous Australian languages) are some of the bigger ones. But much of Australia’s language data is scattered, hard to find, and in danger of being lost.

“If you’re from one particular Aboriginal community, and you’re wanting to revitalise your language, you need some data, some examples of the language to work with. Some data might be in a particular university library, some might sit in AIATSIS, and some might sit in PARADISEC. And there’s no way of knowing that. You just have to go to every individual collection and hope for the best,” said Prof Haugh.

Language data has been the responsibility of individual universities.

“People come and go in universities, so it’s not always stable. And if you lose language data, it’s gone forever because it captures particular people and particular moments at particular points in time, and it’s not recoverable.”

The idea for a national language data infrastructure to secure this valuable cultural heritage emerged from the academy of humanities, and was sparked by Nick Thieberger of PARADISEC and other linguists at the 2018 Humanities, Arts and Culture (HAC) data summit.

Later that year, Prof Haugh reached out to the ARDC , which he says “had a galvanising effect” and led to funding a pilot project for a national language data commons in 2019.

Developing a National Language Data Commons

Now, with further co-investment through the ARDC Data Partnerships Program, a consortium of AIATSIS, the ARC Centre of Excellence for the Dynamics of Language, ANU, the University of Melbourne, Monash University and the University of Queensland will jointly develop a national language data commons over the next couple of years.

A platform for analysing the data — the Australian Text Analytics Platform — will also be built, under the ARDC Platforms Program.

The data commons project builds on a foundation of great work done in the linguistics community over a long time, says Prof Haugh, but without the initial ARDC funding for the pilot, a great idea would likely have floundered.

“We had a meeting in Canberra at the National Library with 18 or 20 of us from all over the country — from universities, from ARDC, the National Library, AIATSIS. We spent a day talking through the roadmap and I think that’s the point where it really came together. Without the funding, we just couldn’t have done it.”

The funding also enabled a visit to CLARIN in the Netherlands, which is the gold standard in Europe for language data infrastructure.

The ARDC co-investments will see the project through to mid-2023.

“The ARDC support means we can hire a project manager and software engineers and it will help us to systematically go through the legal, moral and copyright issues that we need to consider and be careful about,” said Prof Haugh.

“Humanities traditionally has done things on a smaller scale, but I think people are seeing the value of forming research groups. And the language data commons is a good example of the linguistics community coming together on a common project to build something together.

“And language data is something that belongs to all of us. I think that’s a nice thing about what we’re doing.”

The national language data commons will work like an online portal, with a search function that directs users to the institutions that hold the relevant data.

“We’re not suggesting we grab all this language data and put it all in one place”, explains Prof. Haugh. “Language data has to stay with the institution or community responsible for it.”

Learn more about the ARDC’s support of humanities and social sciences.