Minding Our Language Data

Next time you speak with a person in their 80s, take a moment to consider how they speak. What words do they use? Is their grammar different to yours? Do you notice anything about their pronunciation?

Professor Catherine Travis from the Australian National University analyses changes in the way English is spoken in Australia over time. For example, she compares how people of different ages speak, from teenagers to octogenarians, as well as comparing how Australians speak today to how we spoke 40 years ago.

To conduct this research, Prof Travis and colleagues compiled the Sydney Speaks Corpus – a collection of recordings of Sydneysiders telling their stories. However, to understand what changes had occurred over several decades, they had to find old recordings.

“There was a set of recordings of Sydney residents speaking English in the 1970s and 80s created by a very famous linguist, Dr Barbara Horvath, which is cited in introductory linguistics textbooks around the world,” said Prof Travis. “I was communicating with Dr Horvath when she said ‘By the way, I’ve got all my cassettes of the recordings sitting in my garage. Do you know anybody who’s interested in that?’

“This is like a gold mine from a linguistic perspective. I was able to digitise them, and incorporate them into the Sydney Speaks Corpus.”

Many more such recordings are waiting to be discovered and documented. And because recordings like these often contain stories about people’s lives and experiences, they are useful not just to linguists, but also to historians, sociologists, anthropologists, and so on.

The benefits of being able to find and access language data are immeasurable. Not only can language data help researchers from many disciplines answer a multitude of questions without having to collect new data – but it is also crucial for Aboriginal and Torres Strait Islander peoples, who are revitalising hundreds of languages in Australia and the region.

2 people sitting on rocks looking at Sydney harbour, including the Sydney Harbour Bridge and the city.

Access to Language Data may be Restricted

The Language Data Commons of Australia (LDaCA) was created, with ARDC co-investment and expertise, to make language data easier to find. It is a portal that points users to where relevant data is held. Making the “gold mines” of language data scattered across Australia findable is a crucial step towards making this data FAIR – findable, accessible, interoperable and reusable.

However, providing access to language data is not always straightforward. Language data is inherently identifiable and may contain sensitive information – a recording of a conversation between friends will contain identifiable information, or may include personal topics, for example, and in this era of deep fakes, could be used for malicious purposes. To make the data accessible, LDaCA provides a way for users to request access to the data through its online portal.

“All personal data is identifiable, so it can’t always be made open,” said Dr Peter Sefton, the technical lead for LDaCA at The University of Queensland. “Research is conducted with many different access licences, meaning that some data can be reused, but a lot has specific restrictions.”

The Sydney Speaks Corpus, for example, contains data that was collected under a few different licences. It includes the NSW Bicentennial Oral History Project recordings from 1987-1988, which are freely accessible in their entirety through the National Library of Australia, as well as recordings from 2016 onwards that are more restricted due to the agreements with the participants about this data collection. LDaCA helps data stewards like Prof Travis manage the access restrictions in accordance with ethical, moral and legal obligations.

“The access conditions for each item in LDaCA are determined by the data steward, and are managed using an authorisation system,” said Dr Sefton. “The access can be a simple click-through licence, where you agree to licence terms, through to a detailed multi-step workflow where applicants are vetted based on criteria assigned by the rights holder, such as qualifications or membership of a cultural group. In some cases, there is a manual approval process.”

A map of the Sydney area with colourful bubbles indicating the number of speakers recorded fror the Sydney Speaks collection of recordings. — The Sydney Speaks project brings together recordings of 250 speakers form 3 collections of spoken language. The size of each bubble on this map corresponds to the number of speakers. Source: Sydney Speaks.

Giving Access to Communities, Not Just Researchers

In Australia, the Australian Access Federation (AAF) mediates access to hundreds of national and international research platforms and resources using the researcher’s institutional ID and password.

However, since access is based on institutional IDs, a researcher who changes institutions or retires loses access because their email changes. Also, language data is not only useful for researchers working in research institutions – communities might be searching for language data for revitalisation projects or as a cultural record. A different identity authentication system was needed for recordings in LDaCA.

The LDaCA team worked with AAF to integrate CILogon, a system that enables people outside traditional research institutions to create authenticated identities using social logins such as Google and GitHub and access the data they need.

Taking the Stories Beyond Linguistics

Information about the Sydney Speaks Corpus and other language collections are now in LDaCA, and linguists, historians and community groups can request access to the hundreds of hours of language recordings.

“It would be such a waste for those stories to stop at the analysis of the vowels,” said Prof Travis. “That’s what gets us linguists excited, but there’s much, much more to offer.”

Learn more about the Language Data Commons of Australia.

Are you attending eResearch Australasia 2023? Learn more about the access system for LDaCA on Thursday, 19 October 2023 at 2:20 pm (AEST) in the session “Implementing a FAIR and CARE compliant access-control system for cultural data, including Indigenous and other data.”

Join the launch of the Language Data Commons of Australia (LDaCA) at the Australian Linguistics Society conference (ALS 2023) in November 2023.

This project received co-investment (doi.org/10.47486/DP768 and doi.org/10.47486/HIR001 and expertise from the ARDC. It is led by The University of Queensland in partnership with AARNet, ARC Centre of Excellence for the Dynamics of Language, Australian National University, First Languages Australia, Monash University, The University of Melbourne and The University of Sydney.