The Challenge
Australia is a massively multilingual country, in one of the world’s most linguistically diverse regions. Significant collections of this intangible cultural heritage have been amassed, including collections of Aboriginal and Torres Strait Islander languages, regional languages of the Pacific, Australian English and migrant languages, and sign languages of Australia and its region.
There are also language collections important for cybersecurity (AusTalk, Australian National Corpus, corpora of regional languages), for gauging popular sentiment (Australian Twitter Corpus) and for emergency communication (languages of the region and some Indigenous languages).
However, much of Australia’s language data is scattered, hard to find and in danger of being lost. Many collections remain under-utilised, and researchers lack the tools and skills to exploit their research potential.
The Response
We’ve established the Language Data Commons of Australia (LDaCA), an integrated national infrastructure that supports language work and language research. It enables researchers and communities to access and use nationally significant collections of written, spoken, multi-modal and signed language.
The project is:
- improving researchers’ digital skills and raising awareness of best practice in digital research
- rendering valuable collections of national significance more findable, accessible, interoperable and reusable (FAIR) while adhering to CARE principles
- developing the integrated national technical infrastructure to analyse language collections at scale.
It supports researchers to deliver innovative research outcomes and opens up the social and economic possibilities of Australia’s language data for translational research in the national interest.
The Australian Text Analytics Platform (ATAP) is also part of the Language Data Commons of Australia.
Phase 2 – June 2024 to June 2028
In this next phase, LDaCA will:
- develop the social and technical foundations for a national, distributed archival repository of language materials
- continue securing vulnerable and nationally significant collections of Aboriginal and Torres Strait Islander languages, Indigenous languages in Australia’s Pacific region, varieties of Australian English and migrant languages, and sign languages of Australia and its region
- continue to develop the LDaCA data portal for accessing and repurposing language data of significance to researchers and communities, including data that is held in galleries, libraries, archives and museums (GLAM)
- establish workflows that link repositories and analytics environments so that researchers can create fully described, reproducible research on written, spoken, multimodal and signed language
- provide training and develop resources for researchers and communities that support best practice in archiving, sharing, accessing and analysing language data in line with FAIR and CARE principles.
Outcomes
LDaCA has produced 2 important research infrastructures:
- The LDaCA data portal makes language data findable and accessible for researchers and communities.
- The Australian Text Analytics Platform (ATAP) provides programmatic interfaces, training materials and access to cloud computing services for researchers to develop their own text analytics as standalone and publishable digital research outputs.
LDaCA is an archival repository for language data collections of national significance. This has implications for the development of Australia’s economy, national security, and social and cultural well-being. LDaCA’s efforts are focused on the sustainability of data, as well as offering tools and training for the collection and analysis of language data. Achievements include:
- developing policies and governance structures for long-term data storage and access
- developing a technology stack, which enables secure storage and provides a basis for tools and services, now and in the future
- establishing relationships with various communities to encourage sustainable data management and data (re)use practices
- developing notebooks that enable researchers to learn how to apply text analytics to their own data or collections held in LDaCA.
To date, LDaCA has:
- given 25 conference presentations
- presented over 40 workshops, reaching nearly 1000 people
- secured 25 datasets and built 24 data migration tools
- created 75 software repositories including some public tools, such as an RO-Crate profile, a metadata vocabulary, and a GUI tool for working with those resources, Crate-O
- engaged with 8 Indigenous communities/organisations in the development process.
In Phase 2 (June 2024 to June 2028), LDaCA will continue to grow to:
- address the challenge of meeting research needs while respecting community rights for language and cultural collections
- highlight contributions that language-based research and HASS disciplines can make to STEM research and non-academic applications.
LDaCA has not only built an integrated national technical infrastructure for language data, it is also contributing to the success and impact of the HASS and Indigenous RDC by creating foundational infrastructure. It is also positioning Australia internationally as a leading contributor of language collections and digital infrastructure.
Learn more on the LDaCA website.
Who Will Benefit
LDaCA gives researchers better access to Australia’s rich language resources, accelerating the development of language data analysis capability in Australian research and industry.
The Partners
LDaCA is part of the ARDC’s HASS and Indigenous Research Data Commons. It previously received support from the ARDC through:
- Australian Data Partnerships – read the report
- Platforms Program.
Our partners in Phase 2 (2024-2028) (doi.org/10.3565/kq2v-9g52) are:
- The University of Queensland (lead)
- The ARDC
- Australia’s Academic and Research Network (AARNet)
- Australian National University
- Batchelor Institute of Indigenous Tertiary Education
- First Languages Australia
- QUT’s Digital Observatory (QUT)
- The University of Melbourne
- The University of Sydney
- University of Western Australia
Our partners in Phase 1 (2021-2024) (doi.org/10.47486/HIR001) were:
- The University of Queensland (lead)
- The ARDC
- Australian National University
- Monash University
- The University of Melbourne
- The University of Sydney
- AARNet
- First Languages Australia
- PARADISEC
- ARC Centre of Excellence for the Dynamics of Language
- QUT’s Digital Observatory (QUT)
- CLARIN
Further Resources
- Visit the LDaCA website
- Visit the LDaCA data portal
- Visit The Australian Text Analytics Platform (ATAP)
- Join an upcoming workshop or event
- Access LDaCA resources
- Read blog posts from the LDaCA team
- Subscribe to the LDaCA newsletter
- Follow LDaCA on LinkedIn or X
- Read the Project Plan for LDaCA Phase 2 (2024 to 2028)
- Read the Response to Project Plan Feedback for the Language Data Commons of Australia
Timeframe
Current Phase
Project lead
Categories
Research Topic
Related Case Studies
Related Articles
- Securing Voices of Country
- HASS and Indigenous Research Data Community Exchange Knowledge at Annual Symposium
- Draft Project Plans for the HASS and Indigenous Research Data Commons Now Open for Feedback
- Summer School Shares Computational Skills for HASS and Indigenous Research
- Collections as Data in Australia
- Implementing Indigenous Data Licensing and Access: Empowering Communities and Upholding Cultural Rights
- Empowering HASS and Indigenous Researchers with Essential Computational Skills
- Advancing HASS and Indigenous Research Infrastructure: A Symposium
- Australian Text Analytics Platform Launches
- “Bringing Data to Life: Co-Designing a Language Data Commons” Recap
- Announcing Successful Projects for the ARDC HASS Research Data Commons and Indigenous Research Capability Program
- A National Language Data Commons for Australia