Australian Text Analytics Platform Launches

The Australian Text Analytics Platform (ATAP) is an open source platform with tools and training for researchers to analyse, process and explore text.
Australian Text Analytics Platform logo on abstract data background

Everyday, we are bombarded with words from social media platforms, news sites, archives and more, providing an incredible volume of data about the world. Large volumes of text can reveal insights on politics, radicalisation, discrimination, history, and the impacts of climate change–when researchers have the right digital research tools and skills.

The Australian Text Analytics Platform (ATAP) launches today, providing an open source platform with tools and training for researchers to analyse, process and explore text. It provides Australian researchers with access to an ecosystem of data and code repositories, online workspaces, scripts, and training in text analytics.

Text analytics enable data-driven research by extracting and analysing machine-readable information from within unstructured text. Due to the increasing availability of large amounts of unstructured text, such as posts from Twitter, such techniques are becoming more and more important across diverse research disciplines.

The platform is accessible to researchers with a broad range of experience and skills (including beginners) and across a range of disciplines. The ATAP team supports researchers through hands-on workshops, online training modules and online office hours, as well as advice and collaboration in selected partnerships.

ATAP can work with existing archives, such as Pacific and Regional Archive for Digital Sources in Endangered Cultures (PARADISEC) and the Australian National Corpus, to provide easier access to the content in their collections.

Professor Michael Haugh from the University of Queensland (UQ), who leads ATAP, said, “ATAP aims to enable a wide range of researchers in the humanities and social sciences to engage with large text-based datasets using computational methods. With the support of ARDC and the important contributions of our wonderful partners at UQ, University of Sydney and AARNet, we have developed the technological foundations for this platform and a series of notebooks that aim to make text analytics more accessible. 

“We hope that this work will enable Australian researchers to use new methods and to develop their skills in text analytics. We look forward to continuing to contribute to the evolution of the research culture in Australia, particularly as we link ATAP with the data-centred activities of the Language Data Commons of Australia (LDaCA) project.”

Dr Andrew Treloar, Director of Platforms and Software at the ARDC, which co-invested in ATAP, said, “Reducing the barriers to entry is critical to increasing the uptake of innovative digital research tools. ATAP enables humanities, arts and social science researchers to focus on the research questions instead of the infrastructure challenges, and will accelerate research and innovation. The ARDC looks forward to seeing the seeds of the ATAP project and the planned LDaCA integration produce a wonderful harvest.”

Putting Text on the Map

One of the tools in ATAP enables researchers to assign geographic coordinates to place names mentioned in text.

The tool was developed through a collaboration with Fiannuala Morgan, a PhD student at The Australian National University and a Librarian at The National Library of Australia. The tool built on Fiannuala’s current research, which uses digital mapping software in the analysis of 19th century Australian fiction.  

Fiannuala explained that digital research tools may appear to require little human engagement, for example you may think a researcher inputs data of place names, and gets an output of geographical coordinates. However, it is not that simple or automated. 

“The ATAP tool demonstrates the complexity of a task that seems as superficially simplistic as generating coordinates. But the task we set ourselves was to look at generating coordinates on an international scale, not just in Australia, which is a very challenging process that requires a lot of consideration,” said Fiannuala.

The ATAP Geolocation tools Fiannuala and the ATAP team created enables researchers to identify place names in historical documents and assign both international and national coordinates. The tool builds on a program Fiannuala developed to disambiguate locations of bushfires mentioned in 5000 newspaper articles and over 300 serialised stories in Australian newspapers in the 1800s. The ATAP Geolocation tools notebook provides a semi-automated approach to geolocation. Unlike proprietary tools, such as Google Maps, that provide little information on the source data, the ATAP Geolocation notebook will help researchers interrogating text data, as they allow control over parameters.

This image depicts volume of mentions of bushfires in articles in the 19th century. That first peak in the graph corresponds to ‘Black Thursday’ (1851), arguably the most significant fire disaster in settler history, the second peak corresponds to ‘Black Monday’ a disaster currently omitted from cultural histories of Victoria (and settler Australia generally). This was created for Fiannuala Morgan’s research not using the ATAP tool. It shows the potential for how the ATAP Geolocation tool could be used. Credit: Fiannuala Morgan, fiannualamorgan.com.

Powerful Visualisation Tool for Text Data

ATAP contains a number of tools for analysing text, including an open source version of Discursis, a powerful visualisation tool for text data. Discursis is a communication analytics technology that allows a user to analyse text based communication data, in the form of conversations, web forums and training scenarios. 

The graphic below illustrates a debate between former Prime Ministers Kevin Rudd and Tony Abbott held at the National Press Club on 11 August 2013. 

The boxes on the diagonal represent the speaker turns, the boxes back in the matrix represent the conceptual similarity between each pair of turns. So a heavily populated column means that the topics in a turn were also in many following turns and a heavily populated row means that a turn picked up topics from many preceding turns. 

Discursis was originally developed by Dan Angus, Janet Wiles and Andrew Smith, this open source version was engineered by Marius Mather of the Sydney Informatics Hub (University of Sydney) on behalf of ATAP.

a colourful plot generated by Discursis tool
A visualisation created using Discursis. It illustrates a debate between former Prime Ministers Kevin Rudd and Tony Abbott held at the National Press Club on 11 August 2013.

Graduate Digital Research Fellowship Program

Also announced today is the ATAP-linked Graduate Digital Research Fellowship, which will run in the first part of 2023. This program provides an exciting opportunity for junior scholars who want to improve their knowledge of digital research methods and incorporate them into their research. Learn more about the ATAP Graduate Digital Research Fellowship Program, applications close on 30 November.

Join ATAP for a day of activities on Tuesday 29 November at the Australian Linguistic Society Conference, and see more upcoming ATAP workshops and events.

The Australian Text Analytics Platform (ATAP) project received investment (https://doi.org/10.47486/PL074) from the Australian Research Data Commons (ARDC). The ARDC is funded by the National Collaborative Research Infrastructure Strategy (NCRIS). It is a partnership between the University of Queensland, the University of Sydney and AARNet.