For researchers, knowing whether machine learning (ML) can be useful for their work is not always evident, at least not without some understanding of how it works and how it can be applied.
A recent striking example is Cryofilter, a deep learning solution for protein structures, developed at HealthHack 2020. Gavin Rice, a PhD student at the University of Queensland’s Institute for Molecular Bioscience, works with datasets containing millions of particle images. He needs to sort out the good images from the bad ones, a largely manual process that takes weeks. The good images are used to construct 3D cellular models used in drug and vaccine discovery.
An introductory course on deep learning at Monash University was enough for Gavin to understand what might be possible and to be able to frame his problem to software developers at HealthHack. They used ML to train Cryofilter to distinguish good images from bad in a dataset of over half a million images, with spectacular speed and accuracy. The open-source code is now available to researchers worldwide.
To accelerate the adoption of ML techniques among Australian researchers, Monash University and the University of Queensland (UQ) have come together to build environments that support ML at scale, provide access to targeted training, and support a national community of practice.
The 2-year Environments to Accelerate Machine Learning-Based Discovery project, which is funded under our new platforms initiative, engages strongly with researchers and collaborates with industry.
The concept was sparked following an ARDC discovery project in early 2019 to survey researchers in Australia and New Zealand about the challenges of using ML techniques.
Komathy Padmanabhan is one of the project’s principal investigators and the Principal Lead of Data Science and AI Platform at Monash University.
“We knew from the pattern of activities running on the machines at our MASSIVE High-performance Computing Centre that there has been a surge in ML usage”, she says. “We wanted to understand what challenges researchers are facing, either researching ML or using it.”
The survey, jointly run by Monash and UQ, revealed 4 main challenges:
- Data availability and access – Despite this being the era of big data, availability and accessibility of data is challenging; for example, health data generated by hospitals is not available due to privacy concerns.
- Computing environments – Because high performance computers are expensive, they are shared and there is always a waiting period.
- Tools – Researchers generally lack awareness of what tools and libraries are available and suitable, and of how to set them up on compute environments.
- Skills – Researchers are not sure how to apply advanced ML techniques in their research and are not aware of the key software packages.
As part of the project, Monash University and UQ (in partnership with QCIF) have already delivered 140 hours of ML training to more than 900 participants from 22 organisations. Five new courses on ML, deep learning and natural language processing have been developed so far and published under creative commons licence, and the team has been providing ‘train the trainer’ sessions for other organisations to adapt the material for their own users.
Machine learning resources and community
The ML4AU portal, launched at the ARDC eResearch and Data Skills Summit in October 2020, is the repository for all project resources—not just training material but scripts, a tool library, a data library and a calendar of ML training events. The portal also makes the resources FAIR.
“ML4AU is freely accessible to anyone—contributed to by everyone and leveraged by everyone”, says Padmanabhan.
The Machine Learning Community of Practice (CoP), to be launched on 15 April 2021, is an opportunity for anyone interested in ML for research to collaborate over the emerging needs in ML capabilities and expertise.
“There is a general consensus that we need avenues for eResearch interest groups—data science, machine learning and artificial intelligence—to come together and discuss specific interests, challenges and opportunities together”, says Padmanabhan.
The CoP is open to user communities who apply or intend to apply ML to their research, irrespective of their expertise levels; practitioner communities who have hands-on expertise in ML and can help others apply it; training groups and volunteers interested in developing and delivering ML training; and infrastructure providers who host specialist infrastructure, such as high performance computing.
Led by the ARDC, the CoP will host meetings covering topics such as data curation and pre-processing, tools and libraries, skills, training, computing environments, and public testing and benchmarking datasets. It also aims to host events (training, showcase, networking), symposiums and discussion forums.