Baseline Researcher Access to Public Sector Data
Exploreabout Baseline Researcher Access to Public Sector Data
Graphics processing units (GPUs) are an expensive resource and need to be utilised efficiently. Currently when it comes to using GPUs for machine learning (ML) in research cloud or high-performance computing systems, there is no elasticity or flexibility. The current model of allocating GPUs through virtual desktops for ML services presents these issues:
Secondly, as deep learning models become more complex and data sets grow, simple ‘standard’ techniques do not scale to meet the challenge anymore. For example, large amounts of data cannot be accommodated in the often limited system memory, requiring special techniques to split them into smaller batches for processing across multiple central processing units (CPUs) and GPUs, often in parallel. This often causes inefficiency due to transfer of large data across multiple CPUs and GPUs.
To address the first concern, this project aims to develop and implement a national ML service that can make efficient use of GPUs in high-performance computing (HPC) clusters and the national ARDC Nectar Research Cloud. The service will be implemented on infrastructure at Monash University, The University of Queensland and the Queensland Cyber Infrastructure Foundation (QCIF), including dedicated GPU infrastructure at the Monash and QCIF nodes of Nectar.
To address the second concern, the project will develop training materials and courses specifically for researchers with high-end requirements for very large datasets and computational power. By developing and sharing profiling expertise with the researchers, the project will enable them to navigate the challenges of deploying deep learning in a diverse research environment and ensuring that the necessary infrastructure and operational support are in place to achieve this goal.
In line with the ARDC strategy, this project will provide leading-edge infrastructure and enhance provision of services for the national ARDC Nectar Research Cloud. It will also feed into service provision for the ARDC’s Platforms projects and Thematic Research Data Commons (Thematic RDCs), the consultations for and requirements of which will be reviewed to ensure that the ML service can be utilised by these programs of work.
The project is being undertaken in 3 phases. The high-level outcomes of each phase are as follows:
Because machine learning models take a large amount of time and resources to train, it is important to be able to measure the efficiency of the training algorithms. This project will provide tools to measure the performance of machine learning training and help identify optimisations.
Professor David Abramson, Director, Research Computing Centre, The University of Queensland
High-end GPUs are expensive and in high demand by machine learning and AI researchers. We want to deliver the technology to as many researchers as possible without escalating the infrastructure cost. From what we’ve identified, the development phase seems to be the gap where GPUs are required but underutilised. We’ve opted for the Dask framework in the Jupyter environment to increase the seats available in national machine learning services, and we’ve designed and delivered the Machine Learning eResearch Platform (MLeRP) system with various options for the researchers to access powerful GPUs.
Dr Slava Kitaeff, Monash e-Research Centre
Phase 1 of the project was completed in June 2023. Monash University conducted a proof of concept (POC) and in the completion report, it was concluded that SLURM is a good tool to manage the job queue to allow for easy transition to or from traditional HPC for other stages of development. Based on this POC, Dask has been chosen as the primary tool to interface with the SLURM.
The national Machine Learning eResearch Platform (MLeRP) service has been developed by Monash University, and the open Beta version was launched on 7 November 2023. Use MLeRP.
Additionally, The University of Queensland has assessed the gaps in the training and identified several cases and research groups requiring very large-scale ML computation that would benefit from using multiple GPUs. Each case study highlights unique aspects of deep learning model development, including complexity, efficiency and transfer learning. The cases include:
This project will benefit researchers by:
The following project partners will be collaborating with ARDC on this project: