Graphics processing units (GPUs) are an expensive resource and need to be utilised efficiently. Currently when it comes to using GPUs for machine learning (ML) in research cloud or high-performance computing systems, there is no elasticity or flexibility. The current model of allocating GPUs through virtual desktops for ML services presents these issues:
- Utilisation of large GPUs reserved for virtual desktops is inefficient.
- The use of partitioned GPUs in virtual desktops may still lead to under-utilisation or prove inadequate for the intended tasks.
Secondly, as deep learning models become more complex and data sets grow, simple ‘standard’ techniques do not scale to meet the challenge anymore. For example, large amounts of data cannot be accommodated in the often limited system memory, requiring special techniques to split them into smaller batches for processing across multiple central processing units (CPUs) and GPUs, often in parallel. This often causes inefficiency due to transfer of large data across multiple CPUs and GPUs.
To address the first concern, this project aims to develop and implement a national ML service that can make efficient use of GPUs in high-performance computing (HPC) clusters and the national ARDC Nectar Research Cloud. The service will be implemented on infrastructure at Monash University, The University of Queensland and the Queensland Cyber Infrastructure Foundation (QCIF), including dedicated GPU infrastructure at the Monash and QCIF nodes of Nectar.
To address the second concern, the project will develop training materials and courses specifically for researchers with high-end requirements for very large datasets and computational power. By developing and sharing profiling expertise with the researchers, the project will enable them to navigate the challenges of deploying deep learning in a diverse research environment and ensuring that the necessary infrastructure and operational support are in place to achieve this goal.
In line with the ARDC strategy, this project will provide leading-edge infrastructure and enhance provision of services for the national ARDC Nectar Research Cloud. It will also feed into service provision for the ARDC’s Platforms projects and Thematic Research Data Commons (Thematic RDCs), the consultations for and requirements of which will be reviewed to ensure that the ML service can be utilised by these programs of work.
The project is being undertaken in 3 phases. The high-level outcomes of each phase are as follows:
Phase 1: exploration
- Identify researchers who have large-scale ML requirements and need larger computational resources, and document their use cases and requirements as case studies.
- Explore options for delivery of a national ML service and develop a proof of concept (POC) service.
- Identify gaps in training.
Phase 2: pilot implementation
- Hold workshops to enable the identified users to use the tools they need to access larger-scale computational resources for ML.
- Implement a pilot ML service, and test and refine it with a group of pilot users.
- Implement improved training material based on identified gaps, and develop and deliver training for the pilot ML service.
Phase 3: production implementation
- Run workshops for additional ML users and explore the use of the new ML service to meet large-scale requirements.
- Deploy a production ML service.
- Provide improved training material and run training courses for the production ML service.
Because machine learning models take a large amount of time and resources to train, it is important to be able to measure the efficiency of the training algorithms. This project will provide tools to measure the performance of machine learning training and help identify optimisations.Professor David Abramson, Director, Research Computing Centre, The University of Queensland
High-end GPUs are expensive and in high demand by machine learning and AI researchers. We want to deliver the technology to as many researchers as possible without escalating the infrastructure cost. From what we’ve identified, the development phase seems to be the gap where GPUs are required but underutilised. We’ve opted for the Dask framework in the Jupyter environment to increase the seats available in national machine learning services, and we’ve designed and delivered the Machine Learning eResearch Platform (MLeRP) system with various options for the researchers to access powerful GPUs.Dr Slava Kitaeff, Monash e-Research Centre
Who Will Benefit
This project will benefit researchers by:
- further improving access and usability of computational infrastructure and tools for ML-based research
- enabling more efficient use of expensive GPU servers for ML
- enabling researchers who are developing more complex ML models on very large datasets to use multiple GPUs for their large-scale computation
- upskilling researchers in ML techniques and for the ML tools and platforms that this project will deliver.
The following project partners will be collaborating with ARDC on this project:
- Monash University
- Research Computing Center (RCC), The University of Queensland
- Queensland Cyber Infrastructure Foundation (QCIF).
- Implement a national ML service that can make efficient use of GPUs in HPC clusters and the national ARDC Nectar Research Cloud.
- Develop and deliver training material and courses in support of the service.
- Develop and share profiling expertise with researchers to enable them to navigate the challenges in deploying deep learning in a diverse research environment.
The national Machine Learning eResearch Platform (MLeRP) service has been developed by Monash University, and the open Beta version was launched on 7 November 2023. Use the service, read the launch article and watch the recording of the introductory webinar on 28 November 2023:
Phase 1 of the project was completed in June 2023. Monash University conducted a proof of concept (POC) and in the completion report, it was concluded that SLURM is a good tool to manage the job queue to allow for easy transition to or from traditional HPC for other stages of development. Based on this POC, Dask has been chosen as the primary tool to interface with the SLURM.
Additionally, The University of Queensland has assessed the gaps in the training and identified several cases and research groups requiring very large-scale ML computation that would benefit from using multiple GPUs. Each case study highlights unique aspects of deep learning model development, including complexity, efficiency and transfer learning. The cases include:
- signal processing
- large-scale semantic segmentation.
- Join the Machine Learning Community of Practice for Australia (ML4AU), established under our previous machine learning project. The community facilitates knowledge sharing, best-practice frameworks and advanced training materials.