Latest Updates from the Planet Research Data Commons – Sept 2024
Exploreabout Latest Updates from the Planet Research Data Commons – Sept 2024
As part of our Research Software Agenda for Australia, the ARDC is working with the research community to shape better research software for it to be recognised as a first-class research output. Each month, we talk to a leading research software engineer (RSE), sharing their experience and tips on creating, sustaining and improving software for research.
This month, we spoke with Dr Manodeep Sinha, a computational astrophysicist at Swinburne University of Technology. Manodeep is a postdoctoral researcher at the university’s Centre for Astrophysics and Supercomputing and a Senior Research Software Scientist at the ARC Centre of Excellence for All Sky Astrophysics in 3 Dimensions (ASTRO 3D). He’s also the founder and co-chair of the RSE Association of Australia and New Zealand (RSE-AUNZ). Recently, he was awarded the ARDC-sponsored Emerging Leaders in Astronomy Software Development Prize by the Astronomical Society of Australia for his work on the open-source Corrfunc package, which speeds up our understanding of the spatial distribution of galaxies significantly – and potentially has applications beyond astronomy.
After completing a degree in electrical engineering and working at a software company for a year, I decided to pursue a career in astronomy. I’ve now been an astronomer for over 20 years, and funnily enough, software has once again become my mainstay.
I really started my journey as an RSE in 2012 when I started a new computational project. With the tools available at the time, it would require more than 70 years to complete the calculations. Never the one to back down, I went down the deep rabbit hole of learning about modern computer hardware and code optimisation. Eventually, I managed to come up with a pipeline that reduced the runtime of a single calculation from 10 minutess to 5 seconds and the overall computation to around 6 months.
The new workflow consisted of several pieces of custom code created specifically for the project and I had no intentions of ever releasing the codes. That mentality changed after a specialist workshop in mid-2014, where seasoned astronomers using state-of-the-art tools told me how long it took for their code to run. My instinct was that I could create something faster, and within 24 hours I had a new version of my code that sped up their calculation by over 6000 times. That made me realise how valuable my code might be to the broader community, and I decided to create a public, open-source, open-development repository for it in late 2015 – the code was none other than Corrfunc.
I value creating and maintaining open-source research software, and I encourage the junior researchers to contribute as well as consume open-source software. Nowadays, most software I develop is made to be used by others and myself in the future! I enjoy working with big data from large simulations, running statistical analysis and creating tools as needed – a new tool means an extra thing to maintain, so you’ve got to find a balance!
I enjoy working with big data from large simulations, running statistical analysis and creating tools as needed – a new tool means an extra thing to maintain, so you’ve got to find a balance!
Corrfunc is my most significant project, but more on that later – some context first!
Matter in the universe comes in 2 flavours: regular matter that you and I and everything we see is made up of, and dark matter, which, as its name suggests, does not emit or absorb any light. When the universe began, matter was spread uniformly across space-time, except for some tiny density perturbations. The extra gravity from the denser regions pulled in more matter from nearby, and with time, these overdense regions grew to form galaxies. It’s not hard to simulate such gravitational collapse of matter, but because we can’t directly observe the dark matter, we can only infer their existence through what we see in the regular matter. My research is about making a precise and robust connection between what we see and what we can simulate.
Recently, I have worked on the Uchuu Simulations project as part of an international collaboration across Japan, Spain, Italy, USA, Argentina and Australia. The Uchuu simulation is one of the largest simulations of the universe to date, culminating in 125 terabytes of data. To reduce the large amount of data and convert it into a portable and widely understood format, I had to create an HDF5 datatype and ensure the code used to convert the simulation output into HDF5 was super-optimised.
I’m also revamping a semi-analytic galaxy formation model called SAGE, initially created by my line manager, Darren Croton. The new version can now work with 7 different input data formats instead of just one. I was scratching my head for 2 weeks over a bug while implementing code for SAGE to read a new data format, but fortunately I managed to resolve it!
Short for “correlation function”, the name “Corrfunc” is from the legacy code handed over to me by my previous boss, Andreas Berlind. We know galaxies aren’t spread out uniformly across the universe, and a correlation function measures how clustered any observed galaxy distribution is. Specifically, a correlation function quantifies how much more probable it is to find a pair of galaxies at a given distance compared to a smooth universe. Beyond astronomy, correlation functions are a generic statistical tool and can be applied to any distribution of points as long as we know the volume of space those points are spread across. They’re an important topic in computer science, being the focus of one of the finalists for the 2012 ACM Gordon-Bell Prize.
Beyond astronomy, correlation functions are a generic statistical tool and can be applied to any distribution of points as long as we know the volume of space those points are spread across. They’re an important topic in computer science, being the focus of one of the finalists for the 2012 ACM Gordon-Bell Prize.
To calculate the correlation function theoretically, we first need to compute how far every galaxy is from every other galaxy. Once the pair separations are known, we count how frequent the separations are by binning them up. Typically, though, we’re just interested in fairly small separations, so it’s a tremendous waste of time and resources to compute separations for all possible pairs of galaxies, especially given that modern sky surveys are becoming more and more comprehensive, capturing billions of galaxies.
Corrfunc was designed to address these challenges. Modern CPUs process data more efficiently when the data is contiguous in memory. What Corrfunc does is partition the input galaxy positions into 3D cells and rearrange the positions within their associated cells, such that the galaxies within one cell occupy contiguous memory locations. We can then divide the entire volume into cells with sides the length of the largest separation we’re interested in and search only for galaxy pairs within the neighbouring cells in each dimension. Corrfunc is made faster still through optimisations for certain hardware. The result is super-fast computation. For instance, with a million galaxies like the Milky Way, Corrfunc can compute the correlation function within a few seconds when the legacy code takes more than a minute. For a sophisticated research question, this is the difference between years versus months of computation to arrive at a result. You can follow the gory details of Corrfunc in our 2 papers – Sinha and Garrison, 2020 and Sinha and Garrison, 2021.
Corrfunc is open-source and available for everybody to use. Putting aside research benefits, Corrfunc reduces the massive carbon footprint astronomers create through supercomputing by being so efficient. It really is a win-win!
That said, Corrfunc is not all shiny and great. We have an incredible amount of boilerplate code duplicated everywhere, and modifying the boilerplate requires changing over 20 instances within the codebase, which becomes a huge maintenance burden. I’m now evaluating how we can reorganise the codebase to keep the performance while reducing the maintenance burden.
Corrfunc is now cited over 50 times yearly, and some of these projects are only feasible because of it. A group of geophysics researchers in the US used it to study the strength and frequency of earthquakes! After all, Corrfunc works with any set of points – it doesn’t matter whether those points represent galaxies or molecules.
Back in 2016, one of the PhD students in our group was having trouble computing a particular quantity because the computation was so slow. After talking to them, I realised that Corrfunc could be adapted and extended to serve their use case. I explained how Corrfunc was designed and what they would need to do to adapt Corrfunc, and within a few weeks, the student had a working version that was 600 times faster! I’m happy that I could help them get through their computational bottleneck. Still, this story highlights how much of a boost researchers can get from seeing, adapting and working with good research software.
Still, this story highlights how much of a boost researchers can get from seeing, adapting and working with good research software.
I’m very grateful for getting the Emerging Leaders in Astronomy Software Development Prize, which is a confirmation of Corrfunc’s impact on the astronomy community. But to be honest, I was pretty conflicted about applying for the prize, being a senior researcher and having already been recognised as a technical expert. I applied because this was the only opportunity to get a national award for my software contributions.
In 2017, I had a chat with Michelle Barker, then working at Nectar and now leading the Research Software Alliance (ReSA). I told her how surprised I was at the lack of an Australian RSE community, and she encouraged me to form one. It felt impossible at first – how could I, a postdoc on a precarious fixed-term contract, start off an entire national community. But trusting that the time had come for an RSE community for Australia (and New Zealand), I did it anyway, and thus was born RSE-AUNZ.
RSE-AUNZ is, first and foremost, about people. RSEs make deep contributions to the research process but are typically overlooked within the existing academic credit system. As a community, we can share our experiences, learn from one another and develop different approaches towards increasing the visibility and awareness of the RSE role. RSEs are essential for modern research, and now, with the ubiquitous use of AI-generated code, which can be wrong in subtle ways, we will need a steady RSE workforce to correctly capitalise on the new AI tools. We are also committed to creating and sustaining an inclusive community. The tech space is heavily male-dominated, and we actively try to have better representation within the Steering Committee.
Thanks to the fantastic work of the Steering Committee and the ongoing support from the ARDC, RSE-AUNZ has grown into a network of nearly 450 members. Based on the recent ARDC community review, RSE-AUNZ is progressing well as assessed from an external viewpoint and with community input. I’d like to give a shout-out to Rowland Mosbergen, who has been the driving force behind RSE-AUNZ and oversees the meetups, the RSE-AUNZ strategy document and the most recent RSE Asia Australia Unconference.
But while there’s more awareness and recognition of RSE, there is still a lot to be done. Until research evaluation processes include research software as a first-class research output, we will have to fight on. I once asked how I could get credit for my research software outputs while applying for promotion, and I was told that creating research software is considered such an integral part of research that we do not get any additional credit! That will require culture change at research organisations and policy change at the national level. And how soon the culture change happens will depend on how much we value FAIR, maintainable, high-quality, reproducible and open research. Initiatives are happening around the world, including in the Netherlands, so I am hopeful that change is coming.
I’m a member of the UK RSE Society and the Astronomical Society of Australia, but I also spend a lot of time fostering culture change. For instance, I like to help students, especially those from historically excluded communities, with their computing questions on the condition that they pay it forward. My goal is to cultivate a more sustainable and inclusive culture around technical expertise.
I was also involved in a weekly student-run event called “Cookies-n-Code”, where students would come together to learn and share new tools and techniques and, of course, enjoy some cookies! Again, we were very conscious of the diversity of the speakers. If you wait for people to volunteer, you end with a heavily skewed representation. I keep hearing about similar events at other institutions (most recently at the biotech company CSL) – this goes to show the need for such an informal avenue to enable junior researchers to work better with research software.
You can connect with Manodeep via his personal website, GitHub, LinkedIn and Twitter.
If you’d like to be part of a growing community of RSEs in Australia, become a member of RSE-AUNZ – it’s free!
The inaugural ARDC Eureka Prize for Excellence in Research Software has been awarded to Dr Minh Bui and Professor Robert Lanfear of the Australian National University for the free, open-source IQ-TREE2, which turns DNA data into crucial evolutionary insights.
Hear from Dr Bui and Prof Lanfear on receiving the award and recognising research software.
Learn more about the ARDC’s Research Software Agenda for Australia.
The ARDC is funded through the National Collaborative Research Infrastructure Strategy (NCRIS) to support national digital research infrastructure for Australian researchers.