Collectra: Building Text and Image Extraction Pipelines
The Collectra project (born from the Hespi herbarium pipeline) provides a modular framework for building automated information extraction pipelines. It enables researchers to extract structured data from vast visual collections – including specimens, fossils, and art – transforming uncatalogued images into searchable, FAIR-compliant datasets. Read this case study to learn about digitising collection materials.
- Collection managers with large backlogs of digitised images but no searchable text
- Environmental and life scientists needing to mobilise historical biodiversity records
- Digital humanists seeking to detect specific objects or text in visual archives
By the end of reading/using this resource, you should be able to:
- chaining object detection (YOLO) and OCR/HTR into a custom pipeline
- using active learning (clustering) to train AI models with minimal manual effort
- packaging data into RO-Crates for long-term research interoperability.
Software tools
Packaging
Publications
- Turnbull, R., Fitzgerald, E., Thompson, K. M., & Birch, J. L. (2025). Hespi: A pipeline for automatically detecting information from herbarium specimen sheets. BioScience, 75(8), 637–648. https://doi.org/10.1093/biosci/biaf042
- Robert Turnbull, Jo Birch. “Botanical Time Machines: AI Is Unlocking a Treasure Trove of Data Held in Herbarium Collections.” The Conversation, August 19, 2025. DOI: https://doi.org/10.64628/AA.337addajq
- Thompson, Karen M., Robert Turnbull, Emily Fitzgerald, and Joanne L. Birch. “Identification of Herbarium Specimen Sheet Components from High-Resolution Images Using Deep Learning.” Ecology and Evolution 13, no. 8 (2023): e10395. DOI: https://doi.org/10.1002/ece3.10395
The 170,000-Image Bottleneck
Imagine you have 170,000 historical herbarium sheets or 10,000 museum fossil labels. They are digitised, but they are just pictures – dark data that cannot be searched. To track how a plant species has moved due to climate change, or how a fossil was originally classified, you need to read those labels. By hand, this would take a researcher decades of tedious labour.
Dr Robert Turnbull and the team at the University of Melbourne faced this exact bottleneck. Their solution was Collectra, a tool that takes an image, finds the label, reads the handwriting, and turns it into a database entry in seconds. This isn’t just a biological tool; it represents a rate shift for any humanities or science project drowning in un-transcribed images, reducing digitisation timelines from decades to months.
Collectra is being enhanced through the Enhanced Analytics for HASS and Indigenous Data project, part of the ARDC Community Data Lab.
Watch the case study, recorded at the 2026 HASS and Indigenous Research Data Commons Summer School, and read the summary below. A reference table of the tools mentioned in this case study with acronyms is at the end of this case study.
From Curation to Computation: The Lego Approach
The project moved beyond niche specimen processing to a general-purpose HASS utility. By automating the extraction of text and objects, it allows scholars to move from looking at images to analysing datasets.
The team demonstrated this versatility by adapting the pipeline to find every sandglass (hourglass) in a collection of historical paintings. This revealed errors in existing museum metadata where the objects had been missed for centuries, showing that AI can serve as a second set of eyes for the curator, discovering what the human eye has overlooked.
Pipeline Architecture: A Technical Blueprint
The Collectra framework uses a modular network graph where each task is a building block that can be swapped or modified.
Workarounds, Surprises and Unexpected Learnings
Guidance for Researchers
Start with pre-trained models
Clustering is key
Prioritise provenance
Technical Stack: Collectra Pipeline Architecture
| Category | Exact tool or service | Purpose |
|---|---|---|
| Detection | YOLO | Identifying and cropping labels or specific objects within an image |
| Transcription | TrOCR or Tesseract | Extracting handwriting and printed text from visual media |
| Refinement | Multimodal LLMs (Open AI’s GPT or Anthropic’s Claude Sonnet) | Post-processing text to correct errors and structure data |
| Annotation | Active learning (clustering) | Grouping images to reduce manual labelling work by up to 90% |
| Packaging | RO-Crate | Bundling data and metadata into a sharable, citable research object |
This is a case study from the Enhanced Analytics for HASS and Indigenous Data project, part of the ARDC Community Data Lab.