Collectra: Building Text and Image Extraction Pipelines

The Collectra project (born from the Hespi herbarium pipeline) provides a modular framework for building automated information extraction pipelines. It enables researchers to extract structured data from vast visual collections – including specimens, fossils, and art – transforming uncatalogued images into searchable, FAIR-compliant datasets. Read this case study to learn about digitising collection materials.

  • Collection managers with large backlogs of digitised images but no searchable text
  • Environmental and life scientists needing to mobilise historical biodiversity records
  • Digital humanists seeking to detect specific objects or text in visual archives

By the end of reading/using this resource, you should be able to:

  • chaining object detection (YOLO) and OCR/HTR into a custom pipeline
  • using active learning (clustering) to train AI models with minimal manual effort
  • packaging data into RO-Crates for long-term research interoperability.

Software tools

Packaging

Publications

  • Turnbull, R., Fitzgerald, E., Thompson, K. M., & Birch, J. L. (2025). Hespi: A pipeline for automatically detecting information from herbarium specimen sheets. BioScience, 75(8), 637–648. https://doi.org/10.1093/biosci/biaf042
  • Robert Turnbull, Jo Birch. “Botanical Time Machines: AI Is Unlocking a Treasure Trove of Data Held in Herbarium Collections.” The Conversation, August 19, 2025. DOI: https://doi.org/10.64628/AA.337addajq
  • Thompson, Karen M., Robert Turnbull, Emily Fitzgerald, and Joanne L. Birch. “Identification of Herbarium Specimen Sheet Components from High-Resolution Images Using Deep Learning.” Ecology and Evolution 13, no. 8 (2023): e10395. DOI: https://doi.org/10.1002/ece3.10395

Australian Research Data Commons 2026, Collectra: Building Text and Image Extraction Pipelines, viewed 15 May 2026, https://ardc.edu.au/resource/collectra-building-text-and-image-extraction-pipelines/.
Australian Research Data Commons. (2026). Collectra: Building text and image extraction pipelines. https://ardc.edu.au/resource/collectra-building-text-and-image-extraction-pipelines/.
Australian Research Data Commons. “Collectra: Building Text and Image Extraction Pipelines.” 2026, https://ardc.edu.au/resource/collectra-building-text-and-image-extraction-pipelines/.
Australian Research Data Commons. Collectra: Building Text and Image Extraction Pipelines [Internet]. [updated 2026; cited 2026 May 15]. Available from: https://ardc.edu.au/resource/collectra-building-text-and-image-extraction-pipelines/.
Australian Research Data Commons. “Collectra: Building Text and Image Extraction Pipelines.” 2026. https://ardc.edu.au/resource/collectra-building-text-and-image-extraction-pipelines/.
Australian Research Data Commons. “Collectra: Building Text and Image Extraction Pipelines.” Accessed: May. 15, 2026. [Online]. Available: https://ardc.edu.au/resource/collectra-building-text-and-image-extraction-pipelines/.

The 170,000-Image Bottleneck

Imagine you have 170,000 historical herbarium sheets or 10,000 museum fossil labels. They are digitised, but they are just pictures – dark data that cannot be searched. To track how a plant species has moved due to climate change, or how a fossil was originally classified, you need to read those labels. By hand, this would take a researcher decades of tedious labour.

Dr Robert Turnbull and the team at the University of Melbourne faced this exact bottleneck. Their solution was Collectra, a tool that takes an image, finds the label, reads the handwriting, and turns it into a database entry in seconds. This isn’t just a biological tool; it represents a rate shift for any humanities or science project drowning in un-transcribed images, reducing digitisation timelines from decades to months.

Collectra is being enhanced through the Enhanced Analytics for HASS and Indigenous Data project, part of the ARDC Community Data Lab.

Watch the case study, recorded at the 2026 HASS and Indigenous Research Data Commons Summer School,  and read the summary below. A reference table of the tools mentioned in this case study with acronyms is at the end of this case study.

From Curation to Computation: The Lego Approach

The project moved beyond niche specimen processing to a general-purpose HASS utility. By automating the extraction of text and objects, it allows scholars to move from looking at images to analysing datasets.

The team demonstrated this versatility by adapting the pipeline to find every sandglass (hourglass) in a collection of historical paintings. This revealed errors in existing museum metadata where the objects had been missed for centuries, showing that AI can serve as a second set of eyes for the curator, discovering what the human eye has overlooked.

Pipeline Architecture: A Technical Blueprint

The Collectra framework uses a modular network graph where each task is a building block that can be swapped or modified.

Workarounds, Surprises and Unexpected Learnings

Technical Stack: Collectra Pipeline Architecture

CategoryExact tool or servicePurpose
DetectionYOLOIdentifying and cropping labels or specific objects within an image
TranscriptionTrOCR or TesseractExtracting handwriting and printed text from visual media
RefinementMultimodal LLMs (Open AI’s GPT or Anthropic’s Claude Sonnet)Post-processing text to correct errors and structure data
AnnotationActive learning (clustering)Grouping images to reduce manual labelling work by up to 90%
PackagingRO-CrateBundling data and metadata into a sharable, citable research object

This is a case study from the Enhanced Analytics for HASS and Indigenous Data project, part of the ARDC Community Data Lab.