FAIR data 101 training – Interoperable #6

inesented: 10 June 2020

Presenter: Liz Stokes

#6 in the 8 webinar series of the FAIR data 101 training webinars.

Open/Close Transcript

Liz Stokes:
Hello everybody and thank you for your patience. I had some interesting moments there, connecting and being confounded by myself and systems. So, thank you. Thanks a lot. Let’s get into interoperability today. My name is Liz Stokes and I would like to acknowledge the traditional owners of the land on which we’re meeting. I’m coming to you from Sydney, Australia, which is where I am based at the moment. The traditional lands of the Gadigal people of the Eora Nation. I’d like to pay my respects to elders past and present and extend that respect to any of those nations’ people who are with us today.

Liz Stokes:
Let’s get into interoperability without all of that faff. I’m going to… move that picture of myself so I don’t really look at it. So, Matthias talked on Tuesday about the reasoning behind interoperability and it’s place amongst the FAIR data principle and showing some examples of what goes wrong when you don’t have common, agreed upon standards and talking about the utility of having the data schema with the data, which brings everything together. Today, I’m going to pretty much go over the same stuff and hopefully go a little deeper into what it is when we’re talking about vocabularies and share a couple of tools and some resources to help you get your head around interoperability.

Okay. As you know, you could always join us on the slack channel for any questions that come up. If you’ve got any questions from what I’m saying today, please throw them into either the question tab or the chats tab and we will have a Q&A in about half an hour. I’ll try and speed up somewhere. Okay? So, interoperability is what I think we’re going to get over today. What does it even mean? Talking about … Well, let’s just get into it. Okay?

So, to be interoperable, which is basically data that is interpretable by a computer so that data can be combined with other data. The data needs to use community agreed formats, languages and vocabularies. The (meta)data will also need to use community agreed standards and vocabulary and can contain links to the latest information using identifiers, which is essentially what that I3 point there means. (Meta)data including qualified references to other (meta)data.

Now, so let’s have a look at all of those principals in another way. So, data that is interpretable by computers, community agreed formats, languages and vocabularies. And the metadata itself also has some of these FAIR principles. Okay? So, we also use community agreed standards and there are links to related information using identifiers.

So, this is my first warning here. We might just get into the weeds, which is good. Okay? Because when we’re talking about vocabularies, classification systems, determining community standards, inevitably we’ll get into value judgements as to what is a weed, what is grass, and what distinctions are helpful now and in the future. Being humans we don’t always get it right, so it’s a process and we all learn to trust in the process. And I hope that by going through, talking about some FAIR looking vocabularies and a couple of metadata schemas, we’ll even get over to linked data as a way of coming back and having another perspective on what interoperability means. Hopefully, I can get to ontologies, but if we don’t, I’ll just send you off to some awesome things.

Okay. So, community standards. Okay. Let’s focus now on community standards around research data and what kinds of scenarios might people be thinking of when we think about what a community agreed upon standard might help us solve. So, here are a few ones from the fairsharing.org website. Here we have the researchers talking about, my funder’s data policy recommends the use of established standards but which are the ones widely endorsed and applicable standards for my crop data? Funders and journal editors may be asking, what are the mature standards and standard-compliant databases that we should be recommending to our authors? Maybe the journal editors are looking for a repository that they can use to host related data from the publications that they publish.

Down in the bottom right corner, we have librarians and data managers who are looking at genomic rice data in a particular format which has now been deprecated. So, they’re looking to find out what’s the new format and what can they do to migrate their legacy data into a format that might be more widely used or meets current standards. And then, finally, we have curators and developers who are looking at sharing social science data. So, their questions are what… We need a standard of doing this but who should we talk to and what options are there out there?

I do recommend the FAIRsharing policy. As you can see in that little circle in the middle of this infographic, it’s a collection of standard policies and databases or indexes, FAIR databases, policies, and standards that you can all use in an effort to facilitate the verification of our repository services.

Okay. But, this is my… Things can all go pear shaped and humans don’t always get it right. So, what you can see here is a picture of Copernicus’ heliocentric solar system, where you’ve got the sun or sol in the middle and then these concentric circles showing the orbits of the planets around the sun. So, it took us humans a while to figure out where we were in the solar system. But, not just figure out, but also accept and incorporate that into a socially acceptable world view.

It’s interesting that we know now that Copernicus drew on the science from many Islamic astronomers who were leading the charge there. And he even delayed publication of this model for years and years because he didn’t have enough data or really proof of his theory. Then 50 years later when Galileo used the telescope to prove this, while his purposes were more… he was actually trying to change community standards and unfortunately he was tried by the inquisitioner and then placed under house arrest until he died. So, hopefully community development of metadata standards is not necessarily going to be that difficult. But, I do want to reassure anyone who is getting fired up about this that, well, humans are funny and cultural change is hard and you might as well ask any data librarian who’s had to promote data management plans how that goes.

Let’s move onto what things make it easier to help aid interoperability. I’m going to talk a bit about vocabulary and why we would want to have vocabularies in our research data. Vocabulary is the most standard way of setting out common language that a discipline has agreed to use, to refer to concepts of interest in that discipline. Researchers planning observational surveys need to define their data items clearly and an agreed vocabulary or standard makes a good starting point for translating concepts into other vocabularies so that collaboration can occur. So, we’ve got vocabularies happening at the data’s description level and we also have vocabularies working at the metadata description level as well.

So, I’m going to start off by talking about the metadata scheme for aggregating Australian research data. I think I’ve made some mention of this previously, now I’m going to get really into it. It’s the Registry Interchange Format for Collections and Services, officially known as RIF-CS. It’s based on and an ISO standard and yet, unlike the ISO standard which it’s on, it is free to access and you can check out the vocabularies that RIF-CS uses as well as you can even repurpose it yourself. Okay? So, its main purpose is in aggregating Australian research data so that it can be displayed in Research Data Australia, the platform that ARDC provides. The two important things that I wanted to focus on, in terms of interoperability, is that this metadata schema outlines… establishes relationships between objects using qualified references and it also uses a bunch of controlled vocabularies.

Now, let’s have a look at a… Well, this is a little like an ontology model in that this diagram is showing the relationship between different objects within the metadata schema. So, RIF-CS has four objects in its registry. Four types of objects. They’re party, which represent people or group, collection, which are an aggregation of physical or digital objects. So, collection object is what we use for describing a data set or a collection. So, they’re both types of collection. Then we’ve got activity as the third object there, which usually translate to a research project, that could be other activity. And then, service.

So, these four objects come straight from the ISO standard. Okay? And so, you can see there, between these different objects, you can see little arrows pointing to things that can happen between these objects and how they’re related. And those of you who have fond memories of our webinars on protocol should hopefully be gratified to see down here that even this… There’s a line up to services which are delivered through protocols. Okay? And that’s going to aid interoperability. Okay? And then over here on the left, the access policy. So, you could imagine actually, we’ve covered access and protocol interoperability previously and now we’re moving on to looking at how the metadata might identify these relationships.

Now, let’s go in. So, actually, I’m going to jump over to some of the… no… some of the vocabularies that we have. Let’s see if that works. I might need to … Oh. Pretty good I think. Okay. Hopefully you can see… Ah yes. So, this is a list. What I’ve just linked to is a list of the vocabularies that RIF-CS uses. Can see at the top a little index of the different vocabularies. Then down here, we get some definitions of what they are. What I would like to highlight here, for example, the access rights type here, you can see. There are three access rights types. Open, conditional or restricted and there’s some information about when you might want to use those particular types of access rights. Okay?

Another interesting vocabulary, or another way of thinking about these vocabularies is that they’re a list of options. Okay? Is the date types. I’m going to scroll down a little. Close your eyes if this makes you dizzy. Down to the date. Right. Okay. So, as you can see here, in this date type, all of the date type that RIF-CS is using for collections is actually, if you can notice the little DC prefix, these are reusing elements from the Dublin Core Metadata schema. In fact, Dublin Core in a data schema. And we can see… because there are different kinds of dates that pertain to a collection of a data set. So, when it was made available, when it was created. Maybe it might have been accepted or submitted at a particular date, or it might be a range for which that resource or thing is valid. Okay?

And I might come… I’m going to scroll a little more again so avert your eyes. Okay. I’m going to come down to the identifier type. Here’s a very useful vocabulary list of persistent identifiers. As you can see they all have down the side, their prefixes, acronyms and what they stand for. So, that’s it. That’s a fun time in the vocabulary set that we use and, in fact, I might just jump on over to Research Data Australia.

Let me make that a bit nice and big. Okay. So, back over to Research Data Australia, these controlled vocabularies come into play when we’re using search and find [inaudible 00:15:27]. Actually, if I clicked over to choose the publicly accessible online term, the little check box there, this will return results which have… I’ll just click on the first one there. Results which have open access metadata. Okay? Or open access research data. Okay? As you can see there, when we look down at the access part, we can see that it’s open and [inaudible 00:16:03] the metadata sent us to the correct record.

I’m going to scroll down a little here and just take you into the relationships there. So, here is a little graph. So, thinking back to when Matthias was talking to you about linked open data and having different subjects and objects and their relationship or the predicate that identifies the relationship between them, here is a nice diagram that show… graph that shows you how these things are related to each other. You can see that these little green circles represent people or in RIF-CS, these would be party records. Okay? So, a person is a party according to the RIF-CS metadata schema. This particular one, Lisa Bero, professor, is collector of and also principal investigator of this particular data set here. Okay? Which is also associated with these other four data sets in this little graph. So, this graph shows us the relationships between these different objects. We’ve also got a… We can see that there’s a [inaudible 00:17:35] on a website here that’s also listed and that is related to one of these data sets as well.

And if I even go down the bottom to my favorite view, registry view, this will help us. Okay. So, now we see RIF-CS in all its glory. Okay? And we get to see the metadata elements here down the side in bold and then the values of the metadata that has been provided. Okay? And if I take you to the related object. And this is really the … This is the qualified… The metadata provides qualified links to other metadata. We can see that this record is a collection and it’s a type which is a data set. And it is related to these other objects which are party records. And I think we can see other [inaudible 00:18:50] related to these other objects which are collections. Okay?

Now, let’s move on a little, shall we? Okay. Out of [inaudible 00:19:08]. Oh. This is where we get into the weeds everyone. Okay? So, you could have a look, you could check out, have a look at RIF-CS in its schema documentation and that’s what we’re looking at now in the components. Right? Okay? For example, we could go down to what a collection looks like. And this kind of documentation that you get to see, is telling us how to understand, or rather telling machines and I guess, data librarians, how to understand the elements and how they relate to each other in this metadata schema. So, this is quite a formal way of describing the data model that is used for resource [inaudible 00:20:04] and displayed in Research Data Australia.

I’m going to draw back from [inaudible 00:20:14] now. But, nobody can go in there and explore at any… Can you still hear me? I’m going to trust that you can.

Mattias:
Yes. We can still hear you Liz.

Liz Stokes:
Thank you. Okay. Now let’s go back to our webinar. Right. Okay. So, remember the magpies, okay, that Matthias was talking about. So, actually what I really want to talk about is another metadata schema that’s called Darwin Core. Sounds very similar to Dublin Core. Okay? The reason I wanted to talk about this is because this is an example of a metadata schema that has undergone some change. Okay? So, this diagram, you don’t have to learn it really, and in fact this is an early diagram or representation of Darwin Core. I apologize. Okay. As it has become a widely used open access standard for biodiversity data.

It was developed to provide a simple way to provide a simple way to document and share information about species occurrences, whether that was in the field or a museum collection. It’s been used to integrate hundreds of millions of records through a global biodiversity federation organization. Because it’s so widely used, it has benefit for bringing together lots of different kinds of contribution. So, what I wanted to tell you about here, is as you can see in that middle, so here we have Darwin Core. This representation of how it contains locations, organism data, geological context and taxon data. Okay? You can see down the bottom here, you can see there are a few metadata standards that contribute to Darwin Core. So, these are being reused. So, there are some Dublin Core elements that are being reused by Darwin Core and there are also.. Sorry. The mouse was doing funny things. There are also links out to other extensions. Okay? I’ll take you up to Apple Core, which is an extension of Darwin Core that focus on herbaria and plant stuff. Okay?

Now, so, Darwin Core, this is what it looked like in 2012. More recently, there have been additions to Darwin Core that support the aggregation of sampling event data sets. Okay? So, there’s a new event core component of Darwin Core that places the sampling event at the center of the simplified data set. And so, it links its protocol, and when I say protocol, I mean, so how they do the science, not how they transfer the data. Efforts and measurements to the species occurrence derived from the sampling event. Okay? So, as a result researchers can tap into a more complex and quantitatively richer records for analysis and combine them alongside others which are focused on single organisms or individual [inaudible 00:24:00]. Okay? So, these changes could lead to improvements in the quality and usefulness of data sets that are already published on, for example, Atlas of Living Australia and other biodiversity data repositories.

So, this is where I want to get into this linked open data stuff. Okay? Wait a minute. Let me try and back that up with a picture. Coming back to the magpies here. We’ve got all this occurrence records here. I’m just taking you down to how… Darwin Core Metadata schema so to do these other online resources. And so, by being able to link this record to other data out there, we can… This is where we’re getting to. The utility of linked data and FAIR data that we’re using persistent identifiers to link between different repositories and that when we have well-described data models then the data can be reused more efficiently by humans and also the systems and machines that we create.

Okay. Here is a nice example of mapping different kinds of data to Darwin Core terms here. What I’m sharing with you now is a page from the ALA blog. And you can see here raw data, which has been collected. This is structured data in a table which has been designed to serve the purposes of data collection. Okay? So, it’s easier to say what… You’ve got your vernacular names down here, the purple swamphen. And then, we have, at each different, we have different columns for the different localities. Okay?

But. this is not how Darwin Core is laid out. Darwin Core is laid out in a different format and that’s the second table here and you can see how instead of having the date up here in the top left corner, we have date information coming through under event date. Okay? And then, we have the vernacular name and locality. Okay? Indicating different components of the data. Okay? This is actually a really good book and I reckon you should have a read of it. But, right now it’s time for me to get back to linked open data. Make that nice and big.

Yeah. So, this might be … I wanted to get into this by the end of this thing because this illustrates the five stars of linked data that have been defined by Tim Berners-Lee. And I think it’s been a really interesting way of looking back at interoperability and the FAIR data principle because it has a lot in common with how we talk about these principles. But, it’s different. So, it’s not the same but I hope by offering it up, it can give you something to think about. So, these stars indicate levels of things that you can do to publish data and make it available to others.

So, the first thing is step one here, and you get one star for publishing data on the web and making it available. And in this case they’ve used a PDF to publish it, put it up there. And they’ve also put a license on it. That’s the first kind of step so that people know that they can reuse it.

Step two is making that data available as structured data. So, we can see we’ve moved from tables in a PDF to actually tables in an Excel spreadsheet. Good step. That means that people can more easily grab that data and do other things with it. Plug it into some analysis statistical software or other thing. Except, we’re using a proprietary format here. We’re reliant on being able to read Excel spreadsheets.

So, the third star is to use an open format. Okay? Here we have CSV representing the [inaudible 00:29:27] value format of the data, which can be read by anything. CSV is an open format.

Step four, and here we have the beginning of the Resource Descriptive Framework or RDF. One of the cornerstones of Resource Descriptive Framework is to use URI or Uniform Resource Identifiers to point to things so that other people can point to your stuff. So, using URIs to point to concepts in your vocabulary, for example, or in your metadata schema or in your ontology.

And step five, the five star, the linked open data where you get to linking your data to other data so that you can provide context. Okay? That’s really where we get into that heady world of linked open data and query across many things. So, that’s the future. All right? Okay? And I should caveat, it’s not necessarily so that must all happen for all data. But, it is a significant driver.

One thing I also wanted to highlight is that the Australian Government has a Digital Continuity 2020 Policy and I wanted to share this with you because it’s another facet of how interoperability is driving change in how we manage data on a government level. And this here is a component from that policy. “Agencies will have interoperable information, systems and processes that meet standards for short and long-term management, improve information quality and enable information to me found, managed, shared and reused easily and efficiently.” I mean, it’s a… thing.

Sorry Siri. Right. So, the National Archive have some good resources and scenario mappings for government agencies who are building interoperability into their systems and processes. I just wanted to here, highlight some of those scenarios for things that you might do if you wanted to take this interoperability a little further and that relies on looking at streamlining business processes and you might undertake a legacy data migration to move data, to upgrade your metadata. Or maybe it’s looking at a data exchange activity with stakeholders or reviewing the standards that you have for data publication and sharing. Anyhow. You’ll get a copy of these slides and you can go and see those scenarios which are in friendly storyboard maps.

I’m going to kind of skip over ontologies at the moment. But, here are a few nice little ontological tools, if you’d like. And actually, if I did this again, I would probably start off with the comic of Ada Lovelace and Charles Babbage and then have a look at how someone has created an ontology of that comic.

And here, let’s finish up with Research Vocabularies Australia. Okay? Because here at the ARDC, we’ve got some tools that can help you develop your own vocabularies and publish them so you can make them FAIR. So, our vocabulary services … well, our vocab service indexes vocabularies to… and it can tag items in catalogs and search portals and that can help you provide keywords and other search aids. The service can also relate to machine-to-machine services which would support activities like creating, managing and querying vocabularies.

So, in the ARDC Vocabulary Services suite, we’ve got an editor, a vocab editor which is called PoolParty. So, we contract the software PoolParty, where you can create and manage vocab, you can collaborate with others and browse concepts using that built-in visualization tool. You can also query your vocab from a SPARQL endpoint as well, which is very handy. We also have a repository, which is where any vocabs you create can be sorted. And that’s where you can publish it so your vocabulary could be accessed via the portal at Research Vocabularies Australia. And this is, here are a few things that you can, if you are using the repository on the other side in the portal, it will enable other people to find your vocabulary as well.

So, I feel I’ve gone all around the world a little bit, in interoperability, and there are a lot of different directions that we can go in. But, in sort of tying this up together, I’d like to kind of finish with this idea that if you want interoperability to work for you in making data FAIR, then it’s important to consider that the data model you’re using for your metadata is well defined and well structured. So, it uses standards such as Resource Descriptive Framework, RDF, or it uses standards for describing structured data so that machines can pass that information and humans can also identify it.

Secondly, in using controlled vocabularies, which are also well documented, people can access them and are resolvable by using PIDs. That’s another way of enabling your data to be FAIR because the vocabularies and the metadata that you’re using to describe that data are also seen. So, they’re findable, accessible and interoperable. Reusable. And finally, one part of that is that metadata includes cross-references which provide contextual meaning to the data.

That’s the beauty of linked data and what we hope to enable. So, before we get into questions and answers, I’m going to remind you to… Don’t forget to fill in the feedback at the end of this webinar because it’s really helpful for us to know how we’re doing and sometimes at the end of a long webinar that’s when the questions come up. So, just in case you don’t feel like you’ve got any questions now, I’m just giving you an [inaudible 00:37:18] there. So, Matthias, that’s enough from me. Does anyone have any questions about all of this interoperability?

Mattias:
Yes. So, we have one question so far and that is about RIF-CS and RDA. The relationship of parties linked to data set, how is that connection made? Is that manually entered in the related objects field in RDA?

Liz Stokes:
Yes. Well, the related objects… Sorry. Can you hear me now?

Mattias:
Yes.

Liz Stokes:
Good. So, the related objects in RDA is [inaudible 00:38:13]. Let me go back to it.

Mattias:
While Liz is doing that, I would like to remind everybody that if you do have a question, please enter it into the questions module.

Liz Stokes:
Okay. So, let me get this straight, the question was relating parties to each other or parties to collections?

Mattias:
Relating, literally the question was relating parties to a data set.

Liz Stokes:
Okay. So, you can have the… You don’t need to supply party records necessarily alongside collection records. Most of the time you’re actually sending the collection records that say the collection is related to a party. And because of that link that has been established, then RDA will infer the relationship to the party from… No. To the collection from the party. I guess, this is going to sound really confusing, but when you send records to RDA, it’s not like you’re sending all the collection ones with their relationships plus the parties with their own relationships as well. So, you have these sets of relationship that are in parallel. They link up with each other because of the persistent identifies that you have included. So, the relationship happens… doesn’t have to happen on both in the metadata, it can just be one because the bidirectional link has been inferred. I’m not sure that that’s actually very clear in my description. I apologize.

Mattias:
No. That’s clear enough. Okay. We do not have any more questions in, but we are getting close to the hour, so I’ll probably hand back to you, Liz, to wrap up.

Liz Stokes:
Okay. Well, thank you everybody. Gee, you’ve been really quiet today. That’s been nice. And I hope to see you some time on the slack end in our community discussions next week. They’re next week. Okay. Thanks very much. Bye.

FAIR data 101 training - Interoperable #5

View Now
View Now

FAIR data 101 training - Reusable #7

View Now
View Now