Skip to main content

Metadata

Metadata is "data about data".

Metadata is used to store information about data assets that are stored in the GBADs knowledge engine. We strive for metadata to be FAIR (Findable, Accessible, Interoperable, and Reusable).

In addition, we collect metadata on the processes in ingesting data into the Knowledge Engine to ensure that all data lineage is tracked.

Metadata Schema

"A metadata schema is a set of rules about what sorts of subject-predicate-object statements one is allowed to make, and how one is allowed to make them." - Jeffery Pomerantz

A subject-predicate-object statement consists of:

  • Subject = the thing being described
  • Object = the thing describing the subject
  • Predicate = relationship between the subject and object

For example:

  • Subject = FAOSTAT QCL dataset
  • Object = FAO
  • Predicate = creator

In this subject-predicate-object statement the FAO is the creator of the FAOSTAT QCL dataset.

Based on this model, we can craete a metadata schema that defines the predicates (also called elements) that we would like to use to describe a resource. Metadata vocabularies such as Dublin Core, schema.org, PROV-DM, and DCAT , provide metadata elements that can be used to describe data. There is not a 'one-size-fits-all' when it comes to metadata. Several standard metadata element sets exist because what you will include in metadata depends on what your use case is.

We have selected metadata elements from schema.org and PROV-DM to describe data and trace data lineage in the knowledge engine (see Figure below).

metadataModel

Encoding Schema

Each metadata element should have instructions on the expected values expected for each element. For example, there are many different ways to specify a date: 01/04/23 could mean January 4th, 2023 or April 1st, 2023. Therefore, any values for any element specifying a date should use ISO-8601 to ensure that all dates are formatted in a standard fashion.

The encoding schema for each metadata element used in the metadataModel is found below:

ElementEncoding SchemeExpected Type
namefree textstr
codeRepositorylink to GitHub repostr
runtimePlatformname of programming language or platform used at runtime (need controlled vocabulary)str
dateCreatedISO-8601datetime
startTimeISO-8601datetime
endTimeISO-8601datetime
prov:typeControlled vocabulary to be built for use case (i.e ingestionEvent, dataCleaning etc.)str
descriptionfree textstr
urlurlstr
identifierurl, doi, or uristr
licenseurlstr
temporalCoverageISO-8601datetime
creatorfree textstr
inDefinedTermSeturlstr
termCodecode from defined term setstr
PlaceGeoNamesstr
contentSizeFile size in megabytesfloat
fileFormatFile format. One of: csv, json, dbtable etc. (controlled vocabulary required)str
contentUrlurlurl
uploadDateISO-8601datetime

Decision needed:

Currently, keywords for metadata are created through extracting terms (like species), from data sets.

A controlled vocabulary needs to be created to link keywords to. We have begun to do this by collecting all species classifications and definitions from data sources, however, synonyms have not yet been identified.


Vocabularies and Ontologies

Pre-existing vocabularies and ontologies will be accessed, refined, compared and extended upon to create a controlled vocabulary for GBADs. Semantics will be accessed for each data source to ensure that the words used to describe data are consistent between data sources.

  • Vocabularies for data sources that don’t cite vocabulary standards will be obtained and words will be compared to pre-existing data standards such as AGROVOC (FAO’s controlled vocabulary)
  • Collected vocabularies will be compared for all data sources, to see how the description of terms compares to each other.
  • Goal is to provide a standard for GBADs, increasing interoperability and quality of data, ultimately leading to superior models and estimates
    - Also controlled vocabularies lead to better systems and allow for automation of tasks 

Agroportal is an ontology mapping tool that will allow GBADs to determine suitable ontologies and mapping between standardized vocabularies related to the agricultural sector.

  • We also acknowledge that we cannot expect data contributors to change their vocabularies to follow that of GBADs (and if we did ask, it may discourage people from contributing data). This underlines the importance of vocabulary mappings.

Metadata Storage and Management

"All the knowledge is in connections"

-- David Rumelhart

GBADs Informatics uses neo4j, a graph database management system, to manage and storage metadata and information about individuals and groups involved in the project. As you will learn in this section, a graph database is a type of database that leverages the idea of connections between entites as a method to derive insights and new knowledge from otherwise disconnected data.

What is a graph database?

A graph database is a type of database that stores data using relationships between main ideas or entities. The relationships between different entities show connectedness, allowing for more insights to be drawn than a traditional relational database. Because data is highly complex and multidimensional in terms of structure, provenance, governance, security and semantics, GBADs uses graph databases for master metadata management and data cataloguing. By leveraging the dynamic nature of the graph database and structuring our graph model in a way that enables improved understanding of the many dimensions of data, we can both visualize and understand how data flows outside and inside our organization. Graph databases also allow us to add and change the structure as the structure of the information about data changes. This will become more clear as we introduce the preliminary GBADs graph data model.


Traditionally, data are organized into a series of tables. Each of the tables have columns, and some tables have common columns. With these common columns you can specify joins between tables, resulting in a new table.

The biggest advantage of relational databases is the ability to join common tables to derive insights. On the other hand, relational databases require rigid schemas which require database engineers to structure their data to fit the schema. This comes with the assumption that we know what all of our data already looks like, which isn't always the case for research.


Parts of a graph database

Graph databases are made up of nodes (entities) and edges (relationships). Nodes can have properties and labels while edges serve as the connection, or relationship between nodes.

A graph model is a model of what kinds of nodes you are representing and how they are connected (what relationships you will have).

Graph Model d

Graph Database and Metadata API

To be updated when API is launched