Research Data – Latest News & Worth Knowing

Asynchronous Tracking and Description of Research Data Changes in Distributed Systems With Interoperable Metadata

June 13th, 2024 | by
Graduation Celebration

Source: Freepik

In the world of digital research, there are many different ways of storing data. But how can research data be managed in a way that makes it accessible and usable for everyone who should have access to it? In this blog post, we look at how we can tackle this challenge by introducing a method to fill in missing information about the origin of data. This approach is intended to make it easier to find and use research data – in line with the FAIR Principles.



Benedikt Heinrichs

Source: Benedikt Heinrichs

FAIR principles provide guidelines for the discoverability, accessibility, interoperability and reusability of research data, but there is often a lack of concrete implementation guidelines. Therefore, research data management teams have developed various implementations, such as platforms like Coscine, that attempt to simplify these principles. However, these platforms run into problems as researchers often deposit their data with established storage providers, leading to loss of information about data provenance.

In his doctoral thesis, Benedikt Heinrichs developed a method that fills in missing data provenance information, compares data representations and generates interoperable metadata. The applicability of this method is integrated into a standards-based research data management system (here Coscine) to support the implementation of the FAIR principles and improve research processes. The four methods are presented below.



Asynchronous Data Provenance

The following cases were considered in the work with regard to asynchronous data origin, each of which was described using the PROV-Ontology (PROV-O):

Versions: Changes to the data were recognized, such as the addition of a sentence to a text. In this simplest case, the same identifier is used for the data and only older representations are compared.

Variants: Here too, changes to the data have been recognized, with new data being derived from other older data. Examples of this are renaming or the combination of elements of older data. This case is more complex than versions and requires searching through other data.

Invalidation: Data was found to be invalidated when it was deleted. This special case is based on the listing of previous representations and shows missing (deleted) data.


Research Data Similarity

A new approach was developed to determine the comparability of research data, which is based on the comparison of interoperable metadata sets. Instead of checking the research data directly, an abstraction such as the interoperable metadata sets ensures that research data can be compared regardless of format.

The concrete method for calculating this comparability is based on various steps.

These include the filtering of irrelevant relationship triples and subjects and the use of ontologies such as DCAT to structure the metadata into comparable catalogs and data sets. Simplification was also implemented by removing unique identifiers to avoid false similarities. The process created, with its filter, structure and simplify steps, was defined as an FSS-process.

The challenge was to compare interoperable metadata sets. Various comparison methods were tested, including removing parts of the FSS process, applying it directly to research data and using other similarity metrics. Performing similarity comparisons on scientific datasets showed that the methods based on the interoperable metadata partially detect changes, while methods based on the research data do not detect any, provided that sufficient metadata quality is given.


Automatic Interoperable Metadata Extraction

An important topic is the correct representation of research data through accurate and detailed metadata. Many approaches exist for this, but many of them pursue a manual description. Part of the dissertation was therefore to pursue an automatic approach to extracting metadata from research data and making it interoperable. With such an approach, it should be possible to describe format-independent research data with interoperable metadata, provided that an extracting method exists. For this reason, this approach places a strong focus on extensibility. The metadata extractor created from this approach can and will be integrated into various research processes, e.g. NFDIMatwerk or Coscine.

One example of a method is object recognition. This method recognizes, for example, that six bananas exist in an image with a fruit basket. However, depending on the interpretation, one could also come to the conclusion that only five bananas are shown in this image, which highlights the scope for interpretation of the implemented method.


Integration Into Standards-Based Research Data Management System – Coscine As a Use Case

Coscine (Collaborative Scientific Integration Environment) is a platform for research data management that enables the support of multiple storage providers. The platform offers important features such as research data management, metadata management and easy access to storage space.

However, there were some challenges at the beginning of the work. The APIs were individually defined and not based on standardized architectures. In addition, no information about the origin of the data was collected, which made it difficult to trace and track data.

Therefore, one goal of the work was to convert the platform to a standards-based architecture and to integrate the presented methods in order to improve the efficiency and FAIR compliance of Coscine.

For the above reasons, it was necessary to transform the use case into a standards-based research data management system. An evaluation was carried out to determine the relevant standards. Requirements were created based on the use case. The evaluation revealed that no single standard fulfills all requirements. A combination of existing standards was therefore recommended.

The transformation required the semantic lifting of Coscine. It was ensured that stored entities (e.g. storage resources) were described using suitable standards.

In addition, relevant APIs were created that follow defined standards (e.g. LDP). The connections between the individual entities were described in detail to ensure consistent and standard-compliant integration.


Summary of Benefits

By implementing asynchronous data provenance, interoperable metadata and comparability of research data, unstructured and hard-to-access data is transformed into organized and easily accessible research data with detailed metadata. Asynchronous data provenance enables continuous and time-delayed collection of data provenance information. Interoperable metadata ensures the compatibility and comprehensibility of metadata, while the comparability of research data facilitates the identification of similarities between different datasets.

Asynchronous data provenance supports the description of research data, even if the link between older revisions has been lost. By restoring links, the path that research data has traveled can be better traced and research can be better reproduced.

It also allows the provenance and flow of data to be tracked, improving transparency and traceability. Interoperable metadata allows a uniform and comprehensible description of data, which increases the findability and usability of the data. By automatically extracting this data, a clear description of the research data can be generated, which significantly increases retrievability and understanding.

The comparability of research data supports the precise linking and comparability of data sets. This makes it easier for researchers to find relevant data and recognize relationships between different data sets, which increases the efficiency and effectiveness of research.



The development of a method for asynchronous data provenance made it possible to track changes and identify different change events. To determine the similarity of research data, a method using interoperable metadata was developed and successfully tested on various use cases. By implementing an automatic extraction pipeline for interoperable metadata, the interoperability of extracted metadata could be ensured. For the future, user-friendly access to the developed technologies remains, which is currently under development.


Responsible for the content of this article are Benedikt Heinrichs and Arlinda Ujkani.

Leave a Reply