Title slide

Measuring the quality of Relational-to-RDF mappings


Darya Tarasowa, Christoph Lange, Sören Auer
University of Bonn, Germany

KESW2015, Moscow, Russia
September 30th, 2015

Motivation

  • Translating the data stored in relational databases (RDB) to the linked data format is an essential prerequisite for evolving the current Web of documents into a Web of Data
  • In order to be effectively and efficiently reusable, linked data should meet certain data quality requirements
  • Assessing the quality of a linked dataset created from RDB may be a laborious, repetitive task if the dataset is frequently recreated from its RDB source, e.g. after any update to the RDB.

1. Is it possible to positively influence the quality of a linked dataset created from RDB by improving the quality of the RDB2RDF mapping that produces the linked data?

2. If so, how to measure the quality of the RDB2RDF mapping?

Related work

  • RDB2RDF Mapping Approaches
    • not explicitly focused on collecting requirements
    • requirements can be found in best practice and mapping approach descriptions
    • we collected and formally defined the metrics found
  • RDB2XML
    • mostly produced automatically and do not allow customization ->  only take into account the performance of the mapping algorithms and produced XML data quality
    • efficiency of query processing -> we adapted for RDB2RDF as simplicity metric
    • information preservation is proved to be important -> we provide objective metrics in the Faithfulness of Output dimension
  • Requirements for Ontology Matching
    • precision and recall -> we adapt these metrics to the evaluation of RDB2RDF mappings and include them in the Faithfulness of Output dimension as coverage and accuracy of data presentation metrics
  • Measuring Linked Data Quality
    • The current paper assumes that the quality of a linked dataset is influenced by the mapping that produces it and thus categorizes the metrics from the survey from the perspective of the RDB2RDF mapping. We select metrics that are related to the mapping process and adapt them to the RDB2RDF domain. 

Approach

  • 4 quality dimensions with 14 objective metrics overall
  • assign weights to the metrics 
  • choice of weights depends on the goal of the mapping process. 

Dimensions and Metrics defined

Automatisation of the task

An approach by Anastasia Dimou 1 , Ruben Verborgh 1 , Sebastian Hellmann 2 , Dimitris Kontokostas 2 , Jens Lehmann 2 , Markus Freudenberg 2 , Erik Mannens 1 and Rik Van de Walle 1 to be published at ISWC2015

1 Ghent University - iMinds - Multimedia Lab, Belgium;  2 University of Leipzig, Institute of Informatik, AKSW, Germany 

Input: mapping in R2RML

Output: list of violations of the mapping, refined mapping in R2RML

Tool used: RDFUnit (an rdf validation framework)

Assessed: consistency of the mapping definitions against the R2RML schema and, mainly, consistency and quality assessment of the dataset to be generated. The second point is handled by emulating the resulting RDF dataset to assess its schema conformance.

What is being assessed?

  1. Consistency validation of the mapping definitions
    • 78 automatically generated test cases available in RDFUnit
    • support all OWL axioms in R2RML ontology, (e.g., each Triples Map should have exactly one Subject Map,
  2. Consistency validation and quality assessment of the dataset as projected by its mapping definitions
    • instead of validating the predicate against the subject and object, extract the predicate from the Predicate Map and validate it against the Term Maps that define how the subject and object will be formed.
    • For instance, extracted predicate expects a Literal as object, but the Term Map that generates the object can be a Referencing Object Map that generates resources instead.

Metrics assessed

Examples

The WHERE clause of the sparql test case that assesses a missing language is:

?resource ?P1 ?c .
FILTER (lang(?c) = '')

In order to detect the same violation directly from a mapping definition, the WHERE clause of the assessment query is adjusted as follows:

?poMap rr:predicate ?P1 ;
rr:objectMap ?resource .
?P1 rdfs:range rdf:langString .
FILTER NOT EXISTS {?resource rr:language ?lang}

RDFUnit can annotate test cases by requesting additional variables and binding them to specific result properties.

<5b7a80b8> a rut:ExtendedTestCaseResult;
rut:testCase rutt:rr-produces-range-errors ;
# (...) Further result annotations
spin:violationRoot ex:objectMapX ;
spin:violationPath rr:class ;
spin:violationValue ex:Person ;
rut:missingValue foaf:Person ;
ex:erroneousPredicate foaf:knows ;

Automatic refinement

  • Range-level violations: The Predicate Map is used to retrieve the property and identify its range, which is then compared to the corresponding Object Map or Referencing Object Map.
DEL: ex:objectMapX rr:class ex:Person .
ADD: ex:objectMapX rr:class foaf:Person.
MOD: adjust the definition of ex:Person 
  • Domain-level violations: comparing recursively the type(s) assigned to the Subject Map with each predicate’s domain, as specified at the different Predicate Maps.
  • Violations identified when the mapping definitions are instantiated with values from the input source, can lead to a new round of refinements

Perspectives of automatic implementation

  1. Quality of the mapping representation
    1. Data accessibility: yes
    2. Standard compliance: yes
  2. Faithfulness of the output
    1. Coverage: yes
    2. Accuracy: yes
    3. Incorporation of domain semantics: yes
  3. Quality of the output
    1. Simplicity: partly (experts needed to define the frequently demanded values)
    2. Data quality: yes
    3. Data integration: yes
  4. Interoperability
    1. Reuse of existing ontologies: yes
    2. Quality of reused elements: partly (needed API from e.g. http://lov.okfn.org)
    3. Accuracy of reused properties: partly (only respecting the property definition aspect)
    4. Accuracy of reused classes: no, as it is purely semantic metric
    5. Quality of declared classes/properties: yes


Thank you!