Querying XML

  • Today:

    • XQuery

    • XIRQL

  • Lecture 15

    • Vector space approaches



XQuery

  • SQL for XML

  • Usage scenarios

    • Human-readable documents

    • Data-oriented documents

    • Mixed documents (e.g., patient records)

  • Relies on

    • XPath

    • XML Schema datatypes

  • Turing complete

  • XQuery is still a working draft.



XQuery

  • The principal forms of XQuery expressions are:

    • path expressions

    • element constructors

    • FLWR ("flower") expressions

    • list expressions

    • conditional expressions

    • quantified expressions

    • datatype expressions

  • Evaluated with respect to a context



FLWR

  • FOR $p IN document("bib.xml")//publisher LET $b := document("bib.xml”)//book[publisher = $p] WHERE count($b) > 100 RETURN $p

  • FOR generates an ordered list of bindings of publisher names to $p

  • LET associates to each binding a further binding of the list of book elements with that publisher to $b

  • at this stage, we have an ordered list of tuples of bindings: ($p,$b)

  • WHERE filters that list to retain only the desired tuples

  • RETURN constructs for each tuple a resulting value



Queries Supported by XQuery

  • Location/position (“chapter no.3”)

  • Simple attribute/value

    • /play/title contains “hamlet”

  • Path queries

    • title contains “hamlet”

    • /play//title contains “hamlet”

  • Complex graphs

    • Employees with two managers

  • Subsumes: hyperlinks

  • What about relevance ranking?



How XQuery makes ranking difficult

  • All documents in set A must be ranked above all documents in set B.

  • Fragments must be ordered in depth-first, left-to-right order.



XQuery: Order By Clause

  • for $d in document("depts.xml")//deptno

  • let $e := document("emps.xml")//emp[deptno = $d]

  • where count($e) >= 10

  • order by avg($e/salary) descending

  • return { $d, {count($e)}, {avg($e/salary)} }



XQuery Order By Clause

  • Order by clause only allows ordering by “overt” criterion

    • Say by an attribute value

  • Relevance ranking

    • Is often proprietary

    • Can’t be expressed easily as function of set to be ranked

    • Is better abstracted out of query formulation (cf. www)



XIRQL

  • University of Dortmund

    • Goal: open source XML search engine

  • Motivation

    • “Returnable” fragments are special

      • E.g., don’t return a some text fragment

    • Structured Document Retrieval Principle

    • Empower users who don’t know the schema

      • Enable search for any person no matter how schema encodes the data

      • Don’t worry about attribute/element



Atomic Units

  • Specified in schema

  • Only atomic units can be returned as result of search (unless unit specified)

  • Tf.idf weighting is applied to atomic units

  • Probabilistic combination of “evidence” from atomic units



XIRQL Indexing



Structured Document Retrieval Principle

  • A system should always retrieve the most specific part of a document answering a query.

  • Example query: xql

  • Document:

    • 0.3 XQL

    • 0.5 example

    • 0.8 XQL 0.7 syntax

    • Return section, not chapter



Augmentation weights

  • Ensure that Structured Document Retrieval Principle is respected.

  • Assume different query conditions are disjoint events -> independence.

  • P(chapter,XQL)=P(XQL|chapter)+P(section|chapter)*P(XQL|section) – P(XQL|chapter)*P(section|chapter)*P(XQL|section) = 0.3+0.6*0.8-0.3*0.6*0.8 = 0.636

  • Section ranked ahead of chapter



Datatypes

  • Example: person_name

  • Assign all elements and attributes with person semantics to this datatype

  • Allow user to search for “person” without specifying path



XIRQL: Summary

  • Relevance ranking

  • Fragment/context selection

  • Datatypes (person_name)

  • Semantic relativism

    • Attribute/element





Creator: Tgbyrdmc

Contributors:
-


Licensed under the Creative Commons
Attribution ShareAlike CC-BY-SA license


This deck was created using SlideWiki.