Today:
XQuery
XIRQL
Lecture 15
Vector space approaches
SQL for XML
Usage scenarios
Human-readable documents
Data-oriented documents
Mixed documents (e.g., patient records)
Relies on
XPath
XML Schema datatypes
Turing complete
XQuery is still a working draft.
The principal forms of XQuery expressions are:
path expressions
element constructors
FLWR ("flower") expressions
list expressions
conditional expressions
quantified expressions
datatype expressions
Evaluated with respect to a context
FOR $p IN document("bib.xml")//publisher LET $b := document("bib.xml”)//book[publisher = $p] WHERE count($b) > 100 RETURN $p
FOR generates an ordered list of bindings of publisher names to $p
LET associates to each binding a further binding of the list of book elements with that publisher to $b
at this stage, we have an ordered list of tuples of bindings: ($p,$b)
WHERE filters that list to retain only the desired tuples
RETURN constructs for each tuple a resulting value
Location/position (“chapter no.3”)
Simple attribute/value
/play/title contains “hamlet”
Path queries
title contains “hamlet”
/play//title contains “hamlet”
Complex graphs
Employees with two managers
Subsumes: hyperlinks
What about relevance ranking?
All documents in set A must be ranked above all documents in set B.
Fragments must be ordered in depth-first, left-to-right order.
for $d in document("depts.xml")//deptno
let $e := document("emps.xml")//emp[deptno = $d]
where count($e) >= 10
order by avg($e/salary) descending
return
Order by clause only allows ordering by “overt” criterion
Say by an attribute value
Relevance ranking
Is often proprietary
Can’t be expressed easily as function of set to be ranked
Is better abstracted out of query formulation (cf. www)
University of Dortmund
Goal: open source XML search engine
Motivation
“Returnable” fragments are special
E.g., don’t return a
Structured Document Retrieval Principle
Empower users who don’t know the schema
Enable search for any person no matter how schema encodes the data
Don’t worry about attribute/element
Specified in schema
Only atomic units can be returned as result of search (unless unit specified)
Tf.idf weighting is applied to atomic units
Probabilistic combination of “evidence” from atomic units
A system should always retrieve the most specific part of a document answering a query.
Example query: xql
Document:
Return section, not chapter
Ensure that Structured Document Retrieval Principle is respected.
Assume different query conditions are disjoint events -> independence.
P(chapter,XQL)=P(XQL|chapter)+P(section|chapter)*P(XQL|section) – P(XQL|chapter)*P(section|chapter)*P(XQL|section) = 0.3+0.6*0.8-0.3*0.6*0.8 = 0.636
Section ranked ahead of chapter
Example: person_name
Assign all elements and attributes with person semantics to this datatype
Allow user to search for “person” without specifying path
Relevance ranking
Fragment/context selection
Datatypes (person_name)
Semantic relativism
Attribute/element