Vector space approaches
SQL for XML
Mixed documents (e.g., patient records)
XML Schema datatypes
XQuery is still a working draft.
The principal forms of XQuery expressions are:
FLWR ("flower") expressions
Evaluated with respect to a context
FOR $p IN document("bib.xml")//publisher LET $b := document("bib.xml”)//book[publisher = $p] WHERE count($b) > 100 RETURN $p
FOR generates an ordered list of bindings of publisher names to $p
LET associates to each binding a further binding of the list of book elements with that publisher to $b
at this stage, we have an ordered list of tuples of bindings: ($p,$b)
WHERE filters that list to retain only the desired tuples
RETURN constructs for each tuple a resulting value
Location/position (“chapter no.3”)
/play/title contains “hamlet”
title contains “hamlet”
/play//title contains “hamlet”
Employees with two managers
What about relevance ranking?
All documents in set A must be ranked above all documents in set B.
Fragments must be ordered in depth-first, left-to-right order.
for $d in document("depts.xml")//deptno
let $e := document("emps.xml")//emp[deptno = $d]
where count($e) >= 10
order by avg($e/salary) descending
Order by clause only allows ordering by “overt” criterion
Say by an attribute value
Is often proprietary
Can’t be expressed easily as function of set to be ranked
Is better abstracted out of query formulation (cf. www)
University of Dortmund
Goal: open source XML search engine
“Returnable” fragments are special
E.g., don’t return a
Structured Document Retrieval Principle
Empower users who don’t know the schema
Enable search for any person no matter how schema encodes the data
Don’t worry about attribute/element
Specified in schema
Only atomic units can be returned as result of search (unless unit specified)
Tf.idf weighting is applied to atomic units
Probabilistic combination of “evidence” from atomic units
A system should always retrieve the most specific part of a document answering a query.
Example query: xql
Return section, not chapter
Ensure that Structured Document Retrieval Principle is respected.
Assume different query conditions are disjoint events -> independence.
P(chapter,XQL)=P(XQL|chapter)+P(section|chapter)*P(XQL|section) – P(XQL|chapter)*P(section|chapter)*P(XQL|section) = 0.3+0.6*0.8-0.3*0.6*0.8 = 0.636
Section ranked ahead of chapter
Assign all elements and attributes with person semantics to this datatype
Allow user to search for “person” without specifying path