Text Retrieval and Mining
eXtensible Markup Language
A framework for defining markup languages
No fixed collection of markup tags
Each XML language targeted for application
All XML languages share features
Enables building of generic tools
An XML document is an ordered, labeled tree
character data leaf nodes contain the actual data (text strings)
element nodes, are each labeled with
a name (often called the element type), and
a set of attributes, each consisting of a name and a value,
can have child nodes
Elements are denoted by markup tags
Element start tag: foo
The character data: thetext
Matching element end tag:
HTML is a markup language for a specific purpose (display in browsers)
XML is a framework for defining markup languages
HTML can be formalized as an XML language (XHTML)
XML defines logical structure only
HTML: same intention, but has evolved into a presentation language
Separate syntax from semantics to provide a common framework for structuring information
Allow tailor-made markup for any imaginable application domain
Support internationalization (Unicode) and platform independence
Be the future of (semi)structured information (do some of the work now done by databases)
Represent semi-structured data (data that are structured, but don’t fit relational model)
XML is more flexible than DBs
XML is more structured than simple IR
You get a massive infrastructure for free
CML – chemical markup language
WML – wireless markup language
ThML – theological markup language
EVERY man naturally desires knowledge
; but what good is knowledge without fear of God? Indeed a humble rustic who serves God is better than a proud intellectual who neglects his soul to study the course of the stars.
Augustine, Confessions V. 4.
Schema = syntax definition of XML language
Schema language = formal language for expressing XML schemas
Document Type Definition
XML Schema (W3C)
Relevance for XML IR
Our job is much easier if we have a (one) schema
XML Indexing and Search
Uses XML document as logical unit
PCDATA (parsed character data)
DB modified for XML
Generic IR system modified for XML
Most native XML databases have taken a DB approach
Evaluate path expressions
No IR type relevance ranking
Only a few that focus on relevance ranking
Data-centric XML: used for messaging between enterprise applications
Mainly a recasting of relational data
Content-centric XML: used for annotating content
Rich in text
Demands good integration of text retrieval functionality
E.g., find me the ISBN #s of Books with at least three Chapters discussing cocoa production, ranked by Price
There is no document unit in XML
How do we compute tf and idf?
Global tf/idf over all text context is useless
IR systems don’t store content (only index)
Need to go to document for retrieving/displaying fragment
E.g., give me the Abstracts of Papers on existentialism
Where do you retrieve the Abstract from?
Easier in DB framework
There is one schema
User understands schema
In practice: rare
Schemas not known in advance
Users don’t understand schemas
Need to identify similar elements in different schemas
Help user find relevant nodes in schema
Author, editor, contributor, “from:”/sender
What is the query language you expose to the user?
Specific XML query language? No.
Forms? Parametric search?
In general: design layer between XML and user
Why you don’t want to use a DB
Contains vs “is about”
DB has no notion of ordering
Vector space approaches
SQL for XML
Mixed documents (e.g., patient records)
XML Schema datatypes
XQuery is still a working draft.
The principal forms of XQuery expressions are:
FLWR ("flower") expressions
Evaluated with respect to a context
FOR $p IN document("bib.xml")//publisher LET $b := document("bib.xml”)//book[publisher = $p] WHERE count($b) > 100 RETURN $p
FOR generates an ordered list of bindings of publisher names to $p
LET associates to each binding a further binding of the list of book elements with that publisher to $b
at this stage, we have an ordered list of tuples of bindings: ($p,$b)
WHERE filters that list to retain only the desired tuples
RETURN constructs for each tuple a resulting value
Location/position (“chapter no.3”)
/play/title contains “hamlet”
title contains “hamlet”
/play//title contains “hamlet”
Employees with two managers
What about relevance ranking?
All documents in set A must be ranked above all documents in set B.
Fragments must be ordered in depth-first, left-to-right order.
for $d in document("depts.xml")//deptno
let $e := document("emps.xml")//emp[deptno = $d]
where count($e) >= 10
order by avg($e/salary) descending
Order by clause only allows ordering by “overt” criterion
Say by an attribute value
Is often proprietary
Can’t be expressed easily as function of set to be ranked
Is better abstracted out of query formulation (cf. www)
University of Dortmund
Goal: open source XML search engine
“Returnable” fragments are special
E.g., don’t return a
Structured Document Retrieval Principle
Empower users who don’t know the schema
Enable search for any person no matter how schema encodes the data
Don’t worry about attribute/element
Specified in schema
Only atomic units can be returned as result of search (unless unit specified)
Tf.idf weighting is applied to atomic units
Probabilistic combination of “evidence” from atomic units
A system should always retrieve the most specific part of a document answering a query.
Example query: xql
Return section, not chapter
Ensure that Structured Document Retrieval Principle is respected.
Assume different query conditions are disjoint events -> independence.
P(chapter,XQL)=P(XQL|chapter)+P(section|chapter)*P(XQL|section) – P(XQL|chapter)*P(section|chapter)*P(XQL|section) = 0.3+0.6*0.8-0.3*0.6*0.8 = 0.636
Section ranked ahead of chapter
Assign all elements and attributes with person semantics to this datatype
Allow user to search for “person” without specifying path
A very basic introduction.
What are the primitives we need?
Inverted index: give me all elements matching text query Q
We know how to do this – treat each element as a document
Give me all elements (immediately) below any instance of the Book element
Combination of the above
Number each element
Maintain a list of parent-child relationships
E.g., Chapter:21 Book:8
Enables immediate parent
But what about “the word Hamlet under a Scene element under a Play element?
View the XML document as a text document
Build a positional index for each element
Mark the beginning and end for each element, e.g.,
Play → Doc:1(27) → Doc:1(2033) ---→
/Play → Doc:1(1122) → Doc:1(5790) ---→
Verse → Doc:1(431) → Doc:4(33) ---→
/Verse → Doc:1(867) → Doc:4(92) ---→
Term:droppeth → Doc:1(720)
droppeth under Verse under Play.
Path containment etc. can essentially be solved by positional inverted indexes
Retrieval consists of “merging” postings
All the compression tricks etc. from 276A are still applicable
Complications arise from insertion/deletion of elements, text within elements
Beyond the scope of this course
Jan-Marco Bremer’s publications on xml and ir: http://www.db.cs.ucdavis.edu/~bremer
www.w3.org/XML - XML resources at W3C
Ronald Bourret on native XML databases: http://www.rpbourret.com/xml/ProdsNative.htm
Norbert Fuhr and Kai Grossjohann. XIRQL: A query language for information retrieval in XML documents. In Proceedings of the 24th International ACM SIGIR Conference, New Orleans, Louisiana, September 2001.
ORDPATHs: Insert-Friendly XML Node Labels.