Query Relaxation for XML Database

Project funded NSF Award # 0219442

1. Project Overview

1.1 Project Introduction & Motivation

As the World Wide Web becomes a major means of disseminating and sharing information, there has been an exponential increase in the amount of data in web-complaint format such as HyperText Markup Language (HTML) and Extensible Markup Language (XML). XML is essentially a text representation of hierarchical (tree-like) data where a meaningful piece of data is bounded by matching starting and ending tags, such as <name> and </name>. Due to its simplicity as compared to SGML and its relative expressiveness as compared to HTML, XML has become a popular format for information representation and data exchange.

To cope with the tree-like structures in the XML model, several XML-specific query languages have been proposed (e.g. XPath, XQuery, etc.). All these query languages aim at the exact matching of query conditions. Answers are found when those XML documents match the given query condition exactly. However, this may not always be the case in the XML model. To remedy this condition, we propose to develop a query relaxation framework for searching answers that approximately match the given query conditions. Query relaxation enables systems to relax the user query to a less restricted form to devise approximate answers.

In the XML domain, because of the flexible nature of the XML model allows varied structures and/or values and the non-rigid XML tag syntax enables to embed a wealth of meta information in XML documents, the need for such query relaxation increases. The following points illustrates that query relaxation is more important for the XML model than the relational model:

  1. Unlike in the relational model where users are given a relatively small-sized schema to ask queries, the schema in the XML model is substantially bigger and more complex. Therefore, it is often unrealistic for users to understand the full schema and compose very complex queries at once. Therefore, it becomes critical to be able to relax the user's query when the original query yields null or not sufficient answers.
  2. As the number of data sources available on the web increases, it becomes common to build a system where data is gathered from heterogeneous data sources and where the structures of the participating data sources are different, even though they are using the same ontology. Therefore, the capability to query against differently-structured data sources becomes more important. Query relaxation allows the query to be structurally relaxed and routed to diverse data sources with different structures.

We propose a cooperative XML (CoXML) system for approximate query answering. The objective of CoXML is to develop an XML query relaxation enabled search engine such that when users' queries have empty or insufficient answers, our system can relax the queries to get approximate answers.

1.2 Summary of Approaches
CoXML is based on the following two key techniques:
  1. Extending the standard XQuery with relaxation constructs and controls. Such a relaxed XQuery (RLXQuery) language enable users to specify relaxation constructs (e.g., approximate, preferred, unacceptable query conditions etc.), and relaxation controls (e.g., relaxation orders, answer set size, etc.).
  2. Using a knowledge-based relaxation index XML Type Abstraction Hierarchy (XTAH), to guide the XML query relaxation process. XTAH is a hierarchical tree-like multi-level knowledge structure where similar XML fragments are clustered based on the inter XML fragment distance measure. An XTAH has two node types: internal nodes and leaf nodes. An internal node has a representative tree that summarizes the characteristics, including the structure and semantics, of all the XML fragments in that cluster. A leaf node has a representative tree as well as a set of XML fragments that are similar to each other. Guided by the XTAH, based on the relaxation constructs in RLXQuery, CoXML processes the query and returns approximately matched answers.
1.3 CoXML System Overview

The diagram above illustrates the architecture of our CoXML system, which consists of a relaxation controller, a RLXQuery parser, a RLXQuery preprocessor, a database manager, a relaxation manager, an XTAH manager, an XTAH generator, an execution trace manager and a post-processor. The functionalities of each module are briefly described as follows:

  • XTAH Generator
    Given a set of XML documents and some domain knowledge, the XTAH generator automatically generates XTAHs that summarize the content and structure characteristics of the underlying XML documents.
  • Relaxation Controller
    The relaxation controller is the main interface to the outside world. The controller is the controlling unit of RLXQuery search engine and it dispatches the query to different modules to allow them to perform some specific task to the query. It is the module that decides when to stop query relaxation based on the criteria specified by the user.
  • RLXQuery Parser
    The RLXQuery Parser transforms an RLXQuery into an XQueryRep object. The Query Parser performs two services. The first part is responsible for checking the syntax of the query. If the syntax is correct, then the query is parsed to the second part. The second part extracts all the in formation from the query and creates an XQueryRep object.
  • RLXQuery Preprocessor
    The RLXQuery Preprocessor transforms relaxation constructs (if any) in the query into the standard XQuery constructs and communicates with the database manager to check whether there are enough answers available. If not, the relaxation controller will send the query to the relaxation manager.
  • Database Manager
    The functionality of the database manager is to find exact matched answers to a standard XQuery.
  • Relaxation Manager
    The relaxation manager performs the following functionalities: 1) building a relaxation structure based on the specified relaxation constructs and controls; 2)communicating with the XTAH manager to get relaxed conditions; and 3)modifying the query accordingly.
  • XTAH Manager
    The functionality of the XTAH manager is to select an appropriate XTAH for guiding the query relaxation process.
  • Post-processor
    The functionality of the post-processor is to accept unsorted answers and output these answers in a ranked order.
  • Execution Trace Manager
    The Execution Trace keeps track of all the actions incurred on the XQueryRep by each module. All modules need to communicate with the Execution Trace to record their changes to the XQueryRep.
  • Ranking of approximate answers
    A model has been developed to rank answers based on both content and structure relaxation.

2. Screenshots Examples

Example of using a GUI interface to help users to specify relaxations in queries

3. Members

Wesley W. Chu Principal Investigator wwc AT cs.ucla.edu
Shaorong Liu Ph.D. Student sliu AT cs.ucla.edu
Ruzan Shahinian M.S. Student
Joseph Chen M.S. Student jccs05 AT gmail.com
Christian Cardenas M.S. Student
Anna Putnam M.S. Student
Eric Sung M.S. Student  
Tony Lee M.S. Student  

4. Publications

4.1 Papers
4.2 Presentations & Posters

4.3 Master Comprehensive Exam Reports
  • Tony Lee. Indexing Technique for XML structure patterns. MaSters Thesis, Computer Science Department, UCLA, Winter 2003. PDF
  • Eric Sung. Parser for Relaxation-Enabled XML Query Language. Masters Thesis, Computer Science Department, UCLA, Fall 2004. PDF
  • Christian Cardenas. Parsing and Preprocessing of RLXQuery for XML Query Relaxation. Masters Thesis, Computer Science Department, UCLA, Spring 2004. PDF
  • Anna Putnam. The Relaxation Manager for XML Query Relaxation. Masters Thesis, Computer Science Department, UCLA, Spring 2004. PDF