[xquery-talk] A Couple of Questions - OOXML and SQL

Jim Melton jim.melton at acm.org
Thu Apr 3 10:17:28 PST 2008


Scott,

Actually, that is a very good (and, IMHO, important) question.  I'm 
glad you asked it in this forum.

First, I have to admit that I don't know anything specific about 
OOXML.  I have a general clue what it is, of course, but don't know 
details about the structure, schemas, etc.  IF OOXML happened to 
contain only formatting information (which a typical word processing 
document would), then XQuery wouldn't be any help locating, say, all 
red parts or all employees whose birthdays are this month.  You'd be 
stuck looking for <heading1> or <emphasis> information.

On the other hand, if (as I suspect it does) OOXML permits document 
authors to define their own elements to represent semantic concepts 
such as "part", "color", "employee", and "birthday", then XQuery 
would provide one good way of looking for such data in OOXML 
documents.  Somebody with more specific information about OOXML will 
have to pursue this question for you.

Now, to respond to your question about query algebras...

As you obviously know, SQL is based on the relational model, which is 
in turn based on set theory.  This is possible in large part because 
of the way SQL (and relational) data is structured.  And I used the 
word "structured" very carefully to imply two things: "...the way the 
data is logically organized..." *and* "...the fact that the data is 
highly structured in nature...".  It's the latter aspect that makes 
set theory work so well with SQL/relational data.  In the 
SQL/relational models (they're not quite the same thing, as you also 
know), data is represented as "things" that we call "tuples" or 
"rows", each of which has one or more "attributes" or "columns", each 
of which contains a piece of data (in SQL's case, that piece of data 
is permitted to be the special flag that we call "the null value" to 
indicate that the data is absent, irrelevant, unknown, not 
applicable, etc.).  We refer to data with this characteristic as 
"structured" or "regular".

By contrast, data represented in XML is often much less regular.  In 
some kinds of XML, the data might well be very regular and thus 
highly structured (with, in fact, more "structure" than is normally 
feasible in the relational model).  But XML doesn't require 
that.  Consider the case of a book or newspaper or contract, in which 
characteristics such as boldface, italics, underlining, etc. are 
sprinkled (effectively) at random throughout paragraphs.  That sort 
of data is very difficult to represent in a relational world because 
of its unpredictability.

Consider another form of data in which values captured from physical 
sensors are represented.  Every second, data is gathered, put into an 
XML format, and transmitted to a consumer.  But sensors (all physical 
objects) are unreliable and sometimes one or more sensors do not 
respond to the inquiry, or some of the data from a single sensor is 
garbled but the rest valid.  It is necessary to be able to transmit 
partial information to the consumers.  That, in turn, means that the 
structure of the XML data is (allowed to be) irregular, or less structured.

When data becomes semi-structured (as in the sensor example) or 
unstructured (consider a 4 year old child playing with a word 
processor), it's difficult to apply mathematical principles, 
especially set theory, to such data.  That doesn't mean that all is lost.

But, first, allow me to clarify the SQL situation a bit more.  Yes, 
SQL can be transformed into an algebra, and most products do this to 
a large degree for execution.  But SQL itself is more nearly (not 
completely, though!) a calculus than it is an algebra.  SQL, like 
XQuery, is mostly a declarative language in which query authors state 
the intent of the query instead of the algorithm for finding the 
answers.  Thus, in essence, an SQL compiler transforms a 
calculus-like language into an algebra for execution (well, for 
further compilation into executable code).  It is the optimizer in an 
SQL system that guides that transformation to make queries reasonably 
efficient.

XQuery is not substantially different from 10,000 meters.  It is a 
declarative, calculus-like language that is often (usually? always?) 
translated into a sort of algebraic form.  And optimizers are 
responsible for making the result of that transformation 
efficient.  But that's where the similarity breaks down.  There is 
not, as far as I am aware (full disclosure: I do not read a lot of 
research papers, so I could be 'way out of date here), a 
well-defined, rigorous algebra associated with XML data, the 
XPath/XQuery Data Model, or XQuery.  I would not be surprised if 
there never was, but I wouldn't be stunned if there will be, either.

Hope this helps,
    Jim


>Date: Wed, 2 Apr 2008 22:29:39 -0700
>From: "Tsao, Scott" <scott.tsao at boeing.com>
>Subject: [xquery-talk] A Couple of Questions - OOXML and SQL
>To: <talk at x-query.com>
>
>During a recent XQuery Overview presentation, there were a couple of
>questions raised which I am searching for answers:
>
>    1. Office Open XML (OOXML) is a file format used by the Microsoft
>Office 2007 applications. Can XQuery be used to get meaningful
>information from an OOXML document, or would it only return items based
>on formatting aspects (all heading 1s, or all list items).
>
>    2. SQL is based in part on Set theory from Mathematics, and Set
>algebra. It allows set operations "update all red projects to green."
>Does XQuery support set algebra? For example, SQL join is a set
>operation that has inner, outer, Cartesian forms.
>http://en.wikipedia.org/wiki/Algebra_of_sets
>
>Do you have answers to those questions? If you do, please do share!
>
>
>Thanks,
>
>Scott Tsao
>Associate Technical Fellow
>The Boeing Company

========================================================================
Jim Melton --- Editor of ISO/IEC 9075-* (SQL)     Phone: +1.801.942.0144
   Co-Chair, W3C XML Query WG; XQX (etc.) editor    Fax : +1.801.942.3345
Oracle Corporation        Oracle Email: jim dot melton at oracle dot com
1930 Viscounti Drive      Standards email: jim dot melton at acm dot org
Sandy, UT 84093-1063 USA          Personal email: jim at melton dot name
========================================================================
=  Facts are facts.   But any opinions expressed are the opinions      =
=  only of myself and may or may not reflect the opinions of anybody   =
=  else with whom I may or may not have discussed the issues at hand.  =
======================================================================== 



More information about the talk mailing list