[xquery-talk] A Couple of Questions - OOXML and SQL
jim.melton at acm.org
Thu Apr 3 10:17:28 PST 2008
Actually, that is a very good (and, IMHO, important) question. I'm
glad you asked it in this forum.
First, I have to admit that I don't know anything specific about
OOXML. I have a general clue what it is, of course, but don't know
details about the structure, schemas, etc. IF OOXML happened to
contain only formatting information (which a typical word processing
document would), then XQuery wouldn't be any help locating, say, all
red parts or all employees whose birthdays are this month. You'd be
stuck looking for <heading1> or <emphasis> information.
On the other hand, if (as I suspect it does) OOXML permits document
authors to define their own elements to represent semantic concepts
such as "part", "color", "employee", and "birthday", then XQuery
would provide one good way of looking for such data in OOXML
documents. Somebody with more specific information about OOXML will
have to pursue this question for you.
Now, to respond to your question about query algebras...
As you obviously know, SQL is based on the relational model, which is
in turn based on set theory. This is possible in large part because
of the way SQL (and relational) data is structured. And I used the
word "structured" very carefully to imply two things: "...the way the
data is logically organized..." *and* "...the fact that the data is
highly structured in nature...". It's the latter aspect that makes
set theory work so well with SQL/relational data. In the
SQL/relational models (they're not quite the same thing, as you also
know), data is represented as "things" that we call "tuples" or
"rows", each of which has one or more "attributes" or "columns", each
of which contains a piece of data (in SQL's case, that piece of data
is permitted to be the special flag that we call "the null value" to
indicate that the data is absent, irrelevant, unknown, not
applicable, etc.). We refer to data with this characteristic as
"structured" or "regular".
By contrast, data represented in XML is often much less regular. In
some kinds of XML, the data might well be very regular and thus
highly structured (with, in fact, more "structure" than is normally
feasible in the relational model). But XML doesn't require
that. Consider the case of a book or newspaper or contract, in which
characteristics such as boldface, italics, underlining, etc. are
sprinkled (effectively) at random throughout paragraphs. That sort
of data is very difficult to represent in a relational world because
of its unpredictability.
Consider another form of data in which values captured from physical
sensors are represented. Every second, data is gathered, put into an
XML format, and transmitted to a consumer. But sensors (all physical
objects) are unreliable and sometimes one or more sensors do not
respond to the inquiry, or some of the data from a single sensor is
garbled but the rest valid. It is necessary to be able to transmit
partial information to the consumers. That, in turn, means that the
structure of the XML data is (allowed to be) irregular, or less structured.
When data becomes semi-structured (as in the sensor example) or
unstructured (consider a 4 year old child playing with a word
processor), it's difficult to apply mathematical principles,
especially set theory, to such data. That doesn't mean that all is lost.
But, first, allow me to clarify the SQL situation a bit more. Yes,
SQL can be transformed into an algebra, and most products do this to
a large degree for execution. But SQL itself is more nearly (not
completely, though!) a calculus than it is an algebra. SQL, like
XQuery, is mostly a declarative language in which query authors state
the intent of the query instead of the algorithm for finding the
answers. Thus, in essence, an SQL compiler transforms a
calculus-like language into an algebra for execution (well, for
further compilation into executable code). It is the optimizer in an
SQL system that guides that transformation to make queries reasonably
XQuery is not substantially different from 10,000 meters. It is a
declarative, calculus-like language that is often (usually? always?)
translated into a sort of algebraic form. And optimizers are
responsible for making the result of that transformation
efficient. But that's where the similarity breaks down. There is
not, as far as I am aware (full disclosure: I do not read a lot of
research papers, so I could be 'way out of date here), a
well-defined, rigorous algebra associated with XML data, the
XPath/XQuery Data Model, or XQuery. I would not be surprised if
there never was, but I wouldn't be stunned if there will be, either.
Hope this helps,
>Date: Wed, 2 Apr 2008 22:29:39 -0700
>From: "Tsao, Scott" <scott.tsao at boeing.com>
>Subject: [xquery-talk] A Couple of Questions - OOXML and SQL
>To: <talk at x-query.com>
>During a recent XQuery Overview presentation, there were a couple of
>questions raised which I am searching for answers:
> 1. Office Open XML (OOXML) is a file format used by the Microsoft
>Office 2007 applications. Can XQuery be used to get meaningful
>information from an OOXML document, or would it only return items based
>on formatting aspects (all heading 1s, or all list items).
> 2. SQL is based in part on Set theory from Mathematics, and Set
>algebra. It allows set operations "update all red projects to green."
>Does XQuery support set algebra? For example, SQL join is a set
>operation that has inner, outer, Cartesian forms.
>Do you have answers to those questions? If you do, please do share!
>Associate Technical Fellow
>The Boeing Company
Jim Melton --- Editor of ISO/IEC 9075-* (SQL) Phone: +1.801.942.0144
Co-Chair, W3C XML Query WG; XQX (etc.) editor Fax : +1.801.942.3345
Oracle Corporation Oracle Email: jim dot melton at oracle dot com
1930 Viscounti Drive Standards email: jim dot melton at acm dot org
Sandy, UT 84093-1063 USA Personal email: jim at melton dot name
= Facts are facts. But any opinions expressed are the opinions =
= only of myself and may or may not reflect the opinions of anybody =
= else with whom I may or may not have discussed the issues at hand. =
More information about the talk