[xquery-talk] Regular Expression search

Michael Kay mhk at mhk.me.uk
Fri Dec 16 10:10:53 PST 2005


> Search and querying are very different. Search is basically for 
> document-centric XML (like XHTML), where as querying is for 
> data-centric 
> XML (like invoices, etc). If you're using regular expressions for 
> data-centric XML, then I'd say you have a design flaw - but 
> not if you 
> are using them for document-centric XML.

That seems very simplistic to me, for a number of reasons. 

(1) The distinction between document-centric and data-centric is not a
hard-and-fast one. If you take any real application, for example a system
for managing insurance claims, then it contains a spectrum of information
from highly-structured to very loosely-structured. One of the big benefits
of XML is that we can now handle this full spectrum using a single
technology.

(2) XML structures are often designed primarily for information interchange,
not for storage and query. The database often needs to contain the message
as transmitted or received. The fact that the XML design is not optimized
for query is not a design flaw, it is a consequence of the fact that
information interchange rather than query is now the primary driver.

(3) I can think of many perfectly good reasons for using regular expressions
to search highly structured data, even when it was designed primarily for
querying. For example if I receive an invoice that's damaged in the post and
I can't quite read the purchase order number, I might want to do a regular
expression search for the parts of the number that I can read. 

(4) Any argument that says "in data-centric XML there should be no implicit
structure in textual fields, it should all be denoted by explicit markup"
can be applied equally well to document-centric XML. In both cases the
argument is false: it's entirely reasonable to store a UK postcode such as
"RG4 7BS" as a single string even though the "RG4" on its own carries
meaning; similarly dates, part numbers, etc. The granularity of markup
involves a design compromise, you can't argue that finer-grained markup is
always better.

Michael Kay
http://www.saxonica.com/




More information about the talk mailing list