[xquery-talk] Regular Expression search

John Snelson jsnelson at sleepycat.com
Thu Dec 15 18:12:35 PST 2005


Martin Probst wrote:
>>It's not impossible - although I don't know of any XML database that 
>>does use indexes for regular expressions at the moment.
> 
> Are you sure? It's probably possible for simple cases (e.g. "Foo|Bar"),
> but for general regular expressions? How would you do that?

Sleepycat's Berkeley DB XML has what it calls a "substring" index which 
is currently used to optimise fn:contains(), fn:starts-with() and 
fn:ends-with(). This works by splitting the content down into sequential 
three character segments, ie:

"abccccb" is split into "abc", "bcc", "ccc", "ccb"

This type of index could be used to optimise regular expression. If you 
define a regular expression to match the string above, it might look 
like this:

"abc+b"

 From this regular expression, you can see that the keys you need to 
look up in the container are:

"abc" & ("bcb" | ("bcc" & "ccb") | ("bcc" & "ccc" & "ccb"))

Where "&" is set intersect of the index results, and "|" is set union. 
So a simple combination of index lookups will give you a candidate set 
for matching the regular expression. Of course, you will need to double 
check that the candidate set /actually/ fits the regular expression, 
after doing the index lookups.

> Apart from that, if you need regular expressions to search your XML,
> there's probably a major problem with your XML design ;-)

Search and querying are very different. Search is basically for 
document-centric XML (like XHTML), where as querying is for data-centric 
XML (like invoices, etc). If you're using regular expressions for 
data-centric XML, then I'd say you have a design flaw - but not if you 
are using them for document-centric XML.

John

-- 
John Snelson, Berkeley DB XML Engineer
Sleepycat Software, Inc
http://www.sleepycat.com

Contracted to Sleepycat through Parthenon Computing Ltd
http://blog.parthcomp.com/dbxml


More information about the talk mailing list