[xquery-talk] Regular Expression search

Thu Dec 15 09:24:24 PST 2005

John Snelson wrote:

> Martin Probst wrote:
> 
>> be careful with expressions like //*[matches(. "...")]. Because
>> regular expressions can be _very_ complex it's impossible to speed
>> this query up by an index, so you'll be matching the RegExp over
>> each node in the document - not good.
> 
> It's not impossible - although I don't know of any XML database that
>  does use indexes for regular expressions at the moment.
> 
> John

Hey John,

MarkLogic actually uses indexes for wildcard queries.  For example, the 
original poster's questions about finding things starting with 
"MyNameIs" could be solved efficiently using a query like this:

//(subTagA|subTagB)[starts-with(., "MyNameIs")]

That should execute efficiently against a large data set if the 
character indexes are enabled.  If the poster instead wanted any word 
token to start with that sequence of characters (rather than the element 
itself), he could use the MarkLogic function cts:contains() and the * 
wildcard:

//(subTagA|subTagB)[cts:contains(., "MyNameIs*")]

The cts:* functions operate on tokens rather than simple character 
sequences, providing search engine style features.  You can see the 
difference in the previously discussed query to find the token "Name". 
Using standard XQuery you write this:

//*[contains(., "Name")]

But this matches "xName" and "Nameste".  When I search for "foo" I don't 
want to find "food"!  Using cts:contains() you match just word tokens:

//*[cts:contains(., "Name")]

The tokens are broken at index time according to language rules, and you 
have the option at query time to specify stemming rules (should Names 
and Naming match?), case sensitivity (is "name" ok?), thesaurus (what 
about "nom de plume"?), and so on.

It's fun stuff.  I wrote about this in longer form at:
http://idealliance.org/proceedings/xtech05/papers/02-04-01/

-jh-