[xquery-talk] Regular Expression search

John Snelson jsnelson at sleepycat.com
Thu Dec 15 18:00:37 PST 2005


Hi Martin,

I've used the MarkLogic database. However, the custom ctd:contains() 
function is not quite the same as the full regular expressions accepted 
by fn:matches(), is it?

Does MarkLogic use it's indexes to optimise fn:matches()?

John

Jason Hunter wrote:
> MarkLogic actually uses indexes for wildcard queries.  For example, the 
> original poster's questions about finding things starting with 
> "MyNameIs" could be solved efficiently using a query like this:
> 
> //(subTagA|subTagB)[starts-with(., "MyNameIs")]
> 
> That should execute efficiently against a large data set if the 
> character indexes are enabled.  If the poster instead wanted any word 
> token to start with that sequence of characters (rather than the element 
> itself), he could use the MarkLogic function cts:contains() and the * 
> wildcard:
> 
> //(subTagA|subTagB)[cts:contains(., "MyNameIs*")]
> 
> The cts:* functions operate on tokens rather than simple character 
> sequences, providing search engine style features.  You can see the 
> difference in the previously discussed query to find the token "Name". 
> Using standard XQuery you write this:
> 
> //*[contains(., "Name")]
> 
> But this matches "xName" and "Nameste".  When I search for "foo" I don't 
> want to find "food"!  Using cts:contains() you match just word tokens:
> 
> //*[cts:contains(., "Name")]
> 
> The tokens are broken at index time according to language rules, and you 
> have the option at query time to specify stemming rules (should Names 
> and Naming match?), case sensitivity (is "name" ok?), thesaurus (what 
> about "nom de plume"?), and so on.
> 
> It's fun stuff.  I wrote about this in longer form at:
> http://idealliance.org/proceedings/xtech05/papers/02-04-01/
> 
> -jh-
> _______________________________________________
> talk at xquery.com
> http://xquery.com/mailman/listinfo/talk


-- 
John Snelson, Berkeley DB XML Engineer
Sleepycat Software, Inc
http://www.sleepycat.com

Contracted to Sleepycat through Parthenon Computing Ltd
http://blog.parthcomp.com/dbxml


More information about the talk mailing list