[xquery-talk] Regular Expression search
John Snelson
jsnelson at sleepycat.com
Thu Dec 15 18:00:37 PST 2005
Hi Martin,
I've used the MarkLogic database. However, the custom ctd:contains()
function is not quite the same as the full regular expressions accepted
by fn:matches(), is it?
Does MarkLogic use it's indexes to optimise fn:matches()?
John
Jason Hunter wrote:
> MarkLogic actually uses indexes for wildcard queries. For example, the
> original poster's questions about finding things starting with
> "MyNameIs" could be solved efficiently using a query like this:
>
> //(subTagA|subTagB)[starts-with(., "MyNameIs")]
>
> That should execute efficiently against a large data set if the
> character indexes are enabled. If the poster instead wanted any word
> token to start with that sequence of characters (rather than the element
> itself), he could use the MarkLogic function cts:contains() and the *
> wildcard:
>
> //(subTagA|subTagB)[cts:contains(., "MyNameIs*")]
>
> The cts:* functions operate on tokens rather than simple character
> sequences, providing search engine style features. You can see the
> difference in the previously discussed query to find the token "Name".
> Using standard XQuery you write this:
>
> //*[contains(., "Name")]
>
> But this matches "xName" and "Nameste". When I search for "foo" I don't
> want to find "food"! Using cts:contains() you match just word tokens:
>
> //*[cts:contains(., "Name")]
>
> The tokens are broken at index time according to language rules, and you
> have the option at query time to specify stemming rules (should Names
> and Naming match?), case sensitivity (is "name" ok?), thesaurus (what
> about "nom de plume"?), and so on.
>
> It's fun stuff. I wrote about this in longer form at:
> http://idealliance.org/proceedings/xtech05/papers/02-04-01/
>
> -jh-
> _______________________________________________
> talk at xquery.com
> http://xquery.com/mailman/listinfo/talk
--
John Snelson, Berkeley DB XML Engineer
Sleepycat Software, Inc
http://www.sleepycat.com
Contracted to Sleepycat through Parthenon Computing Ltd
http://blog.parthcomp.com/dbxml
More information about the talk
mailing list