[xquery-talk] XQuery and \w, \W in regex (Saxon 8)

Thu Nov 17 12:44:06 PST 2005

Saxon contains code (originally written by James Clark) to translate from
schema regular expressions to Java regular expressions. This code translates
\W to [\p{P}\p{Z}\p{C}]. However, if you write the regular expression \p{P}
then it gets translated to

(?:[\p{P}[\u00AB\u2018\u201B\u201C\u201F\u2039\u00BB\u2019\u201D\u203A]]|\uD
800[\uDD00-\uDD01\uDF9F])

So it seems a second translation phase is needed for \W.

I hate writing this kind of code because there is nothing in the Java
specification that suggests it should handle \p{P} differently from the way
it is specified in Schema (and hence XPath). James appears to have been
working from the JDK implementation rather than from the spec.

Thanks for reporting the problem.

Michael Kay
http://www.saxonica.com/  

> -----Original Message-----
> From: Michael Kay [mailto:mhk at mhk.me.uk] 
> Sent: 16 November 2005 18:38
> To: 'David Sewell'; 'talk at xquery.com'
> Subject: RE: [xquery-talk] XQuery and \w, \W in regex (Saxon 8)
> 
> \W is defined in XML Schema Part 2 to match all characters in 
> the Unicode Punctuation, Separator, and Other categories (P, 
> Z, and C). 201C and 201D are in group P, so on the face of 
> it, you appear to be right. I'll look into it. (Saxon is 
> relying partly on Java for its regular expression matching, 
> but it does preprocess the regex first to ensure conformance 
> with the XPath rules rather than the Java rules.)
> 
> Michael Kay
> http://www.saxonica.com/
>  
> 
> > -----Original Message-----
> > From: talk-bounces at xquery.com 
> > [mailto:talk-bounces at xquery.com] On Behalf Of David Sewell
> > Sent: 16 November 2005 17:22
> > To: talk at xquery.com
> > Subject: [xquery-talk] XQuery and \w, \W in regex (Saxon 8)
> > 
> > Given this code:
> > 
> >   let $string1 := '"quoted"'
> >   let $string2 := "&#x201c;quoted&#x201d;"
> >   return
> >   ( replace($string1, "\W", ""),
> >     replace($string2, "\W", "")
> >   )
> > 
> > Saxon 8.6b returns
> > 
> >   quoted
> >   "quoted"
> > 
> > (where the " " in the second line are Unicode curly 
> quotation marks).
> > 
> > Is this a bug in the regex handling? U+201C and U+201D should 
> > be treated
> > as separators, no? (Likewise single curly quotes, U+2018 
> and U+2019; I
> > haven't tried other punctuation in that code block.)
> > 
> > -- 
> > David Sewell, Editorial and Technical Manager
> > Electronic Imprint, The University of Virginia Press
> > PO Box 400318, Charlottesville, VA 22904-4318 USA
> > Courier: 310 Old Ivy Way, Suite 302, Charlottesville VA 22903
> > Email: dsewell at virginia.edu   Tel: +1 434 924 9973
> > Web: http://www.ei.virginia.edu/
> > _______________________________________________
> > talk at xquery.com
> > http://xquery.com/mailman/listinfo/talk
> > 
>