[xquery-talk] XQuery and \w, \W in regex (Saxon 8)

Michael Kay mhk at mhk.me.uk
Wed Nov 16 18:37:46 PST 2005


\W is defined in XML Schema Part 2 to match all characters in the Unicode
Punctuation, Separator, and Other categories (P, Z, and C). 201C and 201D
are in group P, so on the face of it, you appear to be right. I'll look into
it. (Saxon is relying partly on Java for its regular expression matching,
but it does preprocess the regex first to ensure conformance with the XPath
rules rather than the Java rules.)

Michael Kay
http://www.saxonica.com/
 

> -----Original Message-----
> From: talk-bounces at xquery.com 
> [mailto:talk-bounces at xquery.com] On Behalf Of David Sewell
> Sent: 16 November 2005 17:22
> To: talk at xquery.com
> Subject: [xquery-talk] XQuery and \w, \W in regex (Saxon 8)
> 
> Given this code:
> 
>   let $string1 := '"quoted"'
>   let $string2 := "“quoted”"
>   return
>   ( replace($string1, "\W", ""),
>     replace($string2, "\W", "")
>   )
> 
> Saxon 8.6b returns
> 
>   quoted
>   "quoted"
> 
> (where the " " in the second line are Unicode curly quotation marks).
> 
> Is this a bug in the regex handling? U+201C and U+201D should 
> be treated
> as separators, no? (Likewise single curly quotes, U+2018 and U+2019; I
> haven't tried other punctuation in that code block.)
> 
> -- 
> David Sewell, Editorial and Technical Manager
> Electronic Imprint, The University of Virginia Press
> PO Box 400318, Charlottesville, VA 22904-4318 USA
> Courier: 310 Old Ivy Way, Suite 302, Charlottesville VA 22903
> Email: dsewell at virginia.edu   Tel: +1 434 924 9973
> Web: http://www.ei.virginia.edu/
> _______________________________________________
> talk at xquery.com
> http://xquery.com/mailman/listinfo/talk
> 




More information about the talk mailing list