[xquery-talk] replace() breaks on Unicode 5.1 character?

Fri Apr 4 08:12:44 PST 2008

Dear XQueriers,

I posted an inquiry about this to the eXist mailing list, but further 
testing reveals that I get the same behavior from Saxon 9B, which 
suggests that it's a general XQuery issue, and therefore perhaps more 
appropriate here.

I'm working with some Unicode 5.1 materials (Unicode 5.1 becomes 
official on April 4). I need to adjust the collation order for some of 
the new 5.1 characters to tell them to sort similarly to existing 5.0 
characters. I thought I could do this with the XPath replace function, 
along the lines of:

for $i in $verbs order by replace($i,'abc','def') return ...

where 'abc' is a list of the 5.1 characters that need to be remapped and 
'def' is a list of the 5.0 characters which they should sort as, 
respectively.

Here's a sample standalone test script:

xquery version "1.0";
let $verbs := ('пёс','ꙁка','abc','ghi','Def')
return
<html>
<head><meta http-equiv="Content-Type" 
content="text/html;charset=utf-8"/></head>
<body>
{
(:for $i in $verbs order by replace($i,'Dꙁ','dз'):)
for $i in $verbs order by replace($i,'D','d')
return
<p>{$i}</p>
}
</body>
</html>

The character at the beginning of the second member of the sequence 
$verbs is a Unicode 5.1 character (u+A641)--or, at least, it is in my 
editor (it may have been damaged during mail transfer, so if you want to 
try to replicate the problem, you may need to reinsert it). 
(Documentation for the new early Cyrillic Unicode 5.1 characters that I 
use is available at http://www.dulug.dk/jtc1/sc2/wg2/docs/n3194.pdf and 
should go on-line at http://www.unicode.org shortly.)

If I run this script as is, upper-case 'D' is replaced by lower-case 'd' 
for sorting purposes, and the strings are sorted in byte-value order 
after folding the upper-case 'D' to lower case. This is what I want, at 
least as far as it goes. If, however, I replace the "for $i in $verbs" 
line with the one above it (currently commented out), so that it 
replaces both 'D' with 'd' and u+A641 with a Unicode 5.0 characters 
(u+0437), which is how it should be sorted, the sorting reverts to 
strict byte-value order, without either the new Cyrillic replacement or 
the previously functional Latin case folding. If I understand correctly, 
even in Unicode 5.0 u+A641 is a valid Unicode character (it is a member 
of the "not assigned" category), and it clearly can be sorted (by byte 
order in the first version of my script), but apparently it cannot 
appear in an argument in the replace() function, and if it does appear 
there, it seems to disable the function entirely.

I'd be grateful for advice about how to achieve the sorting I need while 
I wait for my java distribution to catch up to Unicode 5.1. I'd also 
appreciate any insight anyone may be able to supply about why the 
construction I've been using works for Latin case folding but breaks on 
my "not assigned" Unicode 5.1 characters.

Sincerely,

David
djbpitt+xml at pitt.edu