[xquery-talk] replace() breaks on Unicode 5.1 character?
David J Birnbaum
djbpitt+xml at pitt.edu
Fri Apr 4 08:12:44 PST 2008
Dear XQueriers,
I posted an inquiry about this to the eXist mailing list, but further
testing reveals that I get the same behavior from Saxon 9B, which
suggests that it's a general XQuery issue, and therefore perhaps more
appropriate here.
I'm working with some Unicode 5.1 materials (Unicode 5.1 becomes
official on April 4). I need to adjust the collation order for some of
the new 5.1 characters to tell them to sort similarly to existing 5.0
characters. I thought I could do this with the XPath replace function,
along the lines of:
for $i in $verbs order by replace($i,'abc','def') return ...
where 'abc' is a list of the 5.1 characters that need to be remapped and
'def' is a list of the 5.0 characters which they should sort as,
respectively.
Here's a sample standalone test script:
xquery version "1.0";
let $verbs := ('пёс','ꙁка','abc','ghi','Def')
return
<html>
<head><meta http-equiv="Content-Type"
content="text/html;charset=utf-8"/></head>
<body>
{
(:for $i in $verbs order by replace($i,'Dꙁ','dз'):)
for $i in $verbs order by replace($i,'D','d')
return
<p>{$i}</p>
}
</body>
</html>
The character at the beginning of the second member of the sequence
$verbs is a Unicode 5.1 character (u+A641)--or, at least, it is in my
editor (it may have been damaged during mail transfer, so if you want to
try to replicate the problem, you may need to reinsert it).
(Documentation for the new early Cyrillic Unicode 5.1 characters that I
use is available at http://www.dulug.dk/jtc1/sc2/wg2/docs/n3194.pdf and
should go on-line at http://www.unicode.org shortly.)
If I run this script as is, upper-case 'D' is replaced by lower-case 'd'
for sorting purposes, and the strings are sorted in byte-value order
after folding the upper-case 'D' to lower case. This is what I want, at
least as far as it goes. If, however, I replace the "for $i in $verbs"
line with the one above it (currently commented out), so that it
replaces both 'D' with 'd' and u+A641 with a Unicode 5.0 characters
(u+0437), which is how it should be sorted, the sorting reverts to
strict byte-value order, without either the new Cyrillic replacement or
the previously functional Latin case folding. If I understand correctly,
even in Unicode 5.0 u+A641 is a valid Unicode character (it is a member
of the "not assigned" category), and it clearly can be sorted (by byte
order in the first version of my script), but apparently it cannot
appear in an argument in the replace() function, and if it does appear
there, it seems to disable the function entirely.
I'd be grateful for advice about how to achieve the sorting I need while
I wait for my java distribution to catch up to Unicode 5.1. I'd also
appreciate any insight anyone may be able to supply about why the
construction I've been using works for Latin case folding but breaks on
my "not assigned" Unicode 5.1 characters.
Sincerely,
David
djbpitt+xml at pitt.edu
More information about the talk
mailing list