[xquery-talk] Convert diacritics to low-ascii

Tue Jun 21 08:30:44 PDT 2011

And thnx to Andrew as well..

:)

-----Oorspronkelijk bericht-----
Van: Andrew Welch [mailto:andrew.j.welch at gmail.com] 
Verzonden: dinsdag 21 juni 2011 14:53
Aan: Geert Josten
CC: Houghton,Andrew; talk at x-query.com
Onderwerp: Re: [xquery-talk] Convert diacritics to low-ascii

On 21 June 2011 13:33, Geert Josten <geert.josten at daidalos.nl> wrote:
> Thanx Andy!
>
> Works just fine in XQuery too. But have to admit that it looks a bit funny to me. Replace something with nothing and still end up with all characters? Can anyone explain what this \p{M} is matching? Unicode spec isn't making it much clearer to me.. :-P
>

Taken from:

http://www.regular-expressions.info/unicode.html

"\p{M} or \p{Mark}: a character intended to be combined with another
character (e.g. accents, umlauts, enclosing boxes, etc.)."

When NFD or NFKD is used, then the diacritics are represented by a
character following the letter, so for example e accute is the letter
e followed by the character for the accute...  so you can just remove
that character and be left with the e.  (in my post earlier you can
see the accute is unicode character 769)

-- 
Andrew Welch
http://andrewjwelch.com