[xquery-talk] XQuery simulation of XSLT 2.0 grouping

Sat Oct 22 19:48:18 PDT 2005

Interesting question. I don't think there's a better way of expressing this
in XQuery, and yes, the XQuery solution is much more dependent on smart
optimization, which in this case Saxon is failing to achieve. I'm going to
take a careful look at why: the algorithms are there to build an index and
use it, but for some reason the join pattern isn't being recognized in this
instance.

In XSLT 1.0, of course, for-each-group wasn't available, but you could
always use user-defined indexing (xsl:key) to do efficient direct access, so
you were far less dependent on that kind of optimization.

(Actually both the XSLT and XQuery solutions give the wrong answer on my
current build, unless an option is set to prevent whitespace stripping. The
expression tokenize(string(.)) to get the list of words in the document is
dependent on the whitespace between the LINE elements being preserved, and
the Data Model spec now says that whitespace text nodes in element content
should be discarded. The SPEECH elements have element-only content, so the
last word of each line is now run into the first word of the next. I fear
this change in the spec, which I've delayed implementing for a while, is
going to hit a few people.)

Saxon-SA does offer an extension function that simulates the XSLT
for-each-group construct - it's an interesting use case for higher-order
functions - see
http://www.saxonica.com/documentation/extensions/functions/for-each-group.ht
ml

Michael Kay
http://www.saxonica.com/

> -----Original Message-----
> From: talk-bounces at xquery.com 
> [mailto:talk-bounces at xquery.com] On Behalf Of David Sewell
> Sent: 22 October 2005 17:36
> To: talk at xquery.com
> Subject: [xquery-talk] XQuery simulation of XSLT 2.0 grouping
> 
> Early in his XSLT 2.0 Programmer's Reference, Michael Kay presents an
> example of the power of XSLT 2.0 by giving the brief code required to
> produce a word frequency list, sorted in descending order of 
> frequency,
> for all the words in a document (using Shakespeare's 
> "Othello" in XML as
> an example). This is the template that does the work:
> 
>   <xsl:template match="/">
>     <wordcount>
>       <xsl:for-each-group group-by="." select="
>             for $w in tokenize(string(.), '\W+') return 
> lower-case($w)">
>         <xsl:sort select="count(current-group())" order="descending"/>
>         <word word="{current-grouping-key()}" 
> frequency="{count(current-group())}"/>
>       </xsl:for-each-group>
>     </wordcount>
>   </xsl:template>
> 
> The following XQuery produces the identical output:
> 
>   declare variable $corpus :=
>       for $w in tokenize(doc("othello.xml"), '\W+') return 
> lower-case($w);
>   declare variable $wordList := distinct-values($corpus);
>   <wordcount> {
>        for $w in $wordList
>        let $freq := count($corpus[. eq $w])
>        order by $freq descending
>        return <word word="{$w}" frequency="{$freq}"/>
>   }</wordcount>
> 
> However, on my system the XSLT version takes 1.93 seconds to execute
> using Saxon 8.51, while the XQuery takes 210 seconds. I realize that
> XQuery 1.0 does not contain the grouping facilities of XSLT 2.0, but
> I still have a couple of questions:
> 
> 1. Am I overlooking a more efficient way of writing the query?
> 
> 2. If not, is the assumption that one will need to rely on
>    implementation-dependent optimization for this type of
>    XQuery code, possibly relying on extension functions?
> 
> David
> 
> -- 
> David Sewell, Editorial and Technical Manager
> Electronic Imprint, The University of Virginia Press
> PO Box 400318, Charlottesville, VA 22904-4318 USA
> Courier: 310 Old Ivy Way, Suite 302, Charlottesville VA 22903
> Email: dsewell at virginia.edu   Tel: +1 434 924 9973
> Web: http://www.ei.virginia.edu/
> _______________________________________________
> talk at xquery.com
> http://xquery.com/mailman/listinfo/talk
>