[xquery-talk] XQuery simulation of XSLT 2.0 grouping

Sat Oct 22 22:10:08 PDT 2005

I've now tweaked the Saxon-SA optimizer to pick up this case. There were two
things getting in the way:

(a) it was only checking the first reference to the variable $corpus to look
for an expression that would benefit from indexing, whereas in this case
it's the second reference that benefits

(b) it was only considering local variables (let $x) for indexing, and not
global variables (declare variable $x).

Bear in mind that Saxon-SA 8.5 was the first version to include join
optimization, and it takes a while for optimizers to mature; they have to be
refined on a case-by-case basis.

The execution times on my machine are now:

Saxon-SA XQuery:   1953/641/801 ms
Saxon-B XQuery:    72874/67898/69220 ms
Saxon-B XSLT:      1182/440/231 ms

(I always make at least three runs when measuring performance - there's an
undocumented option -3 on the command line - to eliminate the effect of Java
VM startup costs)

So the XQuery code is still a bit slower than the XSLT code. This isn't
surprising, because the XSLT code finds the unique values and assembles the
sets of duplicates in a single pass, whereas the XQuery code still does two
separate passes. But the important thing is that it now scales with document
size.

Thanks for contributing the use case! The changes, of course, should benefit
many other queries as well. They will also benefit some XSLT
transformations.

Michael Kay
http://www.saxonica.com/

> -----Original Message-----
> From: talk-bounces at xquery.com 
> [mailto:talk-bounces at xquery.com] On Behalf Of David Sewell
> Sent: 22 October 2005 17:36
> To: talk at xquery.com
> Subject: [xquery-talk] XQuery simulation of XSLT 2.0 grouping
> 
> Early in his XSLT 2.0 Programmer's Reference, Michael Kay presents an
> example of the power of XSLT 2.0 by giving the brief code required to
> produce a word frequency list, sorted in descending order of 
> frequency,
> for all the words in a document (using Shakespeare's 
> "Othello" in XML as
> an example). This is the template that does the work:
> 
>   <xsl:template match="/">
>     <wordcount>
>       <xsl:for-each-group group-by="." select="
>             for $w in tokenize(string(.), '\W+') return 
> lower-case($w)">
>         <xsl:sort select="count(current-group())" order="descending"/>
>         <word word="{current-grouping-key()}" 
> frequency="{count(current-group())}"/>
>       </xsl:for-each-group>
>     </wordcount>
>   </xsl:template>
> 
> The following XQuery produces the identical output:
> 
>   declare variable $corpus :=
>       for $w in tokenize(doc("othello.xml"), '\W+') return 
> lower-case($w);
>   declare variable $wordList := distinct-values($corpus);
>   <wordcount> {
>        for $w in $wordList
>        let $freq := count($corpus[. eq $w])
>        order by $freq descending
>        return <word word="{$w}" frequency="{$freq}"/>
>   }</wordcount>
> 
> However, on my system the XSLT version takes 1.93 seconds to execute
> using Saxon 8.51, while the XQuery takes 210 seconds. I realize that
> XQuery 1.0 does not contain the grouping facilities of XSLT 2.0, but
> I still have a couple of questions:
> 
> 1. Am I overlooking a more efficient way of writing the query?
> 
> 2. If not, is the assumption that one will need to rely on
>    implementation-dependent optimization for this type of
>    XQuery code, possibly relying on extension functions?
> 
> David
> 
> -- 
> David Sewell, Editorial and Technical Manager
> Electronic Imprint, The University of Virginia Press
> PO Box 400318, Charlottesville, VA 22904-4318 USA
> Courier: 310 Old Ivy Way, Suite 302, Charlottesville VA 22903
> Email: dsewell at virginia.edu   Tel: +1 434 924 9973
> Web: http://www.ei.virginia.edu/
> _______________________________________________
> talk at xquery.com
> http://xquery.com/mailman/listinfo/talk
>