[xquery-talk] Optimal expression for gathering related elements

David Lee dlee at calldei.com
Tue Mar 1 19:35:40 PST 2011

That's nice clean syntax, cant wait until I get my hands on XQuery 3 in the
HE version of Saxon (hint hint).


< Enter processor specific comments -- probably off-list >


Performance wise I tested my code (XQuery 1) vs XSLT 2 code doing the same
thing (using for-each-group) using Saxon 9.3 HE

This was using a 200MB XML file with about 17,000 ungrouped elements and
4,000 grouped elements.


XSLT2 took about 6 seconds

XQuery took about 4 minutes.


(Both took about 1 G of JVM).


I'm hopeful XQuery 3.0 will optimize group-by so that I dont have to switch
to xslt for these kinds of simple things.





David A. Lee

dlee at calldei.com



From: talk-bounces at x-query.com [mailto:talk-bounces at x-query.com] On Behalf
Of Oliver Hallam
Sent: Tuesday, March 01, 2011 7:23 PM
To: talk at x-query.com
Subject: Re: [xquery-talk] Optimal expression for gathering related elements


This really is the best way to do this in XQuery 1.0, and is a common
pattern; this is why the "group by" clause was added to XQuery 3.0

In XQuery 3.0 this would be as simple as:

for $drugs in /all/drug
let $id := $drugs/@id
group-by $id
  <unique_drug id="{$id[1]}">
      for $drug in $drugs

Performance in XQuery 1.0 may not be as bad as you fear though. I can't 
vouch for other processors, but this particular pattern is one that we 
have put a great deal of effort in to spotting in XQSharp.

When I put your query through our processor I get the following query plan:
let $:temp:41 :=
left outer hash join
  for $id in
  for $drug in $:temp:41
id)  =
{ http://www.w3.org/2005/xpath-functions:data($drug/attribute::id) })
  element {drug} { 
fs:item-sequence-to-node-sequence($drug/child::element()) }
  element {unique_drug} { (attribute {id} { $id cast as 
http://www.w3.org/2001/XMLSchema:untypedAtomic* } , 
fs:item-sequence-to-node-sequence($:temp:42)) }

The key thing to note here is the "left outer hash join". This should 
perform in linear time. What is in fact happening here is that an index 
is first built for the right hand input to the join (in this case the 
values of @id for each /all/drug node), and then for each distinct value 
of /all/drug/@id all matching nodes are selected and the <unique_drug> 
element is returned.

XQSharp should be smart enough to spot that the join is joining the keys 
on the right hand side to their own distinct values. The query plan 
should have looked something like this:

  for $drug in /all/drug
group by
  <unique_drug id="{$id}">{$:temp}</unique_drug>

If this optimization had kicked in correctly then the whole query would 
have been performed in a single pass of the document. In this case an 
index is built for the /all/drug elements (keyed on data(@id)) and then 
the index is iterated. Performance should be about as fast as it 
possibly could be!

It seems though that this optimization hasn't happened with the lates 
build of XQSharp; I have filed a bug report and will investigate why 
this is the case with your query.

Even without this final optimization though the query does perform in 
linear time; it just builds two indexes of the data rather than one (one 
during the computation of the distinct values, and one to perform the join).

If you are interested, I have written a more detailed analysis of the 
different join optimizations performed by XQSharp here: 

Oliver Hallam

On 01/03/2011 15:33, David Lee wrote:
> I find I use this pattern frequently when I need to group multiple
> associated with some shared identifier (say @id)
> Suppose I have something like
> <all>
> <drug @id="1"><text>texta</text></drug>
> <drug @id="2"> <text>textb</text></drug>
> <drug @id="3"> <text>textc</text></drug>
> <drug @id="1"> <text>textd</text></drug>
> <drug @id="2"> <text>texte</text></drug>
> ...
> And I want to create a set of combined entries
> like
> <all>
> <unique_drug @id="1">
> <drug><text>texta</text></drug>
> <drug> <text>textd</text></drug>
> </unique_drug>
> <unique_drug @id="2">
> <drug><text>textb</text></drug>
> <drug> <text>texte</text></drug>
> </unique_drug>
> ..
> What I do is something like this :
> ( typed into email, not tested ...)
> for $id in distinct-values(/all/drug/@id)
> return
> <unique_drug id="{$id}">
> {
> for $drug in /all/drug[@id eq $id]
> return
> <drug>{$drug/*}</drug>
> }
> </unique_drug>
> What I was offhand wondering, is if there isnt something more direct (in
> XQuery 1).
> It seems both verbose and inefficient, although of course efficiency is a
> processor issue. (maybe it makes indexes ...)
> But still ... it seems it has to scan all elements to get the unique id's
> then re-scan them N times to get the matching elements,
> when I can imagine a syntax which would do both at once in linear time as
> opposed to (presumably) exponential time.
> It seems like something a declarative expression should be able to state
> succinctly.
> Any suggestions ? Or am I just fantasizing
> -David
> ----------------------------------------
> David A. Lee
> dlee at calldei.com
> http://www.xmlsh.org
> _______________________________________________
> talk at x-query.com
> http://x-query.com/mailman/listinfo/talk

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://x-query.com/pipermail/talk/attachments/20110301/4bf7f5be/attachment-0001.htm

More information about the talk mailing list