[xquery-talk] Optimal expression for gathering related elements

Oliver Hallam oliver at xqsharp.com
Wed Mar 2 00:22:54 PST 2011


This really is the best way to do this in XQuery 1.0, and is a common 
pattern; this is why the "group by" clause was added to XQuery 3.0

In XQuery 3.0 this would be as simple as:

for $drugs in /all/drug
let $id := $drugs/@id
group-by $id
return
   <unique_drug id="{$id[1]}">
     {
       for $drug in $drugs
       return
         <drug>{$drug/*}</drug>
     }
</unique_drug>


Performance in XQuery 1.0 may not be as bad as you fear though. I can't
vouch for other processors, but this particular pattern is one that we
have put a great deal of effort in to spotting in XQSharp.


When I put your query through our processor I get the following query plan:
let $:temp:41 :=
  step
     $fs:dot_0/child::all
     child::drug
left outer hash join
   for $id in
     http://www.w3.org/2005/xpath-functions:distinct-values(
       http://www.w3.org/2005/xpath-functions:data(
         step
           $:temp:41
           attribute::id
       )
     ,
       "http://www.w3.org/2005/xpath-functions/collation/codepoint"
     )
to
   for $drug in $:temp:41
on
fs:convert-operand-to-atomic-type[http://www.w3.org/2001/XMLSchema:string]($id)  
= 
fs:convert-operand-to-atomic-type[http://www.w3.org/2001/XMLSchema:string](cardinality-check[?] 

{ http://www.w3.org/2005/xpath-functions:data($drug/attribute::id) })
aggregate
   element {drug} {
fs:item-sequence-to-node-sequence($drug/child::element()) }
as
   $:temp:42
return
   element {unique_drug} { (attribute {id} { $id cast as
http://www.w3.org/2001/XMLSchema:untypedAtomic* } ,
fs:item-sequence-to-node-sequence($:temp:42)) }



The key thing to note here is the "left outer hash join". This should
perform in linear time. What is in fact happening here is that an index
is first built for the right hand input to the join (in this case the
values of @id for each /all/drug node), and then for each distinct value
of /all/drug/@id all matching nodes are selected and the <unique_drug>
element is returned.

XQSharp should be smart enough to spot that the join is joining the keys
on the right hand side to their own distinct values. The query plan
should have looked something like this:

   for $drug in /all/drug
group by
   data($drug/@id)
aggregate
<drug>{$drug/*}</drug>
as
   $:temp
return
<unique_drug id="{$id}">{$:temp}</unique_drug>

If this optimization had kicked in correctly then the whole query would
have been performed in a single pass of the document. In this case an
index is built for the /all/drug elements (keyed on data(@id)) and then
the index is iterated. Performance should be about as fast as it
possibly could be!

It seems though that this optimization hasn't happened with the lates
build of XQSharp; I have filed a bug report and will investigate why
this is the case with your query.

Even without this final optimization though the query does perform in
linear time; it just builds two indexes of the data rather than one (one
during the computation of the distinct values, and one to perform the join).

If you are interested, I have written a more detailed analysis of the
different join optimizations performed by XQSharp here:
http://xqsharp.blogspot.com/2010/05/join-optimizations-in-xqsharp.html



Oliver Hallam
XQSharp

On 01/03/2011 15:33, David Lee wrote:
 > I find I use this pattern frequently when I need to group multiple 
elements
 > associated with some shared identifier (say @id)
 > Suppose I have something like
 >
 > <all>
 > <drug @id="1"><text>texta</text></drug>
 > <drug @id="2"> <text>textb</text></drug>
 > <drug @id="3"> <text>textc</text></drug>
 > <drug @id="1"> <text>textd</text></drug>
 > <drug @id="2"> <text>texte</text></drug>
 > ...
 >
 >
 > And I want to create a set of combined entries
 > like
 > <all>
 > <unique_drug @id="1">
 > <drug><text>texta</text></drug>
 > <drug> <text>textd</text></drug>
 > </unique_drug>
 > <unique_drug @id="2">
 > <drug><text>textb</text></drug>
 > <drug> <text>texte</text></drug>
 > </unique_drug>
 >
 > ..
 >
 >
 > What I do is something like this :
 > ( typed into email, not tested ...)
 >
 > for $id in distinct-values(/all/drug/@id)
 > return
 > <unique_drug id="{$id}">
 > {
 > for $drug in /all/drug[@id eq $id]
 > return
 > <drug>{$drug/*}</drug>
 > }
 > </unique_drug>
 >
 >
 > What I was offhand wondering, is if there isnt something more direct (in
 > XQuery 1).
 > It seems both verbose and inefficient, although of course efficiency is a
 > processor issue. (maybe it makes indexes ...)
 > But still ... it seems it has to scan all elements to get the unique id's
 > then re-scan them N times to get the matching elements,
 > when I can imagine a syntax which would do both at once in linear time as
 > opposed to (presumably) exponential time.
 > It seems like something a declarative expression should be able to state
 > succinctly.
 >
 > Any suggestions ? Or am I just fantasizing
 >
 > -David
 >
 >
 >
 >
 >
 >
 >
 > ----------------------------------------
 > David A. Lee
 > dlee at calldei.com
 > http://www.xmlsh.org
 >
 >
 > _______________________________________________
 > talk at x-query.com
 > http://x-query.com/mailman/listinfo/talk

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://x-query.com/pipermail/talk/attachments/20110302/589ac6a1/attachment.htm


More information about the talk mailing list