<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

  <head>


    <meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">

  </head>

  <body bgcolor="#ffffff" text="#000000">

    This really is the best way to do this in XQuery 1.0, and is a

    common pattern; this is why the "group by" clause was added to

    XQuery 3.0<br>

    <br>

    In XQuery 3.0 this would be as simple as:<br>

    <pre>

<tt>for $drugs in /all/drug</tt>

<tt>let $id := $drugs/@id</tt>

<tt>group-by $id</tt>

<tt>return</tt>

<tt>  &lt;unique_drug id="{$id[1]}"&gt;</tt>

<tt>    {</tt>

<tt>      for $drug in $drugs</tt>

<tt>      return</tt>

<tt>        &lt;drug&gt;{$drug/*}&lt;/drug&gt;</tt>

<tt>    }</tt>

<tt>&lt;/unique_drug&gt;</tt>


</pre>

    <br>

    Performance in XQuery 1.0 may not be as bad as you fear though. I

    can't <br>

    vouch for other processors, but this particular pattern is one that

    we <br>

    have put a great deal of effort in to spotting in XQSharp.<br>

    <br>

    <br>

    When I put your query through our processor I get the following

    query plan:<br>

    <tt>let $:temp:41 :=</tt><br>

    <tt>&nbsp;step</tt><br>

    <tt>&nbsp;&nbsp;&nbsp; $fs:dot_0/child::all</tt><br>

    <tt>&nbsp;&nbsp;&nbsp; child::drug</tt><br>

    <tt>left outer hash join</tt><br>

    <tt>&nbsp; for $id in</tt><br>

    <tt>&nbsp;&nbsp;&nbsp; <a class="moz-txt-link-freetext" href="http://www.w3.org/2005/xpath-functions:distinct-values(">http://www.w3.org/2005/xpath-functions:distinct-values(</a></tt><br>

    <tt>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <a class="moz-txt-link-freetext" href="http://www.w3.org/2005/xpath-functions:data(">http://www.w3.org/2005/xpath-functions:data(</a></tt><br>

    <tt>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; step</tt><br>

    <tt>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; $:temp:41</tt><br>

    <tt>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; attribute::id</tt><br>

    <tt>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; )</tt><br>

    <tt>&nbsp;&nbsp;&nbsp; ,</tt><br>

    <tt>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;

      <a class="moz-txt-link-rfc2396E" href="http://www.w3.org/2005/xpath-functions/collation/codepoint">"http://www.w3.org/2005/xpath-functions/collation/codepoint"</a></tt><br>

    <tt>&nbsp;&nbsp;&nbsp; )</tt><br>

    <tt>to</tt><br>

    <tt>&nbsp; for $drug in $:temp:41</tt><br>

    <tt>on</tt><br>

    <tt>fs:convert-operand-to-atomic-type[<a class="moz-txt-link-freetext" href="http://www.w3.org/2001/XMLSchema:string">http://www.w3.org/2001/XMLSchema:string</a>]($id)&nbsp;

      =

      fs:convert-operand-to-atomic-type[<a class="moz-txt-link-freetext" href="http://www.w3.org/2001/XMLSchema:string">http://www.w3.org/2001/XMLSchema:string</a>](cardinality-check[?]

    </tt><br>

    <tt>{

      <a class="moz-txt-link-freetext" href="http://www.w3.org/2005/xpath-functions:data($drug/attribute::id)">http://www.w3.org/2005/xpath-functions:data($drug/attribute::id)</a>

      })</tt><br>

    <tt>aggregate</tt><br>

    <tt>&nbsp; element {drug} { </tt><br>

    <tt>fs:item-sequence-to-node-sequence($drug/child::element()) }</tt><br>

    <tt>as</tt><br>

    <tt>&nbsp; $:temp:42</tt><br>

    <tt>return</tt><br>

    <tt>&nbsp; element {unique_drug} { (attribute {id} { $id cast as </tt><br>

    <tt><a class="moz-txt-link-freetext" href="http://www.w3.org/2001/XMLSchema:untypedAtomic*">http://www.w3.org/2001/XMLSchema:untypedAtomic*</a> } , </tt><br>

    <tt>fs:item-sequence-to-node-sequence($:temp:42)) }</tt><br>

    <br>

    <br>

    <br>

    The key thing to note here is the "left outer hash join". This

    should <br>

    perform in linear time. What is in fact happening here is that an

    index <br>

    is first built for the right hand input to the join (in this case

    the <br>

    values of @id for each /all/drug node), and then for each distinct

    value <br>

    of /all/drug/@id all matching nodes are selected and the

    &lt;unique_drug&gt; <br>

    element is returned.<br>

    <br>

    XQSharp should be smart enough to spot that the join is joining the

    keys <br>

    on the right hand side to their own distinct values. The query plan

    <br>

    should have looked something like this:<br>

    <br>

    <tt>&nbsp; for $drug in /all/drug<br>

      group by<br>

      &nbsp; data($drug/@id)<br>

      aggregate<br>

      &nbsp; &lt;drug&gt;{$drug/*}&lt;/drug&gt;<br>

      as<br>

      &nbsp; $:temp<br>

      return<br>

      &nbsp; &lt;unique_drug id="{$id}"&gt;{$:temp}&lt;/unique_drug&gt;</tt><br>

    <br>

    If this optimization had kicked in correctly then the whole query

    would <br>

    have been performed in a single pass of the document. In this case

    an <br>

    index is built for the /all/drug elements (keyed on data(@id)) and

    then <br>

    the index is iterated. Performance should be about as fast as it <br>

    possibly could be!<br>

    <br>

    It seems though that this optimization hasn't happened with the

    lates <br>

    build of XQSharp; I have filed a bug report and will investigate why

    <br>

    this is the case with your query.<br>

    <br>

    Even without this final optimization though the query does perform

    in <br>

    linear time; it just builds two indexes of the data rather than one

    (one <br>

    during the computation of the distinct values, and one to perform

    the join).<br>

    <br>

    If you are interested, I have written a more detailed analysis of

    the <br>

    different join optimizations performed by XQSharp here: <br>

<a class="moz-txt-link-freetext" href="http://xqsharp.blogspot.com/2010/05/join-optimizations-in-xqsharp.html">http://xqsharp.blogspot.com/2010/05/join-optimizations-in-xqsharp.html</a><br>

    <br>

    <br>

    <br>

    Oliver Hallam<br>

    XQSharp<br>

    <br>

    On 01/03/2011 15:33, David Lee wrote:<br>

    &gt; I find I use this pattern frequently when I need to group

    multiple elements<br>

    &gt; associated with some shared identifier (say @id)<br>

    &gt; Suppose I have something like<br>

    &gt;<br>

    &gt; &lt;all&gt;<br>

    &gt; &lt;drug @id="1"&gt;&lt;text&gt;texta&lt;/text&gt;&lt;/drug&gt;<br>

    &gt; &lt;drug @id="2"&gt;

    &lt;text&gt;textb&lt;/text&gt;&lt;/drug&gt;<br>

    &gt; &lt;drug @id="3"&gt;

    &lt;text&gt;textc&lt;/text&gt;&lt;/drug&gt;<br>

    &gt; &lt;drug @id="1"&gt;

    &lt;text&gt;textd&lt;/text&gt;&lt;/drug&gt;<br>

    &gt; &lt;drug @id="2"&gt;

    &lt;text&gt;texte&lt;/text&gt;&lt;/drug&gt;<br>

    &gt; ...<br>

    &gt;<br>

    &gt;<br>

    &gt; And I want to create a set of combined entries<br>

    &gt; like<br>

    &gt; &lt;all&gt;<br>

    &gt; &lt;unique_drug @id="1"&gt;<br>

    &gt; &lt;drug&gt;&lt;text&gt;texta&lt;/text&gt;&lt;/drug&gt;<br>

    &gt; &lt;drug&gt; &lt;text&gt;textd&lt;/text&gt;&lt;/drug&gt;<br>

    &gt; &lt;/unique_drug&gt;<br>

    &gt; &lt;unique_drug @id="2"&gt;<br>

    &gt; &lt;drug&gt;&lt;text&gt;textb&lt;/text&gt;&lt;/drug&gt;<br>

    &gt; &lt;drug&gt; &lt;text&gt;texte&lt;/text&gt;&lt;/drug&gt;<br>

    &gt; &lt;/unique_drug&gt;<br>

    &gt;<br>

    &gt; ..<br>

    &gt;<br>

    &gt;<br>

    &gt; What I do is something like this :<br>

    &gt; ( typed into email, not tested ...)<br>

    &gt;<br>

    &gt; for $id in distinct-values(/all/drug/@id)<br>

    &gt; return<br>

    &gt; &lt;unique_drug id="{$id}"&gt;<br>

    &gt; {<br>

    &gt; for $drug in /all/drug[@id eq $id]<br>

    &gt; return<br>

    &gt; &lt;drug&gt;{$drug/*}&lt;/drug&gt;<br>

    &gt; }<br>

    &gt; &lt;/unique_drug&gt;<br>

    &gt;<br>

    &gt;<br>

    &gt; What I was offhand wondering, is if there isnt something more

    direct (in<br>

    &gt; XQuery 1).<br>

    &gt; It seems both verbose and inefficient, although of course

    efficiency is a<br>

    &gt; processor issue. (maybe it makes indexes ...)<br>

    &gt; But still ... it seems it has to scan all elements to get the

    unique id's<br>

    &gt; then re-scan them N times to get the matching elements,<br>

    &gt; when I can imagine a syntax which would do both at once in

    linear time as<br>

    &gt; opposed to (presumably) exponential time.<br>

    &gt; It seems like something a declarative expression should be able

    to state<br>

    &gt; succinctly.<br>

    &gt;<br>

    &gt; Any suggestions ? Or am I just fantasizing<br>

    &gt;<br>

    &gt; -David<br>

    &gt;<br>

    &gt;<br>

    &gt;<br>

    &gt;<br>

    &gt; <br>

    &gt;<br>

    &gt;<br>

    &gt; ----------------------------------------<br>

    &gt; David A. Lee<br>

    &gt; <a class="moz-txt-link-abbreviated" href="mailto:dlee@calldei.com">dlee@calldei.com</a><br>

    &gt; <a class="moz-txt-link-freetext" href="http://www.xmlsh.org">http://www.xmlsh.org</a><br>

    &gt;<br>

    &gt;<br>

    &gt; _______________________________________________<br>

    &gt; <a class="moz-txt-link-abbreviated" href="mailto:talk@x-query.com">talk@x-query.com</a><br>

    &gt; <a class="moz-txt-link-freetext" href="http://x-query.com/mailman/listinfo/talk">http://x-query.com/mailman/listinfo/talk</a><br>

    <br>

  </body>

</html>