<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">
</head>
<body bgcolor="#ffffff" text="#000000">
This really is the best way to do this in XQuery 1.0, and is a
common pattern; this is why the "group by" clause was added to
XQuery 3.0<br>
<br>
In XQuery 3.0 this would be as simple as:<br>
<pre>
<tt>for $drugs in /all/drug</tt>
<tt>let $id := $drugs/@id</tt>
<tt>group-by $id</tt>
<tt>return</tt>
<tt> <unique_drug id="{$id[1]}"></tt>
<tt> {</tt>
<tt> for $drug in $drugs</tt>
<tt> return</tt>
<tt> <drug>{$drug/*}</drug></tt>
<tt> }</tt>
<tt></unique_drug></tt>
</pre>
<br>
Performance in XQuery 1.0 may not be as bad as you fear though. I
can't <br>
vouch for other processors, but this particular pattern is one that
we <br>
have put a great deal of effort in to spotting in XQSharp.<br>
<br>
<br>
When I put your query through our processor I get the following
query plan:<br>
<tt>let $:temp:41 :=</tt><br>
<tt> step</tt><br>
<tt> $fs:dot_0/child::all</tt><br>
<tt> child::drug</tt><br>
<tt>left outer hash join</tt><br>
<tt> for $id in</tt><br>
<tt> <a class="moz-txt-link-freetext" href="http://www.w3.org/2005/xpath-functions:distinct-values(">http://www.w3.org/2005/xpath-functions:distinct-values(</a></tt><br>
<tt> <a class="moz-txt-link-freetext" href="http://www.w3.org/2005/xpath-functions:data(">http://www.w3.org/2005/xpath-functions:data(</a></tt><br>
<tt> step</tt><br>
<tt> $:temp:41</tt><br>
<tt> attribute::id</tt><br>
<tt> )</tt><br>
<tt> ,</tt><br>
<tt>
<a class="moz-txt-link-rfc2396E" href="http://www.w3.org/2005/xpath-functions/collation/codepoint">"http://www.w3.org/2005/xpath-functions/collation/codepoint"</a></tt><br>
<tt> )</tt><br>
<tt>to</tt><br>
<tt> for $drug in $:temp:41</tt><br>
<tt>on</tt><br>
<tt>fs:convert-operand-to-atomic-type[<a class="moz-txt-link-freetext" href="http://www.w3.org/2001/XMLSchema:string">http://www.w3.org/2001/XMLSchema:string</a>]($id)
=
fs:convert-operand-to-atomic-type[<a class="moz-txt-link-freetext" href="http://www.w3.org/2001/XMLSchema:string">http://www.w3.org/2001/XMLSchema:string</a>](cardinality-check[?]
</tt><br>
<tt>{
<a class="moz-txt-link-freetext" href="http://www.w3.org/2005/xpath-functions:data($drug/attribute::id)">http://www.w3.org/2005/xpath-functions:data($drug/attribute::id)</a>
})</tt><br>
<tt>aggregate</tt><br>
<tt> element {drug} { </tt><br>
<tt>fs:item-sequence-to-node-sequence($drug/child::element()) }</tt><br>
<tt>as</tt><br>
<tt> $:temp:42</tt><br>
<tt>return</tt><br>
<tt> element {unique_drug} { (attribute {id} { $id cast as </tt><br>
<tt><a class="moz-txt-link-freetext" href="http://www.w3.org/2001/XMLSchema:untypedAtomic*">http://www.w3.org/2001/XMLSchema:untypedAtomic*</a> } , </tt><br>
<tt>fs:item-sequence-to-node-sequence($:temp:42)) }</tt><br>
<br>
<br>
<br>
The key thing to note here is the "left outer hash join". This
should <br>
perform in linear time. What is in fact happening here is that an
index <br>
is first built for the right hand input to the join (in this case
the <br>
values of @id for each /all/drug node), and then for each distinct
value <br>
of /all/drug/@id all matching nodes are selected and the
<unique_drug> <br>
element is returned.<br>
<br>
XQSharp should be smart enough to spot that the join is joining the
keys <br>
on the right hand side to their own distinct values. The query plan
<br>
should have looked something like this:<br>
<br>
<tt> for $drug in /all/drug<br>
group by<br>
data($drug/@id)<br>
aggregate<br>
<drug>{$drug/*}</drug><br>
as<br>
$:temp<br>
return<br>
<unique_drug id="{$id}">{$:temp}</unique_drug></tt><br>
<br>
If this optimization had kicked in correctly then the whole query
would <br>
have been performed in a single pass of the document. In this case
an <br>
index is built for the /all/drug elements (keyed on data(@id)) and
then <br>
the index is iterated. Performance should be about as fast as it <br>
possibly could be!<br>
<br>
It seems though that this optimization hasn't happened with the
lates <br>
build of XQSharp; I have filed a bug report and will investigate why
<br>
this is the case with your query.<br>
<br>
Even without this final optimization though the query does perform
in <br>
linear time; it just builds two indexes of the data rather than one
(one <br>
during the computation of the distinct values, and one to perform
the join).<br>
<br>
If you are interested, I have written a more detailed analysis of
the <br>
different join optimizations performed by XQSharp here: <br>
<a class="moz-txt-link-freetext" href="http://xqsharp.blogspot.com/2010/05/join-optimizations-in-xqsharp.html">http://xqsharp.blogspot.com/2010/05/join-optimizations-in-xqsharp.html</a><br>
<br>
<br>
<br>
Oliver Hallam<br>
XQSharp<br>
<br>
On 01/03/2011 15:33, David Lee wrote:<br>
> I find I use this pattern frequently when I need to group
multiple elements<br>
> associated with some shared identifier (say @id)<br>
> Suppose I have something like<br>
><br>
> <all><br>
> <drug @id="1"><text>texta</text></drug><br>
> <drug @id="2">
<text>textb</text></drug><br>
> <drug @id="3">
<text>textc</text></drug><br>
> <drug @id="1">
<text>textd</text></drug><br>
> <drug @id="2">
<text>texte</text></drug><br>
> ...<br>
><br>
><br>
> And I want to create a set of combined entries<br>
> like<br>
> <all><br>
> <unique_drug @id="1"><br>
> <drug><text>texta</text></drug><br>
> <drug> <text>textd</text></drug><br>
> </unique_drug><br>
> <unique_drug @id="2"><br>
> <drug><text>textb</text></drug><br>
> <drug> <text>texte</text></drug><br>
> </unique_drug><br>
><br>
> ..<br>
><br>
><br>
> What I do is something like this :<br>
> ( typed into email, not tested ...)<br>
><br>
> for $id in distinct-values(/all/drug/@id)<br>
> return<br>
> <unique_drug id="{$id}"><br>
> {<br>
> for $drug in /all/drug[@id eq $id]<br>
> return<br>
> <drug>{$drug/*}</drug><br>
> }<br>
> </unique_drug><br>
><br>
><br>
> What I was offhand wondering, is if there isnt something more
direct (in<br>
> XQuery 1).<br>
> It seems both verbose and inefficient, although of course
efficiency is a<br>
> processor issue. (maybe it makes indexes ...)<br>
> But still ... it seems it has to scan all elements to get the
unique id's<br>
> then re-scan them N times to get the matching elements,<br>
> when I can imagine a syntax which would do both at once in
linear time as<br>
> opposed to (presumably) exponential time.<br>
> It seems like something a declarative expression should be able
to state<br>
> succinctly.<br>
><br>
> Any suggestions ? Or am I just fantasizing<br>
><br>
> -David<br>
><br>
><br>
><br>
><br>
> <br>
><br>
><br>
> ----------------------------------------<br>
> David A. Lee<br>
> <a class="moz-txt-link-abbreviated" href="mailto:dlee@calldei.com">dlee@calldei.com</a><br>
> <a class="moz-txt-link-freetext" href="http://www.xmlsh.org">http://www.xmlsh.org</a><br>
><br>
><br>
> _______________________________________________<br>
> <a class="moz-txt-link-abbreviated" href="mailto:talk@x-query.com">talk@x-query.com</a><br>
> <a class="moz-txt-link-freetext" href="http://x-query.com/mailman/listinfo/talk">http://x-query.com/mailman/listinfo/talk</a><br>
<br>
</body>
</html>