<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD><TITLE>RE: [xquery-talk] Count of Distinct elements performance problem</TITLE>
<META http-equiv=Content-Type content="text/html; charset=us-ascii">
<META content="MSHTML 6.00.2900.2963" name=GENERATOR></HEAD>
<BODY>
<DIV dir=ltr align=left><SPAN class=657215319-23082006><FONT face=Arial
color=#0000ff size=2>Did you tell us what product you are
using?</FONT></SPAN></DIV>
<DIV dir=ltr align=left><SPAN class=657215319-23082006><FONT face=Arial
color=#0000ff size=2></FONT></SPAN> </DIV>
<DIV dir=ltr align=left><SPAN class=657215319-23082006><FONT face=Arial
color=#0000ff size=2>As I mentioned, the recursive code depends heavily on
optimization in the processor - as indeed does everything, with the kind of data
volumes you are dealing with.</FONT></SPAN></DIV>
<DIV dir=ltr align=left><SPAN class=657215319-23082006><FONT face=Arial
color=#0000ff size=2></FONT></SPAN> </DIV>
<DIV dir=ltr align=left><SPAN class=657215319-23082006><FONT face=Arial
color=#0000ff size=2>Michael Kay</FONT></SPAN></DIV>
<DIV dir=ltr align=left><SPAN class=657215319-23082006><FONT face=Arial
color=#0000ff size=2>http://www.saxonica.com/</FONT></SPAN></DIV><BR>
<BLOCKQUOTE dir=ltr
style="PADDING-LEFT: 5px; MARGIN-LEFT: 5px; BORDER-LEFT: #0000ff 2px solid; MARGIN-RIGHT: 0px">
<DIV class=OutlookMessageHeader lang=en-us dir=ltr align=left>
<HR tabIndex=-1>
<FONT face=Tahoma size=2><B>From:</B> talk-bounces@xquery.com
[mailto:talk-bounces@xquery.com] <B>On Behalf Of </B>Kusunam,
Srinivas<BR><B>Sent:</B> 23 August 2006 19:47<BR><B>To:</B>
talk@xquery.com<BR><B>Subject:</B> RE: [xquery-talk] Count of Distinct
elements performance problem<BR></FONT><BR></DIV>
<DIV></DIV>
<DIV id=idOWAReplyText4495 dir=ltr>
<DIV dir=ltr><FONT face=Arial color=#000000 size=2>Mike,</FONT></DIV>
<DIV dir=ltr><FONT face=Arial size=2></FONT> </DIV>
<DIV dir=ltr><FONT face=Arial
size=2> Thanks for the
code for grouping. I have fixed this Xquery and now i am getting
"java.lang.StackOverflowError" with this approach even on half file (with
222500 Model Years). Is there anything wrong with this XQuery or 222500 is too
much for it?</FONT></DIV>
<DIV dir=ltr><FONT face=Arial size=2></FONT> </DIV>
<DIV dir=ltr><FONT face=Arial size=2>Here is the XQuery:</FONT></DIV>
<DIV dir=ltr><FONT face=Arial size=2></FONT> </DIV>
<DIV dir=ltr>declare function local:groups($seq as xs:string*, $s as
xs:string, $count as xs:integer) {<BR> if
(empty($seq))<BR> then <gp value="{$s}"
count="{$count}"/><BR> else<BR> if
($seq[1] eq $s)<BR> then local:groups($seq[position()
> 1], $s, $count+1)<BR> else <BR> (<gp
value="{$s}" count="{$count}"/>, local:groups($seq[position() >
1], $seq[1], 1)) <BR> };</DIV>
<DIV dir=ltr> </DIV>
<DIV dir=ltr>let $mdoc := doc('sampleXML.xml')/Body</DIV>
<DIV dir=ltr>let $sourModelYear :=
$mdoc/Title/Group/ModelYear<BR>return<BR><Elements><BR>
<Element name="ModelYear">
<BR>
<frequencyDistribution><BR>
{<BR> let $sortedYears
:=
<BR>
for $dvalue in
$sourModelYear<BR>
order by
$dvalue<BR>
return
string($dvalue)<BR>
return
<BR>
local:groups($sortedYears[position()>1], $sortedYears[1],
1)<BR>
}<BR>
</frequencyDistribution><BR>
</Element>
<BR> </Elements></DIV>
<DIV dir=ltr> </DIV></DIV>
<DIV dir=ltr><BR>
<HR tabIndex=-1>
<FONT face=Tahoma size=2><B>From:</B> Michael Kay
[mailto:mhk@mhk.me.uk]<BR><B>Sent:</B> Tue 8/22/2006 4:14 PM<BR><B>To:</B>
Kusunam, Srinivas; talk@xquery.com<BR><B>Subject:</B> RE: [xquery-talk] Count
of Distinct elements performance problem<BR></FONT><BR></DIV>
<DIV>
<P><FONT size=2>Something like this:<BR><BR>let $sortedYears :=<BR> for
$dvalue in $sourModelYear<BR> order by $dvalue<BR> return
string($dvalue)<BR><BR>(: use string() rather than /text() - strings are
simpler and likely to be<BR>smaller, and /text() is fragile in the face of
comments :)<BR><BR>return<BR> f:groups($sortedYears[position()>1],
$sortedYears[1], 1);<BR><BR><BR>declare function f:groups($seq as
xs:string()*, $s as xs:string, $count as<BR>xs:integer)<BR>as xs:integer
{<BR> if (empty($seq))<BR> then ()<BR>
else<BR> if ($seq[1] eq $s)<BR> then
f:count-group($seq[position() > 1], $s, $count+1)<BR>
else (<gp value="{$s}"
count="{$count}"/>,<BR>
f:count-group($seq[position() > 1], $s, 1))<BR>}<BR><BR>Not
tested.<BR><BR>This is likely to perform well if there are a small number of
large groups,<BR>less well if there's a large number of small groups - but it
depends on how<BR>good the implementation is at recursion.<BR><BR>Michael
Kay<BR><A
href="http://www.saxonica.com/">http://www.saxonica.com/</A><BR><BR>>
-----Original Message-----<BR>> From: Kusunam, Srinivas [<A
href="mailto:SKusunam@rlpt.com">mailto:SKusunam@rlpt.com</A>]<BR>> Sent: 22
August 2006 20:44<BR>> To: Michael Kay; talk@xquery.com<BR>> Subject:
RE: [xquery-talk] Count of Distinct elements<BR>> performance
problem<BR>><BR>>
Michael,<BR>> Thanks
a lot for your
reply.<BR>><BR>> I
understand what you are suggesting and it is a<BR>> good idea. But I am not
clear about doing positional<BR>> grouping. Here is how I am assuming my
modified XQuery would like:<BR>><BR>> let $mdoc :=
doc('input.xml')/Body<BR>> let $sourModelYEAR := $mdoc/Title/ModelYear
return <Elements><BR>> <Element
name="ModelYear"><BR>>
<Distribution><BR>>
{<BR>> for
$dvalue in $sourModelYEAR/text() ------ Extract<BR>>
order by
$dvalue
------ Sort<BR>>
............<BR>>
How do we do Grouping of $dvalue
i.e.<BR>> ModelYear's (1991, 1992, 1995, 1997
etc)????????<BR>>
.............
<BR>>
return <BR>>
<distribution><BR>>
<value>{ $dvalue
}</value><BR>>
<count>{ $eachcount
}</count><BR>>
</distribution><BR>>
}<BR>>
</Distribution><BR>>
</Element> <BR>>
</Elements><BR>><BR>> Thanks,<BR>> Srinivas<BR>><BR>>
-----Original Message-----<BR>> From: Michael Kay [<A
href="mailto:mhk@mhk.me.uk">mailto:mhk@mhk.me.uk</A>]<BR>> Sent: Tuesday,
August 22, 2006 2:20 PM<BR>> To: Kusunam, Srinivas; talk@xquery.com<BR>>
Subject: RE: [xquery-talk] Count of Distinct elements<BR>> performance
problem<BR>><BR>> A performance question like this can only be answered
with<BR>> respect to a specific product.<BR>><BR>> There aren't many
XQuery engines that will handle an 8Gb<BR>> file, so you're doing quite
well.<BR>><BR>> You might find that a multi-pass approach works
faster:<BR>><BR>> (a) extract the values<BR>><BR>> (b) sort
them<BR>><BR>> (c) use a recursive scan to do positional grouping -
depends<BR>> on your product supporting tail call
optimization<BR>><BR>><BR>> That's likely to have O(n*log(n))
performance rather than O(n^2).<BR>><BR>> Michael Kay<BR>> <A
href="http://www.saxonica.com/">http://www.saxonica.com/</A><BR>><BR>><BR>><BR>>
> -----Original Message-----<BR>> > From:
talk-bounces@xquery.com<BR>> > [<A
href="mailto:talk-bounces@xquery.com">mailto:talk-bounces@xquery.com</A>] On
Behalf Of Kusunam, Srinivas<BR>> > Sent: 22 August 2006 18:36<BR>>
> To: talk@xquery.com<BR>> > Subject: [xquery-talk] Count of Distinct
elements<BR>> performance problem<BR>> ><BR>> > I am trying to
find count of distinct elements (Model Year).<BR>> > Here is my XQuery.
It takes 4 hrs to get the count from 8GB file.<BR>> > There are around
3000 distinct Model years in this file.<BR>> ><BR>> > let $mdoc :=
doc('input.xml')/Body<BR>> > let $sourModelYEAR := $mdoc/Title/ModelYear
return <Elements><BR>> > <Element
name="ModelYear"><BR>>
>
<Distribution><BR>>
> {<BR>>
> for $dvalue
in fn:distinct-values($sourModelYEAR)<BR>>
> let
$eachcount := count($mdoc/Title[ModelYear=$dvalue])<BR>>
>
return <BR>>
>
<distribution><BR>>
>
<value>{ $dvalue }</value><BR>>
>
<count>{ $eachcount }</count><BR>>
>
</distribution><BR>>
> }<BR>>
>
</Distribution><BR>> >
</Element> <BR>> >
</Elements><BR>> ><BR>> > This Query seems to loop through
the document for each value i.e.<BR>> > overall 3000 times. I know this
should be easily achievable<BR>> if we have<BR>> > Group-by in
XQuery. Do any XQuery engine supports<BR>> > (custom) Group-By
now?<BR>> > Or is there any other way to make this query
efficient?<BR>> ><BR>> > Where as if I add one more element to
find the pattern of<BR>> the data it<BR>> > finishes the job within
40 minutes? Why is this odd behavior?<BR>> ><BR>> > let $mdoc :=
doc('input.xml')/Body<BR>> > let $sourModelYEAR := $mdoc/Title/ModelYear
return <Elements><BR>> > <Element
name="ModelYear"><BR>>
>
<Distribution><BR>>
> {<BR>>
> for $dvalue
in fn:distinct-values($sourModelYEAR)<BR>>
> let
$eachcount := count($mdoc/Title[ModelYear=$dvalue])<BR>>
>
return <BR>>
>
<distribution><BR>>
>
<value>{ $dvalue }</value><BR>>
>
<count>{ $eachcount }</count><BR>>
>
</distribution><BR>>
> }<BR>>
>
</Distribution><BR>>
>
<PatternDistribution><BR>>
> {<BR>>
>
for $phonenum in<BR>> >
distinct-values($sourModelYEAR/translate(.,<BR>> >
'0123456789','9999999999'))<BR>>
>
return<BR>>
>
<pattern><BR>>
>
<type>{ $phonenum }</type><BR>>
>
<count>{count($sourModelYEAR[translate(.,<BR>> > '0123456789',
'9999999999') eq $phonenum])}</count><BR>>
>
</pattern><BR>> >
}<BR>> >
</PatternDistribution><BR>> >
</Element> <BR>> >
</Elements><BR>> ><BR>> ><BR>> > Thanks,<BR>> >
Srini<BR>> >
*****************************************************************<BR>> >
This message has originated from RLPTechnologies,<BR>> > 26955
Northwestern Highway, Southfield, MI 48033.<BR>> ><BR>> >
RLPTechnologies sends various types of email communications. <BR>>
> If this email message concerns the potential licensing of an RLPT<BR>>
> product or service, and you do not wish to receive further emails<BR>>
> regarding Polk products, forward this email to
Do_Not_Send@rlpt.com<BR>> > with the word "remove" in the subject
line.<BR>> ><BR>> > The email and any files transmitted with it
are confidential and<BR>> > intended solely for the individual or entity
to whom they are<BR>> > addressed.<BR>> ><BR>> > If you have
received this email in error, please delete<BR>> this message<BR>> >
and notify the Polk System Administrator at postmaster@rlpt.com.<BR>> >
*****************************************************************<BR>>
><BR>> ><BR>> >
_______________________________________________<BR>> >
talk@xquery.com<BR>> > <A
href="http://xquery.com/mailman/listinfo/talk">http://xquery.com/mailman/listinfo/talk</A><BR>><BR>>
*****************************************************************<BR>> This
message has originated from RLPTechnologies,<BR>> 26955 Northwestern
Highway, Southfield, MI 48033.<BR>><BR>> RLPTechnologies sends various
types of email<BR>> communications. If this email message concerns
the<BR>> potential licensing of an RLPT product or service, and<BR>> you
do not wish to receive further emails regarding Polk<BR>> products, forward
this email to Do_Not_Send@rlpt.com<BR>> with the word "remove" in the
subject line.<BR>><BR>> The email and any files transmitted with it are
confidential<BR>> and intended solely for the individual or entity to whom
they<BR>> are addressed.<BR>><BR>> If you have received this email in
error, please delete this<BR>> message and notify the Polk System
Administrator at<BR>> postmaster@rlpt.com.<BR>>
*****************************************************************<BR>><BR>><BR><BR></FONT></P></DIV><PRE>*****************************************************************
This message has originated from RLPTechnologies,
26955 Northwestern Highway, Southfield, MI 48033.
RLPTechnologies sends various types of email
communications. If this email message concerns the
potential licensing of an RLPT product or service, and
you do not wish to receive further emails regarding Polk
products, forward this email to Do_Not_Send@rlpt.com
with the word "remove" in the subject line.
The email and any files transmitted with it are confidential
and intended solely for the individual or entity to whom they
are addressed.
If you have received this email in error, please delete this
message and notify the Polk System Administrator at
postmaster@rlpt.com.
*****************************************************************
</PRE></BLOCKQUOTE></BODY></HTML>