[xquery-talk] Count of Distinct elements performance problem
Michael Kay
mhk at mhk.me.uk
Tue Aug 22 22:14:27 PDT 2006
Something like this:
let $sortedYears :=
for $dvalue in $sourModelYear
order by $dvalue
return string($dvalue)
(: use string() rather than /text() - strings are simpler and likely to be
smaller, and /text() is fragile in the face of comments :)
return
f:groups($sortedYears[position()>1], $sortedYears[1], 1);
declare function f:groups($seq as xs:string()*, $s as xs:string, $count as
xs:integer)
as xs:integer {
if (empty($seq))
then ()
else
if ($seq[1] eq $s)
then f:count-group($seq[position() > 1], $s, $count+1)
else (<gp value="{$s}" count="{$count}"/>,
f:count-group($seq[position() > 1], $s, 1))
}
Not tested.
This is likely to perform well if there are a small number of large groups,
less well if there's a large number of small groups - but it depends on how
good the implementation is at recursion.
Michael Kay
http://www.saxonica.com/
> -----Original Message-----
> From: Kusunam, Srinivas [mailto:SKusunam at rlpt.com]
> Sent: 22 August 2006 20:44
> To: Michael Kay; talk at xquery.com
> Subject: RE: [xquery-talk] Count of Distinct elements
> performance problem
>
> Michael,
> Thanks a lot for your reply.
>
> I understand what you are suggesting and it is a
> good idea. But I am not clear about doing positional
> grouping. Here is how I am assuming my modified XQuery would like:
>
> let $mdoc := doc('input.xml')/Body
> let $sourModelYEAR := $mdoc/Title/ModelYear return <Elements>
> <Element name="ModelYear">
> <Distribution>
> {
> for $dvalue in $sourModelYEAR/text() ------ Extract
> order by $dvalue ------ Sort
> ............
> How do we do Grouping of $dvalue i.e.
> ModelYear's (1991, 1992, 1995, 1997 etc)????????
> .............
> return
> <distribution>
> <value>{ $dvalue }</value>
> <count>{ $eachcount }</count>
> </distribution>
> }
> </Distribution>
> </Element>
> </Elements>
>
> Thanks,
> Srinivas
>
> -----Original Message-----
> From: Michael Kay [mailto:mhk at mhk.me.uk]
> Sent: Tuesday, August 22, 2006 2:20 PM
> To: Kusunam, Srinivas; talk at xquery.com
> Subject: RE: [xquery-talk] Count of Distinct elements
> performance problem
>
> A performance question like this can only be answered with
> respect to a specific product.
>
> There aren't many XQuery engines that will handle an 8Gb
> file, so you're doing quite well.
>
> You might find that a multi-pass approach works faster:
>
> (a) extract the values
>
> (b) sort them
>
> (c) use a recursive scan to do positional grouping - depends
> on your product supporting tail call optimization
>
>
> That's likely to have O(n*log(n)) performance rather than O(n^2).
>
> Michael Kay
> http://www.saxonica.com/
>
>
>
> > -----Original Message-----
> > From: talk-bounces at xquery.com
> > [mailto:talk-bounces at xquery.com] On Behalf Of Kusunam, Srinivas
> > Sent: 22 August 2006 18:36
> > To: talk at xquery.com
> > Subject: [xquery-talk] Count of Distinct elements
> performance problem
> >
> > I am trying to find count of distinct elements (Model Year).
> > Here is my XQuery. It takes 4 hrs to get the count from 8GB file.
> > There are around 3000 distinct Model years in this file.
> >
> > let $mdoc := doc('input.xml')/Body
> > let $sourModelYEAR := $mdoc/Title/ModelYear return <Elements>
> > <Element name="ModelYear">
> > <Distribution>
> > {
> > for $dvalue in fn:distinct-values($sourModelYEAR)
> > let $eachcount := count($mdoc/Title[ModelYear=$dvalue])
> > return
> > <distribution>
> > <value>{ $dvalue }</value>
> > <count>{ $eachcount }</count>
> > </distribution>
> > }
> > </Distribution>
> > </Element>
> > </Elements>
> >
> > This Query seems to loop through the document for each value i.e.
> > overall 3000 times. I know this should be easily achievable
> if we have
> > Group-by in XQuery. Do any XQuery engine supports
> > (custom) Group-By now?
> > Or is there any other way to make this query efficient?
> >
> > Where as if I add one more element to find the pattern of
> the data it
> > finishes the job within 40 minutes? Why is this odd behavior?
> >
> > let $mdoc := doc('input.xml')/Body
> > let $sourModelYEAR := $mdoc/Title/ModelYear return <Elements>
> > <Element name="ModelYear">
> > <Distribution>
> > {
> > for $dvalue in fn:distinct-values($sourModelYEAR)
> > let $eachcount := count($mdoc/Title[ModelYear=$dvalue])
> > return
> > <distribution>
> > <value>{ $dvalue }</value>
> > <count>{ $eachcount }</count>
> > </distribution>
> > }
> > </Distribution>
> > <PatternDistribution>
> > {
> > for $phonenum in
> > distinct-values($sourModelYEAR/translate(.,
> > '0123456789','9999999999'))
> > return
> > <pattern>
> > <type>{ $phonenum }</type>
> > <count>{count($sourModelYEAR[translate(.,
> > '0123456789', '9999999999') eq $phonenum])}</count>
> > </pattern>
> > }
> > </PatternDistribution>
> > </Element>
> > </Elements>
> >
> >
> > Thanks,
> > Srini
> > *****************************************************************
> > This message has originated from RLPTechnologies,
> > 26955 Northwestern Highway, Southfield, MI 48033.
> >
> > RLPTechnologies sends various types of email communications.
> > If this email message concerns the potential licensing of an RLPT
> > product or service, and you do not wish to receive further emails
> > regarding Polk products, forward this email to Do_Not_Send at rlpt.com
> > with the word "remove" in the subject line.
> >
> > The email and any files transmitted with it are confidential and
> > intended solely for the individual or entity to whom they are
> > addressed.
> >
> > If you have received this email in error, please delete
> this message
> > and notify the Polk System Administrator at postmaster at rlpt.com.
> > *****************************************************************
> >
> >
> > _______________________________________________
> > talk at xquery.com
> > http://xquery.com/mailman/listinfo/talk
>
> *****************************************************************
> This message has originated from RLPTechnologies,
> 26955 Northwestern Highway, Southfield, MI 48033.
>
> RLPTechnologies sends various types of email
> communications. If this email message concerns the
> potential licensing of an RLPT product or service, and
> you do not wish to receive further emails regarding Polk
> products, forward this email to Do_Not_Send at rlpt.com
> with the word "remove" in the subject line.
>
> The email and any files transmitted with it are confidential
> and intended solely for the individual or entity to whom they
> are addressed.
>
> If you have received this email in error, please delete this
> message and notify the Polk System Administrator at
> postmaster at rlpt.com.
> *****************************************************************
>
>
More information about the talk
mailing list