[xquery-talk] Count of Distinct elements performance problem

Michael Kay mhk at mhk.me.uk
Tue Aug 22 22:14:27 PDT 2006


Something like this:

let $sortedYears :=
  for $dvalue in $sourModelYear
  order by $dvalue
  return string($dvalue)

(: use string() rather than /text() - strings are simpler and likely to be
smaller, and /text() is fragile in the face of comments :)

return
  f:groups($sortedYears[position()>1], $sortedYears[1], 1);


declare function f:groups($seq as xs:string()*, $s as xs:string, $count as
xs:integer)
as xs:integer {
  if (empty($seq))
  then ()
  else
    if ($seq[1] eq $s)
    then f:count-group($seq[position() > 1], $s, $count+1)
    else (<gp value="{$s}" count="{$count}"/>,
        f:count-group($seq[position() > 1], $s, 1))
}

Not tested.

This is likely to perform well if there are a small number of large groups,
less well if there's a large number of small groups - but it depends on how
good the implementation is at recursion.

Michael Kay
http://www.saxonica.com/

> -----Original Message-----
> From: Kusunam, Srinivas [mailto:SKusunam at rlpt.com] 
> Sent: 22 August 2006 20:44
> To: Michael Kay; talk at xquery.com
> Subject: RE: [xquery-talk] Count of Distinct elements 
> performance problem
> 
> Michael,
>          Thanks a lot for your reply.
> 
>          I understand what you are suggesting and it is a 
> good idea. But I am not clear about doing positional 
> grouping. Here is how I am assuming my modified XQuery would like:
> 
> let $mdoc := doc('input.xml')/Body
> let $sourModelYEAR := $mdoc/Title/ModelYear return <Elements>
>    <Element name="ModelYear"> 
>         <Distribution>
>         {
>           for $dvalue in $sourModelYEAR/text()    ------ Extract
> 	    order by $dvalue                        ------ Sort 
> 	    ............
> 		How do we do Grouping of $dvalue i.e. 
> ModelYear's (1991, 1992, 1995, 1997 etc)????????
>           .............   	
>           return  
>               <distribution>
>                 <value>{ $dvalue }</value>
>                 <count>{ $eachcount }</count>
>               </distribution>
>         }
>         </Distribution>
>     </Element>      
>  </Elements>
> 
> Thanks,
> Srinivas
> 
> -----Original Message-----
> From: Michael Kay [mailto:mhk at mhk.me.uk]
> Sent: Tuesday, August 22, 2006 2:20 PM
> To: Kusunam, Srinivas; talk at xquery.com
> Subject: RE: [xquery-talk] Count of Distinct elements 
> performance problem
> 
> A performance question like this can only be answered with 
> respect to a specific product.
> 
> There aren't many XQuery engines that will handle an 8Gb 
> file, so you're doing quite well.
> 
> You might find that a multi-pass approach works faster:
> 
> (a) extract the values
> 
> (b) sort them
> 
> (c) use a recursive scan to do positional grouping - depends 
> on your product supporting tail call optimization
> 
> 
> That's likely to have O(n*log(n)) performance rather than O(n^2). 
> 
> Michael Kay
> http://www.saxonica.com/
> 
> 
> 
> > -----Original Message-----
> > From: talk-bounces at xquery.com
> > [mailto:talk-bounces at xquery.com] On Behalf Of Kusunam, Srinivas
> > Sent: 22 August 2006 18:36
> > To: talk at xquery.com
> > Subject: [xquery-talk] Count of Distinct elements 
> performance problem
> > 
> > I am trying to find count of distinct elements (Model Year). 
> > Here is my XQuery. It takes 4 hrs to get the count from 8GB file. 
> > There are around 3000 distinct Model years in this file.
> > 
> > let $mdoc := doc('input.xml')/Body
> > let $sourModelYEAR := $mdoc/Title/ModelYear return <Elements>
> >     <Element name="ModelYear"> 
> >         <Distribution>
> >         {
> >           for $dvalue in fn:distinct-values($sourModelYEAR)
> >           let $eachcount := count($mdoc/Title[ModelYear=$dvalue])
> >           return  
> >               <distribution>
> >                 <value>{ $dvalue }</value>
> >                 <count>{ $eachcount }</count>
> >               </distribution>
> >         }
> >         </Distribution>
> >     </Element>      
> >  </Elements>
> > 
> > This Query seems to loop through the document for each value i.e.
> > overall 3000 times. I know this should be easily achievable 
> if we have 
> > Group-by in XQuery. Do any XQuery engine supports
> > (custom) Group-By now?
> > Or is there any other way to make this query efficient?
> > 
> > Where as if I add one more element to find the pattern of 
> the data it 
> > finishes the job within 40 minutes? Why is this odd behavior?
> > 
> > let $mdoc := doc('input.xml')/Body
> > let $sourModelYEAR := $mdoc/Title/ModelYear return <Elements>
> >     <Element name="ModelYear"> 
> >         <Distribution>
> >         {
> >           for $dvalue in fn:distinct-values($sourModelYEAR)
> >           let $eachcount := count($mdoc/Title[ModelYear=$dvalue])
> >           return  
> >               <distribution>
> >                 <value>{ $dvalue }</value>
> >                 <count>{ $eachcount }</count>
> >               </distribution>
> >         }
> >         </Distribution>
> >         <PatternDistribution>
> >         {
> >             for $phonenum in
> > distinct-values($sourModelYEAR/translate(.,
> > '0123456789','9999999999'))
> >             return
> >                 <pattern>
> >                     <type>{ $phonenum }</type>
> >                     <count>{count($sourModelYEAR[translate(.,
> > '0123456789', '9999999999') eq $phonenum])}</count>
> >                 </pattern>
> >         }
> >         </PatternDistribution>
> >     </Element>      
> > </Elements>
> > 
> > 
> > Thanks,
> > Srini
> > *****************************************************************
> > This message has originated from RLPTechnologies,
> > 26955 Northwestern Highway, Southfield, MI 48033.
> > 
> > RLPTechnologies sends various types of email communications.  
> > If this email message concerns the potential licensing of an RLPT 
> > product or service, and you do not wish to receive further emails 
> > regarding Polk products, forward this email to Do_Not_Send at rlpt.com 
> > with the word "remove" in the subject line.
> > 
> > The email and any files transmitted with it are confidential and 
> > intended solely for the individual or entity to whom they are 
> > addressed.
> > 
> > If you have received this email in error, please delete 
> this message 
> > and notify the Polk System Administrator at postmaster at rlpt.com.
> > *****************************************************************
> > 
> > 
> > _______________________________________________
> > talk at xquery.com
> > http://xquery.com/mailman/listinfo/talk
> 
> *****************************************************************
> This message has originated from RLPTechnologies,
> 26955 Northwestern Highway, Southfield, MI 48033.
> 
> RLPTechnologies sends various types of email
> communications.  If this email message concerns the
> potential licensing of an RLPT product or service, and
> you do not wish to receive further emails regarding Polk
> products, forward this email to Do_Not_Send at rlpt.com
> with the word "remove" in the subject line.
> 
> The email and any files transmitted with it are confidential
> and intended solely for the individual or entity to whom they
> are addressed.
> 
> If you have received this email in error, please delete this
> message and notify the Polk System Administrator at
> postmaster at rlpt.com.
> *****************************************************************
> 
> 



More information about the talk mailing list