[xquery-talk] Count of Distinct elements performance problem

Michael Kay mhk at mhk.me.uk
Tue Aug 22 20:19:51 PDT 2006


A performance question like this can only be answered with respect to a
specific product.

There aren't many XQuery engines that will handle an 8Gb file, so you're
doing quite well.

You might find that a multi-pass approach works faster:

(a) extract the values

(b) sort them

(c) use a recursive scan to do positional grouping - depends on your product
supporting tail call optimization


That's likely to have O(n*log(n)) performance rather than O(n^2). 

Michael Kay
http://www.saxonica.com/



> -----Original Message-----
> From: talk-bounces at xquery.com 
> [mailto:talk-bounces at xquery.com] On Behalf Of Kusunam, Srinivas
> Sent: 22 August 2006 18:36
> To: talk at xquery.com
> Subject: [xquery-talk] Count of Distinct elements performance problem
> 
> I am trying to find count of distinct elements (Model Year). 
> Here is my XQuery. It takes 4 hrs to get the count from 8GB 
> file. There are around 3000 distinct Model years in this file. 
> 
> let $mdoc := doc('input.xml')/Body
> let $sourModelYEAR := $mdoc/Title/ModelYear return <Elements>
>     <Element name="ModelYear"> 
>         <Distribution>
>         {
>           for $dvalue in fn:distinct-values($sourModelYEAR)
>           let $eachcount := count($mdoc/Title[ModelYear=$dvalue])
>           return  
>               <distribution>
>                 <value>{ $dvalue }</value>
>                 <count>{ $eachcount }</count>
>               </distribution>
>         }
>         </Distribution>
>     </Element>      
>  </Elements>
> 
> This Query seems to loop through the document for each value i.e.
> overall 3000 times. I know this should be easily achievable 
> if we have Group-by in XQuery. Do any XQuery engine supports 
> (custom) Group-By now?
> Or is there any other way to make this query efficient?
> 
> Where as if I add one more element to find the pattern of the 
> data it finishes the job within 40 minutes? Why is this odd behavior?
> 
> let $mdoc := doc('input.xml')/Body
> let $sourModelYEAR := $mdoc/Title/ModelYear return <Elements>
>     <Element name="ModelYear"> 
>         <Distribution>
>         {
>           for $dvalue in fn:distinct-values($sourModelYEAR)
>           let $eachcount := count($mdoc/Title[ModelYear=$dvalue])
>           return  
>               <distribution>
>                 <value>{ $dvalue }</value>
>                 <count>{ $eachcount }</count>
>               </distribution>
>         }
>         </Distribution>
>         <PatternDistribution>
>         {
>             for $phonenum in 
> distinct-values($sourModelYEAR/translate(.,
> '0123456789','9999999999'))
>             return
>                 <pattern>
>                     <type>{ $phonenum }</type>
>                     <count>{count($sourModelYEAR[translate(.,
> '0123456789', '9999999999') eq $phonenum])}</count>
>                 </pattern>
>         }
>         </PatternDistribution>
>     </Element>      
> </Elements>
> 
> 
> Thanks,
> Srini
> *****************************************************************
> This message has originated from RLPTechnologies,
> 26955 Northwestern Highway, Southfield, MI 48033.
> 
> RLPTechnologies sends various types of email communications.  
> If this email message concerns the potential licensing of an 
> RLPT product or service, and you do not wish to receive 
> further emails regarding Polk products, forward this email to 
> Do_Not_Send at rlpt.com with the word "remove" in the subject line.
> 
> The email and any files transmitted with it are confidential 
> and intended solely for the individual or entity to whom they 
> are addressed.
> 
> If you have received this email in error, please delete this 
> message and notify the Polk System Administrator at 
> postmaster at rlpt.com.
> *****************************************************************
> 
> 
> _______________________________________________
> talk at xquery.com
> http://xquery.com/mailman/listinfo/talk



More information about the talk mailing list