[xquery-talk] Count of Distinct elements performance problem
Michael Kay
mhk at mhk.me.uk
Tue Aug 22 20:19:51 PDT 2006
A performance question like this can only be answered with respect to a
specific product.
There aren't many XQuery engines that will handle an 8Gb file, so you're
doing quite well.
You might find that a multi-pass approach works faster:
(a) extract the values
(b) sort them
(c) use a recursive scan to do positional grouping - depends on your product
supporting tail call optimization
That's likely to have O(n*log(n)) performance rather than O(n^2).
Michael Kay
http://www.saxonica.com/
> -----Original Message-----
> From: talk-bounces at xquery.com
> [mailto:talk-bounces at xquery.com] On Behalf Of Kusunam, Srinivas
> Sent: 22 August 2006 18:36
> To: talk at xquery.com
> Subject: [xquery-talk] Count of Distinct elements performance problem
>
> I am trying to find count of distinct elements (Model Year).
> Here is my XQuery. It takes 4 hrs to get the count from 8GB
> file. There are around 3000 distinct Model years in this file.
>
> let $mdoc := doc('input.xml')/Body
> let $sourModelYEAR := $mdoc/Title/ModelYear return <Elements>
> <Element name="ModelYear">
> <Distribution>
> {
> for $dvalue in fn:distinct-values($sourModelYEAR)
> let $eachcount := count($mdoc/Title[ModelYear=$dvalue])
> return
> <distribution>
> <value>{ $dvalue }</value>
> <count>{ $eachcount }</count>
> </distribution>
> }
> </Distribution>
> </Element>
> </Elements>
>
> This Query seems to loop through the document for each value i.e.
> overall 3000 times. I know this should be easily achievable
> if we have Group-by in XQuery. Do any XQuery engine supports
> (custom) Group-By now?
> Or is there any other way to make this query efficient?
>
> Where as if I add one more element to find the pattern of the
> data it finishes the job within 40 minutes? Why is this odd behavior?
>
> let $mdoc := doc('input.xml')/Body
> let $sourModelYEAR := $mdoc/Title/ModelYear return <Elements>
> <Element name="ModelYear">
> <Distribution>
> {
> for $dvalue in fn:distinct-values($sourModelYEAR)
> let $eachcount := count($mdoc/Title[ModelYear=$dvalue])
> return
> <distribution>
> <value>{ $dvalue }</value>
> <count>{ $eachcount }</count>
> </distribution>
> }
> </Distribution>
> <PatternDistribution>
> {
> for $phonenum in
> distinct-values($sourModelYEAR/translate(.,
> '0123456789','9999999999'))
> return
> <pattern>
> <type>{ $phonenum }</type>
> <count>{count($sourModelYEAR[translate(.,
> '0123456789', '9999999999') eq $phonenum])}</count>
> </pattern>
> }
> </PatternDistribution>
> </Element>
> </Elements>
>
>
> Thanks,
> Srini
> *****************************************************************
> This message has originated from RLPTechnologies,
> 26955 Northwestern Highway, Southfield, MI 48033.
>
> RLPTechnologies sends various types of email communications.
> If this email message concerns the potential licensing of an
> RLPT product or service, and you do not wish to receive
> further emails regarding Polk products, forward this email to
> Do_Not_Send at rlpt.com with the word "remove" in the subject line.
>
> The email and any files transmitted with it are confidential
> and intended solely for the individual or entity to whom they
> are addressed.
>
> If you have received this email in error, please delete this
> message and notify the Polk System Administrator at
> postmaster at rlpt.com.
> *****************************************************************
>
>
> _______________________________________________
> talk at xquery.com
> http://xquery.com/mailman/listinfo/talk
More information about the talk
mailing list