[xquery-talk] Optimizing a large join (was: no subject)

Michael Kay mhk at mhk.me.uk
Mon Apr 17 13:56:08 PDT 2006


It's going to take a pretty clever optimizer to avoid this being O(n^2). The
only way I can see out of this (apart from using XSLT 2.0 with its built-in
grouping capability!) is to create a temporary document in which each node
has its full path as an extra attribute: the grouping and counting then
becomes an equijoin which is much more likely to be optimized using a
hash-join approach.

Of course, you have already learnt that the built-in query processor in
Stylus Studio is unlikely to be the best choice for this problem, so I'm not
sure why you are persisting with that.

Michael Kay
http://www.saxonica.com/

> -----Original Message-----
> From: talk-bounces at xquery.com [mailto:talk-bounces at xquery.com] On Behalf
> Of fatma helmy
> Sent: 17 April 2006 07:47
> To: talk at xquery.com
> Subject: [xquery-talk] (no subject)
> 
> I need your consultation on optimizing this code since
> i am using stylus studio, professional edition
> 
> i run this code to produce statistics on xml file with
> size 4M. the programmed finished the job after 30
> minutes now i supply it with xml file with size 11M
> it did not stop. the code is as follow
> 
> 1.	declare function local:pathOfNode($node)
> 2.	{if(empty($node/..)) then ""  else
> concat(local:pathOfNode($node/..), "/",
> local-name($node))};
> 3.	let $j:= doc("book_sample.xml")
> 4.	let $paths := for $n in  $j//* return
> local:pathOfNode($n)
> 5.	let $childpaths:= (for $item in $paths where
> count(tokenize(substring-after(string($item),
> "/"),"/")) >1 return $item)
> 6.	for $p in distinct-values($childpaths)
> 7.	let $toks:= tokenize(string($p),"/")
> 8.	let $papa:= string-join(subsequence($toks, 1,
> count($toks) - 1), "/")
> 9.	let $var:=substring-after(string($p),"/")
> 10.	let $leafs
> :=$j//text()[normalize-space()][string-join(ancestor-or-
> self::element()/name(),'/')
> eq $var]
> 11.	return
> 12.	<STATISTICS>
> 13.	<PATH>
> 14.	{string($p)}
> 15.	</PATH>
> 16.	<RATIO>
> 17.	{string( round( count($childpaths[.=$p]) div
> count($paths[.=$papa]) * 100 ) )}
> 18.	</RATIO>
> 19.	{for $val in distinct-values($leafs)
> 20.	return <value-per-path
> value='{normalize-space($val)}'
> 21.	count='{count($leafs[. eq
> normalize-space($val)])}'/>}
> 22.	</STATISTICS>
> 
> this code produces all paths and then calculate the
> ratio of node frequency relative to its parent
> frequency.
> 
> the question now, is there any unecessary code the
> delays the performance to that extent.
> 
> how to enhance my code to produce paths only with
> nodes whose ratio is greater than certain value, to
> prune infrequent paths from the start and not to go
> further in them?
> 
> 
> 
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com
> _______________________________________________
> talk at xquery.com
> http://xquery.com/mailman/listinfo/talk



More information about the talk mailing list