[xquery-talk] run time error

Michael Kay mhk at mhk.me.uk
Tue May 9 10:40:37 PDT 2006


First try increasing the memory available to the JVM. I generally use

java -Xms512M -XmX512M

However, you may also need to find a way of writing your query in a way that
is less greedy in its use of memory. I've changed the layout of your code
below to make it legible, and added comments prefixed MHK>


declare function local:pathOfNode($node) {
     if (empty($node/..)) 
     then "" 
     else concat(local:pathOfNode($node/..), "/", local-name($node))
};

let $j:= doc("test.XML")
let $paths := for $n in $j//* return local:pathOfNode($n)
let $childpaths:= (for $item in $paths 
                   where
count(tokenize(substring-after(string($item),"/"),"/")) >1 
                   return $item)

MHK> Both $paths and $childpaths are referenced more than once, and their
values will therefore be stored in memory. If your 11Mb document is
reasonably structure-rich, then it might well contain 200K elements, each
having an expanded path of say 100 characters, which is 200 bytes, so each
of these two variables is going to occupy about 40Mb of memory. And further
on, $leafs is the same.

MHK>I'd suggest you start by only taking the distinct paths:

MHK>let $paths := distinct-values(for $n in $j//* return
local:pathOfNode($n))

MHK>which means you will never need to hold the full set of paths in memory.


for $p in distinct-values($childpaths)
let $toks:= tokenize(string($p),"/")

MHK>It seems very wasteful to carefully build up the concatenated path, and
then split it up again by tokenizing. But I'm afraid, given the absence of
comments and unhelpful variable names, I've lost the thread of what you're
trying to achieve here. 

let $papa:= string-join(subsequence($toks, 1, count($toks) - 1), "/")
let $var:=substring-after(string($p),"/")
let
$leafs:=$j//text()[normalize-space()][string-join(ancestor-or-self::element(
)/name(),'/') eq $var]

MHK>Any particular reason you used a recursive function to form the path for
elements, but are using string-join to form the paths for text nodes?

MHK>I suspect that the expression $j//text()[normalize-space()] is going to
be pulled out of the "for $p" loop, so it only needs to be evaluated once:
but that's another great chunk of memory gone. If you're using Saxon-SA then
it's also likely to be indexed to avoid an O(n^2) join.

return
  <STATISTICS>
    <PATH>
      {string($p)}
    </PATH>
    <RATIO>
      {string( round( count($childpaths[.=$p]) div
                       count($paths[.=$papa]) * 100 ) )}
    </RATIO>
      {for $val in distinct-values($leafs)
       return <value-per-path 
                      value='{normalize-space($val)}'
                      count='{count($leafs[. eq normalize-space($val)])}'/>}
    </STATISTICS> 

MHK> I've been trying to find suggestions for improving this code but I have
difficulty seeing exactly what it's doing - it seems to be collecting some
basic statistics on the structures present in the document, but it's doing
so in a pretty heavy-handed way. 

To be quite honest, I'd suggest writing this in XSLT. It's a grouping
problem, and XSLT 2.0 has built-in grouping operators which XQuery 1.0
lacks. This is likely to give you a far more efficient solution, both in
space and time usage. For starters, if you do

<xsl:for-each-group select="$j//*" group-by="local:pathOfNode(.)">

then this gives you a group which is the set of nodes having the same path -
so the groups are sets of nodes, not sets of paths.

If you don't want to switch languages, you could consider using the
higher-order saxon:for-each-group() extension function which gives you the
same functionality in XQuery.

Michael Kay
http://www.saxonica.com/




More information about the talk mailing list