[xquery-talk] get highest number

Wolfgang wolfgang at exist-db.org
Sat May 17 19:52:43 PDT 2008


>> Performance in a database is very dependent on how many pages you read from
>> disk. If there's an index that contains the node identifiers for all the
>> attributes named id, and that index occupies one page, then you can evaluate
>> //@id with one page access. But to evaluate max(//@id) you need to
>> dereference those node identifiers, which may mean accessing one page for
>> each separate id attribute.
> 
> So the index just contains the node identifiers - the actual values
> for those nodes still need to be fetched from disk?

In our implementation, most path expressions operate on references, not 
on the actual nodes. The references are just pairs of documentId/nodeId 
and do not occupy much space in an index. As Michael pointed out, the 
entire sequence of @id references can be stored in a few disk pages. 
Also, all possible relationships between nodes can be computed by 
comparing node ids. This means that expressions like /root//@id are 
evaluated without loading the actual nodes. All subexpressions just 
return references to nodes, not the nodes themselves.

To serialize or atomize a node, the node references need to be 
dereferenced, which means loading the node from the sequence of pages in 
which the document is stored. This may indeed result in one page access 
operation for every attribute.

 > If so, if you indexed //@id/text() and then did max(//@id/text())
 > would that be a solution?

Creating a "value"-index on //@id/text() would at least have the 
advantage that all required atomized values could be stored in 
subsequent pages, thus causing less IO. However, in many cases the 
expression within max() could be rather complex, making it hard to 
decide on what exactly the index should be created. I was recently 
thinking about a solution which just pre-fetches the atomized values on 
demand, thus speeding up the second access, though not the first one.

Anyway, those are implementation-specific questions, which may not 
belong on this list, at least not in detail.

Wolfgang


More information about the talk mailing list