[xquery-talk] get highest number
wolfgang at exist-db.org
Sat May 17 19:52:43 PDT 2008
>> Performance in a database is very dependent on how many pages you read from
>> disk. If there's an index that contains the node identifiers for all the
>> attributes named id, and that index occupies one page, then you can evaluate
>> //@id with one page access. But to evaluate max(//@id) you need to
>> dereference those node identifiers, which may mean accessing one page for
>> each separate id attribute.
> So the index just contains the node identifiers - the actual values
> for those nodes still need to be fetched from disk?
In our implementation, most path expressions operate on references, not
on the actual nodes. The references are just pairs of documentId/nodeId
and do not occupy much space in an index. As Michael pointed out, the
entire sequence of @id references can be stored in a few disk pages.
Also, all possible relationships between nodes can be computed by
comparing node ids. This means that expressions like /root//@id are
evaluated without loading the actual nodes. All subexpressions just
return references to nodes, not the nodes themselves.
To serialize or atomize a node, the node references need to be
dereferenced, which means loading the node from the sequence of pages in
which the document is stored. This may indeed result in one page access
operation for every attribute.
> If so, if you indexed //@id/text() and then did max(//@id/text())
> would that be a solution?
Creating a "value"-index on //@id/text() would at least have the
advantage that all required atomized values could be stored in
subsequent pages, thus causing less IO. However, in many cases the
expression within max() could be rather complex, making it hard to
decide on what exactly the index should be created. I was recently
thinking about a solution which just pre-fetches the atomized values on
demand, thus speeding up the second access, though not the first one.
Anyway, those are implementation-specific questions, which may not
belong on this list, at least not in detail.
More information about the talk