[xquery-talk] Release of the GCX XQuery EngineQ
mike at saxonica.com
Mon Feb 5 09:07:54 PST 2007
> As I see it, there are two kinds of "streaming" implementations:
> - pull-based: expressions are evaluated at-need (lazily) when
> and if their results are needed; or
> - push-based: expressions are evaluated eagerly, but
> sub-parts of the results are "pushed" to a "consumer" as they
> are generated, while avoiding creating of complete "reified"
> sequences, if possible.
> The "data-base-oriented" implementations seem to be
> pull-based, while more "document-oriented" ones may be
> push-based. At least Qexo falls into the latter category,
> and my impression is Saxon does too.
> My impression from your web-page is that GCX is "push-based",
> so it should be comparable to Qexo and Saxon.
Saxon in fact uses a mixture of pull and push, with some user control over
the choice. By default, though, especially when results are being
serialized, push seems to work better at present. This is because using push
seems to make it easier to avoid constructing the result tree in memory.
However, it's not clear to me that the distinction here is really important.
The critical issue I think is whether you can process queries without
building a tree representation of the *source* document in memory. Saxon
currently does that in two limited cases: for the subset of XPath that's
used in XML Schema integrity constraints, and for the "serial processing
mode" in XSLT (which is applicable only to stylesheets that follow a very
stereotyped coding pattern). There's certainly an opportunity to achieve
this kind of streaming over a much wider range of queries.
One of the obstactles in practice, which I haven't seen addressed in any of
the academic research, is the requirement for stability. This means that if
a query reads the same document more than once, it needs to get the same
result (identical nodes) each time. In turn this means that if a query does
doc($x) and doc($y), you can't safely avoid building the tree unless you can
prove that $x and $y will be different URIs (or, perhaps, that the identity
of the nodes makes no difference to the outcome). This is the kind of corner
case that makes optimization in practice much harder than it is in academic
theory: you're not allowed to do an optimization that benefits 99.99% of
queries if it causes incorrect results for the other 0.01%. Subsetting the
language in ways that don't affect the conclusions is legitimate; but
ignoring parts of the language specification that have a significant bearing
on the issue is not.
More information about the talk