[xquery-talk] Screen-scraping with XQuery

Michael Kay mhk at mhk.me.uk
Thu Mar 31 19:45:45 PST 2005


> 
> In that post I also rewrote the example with XSLT 2.0, which shows how
> close XQuery 1.0 and XSLT 2.0 are for a use case like this. In passing
> I admit having switched from Qexo to Saxon (sorry Per ;-).
> 

If you're screen scraping HTML, then some of the "extras" in XSLT 2.0 versus
XQuery actually make a big difference. A very common requirement is the
"positional grouping" problem: turning

<h1>heading</h1>
<p>first</p>
<p>second</p>
<h2>subheading</h2>
<p>third</p>
<h2>subheading</h2>
<p>fourth</p>

into something that's respectably hierarchical. XSLT 2.0 has a construct
specifically for this: <xsl:for-each-group starting-with="h2">. In XSLT 1.0,
and in XQuery, this is a tough challenge (especially if you choose an XQuery
product that doesn't offer the sibling axes).

Also, XSLT 2.0's regular expression processing is more powerful, with the
<xsl:analyze-string> construct: this is another thing that's often needed in
screen-scraping applications, where the structure is indicated by textual
signals rather than explicit markup.

Also, the approach of using template rules in XSLT means that it's much
easier to write logic that can cope with unpredictable variations in the
structures you find in the source documents. Screen-scraping is a classic
case where you don't have a precise specification of your possible range of
inputs.

We did some interesting screen-scraping as part of the process of managing
the last-call issues on the specs: automatic extraction of the email archive
on public-qt-comments. This was actually a very simple process, and could be
done happily in either language.

Michael Kay
http://www.saxonica.com/




More information about the talk mailing list