[xquery-talk] XQuery function to do elementX/not-elementX chunking?

Tue Sep 13 23:41:47 PDT 2005

There's a fair bit of literature on this in an XSLT 1.0 context - the
standard method relies on a recursive use of xsl:apply-templates scanning
over the siblings. Look for "XSLT positional grouping". 

As a result of this experience, there are standard facilities for doing
positional grouping in XSLT 2.0 - for reference, an XSLT 2.0 solution is:

<xsl:for-each-group select="*" 
       group-starting-with="ref |
*[preceding-sibling::node()[1][self::ref]]">
  <span><xsl:copy-of select="current-group()"/></span>
</xsl:for-each-group>

That is, it splits the sequence into groups starting a new group at any ref
element or at any element whose immediately preceding sibling is a ref
element, and then wraps each group in a span element.

If you really have to do this in XQuery, make sure your product offers the
following-sibling axis. It's optional in the spec, but I think most vendors
have woken up to the fact that it's essential when handling documents rather
than data.

Then you basically need to recurse over the siblings with something like
this:

declare function f:makeSpan($this as node()?) {
  if ($this) then
    if ($this/self::ref) then
      (<span>{$this}</span>, f:makeSpan($this/following-sibling::node()[1])
    else
      (<span>{$this/(.|following-sibling::node())
                 except
$this/following-sibling::ref[1]/(.|following-sibling::node())
                 }</span>, 
      f:makeSpan($this/following-sibling::ref[1])
  else ()
}

Not tested, and probably has bugs! It's probably fairly similar to your own
"ugly" solution.

Michael Kay
http://www.saxonica.com/

> -----Original Message-----
> From: talk-bounces at xquery.com 
> [mailto:talk-bounces at xquery.com] On Behalf Of David Sewell
> Sent: 13 September 2005 21:01
> To: talk at xquery.com
> Subject: [xquery-talk] XQuery function to do 
> elementX/not-elementX chunking?
> 
> Given an XML element like this:
> 
>   <ref>This text has <i>italics</i> and <ref>an embedded ref</ref> and
>     more text including <b>boldface</b> and <ref>another ref</ref>
>     and a bit more text.</ref>
> 
> I want to break this into a sequence of sequences of its node children
> like so (with text nodes represented as strings, ignoring linebreaks):
> 
> (
>   ( 'This text has ', <i>italics</i>, ' and ' ) ,
>   <ref>an embedded ref</ref>,
>   ( ' and more text including ', <b>boldface</b>, ' and ' ),
>   <ref>another ref</ref>,
>   'and a bit more text.'
> )
> 
> In other words, pull out alternating sequences of (1) <ref> 
> elements and
> (2) other nodes that are not <ref> elements. (The practical 
> application is
> so that the <ref>s can be transformed into HTML <span>s without
> permitting embedded <spans>s -- they are HTML-legal but cause certain
> problems.)
> 
> I was able to do this by writing a 10-line function that relies on a
> fairly clunky process of selecting all the <ref> children and then
> chunking the other nodes that precede and/or follow them; it relies on
> some fairly ugly use of preceding-sibling(), following-sibling(),
> name(), and the '>>' operator. It's so ugly that I don't want 
> to inflict
> it this list (unless someone insists).
> 
> Does anyone have a simple, elegant way to do this?
> 
> -- 
> David Sewell, Editorial and Technical Manager
> Electronic Imprint, The University of Virginia Press
> PO Box 400318, Charlottesville, VA 22904-4318 USA
> Courier: 310 Old Ivy Way, Suite 302, Charlottesville VA 22903
> Email: dsewell at virginia.edu   Tel: +1 434 924 9973
> Web: http://www.ei.virginia.edu/
> _______________________________________________
> talk at xquery.com
> http://xquery.com/mailman/listinfo/talk
>