[xquery-talk] Partitioning data with XQuery

Mon Apr 25 22:49:37 PDT 2005

Brilliant! (assuming that it works; I haven't tried it yet. ;)

It's a machine translation. I can't imagine any rational programmer who
would purposefully code this way. It's Academy Award data that started life
as HTML, got converted into XHTML, and is now being transformed via a series
of XQuery transformations into much more generic XML.

The data for each of the 1,858 academy awards at this site (ie who won and
who lost for each award) appears inside a single <TD> element in an HTML
table. However, some 70-odd Academy Awards over the years came in two
variations. For example Best Director in the early years was divided into
Best Director/Comedy and Best/Director/Drama. Likewise Best Costume Design
awards between 1948 and 1960 were divided into Costume
Design/Black-and-White and Costume Design/Color. In such cases, data for
both sub-awards still appears inside a single <TD> at this site, with the
start of each sub-award being marked by a <FONT> element that provides the
heading for the sub-award. The <FONT> tag corresponds to what I've been
calling more generically <subRecordStart/>.

For example, here's the original HTML  for one award (simplified slightly)
that corresponds to what I've been calling more generically just <record>:

<TD VALIGN=BOTTOM>
<FONT COLOR="FFF3A3" SIZE=4>(Black-and-White)</FONT><BR>
	THE GOSPEL ACCORDING TO ST. MATTHEW - Danilo Donati<BR>
	MANDRAGOLA - Danilo Donati<BR>
	MISTER BUDDWING - Helen Rose<BR>

	MORGAN! - Jocelyn Rickards<BR>
	<B><U>WHO'S AFRAID OF VIRGINIA WOOLF? - Irene Sharaff</U></B><BR>
<FONT COLOR="FFF3A3" SIZE=4>(Color)</FONT><BR>
	GAMBIT - Jean Louis<BR>
	HAWAII - Dorothy Jeakins<BR>
	JULIET OF THE SPIRITS - Piero Gherardi<BR>

	<B><U>A MAN FOR ALL SEASONS - Joan Bridge, Elizabeth Haffenden</U></B><BR>
	THE OSCAR - Edith Head<BR>
</TD>

and here's what the same data is going to look like once I'm done:

<award year="1966">
     <costumeDesignBlackAndWhite>
          <lost>
               <person>Danilo Donati</person>
               <picture>The Gospel According to St. Matthew</picture>
          </lost>
          <lost>
                <person>Danilo Donati</person>
                <picture>Mandragola</picture>
          </lost>
          <lost>
                <person>Helen Rose</person>
                <picture>Mister Buddwing</picture>
          </lost>
          //  ...
          <won>
                <person>Irene Sharaff</person>
                <picture>Who's Afraid of Virginia Woolf</picture>
          </won>
          //  ...
     </costumeDesignBlackAndWhite>
</award>

Howard

 > -----Original Message-----
 > From: talk-bounces at xquery.com [mailto:talk-bounces at xquery.com]On Behalf
 > Of Brian Maso
 > Sent: Monday, April 25, 2005 5:18 PM
 > To: howardk at fatdog.com; xquery-talk
 > Subject: Re: [xquery-talk] Partitioning data with XQuery
 >
 >
 > I haven't this exact code it, but the use of the "<<" and ">>"
 > operators to
 > compare relative positions of children in document order can be
 > used to get
 > the job done.
 >
 > for each $record in $records
 > let $firstSubRecord := $record/subRecordStart[1],
 >      $secondSubRecord := $record/subRecordStart[2]
 > return
 >    <record>
 >    {
 >      $record/(* | text())[. >> $firstSubRecord and . <<
 > $secondSubRecord])
 >    }
 >    </record>
 >
 >    <record>
 >    {
 >      $record/(* | text())[. >> $secondSubRecord]
 >    }
 >    </record>
 >
 > Is this a translation of a traditional flatfile in to XML or
 > something? Use
 > of empty elements to delimit sections is a little strange, and
 > that's the
 > only kind of soruce I can imagine this kind of thing coming from.
 >
 > Brian Maso
 >
 > At 04:29 PM 4/25/2005, Howard Katz wrote:
 >
 > >I need to repartition some XML data using XQuery, and I can't
 > see how to do
 > >it. The basic data looks something like this:
 > >
 > ><record>
 > >      <subRecordStart/>          (: marks start of new sub-record :)
 > >      some pcdata_1
 > >      <someElement_1/>
 > >      some more pcdata_1
 > >      <anotherElement_1/>
 > >      ... etc
 > >
 > >      <subRecordStart/>          (: marks start of new sub-record :)
 > >      some more pc data_2
 > >      <yetAnotherElement_2/>
 > >      yet some more pc data_2
 > >      <andYetAnotherElement_2/>
 > >      ... etc
 > ></record>
 > >...
 > >
 > >The contents of each <record> consists of exactly two <subRecordStart/>
 > >elements, plus some undetermined mixture of elements and text
 > nodes. Each
 > ><record> needs to be replaced by two new <record> elements formed by
 > >partioning its contents into two parts. The place where each
 > new record is
 > >to begin is indicated by a <subRecordStart/> marker, with the first
 > ><subRecordStart/> marker being the first element child of
 > <record>. Other
 > >than that and the fact there are exactly two markers per
 > record, the rest of
 > >the contents are not known in advance.
 > >
 > >On application of the appropriate XQuery, the single record
 > above would be
 > >replaced by the following two:
 > >
 > ><record>
 > >      some pcdata_1
 > >      <someElement_1/>
 > >      some more pcdata_1
 > >      <anotherElement_1/>
 > ></record>
 > ><record>
 > >      some more pc data_2
 > >      <yetAnotherElement_2/>
 > >      yet some more pc data_2
 > >      <andYetAnotherElement_2/>
 > ></record>
 > >
 > >This doesn't look difficult, but a solution eludes me. Can
 > somebody suggest
 > >an XQuery that would be able to do this?
 > >TIA,
 > >Howard
 > >
 > >
 > >_______________________________________________
 > >talk at xquery.com
 > >http://xquery.com/mailman/listinfo/talk
 >
 > _______________________________________________
 > talk at xquery.com
 > http://xquery.com/mailman/listinfo/talk
 >