[xquery-talk] Count a specific word in a document

Michael Strasser M.Strasser at gpo.com
Thu Jun 14 08:22:31 PDT 2007


Michael


Thanks for your thorough response and the warning about text(). I spared 
everyone the source document because it is not very good (and obviously 
it is long). I started with 
http://www.simonandkevin.com/ElijahLibretto.htm and converted its source 
from MS-Worded HTML to XHTML using a text editor. Its markup is visual, 
not structural. (My next XQuery project might be to convert its markup 
to a structural one.)

An excerpt is:

  <td>
    <p>
    <i>Elijah</i>
    <br/>
    Draw near, all ye people, come to me . . .
    </p>
    <p>
    Lord God of Abraham, Isaac and Israel, this day let
    it be known that Thou art God, and that I am Thy
    servant! Lord God of Abraham! Oh show to all this
    people that I have done these things according to
    Thy word.
    Oh hear me, Lord, and answer me!
    Lord God of Abraham, Isaac and Israel, oh hear me
    and answer me, and show this people that Thou art
    Lord God. And let their hearts again be turned!
    </p>
  </td>

So you see that $elijah//td/p/[i = 'Elijah'] will not capture all Elijah 
sings. In fact, I ended up using this to capture all paragraphs of his 
sung text:

  let $td := doc("/db/mjs/ElijahLibretto.xhtml")/html//td[p/i = 'Elijah']
  let $elijah-para := $td/p[i = 'Elijah' or i = 'Both' or count(i) = 0]

(<i>'Both'</i> marks his lines of duet with the Widow.)

Thanks also for fixing up my use of tokenize(). I don't like using 
something I don't understand (especially in a public forum). The results 
were different using "\W+": I got 37 occurrences of 'Lord' instead of 36 
(Jonathan Robie's regexp didn't tokenise 'Lord?' correctly).

Is there a web repository of XQuery questions and answers like Dave 
Pawson's very useful Q&A for XSLT?


Michael Strasser

(P.S. Why did I choose this strange exercise? Last year I sang the part 
of Elijah and wondered how often he uttered the word 'Lord'. Merely 
going through the score and counting is not geeky enough!)



More information about the talk mailing list