[xquery-talk] Count a specific word in a document

Michael Kay mike at saxonica.com
Wed Jun 13 12:18:53 PDT 2007


>   for $elijah in doc("/db/mjs/ElijahLibretto.xhtml")/html
>   let $elijah-para := $elijah//td/p[i/text() = 'Elijah']
>   let $txt := string-join($elijah-para/text(), " ")

You haven't shown your source document, but the above seems surprising to
me. If the paragraph in question has

<p><i>Elijah</i> Rise then ye priests of Baal, select and slay a bullock,
and I then will call on the Lord Jehovah</p>

then this will work. But in general, when a paragraph has mixed content,
then using /text() is dangerous, because it loses content that is in nested
markup. For example it would fail with:

<p><i>Obadiah</i> <quote>If with all your hearts you truly seek me, ye shall
ever surely find me.</quote> Thus saith our God.</p>

I would normally expect to see

let $txt := string($elijah-para)

(except that this will probably be done implicitly anyway).

In fact the use of /text() is very common in XQuery circles, and in my view
it's usually wrong. You nearly always want the string value of the element
rather than its text node children: /string() rather than /text(), except as
I say that it's usually implicit.


>   let $words := tokenize($txt,"(\s|[,.!:;]|[n][b][s][p][;])+")

A strange regular expression this. Firstly, '[n][b][s][p][;]' can be written
'nbsp;'. But I wouldn't normally expect to see nbsp; in your source. If
there's an entity reference &nbsp; in your lexical XML then the text node
will contain an xA0 character, and it is this that you should match, by
using '&#xa0;' in your regular expression. But a better regex is \W+, which
matches all "non-word" characters. 
> 
> I can't figure out how to count the number of string tokens 
> that are 'Lord'. I can get them with:
> 
>   for $word in $words
>   return $word[$word = 'Lord']
> 
> but I can't seem to get the count of them.

count(tokenize($txt, '\W+')[.='Lord'])

Michael Kay
http://www.saxonica.com/



More information about the talk mailing list