[xquery-talk] Omission in near() when used in mixed content

Wed Jan 21 14:03:26 PST 2009

Dear all,

we are currently seeing Problems with near() when used with words span
over element boundaries. We have a fulltext index with content="mixed"
defined for the collection. We know that the index as such works, as
near() works as expected with single words, even when they overlap
element tags. Nevertheless when searching for a succession of multiple
words the search fails if at least one of the words is split by an element.

Assume the following xql:

---

declare namespace tei = "http://www.tei-c.org/ns/1.0";

let $q := "mixed test"
return //tei:u[near(. , $q)]

---
and this sample document:

---

<?xml version="1.0" encoding="utf-8"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0">
<!-- snipped header -->
  <text>
    <body>
      <div>
        <u xml:id="u1"> this the first mi<seg type="overlap">xed test
</seg> </u>
        <u xml:id="u2"> this the second mi<anchor/>xed test </u>
        <u xml:id="u3"> this is the third <seg type="overlap"> mixed
</seg> test </u>
        <u xml:id="u4"> this is last <seg type="overlap"> mixed test
</seg> </u>
      </div>
    </body>
  </text>
</TEI>
---

several searches yield very different results, even though they should
imho be equal

1) $q="mixed"  returns tei:u with id u1,u2,u3,u4
2) $2="mixed test" only returns tei:u with id u3,u4

Does anybody see a different behaviour? I might have misinterpreted
something in the docs, such that the assumption that the second search
should return the same four tei:u elements is wrong, or maybe there
could also be a bug in near() or the fulltext index causing this issue.
However it might be, I would be very glad to get some hints how I could
circumvent this issue as I currently implement searches over highly
segmented texts.

cheers,

Stefan