[xquery-talk] An analyze-string stumper

Joe Wicentowski joewiz at gmail.com
Mon Apr 23 12:50:52 PDT 2018


Hi Patrick,

Thanks for your reply!  That 4th version is certainly promising, but I
wonder, will it capture a case I have but regrettably didn't mention
explicitly: more than 2 numbers?  Here's an example:

  The most significant elements in the package were 18 F-104 fighters and
100 M 48 tanks. (72, 76, 77, 82, 89, 95, 99, 107, 111, 125)

Here, I've got more than 2 numbers inside the parentheses, so I can't count
on a parens to begin or end a number.  I was hoping to find a pattern that
would wrap each of the numbers inside the parentheses in an <fn:group>
element, without jeopardizing inadvertent hits on numbers outside the
parentheses.

I'd take any solution or hint, but what really threw me about my attempts
was that I wasn't able to use the open and close parentheses to anchor my
search and allow arbitrary repeats of number-plus-optional-comma-and-space
"(\d+(, )?)+" within a pair of parentheses.  I couldn't see why this wasn't
capturing each of the numbers within the parentheses.

Joe

On Mon, Apr 23, 2018 at 2:54 PM Patrick Durusau <patrick at durusau.net> wrote:

> Joe,
>
> Forgive the length but I'm likely to bump my head on this issue in the
> future, so a fuller than necessary explanation:
>
> Started with the simplest regex that would capture the parens:
>
> 1. fn:analyze-string("On February 13, 1968, Secretary of State Dean Rusk
> sent a message to Israeli Foreign Minister Abba Eban calling upon Israel to
> endorse openly Resolution 242, and on May 13 President Johnson sent a
> letter to United Arab Republic (UAR) President Gamal Abdel Nasser, urging
> him to seize the unique opportunity offered by the Jarring mission to
> achieve peace. (79, 171) ", "\(\d.*\)")
> 1. Result: <fn:analyze-string-result xmlns:fn=
> "http://www.w3.org/2005/xpath-functions"
> <http://www.w3.org/2005/xpath-functions>>
>   <fn:non-match>On February 13, 1968, Secretary of State Dean Rusk sent a
> message to Israeli Foreign Minister Abba Eban calling upon Israel to
> endorse openly Resolution 242, and on May 13 President Johnson sent a
> letter to United Arab Republic </fn:non-match>
>   <fn:match>(UAR) President Gamal Abdel Nasser, urging him to seize the
> unique opportunity offered by the Jarring mission to achieve peace. (79,
> 171)</fn:match>
>   <fn:non-match> </fn:non-match>
> </fn:analyze-string-result>
>
> OK, so what do we know about the desired matches? Digits plus (, ) with no
> spaces. Yes?
>
> 2. fn:analyze-string("On February 13, 1968, Secretary of State Dean Rusk
> sent a message to Israeli Foreign Minister Abba Eban calling upon Israel to
> endorse openly Resolution 242, and on May 13 President Johnson sent a
> letter to United Arab Republic (UAR) President Gamal Abdel Nasser, urging
> him to seize the unique opportunity offered by the Jarring mission to
> achieve peace. (79, 171) ", "\(\d, \d+\)")
>
> So I match parens plus digits, ", " (comma plus whitespace), digits plus
> paren.
>
> 2. Result: <fn:analyze-string-result xmlns:fn=
> "http://www.w3.org/2005/xpath-functions"
> <http://www.w3.org/2005/xpath-functions>>
>   <fn:non-match>On February 13, 1968, Secretary of State Dean Rusk sent a
> message to Israeli Foreign Minister Abba Eban calling upon Israel to
> endorse openly Resolution 242, and on May 13 President Johnson sent a
> letter to United Arab Republic (UAR) President Gamal Abdel Nasser, urging
> him to seize the unique opportunity offered by the Jarring mission to
> achieve peace. </fn:non-match>
>   <fn:match>(79, 171)</fn:match>
>   <fn:non-match> </fn:non-match>
> </fn:analyze-string-result>
>
> I need to split the two numbers and what better to do that than
> alternative matching?
>
> 3. fn:analyze-string("On February 13, 1968, Secretary of State Dean Rusk
> sent a message to Israeli Foreign Minister Abba Eban calling upon Israel to
> endorse openly Resolution 242, and on May 13 President Johnson sent a
> letter to United Arab Republic (UAR) President Gamal Abdel Nasser, urging
> him to seize the unique opportunity offered by the Jarring mission to
> achieve peace. (79, 171) ", "\(\d+ | \d+\)")
>
> 3. Result: <fn:analyze-string-result xmlns:fn=
> "http://www.w3.org/2005/xpath-functions"
> <http://www.w3.org/2005/xpath-functions>>
>   <fn:non-match>On February 13, 1968, Secretary of State Dean Rusk sent a
> message to Israeli Foreign Minister Abba Eban calling upon Israel to
> endorse openly Resolution 242, and on May 13 President Johnson sent a
> letter to United Arab Republic (UAR) President Gamal Abdel Nasser, urging
> him to seize the unique opportunity offered by the Jarring mission to
> achieve peace. (79,</fn:non-match>
>   <fn:match> 171)</fn:match>
>   <fn:non-match> </fn:non-match>
> </fn:analyze-string-result>
>
> Your probably already laughing because you see my mistake, which I correct
> in #4:
>
> 4. fn:analyze-string("On February 13, 1968, Secretary of State Dean Rusk
> sent a message to Israeli Foreign Minister Abba Eban calling upon Israel to
> endorse openly Resolution 242, and on May 13 President Johnson sent a
> letter to United Arab Republic (UAR) President Gamal Abdel Nasser, urging
> him to seize the unique opportunity offered by the Jarring mission to
> achieve peace. (79, 171) ", "\(\d+|\d+\)")
>
> 4. Result: <fn:analyze-string-result xmlns:fn=
> "http://www.w3.org/2005/xpath-functions"
> <http://www.w3.org/2005/xpath-functions>>
>   <fn:non-match>On February 13, 1968, Secretary of State Dean Rusk sent a
> message to Israeli Foreign Minister Abba Eban calling upon Israel to
> endorse openly Resolution 242, and on May 13 President Johnson sent a
> letter to United Arab Republic (UAR) President Gamal Abdel Nasser, urging
> him to seize the unique opportunity offered by the Jarring mission to
> achieve peace. </fn:non-match>
>   <fn:match>(79</fn:match>
>   <fn:non-match>,</fn:non-match>
>   <fn:match> 171)</fn:match>
>   <fn:non-match> </fn:non-match>
> </fn:analyze-string-result>
>
> The error was here: "\(\d+ | \d+\)", which would only match (any-digit
> plus a white space, whereas the number in question was followed by *no
> space* and a comma.
>
> Know thy data!
>
> Examples created on BaseX. BTW, I started from known good examples in
> XQuery Functions 3.1, verified that they worked and then created the search
> strings.
>
> Hope this helps!
>
> Patrick
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> On 04/23/2018 12:22 PM, Joe Wicentowski wrote:
>
> Hi all,
>
> I have encountered an unexpected challenge constructing a regex for a
> pattern I am looking for.  I am looking for numbers in parentheses.  For
> example, in the following string:
>
>   "On February 13, 1968, Secretary of State Dean Rusk sent a
>     message to Israeli Foreign Minister Abba Eban calling upon Israel to
>     endorse openly Resolution 242, and on May 13 President Johnson sent a
>     letter to United Arab Republic (UAR) President Gamal Abdel Nasser,
>     urging him to seize the unique opportunity offered by the Jarring
>     mission to achieve peace. (79, 171)"
>
> ... I would like to match "79" and "171" (but not "UAR" or "13" or
> "1968").  I have been trying to construct a regex for use with
> analyze-string to capture this pattern, but I have not been successful.  I
> have tried the following:
>
>   analyze-string($string, "(?:\()(?:(\d+)(?:, )?)+(?:\))")
>
> In other words, there are these 3 components:
>
>   1. (?:\() a non-capturing group consisting of an open parens, followed by
>   2. (?:(\d+)(?:, )?)+ one or more non-capturing groups consisting of (a
> number followed by an optional, non-matching comma-and-space), followed by
>   3. (?:\)) a non-capturing group consisting of a close parens
>
> I was expecting to get the following output:
>
>   <fn:analyze-string-result xmlns:fn="
> http://www.w3.org/2005/xpath-functions">
>     <fn:non-match>On February 13, 1968, Secretary of State Dean Rusk sent
> a
>     message to Israeli Foreign Minister Abba Eban calling upon Israel to
>     endorse openly Resolution 242, and on May 13 President Johnson sent a
>     letter to United Arab Republic (UAR) President Gamal Abdel Nasser,
>     urging him to seize the unique opportunity offered by the Jarring
>     mission to achieve peace. </fn:non-match>
>     <fn:match>(<fn:group nr="1">79</fn:group>,
>       <fn:group nr="1">171</fn:group>)</fn:match>
>   </fn:analyze-string-result>
>
> However, the actual result is that the first number ("79") is skipped, and
> only the 2nd number ("171") is captured:
>
>   <fn:analyze-string-result xmlns:fn="
> http://www.w3.org/2005/xpath-functions">
>     <fn:non-match>On February 13, 1968, Secretary of State Dean Rusk sent
> a
>     message to Israeli Foreign Minister Abba Eban calling upon Israel to
>     endorse openly Resolution 242, and on May 13 President Johnson sent a
>     letter to United Arab Republic (UAR) President Gamal Abdel Nasser,
>     urging him to seize the unique opportunity offered by the Jarring
>     mission to achieve peace. </fn:non-match>
>     <fn:match>(79,
>       <fn:group nr="1">171</fn:group>)</fn:match>
>   </fn:analyze-string-result>
>
> What am I missing?  Can anyone suggest a regex that is able to capture
> both numbers inside the parentheses?  Or do I need to make a two-pass run
> through this, finding parenthetical text with a first analyze-string like
> "\(.+\)" and then looking inside its matches with a second analyze-string
> like "(\d+)(?:, )?"?
>
> Thanks,
> Joe
>
>
> _______________________________________________talk at x-query.comhttp://x-query.com/mailman/listinfo/talk
>
>
> --
> Patrick Durusaupatrick at durusau.net
> Technical Advisory Board, OASIS (TAB)
> Editor, OpenDocument Format TC (OASIS), Project Editor ISO/IEC 26300
> Co-Editor, ISO/IEC 13250-1, 13250-5 (Topic Maps)
>
> Another Word For It (blog): http://tm.durusau.net
> Homepage: http://www.durusau.net
> Twitter: patrickDurusau
>
> _______________________________________________
> talk at x-query.com
> http://x-query.com/mailman/listinfo/talk
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://x-query.com/pipermail/talk/attachments/20180423/b442d081/attachment.html>


More information about the talk mailing list