[xquery-talk] An analyze-string stumper
Patrick Durusau
patrick at durusau.net
Mon Apr 23 11:50:42 PDT 2018
Joe,
Forgive the length but I'm likely to bump my head on this issue in the
future, so a fuller than necessary explanation:
Started with the simplest regex that would capture the parens:
1. fn:analyze-string("On February 13, 1968, Secretary of State Dean Rusk
sent a message to Israeli Foreign Minister Abba Eban calling upon Israel
to endorse openly Resolution 242, and on May 13 President Johnson sent a
letter to United Arab Republic (UAR) President Gamal Abdel Nasser,
urging him to seize the unique opportunity offered by the Jarring
mission to achieve peace. (79, 171) ", "\(\d.*\)")
1. Result: <fn:analyze-string-result
xmlns:fn="http://www.w3.org/2005/xpath-functions">
<fn:non-match>On February 13, 1968, Secretary of State Dean Rusk sent
a message to Israeli Foreign Minister Abba Eban calling upon Israel to
endorse openly Resolution 242, and on May 13 President Johnson sent a
letter to United Arab Republic </fn:non-match>
<fn:match>(UAR) President Gamal Abdel Nasser, urging him to seize the
unique opportunity offered by the Jarring mission to achieve peace. (79,
171)</fn:match>
<fn:non-match> </fn:non-match>
</fn:analyze-string-result>
OK, so what do we know about the desired matches? Digits plus (, ) with
no spaces. Yes?
2. fn:analyze-string("On February 13, 1968, Secretary of State Dean Rusk
sent a message to Israeli Foreign Minister Abba Eban calling upon Israel
to endorse openly Resolution 242, and on May 13 President Johnson sent a
letter to United Arab Republic (UAR) President Gamal Abdel Nasser,
urging him to seize the unique opportunity offered by the Jarring
mission to achieve peace. (79, 171) ", "\(\d, \d+\)")
So I match parens plus digits, ", " (comma plus whitespace), digits plus
paren.
2. Result: <fn:analyze-string-result
xmlns:fn="http://www.w3.org/2005/xpath-functions">
<fn:non-match>On February 13, 1968, Secretary of State Dean Rusk sent
a message to Israeli Foreign Minister Abba Eban calling upon Israel to
endorse openly Resolution 242, and on May 13 President Johnson sent a
letter to United Arab Republic (UAR) President Gamal Abdel Nasser,
urging him to seize the unique opportunity offered by the Jarring
mission to achieve peace. </fn:non-match>
<fn:match>(79, 171)</fn:match>
<fn:non-match> </fn:non-match>
</fn:analyze-string-result>
I need to split the two numbers and what better to do that than
alternative matching?
3. fn:analyze-string("On February 13, 1968, Secretary of State Dean Rusk
sent a message to Israeli Foreign Minister Abba Eban calling upon Israel
to endorse openly Resolution 242, and on May 13 President Johnson sent a
letter to United Arab Republic (UAR) President Gamal Abdel Nasser,
urging him to seize the unique opportunity offered by the Jarring
mission to achieve peace. (79, 171) ", "\(\d+ | \d+\)")
3. Result: <fn:analyze-string-result
xmlns:fn="http://www.w3.org/2005/xpath-functions">
<fn:non-match>On February 13, 1968, Secretary of State Dean Rusk sent
a message to Israeli Foreign Minister Abba Eban calling upon Israel to
endorse openly Resolution 242, and on May 13 President Johnson sent a
letter to United Arab Republic (UAR) President Gamal Abdel Nasser,
urging him to seize the unique opportunity offered by the Jarring
mission to achieve peace. (79,</fn:non-match>
<fn:match> 171)</fn:match>
<fn:non-match> </fn:non-match>
</fn:analyze-string-result>
Your probably already laughing because you see my mistake, which I
correct in #4:
4. fn:analyze-string("On February 13, 1968, Secretary of State Dean Rusk
sent a message to Israeli Foreign Minister Abba Eban calling upon Israel
to endorse openly Resolution 242, and on May 13 President Johnson sent a
letter to United Arab Republic (UAR) President Gamal Abdel Nasser,
urging him to seize the unique opportunity offered by the Jarring
mission to achieve peace. (79, 171) ", "\(\d+|\d+\)")
4. Result: <fn:analyze-string-result
xmlns:fn="http://www.w3.org/2005/xpath-functions">
<fn:non-match>On February 13, 1968, Secretary of State Dean Rusk sent
a message to Israeli Foreign Minister Abba Eban calling upon Israel to
endorse openly Resolution 242, and on May 13 President Johnson sent a
letter to United Arab Republic (UAR) President Gamal Abdel Nasser,
urging him to seize the unique opportunity offered by the Jarring
mission to achieve peace. </fn:non-match>
<fn:match>(79</fn:match>
<fn:non-match>,</fn:non-match>
<fn:match> 171)</fn:match>
<fn:non-match> </fn:non-match>
</fn:analyze-string-result>
The error was here: "\(\d+ | \d+\)", which would only match (any-digit
plus a white space, whereas the number in question was followed by *no
space* and a comma.
Know thy data!
Examples created on BaseX. BTW, I started from known good examples in
XQuery Functions 3.1, verified that they worked and then created the
search strings.
Hope this helps!
Patrick
On 04/23/2018 12:22 PM, Joe Wicentowski wrote:
> Hi all,
>
> I have encountered an unexpected challenge constructing a regex for a
> pattern I am looking for. I am looking for numbers in parentheses.
> For example, in the following string:
>
> "On February 13, 1968, Secretary of State Dean Rusk sent a
> message to Israeli Foreign Minister Abba Eban calling upon Israel to
> endorse openly Resolution 242, and on May 13 President Johnson sent a
> letter to United Arab Republic (UAR) President Gamal Abdel Nasser,
> urging him to seize the unique opportunity offered by the Jarring
> mission to achieve peace. (79, 171)"
>
> ... I would like to match "79" and "171" (but not "UAR" or "13" or
> "1968"). I have been trying to construct a regex for use with
> analyze-string to capture this pattern, but I have not been
> successful. I have tried the following:
>
> analyze-string($string, "(?:\()(?:(\d+)(?:, )?)+(?:\))")
>
> In other words, there are these 3 components:
>
> 1. (?:\() a non-capturing group consisting of an open parens,
> followed by
> 2. (?:(\d+)(?:, )?)+ one or more non-capturing groups consisting of
> (a number followed by an optional, non-matching comma-and-space),
> followed by
> 3. (?:\)) a non-capturing group consisting of a close parens
>
> I was expecting to get the following output:
>
> <fn:analyze-string-result
> xmlns:fn="http://www.w3.org/2005/xpath-functions">
> <fn:non-match>On February 13, 1968, Secretary of State Dean Rusk
> sent a
> message to Israeli Foreign Minister Abba Eban calling upon Israel to
> endorse openly Resolution 242, and on May 13 President Johnson sent a
> letter to United Arab Republic (UAR) President Gamal Abdel Nasser,
> urging him to seize the unique opportunity offered by the Jarring
> mission to achieve peace. </fn:non-match>
> <fn:match>(<fn:group nr="1">79</fn:group>,
> <fn:group nr="1">171</fn:group>)</fn:match>
> </fn:analyze-string-result>
>
> However, the actual result is that the first number ("79") is skipped,
> and only the 2nd number ("171") is captured:
>
> <fn:analyze-string-result
> xmlns:fn="http://www.w3.org/2005/xpath-functions">
> <fn:non-match>On February 13, 1968, Secretary of State Dean Rusk
> sent a
> message to Israeli Foreign Minister Abba Eban calling upon Israel to
> endorse openly Resolution 242, and on May 13 President Johnson sent a
> letter to United Arab Republic (UAR) President Gamal Abdel Nasser,
> urging him to seize the unique opportunity offered by the Jarring
> mission to achieve peace. </fn:non-match>
> <fn:match>(79,
> <fn:group nr="1">171</fn:group>)</fn:match>
> </fn:analyze-string-result>
>
> What am I missing? Can anyone suggest a regex that is able to capture
> both numbers inside the parentheses? Or do I need to make a two-pass
> run through this, finding parenthetical text with a first
> analyze-string like "\(.+\)" and then looking inside its matches with
> a second analyze-string like "(\d+)(?:, )?"?
>
> Thanks,
> Joe
>
>
> _______________________________________________
> talk at x-query.com
> http://x-query.com/mailman/listinfo/talk
--
Patrick Durusau
patrick at durusau.net
Technical Advisory Board, OASIS (TAB)
Editor, OpenDocument Format TC (OASIS), Project Editor ISO/IEC 26300
Co-Editor, ISO/IEC 13250-1, 13250-5 (Topic Maps)
Another Word For It (blog): http://tm.durusau.net
Homepage: http://www.durusau.net
Twitter: patrickDurusau
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://x-query.com/pipermail/talk/attachments/20180423/c3325939/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: OpenPGP digital signature
URL: <http://x-query.com/pipermail/talk/attachments/20180423/c3325939/attachment.bin>
More information about the talk
mailing list