Michael Dyck jmdyck at ibiblio.org
Mon Apr 27 08:44:59 PDT 2015

On 15-04-27 05:53 AM, Michael Kay wrote:
> Agreed.
> To confuse matters, though, I see that we still have the problematic
> statement in A.2 "When tokenizing, the longest possible match that is
> consistent with the EBNF is used."

In the CR period for XQuery 3.0, we changed that sentence from
     "valid in the current context"
     "consistent with the EBNF"
(See meeting 541.)

> This to my mind has always suggested the idea that the tokenization is
> sensitive to the grammatical context. And in some cases it is; you don't
> want to go looking for QNames or IntegerLiterals when you're in
> DirElementContent, just because a QName or IntegerLiteral is longer than
> a Char.


> However, it could also be read as meaning that given "12 div3",
> tokenizing "div3" as one token is not consistent with the EBNF (it
> doesn't lead to a valid parse),

Yes, I believe that's how that sentence is supposed to be read. There are no 
possible continuations of "12 div3" that conform to the EBNF, but there 
*are* continuations of "12 div" that conform to the EBNF. So, when the 
tokenizer is positioned just before the 'd', "div" is the longest possible 
match (LPM) that is consistent with the EBNF, so the next token is "div".

>  so it should be tokenized as two tokens.

Well, that's less clear, but I think it's one valid interpretation.

> I don't think that has ever been the intent, and I guess section A.2.2 on
> delimiting and non-delimiting terminals was added to eliminate this
> interpretation.

I don't think there's a problem with saying it's tokenized as two tokens. 
Just because a text can be tokenized doesn't mean it's free of syntax 
errors. And section A.2.2 gives just one of the many requirements that a 
sequence of tokens must satisfy in order to be error-free. (Specifically, 
"div" and "3" are adjacent non-delimiting terminal symbols, and so must be 
separated by Whitespace and/or Comments.)

So, in that view, A.2.2 wasn't added to modify the interpretation of the LPM 
rule, it was added to flag some of the cases that the LPM rule "lets through".


