[xquery-talk] Necessary whitespace

Benito van der Zander benito at benibela.de
Tue Apr 28 13:33:17 PDT 2015


Hi Michael,

>
>
> I don't think there's a problem with saying it's tokenized as two 
> tokens. Just because a text can be tokenized doesn't mean it's free of 
> syntax errors. And section A.2.2 gives just one of the many 
> requirements that a sequence of tokens must satisfy in order to be 
> error-free. (Specifically, "div" and "3" are adjacent non-delimiting 
> terminal symbols, and so must be separated by Whitespace and/or 
> Comments.) 

What if it parses it in
12!(12 div.)
as two tokens?
"." is a terminal symbol, and "div" is not a NCName there, just part of 
a MultiplicativeExpr.

Or in
1<<a>2</a>
as "<" and "<a>2</a>"

"<<" is longer, but not consistent.

Cheers,
Benito



On 04/27/2015 05:44 PM, Michael Dyck wrote:
> On 15-04-27 05:53 AM, Michael Kay wrote:
>> Agreed.
>>
>> To confuse matters, though, I see that we still have the problematic
>> statement in A.2 "When tokenizing, the longest possible match that is
>> consistent with the EBNF is used."
>
> In the CR period for XQuery 3.0, we changed that sentence from
>     "valid in the current context"
> to
>     "consistent with the EBNF"
> (See meeting 541.)
>
>> This to my mind has always suggested the idea that the tokenization is
>> sensitive to the grammatical context. And in some cases it is; you don't
>> want to go looking for QNames or IntegerLiterals when you're in
>> DirElementContent, just because a QName or IntegerLiteral is longer than
>> a Char.
>
> Right.
>
>> However, it could also be read as meaning that given "12 div3",
>> tokenizing "div3" as one token is not consistent with the EBNF (it
>> doesn't lead to a valid parse),
>
> Yes, I believe that's how that sentence is supposed to be read. There 
> are no possible continuations of "12 div3" that conform to the EBNF, 
> but there *are* continuations of "12 div" that conform to the EBNF. 
> So, when the tokenizer is positioned just before the 'd', "div" is the 
> longest possible match (LPM) that is consistent with the EBNF, so the 
> next token is "div".
>
>>  so it should be tokenized as two tokens.
>
> Well, that's less clear, but I think it's one valid interpretation.
>
>> I don't think that has ever been the intent, and I guess section 
>> A.2.2 on
>> delimiting and non-delimiting terminals was added to eliminate this
>> interpretation.
>
> I don't think there's a problem with saying it's tokenized as two 
> tokens. Just because a text can be tokenized doesn't mean it's free of 
> syntax errors. And section A.2.2 gives just one of the many 
> requirements that a sequence of tokens must satisfy in order to be 
> error-free. (Specifically, "div" and "3" are adjacent non-delimiting 
> terminal symbols, and so must be separated by Whitespace and/or 
> Comments.)
>
> So, in that view, A.2.2 wasn't added to modify the interpretation of 
> the LPM rule, it was added to flag some of the cases that the LPM rule 
> "lets through".
>
> -Michael
> _______________________________________________
> talk at x-query.com
> http://x-query.com/mailman/listinfo/talk
>



More information about the talk mailing list