(Final?) krb5.Conf Lexer/Parser Proposal

Wed Jan 4 17:46:34 EST 2006

Hello and Happy New Year!

Please recall the krb5.conf lexer proposals made and discussed 
earlier (links: 
http://mailman.mit.edu/pipermail/krbdev/2005-November/003892.html , 
http://mailman.mit.edu/pipermail/krbdev/2005-December/003944.html  ). 
  After discovering more and more unfriendly edge cases, including a 
nasty one printed in an O'Reilly book , I have come to the 
unfortunate conclusion that it is not possible to robustly make the 
lexer whitespace-neutral while supporting existing krb5.conf 
files.  Whitespace is and will be syntactic.  A moment of silence for 
the previous proposal.

The new lexer/parser proposal embraces line-based syntax and is much 
less of a change from the existing parser.  As a first order 
approximation, the new proposal is the same as the existing parser except that:
   - comments beginning with the pound sign '#' may appear at the end 
of any line.
   - quoted strings are acceptable everywhere, not just in relation 
values.  Consequently, all section, tag, and value names can contain 
any characters.
   - the "final" signifier (asterisk '*') is no longer supported or 
treated as special.

Therefore, please feel free to skip or skim the nitty-gritty part 
below, and look at the Noteworthy Features section for interesting 
changes.  If you have any concerns, questions, or (non-xml-themed) 
suggestions, please let me know.  Thanks!

Yours,

Joe Calzaretta
Software Development & Integration Team
MIT Information Services & Technology

-----------------
The New krb5.conf Lexer/Parser Proposal, a.k.a., the Nitty-Gritty Part:

The input file is divided into lines, terminated by linebreaks (or 
end-of-stream for the last line.)

For each line:
   initial whitespace is ignored.
   all text after an unquoted pound sign '#' (including the pound 
sign itself) is stored as comment text and subsequently ignored.
   if the first nonwhitespace character is a semicolon ';', the 
entire line is stored as comment text and subsequently ignored.
   if the line is all whitespace, the entire line is ignored.
   At this point, whitespace, comments, and blank lines are stripped.

After this processing, the first character of the line is examined.
   If this character is an open square bracket '[', the line is a Section Line.
   If this character is a close curly brace '}', the line is a 
CloseSubsection Line.
   If this character is an open curly brace '{', AND the previous 
line was a DanglingSubsection line (more later), the line is a 
RescuedSubsection Line.
   Otherwise, the line is a SubsectionOrRelation line.

Section Lines:
   All text after the initial open square bracket '[' and up to but 
not including the first unquoted close square bracket ']' is 
considered the Raw Section Name.  A line without the close bracket or 
with any nonwhitespace text after the close bracket is considered an error.

CloseSubsection Lines:
   Any nonwhitespace text after the initial close curly brace '}' is 
considered an error.

SubsectionOrRelation Lines:
   All text up to but not including the first unquoted equal sign '=' 
is considered the Raw Tag name.  A line without such an equal sign is 
considered an error.
   The first nonwhitespace character after such an equal sign '=' is 
examined.
     If this character is an open curly brace '{', the line is a 
Subsection Line.
     If there is no such character, the line is a DanglingSubsection Line.
     Otherwise, the line is a Relation Line.

Subsection and RescuedSubsection Lines:
   Any nonwhitespace text after the unquoted open curly brace '{' is 
considered an error.

Relation Lines:
    All text after the unquoted equal sign '=' is considered the Raw 
Value Name.

Raw Name Canonicalization:
   All Raw Section/Tag/Value Names are canonicalized thusly:
     Any text within a quoted string is unescaped in the manner of 
ANSI C (C90 spec).
(e.g., "[\\Huh\x3F]" => [\Huh?])
     All whitespace within a quoted string is preserved.
     Whitespace between two quoted strings is eliminated. (provides string
concatenation much like ANSI C)
     Whitespace at the beginning and end of the Raw Name is eliminated.
     All other whitespace (i.e., whitespace before or after an unquoted word)
is condensed to single space (like 'collapse' whitespace handling in xml).

Lines may generally occur in any order, but some situations are 
considered errors.  Errors occur:
   If the first line is not a Section Line.
   If a DanglingSubsection Line is not immediately followed by a 
RescuedSubsection Line.
   If a Section Line or the end-of-stream appears within a Subsection 
(after fewer CloseSubsection lines than Subsection/RescuedSubsection Lines).
   If a CloseSubsection Line appears outside of a Subsection (after 
an equal number of CloseSubsection and Subsection/RescuedSubsection Lines).

---------------------------------
Noteworthy Features of the Proposed Lexer/Parser

=> The asterisk '*' signifier for "final" lines is no longer 
supported.  Asterisks are not considered special characters at 
all.  If this is undesirable or surprising, please let me know.

=> The semicolon ';' only signifies a comment at the beginning of a 
line, whereas the pound sign '#' signifies a comment whenever it 
appears unquoted.  Note the following lines:
   # a comment
   foo = bar # a comment
   foo = "bar # NOT a comment"
   ; a comment
   foo = bar ; NOT a comment
Why this difference?  Some existing krb5.conf files' relation values 
(notably the auth_to_local value) may have unquoted semicolons ';' in 
them.  As far as we have seen, no existing krb5.conf files' relation 
values or tags use unquoted pound signs '#'.  If this is untrue, 
please let me know.

=> Relation values may not start with an unquoted open curly brace 
'{'.  For example, the line:
   foo = { bar
is considered an error.  Note that the existing parser would treat 
this as a relation assigning the value "{ bar" to the tag "foo".  The 
existing parser's behavior is confusing enough that it is probably 
best discarded.  If this is untrue, please let me know.

=> DanglingSubsections and RescuedSubsections:  The existing parser 
allows the open curly brace '{' for subsections to appear on the line 
after the equal sign '=', like so:
   foo =              # dangling subsection
   {                     # rescued subsection, yay!
      bar = baz
   }
This syntax is pretty, because you can line up the open and close 
curlies.  But it violates the one-line-per-element 
linebreaks-are-syntactic rule to which the parser otherwise strictly 
adheres.  Personally, I don't like this because it is an extra layer 
of complexity, and the corresponding format for relation values is not valid:
   foo =   # dangling relation?
   bar     # can't be rescued, and doesn't parse.  boo!
Anyway, this syntax continues to be supported in the new proposal 
(actually improved because the existing parser doesn't allow comments 
to appear on lines between the equal sign and the open curly 
brace).  If anyone thinks this should be eliminated or supported 
differently, please let me know.

=> All section names, tag names, and value names may contain any 
character (except for null '\0') including whitespace.  Since all 
such names also support ANSI C quoted strings, there is a way to 
include any special character.  For example,
   ["foo]"] # close bracket in a section name
   "} foo" = bar # close curly brace at start of a tag name.
   "foo " = bar # space at the end of a tag name
   foo     bar = baz #single collapsed space in the middle of a tag name.
   "foo=" = bar # equal sign in a tag name.
   "#foo" = bar #pound sign in a tag name.
   foo = "{ bar" #open curly brace at start of a value name.
   foo = "\"bar\"" #quotation marks in a value name.
   foo = "\x3F" #raw byte value for a question mark.  Iffy!
Note that some of these would be errors in the existing parser, while 
others would be interpreted much differently.  Also note that the 
allowing of "\xhh" and "\ooo" byte codes can get a bit iffy in the 
future.  Right now, data is just stored as null-terminated byte 
strings, with no guaranteed interpretation of codes outside the 7-bit 
ASCII range.  We are planning to eventually use UTF-8 as the internal 
representation of data.  If your krb5.conf file is in UTF-8, and the 
byte codes specified either in raw form or via "\xhh" encoding are 
UTF-8, you will probably not see surprising behavior.  Other 
encodings may be unhappy when using byte codes.  If any of these are 
surprising or seem wrong, please let me know.

=> Error Recovery: In all error cases, reasonable recovery steps can 
be taken to continue parsing.  For example, if the first line is not 
a Section Line, it can be treated as a comment line.  As another 
example, if a Section Line does not contain an unquoted close square 
bracket ']', the parser can pretend that one exists.  Thus
   [foo  #whoops
can be interpreted as
   [foo ] #whoops
The default parser behavior is to note the error and perform error 
recovery.  Thus a tree will always be produced, regardless of syntax 
errors.   When the parser returns, it returns the tree as well as the 
list of errors.  The calling function can decide whether the errors 
are fatal or ignorable.  This allows the existing API to be 
implemented (most errors are fatal), as well as more flexible parse 
functions which allow certain classes of error.  I can, upon request, 
talk about the specific recovery steps planned for each of the error 
cases.        (i.e., please let me know).

--------------------

Whew, that's it!  Mostly.  I have not specified here how comments are 
attached into the tree, although the existing API ignores comments 
anyway.  And I'm sure there are other issues I haven't touched on, so 
if you have questions... you know.  Thanks for your time and patience 
if you've read all this!  :-)

--Joe