(Final?) krb5.Conf Lexer/Parser Proposal
Joseph Calzaretta
saltine at MIT.EDU
Wed Jan 4 17:46:34 EST 2006
Hello and Happy New Year!
Please recall the krb5.conf lexer proposals made and discussed
earlier (links:
http://mailman.mit.edu/pipermail/krbdev/2005-November/003892.html ,
http://mailman.mit.edu/pipermail/krbdev/2005-December/003944.html ).
After discovering more and more unfriendly edge cases, including a
nasty one printed in an O'Reilly book , I have come to the
unfortunate conclusion that it is not possible to robustly make the
lexer whitespace-neutral while supporting existing krb5.conf
files. Whitespace is and will be syntactic. A moment of silence for
the previous proposal.
The new lexer/parser proposal embraces line-based syntax and is much
less of a change from the existing parser. As a first order
approximation, the new proposal is the same as the existing parser except that:
- comments beginning with the pound sign '#' may appear at the end
of any line.
- quoted strings are acceptable everywhere, not just in relation
values. Consequently, all section, tag, and value names can contain
any characters.
- the "final" signifier (asterisk '*') is no longer supported or
treated as special.
Therefore, please feel free to skip or skim the nitty-gritty part
below, and look at the Noteworthy Features section for interesting
changes. If you have any concerns, questions, or (non-xml-themed)
suggestions, please let me know. Thanks!
Yours,
Joe Calzaretta
Software Development & Integration Team
MIT Information Services & Technology
-----------------
The New krb5.conf Lexer/Parser Proposal, a.k.a., the Nitty-Gritty Part:
The input file is divided into lines, terminated by linebreaks (or
end-of-stream for the last line.)
For each line:
initial whitespace is ignored.
all text after an unquoted pound sign '#' (including the pound
sign itself) is stored as comment text and subsequently ignored.
if the first nonwhitespace character is a semicolon ';', the
entire line is stored as comment text and subsequently ignored.
if the line is all whitespace, the entire line is ignored.
At this point, whitespace, comments, and blank lines are stripped.
After this processing, the first character of the line is examined.
If this character is an open square bracket '[', the line is a Section Line.
If this character is a close curly brace '}', the line is a
CloseSubsection Line.
If this character is an open curly brace '{', AND the previous
line was a DanglingSubsection line (more later), the line is a
RescuedSubsection Line.
Otherwise, the line is a SubsectionOrRelation line.
Section Lines:
All text after the initial open square bracket '[' and up to but
not including the first unquoted close square bracket ']' is
considered the Raw Section Name. A line without the close bracket or
with any nonwhitespace text after the close bracket is considered an error.
CloseSubsection Lines:
Any nonwhitespace text after the initial close curly brace '}' is
considered an error.
SubsectionOrRelation Lines:
All text up to but not including the first unquoted equal sign '='
is considered the Raw Tag name. A line without such an equal sign is
considered an error.
The first nonwhitespace character after such an equal sign '=' is
examined.
If this character is an open curly brace '{', the line is a
Subsection Line.
If there is no such character, the line is a DanglingSubsection Line.
Otherwise, the line is a Relation Line.
Subsection and RescuedSubsection Lines:
Any nonwhitespace text after the unquoted open curly brace '{' is
considered an error.
Relation Lines:
All text after the unquoted equal sign '=' is considered the Raw
Value Name.
Raw Name Canonicalization:
All Raw Section/Tag/Value Names are canonicalized thusly:
Any text within a quoted string is unescaped in the manner of
ANSI C (C90 spec).
(e.g., "[\\Huh\x3F]" => [\Huh?])
All whitespace within a quoted string is preserved.
Whitespace between two quoted strings is eliminated. (provides string
concatenation much like ANSI C)
Whitespace at the beginning and end of the Raw Name is eliminated.
All other whitespace (i.e., whitespace before or after an unquoted word)
is condensed to single space (like 'collapse' whitespace handling in xml).
Lines may generally occur in any order, but some situations are
considered errors. Errors occur:
If the first line is not a Section Line.
If a DanglingSubsection Line is not immediately followed by a
RescuedSubsection Line.
If a Section Line or the end-of-stream appears within a Subsection
(after fewer CloseSubsection lines than Subsection/RescuedSubsection Lines).
If a CloseSubsection Line appears outside of a Subsection (after
an equal number of CloseSubsection and Subsection/RescuedSubsection Lines).
---------------------------------
Noteworthy Features of the Proposed Lexer/Parser
=> The asterisk '*' signifier for "final" lines is no longer
supported. Asterisks are not considered special characters at
all. If this is undesirable or surprising, please let me know.
=> The semicolon ';' only signifies a comment at the beginning of a
line, whereas the pound sign '#' signifies a comment whenever it
appears unquoted. Note the following lines:
# a comment
foo = bar # a comment
foo = "bar # NOT a comment"
; a comment
foo = bar ; NOT a comment
Why this difference? Some existing krb5.conf files' relation values
(notably the auth_to_local value) may have unquoted semicolons ';' in
them. As far as we have seen, no existing krb5.conf files' relation
values or tags use unquoted pound signs '#'. If this is untrue,
please let me know.
=> Relation values may not start with an unquoted open curly brace
'{'. For example, the line:
foo = { bar
is considered an error. Note that the existing parser would treat
this as a relation assigning the value "{ bar" to the tag "foo". The
existing parser's behavior is confusing enough that it is probably
best discarded. If this is untrue, please let me know.
=> DanglingSubsections and RescuedSubsections: The existing parser
allows the open curly brace '{' for subsections to appear on the line
after the equal sign '=', like so:
foo = # dangling subsection
{ # rescued subsection, yay!
bar = baz
}
This syntax is pretty, because you can line up the open and close
curlies. But it violates the one-line-per-element
linebreaks-are-syntactic rule to which the parser otherwise strictly
adheres. Personally, I don't like this because it is an extra layer
of complexity, and the corresponding format for relation values is not valid:
foo = # dangling relation?
bar # can't be rescued, and doesn't parse. boo!
Anyway, this syntax continues to be supported in the new proposal
(actually improved because the existing parser doesn't allow comments
to appear on lines between the equal sign and the open curly
brace). If anyone thinks this should be eliminated or supported
differently, please let me know.
=> All section names, tag names, and value names may contain any
character (except for null '\0') including whitespace. Since all
such names also support ANSI C quoted strings, there is a way to
include any special character. For example,
["foo]"] # close bracket in a section name
"} foo" = bar # close curly brace at start of a tag name.
"foo " = bar # space at the end of a tag name
foo bar = baz #single collapsed space in the middle of a tag name.
"foo=" = bar # equal sign in a tag name.
"#foo" = bar #pound sign in a tag name.
foo = "{ bar" #open curly brace at start of a value name.
foo = "\"bar\"" #quotation marks in a value name.
foo = "\x3F" #raw byte value for a question mark. Iffy!
Note that some of these would be errors in the existing parser, while
others would be interpreted much differently. Also note that the
allowing of "\xhh" and "\ooo" byte codes can get a bit iffy in the
future. Right now, data is just stored as null-terminated byte
strings, with no guaranteed interpretation of codes outside the 7-bit
ASCII range. We are planning to eventually use UTF-8 as the internal
representation of data. If your krb5.conf file is in UTF-8, and the
byte codes specified either in raw form or via "\xhh" encoding are
UTF-8, you will probably not see surprising behavior. Other
encodings may be unhappy when using byte codes. If any of these are
surprising or seem wrong, please let me know.
=> Error Recovery: In all error cases, reasonable recovery steps can
be taken to continue parsing. For example, if the first line is not
a Section Line, it can be treated as a comment line. As another
example, if a Section Line does not contain an unquoted close square
bracket ']', the parser can pretend that one exists. Thus
[foo #whoops
can be interpreted as
[foo ] #whoops
The default parser behavior is to note the error and perform error
recovery. Thus a tree will always be produced, regardless of syntax
errors. When the parser returns, it returns the tree as well as the
list of errors. The calling function can decide whether the errors
are fatal or ignorable. This allows the existing API to be
implemented (most errors are fatal), as well as more flexible parse
functions which allow certain classes of error. I can, upon request,
talk about the specific recovery steps planned for each of the error
cases. (i.e., please let me know).
--------------------
Whew, that's it! Mostly. I have not specified here how comments are
attached into the tree, although the existing API ignores comments
anyway. And I'm sure there are other issues I haven't touched on, so
if you have questions... you know. Thanks for your time and patience
if you've read all this! :-)
--Joe
More information about the krbdev
mailing list