Proposed lexer for parsing krb5.conf files

Mon Nov 21 14:22:05 EST 2005

Hello,

I'm working on rewriting the MIT Kerberos profile library code.  The new 
work should:
- define a well-documented and rational format and grammar for krb5.conf 
files, while supporting existing files as much as possible.
- provide a new well-documented API for profile manipulation, while 
supporting the existing API (except for its bugs)

At this stage, I have a new proposed lexer design which reads in krb5.conf 
files and splits it into tokens.  Please review (or at least skim) and let 
me know if you have any concerns, questions, suggestions, or ideas.  Thanks!

Yours,

Joe Calzaretta
Software Development & Integration Team
MIT Information Services & Technology

krb5.conf Lexer Design Proposal:
------------

Main objectives for the lexer:
- The lexer should allow for any valid byte sequence to occur as section 
names, relation tags, or relation values.  For example, '=' may appear in 
relation tags.
- The lexer should allow comments to appear anywhere in the krb5.conf 
file.  For example, comments should be allowed within curly braces.
- The lexer should, as much as possible, support existing krb5.conf files 
in use.  For example, the lexer should deal correctly with 
"realms/REALMNAME/auth_to_local" relation values containing regular 
expressions.

Lesser objectives:
- The lexer should, as much as possible, not use syntactic whitespace.  For 
example, "foo\n=bar\n" should be interpreted the same as "foo = bar\n".

---

Terminology:

Tree structure:
Section Names are the top-level nodes of the tree, and appear within square 
brackets in krb5.conf files.  (e.g., "[SectionName]")
Relation Tags are the non-top-level internal nodes of the tree, and appear 
before equal signs in krb5.conf files.  (e.g., "RelationTag =")
Relation Values are the leaf nodes of the tree, and appear after equal 
signs in krb5.conf files. (e.g., "= RelationValue")
Subtrees are groups of nodes of the tree, and appear encased in curly 
braces in krb5.conf files. (e.g., "= { Subtree }")
Relations are the connection between a Relation Tag and a either a Subtree 
or a Relation Value.

Token types:
'[' or TT_OPEN_SECTION
']' or TT_CLOSE_SECTION
'{' or TT_OPEN_SUBTREE
'}' or TT_CLOSE_SUBTREE
'=' or TT_RELATION
'#' or TT_COMMENT
'T' or TT_TEXT
'!' or TT_END

Whitespace:
a WS (whitespace) character is any of ' ', '\f', 'v', '\t'.
an LB (linebreak) character is any of '\r', '\n'.
an LW (linear whitespace) character is any whitespace or linebreak character.

---

Behavior of the lexer:

The lexer splits the input stream into tokens.  The token type is 
determined by the first non-LW character encountered.  This character is 
called the marker.  If the marker is one of '[', ']', '{', '}', '=', or 
'#', the token type is equal to the marker.  If the marker is ';', the 
token type is '#' (TT_COMMENT).  If the marker is any other character, the 
token type is 'T' (TT_TEXT).  Finally, if there is no marker (due to 
end-of-stream), the token type is '!' (TT_END).  If end-of-steam is 
encountered after the marker, the token terminates just before the 
end-of-stream.  Additionally:

Tokens of type '[', ']', '{', '}', or '=' tokens terminate just before the 
first non-WS character after the marker.

Tokens of type TT_COMMENT terminate just before the first LB character 
after the marker.

Tokens of type TT_TEXT follow a more complicated rule:
   They do not terminate within a quoted string (between a quotation mark 
and an unescaped quotation mark).
   When outside a quoted string, they terminate in any of the following cases:
   (1)  when an LB character is encountered.  The token is terminated just 
before the LB.
   (2)  when an '{', '}', or '=' character is encountered.  The token is 
terminated just before the '{', '}', or '='.
   (3)  when an ']', '#', or ';' is encountered, followed by an LW 
character.  The token is terminated just before the ']', '#', or ';'.
   (4)  when an WS character is encountered, followed by an '['.  The token 
is terminated just before the '['.

Text canonicalization:

After tokenizing, tokens of type TT_TEXT are canonicalized to remove 
internal quotes and trim whitespace:
- Any text within a quoted string is unescaped in the manner of ANSI C. 
(e.g., "[\\Huh\x3F]" => [\Huh?])
- All whitespace within a quoted string is preserved.
- Whitespace between two quoted strings is eliminated. (provides string 
concatenation much like ANSI C)
- Whitespace at the beginning and end of the token is eliminated.
- All other whitespace (i.e., whitespace before or after an unquoted word) 
is condensed to single space (like 'collapse' whitespace handling in xml).

For example:
>This is            some    "wei"                 "rd"  text!
is canonicalized to
>This is some weird text!

---

Notes on lexer behavior:

This lexer specifically does not support the "finalizer" token '*'.  We 
suspect that this feature is not being used in practice (although we should 
really be corrected if we're wrong) and even so is quite a headache to 
properly support and implement.  The lexer treats the '*' character as text.

Note that the semicolon ';' is a comment character, even though this is 
largely undocumented and (according to my research) used much less 
frequently than the pound sign '#' for comments.

The complicated rules for text tokens (specifically the whitespace parts of 
cases (3) & (4)) are there specifically to support unquoted auth_to_local 
values of the form
>RULE:[2:$1;$2](^.*;admin$)s/;admin$//
which must be interpreted as single text tokens despite containing '[', 
']', and ';' characters.  In a perfect world these characters would have 
been escaped or quoted to begin with, but this is not such a 
world.  Therefore, while we can interpret the line
>foo = bar; more text
as a value of "bar" assigned to the tag "foo" with a comment of "; more 
text", we are forced to interpret a line like:
>foo = bar;more text
as a value of "bar;more text" assigned to the tag "foo".

This document does not describe the parser and how the tokens are arranged 
into a tree structure.

---

Pros and cons of this lexer:

Pro:
- linebreaks are less syntactic than the old parser.  Sections like "foo = 
\n\t\t bar\n", "foo \n = bar\n", "foo \n = \n bar\n", and "foo = bar\n" are 
interpreted the same way.
- all possible byte codes can appear in all text tokens (i.e., section 
names, relation tags, and relation values).  In the previous parser, 
section names could not contain the ']' character, relation tags could not 
contain spaces or the '=' character, and relation values could not contain 
the '{' character.  Additionally, the new parser allows unprintable 
characters to always be represented as more portable quoted hex codes "\x##".
- the lexer does not need to maintain any state or to interact with the 
parser in a complicated way.

Con:
- the lexer requires two-character lookahead.  (case (4) for text tokens)
- text token termination rules are not intuitive and may cause confusion. 
(e.g., "bar# baz", "bar[ baz", "bar ]baz", "bar? baz", "bar ?baz", and "bar 
baz" are all interpreted as single text tokens, but "bar #baz", "bar [baz", 
and "bar] baz", "bar{ baz", "bar {baz", and "bar \n baz" are not.)

---

krb5.conf examples:

The following files are shown split into tokens (types only):

File:
>[normal]  #c1
>   foo = bar #c2
>   baz = { #c3
>     quux = quuux quuuux #c4
>   } #c5
is split into:
>[ T ] # T = T # T = { # T = T # } # !

File:
>[one-liner] foo = { bar = { baz = quux } } #compact
is split into:
>[ T ] T = { T = { T = T } } # !

File:
>[       # This is technically allowable,
>long    # but do we want to support it?
>]       # It doesn't seem that crazy,
>foo     # but then again,
>=       # it doesn't seem that vital either.
>bar     # I don't know.
is split into:
>[ # T # ] # T # = # T # !

------------