Proposed lexer for parsing krb5.conf files

Mon Nov 21 14:32:43 EST 2005

Why not go crazy and drag krb5 kicking and screaming into the 21st
century and make the config file XML 
You get the parser, editors, validation, low level API, pretty display,
base syntax rules, base extension capability all for free

-----Original Message-----
From: krbdev-bounces at MIT.EDU [mailto:krbdev-bounces at MIT.EDU] On Behalf
Of Joseph Calzaretta
Sent: Monday, November 21, 2005 11:22 AM
To: krbdev at MIT.EDU
Subject: Proposed lexer for parsing krb5.conf files

Hello,

I'm working on rewriting the MIT Kerberos profile library code.  The new
work should:
- define a well-documented and rational format and grammar for krb5.conf
files, while supporting existing files as much as possible.
- provide a new well-documented API for profile manipulation, while
supporting the existing API (except for its bugs)

At this stage, I have a new proposed lexer design which reads in
krb5.conf files and splits it into tokens.  Please review (or at least
skim) and let me know if you have any concerns, questions, suggestions,
or ideas.  Thanks!

Yours,

Joe Calzaretta
Software Development & Integration Team
MIT Information Services & Technology

krb5.conf Lexer Design Proposal:
------------

Main objectives for the lexer:
- The lexer should allow for any valid byte sequence to occur as section
names, relation tags, or relation values.  For example, '=' may appear
in relation tags.
- The lexer should allow comments to appear anywhere in the krb5.conf
file.  For example, comments should be allowed within curly braces.
- The lexer should, as much as possible, support existing krb5.conf
files in use.  For example, the lexer should deal correctly with
"realms/REALMNAME/auth_to_local" relation values containing regular
expressions.

Lesser objectives:
- The lexer should, as much as possible, not use syntactic whitespace.
For example, "foo\n=bar\n" should be interpreted the same as "foo =
bar\n".

---

Terminology:

Tree structure:
Section Names are the top-level nodes of the tree, and appear within
square brackets in krb5.conf files.  (e.g., "[SectionName]") Relation
Tags are the non-top-level internal nodes of the tree, and appear before
equal signs in krb5.conf files.  (e.g., "RelationTag =") Relation Values
are the leaf nodes of the tree, and appear after equal signs in
krb5.conf files. (e.g., "= RelationValue") Subtrees are groups of nodes
of the tree, and appear encased in curly braces in krb5.conf files.
(e.g., "= { Subtree }") Relations are the connection between a Relation
Tag and a either a Subtree or a Relation Value.

Token types:
'[' or TT_OPEN_SECTION
']' or TT_CLOSE_SECTION
'{' or TT_OPEN_SUBTREE
'}' or TT_CLOSE_SUBTREE
'=' or TT_RELATION
'#' or TT_COMMENT
'T' or TT_TEXT
'!' or TT_END

Whitespace:
a WS (whitespace) character is any of ' ', '\f', 'v', '\t'.
an LB (linebreak) character is any of '\r', '\n'.
an LW (linear whitespace) character is any whitespace or linebreak
character.

---

Behavior of the lexer:

The lexer splits the input stream into tokens.  The token type is
determined by the first non-LW character encountered.  This character is
called the marker.  If the marker is one of '[', ']', '{', '}', '=', or
'#', the token type is equal to the marker.  If the marker is ';', the
token type is '#' (TT_COMMENT).  If the marker is any other character,
the token type is 'T' (TT_TEXT).  Finally, if there is no marker (due to
end-of-stream), the token type is '!' (TT_END).  If end-of-steam is
encountered after the marker, the token terminates just before the
end-of-stream.  Additionally:

Tokens of type '[', ']', '{', '}', or '=' tokens terminate just before
the first non-WS character after the marker.

Tokens of type TT_COMMENT terminate just before the first LB character
after the marker.

Tokens of type TT_TEXT follow a more complicated rule:
   They do not terminate within a quoted string (between a quotation
mark and an unescaped quotation mark).
   When outside a quoted string, they terminate in any of the following
cases:
   (1)  when an LB character is encountered.  The token is terminated
just before the LB.
   (2)  when an '{', '}', or '=' character is encountered.  The token is
terminated just before the '{', '}', or '='.
   (3)  when an ']', '#', or ';' is encountered, followed by an LW
character.  The token is terminated just before the ']', '#', or ';'.
   (4)  when an WS character is encountered, followed by an '['.  The
token is terminated just before the '['.

Text canonicalization:

After tokenizing, tokens of type TT_TEXT are canonicalized to remove
internal quotes and trim whitespace:
- Any text within a quoted string is unescaped in the manner of ANSI C. 
(e.g., "[\\Huh\x3F]" => [\Huh?])
- All whitespace within a quoted string is preserved.
- Whitespace between two quoted strings is eliminated. (provides string
concatenation much like ANSI C)
- Whitespace at the beginning and end of the token is eliminated.
- All other whitespace (i.e., whitespace before or after an unquoted
word) is condensed to single space (like 'collapse' whitespace handling
in xml).

For example:
>This is            some    "wei"                 "rd"  text!
is canonicalized to
>This is some weird text!

---

Notes on lexer behavior:

This lexer specifically does not support the "finalizer" token '*'.  We
suspect that this feature is not being used in practice (although we
should really be corrected if we're wrong) and even so is quite a
headache to properly support and implement.  The lexer treats the '*'
character as text.

Note that the semicolon ';' is a comment character, even though this is
largely undocumented and (according to my research) used much less
frequently than the pound sign '#' for comments.

The complicated rules for text tokens (specifically the whitespace parts
of cases (3) & (4)) are there specifically to support unquoted
auth_to_local values of the form
>RULE:[2:$1;$2](^.*;admin$)s/;admin$//
which must be interpreted as single text tokens despite containing '[',
']', and ';' characters.  In a perfect world these characters would have
been escaped or quoted to begin with, but this is not such a world.
Therefore, while we can interpret the line
>foo = bar; more text
as a value of "bar" assigned to the tag "foo" with a comment of "; more
text", we are forced to interpret a line like:
>foo = bar;more text
as a value of "bar;more text" assigned to the tag "foo".

This document does not describe the parser and how the tokens are
arranged into a tree structure.

---

Pros and cons of this lexer:

Pro:
- linebreaks are less syntactic than the old parser.  Sections like "foo
= \n\t\t bar\n", "foo \n = bar\n", "foo \n = \n bar\n", and "foo =
bar\n" are interpreted the same way.
- all possible byte codes can appear in all text tokens (i.e., section
names, relation tags, and relation values).  In the previous parser,
section names could not contain the ']' character, relation tags could
not contain spaces or the '=' character, and relation values could not
contain the '{' character.  Additionally, the new parser allows
unprintable characters to always be represented as more portable quoted
hex codes "\x##".
- the lexer does not need to maintain any state or to interact with the
parser in a complicated way.

Con:
- the lexer requires two-character lookahead.  (case (4) for text
tokens)
- text token termination rules are not intuitive and may cause
confusion. 
(e.g., "bar# baz", "bar[ baz", "bar ]baz", "bar? baz", "bar ?baz", and
"bar baz" are all interpreted as single text tokens, but "bar #baz",
"bar [baz", and "bar] baz", "bar{ baz", "bar {baz", and "bar \n baz" are
not.)

---

krb5.conf examples:

The following files are shown split into tokens (types only):

File:
>[normal]  #c1
>   foo = bar #c2
>   baz = { #c3
>     quux = quuux quuuux #c4
>   } #c5
is split into:
>[ T ] # T = T # T = { # T = T # } # !

File:
>[one-liner] foo = { bar = { baz = quux } } #compact
is split into:
>[ T ] T = { T = { T = T } } # !

File:
>[       # This is technically allowable,
>long    # but do we want to support it?
>]       # It doesn't seem that crazy,
>foo     # but then again,
>=       # it doesn't seem that vital either.
>bar     # I don't know.
is split into:
>[ # T # ] # T # = # T # !

------------

_______________________________________________
krbdev mailing list             krbdev at mit.edu
https://mailman.mit.edu/mailman/listinfo/krbdev