Proposed lexer for parsing krb5.conf files
Joseph Calzaretta
saltine at MIT.EDU
Mon Nov 21 14:22:05 EST 2005
Hello,
I'm working on rewriting the MIT Kerberos profile library code. The new
work should:
- define a well-documented and rational format and grammar for krb5.conf
files, while supporting existing files as much as possible.
- provide a new well-documented API for profile manipulation, while
supporting the existing API (except for its bugs)
At this stage, I have a new proposed lexer design which reads in krb5.conf
files and splits it into tokens. Please review (or at least skim) and let
me know if you have any concerns, questions, suggestions, or ideas. Thanks!
Yours,
Joe Calzaretta
Software Development & Integration Team
MIT Information Services & Technology
krb5.conf Lexer Design Proposal:
------------
Main objectives for the lexer:
- The lexer should allow for any valid byte sequence to occur as section
names, relation tags, or relation values. For example, '=' may appear in
relation tags.
- The lexer should allow comments to appear anywhere in the krb5.conf
file. For example, comments should be allowed within curly braces.
- The lexer should, as much as possible, support existing krb5.conf files
in use. For example, the lexer should deal correctly with
"realms/REALMNAME/auth_to_local" relation values containing regular
expressions.
Lesser objectives:
- The lexer should, as much as possible, not use syntactic whitespace. For
example, "foo\n=bar\n" should be interpreted the same as "foo = bar\n".
---
Terminology:
Tree structure:
Section Names are the top-level nodes of the tree, and appear within square
brackets in krb5.conf files. (e.g., "[SectionName]")
Relation Tags are the non-top-level internal nodes of the tree, and appear
before equal signs in krb5.conf files. (e.g., "RelationTag =")
Relation Values are the leaf nodes of the tree, and appear after equal
signs in krb5.conf files. (e.g., "= RelationValue")
Subtrees are groups of nodes of the tree, and appear encased in curly
braces in krb5.conf files. (e.g., "= { Subtree }")
Relations are the connection between a Relation Tag and a either a Subtree
or a Relation Value.
Token types:
'[' or TT_OPEN_SECTION
']' or TT_CLOSE_SECTION
'{' or TT_OPEN_SUBTREE
'}' or TT_CLOSE_SUBTREE
'=' or TT_RELATION
'#' or TT_COMMENT
'T' or TT_TEXT
'!' or TT_END
Whitespace:
a WS (whitespace) character is any of ' ', '\f', 'v', '\t'.
an LB (linebreak) character is any of '\r', '\n'.
an LW (linear whitespace) character is any whitespace or linebreak character.
---
Behavior of the lexer:
The lexer splits the input stream into tokens. The token type is
determined by the first non-LW character encountered. This character is
called the marker. If the marker is one of '[', ']', '{', '}', '=', or
'#', the token type is equal to the marker. If the marker is ';', the
token type is '#' (TT_COMMENT). If the marker is any other character, the
token type is 'T' (TT_TEXT). Finally, if there is no marker (due to
end-of-stream), the token type is '!' (TT_END). If end-of-steam is
encountered after the marker, the token terminates just before the
end-of-stream. Additionally:
Tokens of type '[', ']', '{', '}', or '=' tokens terminate just before the
first non-WS character after the marker.
Tokens of type TT_COMMENT terminate just before the first LB character
after the marker.
Tokens of type TT_TEXT follow a more complicated rule:
They do not terminate within a quoted string (between a quotation mark
and an unescaped quotation mark).
When outside a quoted string, they terminate in any of the following cases:
(1) when an LB character is encountered. The token is terminated just
before the LB.
(2) when an '{', '}', or '=' character is encountered. The token is
terminated just before the '{', '}', or '='.
(3) when an ']', '#', or ';' is encountered, followed by an LW
character. The token is terminated just before the ']', '#', or ';'.
(4) when an WS character is encountered, followed by an '['. The token
is terminated just before the '['.
Text canonicalization:
After tokenizing, tokens of type TT_TEXT are canonicalized to remove
internal quotes and trim whitespace:
- Any text within a quoted string is unescaped in the manner of ANSI C.
(e.g., "[\\Huh\x3F]" => [\Huh?])
- All whitespace within a quoted string is preserved.
- Whitespace between two quoted strings is eliminated. (provides string
concatenation much like ANSI C)
- Whitespace at the beginning and end of the token is eliminated.
- All other whitespace (i.e., whitespace before or after an unquoted word)
is condensed to single space (like 'collapse' whitespace handling in xml).
For example:
>This is some "wei" "rd" text!
is canonicalized to
>This is some weird text!
---
Notes on lexer behavior:
This lexer specifically does not support the "finalizer" token '*'. We
suspect that this feature is not being used in practice (although we should
really be corrected if we're wrong) and even so is quite a headache to
properly support and implement. The lexer treats the '*' character as text.
Note that the semicolon ';' is a comment character, even though this is
largely undocumented and (according to my research) used much less
frequently than the pound sign '#' for comments.
The complicated rules for text tokens (specifically the whitespace parts of
cases (3) & (4)) are there specifically to support unquoted auth_to_local
values of the form
>RULE:[2:$1;$2](^.*;admin$)s/;admin$//
which must be interpreted as single text tokens despite containing '[',
']', and ';' characters. In a perfect world these characters would have
been escaped or quoted to begin with, but this is not such a
world. Therefore, while we can interpret the line
>foo = bar; more text
as a value of "bar" assigned to the tag "foo" with a comment of "; more
text", we are forced to interpret a line like:
>foo = bar;more text
as a value of "bar;more text" assigned to the tag "foo".
This document does not describe the parser and how the tokens are arranged
into a tree structure.
---
Pros and cons of this lexer:
Pro:
- linebreaks are less syntactic than the old parser. Sections like "foo =
\n\t\t bar\n", "foo \n = bar\n", "foo \n = \n bar\n", and "foo = bar\n" are
interpreted the same way.
- all possible byte codes can appear in all text tokens (i.e., section
names, relation tags, and relation values). In the previous parser,
section names could not contain the ']' character, relation tags could not
contain spaces or the '=' character, and relation values could not contain
the '{' character. Additionally, the new parser allows unprintable
characters to always be represented as more portable quoted hex codes "\x##".
- the lexer does not need to maintain any state or to interact with the
parser in a complicated way.
Con:
- the lexer requires two-character lookahead. (case (4) for text tokens)
- text token termination rules are not intuitive and may cause confusion.
(e.g., "bar# baz", "bar[ baz", "bar ]baz", "bar? baz", "bar ?baz", and "bar
baz" are all interpreted as single text tokens, but "bar #baz", "bar [baz",
and "bar] baz", "bar{ baz", "bar {baz", and "bar \n baz" are not.)
---
krb5.conf examples:
The following files are shown split into tokens (types only):
File:
>[normal] #c1
> foo = bar #c2
> baz = { #c3
> quux = quuux quuuux #c4
> } #c5
is split into:
>[ T ] # T = T # T = { # T = T # } # !
File:
>[one-liner] foo = { bar = { baz = quux } } #compact
is split into:
>[ T ] T = { T = { T = T } } # !
File:
>[ # This is technically allowable,
>long # but do we want to support it?
>] # It doesn't seem that crazy,
>foo # but then again,
>= # it doesn't seem that vital either.
>bar # I don't know.
is split into:
>[ # T # ] # T # = # T # !
------------
More information about the krbdev
mailing list