Unicode and APIs

Wed Sep 19 19:55:10 EDT 2007

On Sep 19, 2007, at 17:59, Sam Hartman wrote:
>>> How badly will things break if we add a context flag that says
>>> "everything is UTF8" and minimize other API changes?
>
>     Jeffrey> That would certainly work for the interface between the
>     Jeffrey> MIT library and the calling application.
>
> Let's focus on the API issue for now.  Ken disagrees with you or at
> least implied he did by saying that it would be hard.  Let's see what
> his concerns are.

Mostly, I was just thinking about the ideas kicked around for adding  
*_utf8 versions of a bunch of GSSAPI functions.  For krb5 only, yeah,  
we could probably set another context flag, but I think it would go  
onto my list of "things I'd like to see fixed if we ever get to  
rework the API in a big way". :)  Long term, I think what I'd rather  
see would be us just using UTF-8, and maybe having a hook for just- 
send-8 (JS8 for short, below).  In 5-10 years I'd rather have an API  
that efficiently deals with the environment we're using then, rather  
than having to default to expecting 20th-century environments and  
then calling routines to upgrade from that.

The context flag might be a good start, but how much would that help  
with issues I brought up like needing to try a password as JS8 and as  
UTF-8?  Do we really want an application to have to create two  
contexts?  Or do we switch UTF-8 mode off and on?

What would code trying to get initial credentials look like, then?

   read princname_local
   read password_local
   convert princname_local to princname_utf8
   convert password_local to password_utf8
   enable utf-8 in context
   get_init_creds (princname_utf8, password_utf8)
   if success then return success
   if local == utf-8 then return error
   if princname_local == princname_utf8 && password_local ==  
password_utf8 then return error
   # Okay, maybe the password was set in just-send-8 mode.
   disable utf-8 in context
   get_init_creds (princname_local, password_local)

And if we let get_init_creds prompt for the password, the user will  
get prompted twice.  (Does the context flag mean any input we get is  
going to be UTF-8?  Do the prompter functions we supply have to do  
conversions?)  That's all well and fine for minimizing our API  
changes, but for the application programmer and user, it kind of  
sucks.  (Obviously, that's just a rough outline.  If the first error  
is one indicating the principal exists but the password isn't right,  
we can probably assume the UTF-8 form of the principal name is the  
correct one and the other won't be found.  Or can we?)

Can we wind up with a JS8-encoded principal name using a UTF-8  
password, or vice versa?  If so, we're looking at more calls from the  
application, to get the principal name encoding right and to get the  
password encoding right, somewhat independently, and the single flag  
in the context doesn't make as much sense.

For more friendly compatibility, I think we'd want a function that  
takes both forms (of both strings) and does the retrying as  
necessary, internally.  Or a callback function to do the conversions.

On the other end, krb5_rd_req gives the receiver the principal name  
from the request; that'll presumably be in the form it was sent on  
the wire?  Which means regardless of context setting, it could be  
either UTF-8 or JS8.  Do we need to return to the application an  
indication as to which principal name form was correct?  If not, how  
can it properly check against an ACL file like .k5login, presumably  
maintained with one consistent encoding?

Should ktutil work in UTF-8 or JS8 mode for names and passwords?  New  
command-line switch?  Okay, that's not an API issue... but we might  
want to change ktutil to talk to the KDC to figure out the answer, if  
we want to be friendly about it, rather than requiring the local  
admin to know the answer beforehand.

What about non-ASCII data in the config files -- UTF-8 or local  
encoding?  They're read at init_context time; you don't get to set a  
flag first unless we create a new init_context variant.  (And an  
init_secure_context variant, and an internal init_context_kdc  
variant.)  So there's another API change, unless we dictate terms.  I  
suppose we could extend the parsing code to let the config file say,  
"I'm in UTF-8", but it would need to be backwards-compatible  
syntactically, and that doesn't play nicely with the API functions  
that return all matches against a list of names from the full set of  
config files, without associated UTF-8-ness flags for each response.

And I'm not even thinking now about what happens if we try using  
GSSAPI in an environment where we've starting mixing in UTF-8 with  
local encoding at the Kerberos layer.

>     Jeffrey> I think the challenge will be credential caches, keytabs,
>     Jeffrey> replay caches, etc.  Those are resources which are shared
>     Jeffrey> with other Kerberos implementations that will not
>     Jeffrey> necessarily be happy if the character sets changes.

True, but unless we've got places to stuff additional data ("oh, and  
here's the UTF-8 version of the name"), we may just have to bite the  
bullet and have a flag day or something.

Ken