non-ascii password in kerberos authentication

Wed Oct 31 00:21:49 EDT 2007

Xu Qiang wrote:
>> -----Original Message-----
>> From: Jeffrey Altman [mailto:jaltman at secure-endpoints.com]
>> Sent: Wednesday, October 31, 2007 11:21 AM
>> To: Xu Qiang
>> Cc: Ken Raeburn; Paul Moore; Su Huang (FXSGSC) Yi; krbdev at mit.edu
>> Subject: Re: non-ascii password in kerberos authentication
>>
>> Microsoft does not use the ISO-Latin-1 character set, they use the
>> ANSI-Latin-1 character set (which was not issued by ANSI, its 
>> just what
>> they call it.)
> 
> Jeff,
> 
> Yes, I know MS doesn't use ISO-Latin-1, thus I said, they didn't follow the standard strictly. 
> That is why euro sign appears as 0x80, and not 0xA4 (the value in ISO-Latin-9).

The history of ISO-Latin-9 is that it was created many years after
ISO-Latin-1 to ISO-Latin-8 had been standardized.  When the Euro was
created there was an obvious need to add the character, but you can't
change the standard after it is published and ISO-Latin character sets
follow the rules of ISO-2022 which prohibits printable characters in the
 the C1 control character range.  As a result they had to create the new
ISO-Latin-9 character set to include the Euro character for Western
Europe by replacing the US Currency symbol.  Since they are different
characters, they need to be different character sets.

Microsoft's ANSI character sets are closer in heritage to the IBM Code
Pages.  IBM CP850 was the Western European character set that included
all of the characters of ISO-Latin-1 plus the box drawing characters and
many other characters used within Western Europe that could not fit in
ISO-Latin-1.  The reason these additional characters could fit in the
IBM Code Pages is that unlike the ISO-2022 based character sets, the IBM
Code Pages did not reserve the C1 control character range.

Microsoft's ANSI character sets like the IBM Code Pages (which Microsoft
calls OEM Character Sets) do not reserve the C1 control character range
and therefore there was room to support both the US Currency and Euro
characters within the ANSI Latin-1 character set.

For additional historical reference, when the Euro character was
originally introduced IBM modified CP850 to include it by replacing the
dotless-i (0xD5) which is used extensively in Turkey.  This produced
significant backlash which resulted in the introduction of CP858 with
the Euro character and restoration of the dotless-i to CP850.  However,
the damage had already been done.  The ISO committees decided against
repeating IBM's folly.

Microsoft in its infinite wisdom decided that the OEM Code Pages needed
replacing and created a matching "ANSI" code page for each of the ISO
Latin character sets.  The Code Page 1252 is called "ANSI Latin-1" and
includes all of the characters from ISO Latin-1 at the same code points.
  However, it also adds a number of characters not found in ISO Latin-1
and there is not a one-to-one mapping with characters in the OEM Code
Pages.  The original version of Code Page 1252 did not include the Euro
character.  Nor did it include the "S with caron" or "Z with caron"
characters.  These have been added over time and there are still five
unused code points that can be assigned values at some time in the future.

> But anyway, euro sign is beyond single-byte char, because its hex value in UTF16 (or UCS-2LE) 
> is 0x20AC, which is a double-byte char. So appending 0x00 to the original byte (the MIT way) 
> will not work, regardless of whether it is 0xA4 or 0x80.

No it won't.  Your issue is that there is not a one-to-one mapping
between the code points in the Windows ANSI Latin-1 (CP1252) character
set and Unicode.  As a result, in order to solve this problem you are
going to have to implement some way of communicating the character set
used by your application to the Kerberos library and then replace the
dumb NUL-stuffing algorithm with one that actually performs a
character-set translation.

If you know that your application always uses only a single character
set, then you could (for your own distribution) bypass the character-set
communication and simply replace the NUL-stuffing code with
character-set translation routines for CP1252 to Unicode UCS-2LE.

Jeffrey Altman