Concurrency issues with FILE ccache

Fri Apr 9 11:35:26 EDT 2021

Am 2021-04-06 um 19:28 schrieb Greg Hudson:
> On 4/6/21 11:48 AM, Osipov, Michael (LDA IT PLM) wrote:
>> gssapi.raw.misc.GSSError: Major (851968): Unspecified GSS failure.  Minor code may provide more information, Minor (100001): Failed to store credentials: Internal credentials cache error (filename: /tmp/krb5cc_1000)
> 
> This is not expected, and bears investigation.  It suggests an EINVAL,
> EEXIST, EFAULT, EBADF, or EWOULDBLOCK error from one of the I/O
> operations performed by fcc_store(), none of which are expected.  If
> you're building libkrb5, you could try modifying interpret_error() to
> pass those error codes through in order to find out which one is happening.
> 
> Getting multiple cache entries for a service is normal when multiple
> threads or processes initiate contexts to the same (new) service within
> a short window.
> 

Hi Greg,

so I was able to properly compile and install 1.19.1 in the GitLab 
Runner and verified that py-gssapi picks it up from LD_LIBRARY_PATH.
Unfortunately, 1.19.1 still suffers from the same problem as 1.17. I 
tried to narrow it down with strace, but that changes the runtime 
behavior of the application and the error disappears. I did patch the 
fcc_store() funtion:
> $ git diff
> diff --git a/src/lib/krb5/ccache/cc_file.c b/src/lib/krb5/ccache/cc_file.c
> index 9a9b45a6e..7f604c0f4 100644
> --- a/src/lib/krb5/ccache/cc_file.c
> +++ b/src/lib/krb5/ccache/cc_file.c
> @@ -1000,8 +1000,9 @@ fcc_store(krb5_context context, krb5_ccache id, krb5_creds *creds)
>      if (ret)
>          goto cleanup;
>      nwritten = write(fileno(fp), buf.data, buf.len);
> -    if (nwritten == -1)
> +    if (nwritten == -1) {
>          ret = interpret_errno(context, errno);
> +        printf("errno: %d, ret: %d\n", errno, ret); }
>      if ((size_t)nwritten != buf.len)
>          ret = KRB5_CC_IO;

but the output did not appear. Then I patched the interpret_errno() 
dirctly for the internal error:
> @@ -1293,6 +1294,7 @@ interpret_errno(krb5_context context, int errnum)
>      case EWOULDBLOCK:
>  #endif
>          ret = KRB5_FCC_INTERNAL;
> +        printf("errnum: %d, ret: %d\n", errnum, ret);
>          break;
>      /*
>       * The rest all map to KRB5_CC_IO.  These errnos are listed to
I had exactly one faiure in the job and received exactly this:
> errnum: 17, ret: -1765328188
which maps to EEXIST

I am quite sure that this is a race condition where stat() is performed, 
file does not exist, open() with write is performed, in parallel it is 
already created and the later call returns in EEXIST.
I assumed it to be fcc_initialize() and added a printf():
> fcc_initialize()
> errnum: 17, ret: -1765328188
> fcc_initialize()
> errnum: 17, ret: -1765328188

What now?

Michael