yahoo-eng-team team mailing list archive

Thread
Date

[Bug 1833085] [NEW] Zero-downtime upgrades lead to spurious token validation failures when caching is enabled

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: Sebastian Riese <1833085@xxxxxxxxxxxxxxxxxx>
Date: Mon, 17 Jun 2019 14:45:07 -0000
Reply-to: Bug 1833085 <1833085@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx

Public bug reported:

When doing the zero-downtime upgrade routine of keystone and having
caching enabled we observe validation failures of valid tokens.

The problem is, that if both running versions share a cache, both may
cache validated tokens, but the type of the tokens changed between the
two versions (and the cache pickles the objects-to-cache directly). In
queens it is a dict, in rocky it is a dedicated type `TokenModel`. This
causes exceptions when the tokens are loaded from the cache and don't
have the expected attributes.

The offending code is
<https://github.com/openstack/keystone/blob/stable/queens/keystone/token/provider.py#L165>
vs.
<https://github.com/openstack/keystone/blob/stable/rocky/keystone/token/provider.py#L150>
the `@MEMOIZE_TOKEN` decorator serializes the tokens into the cache, both versions use the same keyspace, but the type of the objects has changed.

Disabling the caching (by setting `[caching] enabled = false` in the
config) or disabling all but one keystone instances fixes the problem
(of course disabling all but one keystone instance defeats the whole
purpose of a zero-downtime upgrade – this was just done to validate the
cause of the issue).

This issue and the possible workaround (disabling the cache) should at
least be documented. If it is safe to run the instances with separate
caches (per instance or per version) this may be workaround with less of
a performance impact, but I am not sure, whether this would be safe with
respect to token invalidation. My understanding is, that on token
revocation the keystone instance handling the API request invalidates
the cache entry and adds the revocation event to the database. So if the
token was already stored as validated in the other cache, this would
cause the token to be accepted as valid by some of the keystone services
(which use the other cache which says it is valid). So with a load
balancer in front of the keystones the revoked token would sometimes
validate.

** Affects: keystone
Importance: Undecided
Status: New

--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Identity (keystone).
https://bugs.launchpad.net/bugs/1833085

Title:
Zero-downtime upgrades lead to spurious token validation failures when
caching is enabled

Status in OpenStack Identity (keystone):
New

Bug description:
When doing the zero-downtime upgrade routine of keystone and having
caching enabled we observe validation failures of valid tokens.

The problem is, that if both running versions share a cache, both may
cache validated tokens, but the type of the tokens changed between the
two versions (and the cache pickles the objects-to-cache directly). In
queens it is a dict, in rocky it is a dedicated type `TokenModel`.
This causes exceptions when the tokens are loaded from the cache and
don't have the expected attributes.

This issue and the possible workaround (disabling the cache) should at
least be documented. If it is safe to run the instances with separate
caches (per instance or per version) this may be workaround with less
of a performance impact, but I am not sure, whether this would be safe
with respect to token invalidation. My understanding is, that on token
revocation the keystone instance handling the API request invalidates
the cache entry and adds the revocation event to the database. So if
the token was already stored as validated in the other cache, this
would cause the token to be accepted as valid by some of the keystone
services (which use the other cache which says it is valid). So with a
load balancer in front of the keystones the revoked token would
sometimes validate.

To manage notifications about this bug go to:
https://bugs.launchpad.net/keystone/+bug/1833085/+subscriptions