← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1833085] [NEW] Zero-downtime upgrades lead to spurious token validation failures when caching is enabled

 

Public bug reported:

When doing the zero-downtime upgrade routine of keystone and having
caching enabled we observe validation failures of valid tokens.

The problem is, that if both running versions share a cache, both may
cache validated tokens, but the type of the tokens changed between the
two versions (and the cache pickles the objects-to-cache directly). In
queens it is a dict, in rocky it is a dedicated type `TokenModel`. This
causes exceptions when the tokens are loaded from the cache and don't
have the expected attributes.

The offending code is
<https://github.com/openstack/keystone/blob/stable/queens/keystone/token/provider.py#L165>
vs.
<https://github.com/openstack/keystone/blob/stable/rocky/keystone/token/provider.py#L150>
the `@MEMOIZE_TOKEN` decorator serializes the tokens into the cache, both versions use the same keyspace, but the type of the objects has changed.

Disabling the caching (by setting `[caching] enabled = false` in the
config) or disabling all but one keystone instances fixes the problem
(of course disabling all but one keystone instance defeats the whole
purpose of a zero-downtime upgrade – this was just done to validate the
cause of the issue).

This issue and the possible workaround (disabling the cache) should at
least be documented. If it is safe to run the instances with separate
caches (per instance or per version) this may be workaround with less of
a performance impact, but I am not sure, whether this would be safe with
respect to token invalidation. My understanding is, that on token
revocation the keystone instance handling the API request invalidates
the cache entry and adds the revocation event to the database. So if the
token was already stored as validated in the other cache, this would
cause the token to be accepted as valid by some of the keystone services
(which use the other cache which says it is valid). So with a load
balancer in front of the keystones the revoked token would sometimes
validate.

** Affects: keystone
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Identity (keystone).
https://bugs.launchpad.net/bugs/1833085

Title:
  Zero-downtime upgrades lead to spurious token validation failures when
  caching is enabled

Status in OpenStack Identity (keystone):
  New

Bug description:
  When doing the zero-downtime upgrade routine of keystone and having
  caching enabled we observe validation failures of valid tokens.

  The problem is, that if both running versions share a cache, both may
  cache validated tokens, but the type of the tokens changed between the
  two versions (and the cache pickles the objects-to-cache directly). In
  queens it is a dict, in rocky it is a dedicated type `TokenModel`.
  This causes exceptions when the tokens are loaded from the cache and
  don't have the expected attributes.

  The offending code is
  <https://github.com/openstack/keystone/blob/stable/queens/keystone/token/provider.py#L165>
  vs.
  <https://github.com/openstack/keystone/blob/stable/rocky/keystone/token/provider.py#L150>
  the `@MEMOIZE_TOKEN` decorator serializes the tokens into the cache, both versions use the same keyspace, but the type of the objects has changed.

  Disabling the caching (by setting `[caching] enabled = false` in the
  config) or disabling all but one keystone instances fixes the problem
  (of course disabling all but one keystone instance defeats the whole
  purpose of a zero-downtime upgrade – this was just done to validate
  the cause of the issue).

  This issue and the possible workaround (disabling the cache) should at
  least be documented. If it is safe to run the instances with separate
  caches (per instance or per version) this may be workaround with less
  of a performance impact, but I am not sure, whether this would be safe
  with respect to token invalidation. My understanding is, that on token
  revocation the keystone instance handling the API request invalidates
  the cache entry and adds the revocation event to the database. So if
  the token was already stored as validated in the other cache, this
  would cause the token to be accepted as valid by some of the keystone
  services (which use the other cache which says it is valid). So with a
  load balancer in front of the keystones the revoked token would
  sometimes validate.

To manage notifications about this bug go to:
https://bugs.launchpad.net/keystone/+bug/1833085/+subscriptions