← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1380776] [NEW] Uploading and downloading VHDs via Glance XenAPI plugin doesn't always retry when it should

 

Public bug reported:

Encountered a situation where one glance node could not talk to registry
which resulted in a high number of upload_vhd errors. The Glance XenAPI
plugin doesn't properly differentiate between server permanent and
globally permanent errors. This is only reasonable behavior in the case
where there is a single glance node. In the case of many glance nodes
retrying a different server is preferable.

Ideally:

Retry until:
1. A non-retryable error is encountered (e.g. 403)
2. Max retries is reached
3. No servers left to retry (i.e. every server was dropped from the retry list due to a permanent error)

If the glance nodes sit behind a load balancer (proxy), this approach
could result in the LB being treated as a single glance endpoint (no
retries for server errors). Retrying on server errors without dropping
servers with server errors from the list could result in unnecessary
retries, especially in the case where there is only a single glance
node.


Additionally, if multiple errors are encountered, only the last error is logged as an instance error. Every error should be recorded.


Examples:

Current:

* The plugin tries to upload using 1 of n glance nodes (n > 1)
* An ephemeral (retryable) error is encountered
* The plugin retries using a different glance node
* An error related to a server fault (e.g. 500) is encountered
* The plugin does not retry
* Instance fault

Expected:

* The plugin tries to upload using 1 of n glance nodes (n > 1)
* An ephemeral (retryable) error is encountered
* Instance fault
* The plugin retries using a different glance node
* An error related to a server fault (e.g. 500) is encountered
* The plugin retries using a different glance node
* Success

** Affects: nova
     Importance: Undecided
     Assignee: Jesse J. Cook (jesse-j-cook)
         Status: In Progress

** Changed in: nova
       Status: New => In Progress

** Changed in: nova
     Assignee: (unassigned) => Jesse J. Cook (jesse-j-cook)

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1380776

Title:
  Uploading and downloading VHDs via Glance XenAPI plugin doesn't always
  retry when it should

Status in OpenStack Compute (Nova):
  In Progress

Bug description:
  Encountered a situation where one glance node could not talk to
  registry which resulted in a high number of upload_vhd errors. The
  Glance XenAPI plugin doesn't properly differentiate between server
  permanent and globally permanent errors. This is only reasonable
  behavior in the case where there is a single glance node. In the case
  of many glance nodes retrying a different server is preferable.

  Ideally:

  Retry until:
  1. A non-retryable error is encountered (e.g. 403)
  2. Max retries is reached
  3. No servers left to retry (i.e. every server was dropped from the retry list due to a permanent error)

  If the glance nodes sit behind a load balancer (proxy), this approach
  could result in the LB being treated as a single glance endpoint (no
  retries for server errors). Retrying on server errors without dropping
  servers with server errors from the list could result in unnecessary
  retries, especially in the case where there is only a single glance
  node.

  
  Additionally, if multiple errors are encountered, only the last error is logged as an instance error. Every error should be recorded.

  
  Examples:

  Current:

  * The plugin tries to upload using 1 of n glance nodes (n > 1)
  * An ephemeral (retryable) error is encountered
  * The plugin retries using a different glance node
  * An error related to a server fault (e.g. 500) is encountered
  * The plugin does not retry
  * Instance fault

  Expected:

  * The plugin tries to upload using 1 of n glance nodes (n > 1)
  * An ephemeral (retryable) error is encountered
  * Instance fault
  * The plugin retries using a different glance node
  * An error related to a server fault (e.g. 500) is encountered
  * The plugin retries using a different glance node
  * Success

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1380776/+subscriptions


Follow ups

References