← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1969971] Re: Live migrations failing due to remote host identification change

 

The nova-cloud-controller charm will create hostname, fqdn and ip
address entries for each compute host. It does using settings 'private-
address' and 'hostname' on the cloud-compute relation. private-address
will be the address resolvable from libvirt-migration-network (if
configured) otherwise the unit private-address.

Here comes the problem; the hostname added to known_hosts will be from
relation 'hostname' BUT the hostname fqdn will be resolved from private-
address. This means that if Nova compute is configured to use network X
for the its management network and libvirt-migration-network is set to a
different network, the fqdn in known_hosts will be from the latter. This
is all good until nova-compute needs to do a vm resize and the image
used to build the vm no longer exists in Glance. At which point Nova
will use the instance.hostname from the database to perform an scp from
source to destination and this fails because this hostname (fqdn from
management network) is not in known_hosts.

This is something that Nova should ultimately have support for but in
the interim the suggestion is that nova-cloud-controller always adds the
management network fqdn to known_hosts.

** Also affects: nova
   Importance: Undecided
       Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1969971

Title:
  Live migrations failing due to remote host identification change

Status in OpenStack Nova Cloud Controller Charm:
  New
Status in OpenStack Compute (nova):
  New

Bug description:
  I've encountered a cloud where, for some reason (maybe a redeploy of a
  compute; I'm not sure), I'm hitting this error in nova-compute.log on
  the source node for an instance migration:

  2022-04-22 10:21:17.419 3776 ERROR nova.virt.libvirt.driver [-] [instance: <REDACTED INSTANCE UUID>] Live Migration failure: operation failed: Failed to connect to remote libvirt URI qemu+ssh://<REDACTED IP>/system: Cannot recv data: @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
  @    WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!     @
  @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
  IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
  Someone could be eavesdropping on you right now (man-in-the-middle attack)!
  It is also possible that a host key has just been changed.
  The fingerprint for the RSA key sent by the remote host is
  SHA256:<REDACTED FINGERPRINT>.
  Please contact your system administrator.
  Add correct host key in /root/.ssh/known_hosts to get rid of this message.
  Offending RSA key in /root/.ssh/known_hosts:97
    remove with:
    ssh-keygen -f "/root/.ssh/known_hosts" -R "<REDACTED IP>"
  RSA host key for <REDACTED IP> has changed and you have requested strict checking.
  Host key verification failed.: Connection reset by peer: libvirt.libvirtError: operation failed: Failed to connect to remote libvirt URI qemu+ssh://<REDACTED IP>/system: Cannot recv data: @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

  This interferes with instance migration.

  There is a workaround:
  * Manually ssh to the destination node, both as the root and nova users on the source node.
  * Manually clear the offending known_hosts entries reported by the SSH command.
  * Verify that once cleared, the root and nova users are able to successfully connect via SSH.

  Obviously, this is cumbersome in the case of clouds with high numbers
  of compute nodes.  It'd be better if the charm was able to avoid this
  issue.

To manage notifications about this bug go to:
https://bugs.launchpad.net/charm-nova-cloud-controller/+bug/1969971/+subscriptions