yahoo-eng-team team mailing list archive

Thread
Date

[Bug 1947753] Re: Evacuated instances are not removed from the source

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: Sylvain Bauza <1947753@xxxxxxxxxxxxxxxxxx>
Date: Wed, 27 Oct 2021 14:32:40 -0000
Reply-to: Bug 1947753 <1947753@xxxxxxxxxxxxxxxxxx>
Sender: noreply@xxxxxxxxxxxxx

OK, let me get it right.

You say that if you want to evacuate an instance, you don't really know whether the original service runs correctly, right?
That's basically why Nova verifies whether the host is not operational and somehow 'failed'.
Sometimes, you're right, Nova thinks the compute service isn't faulty and then you can't evacuate. Some other time, Nova thinks the compute service *is* faulty and then you can evacuate.

If you're doing so, then indeed you could have problems *if* the host is actually running.
That's why in general we recommend operators to "fence" the original faulty host that's detected by Nova before evacuating.

Either way, if the service continues to run, it verifies the evacuation
status periodically and deletes the host. So, maybe you're getting a
race when you evacuate while a compute fault is transient and then you
see a problem.

If so, I'd recommend you, as I said, to 'fence' the host before evacuating instances... or wait a little bit before evacuating the instances if the issue is transient.
Maybe that's something related to healthchecks we want to work on : if you're getting a better status of a faulty compute service, you wouldn't issue evacuations unless you're sure it went down.

Putting the bug report as Opinion but I'm more than happy to discuss
with you, Belmiro, on #openstack-nova if you wish.

** Changed in: nova
Status: New => Opinion

** Changed in: nova
Importance: Undecided => Wishlist

** Tags added: evacuate

--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1947753

Title:
Evacuated instances are not removed from the source

Status in OpenStack Compute (nova):
Opinion

Bug description:
Instance "evacuation" is a great feature and we are trying to take advantage of it.
But, it has some limitations, depending how "broken" is the node.

Let me give some context...

In the scenario where the compute node loses connectivity (broken
switch port, loose network cable, ...) or nova-compute is suck
(filesystem issue) evacuating instances can have some unexpected
consequences and lead to data corruption in the application (for
example in a DB application).

If a compute node loses connectivity (or an entire set of compute nodes), nova-compute and the instances are "not available".
If the node runs critical applications (let's suppose a MySQL DB), the cloud operator could be tempted to "evacuate" the instance to recover the critical application for the user. At this point the cloud operator may not know yet the compute node issue and maybe it won't be possible to shut it down (management network affected?, ...) or even simply don't want to interfere with the work of the repair team.

The repair teams fixes the issue (it can take few minutes or hours...)
and nova-compute and the instances are available again.

The problem is that nova-compute doesn't destroy the evacuated
instances in the source.

```
2021-10-19 11:17:51.519 3050 WARNING nova.compute.resource_tracker [req-0ed10e35-2715-466a-918b-69eb1fc770e8 - - - - -] Instance fc3be091-56d3-4c69-8adb-2fdb8b0a35d2 has been moved to another host foo.cern.ch(foo.cern.ch). There are allocations remaining against the source host that might need to be removed: {u'resources': {u'VCPU': 1, u'MEMORY_MB': 1875}}.
```

At this point we have 2 instances sharing the same IP and possibly
writing into the same volume.

Only when nova-compute is restarted (I guess that was always the
assumption... the compute node was really broken) the evacuated
instances in the affected node are removed.

```
2021-10-19 15:39:49.257 21189 INFO nova.compute.manager [req-ded45b0c-20ab-4587-9533-8c613d977f79 - - - - -] Destroying instance as it has been evacuated from this host but still exists in the hypervisor
2021-10-19 15:39:52.949 21189 INFO nova.virt.libvirt.driver [ ] Instance destroyed successfully.
```

I would expect that nova-compute will constantly check for the evacuated instances and then removed them.
Otherwise, this requires a lot of coordination between different support teams.

Should this be moved to a periodic task?
https://github.com/openstack/nova/blob/e14eef0719eceef35e7e96b3e3d242ec79a80969/nova/compute/manager.py#L1440

I'm running Stein, but looking into the code, we have the same behaviour in master.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1947753/+subscriptions

References

[Bug 1947753] [NEW] Evacuated instances are not removed from the source
From: Belmiro Moreira, 2021-10-19