← Back to team overview

fuel-dev team mailing list archive

[Fuel] HA Fixes Catalogue

 

Fuelers,

I have compiled a catalogue of all OpenStack HA fixes we have
implemented so far, researched, or need to research and implement.

Here is a summary of where things stand today (I've added the same
list to https://etherpad.openstack.org/p/fuel-ha-rabbitmq):

Applied in 5.0, needs a backport to 4.1.1:
- https://review.openstack.org/78178 ocf-neutron-dhcp-orphan
- https://review.openstack.org/93927 nova-reap-deleted-instance
- https://review.openstack.org/77276 oslo-ccn-handling
- https://review.openstack.org/76686 oslo-kombu-reconnect-delay
Proposed for 5.0:
- https://review.openstack.org/93884 ocf-haproxy-vip-colocate
- https://review.openstack.org/93411 rabbitmq-keepalive
- https://review.openstack.org/93815
kernel-match-tcp-keepalive-to-nova-report-interval
- https://review.openstack.org/93883 rabbitmq-hosts-shuffle
Must be implemented in 5.0:
- python-kombu-and-amqp-upgrade (multiple CCN fixes)
- https://launchpadlibrarian.net/160766270/transport.py.patch
python-amqp-tcp-user-timeout
- https://bugs.launchpad.net/fuel/+bug/1312177
pacemaker-neutron-agent-stickiness
- https://bugs.launchpad.net/fuel/+bug/1297355 ocf-galera-full-stop
- https://bugs.launchpad.net/fuel/+bug/1293680 ocf-galera-take-donor-out
Should be implemented in 5.1:
- https://bugs.launchpad.net/fuel/+bug/1318936 rabbitmq-does-not-restart
Known not to help or cause breakage:
- https://review.openstack.org/34949 rabbitmq-amqp-heartbeat (requires
a heartbeat periodic task in every OpenStack component)

Below is the full catalogue:

pacemaker-haproxy-reload
- applied in 4.0
- https://bugs.launchpad.net/fuel/+bug/1259639
- https://review.openstack.org/61453

ceph-mon-list
- applied in 4.1
- https://bugs.launchpad.net/fuel/+bug/1268579
- https://review.openstack.org/73106

ocf-neutron-agent-pid-matching
- applied in 4.1
- https://bugs.launchpad.net/fuel/+bug/1269334
- https://review.openstack.org/67101

ocf-galera-restart-wait
- applied in 4.1
- https://bugs.launchpad.net/fuel/+bug/1281625
- https://review.openstack.org/74431

pacemaker-fd-leak
- applied in 4.1
- https://bugs.launchpad.net/fuel/+bug/1272840
- https://github.com/ClusterLabs/libqb/commit/b327dbec7380e7de6896f9bb6cb1ca58677f4ed8

pacemaker-broadcast-calculation
- applied in 4.1  # TODO(angdraug): report to upstream
- https://bugs.launchpad.net/fuel/+bug/1277614
- https://review.openstack.org/72438

rabbitmq-hosts
- applied in 4.1
- https://bugs.launchpad.net/fuel/+bug/1285449
- https://review.openstack.org/77409

mysql-read-timeout
- applied in 4.1
- https://bugs.launchpad.net/fuel/+bug/1285449
- https://review.openstack.org/77643

drop-mysql-on-disconnect
- applied in 4.1.1, 5.0  # TODO(angdraug): confirm all fixes are present in 5.0
- https://bugs.launchpad.net/fuel/+bug/1288438
- https://review.openstack.org/81225

haproxy-netns
- applied in 4.1.1, 5.0
- https://review.openstack.org/82518

rabbitmq3
- applied in 4.1.1, 5.0
- depends on rabbitmq3-ha-mode
- https://bugs.launchpad.net/fuel/+bug/1288831

rabbitmq3-ha-mode
- applied in 4.1.1, 5.0
- https://bugs.launchpad.net/fuel/+bug/1296922
- https://review.openstack.org/84707

rabbitmq-init-retry
- applied in 4.1.1, 5.0
- https://bugs.launchpad.net/fuel/+bug/1314617
- https://review.openstack.org/88593

ocf-gratuitous-arp
- applied in 4.1.1, 5.0
- https://bugs.launchpad.net/fuel/+bug/1310676
- https://review.openstack.org/89378

neutron-l3-rootwrap
- applied in 4.1.1, 5.0  # TODO(rmoe): confirm how this is related to
the neutron umask/pid flock bug (0751)
- https://bugs.launchpad.net/fuel/+bug/1310926
- https://bugs.launchpad.net/neutron/+bug/1311804

ocf-neutron-l3-cleanup-ns
- applied in 4.1.1, 5.0
- https://review.openstack.org/89872

ocf-neutron-dhcp-cleanup-ns
- applied in 4.1.1, 5.0
- https://bugs.launchpad.net/fuel/+bug/1285929
- https://review.openstack.org/89557

rabbitmq-fd-ulimit
- applied in 4.1.1, 5.0
- https://bugs.launchpad.net/fuel/+bug/1279594
- https://gerrit.mirantis.com/10566

ocf-neutron-agent-lost-mysql
- applied in 4.1.1, 5.0
- https://bugs.launchpad.net/fuel/+bug/1287716
- https://review.openstack.org/77895

ocf-neutron-dhcp-orphan
- applied in 5.0  # TODO(xenolog): backport to 4.1.1
- https://bugs.launchpad.net/fuel/+bug/1285929
- https://review.openstack.org/78178

nova-reap-deleted-instance
- applied in 5.0, proposed for 4.1.1
- https://review.openstack.org/93927

oslo-ccn-handling
- applied in 5.0  # TODO(angdraug): backport to 4.1.1
- https://review.openstack.org/77276

oslo-kombu-reconnect-delay
- applied in 5.0  # TODO(angdraug): backport to 4.1.1
- https://review.openstack.org/76686

ocf-haproxy-vip-colocate
- https://review.openstack.org/93884

rabbitmq-keepalive
- https://review.openstack.org/93411

kernel-match-tcp-keepalive-to-nova-report-interval
- https://review.openstack.org/93815

rabbitmq-hosts-shuffle
- https://review.openstack.org/93883

python-kombu-and-amqp-upgrade
- # NOTE(angdraug): multiple CCN handling fixes
- # TODO(rmoe): try kombu 3.0.15 and amqp 1.4.5; if breaks, check
whether kombu 2.5.13 and amqp 1.0.13 is enough

python-amqp-tcp-user-timeout
- depends on python-kombu-and-amqp-upgrade
- https://launchpadlibrarian.net/160766270/transport.py.patch

pacemaker-neutron-agent-stickiness
- https://bugs.launchpad.net/fuel/+bug/1312177

ocf-galera-full-stop
- # NOTE(angdraug): requires a rewrite of galera OCF script
- https://bugs.launchpad.net/fuel/+bug/1297355

ocf-galera-take-donor-out
- https://bugs.launchpad.net/fuel/+bug/1293680

rabbitmq-does-not-restart
- NOTE(angdraug): managing rabbitmq by pacemaker is proposed
- https://bugs.launchpad.net/fuel/+bug/1318936

rabbitmq-amqp-heartbeat
- reverted  # NOTE(angdraug): requires a heartbeat periodic task in
every OpenStack component
<https://lists.launchpad.net/openstack/msg15111.html>
- https://review.openstack.org/34949

Please respond if you know about any other HA fixes and improvements
that can help avoid breakage of OpenStack, RabbitMQ, and MySQL on
failover.

Thanks,

-- 
Dmitry Borodaenko