← Back to team overview

touch-packages team mailing list archive

[Bug 1439649] Re: Pacemaker unable to communicate with corosync on restart under lxc

 

After trying several corosync/pacemaker restarts without luck,
I was able to workaround this by adding an 'uidgid'
entry for hacluster:haclient:

* from /var/log/syslog:
Aug 31 18:33:18 juju-machine-3-lxc-3 corosync[901082]:  [MAIN  ] Denied connection attempt from 108:113
$ getent passwd 108
hacluster:x:108:113::/var/lib/heartbeat:/bin/false
$ getent group 113
haclient:x:113:

* add uidgid config:
# echo $'uidgid {\n  uid: hacluster\n  gid: haclient\n}' > /etc/corosync/uidgid.d/hacluster

* restart => Ok (crm status, etc)

I can't explain why other units are working ok without
this ACL addition (racing at service setup/start?).

-- 
You received this bug notification because you are a member of Ubuntu
Touch seeded packages, which is subscribed to lxc in Ubuntu.
https://bugs.launchpad.net/bugs/1439649

Title:
  Pacemaker unable to communicate with corosync on restart under lxc

Status in lxc package in Ubuntu:
  Confirmed
Status in pacemaker package in Ubuntu:
  Confirmed

Bug description:
  We've seen this a few times with three node clusters, all running in
  LXC containers; pacemaker fails to restart correctly as it can't
  communicate with corosync, resulting in a down cluster.  Rebooting the
  containers resolves the issue, so suspect some sort of bad state
  either in corosync or pacemaker.

  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: mcp_read_config: Configured corosync to accept connections from group 115: Library error (2)
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: main: Starting Pacemaker 1.1.10 (Build: 42f2063):  generated-manpages agent-manpages ncurses libqb-logging libqb-ipc lha-fencing upstart nagios  heartbeat corosync-native snmp libesmtp
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: cluster_connect_quorum: Quorum acquired
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: corosync_node_name: Unable to get node name for nodeid 1000
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: corosync_node_name: Unable to get node name for nodeid 1001
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: corosync_node_name: Unable to get node name for nodeid 1003
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: corosync_node_name: Unable to get node name for nodeid 1001
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: get_node_name: Defaulting to uname -n for the local corosync node name
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: crm_update_peer_state: pcmk_quorum_notification: Node juju-machine-4-lxc-4[1001] - state is now member (was (null))
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: corosync_node_name: Unable to get node name for nodeid 1003
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: crm_update_peer_state: pcmk_quorum_notification: Node (null)[1003] - state is now member (was (null))
  Apr  2 11:41:32 juju-machine-4-lxc-4 crmd[1033748]:   notice: main: CRM Git Version: 42f2063
  Apr  2 11:41:32 juju-machine-4-lxc-4 stonith-ng[1033744]:   notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync
  Apr  2 11:41:32 juju-machine-4-lxc-4 stonith-ng[1033744]:   notice: corosync_node_name: Unable to get node name for nodeid 1001
  Apr  2 11:41:32 juju-machine-4-lxc-4 stonith-ng[1033744]:   notice: get_node_name: Defaulting to uname -n for the local corosync node name
  Apr  2 11:41:32 juju-machine-4-lxc-4 attrd[1033746]:   notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync
  Apr  2 11:41:32 juju-machine-4-lxc-4 corosync[1033732]:  [MAIN  ] Denied connection attempt from 109:115
  Apr  2 11:41:32 juju-machine-4-lxc-4 corosync[1033732]:  [QB    ] Invalid IPC credentials (1033732-1033746).
  Apr  2 11:41:32 juju-machine-4-lxc-4 attrd[1033746]:    error: cluster_connect_cpg: Could not connect to the Cluster Process Group API: 11
  Apr  2 11:41:32 juju-machine-4-lxc-4 attrd[1033746]:    error: main: HA Signon failed
  Apr  2 11:41:32 juju-machine-4-lxc-4 attrd[1033746]:    error: main: Aborting startup
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:    error: pcmk_child_exit: Child process attrd (1033746) exited: Network is down (100)
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:  warning: pcmk_child_exit: Pacemaker child process attrd no longer wishes to be respawned. Shutting ourselves down.
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: pcmk_shutdown_worker: Shuting down Pacemaker
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: stop_child: Stopping crmd: Sent -15 to process 1033748
  Apr  2 11:41:32 juju-machine-4-lxc-4 crmd[1033748]:  warning: do_cib_control: Couldn't complete CIB registration 1 times... pause and retry
  Apr  2 11:41:32 juju-machine-4-lxc-4 crmd[1033748]:   notice: crm_shutdown: Requesting shutdown, upper limit is 1200000ms
  Apr  2 11:41:32 juju-machine-4-lxc-4 crmd[1033748]:  warning: do_log: FSA: Input I_SHUTDOWN from crm_shutdown() received in state S_STARTING
  Apr  2 11:41:32 juju-machine-4-lxc-4 crmd[1033748]:   notice: do_state_transition: State transition S_STARTING -> S_STOPPING [ input=I_SHUTDOWN cause=C_SHUTDOWN origin=crm_shutdown ]
  Apr  2 11:41:32 juju-machine-4-lxc-4 cib[1033743]:   notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync
  Apr  2 11:41:32 juju-machine-4-lxc-4 crmd[1033748]:   notice: terminate_cs_connection: Disconnecting from Corosync
  Apr  2 11:41:32 juju-machine-4-lxc-4 corosync[1033732]:  [MAIN  ] Denied connection attempt from 109:115
  Apr  2 11:41:32 juju-machine-4-lxc-4 corosync[1033732]:  [QB    ] Invalid IPC credentials (1033732-1033743).
  Apr  2 11:41:32 juju-machine-4-lxc-4 cib[1033743]:    error: cluster_connect_cpg: Could not connect to the Cluster Process Group API: 11
  Apr  2 11:41:32 juju-machine-4-lxc-4 cib[1033743]:     crit: cib_init: Cannot sign in to the cluster... terminating
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: stop_child: Stopping pengine: Sent -15 to process 1033747
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:    error: pcmk_child_exit: Child process cib (1033743) exited: Network is down (100)
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:  warning: pcmk_child_exit: Pacemaker child process cib no longer wishes to be respawned. Shutting ourselves down.
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: stop_child: Stopping lrmd: Sent -15 to process 1033745
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: stop_child: Stopping stonith-ng: Sent -15 to process 1033744
  Apr  2 11:41:34 juju-machine-4-lxc-4 corosync[1033732]:  [TOTEM ] A new membership (10.245.160.62:284) was formed. Members joined: 1000
  Apr  2 11:41:41 juju-machine-4-lxc-4 stonith-ng[1033744]:    error: setup_cib: Could not connect to the CIB service: Transport endpoint is not connected (-107)
  Apr  2 11:41:41 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: pcmk_shutdown_worker: Shutdown complete
  Apr  2 11:41:41 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: pcmk_shutdown_worker: Attempting to inhibit respawning after fatal error

  ProblemType: Bug
  DistroRelease: Ubuntu 14.04
  Package: pacemaker 1.1.10+git20130802-1ubuntu2.3
  ProcVersionSignature: User Name 3.16.0-33.44~14.04.1-generic 3.16.7-ckt7
  Uname: Linux 3.16.0-33-generic x86_64
  NonfreeKernelModules: vhost_net vhost macvtap macvlan xt_conntrack ipt_REJECT ip6table_filter ip6_tables ebtable_nat ebtables veth 8021q garp xt_CHECKSUM mrp iptable_mangle ipt_MASQUERADE iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack xt_tcpudp iptable_filter ip_tables x_tables nbd ib_iser rdma_cm iw_cm ib_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi openvswitch gre vxlan dm_crypt bridge dm_multipath intel_rapl stp scsi_dh x86_pkg_temp_thermal llc intel_powerclamp coretemp ioatdma kvm_intel ipmi_si joydev sb_edac kvm hpwdt hpilo dca ipmi_msghandler acpi_power_meter edac_core lpc_ich shpchp serio_raw mac_hid xfs libcrc32c btrfs xor raid6_pq hid_generic usbhid hid crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd psmouse tg3 ptp pata_acpi hpsa pps_core
  ApportVersion: 2.14.1-0ubuntu3.7
  Architecture: amd64
  Date: Thu Apr  2 11:42:18 2015
  SourcePackage: pacemaker
  UpgradeStatus: No upgrade log present (probably fresh install)

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/lxc/+bug/1439649/+subscriptions