touch-packages team mailing list archive
-
touch-packages team
-
Mailing list archive
-
Message #76434
[Bug 1439649] Re: Pacemaker unable to communicate with corosync on restart under lxc
I'm seeing this problem in another environment, similar deployment (3
lxc containers)
Apr 20 16:39:26 juju-machine-3-lxc-4 crm_verify[31774]: notice: crm_log_args: Invoked: crm_verify -V -p
Apr 20 16:39:27 juju-machine-3-lxc-4 cibadmin[31786]: notice: crm_log_args: Invoked: cibadmin -p -P
Apr 20 16:50:01 juju-machine-3-lxc-4 cib[780]: error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2)
Apr 20 16:50:01 juju-machine-3-lxc-4 cib[780]: error: cib_cs_destroy: Corosync connection lost! Exiting.
Apr 20 16:50:01 juju-machine-3-lxc-4 crmd[785]: error: crmd_quorum_destroy: connection terminated
Apr 20 16:50:01 juju-machine-3-lxc-4 attrd[783]: error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2)
Apr 20 16:50:01 juju-machine-3-lxc-4 stonith-ng[781]: error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2)
Apr 20 16:50:01 juju-machine-3-lxc-4 crmd[785]: notice: crmd_exit: Forcing immediate exit: Link has been severed (67)
Apr 20 16:50:01 juju-machine-3-lxc-4 lrmd[782]: warning: qb_ipcs_event_sendv: new_event_notification (782-785-6): Bad file descriptor (9)
Apr 20 16:50:01 juju-machine-3-lxc-4 lrmd[782]: warning: send_client_notify: Notification of client crmd/8ad990ba-cf09-4ba3-b74b-a7d05d377a1b failed
Apr 20 16:50:01 juju-machine-3-lxc-4 lrmd[782]: error: crm_abort: crm_glib_handler: Forked child 760 to record non-fatal assert at logging.c:63 : Source ID 4601370 was not found when attempting to remove it
Apr 20 16:50:01 juju-machine-3-lxc-4 pacemakerd[773]: error: pcmk_child_exit: Child process cib (780) exited: Invalid argument (22)
Apr 20 16:50:01 juju-machine-3-lxc-4 pacemakerd[773]: notice: pcmk_process_exit: Respawning failed child process: cib
Apr 20 16:50:01 juju-machine-3-lxc-4 pacemakerd[773]: error: pcmk_child_exit: Child process crmd (785) exited: Link has been severed (67)
Apr 20 16:50:01 juju-machine-3-lxc-4 pacemakerd[773]: notice: pcmk_process_exit: Respawning failed child process: crmd
Apr 20 16:50:01 juju-machine-3-lxc-4 attrd[783]: crit: attrd_cs_destroy: Lost connection to Corosync service!
Apr 20 16:50:01 juju-machine-3-lxc-4 attrd[783]: notice: main: Exiting...
Apr 20 16:50:01 juju-machine-3-lxc-4 attrd[783]: notice: main: Disconnecting client 0x7ff985e478e0, pid=785...
Apr 20 16:50:01 juju-machine-3-lxc-4 pacemakerd[773]: error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2)
Apr 20 16:50:01 juju-machine-3-lxc-4 pacemakerd[773]: error: mcp_cpg_destroy: Connection destroyed
Apr 20 16:50:01 juju-machine-3-lxc-4 attrd[783]: error: attrd_cib_connection_destroy: Connection to the CIB terminated...
Apr 20 16:50:01 juju-machine-3-lxc-4 cib[761]: debug: crm_update_callsites: Enabling callsites based on priority=7, files=(null), functions=(null), formats=(null), tags=(null)
Apr 20 16:50:01 juju-machine-3-lxc-4 crmd[767]: debug: crm_update_callsites: Enabling callsites based on priority=7, files=(null), functions=(null), formats=(null), tags=(null)
Apr 20 16:50:01 juju-machine-3-lxc-4 crmd[767]: notice: main: CRM Git Version: 42f2063
Apr 20 16:50:01 juju-machine-3-lxc-4 stonith-ng[781]: error: stonith_peer_cs_destroy: Corosync connection terminated
Apr 20 16:50:01 juju-machine-3-lxc-4 cib[761]: notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync
Apr 20 16:50:01 juju-machine-3-lxc-4 cib[761]: error: cluster_connect_cpg: Could not connect to the Cluster Process Group API: 2
Apr 20 16:50:01 juju-machine-3-lxc-4 cib[761]: crit: cib_init: Cannot sign in to the cluster... terminating
Apr 20 16:50:02 juju-machine-3-lxc-4 crmd[767]: warning: do_cib_control: Couldn't complete CIB registration 1 times... pause and retry
Apr 20 16:50:05 juju-machine-3-lxc-4 crmd[767]: warning: do_cib_control: Couldn't complete CIB registration 2 times... pause and retry
These are the only processes running in one of the nodes:
root 782 0.0 0.0 81464 1828 ? Ss Feb12 25:13 /usr/lib/pacemaker/lrmd
haclust+ 784 0.0 0.0 73920 776 ? Ss Feb12 8:25 /usr/lib/pacemaker/pengine
root 780 0.8 0.0 130256 4152 ? Ssl 16:50 0:00 /usr/sbin/corosync
A possible explanation could be: http://thread.gmane.org/gmane.linux.highavailability.corosync/592/focus=639
I only have logs for one of the nodes, I'm trying to get logs of the
other 2 nodes to get a better understanding of what was happening with
the communication.
--
You received this bug notification because you are a member of Ubuntu
Touch seeded packages, which is subscribed to lxc in Ubuntu.
https://bugs.launchpad.net/bugs/1439649
Title:
Pacemaker unable to communicate with corosync on restart under lxc
Status in lxc package in Ubuntu:
Confirmed
Status in pacemaker package in Ubuntu:
Confirmed
Bug description:
We've seen this a few times with three node clusters, all running in
LXC containers; pacemaker fails to restart correctly as it can't
communicate with corosync, resulting in a down cluster. Rebooting the
containers resolves the issue, so suspect some sort of bad state
either in corosync or pacemaker.
Apr 2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]: notice: mcp_read_config: Configured corosync to accept connections from group 115: Library error (2)
Apr 2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]: notice: main: Starting Pacemaker 1.1.10 (Build: 42f2063): generated-manpages agent-manpages ncurses libqb-logging libqb-ipc lha-fencing upstart nagios heartbeat corosync-native snmp libesmtp
Apr 2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]: notice: cluster_connect_quorum: Quorum acquired
Apr 2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]: notice: corosync_node_name: Unable to get node name for nodeid 1000
Apr 2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]: notice: corosync_node_name: Unable to get node name for nodeid 1001
Apr 2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]: notice: corosync_node_name: Unable to get node name for nodeid 1003
Apr 2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]: notice: corosync_node_name: Unable to get node name for nodeid 1001
Apr 2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]: notice: get_node_name: Defaulting to uname -n for the local corosync node name
Apr 2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]: notice: crm_update_peer_state: pcmk_quorum_notification: Node juju-machine-4-lxc-4[1001] - state is now member (was (null))
Apr 2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]: notice: corosync_node_name: Unable to get node name for nodeid 1003
Apr 2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]: notice: crm_update_peer_state: pcmk_quorum_notification: Node (null)[1003] - state is now member (was (null))
Apr 2 11:41:32 juju-machine-4-lxc-4 crmd[1033748]: notice: main: CRM Git Version: 42f2063
Apr 2 11:41:32 juju-machine-4-lxc-4 stonith-ng[1033744]: notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync
Apr 2 11:41:32 juju-machine-4-lxc-4 stonith-ng[1033744]: notice: corosync_node_name: Unable to get node name for nodeid 1001
Apr 2 11:41:32 juju-machine-4-lxc-4 stonith-ng[1033744]: notice: get_node_name: Defaulting to uname -n for the local corosync node name
Apr 2 11:41:32 juju-machine-4-lxc-4 attrd[1033746]: notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync
Apr 2 11:41:32 juju-machine-4-lxc-4 corosync[1033732]: [MAIN ] Denied connection attempt from 109:115
Apr 2 11:41:32 juju-machine-4-lxc-4 corosync[1033732]: [QB ] Invalid IPC credentials (1033732-1033746).
Apr 2 11:41:32 juju-machine-4-lxc-4 attrd[1033746]: error: cluster_connect_cpg: Could not connect to the Cluster Process Group API: 11
Apr 2 11:41:32 juju-machine-4-lxc-4 attrd[1033746]: error: main: HA Signon failed
Apr 2 11:41:32 juju-machine-4-lxc-4 attrd[1033746]: error: main: Aborting startup
Apr 2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]: error: pcmk_child_exit: Child process attrd (1033746) exited: Network is down (100)
Apr 2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]: warning: pcmk_child_exit: Pacemaker child process attrd no longer wishes to be respawned. Shutting ourselves down.
Apr 2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]: notice: pcmk_shutdown_worker: Shuting down Pacemaker
Apr 2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]: notice: stop_child: Stopping crmd: Sent -15 to process 1033748
Apr 2 11:41:32 juju-machine-4-lxc-4 crmd[1033748]: warning: do_cib_control: Couldn't complete CIB registration 1 times... pause and retry
Apr 2 11:41:32 juju-machine-4-lxc-4 crmd[1033748]: notice: crm_shutdown: Requesting shutdown, upper limit is 1200000ms
Apr 2 11:41:32 juju-machine-4-lxc-4 crmd[1033748]: warning: do_log: FSA: Input I_SHUTDOWN from crm_shutdown() received in state S_STARTING
Apr 2 11:41:32 juju-machine-4-lxc-4 crmd[1033748]: notice: do_state_transition: State transition S_STARTING -> S_STOPPING [ input=I_SHUTDOWN cause=C_SHUTDOWN origin=crm_shutdown ]
Apr 2 11:41:32 juju-machine-4-lxc-4 cib[1033743]: notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync
Apr 2 11:41:32 juju-machine-4-lxc-4 crmd[1033748]: notice: terminate_cs_connection: Disconnecting from Corosync
Apr 2 11:41:32 juju-machine-4-lxc-4 corosync[1033732]: [MAIN ] Denied connection attempt from 109:115
Apr 2 11:41:32 juju-machine-4-lxc-4 corosync[1033732]: [QB ] Invalid IPC credentials (1033732-1033743).
Apr 2 11:41:32 juju-machine-4-lxc-4 cib[1033743]: error: cluster_connect_cpg: Could not connect to the Cluster Process Group API: 11
Apr 2 11:41:32 juju-machine-4-lxc-4 cib[1033743]: crit: cib_init: Cannot sign in to the cluster... terminating
Apr 2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]: notice: stop_child: Stopping pengine: Sent -15 to process 1033747
Apr 2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]: error: pcmk_child_exit: Child process cib (1033743) exited: Network is down (100)
Apr 2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]: warning: pcmk_child_exit: Pacemaker child process cib no longer wishes to be respawned. Shutting ourselves down.
Apr 2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]: notice: stop_child: Stopping lrmd: Sent -15 to process 1033745
Apr 2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]: notice: stop_child: Stopping stonith-ng: Sent -15 to process 1033744
Apr 2 11:41:34 juju-machine-4-lxc-4 corosync[1033732]: [TOTEM ] A new membership (10.245.160.62:284) was formed. Members joined: 1000
Apr 2 11:41:41 juju-machine-4-lxc-4 stonith-ng[1033744]: error: setup_cib: Could not connect to the CIB service: Transport endpoint is not connected (-107)
Apr 2 11:41:41 juju-machine-4-lxc-4 pacemakerd[1033741]: notice: pcmk_shutdown_worker: Shutdown complete
Apr 2 11:41:41 juju-machine-4-lxc-4 pacemakerd[1033741]: notice: pcmk_shutdown_worker: Attempting to inhibit respawning after fatal error
ProblemType: Bug
DistroRelease: Ubuntu 14.04
Package: pacemaker 1.1.10+git20130802-1ubuntu2.3
ProcVersionSignature: User Name 3.16.0-33.44~14.04.1-generic 3.16.7-ckt7
Uname: Linux 3.16.0-33-generic x86_64
NonfreeKernelModules: vhost_net vhost macvtap macvlan xt_conntrack ipt_REJECT ip6table_filter ip6_tables ebtable_nat ebtables veth 8021q garp xt_CHECKSUM mrp iptable_mangle ipt_MASQUERADE iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack xt_tcpudp iptable_filter ip_tables x_tables nbd ib_iser rdma_cm iw_cm ib_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi openvswitch gre vxlan dm_crypt bridge dm_multipath intel_rapl stp scsi_dh x86_pkg_temp_thermal llc intel_powerclamp coretemp ioatdma kvm_intel ipmi_si joydev sb_edac kvm hpwdt hpilo dca ipmi_msghandler acpi_power_meter edac_core lpc_ich shpchp serio_raw mac_hid xfs libcrc32c btrfs xor raid6_pq hid_generic usbhid hid crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd psmouse tg3 ptp pata_acpi hpsa pps_core
ApportVersion: 2.14.1-0ubuntu3.7
Architecture: amd64
Date: Thu Apr 2 11:42:18 2015
SourcePackage: pacemaker
UpgradeStatus: No upgrade log present (probably fresh install)
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/lxc/+bug/1439649/+subscriptions