kernel-packages team mailing list archive

Thread
Date
[Bug 1505564] Re: Soft lockup with "block nbdX: Attempted send on closed socket" spam

To: kernel-packages@xxxxxxxxxxxxxxxxxxx
From: Dan Streetman <dan.streetman@xxxxxxxxxxxxx>
Date: Tue, 15 Dec 2015 02:34:37 -0000
Reply-to: Bug 1505564 <1505564@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx
Ok, here's my analysis of the latest dump.

There are 3 kernel migrate threads waiting; this is the cause of the
softlockup - specifically pid 101 on cpu 13 is where the softlockup (and
then panic, due to panic on softlockup enabled) happens, and the other 2
migrate threads (pid 79 and 151) are also waiting.  All are waiting for
multi_cpu_stop to finish.  The way multi_cpu_stop works is: the caller
sets up one or more cpus to coordinate stopping; in multi_cpu_stop, the
state machine moves from MULTI_STOP_PREPARE through disable irqs, to run
(the provided function), to exit when done.  However, only the specified
cpus (in the cpumask) will run the function.  The state machine doesn't
proceed to the next step until all cpus have processed the current
state.

This is where the problem comes in.  In this case, it's a migration of
tasks from one numa node to another, via numa rebalancing.  In this
particular case, there are 3 rebalancing events happening: cpu 3 and cpu
10, cpu 3 and cpu 13, cpu 3 and cpu 20.  the migrate threads on cpus 10,
13, and 20 are running multi_cpu_stop, but it's stuck waiting because
cpu 3 still has it in its queue.

cpu 3 is writing bytes to the serial port, and currently waiting for
confirmation that the serial port write completed.  This wait is done
via checking the serial port register for CTS, then if it's not set
delaying for 1us, and trying again.  However, this is all inside a held
spinlock, with irqs disabled.  So while this serial port r/w is being
done, nothing else will run on this cpu.  But - the code limits this to
1 second, so presumably it shouldn't lock up the cpu for longer than 1
second or so (I haven't dug too far into this, so the function may be
called multiple times with the lock held).

For whatever reason, that serial port r/w seems to be taking a long
time.  The migrate threads on the other cpus are waiting for it to
finish, so that the migrate thread on cpu 3 can run, and move the
multi_cpu_stop state machine along.  But that doesn't happen in time to
avoid the softlockup detector.

The multi_cpu_stop function could arguably use the addition of
touch_nmi_watchdog(), since it intentionally spins on the cpu with
interrupts disabled - doing so would avoid the softlockup detector (but
would not change the system behavior).  However, it's not really its
fault, since the real cause is the other cpu(s) it's waiting for being
locked.

back on cpu 3 (that the others are waiting on), the way that delay is
implemented is using the TSC.  Unfortunately, the TSC is a generally
unreliable clock source, so it's possible there is a problem in the
delay function.

To determine that, can you please boot with the "notsc" parameter, which
will change the udelay function to use a simple loop instead of the TSC,
and reproduce the softlockup?

** Changed in: linux (Ubuntu)
     Assignee: Rafael David Tinoco (inaddy) => Dan Streetman (ddstreet)

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1505564

Title:
  Soft lockup with "block nbdX: Attempted send on closed socket" spam

Status in linux package in Ubuntu:
  In Progress

Bug description:
  Some of our nova compute hosts regularly freeze, sometimes for a few
  hours, with kern.log getting spammed with:

  block nbdX: Attempted send on closed socket

  and a few "CPU soft lockup" messages (see attached log). This clears
  up when the queue gets cleared, eg :

  block nbdX: queue cleared

  trusty hosts with kernel version 3.19.0-30-generic.
  --- 
  AlsaDevices:
   total 0
   crw-rw---- 1 root audio 116,  1 Nov 24 12:23 seq
   crw-rw---- 1 root audio 116, 33 Nov 24 12:23 timer
  AplayDevices: Error: [Errno 2] No such file or directory
  ApportVersion: 2.14.1-0ubuntu3.19
  Architecture: amd64
  ArecordDevices: Error: [Errno 2] No such file or directory
  AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
  DistroRelease: Ubuntu 14.04
  IwConfig: Error: [Errno 2] No such file or directory
  MachineType: HP ProLiant DL385 G7
  Package: linux (not installed)
  PciMultimedia:
   
  ProcEnviron:
   TERM=screen-256color
   PATH=(custom, no user)
   LANG=en_US.UTF-8
   SHELL=/bin/bash
  ProcFB: 0 radeondrmfb
  ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-3.19.0-36-generic root=UUID=13289ac9-8dc9-4feb-b6bd-ca7db66b21d6 ro console=tty0 console=ttyS1,38400 nosplash crashkernel=384M-:512M nox2apic intremap=off
  ProcVersionSignature: Ubuntu 3.19.0-36.41~14.04.1hf00090138v20151122b1-generic 3.19.8-ckt9
  RelatedPackageVersions:
   linux-restricted-modules-3.19.0-36-generic N/A
   linux-backports-modules-3.19.0-36-generic  N/A
   linux-firmware                             1.127.18
  RfKill: Error: [Errno 2] No such file or directory
  Tags:  trusty uec-images
  Uname: Linux 3.19.0-36-generic x86_64
  UpgradeStatus: No upgrade log present (probably fresh install)
  UserGroups:
   
  _MarkForUpload: True
  dmi.bios.date: 02/02/2014
  dmi.bios.vendor: HP
  dmi.bios.version: A18
  dmi.chassis.type: 23
  dmi.chassis.vendor: HP
  dmi.modalias: dmi:bvnHP:bvrA18:bd02/02/2014:svnHP:pnProLiantDL385G7:pvr:cvnHP:ct23:cvr:
  dmi.product.name: ProLiant DL385 G7
  dmi.sys.vendor: HP

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1505564/+subscriptions
References

[Bug 1505564] [NEW] Soft lockup with "block nbdX: Attempted send on closed socket" spam
From: Junien Fridrick, 2015-10-13