kernel-packages team mailing list archive

Thread
Date
[Bug 1534345] Re: Ubuntu 15.10 Crashing Frequently on EC2 Instances w/ Enhanced Networking

To: kernel-packages@xxxxxxxxxxxxxxxxxxx
From: Stefan Bader <stefan.bader@xxxxxxxxxxxxx>
Date: Wed, 20 Jan 2016 16:29:28 -0000
Reply-to: Bug 1534345 <1534345@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx
Note that the following is not final statement but sharing some thoughts
to whoever else is looking at this report (and for me to remember). So
while I did find nothing that really looked odd in the xen-netfront code
I saw there was some change to the generic timer code:

commit 1dabbcec2c0a36fe43509d06499b9e512e70a028
  timer: Use hlist for the timer wheel hash buckets

That change was part of 4.2 but if it would be the cause I would expect
problems not only on AWS instances. But then it might just be that bare-
metal servers with a similarly high traffic tend to be upgraded much
less often.... anyway... Part of the change above seems to be some
exchange of special meaning of list pointer values. Not sure I grasp the
implications, yet. While using double linked lists before, the pointer
to the next element seemed to serve as pending indicator and the pointer
to the previous element was invalidated with a LIST_POISON2 value. Now
its the other way round. Referring to the detach_timer function which is
called from __run_timers via detached_expired_timer.

The crash happens at offset 0x116 in run_timer_softirq (thats 278
decimal). The disassembly of that function around there is:

   0xffffffff810e5c1e <+254>:	mov    %r15,0x8(%rbx)
   0xffffffff810e5c22 <+258>:	nopl   0x0(%rax,%rax,1)
   // Guest this is __hlist_del(struct hlist_node *n)
   // rax = n->next
   0xffffffff810e5c27 <+263>:	mov    (%r15),%rax
   // rdx = n->ppev
   0xffffffff810e5c2a <+266>:	mov    0x8(%r15),%rdx
   0xffffffff810e5c2e <+270>:	test   %rax,%rax
   // *(n->pprev) = n->next
   0xffffffff810e5c31 <+273>:	mov    %rax,(%rdx)
   // if (n->next == NULL) jump
   0xffffffff810e5c34 <+276>:	je     0xffffffff810e5c3a <run_timer_softirq+282>
   // (n->next)->pprev = n->pprev (but n->next is LIST_POISON2 / invalid ptr)
   0xffffffff810e5c36 <+278>:	mov    %rdx,0x8(%rax)
   0xffffffff810e5c3a <+282>:	testb  $0x10,0x2a(%r15)
   // here we seem back at detach_timer inlined and clear_pending assumed true
   // entry->next = LIST_POISON2 and entry->pprev = NULL
   0xffffffff810e5c3f <+287>:	movabs $0xdead000000200200,%rax
   0xffffffff810e5c49 <+297>:	movq   $0x0,0x8(%r15)
   0xffffffff810e5c51 <+305>:	mov    %rax,(%r15)

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1534345

Title:
  Ubuntu 15.10 Crashing Frequently on EC2 Instances w/ Enhanced
  Networking

Status in linux package in Ubuntu:
  Triaged

Bug description:
  Lots of details and history of the problem here:
  https://askubuntu.com/questions/710747/after-upgrading-
  to-15-10-from-15-04-ec2-webservers-have-become-very-unstable

  10 of my webservers have started crashing immediately following the
  15.10 upgrade. As far as what exactly defines a "crash", Instance
  Status Checks fail, and I can no longer SSH to the machine. Background
  daemons running on the system stop responding, and nothing is written
  to the logs.

  After weeks of working with the AWS team, I finally fixed a netconsole
  issue via "echo 7 > /proc/sys/kernel/printk" and got netconsole
  working properly, and finally have a trace:

  
  [21410.260077] general protection fault: 0000 [#1] SMP
  [21410.261976] Modules linked in: isofs xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack xt_tcpudp bridge stp llc iptable_filter ip_tables x_tables ppdev intel_rapl iosf_mbi xen_fbfront fb_sys_fops input_leds serio_raw i2c_piix4 parport_pc 8250_fintek parport mac_hid netconsole configfs autofs4 crct10dif_pclmul crc32_pclmul cirrus syscopyarea sysfillrect sysimgblt aesni_intel ttm aes_x86_64 drm_kms_helper lrw gf128mul glue_helper ablk_helper cryptd psmouse drm ixgbevf pata_acpi floppy
  [21410.264054] CPU: 0 PID: 26957 Comm: apache2 Not tainted 4.2.0-23-generic #28-Ubuntu
  [21410.264054] Hardware name: Xen HVM domU, BIOS 4.2.amazon 12/07/2015
  [21410.264054] task: ffff8803f9809b80 ti: ffff8803f999c000 task.ti: ffff8803f999c000
  [21410.264054] RIP: 0010:[<ffffffff810e5c36>]  [<ffffffff810e5c36>] run_timer_softirq+0x116/0x2d0
  [21410.264054] RSP: 0000:ffff8803ff203e98  EFLAGS: 00010086
  [21410.264054] RAX: dead000000200200 RBX: ffff8803ff20e9c0 RCX: ffff8803ff203ec8
  [21410.264054] RDX: ffff8803ff203ec8 RSI: 0000000000011fc0 RDI: ffff8803ff20e9c0
  [21410.264054] RBP: ffff8803ff203f08 R08: 000000000000a77a R09: 0000000000000000
  [21410.264054] R10: 0000000000000020 R11: 0000000000000004 R12: 000000000000007c
  [21410.264054] R13: ffffffff8172aaf0 R14: 0000000000000000 R15: ffff8803af955be0
  [21410.264054] FS:  00007fb0ce6e8780(0000) GS:ffff8803ff200000(0000) knlGS:0000000000000000
  [21410.264054] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [21410.264054] CR2: 00007fb0ce51e130 CR3: 00000003fb233000 CR4: 00000000001406f0
  [21410.264054] Stack:
  [21410.264054]  ffff8803ff203eb8 ffff8803ff20f5f8 ffff8803ff20f3f8 ffff8803ff20f1f8
  [21410.264054]  ffff8803ff20e9f8 ffff8803af955b58 dead000000200200 00000000f60fabc0
  [21410.264054]  0000000000011fc0 0000000000000001 ffffffff81c0b0c8 0000000000000001
  [21410.264054] Call Trace:
  [21410.264054]  <IRQ>
  [21410.264054]  [<ffffffff8107f846>] __do_softirq+0xf6/0x250
  [21410.264054]  [<ffffffff8107fb13>] irq_exit+0xa3/0xb0
  [21410.264054]  [<ffffffff814a4499>] xen_evtchn_do_upcall+0x39/0x50
  [21410.264054]  [<ffffffff817f1f6b>] xen_hvm_callback_vector+0x6b/0x70
  [21410.264054]  <EOI>
  [21410.264054] Code: 81 e6 00 00 20 00 48 85 d2 48 89 45 b8 0f 85 30 01 00 00 4c 89 7b 08 0f 1f 44 00 00 49 8b 07 49 8b 57 08 48 85 c0 48 89 02 74 04 <48> 89 50 08 41 f6 47 2a 10 48 b8 00 02 20 00 00 00 ad de 49 c7
  [21410.264054] RIP  [<ffffffff810e5c36>] run_timer_softirq+0x116/0x2d0
  [21410.264054]  RSP <ffff8803ff203e98>

  I don't have a vmcore at the moment, but I'm trying to get one from
  AWS and should have one in the next couple of days. This is happening
  frequently and repeatedly since I first upgraded to 15.10 on early
  December.

  
  ubuntu@xxx-web-xx:~$ lsb_release -a
  No LSB modules are available.
  Distributor ID:	Ubuntu
  Description:	Ubuntu 15.10
  Release:	15.10
  Codename:	wily
  ubuntu@xxx-web-xx:~$ uname -a
  Linux xxx-web-xx 4.2.0-23-generic #28-Ubuntu SMP Sun Dec 27 17:47:31 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
  ubuntu@xxx-web-xx:~$

  ProblemType: Bug
  DistroRelease: Ubuntu 15.10
  Package: linux-image-4.2.0-23-generic 4.2.0-23.28
  ProcVersionSignature: User Name 4.2.0-23.28-generic 4.2.6
  Uname: Linux 4.2.0-23-generic x86_64
  AlsaDevices:
   total 0
   crw-rw---- 1 root audio 116,  1 Jan 14 15:42 seq
   crw-rw---- 1 root audio 116, 33 Jan 14 15:42 timer
  AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
  ApportVersion: 2.19.1-0ubuntu5
  Architecture: amd64
  ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
  AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
  Date: Thu Jan 14 21:31:14 2016
  Ec2AMI: ami-d5e7adbf
  Ec2AMIManifest: (unknown)
  Ec2AvailabilityZone: us-east-1d
  Ec2InstanceType: m4.xlarge
  Ec2Kernel: unavailable
  Ec2Ramdisk: unavailable
  IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig'
  Lsusb: Error: command ['lsusb'] failed with exit code 1: unable to initialize libusb: -99
  MachineType: Xen HVM domU
  PciMultimedia:
   
  ProcEnviron:
   TERM=xterm-256color
   PATH=(custom, no user)
   XDG_RUNTIME_DIR=<set>
   LANG=en_US.UTF-8
   SHELL=/bin/bash
  ProcFB:
   0 cirrusdrmfb
   1 xen
  ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-4.2.0-23-generic root=UUID=9bd55602-81dd-4868-8cfc-b7d63f8f8d7e ro console=tty1 console=ttyS0 crashkernel=256M@0M
  RelatedPackageVersions:
   linux-restricted-modules-4.2.0-23-generic N/A
   linux-backports-modules-4.2.0-23-generic  N/A
   linux-firmware                            1.149.3
  RfKill: Error: [Errno 2] No such file or directory: 'rfkill'
  SourcePackage: linux
  UdevLog: Error: [Errno 2] No such file or directory: '/var/log/udev'
  UpgradeStatus: Upgraded to wily on 2015-12-15 (29 days ago)
  dmi.bios.date: 12/07/2015
  dmi.bios.vendor: Xen
  dmi.bios.version: 4.2.amazon
  dmi.chassis.type: 1
  dmi.chassis.vendor: Xen
  dmi.modalias: dmi:bvnXen:bvr4.2.amazon:bd12/07/2015:svnXen:pnHVMdomU:pvr4.2.amazon:cvnXen:ct1:cvr:
  dmi.product.name: HVM domU
  dmi.product.version: 4.2.amazon
  dmi.sys.vendor: Xen

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1534345/+subscriptions
References

[Bug 1534345] [NEW] Ubuntu 15.10 Crashing Frequently on EC2 Instances w/ Enhanced Networking
From: Will Buckner, 2016-01-14