group.of.nepali.translators team mailing list archive

Thread
Date
[Bug 1632045] Re: KVM: PPC: Book3S HV: Migrate pinned pages out of CMA

To: group.of.nepali.translators@xxxxxxxxxxxxxxxxxxx
From: Tim Gardner <tim.gardner@xxxxxxxxxxxxx>
Date: Fri, 28 Oct 2016 14:20:22 -0000
Reply-to: Bug 1632045 <1632045@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx
** Changed in: linux (Ubuntu Xenial)
       Status: In Progress => Fix Committed

** Changed in: linux (Ubuntu)
       Status: In Progress => Fix Released

** Also affects: linux (Ubuntu Zesty)
   Importance: High
     Assignee: Tim Gardner (timg-tpi)
       Status: Fix Released

-- 
You received this bug notification because you are a member of नेपाली
भाषा समायोजकहरुको समूह, which is subscribed to Xenial.
Matching subscriptions: Ubuntu 16.04 Bugs
https://bugs.launchpad.net/bugs/1632045

Title:
  KVM: PPC: Book3S HV: Migrate pinned pages out of CMA

Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Xenial:
  Fix Committed
Status in linux source package in Yakkety:
  Fix Committed
Status in linux source package in Zesty:
  Fix Released

Bug description:
  ---Problem Description---
  https://github.com/open-power/supermicro-openpower/issues/59

  SW/HW Configuration

  PNOR image version: 5/3/2016
  BMC image version: 0.25
  CPLD Version: B2.81.01
  Host OS version: Ubuntu 16.04 LTS
  UbuntuKVM Guest OS version: Ubuntu 14.04.4 LTS
  HTX version: 394
  Processor: 00UL865 * 2
  Memory: SK hynix 16GB 2Rx4 PC4-2133P * 16
  Summary of Issue

  Two UbuntuKVM guests are each configured with 8 processors, 64 GB of
  memory, 1 disk of 128 GB, 1 network interface, and 1 GPU (pass-
  through'd from the Host OS's K80).

  The two guests are each put into a Create/Destroy loop, with HTX
  running on each of the guests (NOT HOST) in between its creation and
  destruction. The mdt.bu profile is used, and the processors, memory,
  and the GPU are put under load. The HTX session lasts 9 minutes.

  While this is running, the amount of available memory (free memory) in
  the Host OS will slowly decrease, and this can continue until the
  point wherein there's no more free memory for the Host OS to do
  anything, including creating the two VM guests. It seems to be that
  after every cycle, a small portion of the memory that was allocated to
  the VM guest does not get released back to the Host OS, and
  eventually, this can and will add up to take up all the available
  memory in the Host OS.

  At some point, the VM guest(s) might get disconnected and will display
  the following error:

      error: Disconnected from qemu:///system due to I/O error

      error: One or more references were leaked after disconnect from
  the hypervisor

  Then, when the Host OS tries to start the VM guest again, the
  following error shows up:

      error: Failed to create domain from guest2_trusty.xml
      error: internal error: early end of file from monitor, possible problem: Unexpected error in spapr_alloc_htab() at /build/qemu-c3ZrbA/qemu-2.5+dfsg/hw/ppc/spapr.c:1030:
      2016-05-23T16:18:16.871549Z qemu-system-ppc64: Failed to allocate HTAB of requested size, try with smaller maxmem

  The Host OS syslog, as seen HERE, also contains quite some errors.
  To just list a few:

      May 13 20:27:44 191-136 kernel: [36827.151228] alloc_contig_range: [3fb800, 3fd8f8) PFNs busy
      May 13 20:27:44 191-136 kernel: [36827.151291] alloc_contig_range: [3fb800, 3fd8fc) PFNs busy
      May 13 20:27:44 191-136 libvirtd[19263]: *** Error in `/usr/sbin/libvirtd': realloc(): invalid next size: 0x000001000a780400 ***
      May 13 20:27:44 191-136 libvirtd[19263]: ======= Backtrace: =========
      May 13 20:27:44 191-136 libvirtd[19263]: /lib/powerpc64le-linux-gnu/libc.so.6(+0x8720c)[0x3fffaf6a720c]
      May 13 20:27:44 191-136 libvirtd[19263]: /lib/powerpc64le-linux-gnu/libc.so.6(+0x96f70)[0x3fffaf6b6f70]
      May 13 20:27:44 191-136 libvirtd[19263]: /lib/powerpc64le-linux-gnu/libc.so.6(realloc+0x16c)[0x3fffaf6b87fc]
      May 13 20:27:44 191-136 libvirtd[19263]: /usr/lib/powerpc64le-linux-gnu/libvirt.so.0(virReallocN+0x68)[0x3fffaf90ccc8]
      May 13 20:27:44 191-136 libvirtd[19263]: /usr/lib/libvirt/connection-driver/libvirt_driver_qemu.so(+0x8ef6c)[0x3fff9346ef6c]
      May 13 20:27:44 191-136 libvirtd[19263]: /usr/lib/libvirt/connection-driver/libvirt_driver_qemu.so(+0xa826c)[0x3fff9348826c]
      May 13 20:27:44 191-136 libvirtd[19263]: /usr/lib/powerpc64le-linux-gnu/libvirt.so.0(virEventPollRunOnce+0x8b4)[0x3fffaf9332b4]
      May 13 20:27:44 191-136 libvirtd[19263]: /usr/lib/powerpc64le-linux-gnu/libvirt.so.0(virEventRunDefaultImpl+0x54)[0x3fffaf931334]
      May 13 20:27:44 191-136 libvirtd[19263]: /usr/lib/powerpc64le-linux-gnu/libvirt.so.0(virNetDaemonRun+0x1f0)[0x3fffafad2f70]
      May 13 20:27:44 191-136 libvirtd[19263]: /usr/sbin/libvirtd(+0x15d74)[0x52e45d74]
      May 13 20:27:44 191-136 libvirtd[19263]: /lib/powerpc64le-linux-gnu/libc.so.6(+0x2319c)[0x3fffaf64319c]
      May 13 20:27:44 191-136 libvirtd[19263]: /lib/powerpc64le-linux-gnu/libc.so.6(__libc_start_main+0xb8)[0x3fffaf6433b8]
      May 13 20:27:44 191-136 libvirtd[19263]: ======= Memory map: ========
      May 13 20:27:44 191-136 libvirtd[19263]: 52e30000-52eb0000 r-xp 00000000 08:02 65540510 /usr/sbin/libvirtd
      May 13 20:27:44 191-136 libvirtd[19263]: 52ec0000-52ed0000 r--p 00080000 08:02 65540510 /usr/sbin/libvirtd
      May 13 20:27:44 191-136 libvirtd[19263]: 52ed0000-52ee0000 rw-p 00090000 08:02 65540510 /usr/sbin/libvirtd
      May 13 20:27:44 191-136 libvirtd[19263]: 1000a730000-1000a830000 rw-p 00000000 00:00 0 [heap]
      May 13 20:27:44 191-136 libvirtd[19263]: 3fff60000000-3fff60030000 rw-p 00000000 00:00 0
      May 13 20:27:44 191-136 libvirtd[19263]: 3fff60030000-3fff64000000 ---p 00000000 00:00 0
      May 13 20:50:33 191-136 kernel: [38196.502926] audit: type=1400 audit(1463197833.497:4025): apparmor="DENIED" operation="open" profile="libvirt-d3ade785-c1c1-4519-b123-9d28704c2ad4" name="/sys/devices/pci0003:00/0003:00:00.0/0003:01:00.0/0003:02:08.0/0003:03:00.0/devspec" pid=24887 comm="qemu-system-ppc" requested_mask="r" denied_mask="r" fsuid=110 ouid=0
      May 13 20:50:33 191-136 virtlogd[3727]: End of file while reading data: Input/output error

  Notes

  Host OS's free memory will also slowly decrease when HTX is NOT
  executed at all on the guests between guest Create/Destory, but at a
  much slower pace, and VM guests can also still fail to be created,
  with the same error message, and even though the Host OS might still
  have plenty of free memory left:

      error: Failed to create domain from guest2_trusty.xml
      error: internal error: early end of file from monitor, possible problem: Unexpected error in spapr_alloc_htab() at /build/qemu-c3ZrbA/qemu-2.5+dfsg/hw/ppc/spapr.c:1030:
      2016-05-23T16:18:16.871549Z qemu-system-ppc64: Failed to allocate HTAB of requested size, try with smaller maxmem

  However, this happened only once so far, and after it completed about 3924 Create/Destroy cycles.
  The other guest that was running the same test concurrently did NOT have any issues and went on to 4,600+ cycles.

   
  ---uname output---
  Host OS version: Ubuntu 16.04 LTS UbuntuKVM Guest OS version: Ubuntu 14.04.4 LTS
   
  Machine Type = SMC 
   
  I do not see any actual information about using all memory, here are:

  1. "Failed to allocate HTAB" - happens because we run out of
  _contiguous_ chunks of CMA memory, not just any RAM

  2. libvirtd[19263]: *** Error in `/usr/sbin/libvirtd': realloc():
  invalid next size: 0x000001000a780400 *** - this looks more like
  memory corruption than insufficient memory

  I suggest collecting statistics using something like this shell
  script:

  # !/bin/sh

  while [ true ]
  do
   <here you put guest start/stop>
   grep -e "\(CmaFree:\|MemFree:\)" /proc/meminfo | paste -d "\t" - - >> mymemorylog
  done

  and attaching the resulting mymemorylog to this bug. Also it would be
  interesting to know if the issue can be reproduced without loaded
  NVIDIA driver in the guest or even without passing NVIDIA GPU to the
  guest. Meanwhile I am running my tests and see if I can get this
  behavior.

  Ok, located the problem, will post a patch tomorrow to the public
  lists.

  Basically when QEMU dies, it does unpin DMA pages when its memory
  context is destroyed which was expected to happen when QEMU process
  exits but actually it may happen lot later if some kernel thread was
  executed on this same context and referenced it so until it was
  scheduled again, the very last memory context release would not
  happen.

  == Comment: #15 - Leonardo Augusto Guimaraes Garcia <lagarcia@xxxxxxxxxx> - 2016-08-24 08:15:00 ==
  (In reply to comment #14)
  > On my host, I have 10 guests running. Sum of all 10 guests memory will come
  > up to 69GB.

  Ok... So, this is quite different from what is in the bug description.
  In the bug description, I read:

  "Two UbuntuKVM guests are each configured with 8 processors, 64 GB of
  memory, 1 disk of 128 GB, 1 network interface, and 1 GPU (pass-
  through'd from the Host OS's K80).

  The two guests are each put into a Create/Destroy loop, with HTX
  running on each of the guests (NOT HOST) in between its creation and
  destruction. The mdt.bu profile is used, and the processors, memory,
  and the GPU are put under load. The HTX session lasts 9 minutes."

  What is the scenario being worked on this bug? I suggest you open a
  new bug for your issue if needed and we continue to investigate the
  original issue here.

  > 
  > I am trying to bring up 11th guest which is having 5Gb memory and it fails:
  > 
  > root@lotkvm:~# virsh start --console lotg12
  > error: Failed to start domain lotg12
  > error: internal error: process exited while connecting to monitor:
  > 5076802818bda30000000000003f2,format=raw,if=none,id=drive-virtio-disk0
  > -device
  > virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,
  > id=virtio-disk0,bootindex=1 -drive
  > file=/dev/disk/by-id/wwn-0x6005076802818bda30000000000003f4,format=raw,
  > if=none,id=drive-virtio-disk1 -device
  > virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk1,
  > id=virtio-disk1 -netdev tap,fd=41,id=hostnet0 -device
  > virtio-net,netdev=hostnet0,id=net0,mac=52:54:00:9b:53:77,bus=pci.0,addr=0x1,
  > bootindex=2 -chardev pty,id=charserial0 -device
  > spapr-vty,chardev=charserial0,reg=0x30000000 -device
  > virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x2 -msg timestamp=on
  > 2016-08-24T12:00:50.375315Z qemu-system-ppc64: Failed to allocate KVM HPT of
  > order 26 (try smaller maxmem?): Cannot allocate memory

  This is not because you don't have available memory. This is because
  you don't have CMA memory available. Please, take a look at LTC bug
  145072 comment 5 and subsequent comments.

  > 
  > 
  > I waited for an hour and retried guest start.. It fails still..
  > 
  > Current memory on host :
  > -----------
  > root@lotkvm:~# free -g
  >               total        used        free      shared  buff/cache  
  > available
  > Mem:            127          73           0           0          53         
  > 53
  > Swap:            11           4           6

  I think there are actually two separate problems here.

  (A) Pages in the CMA zone are getting pinned and causing fragmentation
  of the CMA zone, leading to the messages saying "qemu-system-ppc64:
  Failed to allocate HTAB of requested size, try with smaller maxmem".
  This happens because the guest is doing PCI passthrough with DDW
  enabled and hence pins all its memory. If guest pages happen to be
  allocated in the CMA zone, they get pinned there and then can't be
  moved for a future HPT allocation.

  Balbir was looking at the possibility of moving the pages out of the
  CMA zone before pinning them, but this work was dependent on some
  upstream refactoring which seems to be stalled.

  (B) On VM destruction, the pages are not getting unpinned and freed in
  a timely fashion. Alexey debugged this issue and has posted two
  patches to fix the problem: "powerpc/iommu: Stop using @current in
  mm_iommu_xxx" and "powerpc/mm/iommu: Put pages on process exit". These
  patches touch two maintainers' areas (powerpc and vfio) and hence need
  two maintainers' concurrence, and thus haven't gone anywhere yet.

  (Of course, issue (B) exacerbates issue (A).)

  Upon moving host and guests to 4.8 kernel. Still almost whole memory
  is getting used on host.

  Any updates here, any patches that we can expect soon ? Please let us
  know..

  Thanks,
  Manju

  
  4.8 does not yet have the fix for the pinned page migrations. I am not sure of the status of https://patchwork.kernel.org/patch/9238861/ upstream. I checked to see if I could find it in any git tree, but could not. I suspect we need this fix in first.

  > Balbir - Is this fixed in the latest 4.8 kernel out today?
  My patch is in powerpc-next

  https://git.kernel.org/cgit/linux/kernel/git/powerpc/linux.git/commit/?h=next&id=2e5bbb5461f138cac631fe21b4ad956feabfba22

  Should hit 4.9 and we can backport it. I am also trying to work on
  improvements to the patch for the future. Not sure of aik's patch
  status

  Balbir Singh.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1632045/+subscriptions