kernel-packages team mailing list archive

Thread
Date
[Bug 1352995] Comment bridged from LTC Bugzilla

To: kernel-packages@xxxxxxxxxxxxxxxxxxx
From: bugproxy <bugproxy@xxxxxxxxxx>
Date: Thu, 21 Aug 2014 19:49:29 -0000
Reply-to: Bug 1352995 <1352995@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx
*** This bug is a duplicate of bug 1357014 ***
    https://bugs.launchpad.net/bugs/1357014

------- Comment From hartb@xxxxxxxxxx 2014-08-21 19:39 EDT-------
Thank you for confirming!

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1352995

Title:
  ERAT Multihit machine checks

Status in “linux” package in Ubuntu:
  Fix Released
Status in “linux” source package in Trusty:
  In Progress
Status in “linux” source package in Utopic:
  Fix Released

Bug description:
  -- Problem Description --
  Our project involves porting a 3rd-party out-of-tree module to LE Ubuntu on Power.

  We've been seeing occasional ERAT Multihit machine checks with kernels
  ranging from the LE Ubuntu 14.04 3.13-based kernel through the very
  latest 3.16-rc5 mainline kernels.

  Our kernels are running directly on top of OPAL/Sapphire in PowerNV
  mode, with no intervening KVM host.

  FSP dumps captured at the time of the ERAT detection show that there
  are duplicate mappings in force for the same real page, with the
  duplicate mappings being for different sized pages.  So, for example,
  the same 4K real page will be referred to by a 4K mapping and an
  overlapping 16M mapping.

  Aneesh has been working with us on this.

  We are currently testing this patchset. (git format-patch --stdout
  format). We are still finding ERAT with this changes. Most of these
  changes are already posted externally. Some of them got updated after
  that.

  Current status is. When hitting multi hit erat, I don't find duplicate
  hash pte entries. So it possibly indicate a missed flush or a race.

  Dar value is 3fff7d0f0000 psize 0
  slot = 453664 v = 40001f0d74ff7d01 r = 7ca0f0196 with b_size = 15 a_size = -1
  Dump the rest of 256 entries 
  Dar value is 3fff7d0f0000 psize 0
  slot = 453664 v = 40001f0d74ff7d01 r = 7ca0f0196 with b_size = 15 a_size = -1
  Done..
  Dump the rest of 256 entries 
  Done..
  Found hugepage shift 0 for ea 3fff7d0f0000 with ptep 1f283d8000383
  Severe Machine check interrupt [Recovered]
    Initiator: CPU
    Error type: ERAT [Multihit]
      Effective address: 00003fff7d0f0000

  That is what i am finding on machine check. I am searching the hash
  pte with base page size 4K and 64K and printing matching hash table
  entries. b_size = 15 and a_size = -1 both indicate 4K.

  -aneesh

  I guess we now have a race in the unmap path. I am auditing the
  hpte_slot_array usage.

  We do check for hpte_slot_array != NULL in invalidate. But if we hit
  two pmdp_splitting flush one will skip the invalidate as per current
  code and will go ahead and mark hpte_slot_array NULL. I have a patch
  in the repo which try to work around that. But I am not sure whether
  we really can have two pmdp_splitting flush simultaneously. because we
  call that under pmd_lock.

  Still need to look at the details.

  -aneesh

  I added more debug prints. And this is what i found. Before a hugepage
  flush I added debug prints to dump the hash table to see if we are
  failing to clear any hash table entries. After every update we seems
  to have clearly updated hash table. One MCE some of the relevant part
  of logs are

  pmd_hugepage_update dumping entries for 0x3fff71000000 with clr = 0xffffffffffffffff set = 0x0
  .....

  .....

  dump_hash_pte_group dumping entries for 0x3fff7191da8c with clr = 0x0 set = 0x0
  func = dump_hash_pte_group, addr = 3fff7191da8c psize = 0 slot = 1174024 v = 4001a9245cff7181 r = 7dfb5d193 with b_size = 0 a_size = 0 count = 2333
  func = dump_hash_pte_group, addr = 3fff71000000 psize = 0 slot = 1155808 v = 4001a9245cff7105 r = 7cc038196 with b_size = 0 a_size = 9 count = 0
  func = dump_hash_pte_group, addr = 3fff710a2000 psize = 0 slot = 1157104 v = 4001a9245cff7105 r = 7cc038116 with b_size = 0 a_size = 9 count = 162
  func = dump_hash_pte_group, addr = 3fff710e6000 psize = 0 slot = 1156560 v = 4001a9245cff7105 r = 7cc038196 with b_size = 0 a_size = 9 count = 230
  func = dump_hash_pte_group, addr = 3fff71378000 psize = 0 slot = 1161504 v = 4001a9245cff7105 r = 7cc038116 with b_size = 0 a_size = 9 count = 888

  So we end up clearing the huge pmd with 0x3fff71000000 and at that
  point we didn't had anything in hash table. That is the last
  pmdp_splitting_flush or pmd_hugepage_update even on that address.

  Can we audit the driver code to understand large/huge page usage and
  if it is making any x86 assumptions around the page table accessor.
  For example ppc64 rules around page table access are more strict than
  x86. We don't have flush_tlb_* functions and we need to make sure we
  hold ptl while updating page table and also flush the hash pte holding
  the lock.

  Attaching the log also

  -aneesh

  Aneesh writes:

  Can we audit the driver code to understand large/huge page usage and
  if it is making any x86 assumptions around the page table accessor.
  For example ppc64 rules around page table access are more strict than
  x86. We don't have flush_tlb_* functions and we need to make sure we
  hold ptl while updating page table and also flush the hash pte holding
  the lock.

  
  Yes, we can do that (all the driver code that's specific to linux is in the kernel-interface subdirectory, so you can take a look as well).  But I'm not quite sure what we'd be looking for.

  The driver doesn't have any explicit awareness of huge-pages; it
  doesn't intend or expect to interact with them in any way.  And I
  wouldn't expect the driver to be updating the kernel's page tables
  itself but rather to use of some set of (relatively safe) services to
  do that.

  So if you can tell us what we might want to look for in the driver
  code, we'll be happy to do that.

  I do notice a couple uses of __flush_tlb() and global_flush_tlb(), but
  those are under x86 ifdefs and won't be compiled in for Power.  The
  intent of the code using those is to flush the caches when the driver
  changes the cache attribute of memory regions between cached and
  uncached.

  The driver's linux kernel interface code does contain references to
  updating "pte", but those should all be the PTEs that are used by the
  adapter, not the linux kernel page table entries.

  After some additional looking, I see that there are some code paths in the driver's kernel interface layer at least refer to the kernel page table structures (see the references to pte_t, pmd_t, pgd_t, etc.) in
  kernel_interface/nv-linux.h and nv.c.

  But again, these are code paths that should only be compiled in for x86 (and in this case for kernel
  versions < 2.6.1) as far as I can see.

  Can you try the new patchset? I was able to run recreat1.sh in loop
  for more than 8 times now. I will leave it running for the rest of the
  day and will check again tomorrow morning.

  I still need to get clarification on calling tlbie in loop for huge
  pages from hardware guys.

  -aneesh

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1352995/+subscriptions