kernel-packages team mailing list archive

Thread
Date
[Bug 1352995] [NEW] ERAT Multihit machine checks

To: kernel-packages@xxxxxxxxxxxxxxxxxxx
From: Launchpad Bug Tracker <1352995@xxxxxxxxxxxxxxxxxx>
Date: Tue, 05 Aug 2014 16:39:04 -0000
Reply-to: Bug 1352995 <1352995@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx
You have been subscribed to a public bug:

-- Problem Description --
Our project involves porting a 3rd-party out-of-tree module to LE Ubuntu on Power.

We've been seeing occasional ERAT Multihit machine checks with kernels
ranging from the LE Ubuntu 14.04 3.13-based kernel through the very
latest 3.16-rc5 mainline kernels.

Our kernels are running directly on top of OPAL/Sapphire in PowerNV
mode, with no intervening KVM host.

FSP dumps captured at the time of the ERAT detection show that there are
duplicate mappings in force for the same real page, with the duplicate
mappings being for different sized pages.  So, for example, the same 4K
real page will be referred to by a 4K mapping and an overlapping 16M
mapping.

Aneesh has been working with us on this.

We are currently testing this patchset. (git format-patch --stdout
format). We are still finding ERAT with this changes. Most of these
changes are already posted externally. Some of them got updated after
that.

Current status is. When hitting multi hit erat, I don't find duplicate
hash pte entries. So it possibly indicate a missed flush or a race.

Dar value is 3fff7d0f0000 psize 0
slot = 453664 v = 40001f0d74ff7d01 r = 7ca0f0196 with b_size = 15 a_size = -1
Dump the rest of 256 entries 
Dar value is 3fff7d0f0000 psize 0
slot = 453664 v = 40001f0d74ff7d01 r = 7ca0f0196 with b_size = 15 a_size = -1
Done..
Dump the rest of 256 entries 
Done..
Found hugepage shift 0 for ea 3fff7d0f0000 with ptep 1f283d8000383
Severe Machine check interrupt [Recovered]
  Initiator: CPU
  Error type: ERAT [Multihit]
    Effective address: 00003fff7d0f0000

That is what i am finding on machine check. I am searching the hash pte
with base page size 4K and 64K and printing matching hash table entries.
b_size = 15 and a_size = -1 both indicate 4K.

-aneesh

I guess we now have a race in the unmap path. I am auditing the
hpte_slot_array usage.

We do check for hpte_slot_array != NULL in invalidate. But if we hit two
pmdp_splitting flush one will skip the invalidate as per current code
and will go ahead and mark hpte_slot_array NULL. I have a patch in the
repo which try to work around that. But I am not sure whether we really
can have two pmdp_splitting flush simultaneously. because we call that
under pmd_lock.

Still need to look at the details.

-aneesh

I added more debug prints. And this is what i found. Before a hugepage
flush I added debug prints to dump the hash table to see if we are
failing to clear any hash table entries. After every update we seems to
have clearly updated hash table. One MCE some of the relevant part of
logs are

pmd_hugepage_update dumping entries for 0x3fff71000000 with clr = 0xffffffffffffffff set = 0x0
.....

.....

dump_hash_pte_group dumping entries for 0x3fff7191da8c with clr = 0x0 set = 0x0
func = dump_hash_pte_group, addr = 3fff7191da8c psize = 0 slot = 1174024 v = 4001a9245cff7181 r = 7dfb5d193 with b_size = 0 a_size = 0 count = 2333
func = dump_hash_pte_group, addr = 3fff71000000 psize = 0 slot = 1155808 v = 4001a9245cff7105 r = 7cc038196 with b_size = 0 a_size = 9 count = 0
func = dump_hash_pte_group, addr = 3fff710a2000 psize = 0 slot = 1157104 v = 4001a9245cff7105 r = 7cc038116 with b_size = 0 a_size = 9 count = 162
func = dump_hash_pte_group, addr = 3fff710e6000 psize = 0 slot = 1156560 v = 4001a9245cff7105 r = 7cc038196 with b_size = 0 a_size = 9 count = 230
func = dump_hash_pte_group, addr = 3fff71378000 psize = 0 slot = 1161504 v = 4001a9245cff7105 r = 7cc038116 with b_size = 0 a_size = 9 count = 888

So we end up clearing the huge pmd with 0x3fff71000000 and at that point
we didn't had anything in hash table. That is the last
pmdp_splitting_flush or pmd_hugepage_update even on that address.

Can we audit the driver code to understand large/huge page usage and if
it is making any x86 assumptions around the page table accessor. For
example ppc64 rules around page table access are more strict than x86.
We don't have flush_tlb_* functions and we need to make sure we hold ptl
while updating page table and also flush the hash pte holding the lock.

Attaching the log also

-aneesh

Aneesh writes:

Can we audit the driver code to understand large/huge page usage and if
it is making any x86 assumptions around the page table accessor. For
example ppc64 rules around page table access are more strict than x86.
We don't have flush_tlb_* functions and we need to make sure we hold ptl
while updating page table and also flush the hash pte holding the lock.


Yes, we can do that (all the driver code that's specific to linux is in the kernel-interface subdirectory, so you can take a look as well).  But I'm not quite sure what we'd be looking for.

The driver doesn't have any explicit awareness of huge-pages; it doesn't
intend or expect to interact with them in any way.  And I wouldn't
expect the driver to be updating the kernel's page tables itself but
rather to use of some set of (relatively safe) services to do that.

So if you can tell us what we might want to look for in the driver code,
we'll be happy to do that.

I do notice a couple uses of __flush_tlb() and global_flush_tlb(), but
those are under x86 ifdefs and won't be compiled in for Power.  The
intent of the code using those is to flush the caches when the driver
changes the cache attribute of memory regions between cached and
uncached.

The driver's linux kernel interface code does contain references to
updating "pte", but those should all be the PTEs that are used by the
adapter, not the linux kernel page table entries.

After some additional looking, I see that there are some code paths in the driver's kernel interface layer at least refer to the kernel page table structures (see the references to pte_t, pmd_t, pgd_t, etc.) in
kernel_interface/nv-linux.h and nv.c.

But again, these are code paths that should only be compiled in for x86 (and in this case for kernel
versions < 2.6.1) as far as I can see.

Can you try the new patchset? I was able to run recreat1.sh in loop for
more than 8 times now. I will leave it running for the rest of the day
and will check again tomorrow morning.

I still need to get clarification on calling tlbie in loop for huge
pages from hardware guys.

-aneesh

** Affects: linux (Ubuntu)
     Importance: Undecided
         Status: New


** Tags: architecture-ppc64le bugnameltc-113570 severity-high targetmilestone-inin1410
-- 
ERAT Multihit machine checks
https://bugs.launchpad.net/bugs/1352995
You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu.