kernel-packages team mailing list archive
-
kernel-packages team
-
Mailing list archive
-
Message #76744
[Bug 1352995] Re: ERAT Multihit machine checks
All of these patches are in http://bugs.launchpad.net/bugs/1357014 and
were released with Ubuntu-3.16.0-9.14. I'll start working on back
porting them to 14.04.
** Changed in: linux (Ubuntu Utopic)
Status: Confirmed => In Progress
** Changed in: linux (Ubuntu Utopic)
Assignee: (unassigned) => Tim Gardner (timg-tpi)
** Changed in: linux (Ubuntu Utopic)
Status: In Progress => Fix Released
** Also affects: linux (Ubuntu Trusty)
Importance: Undecided
Status: New
** Changed in: linux (Ubuntu Trusty)
Status: New => In Progress
** Changed in: linux (Ubuntu Trusty)
Assignee: (unassigned) => Tim Gardner (timg-tpi)
--
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1352995
Title:
ERAT Multihit machine checks
Status in “linux” package in Ubuntu:
Fix Released
Status in “linux” source package in Trusty:
In Progress
Status in “linux” source package in Utopic:
Fix Released
Bug description:
-- Problem Description --
Our project involves porting a 3rd-party out-of-tree module to LE Ubuntu on Power.
We've been seeing occasional ERAT Multihit machine checks with kernels
ranging from the LE Ubuntu 14.04 3.13-based kernel through the very
latest 3.16-rc5 mainline kernels.
Our kernels are running directly on top of OPAL/Sapphire in PowerNV
mode, with no intervening KVM host.
FSP dumps captured at the time of the ERAT detection show that there
are duplicate mappings in force for the same real page, with the
duplicate mappings being for different sized pages. So, for example,
the same 4K real page will be referred to by a 4K mapping and an
overlapping 16M mapping.
Aneesh has been working with us on this.
We are currently testing this patchset. (git format-patch --stdout
format). We are still finding ERAT with this changes. Most of these
changes are already posted externally. Some of them got updated after
that.
Current status is. When hitting multi hit erat, I don't find duplicate
hash pte entries. So it possibly indicate a missed flush or a race.
Dar value is 3fff7d0f0000 psize 0
slot = 453664 v = 40001f0d74ff7d01 r = 7ca0f0196 with b_size = 15 a_size = -1
Dump the rest of 256 entries
Dar value is 3fff7d0f0000 psize 0
slot = 453664 v = 40001f0d74ff7d01 r = 7ca0f0196 with b_size = 15 a_size = -1
Done..
Dump the rest of 256 entries
Done..
Found hugepage shift 0 for ea 3fff7d0f0000 with ptep 1f283d8000383
Severe Machine check interrupt [Recovered]
Initiator: CPU
Error type: ERAT [Multihit]
Effective address: 00003fff7d0f0000
That is what i am finding on machine check. I am searching the hash
pte with base page size 4K and 64K and printing matching hash table
entries. b_size = 15 and a_size = -1 both indicate 4K.
-aneesh
I guess we now have a race in the unmap path. I am auditing the
hpte_slot_array usage.
We do check for hpte_slot_array != NULL in invalidate. But if we hit
two pmdp_splitting flush one will skip the invalidate as per current
code and will go ahead and mark hpte_slot_array NULL. I have a patch
in the repo which try to work around that. But I am not sure whether
we really can have two pmdp_splitting flush simultaneously. because we
call that under pmd_lock.
Still need to look at the details.
-aneesh
I added more debug prints. And this is what i found. Before a hugepage
flush I added debug prints to dump the hash table to see if we are
failing to clear any hash table entries. After every update we seems
to have clearly updated hash table. One MCE some of the relevant part
of logs are
pmd_hugepage_update dumping entries for 0x3fff71000000 with clr = 0xffffffffffffffff set = 0x0
.....
.....
dump_hash_pte_group dumping entries for 0x3fff7191da8c with clr = 0x0 set = 0x0
func = dump_hash_pte_group, addr = 3fff7191da8c psize = 0 slot = 1174024 v = 4001a9245cff7181 r = 7dfb5d193 with b_size = 0 a_size = 0 count = 2333
func = dump_hash_pte_group, addr = 3fff71000000 psize = 0 slot = 1155808 v = 4001a9245cff7105 r = 7cc038196 with b_size = 0 a_size = 9 count = 0
func = dump_hash_pte_group, addr = 3fff710a2000 psize = 0 slot = 1157104 v = 4001a9245cff7105 r = 7cc038116 with b_size = 0 a_size = 9 count = 162
func = dump_hash_pte_group, addr = 3fff710e6000 psize = 0 slot = 1156560 v = 4001a9245cff7105 r = 7cc038196 with b_size = 0 a_size = 9 count = 230
func = dump_hash_pte_group, addr = 3fff71378000 psize = 0 slot = 1161504 v = 4001a9245cff7105 r = 7cc038116 with b_size = 0 a_size = 9 count = 888
So we end up clearing the huge pmd with 0x3fff71000000 and at that
point we didn't had anything in hash table. That is the last
pmdp_splitting_flush or pmd_hugepage_update even on that address.
Can we audit the driver code to understand large/huge page usage and
if it is making any x86 assumptions around the page table accessor.
For example ppc64 rules around page table access are more strict than
x86. We don't have flush_tlb_* functions and we need to make sure we
hold ptl while updating page table and also flush the hash pte holding
the lock.
Attaching the log also
-aneesh
Aneesh writes:
Can we audit the driver code to understand large/huge page usage and
if it is making any x86 assumptions around the page table accessor.
For example ppc64 rules around page table access are more strict than
x86. We don't have flush_tlb_* functions and we need to make sure we
hold ptl while updating page table and also flush the hash pte holding
the lock.
Yes, we can do that (all the driver code that's specific to linux is in the kernel-interface subdirectory, so you can take a look as well). But I'm not quite sure what we'd be looking for.
The driver doesn't have any explicit awareness of huge-pages; it
doesn't intend or expect to interact with them in any way. And I
wouldn't expect the driver to be updating the kernel's page tables
itself but rather to use of some set of (relatively safe) services to
do that.
So if you can tell us what we might want to look for in the driver
code, we'll be happy to do that.
I do notice a couple uses of __flush_tlb() and global_flush_tlb(), but
those are under x86 ifdefs and won't be compiled in for Power. The
intent of the code using those is to flush the caches when the driver
changes the cache attribute of memory regions between cached and
uncached.
The driver's linux kernel interface code does contain references to
updating "pte", but those should all be the PTEs that are used by the
adapter, not the linux kernel page table entries.
After some additional looking, I see that there are some code paths in the driver's kernel interface layer at least refer to the kernel page table structures (see the references to pte_t, pmd_t, pgd_t, etc.) in
kernel_interface/nv-linux.h and nv.c.
But again, these are code paths that should only be compiled in for x86 (and in this case for kernel
versions < 2.6.1) as far as I can see.
Can you try the new patchset? I was able to run recreat1.sh in loop
for more than 8 times now. I will leave it running for the rest of the
day and will check again tomorrow morning.
I still need to get clarification on calling tlbie in loop for huge
pages from hardware guys.
-aneesh
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1352995/+subscriptions