Discussion:
[PATCH 03/22] HWPOISON: Add support for poison swap entries v2
Wu Fengguang
2009-06-15 02:45:23 UTC
Permalink
From: Andi Kleen <***@linux.intel.com>

Memory migration uses special swap entry types to trigger special actions on
page faults. Extend this mechanism to also support poisoned swap entries, to
trigger poison handling on page faults. This allows follow-on patches to
prevent processes from faulting in poisoned pages again.

v2: Fix overflow in MAX_SWAPFILES (Fengguang Wu)
v3: Better overflow fix (Hidehiro Kawai)

Reviewed-by: Wu Fengguang <***@intel.com>
Reviewed-by: Hidehiro Kawai <***@hitachi.com>
Signed-off-by: Andi Kleen <***@linux.intel.com>

---
include/linux/swap.h | 34 ++++++++++++++++++++++++++++------
include/linux/swapops.h | 38 ++++++++++++++++++++++++++++++++++++++
mm/swapfile.c | 4 ++--
3 files changed, 68 insertions(+), 8 deletions(-)

--- sound-2.6.orig/include/linux/swap.h
+++ sound-2.6/include/linux/swap.h
@@ -34,16 +34,38 @@ static inline int current_is_kswapd(void
* the type/offset into the pte as 5/27 as well.
*/
#define MAX_SWAPFILES_SHIFT 5
-#ifndef CONFIG_MIGRATION
-#define MAX_SWAPFILES (1 << MAX_SWAPFILES_SHIFT)
+
+/*
+ * Use some of the swap files numbers for other purposes. This
+ * is a convenient way to hook into the VM to trigger special
+ * actions on faults.
+ */
+
+/*
+ * NUMA node memory migration support
+ */
+#ifdef CONFIG_MIGRATION
+#define SWP_MIGRATION_NUM 2
+#define SWP_MIGRATION_READ (MAX_SWAPFILES + SWP_HWPOISON_NUM)
+#define SWP_MIGRATION_WRITE (MAX_SWAPFILES + SWP_HWPOISON_NUM + 1)
#else
-/* Use last two entries for page migration swap entries */
-#define MAX_SWAPFILES ((1 << MAX_SWAPFILES_SHIFT)-2)
-#define SWP_MIGRATION_READ MAX_SWAPFILES
-#define SWP_MIGRATION_WRITE (MAX_SWAPFILES + 1)
+#define SWP_MIGRATION_NUM 0
#endif

/*
+ * Handling of hardware poisoned pages with memory corruption.
+ */
+#ifdef CONFIG_MEMORY_FAILURE
+#define SWP_HWPOISON_NUM 1
+#define SWP_HWPOISON MAX_SWAPFILES
+#else
+#define SWP_HWPOISON_NUM 0
+#endif
+
+#define MAX_SWAPFILES \
+ ((1 << MAX_SWAPFILES_SHIFT) - SWP_MIGRATION_NUM - SWP_HWPOISON_NUM)
+
+/*
* Magic header for a swap area. The first part of the union is
* what the swap magic looks like for the old (limited to 128MB)
* swap area format, the second part of the union adds - in the
--- sound-2.6.orig/include/linux/swapops.h
+++ sound-2.6/include/linux/swapops.h
@@ -131,3 +131,41 @@ static inline int is_write_migration_ent

#endif

+#ifdef CONFIG_MEMORY_FAILURE
+/*
+ * Support for hardware poisoned pages
+ */
+static inline swp_entry_t make_hwpoison_entry(struct page *page)
+{
+ BUG_ON(!PageLocked(page));
+ return swp_entry(SWP_HWPOISON, page_to_pfn(page));
+}
+
+static inline int is_hwpoison_entry(swp_entry_t entry)
+{
+ return swp_type(entry) == SWP_HWPOISON;
+}
+#else
+
+static inline swp_entry_t make_hwpoison_entry(struct page *page)
+{
+ return swp_entry(0, 0);
+}
+
+static inline int is_hwpoison_entry(swp_entry_t swp)
+{
+ return 0;
+}
+#endif
+
+#if defined(CONFIG_MEMORY_FAILURE) || defined(CONFIG_MIGRATION)
+static inline int non_swap_entry(swp_entry_t entry)
+{
+ return swp_type(entry) >= MAX_SWAPFILES;
+}
+#else
+static inline int non_swap_entry(swp_entry_t entry)
+{
+ return 0;
+}
+#endif
--- sound-2.6.orig/mm/swapfile.c
+++ sound-2.6/mm/swapfile.c
@@ -697,7 +697,7 @@ int free_swap_and_cache(swp_entry_t entr
struct swap_info_struct *p;
struct page *page = NULL;

- if (is_migration_entry(entry))
+ if (non_swap_entry(entry))
return 1;

p = swap_info_get(entry);
@@ -2083,7 +2083,7 @@ static int __swap_duplicate(swp_entry_t
int count;
bool has_cache;

- if (is_migration_entry(entry))
+ if (non_swap_entry(entry))
return -EINVAL;

type = swp_type(entry);
--
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Wu Fengguang
2009-06-15 02:45:34 UTC
Permalink
From: Andi Kleen <***@linux.intel.com>

Useful for some testing scenarios, although specific testing is often
done better through MADV_POISON

This can be done with the x86 level MCE injector too, but this interface
allows it to do independently from low level x86 changes.

Open issues:
Should be disabled for cgroups.

Signed-off-by: Andi Kleen <***@linux.intel.com>

---
mm/Kconfig | 4 ++++
mm/Makefile | 1 +
mm/hwpoison-inject.c | 41 +++++++++++++++++++++++++++++++++++++++++
3 files changed, 46 insertions(+)

--- /dev/null
+++ sound-2.6/mm/hwpoison-inject.c
@@ -0,0 +1,41 @@
+/* Inject a hwpoison memory failure on a arbitary pfn */
+#include <linux/module.h>
+#include <linux/debugfs.h>
+#include <linux/kernel.h>
+#include <linux/mm.h>
+
+static struct dentry *hwpoison_dir, *corrupt_pfn;
+
+static int hwpoison_inject(void *data, u64 val)
+{
+ if (!capable(CAP_SYS_ADMIN))
+ return -EPERM;
+ printk(KERN_INFO "Injecting memory failure at pfn %Lx\n", val);
+ memory_failure(val, 18);
+ return 0;
+}
+
+DEFINE_SIMPLE_ATTRIBUTE(hwpoison_fops, NULL, hwpoison_inject, "%lli\n");
+
+static void pfn_inject_exit(void)
+{
+ if (hwpoison_dir)
+ debugfs_remove_recursive(hwpoison_dir);
+}
+
+static int pfn_inject_init(void)
+{
+ hwpoison_dir = debugfs_create_dir("hwpoison", NULL);
+ if (hwpoison_dir == NULL)
+ return -ENOMEM;
+ corrupt_pfn = debugfs_create_file("corrupt-pfn", 0600, hwpoison_dir,
+ NULL, &hwpoison_fops);
+ if (corrupt_pfn == NULL) {
+ pfn_inject_exit();
+ return -ENOMEM;
+ }
+ return 0;
+}
+
+module_init(pfn_inject_init);
+module_exit(pfn_inject_exit);
--- sound-2.6.orig/mm/Kconfig
+++ sound-2.6/mm/Kconfig
@@ -242,6 +242,10 @@ config KSM
config MEMORY_FAILURE
bool

+config HWPOISON_INJECT
+ tristate "Poison pages injector"
+ depends on MEMORY_FAILURE && DEBUG_KERNEL
+
config NOMMU_INITIAL_TRIM_EXCESS
int "Turn on mmap() excess space trimming before booting"
depends on !MMU
--- sound-2.6.orig/mm/Makefile
+++ sound-2.6/mm/Makefile
@@ -43,5 +43,6 @@ endif
obj-$(CONFIG_QUICKLIST) += quicklist.o
obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
obj-$(CONFIG_MEMORY_FAILURE) += memory-failure.o
+obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o
obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o
--
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Wu Fengguang
2009-06-15 02:45:22 UTC
Permalink
Needed for later patch that walks rmap entries on its own.

This used to be very frowned upon, but memory-failure.c does
some rather specialized rmap walking and rmap has been stable
for quite some time, so I think it's ok now to export it.

Signed-off-by: Andi Kleen <***@linux.intel.com>

---
include/linux/rmap.h | 6 ++++++
mm/rmap.c | 4 ++--
2 files changed, 8 insertions(+), 2 deletions(-)

--- sound-2.6.orig/include/linux/rmap.h
+++ sound-2.6/include/linux/rmap.h
@@ -116,6 +116,12 @@ int try_to_munlock(struct page *);
int page_wrprotect(struct page *page, int *odirect_sync, int count_offset);
#endif

+/*
+ * Called by memory-failure.c to kill processes.
+ */
+struct anon_vma *page_lock_anon_vma(struct page *page);
+void page_unlock_anon_vma(struct anon_vma *anon_vma);
+
#else /* !CONFIG_MMU */

#define anon_vma_init() do {} while (0)
--- sound-2.6.orig/mm/rmap.c
+++ sound-2.6/mm/rmap.c
@@ -191,7 +191,7 @@ void __init anon_vma_init(void)
* Getting a lock on a stable anon_vma from a page off the LRU is
* tricky: page_lock_anon_vma rely on RCU to guard against the races.
*/
-static struct anon_vma *page_lock_anon_vma(struct page *page)
+struct anon_vma *page_lock_anon_vma(struct page *page)
{
struct anon_vma *anon_vma;
unsigned long anon_mapping;
@@ -211,7 +211,7 @@ out:
return NULL;
}

-static void page_unlock_anon_vma(struct anon_vma *anon_vma)
+void page_unlock_anon_vma(struct anon_vma *anon_vma)
{
spin_unlock(&anon_vma->lock);
rcu_read_unlock();
--
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Wu Fengguang
2009-06-15 02:45:27 UTC
Permalink
From: Wu Fengguang <***@intel.com>

So as to eliminate one #ifdef in the c source.

Proposed by Nick Piggin.

Acked-by: Nick Piggin <***@suse.de>
Signed-off-by: Wu Fengguang <***@intel.com>
---
arch/x86/mm/fault.c | 3 +--
include/linux/mm.h | 7 ++++++-
2 files changed, 7 insertions(+), 3 deletions(-)

--- sound-2.6.orig/arch/x86/mm/fault.c
+++ sound-2.6/arch/x86/mm/fault.c
@@ -820,14 +820,13 @@ do_sigbus(struct pt_regs *regs, unsigned
tsk->thread.error_code = error_code;
tsk->thread.trap_no = 14;

-#ifdef CONFIG_MEMORY_FAILURE
if (fault & VM_FAULT_HWPOISON) {
printk(KERN_ERR
"MCE: Killing %s:%d due to hardware memory corruption fault at %lx\n",
tsk->comm, tsk->pid, address);
code = BUS_MCEERR_AR;
}
-#endif
+
force_sig_info_fault(SIGBUS, code, address, tsk);
}

--- sound-2.6.orig/include/linux/mm.h
+++ sound-2.6/include/linux/mm.h
@@ -700,11 +700,16 @@ static inline int page_mapped(struct pag
#define VM_FAULT_SIGBUS 0x0002
#define VM_FAULT_MAJOR 0x0004
#define VM_FAULT_WRITE 0x0008 /* Special case for get_user_pages */
-#define VM_FAULT_HWPOISON 0x0010 /* Hit poisoned page */

#define VM_FAULT_NOPAGE 0x0100 /* ->fault installed the pte, not return page */
#define VM_FAULT_LOCKED 0x0200 /* ->fault locked the returned page */

+#ifdef CONFIG_MEMORY_FAILURE
+#define VM_FAULT_HWPOISON 0x0010 /* Hit poisoned page */
+#else
+#define VM_FAULT_HWPOISON 0
+#endif
+
#define VM_FAULT_ERROR (VM_FAULT_OOM | VM_FAULT_SIGBUS | VM_FAULT_HWPOISON)

/*
--
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Wu Fengguang
2009-06-15 02:45:21 UTC
Permalink
From: Andi Kleen <***@linux.intel.com>

Hardware poisoned pages need special handling in the VM and shouldn't be
touched again. This requires a new page flag. Define it here.

The page flags wars seem to be over, so it shouldn't be a problem
to get a new one.

v2: Add TestSetHWPoison (suggested by Johannes Weiner)
v3: Define TestSetHWPoison on !CONFIG_MEMORY_FAILURE (Fengguang)

Acked-by: Christoph Lameter <***@linux.com>
Reviewed-by: Wu Fengguang <***@intel.com>
Signed-off-by: Andi Kleen <***@linux.intel.com>

---
include/linux/page-flags.h | 21 ++++++++++++++++++++-
1 file changed, 20 insertions(+), 1 deletion(-)

--- sound-2.6.orig/include/linux/page-flags.h
+++ sound-2.6/include/linux/page-flags.h
@@ -51,6 +51,9 @@
* PG_buddy is set to indicate that the page is free and in the buddy system
* (see mm/page_alloc.c).
*
+ * PG_hwpoison indicates that a page got corrupted in hardware and contains
+ * data with incorrect ECC bits that triggered a machine check. Accessing is
+ * not safe since it may cause another machine check. Don't touch!
*/

/*
@@ -102,6 +105,9 @@ enum pageflags {
#ifdef CONFIG_IA64_UNCACHED_ALLOCATOR
PG_uncached, /* Page has been mapped as uncached */
#endif
+#ifdef CONFIG_MEMORY_FAILURE
+ PG_hwpoison, /* hardware poisoned page. Don't touch */
+#endif
__NR_PAGEFLAGS,

/* Filesystems */
@@ -182,6 +188,9 @@ static inline void ClearPage##uname(stru
#define __CLEARPAGEFLAG_NOOP(uname) \
static inline void __ClearPage##uname(struct page *page) { }

+#define TESTSETFLAG_FALSE(uname) \
+static inline int TestSetPage##uname(struct page *page) { return 0; }
+
#define TESTCLEARFLAG_FALSE(uname) \
static inline int TestClearPage##uname(struct page *page) { return 0; }

@@ -265,6 +274,16 @@ PAGEFLAG(Uncached, uncached)
PAGEFLAG_FALSE(Uncached)
#endif

+#ifdef CONFIG_MEMORY_FAILURE
+PAGEFLAG(HWPoison, hwpoison)
+TESTSETFLAG(HWPoison, hwpoison)
+#define __PG_HWPOISON (1UL << PG_hwpoison)
+#else
+PAGEFLAG_FALSE(HWPoison)
+TESTSETFLAG_FALSE(HWPoison)
+#define __PG_HWPOISON 0
+#endif
+
static inline int PageUptodate(struct page *page)
{
int ret = test_bit(PG_uptodate, &(page)->flags);
@@ -389,7 +408,7 @@ static inline void __ClearPageTail(struc
1 << PG_private | 1 << PG_private_2 | \
1 << PG_buddy | 1 << PG_writeback | 1 << PG_reserved | \
1 << PG_slab | 1 << PG_swapcache | 1 << PG_active | \
- 1 << PG_unevictable | __PG_MLOCKED)
+ 1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON)

/*
* Flags checked when a page is prepped for return by the page allocator.
--
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Wu Fengguang
2009-06-15 02:45:40 UTC
Permalink
When a page corrupted, users may care about
- does it hit some important areas?
- can its data be recovered?
- can it be isolated to avoid a deadly future reference?
so that they can take proper actions like emergency sync/shutdown or
schedule reboot at some convenient time.

Signed-off-by: Wu Fengguang <***@intel.com>
---
mm/memory-failure.c | 78 +++++++++++++++++++++++++++++++++++-------
1 file changed, 66 insertions(+), 12 deletions(-)

--- sound-2.6.orig/mm/memory-failure.c
+++ sound-2.6/mm/memory-failure.c
@@ -312,11 +312,32 @@ static const char *hwpoison_outcome_name
[RECOVERED] = "Recovered",
};

+enum hwpoison_page_type {
+ PAGE_IS_KERNEL,
+ PAGE_IS_FS_METADATA,
+ PAGE_IS_FILE_DATA,
+ PAGE_IS_ANON_DATA,
+ PAGE_IS_SWAP_CACHE,
+ PAGE_IS_FREE,
+};
+
+static const char *hwpoison_page_type_name[] = {
+ [ PAGE_IS_KERNEL ] = "kernel",
+ [ PAGE_IS_FS_METADATA ] = "fs_metadata",
+ [ PAGE_IS_FILE_DATA ] = "file_data",
+ [ PAGE_IS_ANON_DATA ] = "anon_data",
+ [ PAGE_IS_SWAP_CACHE ] = "swap_cache",
+ [ PAGE_IS_FREE ] = "free",
+};
+
struct hwpoison_control {
unsigned long pfn;
struct page *p; /* corrupted page */
struct page *page; /* compound page head */
int outcome;
+ int page_type;
+ unsigned data_recoverable:1;
+ unsigned page_isolated:1;
};

/*
@@ -358,8 +379,14 @@ static int me_pagecache_clean(struct hwp
page_cache_release(p);

mapping = page_mapping(p);
- if (mapping == NULL)
+ if (mapping == NULL) {
+ hpc->page_isolated = 1;
return RECOVERED;
+ }
+
+ /* clean file backed page is recoverable */
+ if (!PageDirty(p) && !PageSwapBacked(p))
+ hpc->data_recoverable = 1;

/*
* Now truncate the page in the page cache. This is really
@@ -368,12 +395,14 @@ static int me_pagecache_clean(struct hwp
* has a reference, because it could be file system metadata
* and that's not safe to truncate.
*/
- if (!S_ISREG(mapping->host->i_mode) &&
- !invalidate_complete_page(mapping, p)) {
- printk(KERN_ERR
- "MCE %#lx: failed to invalidate metadata page\n",
- hpc->pfn);
- return FAILED;
+ if (!S_ISREG(mapping->host->i_mode)) {
+ hpc->page_type = PAGE_IS_FS_METADATA;
+ if (!invalidate_complete_page(mapping, p)) {
+ printk(KERN_ERR
+ "MCE %#lx: failed to invalidate metadata page\n",
+ hpc->pfn);
+ return FAILED;
+ }
}

truncate_inode_page(mapping, p);
@@ -382,6 +411,8 @@ static int me_pagecache_clean(struct hwp
hpc->pfn);
return FAILED;
}
+
+ hpc->page_isolated = 1;
return RECOVERED;
}

@@ -467,6 +498,7 @@ static int me_swapcache_dirty(struct hwp
if (!isolate_lru_page(p))
page_cache_release(p);

+ hpc->page_isolated = 1;
return DELAYED;
}

@@ -478,6 +510,8 @@ static int me_swapcache_clean(struct hwp
page_cache_release(p);

delete_from_swap_cache(p);
+ hpc->data_recoverable = 1;
+ hpc->page_isolated = 1;

return RECOVERED;
}
@@ -587,6 +621,10 @@ static void page_action(struct page_stat
"MCE %#lx: %s page still referenced by %d users\n",
hpc->pfn, ps->msg, page_count(hpc->page) - 1);

+ if (page_count(hpc->page) > 1 ||
+ page_mapcount(hpc->page) > 0)
+ hpc->page_isolated = 0;
+
/* Could do more checks here if page looks ok */
atomic_long_add(1, &mce_bad_pages);

@@ -735,6 +773,10 @@ void memory_failure(unsigned long pfn, i
hpc.p = p;
hpc.page = p = compound_head(p);

+ hpc.page_type = PAGE_IS_KERNEL;
+ hpc.data_recoverable = 0;
+ hpc.page_isolated = 0;
+
/*
* We need/can do nothing about count=0 pages.
* 1) it's a free page, and therefore in safe hand:
@@ -747,9 +789,12 @@ void memory_failure(unsigned long pfn, i
* that may make page_freeze_refs()/page_unfreeze_refs() mismatch.
*/
if (!get_page_unless_zero(p)) {
- if (is_free_buddy_page(p))
+ if (is_free_buddy_page(p)) {
+ hpc.page_type = PAGE_IS_FREE;
+ hpc.data_recoverable = 1;
+ hpc.page_isolated = 1;
action_result(&hpc, "free buddy", DELAYED);
- else
+ } else
action_result(&hpc, "high order kernel", IGNORED);
return;
}
@@ -770,9 +815,18 @@ void memory_failure(unsigned long pfn, i
/*
* Torn down by someone else?
*/
- if (PageLRU(p) && !PageSwapCache(p) && p->mapping == NULL) {
- action_result(&hpc, "already truncated LRU", IGNORED);
- goto out;
+ if (PageLRU(p)) {
+ if (PageSwapCache(p))
+ hpc.page_type = PAGE_IS_SWAP_CACHE;
+ else if (PageAnon(p))
+ hpc.page_type = PAGE_IS_ANON_DATA;
+ else
+ hpc.page_type = PAGE_IS_FILE_DATA;
+ if (!PageSwapCache(p) && p->mapping == NULL) {
+ action_result(&hpc, "already truncated LRU", IGNORED);
+ hpc.page_type = PAGE_IS_FREE;
+ goto out;
+ }
}

for (ps = error_states;; ps++) {
--
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Wu Fengguang
2009-06-15 02:45:26 UTC
Permalink
From: Andi Kleen <***@linux.intel.com>

Add VM_FAULT_HWPOISON handling to the x86 page fault handler. This is
very similar to VM_FAULT_OOM, the only difference is that a different
si_code is passed to user space and the new addr_lsb field is initialized.

v2: Make the printk more verbose/unique

Signed-off-by: Andi Kleen <***@linux.intel.com>

---
arch/x86/mm/fault.c | 19 +++++++++++++++----
1 file changed, 15 insertions(+), 4 deletions(-)

--- sound-2.6.orig/arch/x86/mm/fault.c
+++ sound-2.6/arch/x86/mm/fault.c
@@ -167,6 +167,7 @@ force_sig_info_fault(int si_signo, int s
info.si_errno = 0;
info.si_code = si_code;
info.si_addr = (void __user *)address;
+ info.si_addr_lsb = si_code == BUS_MCEERR_AR ? PAGE_SHIFT : 0;

force_sig_info(si_signo, &info, tsk);
}
@@ -798,10 +799,12 @@ out_of_memory(struct pt_regs *regs, unsi
}

static void
-do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long address)
+do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long address,
+ unsigned int fault)
{
struct task_struct *tsk = current;
struct mm_struct *mm = tsk->mm;
+ int code = BUS_ADRERR;

up_read(&mm->mmap_sem);

@@ -817,7 +820,15 @@ do_sigbus(struct pt_regs *regs, unsigned
tsk->thread.error_code = error_code;
tsk->thread.trap_no = 14;

- force_sig_info_fault(SIGBUS, BUS_ADRERR, address, tsk);
+#ifdef CONFIG_MEMORY_FAILURE
+ if (fault & VM_FAULT_HWPOISON) {
+ printk(KERN_ERR
+ "MCE: Killing %s:%d due to hardware memory corruption fault at %lx\n",
+ tsk->comm, tsk->pid, address);
+ code = BUS_MCEERR_AR;
+ }
+#endif
+ force_sig_info_fault(SIGBUS, code, address, tsk);
}

static noinline void
@@ -827,8 +838,8 @@ mm_fault_error(struct pt_regs *regs, uns
if (fault & VM_FAULT_OOM) {
out_of_memory(regs, error_code, address);
} else {
- if (fault & VM_FAULT_SIGBUS)
- do_sigbus(regs, error_code, address);
+ if (fault & (VM_FAULT_SIGBUS|VM_FAULT_HWPOISON))
+ do_sigbus(regs, error_code, address, fault);
else
BUG();
}

--
Wu Fengguang
2009-06-15 02:45:24 UTC
Permalink
Add new SIGBUS codes for reporting machine checks as signals. When
the hardware detects an uncorrected ECC error it can trigger these
signals.

This is needed for telling KVM's qemu about machine checks that happen to
guests, so that it can inject them, but might be also useful for other programs.
I find it useful in my test programs.

This patch merely defines the new types.

- Define two new si_codes for SIGBUS. BUS_MCEERR_AO and BUS_MCEERR_AR
* BUS_MCEERR_AO is for "Action Optional" machine checks, which means that some
corruption has been detected in the background, but nothing has been consumed
so far. The program can ignore those if it wants (but most programs would
already get killed)
* BUS_MCEERR_AR is for "Action Required" machine checks. This happens
when corrupted data is consumed or the application ran into an area
which has been known to be corrupted earlier. These require immediate
action and cannot just returned to. Most programs would kill themselves.
- They report the address of the corruption in the user address space
in si_addr.
- Define a new si_addr_lsb field that reports the extent of the corruption
to user space. That's currently always a (small) page. The user application
cannot tell where in this page the corruption happened.

AK: I plan to write a man page update before anyone asks.

Signed-off-by: Andi Kleen <***@linux.intel.com>

---
include/asm-generic/siginfo.h | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)

--- sound-2.6.orig/include/asm-generic/siginfo.h
+++ sound-2.6/include/asm-generic/siginfo.h
@@ -82,6 +82,7 @@ typedef struct siginfo {
#ifdef __ARCH_SI_TRAPNO
int _trapno; /* TRAP # which caused the signal */
#endif
+ short _addr_lsb; /* LSB of the reported address */
} _sigfault;

/* SIGPOLL */
@@ -112,6 +113,7 @@ typedef struct siginfo {
#ifdef __ARCH_SI_TRAPNO
#define si_trapno _sifields._sigfault._trapno
#endif
+#define si_addr_lsb _sifields._sigfault._addr_lsb
#define si_band _sifields._sigpoll._band
#define si_fd _sifields._sigpoll._fd

@@ -192,7 +194,11 @@ typedef struct siginfo {
#define BUS_ADRALN (__SI_FAULT|1) /* invalid address alignment */
#define BUS_ADRERR (__SI_FAULT|2) /* non-existant physical address */
#define BUS_OBJERR (__SI_FAULT|3) /* object specific hardware error */
-#define NSIGBUS 3
+/* hardware memory error consumed on a machine check: action required */
+#define BUS_MCEERR_AR (__SI_FAULT|4)
+/* hardware memory error detected in process but not consumed: action optional*/
+#define BUS_MCEERR_AO (__SI_FAULT|5)
+#define NSIGBUS 5

/*
* SIGTRAP si_codes
--
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Wu Fengguang
2009-06-15 02:45:36 UTC
Permalink
We'll be exporting them in other places than /proc/kpageflags.
For example, in hwpoison uevents for describing the poisoned page.

Signed-off-by: Wu Fengguang <***@intel.com>
---
fs/proc/page.c | 40 +--------------------------------
include/linux/page-flags.h | 42 +++++++++++++++++++++++++++++++++++
2 files changed, 44 insertions(+), 38 deletions(-)

--- sound-2.6.orig/fs/proc/page.c
+++ sound-2.6/fs/proc/page.c
@@ -72,48 +72,12 @@ static const struct file_operations proc

/* These macros are used to decouple internal flags from exported ones */

-#define KPF_LOCKED 0
-#define KPF_ERROR 1
-#define KPF_REFERENCED 2
-#define KPF_UPTODATE 3
-#define KPF_DIRTY 4
-#define KPF_LRU 5
-#define KPF_ACTIVE 6
-#define KPF_SLAB 7
-#define KPF_WRITEBACK 8
-#define KPF_RECLAIM 9
-#define KPF_BUDDY 10
-
-/* 11-20: new additions in 2.6.31 */
-#define KPF_MMAP 11
-#define KPF_ANON 12
-#define KPF_SWAPCACHE 13
-#define KPF_SWAPBACKED 14
-#define KPF_COMPOUND_HEAD 15
-#define KPF_COMPOUND_TAIL 16
-#define KPF_HUGE 17
-#define KPF_UNEVICTABLE 18
-#define KPF_HWPOISON 19
-#define KPF_NOPAGE 20
-
-/* kernel hacking assistances
- * WARNING: subject to change, never rely on them!
- */
-#define KPF_RESERVED 32
-#define KPF_MLOCKED 33
-#define KPF_MAPPEDTODISK 34
-#define KPF_PRIVATE 35
-#define KPF_PRIVATE_2 36
-#define KPF_OWNER_PRIVATE 37
-#define KPF_ARCH 38
-#define KPF_UNCACHED 39
-
static inline u64 kpf_copy_bit(u64 kflags, int ubit, int kbit)
{
return ((kflags >> kbit) & 1) << ubit;
}

-static u64 get_uflags(struct page *page)
+u64 page_uflags(struct page *page)
{
u64 k;
u64 u;
@@ -214,7 +178,7 @@ static ssize_t kpageflags_read(struct fi
else
ppage = NULL;

- if (put_user(get_uflags(ppage), out)) {
+ if (put_user(page_uflags(ppage), out)) {
ret = -EFAULT;
break;
}
--- sound-2.6.orig/include/linux/page-flags.h
+++ sound-2.6/include/linux/page-flags.h
@@ -132,6 +132,46 @@ enum pageflags {
PG_slub_debug = PG_error,
};

+/*
+ * stable flag numbers exported to user space
+ */
+
+#define KPF_LOCKED 0
+#define KPF_ERROR 1
+#define KPF_REFERENCED 2
+#define KPF_UPTODATE 3
+#define KPF_DIRTY 4
+#define KPF_LRU 5
+#define KPF_ACTIVE 6
+#define KPF_SLAB 7
+#define KPF_WRITEBACK 8
+#define KPF_RECLAIM 9
+#define KPF_BUDDY 10
+
+/* 11-20: new additions in 2.6.31 */
+#define KPF_MMAP 11
+#define KPF_ANON 12
+#define KPF_SWAPCACHE 13
+#define KPF_SWAPBACKED 14
+#define KPF_COMPOUND_HEAD 15
+#define KPF_COMPOUND_TAIL 16
+#define KPF_HUGE 17
+#define KPF_UNEVICTABLE 18
+#define KPF_HWPOISON 19
+#define KPF_NOPAGE 20
+
+/* kernel hacking assistances
+ * WARNING: subject to change, never rely on them!
+ */
+#define KPF_RESERVED 32
+#define KPF_MLOCKED 33
+#define KPF_MAPPEDTODISK 34
+#define KPF_PRIVATE 35
+#define KPF_PRIVATE_2 36
+#define KPF_OWNER_PRIVATE 37
+#define KPF_ARCH 38
+#define KPF_UNCACHED 39
+
#ifndef __GENERATING_BOUNDS_H

/*
@@ -284,6 +324,8 @@ TESTSETFLAG_FALSE(HWPoison)
#define __PG_HWPOISON 0
#endif

+u64 page_uflags(struct page *page);
+
static inline int PageUptodate(struct page *page)
{
int ret = test_bit(PG_uptodate, &(page)->flags);
--
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Wu Fengguang
2009-06-15 02:45:33 UTC
Permalink
From: Andi Kleen <***@linux.intel.com>

Impact: optional, useful for debugging

Add a new madvice sub command to inject poison for some
pages in a process' address space. This is useful for
testing the poison page handling.

Open issues:

- This patch allows root to tie up arbitary amounts of memory.
Should this be disabled inside containers?
- There's a small race window between getting the page and injecting.
The patch drops the ref count because otherwise memory_failure
complains about dangling references. In theory with a multi threaded
injector one could inject poison for a process foreign page this way.
Not a serious issue right now.

v2: Use write flag for get_user_pages to make sure to always get
a fresh page
v3: Don't request write mapping (Fengguang Wu)

Reviewed-by: Wu Fengguang <***@intel.com>
Signed-off-by: Andi Kleen <***@linux.intel.com>

---
include/asm-generic/mman-common.h | 1
mm/madvise.c | 36 ++++++++++++++++++++++++++++
2 files changed, 37 insertions(+)

--- sound-2.6.orig/mm/madvise.c
+++ sound-2.6/mm/madvise.c
@@ -207,6 +207,38 @@ static long madvise_remove(struct vm_are
return error;
}

+#ifdef CONFIG_MEMORY_FAILURE
+/*
+ * Error injection support for memory error handling.
+ */
+static int madvise_hwpoison(unsigned long start, unsigned long end)
+{
+ /*
+ * RED-PEN
+ * This allows to tie up arbitary amounts of memory.
+ * Might be a good idea to disable it inside containers even for root.
+ */
+ if (!capable(CAP_SYS_ADMIN))
+ return -EPERM;
+ for (; start < end; start += PAGE_SIZE) {
+ struct page *p;
+ int ret = get_user_pages(current, current->mm, start, 1,
+ 0, 0, &p, NULL);
+ if (ret != 1)
+ return ret;
+ put_page(p);
+ /*
+ * RED-PEN page can be reused in a short window, but otherwise
+ * we'll have to fight with the reference count.
+ */
+ printk(KERN_INFO "Injecting memory failure for page %lx at %lx\n",
+ page_to_pfn(p), start);
+ memory_failure(page_to_pfn(p), 0);
+ }
+ return 0;
+}
+#endif
+
static long
madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
unsigned long start, unsigned long end, int behavior)
@@ -307,6 +339,10 @@ SYSCALL_DEFINE3(madvise, unsigned long,
int write;
size_t len;

+#ifdef CONFIG_MEMORY_FAILURE
+ if (behavior == MADV_HWPOISON)
+ return madvise_hwpoison(start, start+len_in);
+#endif
if (!madvise_behavior_valid(behavior))
return error;

--- sound-2.6.orig/include/asm-generic/mman-common.h
+++ sound-2.6/include/asm-generic/mman-common.h
@@ -34,6 +34,7 @@
#define MADV_REMOVE 9 /* remove these pages & resources */
#define MADV_DONTFORK 10 /* don't inherit across fork */
#define MADV_DOFORK 11 /* do inherit across fork */
+#define MADV_HWPOISON 12 /* poison a page for testing */

/* compatibility flags */
#define MAP_FILE 0
--
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Wu Fengguang
2009-06-15 02:45:35 UTC
Permalink
- check for page_mapped_in_vma() on anon pages
- test and use page->mapping instead of page_mapping()
- cleanup some comments

If no objections, this patch will be folded into the big high-level patch.

Signed-off-by: Wu Fengguang <***@intel.com>
---
include/linux/rmap.h | 1 +
mm/memory-failure.c | 20 +++++++++++---------
mm/rmap.c | 2 +-
3 files changed, 13 insertions(+), 10 deletions(-)

--- sound-2.6.orig/mm/memory-failure.c
+++ sound-2.6/mm/memory-failure.c
@@ -122,8 +122,6 @@ struct to_kill {

/*
* Schedule a process for later kill.
- * Uses GFP_ATOMIC allocations to avoid potential recursions in the VM.
- * TBD would GFP_NOIO be enough?
*/
static void add_to_kill(struct task_struct *tsk, struct page *p,
struct vm_area_struct *vma,
@@ -227,6 +225,9 @@ static void collect_procs_anon(struct pa
if (!tsk->mm)
continue;
list_for_each_entry (vma, &av->head, anon_vma_node) {
+ if (!page_mapped_in_vma(page, vma))
+ continue;
+
if (vma->vm_mm == tsk->mm)
add_to_kill(tsk, page, vma, to_kill, tkc);
}
@@ -245,7 +246,7 @@ static void collect_procs_file(struct pa
struct vm_area_struct *vma;
struct task_struct *tsk;
struct prio_tree_iter iter;
- struct address_space *mapping = page_mapping(page);
+ struct address_space *mapping = page->mapping;

/*
* A note on the locking order between the two locks.
@@ -275,16 +276,17 @@ static void collect_procs_file(struct pa

/*
* Collect the processes who have the corrupted page mapped to kill.
- * This is done in two steps for locking reasons.
- * First preallocate one tokill structure outside the spin locks,
- * so that we can kill at least one process reasonably reliable.
*/
static void collect_procs(struct page *page, struct list_head *tokill)
{
struct to_kill *tk;

- tk = kmalloc(sizeof(struct to_kill), GFP_KERNEL);
- /* memory allocation failure is implicitly handled */
+ /*
+ * First preallocate one to_kill structure outside the spin locks,
+ * so that we can kill at least one process reasonably reliable.
+ */
+ tk = kmalloc(sizeof(struct to_kill), GFP_NOIO);
+
if (PageAnon(page))
collect_procs_anon(page, tokill, &tk);
else
@@ -657,7 +659,7 @@ static void hwpoison_user_mappings(struc
* Error handling: We ignore errors here because
* there's nothing that can be done.
*/
- if (kill)
+ if (kill && p->mapping)
collect_procs(p, &tokill);

/*
--- sound-2.6.orig/include/linux/rmap.h
+++ sound-2.6/include/linux/rmap.h
@@ -134,6 +134,7 @@ int page_wrprotect(struct page *page, in
*/
struct anon_vma *page_lock_anon_vma(struct page *page);
void page_unlock_anon_vma(struct anon_vma *anon_vma);
+int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma);

#else /* !CONFIG_MMU */

--- sound-2.6.orig/mm/rmap.c
+++ sound-2.6/mm/rmap.c
@@ -315,7 +315,7 @@ pte_t *page_check_address(struct page *p
* if the page is not mapped into the page tables of this VMA. Only
* valid for normal file or anonymous VMAs.
*/
-static int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma)
+int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma)
{
unsigned long address;
pte_t *pte;
--
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Wu Fengguang
2009-06-15 02:45:41 UTC
Permalink
This allows the user space to do some flexible policies.
For example, it may either do emergency sync/shutdown
or to schedule reboot at some convenient time, depending
on the severeness of the corruption.

Signed-off-by: Wu Fengguang <***@intel.com>
---
Documentation/vm/memory-failure | 68 ++++++++++++++++++
mm/memory-failure.c | 110 +++++++++++++++++++++++++++++-
2 files changed, 175 insertions(+), 3 deletions(-)

--- /dev/null
+++ sound-2.6/Documentation/vm/memory-failure
@@ -0,0 +1,68 @@
+Memory failure and hardware poison events
+
+Memory may have soft errors and the more memory you have the more errors.
+Normally hardware hides that from you by correcting it, but in some cases you
+can get multi-bit errors which lead to uncorrected errors the hardware cannot
+hide.
+
+This does not necessarily mean that the hardware is broken; for example it can
+be caused by cosmic particles hitting a unlucky transistor. So it can really
+happen in normal operation.
+
+Some hardwares (eg. Nehalem-EX) support background memory scrubbing in order to
+report the memory corruption before they are consumed. The kernel will then try
+to isolate the corrupted memory page, restore data, and finally send a uevent
+to the user space.
+
+A memory poison uevent will be
+
+ # udevadm monitor --environment --kernel
+ KERNEL[1245030313.702625] change /kernel/mm/hwpoison/hwpoison (hwpoison)
+ UDEV_LOG=3
+ ACTION=change
+ DEVPATH=/kernel/mm/hwpoison/hwpoison
+ SUBSYSTEM=hwpoison
+ EVENT=poison
+ PHYS_ADDR=0x19e1c000
+ PAGE_FLAGS=0x80008083c
+ PAGE_COUNT=3
+ PAGE_MAPCOUNT=1
+ PAGE_DEV=8:2
+ PAGE_INODE=56169
+ PAGE_INDEX=9
+ PAGE_TYPE=file_data
+ PAGE_ISOLATED=1
+ DATA_RECOVERABLE=0
+ SEQNUM=2109
+
+where
+
+ PHYS_ADDR the physical page address
+ PAGE_FLAGS the kpageflags bits defined at Documentation/vm/pagemap.txt
+ PAGE_COUNT the original page reference count
+ PAGE_MAPCOUNT the original page map count
+
+ PAGE_TYPE where the error lands, can be one of
+ "kernel" - a kernel page that may contain some critical data structure
+ "fs_metadata" - a filesystem metadata page
+ "file_data" - a file data page
+ "anon_data" - a page belong to some process(es)
+ "swap_cache" - it's in the swap cache; the kernel cannot tell if it was an
+ anon_data page or a tmpfs' file_data page
+ "free" - a free page; not used by anyone
+
+For "file_data" pages, the following three vars are available:
+
+ PAGE_DEV the file's MAJOR:MINOR device numbers in decimal
+ PAGE_INODE the file's inode number in decimal
+ PAGE_INDEX the file offset in page size
+
+ PAGE_ISOLATED if 1, we are sure that the page won't be consumed in the future.
+ if 0, the error page is still referenced by someone, and may be
+ consumed at anytime, which will be detected/stopped by hardware,
+ and trigger instant machine reboot.
+
+ DATA_RECOVERABLE if 1, no data are lost. For example, it's a free page, or a
+ clean page whose data can be reloaded from disk. In these
+ cases, the user space will not see the error at all.
+
--- sound-2.6.orig/mm/memory-failure.c
+++ sound-2.6/mm/memory-failure.c
@@ -330,7 +330,11 @@ static const char *hwpoison_page_type_na
[ PAGE_IS_FREE ] = "free",
};

+static struct kset *hwpoison_kset;
+static struct kobject hwpoison_kobj;
+
struct hwpoison_control {
+ struct kobj_uevent_env *env;
unsigned long pfn;
struct page *p; /* corrupted page */
struct page *page; /* compound page head */
@@ -340,6 +344,51 @@ struct hwpoison_control {
unsigned page_isolated:1;
};

+static void hwpoison_uevent_page(struct hwpoison_control *hpc)
+{
+ struct page *p = hpc->page;
+
+ if (hpc->env == NULL)
+ return;
+
+ add_uevent_var(hpc->env, "EVENT=poison");
+ add_uevent_var(hpc->env, "PHYS_ADDR=%#lx", hpc->pfn << PAGE_SHIFT);
+ add_uevent_var(hpc->env, "PAGE_FLAGS=%#Lx", page_uflags(p));
+ add_uevent_var(hpc->env, "PAGE_COUNT=%d", page_count(p));
+ add_uevent_var(hpc->env, "PAGE_MAPCOUNT=%d", page_mapcount(p));
+}
+
+static void hwpoison_uevent_file(struct hwpoison_control *hpc)
+{
+ struct address_space *mapping = page_mapping(hpc->page);
+
+ if (hpc->env == NULL)
+ return;
+
+ if (!mapping || !mapping->host)
+ return;
+
+ add_uevent_var(hpc->env, "PAGE_DEV=%d:%d",
+ MAJOR(mapping->host->i_sb->s_dev),
+ MINOR(mapping->host->i_sb->s_dev));
+ add_uevent_var(hpc->env, "PAGE_INODE=%lu", mapping->host->i_ino);
+ add_uevent_var(hpc->env, "PAGE_INDEX=%lu", hpc->page->index);
+}
+
+static void hwpoison_uevent_send(struct hwpoison_control *hpc)
+{
+ if (hpc->env == NULL)
+ return;
+
+ add_uevent_var(hpc->env, "PAGE_TYPE=%s",
+ hwpoison_page_type_name[hpc->page_type]);
+ add_uevent_var(hpc->env, "PAGE_ISOLATED=%d",
+ hpc->page_isolated);
+ add_uevent_var(hpc->env, "DATA_RECOVERABLE=%d",
+ hpc->data_recoverable);
+ kobject_uevent_env(&hwpoison_kobj, KOBJ_CHANGE, hpc->env->envp);
+}
+
/*
* Error hit kernel page.
* Do nothing, try to be lucky and not touch this instead. For a few cases we
@@ -769,10 +818,19 @@ void memory_failure(unsigned long pfn, i
return;
}

+ hpc.env = kzalloc(sizeof(struct kobj_uevent_env), GFP_NOIO);
+ if (!hpc.env) {
+ printk(KERN_ERR
+ "MCE %#lx: cannot allocate memory for uevent\n",
+ pfn);
+ }
+
hpc.pfn = pfn;
hpc.p = p;
hpc.page = p = compound_head(p);

+ hwpoison_uevent_page(&hpc);
+
hpc.page_type = PAGE_IS_KERNEL;
hpc.data_recoverable = 0;
hpc.page_isolated = 0;
@@ -796,7 +854,7 @@ void memory_failure(unsigned long pfn, i
action_result(&hpc, "free buddy", DELAYED);
} else
action_result(&hpc, "high order kernel", IGNORED);
- return;
+ goto out;
}

/*
@@ -825,16 +883,62 @@ void memory_failure(unsigned long pfn, i
if (!PageSwapCache(p) && p->mapping == NULL) {
action_result(&hpc, "already truncated LRU", IGNORED);
hpc.page_type = PAGE_IS_FREE;
- goto out;
+ goto out_unlock;
}
}

+ hwpoison_uevent_file(&hpc);
+
for (ps = error_states;; ps++) {
if ((p->flags & ps->mask) == ps->res) {
page_action(ps, &hpc);
break;
}
}
-out:
+out_unlock:
unlock_page(p);
+out:
+ hwpoison_uevent_send(&hpc);
+}
+
+static void hwpoison_release(struct kobject *kobj)
+{
+}
+
+static struct kobj_type hwpoison_ktype = {
+ .release = hwpoison_release,
+};
+
+static int hwpoison_kobj_init(void)
+{
+ int err;
+
+ hwpoison_kset = kset_create_and_add("hwpoison", NULL, mm_kobj);
+ if (!hwpoison_kset)
+ return -ENOMEM;
+
+ hwpoison_kobj.kset = hwpoison_kset;
+
+ err = kobject_init_and_add(&hwpoison_kobj, &hwpoison_ktype, NULL,
+ "hwpoison");
+ if (err)
+ return -ENOMEM;
+
+ kobject_uevent(&hwpoison_kobj, KOBJ_ADD);
+
+ return 0;
}
+
+
+static int __init hwpoison_init(void)
+{
+ return hwpoison_kobj_init();
+}
+
+static void __exit hwpoison_exit(void)
+{
+ kset_unregister(hwpoison_kset);
+}
+
+module_init(hwpoison_init);
+module_exit(hwpoison_exit);
--
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Andi Kleen
2009-06-15 06:29:34 UTC
Permalink
Post by Wu Fengguang
This allows the user space to do some flexible policies.
For example, it may either do emergency sync/shutdown
or to schedule reboot at some convenient time, depending
on the severeness of the corruption.
I don't think it's a good idea to export that much detailed information.
That would become a stable ABI, but might not be possible to keep
all these details stable. e.g. map count or reference count are
internal implementation details that shouldn't be exposed.
And what is an user space application to do with the inode? Run
find -inum?

Also we already report the event using low level logging mechanism.
in a relatively stable form.

It's also unclear to me what an application would do with that much
detail.

I would suggest to drop this part and the earlier flags move.

Please only bug fixes are this stage.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Wu Fengguang
2009-06-15 09:56:39 UTC
Permalink
Post by Andi Kleen
Post by Wu Fengguang
This allows the user space to do some flexible policies.
For example, it may either do emergency sync/shutdown
or to schedule reboot at some convenient time, depending
on the severeness of the corruption.
I don't think it's a good idea to export that much detailed information.
That would become a stable ABI, but might not be possible to keep
all these details stable. e.g. map count or reference count are
internal implementation details that shouldn't be exposed.
And what is an user space application to do with the inode? Run
find -inum?
I had plan to export file path too, if available. But you are right,
we should really think twice before exporting them.
Post by Andi Kleen
Also we already report the event using low level logging mechanism.
in a relatively stable form.
It's also unclear to me what an application would do with that much
detail.
I would suggest to drop this part and the earlier flags move.
Please only bug fixes are this stage.
OK, I'll get rid of the uevent patches numbered 16-21 in the next release.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Greg KH
2009-06-16 00:35:57 UTC
Permalink
Post by Wu Fengguang
+static void hwpoison_release(struct kobject *kobj)
+{
+}
+
+static struct kobj_type hwpoison_ktype = {
+ .release = hwpoison_release,
+};
{sigh}

Why did you include an empty release function? Was it because the
kernel complained when you had no release function? So, why would you
think that the acceptable solution to that warning would be an empty
release function instead?

Hint, this is totally wrong, provide a release function that ACTUALLY
DOES SOMETHING!!! Read the kobject documentation for details as to what
you need to do here.

ugh, I'm so tired of repeating this year after year after year, I feel
like a broken record...

This is broken, please fix it.

greg k-h

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Wu Fengguang
2009-06-15 02:45:30 UTC
Permalink
From: Wu Fengguang <***@intel.com>

If memory corruption hits the free buddy pages, we can safely ignore them.
No one will access them until page allocation time, then prep_new_page()
will automatically check and isolate PG_hwpoison page for us (for 0-order
allocation).

This patch expands prep_new_page() to check every component page in a high
order page allocation, in order to completely stop PG_hwpoison pages from
being recirculated.

Note that the common case -- only allocating a single page, doesn't
do any more work than before. Allocating > order 0 does a bit more work,
but that's relatively uncommon.

This simple implementation may drop some innocent neighbor pages, hopefully
it is not a big problem because the event should be rare enough.

This patch adds some runtime costs to high order page users.

[AK: Improved description]

v2: Andi Kleen:
Port to -mm code
Move check into separate function.
Don't dump stack in bad_pages for hwpoisoned pages.

Signed-off-by: Wu Fengguang <***@intel.com>
Signed-off-by: Andi Kleen <***@linux.intel.com>

---
mm/page_alloc.c | 20 +++++++++++++++++++-
1 file changed, 19 insertions(+), 1 deletion(-)

--- sound-2.6.orig/mm/page_alloc.c
+++ sound-2.6/mm/page_alloc.c
@@ -233,6 +233,12 @@ static void bad_page(struct page *page)
static unsigned long nr_shown;
static unsigned long nr_unshown;

+ /* Don't complain about poisoned pages */
+ if (PageHWPoison(page)) {
+ __ClearPageBuddy(page);
+ return;
+ }
+
/*
* Allow a burst of 60 reports, then keep quiet for that minute;
* or allow a steady drip of one report per second.
@@ -646,7 +652,7 @@ static inline void expand(struct zone *z
/*
* This page is about to be returned from the page allocator
*/
-static int prep_new_page(struct page *page, int order, gfp_t gfp_flags)
+static inline int check_new_page(struct page *page)
{
if (unlikely(page_mapcount(page) |
(page->mapping != NULL) |
@@ -655,6 +661,18 @@ static int prep_new_page(struct page *pa
bad_page(page);
return 1;
}
+ return 0;
+}
+
+static int prep_new_page(struct page *page, int order, gfp_t gfp_flags)
+{
+ int i;
+
+ for (i = 0; i < (1 << order); i++) {
+ struct page *p = page + i;
+ if (unlikely(check_new_page(p)))
+ return 1;
+ }

set_page_private(page, 0);
set_page_refcounted(page);
--
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
KAMEZAWA Hiroyuki
2009-06-15 09:41:12 UTC
Permalink
On Mon, 15 Jun 2009 10:45:30 +0800
Post by Wu Fengguang
If memory corruption hits the free buddy pages, we can safely ignore them.
No one will access them until page allocation time, then prep_new_page()
will automatically check and isolate PG_hwpoison page for us (for 0-order
allocation).
This patch expands prep_new_page() to check every component page in a high
order page allocation, in order to completely stop PG_hwpoison pages from
being recirculated.
Note that the common case -- only allocating a single page, doesn't
do any more work than before. Allocating > order 0 does a bit more work,
but that's relatively uncommon.
This simple implementation may drop some innocent neighbor pages, hopefully
it is not a big problem because the event should be rare enough.
This patch adds some runtime costs to high order page users.
[AK: Improved description]
Port to -mm code
Move check into separate function.
Don't dump stack in bad_pages for hwpoisoned pages.
---
mm/page_alloc.c | 20 +++++++++++++++++++-
1 file changed, 19 insertions(+), 1 deletion(-)
--- sound-2.6.orig/mm/page_alloc.c
+++ sound-2.6/mm/page_alloc.c
@@ -233,6 +233,12 @@ static void bad_page(struct page *page)
static unsigned long nr_shown;
static unsigned long nr_unshown;
+ /* Don't complain about poisoned pages */
+ if (PageHWPoison(page)) {
+ __ClearPageBuddy(page);
+ return;
+ }
Hmm ? why __ClearPageBuddy() is necessary ?

Thanks,
-Kame
Post by Wu Fengguang
+
/*
* Allow a burst of 60 reports, then keep quiet for that minute;
* or allow a steady drip of one report per second.
@@ -646,7 +652,7 @@ static inline void expand(struct zone *z
/*
* This page is about to be returned from the page allocator
*/
-static int prep_new_page(struct page *page, int order, gfp_t gfp_flags)
+static inline int check_new_page(struct page *page)
{
if (unlikely(page_mapcount(page) |
(page->mapping != NULL) |
@@ -655,6 +661,18 @@ static int prep_new_page(struct page *pa
bad_page(page);
return 1;
}
+ return 0;
+}
+
+static int prep_new_page(struct page *page, int order, gfp_t gfp_flags)
+{
+ int i;
+
+ for (i = 0; i < (1 << order); i++) {
+ struct page *p = page + i;
+ if (unlikely(check_new_page(p)))
+ return 1;
+ }
set_page_private(page, 0);
set_page_refcounted(page);
--
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
see: http://www.linux-mm.org/ .
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Wu Fengguang
2009-06-15 10:16:20 UTC
Permalink
Post by KAMEZAWA Hiroyuki
On Mon, 15 Jun 2009 10:45:30 +0800
Post by Wu Fengguang
If memory corruption hits the free buddy pages, we can safely ignore them.
No one will access them until page allocation time, then prep_new_page()
will automatically check and isolate PG_hwpoison page for us (for 0-order
allocation).
This patch expands prep_new_page() to check every component page in a high
order page allocation, in order to completely stop PG_hwpoison pages from
being recirculated.
Note that the common case -- only allocating a single page, doesn't
do any more work than before. Allocating > order 0 does a bit more work,
but that's relatively uncommon.
This simple implementation may drop some innocent neighbor pages, hopefully
it is not a big problem because the event should be rare enough.
This patch adds some runtime costs to high order page users.
[AK: Improved description]
Port to -mm code
Move check into separate function.
Don't dump stack in bad_pages for hwpoisoned pages.
---
mm/page_alloc.c | 20 +++++++++++++++++++-
1 file changed, 19 insertions(+), 1 deletion(-)
--- sound-2.6.orig/mm/page_alloc.c
+++ sound-2.6/mm/page_alloc.c
@@ -233,6 +233,12 @@ static void bad_page(struct page *page)
static unsigned long nr_shown;
static unsigned long nr_unshown;
+ /* Don't complain about poisoned pages */
+ if (PageHWPoison(page)) {
+ __ClearPageBuddy(page);
+ return;
+ }
Hmm ? why __ClearPageBuddy() is necessary ?
Because this page is considered to be "allocated" out of the buddy
system, even though we fail the allocation here.

The page is now owned by no one, especially not owned by the buddy
allocator.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
KAMEZAWA Hiroyuki
2009-06-15 23:52:22 UTC
Permalink
On Mon, 15 Jun 2009 18:16:20 +0800
Post by Wu Fengguang
Post by KAMEZAWA Hiroyuki
On Mon, 15 Jun 2009 10:45:30 +0800
Post by Wu Fengguang
If memory corruption hits the free buddy pages, we can safely ignore them.
No one will access them until page allocation time, then prep_new_page()
will automatically check and isolate PG_hwpoison page for us (for 0-order
allocation).
This patch expands prep_new_page() to check every component page in a high
order page allocation, in order to completely stop PG_hwpoison pages from
being recirculated.
Note that the common case -- only allocating a single page, doesn't
do any more work than before. Allocating > order 0 does a bit more work,
but that's relatively uncommon.
This simple implementation may drop some innocent neighbor pages, hopefully
it is not a big problem because the event should be rare enough.
This patch adds some runtime costs to high order page users.
[AK: Improved description]
Port to -mm code
Move check into separate function.
Don't dump stack in bad_pages for hwpoisoned pages.
---
mm/page_alloc.c | 20 +++++++++++++++++++-
1 file changed, 19 insertions(+), 1 deletion(-)
--- sound-2.6.orig/mm/page_alloc.c
+++ sound-2.6/mm/page_alloc.c
@@ -233,6 +233,12 @@ static void bad_page(struct page *page)
static unsigned long nr_shown;
static unsigned long nr_unshown;
+ /* Don't complain about poisoned pages */
+ if (PageHWPoison(page)) {
+ __ClearPageBuddy(page);
+ return;
+ }
Hmm ? why __ClearPageBuddy() is necessary ?
Because this page is considered to be "allocated" out of the buddy
system, even though we fail the allocation here.
The page is now owned by no one, especially not owned by the buddy
allocator.
I just wonder "why __ClearPageBuddy() is necessary."

When bad_page() is called, a page is removed from buddy allocator and no
PG_buddy flag at all....I'm sorry if you added bad_page() caller in buddy allocator.

Buddy Allocator I call here is just 2 functions.
- __free_one_page()
- expand()


Bye,
-Kame
Post by Wu Fengguang
Thanks,
Fengguang
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Wu Fengguang
2009-06-16 00:34:40 UTC
Permalink
Post by KAMEZAWA Hiroyuki
On Mon, 15 Jun 2009 18:16:20 +0800
Post by Wu Fengguang
Post by KAMEZAWA Hiroyuki
On Mon, 15 Jun 2009 10:45:30 +0800
Post by Wu Fengguang
If memory corruption hits the free buddy pages, we can safely ignore them.
No one will access them until page allocation time, then prep_new_page()
will automatically check and isolate PG_hwpoison page for us (for 0-order
allocation).
This patch expands prep_new_page() to check every component page in a high
order page allocation, in order to completely stop PG_hwpoison pages from
being recirculated.
Note that the common case -- only allocating a single page, doesn't
do any more work than before. Allocating > order 0 does a bit more work,
but that's relatively uncommon.
This simple implementation may drop some innocent neighbor pages, hopefully
it is not a big problem because the event should be rare enough.
This patch adds some runtime costs to high order page users.
[AK: Improved description]
Port to -mm code
Move check into separate function.
Don't dump stack in bad_pages for hwpoisoned pages.
---
mm/page_alloc.c | 20 +++++++++++++++++++-
1 file changed, 19 insertions(+), 1 deletion(-)
--- sound-2.6.orig/mm/page_alloc.c
+++ sound-2.6/mm/page_alloc.c
@@ -233,6 +233,12 @@ static void bad_page(struct page *page)
static unsigned long nr_shown;
static unsigned long nr_unshown;
+ /* Don't complain about poisoned pages */
+ if (PageHWPoison(page)) {
+ __ClearPageBuddy(page);
+ return;
+ }
Hmm ? why __ClearPageBuddy() is necessary ?
Because this page is considered to be "allocated" out of the buddy
system, even though we fail the allocation here.
The page is now owned by no one, especially not owned by the buddy
allocator.
I just wonder "why __ClearPageBuddy() is necessary."
When bad_page() is called, a page is removed from buddy allocator and no
PG_buddy flag at all....I'm sorry if you added bad_page() caller in buddy allocator.
You are right. But I didn't add bad_page() callers either :)
Post by KAMEZAWA Hiroyuki
Buddy Allocator I call here is just 2 functions.
- __free_one_page()
- expand()
Right. Then the original __ClearPageBuddy() call in bad_page() is
questionable, I guess this line was there just for the sake of safety
(ie. the buddy allocator itself goes wrong):

sound-2.6/mm/page_alloc.c

@@ -269,7 +269,6 @@ static void bad_page(struct page *page)
dump_stack();
out:
/* Leave bad fields for debug, except PageBuddy could make trouble */
===> __ClearPageBuddy(page);
add_taint(TAINT_BAD_PAGE);
}


Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Hugh Dickins
2009-06-16 11:29:45 UTC
Permalink
Post by Wu Fengguang
Right. Then the original __ClearPageBuddy() call in bad_page() is
questionable, I guess this line was there just for the sake of safety
sound-2.6/mm/page_alloc.c
@@ -269,7 +269,6 @@ static void bad_page(struct page *page)
dump_stack();
/* Leave bad fields for debug, except PageBuddy could make trouble */
===> __ClearPageBuddy(page);
add_taint(TAINT_BAD_PAGE);
}
I didn't put that in for the case of the buddy allocator going wrong
(not sure if there could be such a case - I don't mean that the buddy
allocator is provably perfect! but how would it get here if it were
wrong?). No, I put that in for the case when the flag bits in struct
page have themselves got corrupted somehow, and hence we arrive at
bad_page(): most of the bits are best left as they are, to provide
maximum debug info; but leaving PageBuddy set there might conceivably
allow this corrupted struct page to get paired up with its buddy later,
and so freed for reuse, when we're trying to make sure it's never reused.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Wu Fengguang
2009-06-16 11:40:24 UTC
Permalink
Post by Hugh Dickins
Post by Wu Fengguang
Right. Then the original __ClearPageBuddy() call in bad_page() is
questionable, I guess this line was there just for the sake of safety
sound-2.6/mm/page_alloc.c
@@ -269,7 +269,6 @@ static void bad_page(struct page *page)
dump_stack();
/* Leave bad fields for debug, except PageBuddy could make trouble */
===> __ClearPageBuddy(page);
add_taint(TAINT_BAD_PAGE);
}
I didn't put that in for the case of the buddy allocator going wrong
(not sure if there could be such a case - I don't mean that the buddy
allocator is provably perfect! but how would it get here if it were
wrong?). No, I put that in for the case when the flag bits in struct
page have themselves got corrupted somehow, and hence we arrive at
bad_page(): most of the bits are best left as they are, to provide
maximum debug info; but leaving PageBuddy set there might conceivably
allow this corrupted struct page to get paired up with its buddy later,
and so freed for reuse, when we're trying to make sure it's never reused.
Hugh, thank you for the detailed explanations! You are always informative :)

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Wu Fengguang
2009-06-15 02:45:25 UTC
Permalink
From: Andi Kleen <***@linux.intel.com>

- Add a new VM_FAULT_HWPOISON error code to handle_mm_fault. Right now
architectures have to explicitely enable poison page support, so
this is forward compatible to all architectures. They only need
to add it when they enable poison page support.
- Add poison page handling in swap in fault code

v2: Add missing delayacct_clear_flag (Hidehiro Kawai)
v3: Really use delayacct_clear_flag (Hidehiro Kawai)

Reviewed-by: Wu Fengguang <***@intel.com>
Reviewed-by: Hidehiro Kawai <***@hitachi.com>
Signed-off-by: Andi Kleen <***@linux.intel.com>

---
include/linux/mm.h | 3 ++-
mm/memory.c | 18 +++++++++++++++---
2 files changed, 17 insertions(+), 4 deletions(-)

--- sound-2.6.orig/mm/memory.c
+++ sound-2.6/mm/memory.c
@@ -1315,7 +1315,8 @@ int __get_user_pages(struct task_struct
if (ret & VM_FAULT_ERROR) {
if (ret & VM_FAULT_OOM)
return i ? i : -ENOMEM;
- else if (ret & VM_FAULT_SIGBUS)
+ if (ret &
+ (VM_FAULT_HWPOISON|VM_FAULT_SIGBUS))
return i ? i : -EFAULT;
BUG();
}
@@ -2595,8 +2596,15 @@ static int do_swap_page(struct mm_struct
goto out;

entry = pte_to_swp_entry(orig_pte);
- if (is_migration_entry(entry)) {
- migration_entry_wait(mm, pmd, address);
+ if (unlikely(non_swap_entry(entry))) {
+ if (is_migration_entry(entry)) {
+ migration_entry_wait(mm, pmd, address);
+ } else if (is_hwpoison_entry(entry)) {
+ ret = VM_FAULT_HWPOISON;
+ } else {
+ print_bad_pte(vma, address, pte, NULL);
+ ret = VM_FAULT_OOM;
+ }
goto out;
}
delayacct_set_flag(DELAYACCT_PF_SWAPIN);
@@ -2620,6 +2628,10 @@ static int do_swap_page(struct mm_struct
/* Had to read the page from swap area: Major fault */
ret = VM_FAULT_MAJOR;
count_vm_event(PGMAJFAULT);
+ } else if (PageHWPoison(page)) {
+ ret = VM_FAULT_HWPOISON;
+ delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
+ goto out;
}

lock_page(page);
--- sound-2.6.orig/include/linux/mm.h
+++ sound-2.6/include/linux/mm.h
@@ -700,11 +700,12 @@ static inline int page_mapped(struct pag
#define VM_FAULT_SIGBUS 0x0002
#define VM_FAULT_MAJOR 0x0004
#define VM_FAULT_WRITE 0x0008 /* Special case for get_user_pages */
+#define VM_FAULT_HWPOISON 0x0010 /* Hit poisoned page */

#define VM_FAULT_NOPAGE 0x0100 /* ->fault installed the pte, not return page */
#define VM_FAULT_LOCKED 0x0200 /* ->fault locked the returned page */

-#define VM_FAULT_ERROR (VM_FAULT_OOM | VM_FAULT_SIGBUS)
+#define VM_FAULT_ERROR (VM_FAULT_OOM | VM_FAULT_SIGBUS | VM_FAULT_HWPOISON)

/*
* Can be called by the pagefault handler when it gets a VM_FAULT_OOM.
--
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Wu Fengguang
2009-06-15 02:45:28 UTC
Permalink
From: Andi Kleen <***@linux.intel.com>

try_to_unmap currently has multiple modi (migration, munlock, normal unmap)
which are selected by magic flag variables. The logic is not very straight
forward, because each of these flag change multiple behaviours (e.g.
migration turns off aging, not only sets up migration ptes etc.)
Also the different flags interact in magic ways.

A later patch in this series adds another mode to try_to_unmap, so
this becomes quickly unmanageable.

Replace the different flags with a action code (migration, munlock, munmap)
and some additional flags as modifiers (ignore mlock, ignore aging).
This makes the logic more straight forward and allows easier extension
to new behaviours. Change all the caller to declare what they want to
do.

This patch is supposed to be a nop in behaviour. If anyone can prove
it is not that would be a bug.

Cc: ***@hp.com
Cc: ***@suse.de
Reviewed-by: Wu Fengguang <***@intel.com>
Signed-off-by: Andi Kleen <***@linux.intel.com>

---
include/linux/rmap.h | 14 +++++++++++++-
mm/migrate.c | 2 +-
mm/rmap.c | 40 ++++++++++++++++++++++------------------
mm/vmscan.c | 2 +-
4 files changed, 37 insertions(+), 21 deletions(-)

--- sound-2.6.orig/include/linux/rmap.h
+++ sound-2.6/include/linux/rmap.h
@@ -85,7 +85,19 @@ static inline void page_dup_rmap(struct
*/
int page_referenced(struct page *, int is_locked,
struct mem_cgroup *cnt, unsigned long *vm_flags);
-int try_to_unmap(struct page *, int ignore_refs);
+
+enum ttu_flags {
+ TTU_UNMAP = 0, /* unmap mode */
+ TTU_MIGRATION = 1, /* migration mode */
+ TTU_MUNLOCK = 2, /* munlock mode */
+ TTU_ACTION_MASK = 0xff,
+
+ TTU_IGNORE_MLOCK = (1 << 8), /* ignore mlock */
+ TTU_IGNORE_ACCESS = (1 << 9), /* don't age */
+};
+#define TTU_ACTION(x) ((x) & TTU_ACTION_MASK)
+
+int try_to_unmap(struct page *, enum ttu_flags flags);

/*
* Called from mm/filemap_xip.c to unmap empty zero page
--- sound-2.6.orig/mm/rmap.c
+++ sound-2.6/mm/rmap.c
@@ -912,7 +912,7 @@ void page_remove_rmap(struct page *page)
* repeatedly from either try_to_unmap_anon or try_to_unmap_file.
*/
static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
- int migration)
+ enum ttu_flags flags)
{
struct mm_struct *mm = vma->vm_mm;
unsigned long address;
@@ -934,11 +934,13 @@ static int try_to_unmap_one(struct page
* If it's recently referenced (perhaps page_referenced
* skipped over this mm) then we should reactivate it.
*/
- if (!migration) {
+ if (!(flags & TTU_IGNORE_MLOCK)) {
if (vma->vm_flags & VM_LOCKED) {
ret = SWAP_MLOCK;
goto out_unmap;
}
+ }
+ if (!(flags & TTU_IGNORE_ACCESS)) {
if (ptep_clear_flush_young_notify(vma, address, pte)) {
ret = SWAP_FAIL;
goto out_unmap;
@@ -978,12 +980,12 @@ static int try_to_unmap_one(struct page
* pte. do_swap_page() will wait until the migration
* pte is removed and then restart fault handling.
*/
- BUG_ON(!migration);
+ BUG_ON(TTU_ACTION(flags) != TTU_MIGRATION);
entry = make_migration_entry(page, pte_write(pteval));
}
set_pte_at(mm, address, pte, swp_entry_to_pte(entry));
BUG_ON(pte_file(*pte));
- } else if (PAGE_MIGRATION && migration) {
+ } else if (PAGE_MIGRATION && (TTU_ACTION(flags) == TTU_MIGRATION)) {
/* Establish migration entry for a file page */
swp_entry_t entry;
entry = make_migration_entry(page, pte_write(pteval));
@@ -1152,12 +1154,13 @@ static int try_to_mlock_page(struct page
* vm_flags for that VMA. That should be OK, because that vma shouldn't be
* 'LOCKED.
*/
-static int try_to_unmap_anon(struct page *page, int unlock, int migration)
+static int try_to_unmap_anon(struct page *page, enum ttu_flags flags)
{
struct anon_vma *anon_vma;
struct vm_area_struct *vma;
unsigned int mlocked = 0;
int ret = SWAP_AGAIN;
+ int unlock = TTU_ACTION(flags) == TTU_MUNLOCK;

if (MLOCK_PAGES && unlikely(unlock))
ret = SWAP_SUCCESS; /* default for try_to_munlock() */
@@ -1173,7 +1176,7 @@ static int try_to_unmap_anon(struct page
continue; /* must visit all unlocked vmas */
ret = SWAP_MLOCK; /* saw at least one mlocked vma */
} else {
- ret = try_to_unmap_one(page, vma, migration);
+ ret = try_to_unmap_one(page, vma, flags);
if (ret == SWAP_FAIL || !page_mapped(page))
break;
}
@@ -1197,8 +1200,7 @@ static int try_to_unmap_anon(struct page
/**
* try_to_unmap_file - unmap/unlock file page using the object-based rmap method
* @page: the page to unmap/unlock
- * @unlock: request for unlock rather than unmap [unlikely]
- * @migration: unmapping for migration - ignored if @unlock
+ * @flags: action and flags
*
* Find all the mappings of a page using the mapping pointer and the vma chains
* contained in the address_space struct it points to.
@@ -1210,7 +1212,7 @@ static int try_to_unmap_anon(struct page
* vm_flags for that VMA. That should be OK, because that vma shouldn't be
* 'LOCKED.
*/
-static int try_to_unmap_file(struct page *page, int unlock, int migration)
+static int try_to_unmap_file(struct page *page, enum ttu_flags flags)
{
struct address_space *mapping = page->mapping;
pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
@@ -1222,6 +1224,7 @@ static int try_to_unmap_file(struct page
unsigned long max_nl_size = 0;
unsigned int mapcount;
unsigned int mlocked = 0;
+ int unlock = TTU_ACTION(flags) == TTU_MUNLOCK;

if (MLOCK_PAGES && unlikely(unlock))
ret = SWAP_SUCCESS; /* default for try_to_munlock() */
@@ -1234,7 +1237,7 @@ static int try_to_unmap_file(struct page
continue; /* must visit all vmas */
ret = SWAP_MLOCK;
} else {
- ret = try_to_unmap_one(page, vma, migration);
+ ret = try_to_unmap_one(page, vma, flags);
if (ret == SWAP_FAIL || !page_mapped(page))
goto out;
}
@@ -1259,7 +1262,8 @@ static int try_to_unmap_file(struct page
ret = SWAP_MLOCK; /* leave mlocked == 0 */
goto out; /* no need to look further */
}
- if (!MLOCK_PAGES && !migration && (vma->vm_flags & VM_LOCKED))
+ if (!MLOCK_PAGES && !(flags & TTU_IGNORE_MLOCK) &&
+ (vma->vm_flags & VM_LOCKED))
continue;
cursor = (unsigned long) vma->vm_private_data;
if (cursor > max_nl_cursor)
@@ -1293,7 +1297,7 @@ static int try_to_unmap_file(struct page
do {
list_for_each_entry(vma, &mapping->i_mmap_nonlinear,
shared.vm_set.list) {
- if (!MLOCK_PAGES && !migration &&
+ if (!MLOCK_PAGES && !(flags & TTU_IGNORE_MLOCK) &&
(vma->vm_flags & VM_LOCKED))
continue;
cursor = (unsigned long) vma->vm_private_data;
@@ -1333,7 +1337,7 @@ out:
/**
* try_to_unmap - try to remove all page table mappings to a page
* @page: the page to get unmapped
- * @migration: migration flag
+ * @flags: action and flags
*
* Tries to remove all the page table entries which are mapping this
* page, used in the pageout path. Caller must hold the page lock.
@@ -1344,16 +1348,16 @@ out:
* SWAP_FAIL - the page is unswappable
* SWAP_MLOCK - page is mlocked.
*/
-int try_to_unmap(struct page *page, int migration)
+int try_to_unmap(struct page *page, enum ttu_flags flags)
{
int ret;

BUG_ON(!PageLocked(page));

if (PageAnon(page))
- ret = try_to_unmap_anon(page, 0, migration);
+ ret = try_to_unmap_anon(page, flags);
else
- ret = try_to_unmap_file(page, 0, migration);
+ ret = try_to_unmap_file(page, flags);
if (ret != SWAP_MLOCK && !page_mapped(page))
ret = SWAP_SUCCESS;
return ret;
@@ -1378,8 +1382,8 @@ int try_to_munlock(struct page *page)
VM_BUG_ON(!PageLocked(page) || PageLRU(page));

if (PageAnon(page))
- return try_to_unmap_anon(page, 1, 0);
+ return try_to_unmap_anon(page, TTU_MUNLOCK);
else
- return try_to_unmap_file(page, 1, 0);
+ return try_to_unmap_file(page, TTU_MUNLOCK);
}

--- sound-2.6.orig/mm/vmscan.c
+++ sound-2.6/mm/vmscan.c
@@ -661,7 +661,7 @@ static unsigned long shrink_page_list(st
* processes. Try to unmap it here.
*/
if (page_mapped(page) && mapping) {
- switch (try_to_unmap(page, 0)) {
+ switch (try_to_unmap(page, TTU_UNMAP)) {
case SWAP_FAIL:
goto activate_locked;
case SWAP_AGAIN:
--- sound-2.6.orig/mm/migrate.c
+++ sound-2.6/mm/migrate.c
@@ -669,7 +669,7 @@ static int unmap_and_move(new_page_t get
}

/* Establish migration ptes or remove ptes */
- try_to_unmap(page, 1);
+ try_to_unmap(page, TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);

if (!page_mapped(page))
rc = move_to_new_page(newpage, page);
--
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Wu Fengguang
2009-06-15 02:45:32 UTC
Permalink
From: Andi Kleen <***@linux.intel.com>

Add the high level memory handler that poisons pages
that got corrupted by hardware (typically by a two bit flip in a DIMM
or a cache) on the Linux level. The goal is to prevent everyone
from accessing these pages in the future.

This done at the VM level by marking a page hwpoisoned
and doing the appropriate action based on the type of page
it is.

The code that does this is portable and lives in mm/memory-failure.c

To quote the overview comment:

* High level machine check handler. Handles pages reported by the
* hardware as being corrupted usually due to a 2bit ECC memory or cache
* failure.
*
* This focuses on pages detected as corrupted in the background.
* When the current CPU tries to consume corruption the currently
* running process can just be killed directly instead. This implies
* that if the error cannot be handled for some reason it's safe to
* just ignore it because no corruption has been consumed yet. Instead
* when that happens another machine check will happen.
*
* Handles page cache pages in various states. The tricky part
* here is that we can access any page asynchronous to other VM
* users, because memory failures could happen anytime and anywhere,
* possibly violating some of their assumptions. This is why this code
* has to be extremely careful. Generally it tries to use normal locking
* rules, as in get the standard locks, even if that means the
* error handling takes potentially a long time.
*
* Some of the operations here are somewhat inefficient and have non
* linear algorithmic complexity, because the data structures have not
* been optimized for this case. This is in particular the case
* for the mapping from a vma to a process. Since this case is expected
* to be rare we hope we can get away with this.

There are in principle two strategies to kill processes on poison:
- just unmap the data and wait for an actual reference before
killing
- kill as soon as corruption is detected.
Both have advantages and disadvantages and should be used
in different situations. Right now both are implemented and can
be switched with a new sysctl vm.memory_failure_early_kill
The default is early kill.

The patch does some rmap data structure walking on its own to collect
processes to kill. This is unusual because normally all rmap data structure
knowledge is in rmap.c only. I put it here for now to keep
everything together and rmap knowledge has been seeping out anyways

v2: Fix anon vma unlock crash (noticed by Johannes Weiner <***@cmpxchg.org>)
Handle pages on free list correctly (also noticed by Johannes)
Fix inverted try_to_release_page check (found by Chris Mason)
Add documentation for the new sysctl.
Various other cleanups/comment fixes.
v3: Use blockable signal for AO SIGBUS for better qemu handling.
Numerous fixes from Fengguang Wu:
New code layout for the table (redone by AK)
Move the hwpoison bit setting before the lock (Fengguang Wu)
Some code cleanups (Fengguang Wu, AK)
Add missing lru_drain (Fengguang Wu)
Do more checks for valid mappings (inspired by patch from Fengguang)
Handle free pages and fixes for clean pages (Fengguang)
Removed swap cache handling for now, needs more work
Better mapping checks to avoid races (Fengguang)
Fix swapcache (Fengguang)
Handle private2 pages too (Fengguang)
v4: Various fixes based on review comments from Nick Piggin
Document locking order.
Improved comments.
Slightly improved description
Remove bogus hunk.
Wait properly for writeback pages (Nick Piggin)
v5: Improve various comments
Handle page_address_in_vma() failure better by SIGKILL and also
make message debugging only
Clean up printks
Remove redundant PageWriteback check (Nick Piggin)
Add missing clear_page_mlock
Reformat state table to be <80 columns again
Use truncate helper instead of manual truncate in me_pagecache_*
Check for metadata buffer pages and reject them.
A few cleanups.
v6:
Fix a printk broken in the last round of cleanups.
More minor cleanups and fixes based on comments from FengguangWu.
Rename /proc/meminfo Header to "HardwareCorrupted"
Add a printk for the failed mapping case (Fengguang Wu)
Better clean page check (Fengguang Wu)
v7:
Fix kernel oops by action_result(invalid pfn) (Fengguang Wu)
make tasklist_lock/anon_vma locking order straight (Nick Piggin)
use the safer invalidate page for possible metadata pages (Nick Piggin)

Cc: ***@tiscali.co.uk
Cc: ***@suse.de
Cc: ***@oracle.com
Acked-by: Rik van Riel <***@redhat.com>
Reviewed-by: Hidehiro Kawai <***@hitachi.com>
Signed-off-by: Wu Fengguang <***@intel.com>
Signed-off-by: Andi Kleen <***@linux.intel.com>

---
Documentation/sysctl/vm.txt | 28 +
fs/proc/meminfo.c | 9
include/linux/mm.h | 6
kernel/sysctl.c | 14
mm/Kconfig | 3
mm/Makefile | 1
mm/filemap.c | 4
mm/memory-failure.c | 771 ++++++++++++++++++++++++++++++++++
mm/rmap.c | 4
mm/truncate.c | 3
10 files changed, 839 insertions(+), 4 deletions(-)

--- sound-2.6.orig/mm/Makefile
+++ sound-2.6/mm/Makefile
@@ -42,5 +42,6 @@ obj-$(CONFIG_SMP) += allocpercpu.o
endif
obj-$(CONFIG_QUICKLIST) += quicklist.o
obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
+obj-$(CONFIG_MEMORY_FAILURE) += memory-failure.o
obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o
--- /dev/null
+++ sound-2.6/mm/memory-failure.c
@@ -0,0 +1,771 @@
+/*
+ * Copyright (C) 2008, 2009 Intel Corporation
+ * Authors: Andi Kleen, Fengguang Wu
+ *
+ * This software may be redistributed and/or modified under the terms of
+ * the GNU General Public License ("GPL") version 2 only as published by the
+ * Free Software Foundation.
+ *
+ * High level machine check handler. Handles pages reported by the
+ * hardware as being corrupted usually due to a 2bit ECC memory or cache
+ * failure.
+ *
+ * This focuses on pages detected as corrupted in the background.
+ * When the current CPU tries to consume corruption the currently
+ * running process can just be killed directly instead. This implies
+ * that if the error cannot be handled for some reason it's safe to
+ * just ignore it because no corruption has been consumed yet. Instead
+ * when that happens another machine check will happen.
+ *
+ * Handles page cache pages in various states. The tricky part
+ * here is that we can access any page asynchronous to other VM
+ * users, because memory failures could happen anytime and anywhere,
+ * possibly violating some of their assumptions. This is why this code
+ * has to be extremely careful. Generally it tries to use normal locking
+ * rules, as in get the standard locks, even if that means the
+ * error handling takes potentially a long time.
+ *
+ * The operation to map back from RMAP chains to processes has to walk
+ * the complete process list and has non linear complexity with the number
+ * mappings. In short it can be quite slow. But since memory corruptions
+ * are rare we hope to get away with this.
+ */
+
+/*
+ * Notebook:
+ * - hugetlb needs more code
+ * - kcore/oldmem/vmcore/mem/kmem check for hwpoison pages
+ * - pass bad pages to kdump next kernel
+ */
+#define DEBUG 1
+#include <linux/kernel.h>
+#include <linux/mm.h>
+#include <linux/page-flags.h>
+#include <linux/sched.h>
+#include <linux/rmap.h>
+#include <linux/pagemap.h>
+#include <linux/swap.h>
+#include <linux/backing-dev.h>
+#include "internal.h"
+
+int sysctl_memory_failure_early_kill __read_mostly = 1;
+
+atomic_long_t mce_bad_pages __read_mostly = ATOMIC_LONG_INIT(0);
+
+/*
+ * Send all the processes who have the page mapped an ``action optional''
+ * signal.
+ */
+static int kill_proc_ao(struct task_struct *t, unsigned long addr, int trapno,
+ unsigned long pfn)
+{
+ struct siginfo si;
+ int ret;
+
+ printk(KERN_ERR
+ "MCE %#lx: Killing %s:%d early due to hardware memory corruption\n",
+ pfn, t->comm, t->pid);
+ si.si_signo = SIGBUS;
+ si.si_errno = 0;
+ si.si_code = BUS_MCEERR_AO;
+ si.si_addr = (void *)addr;
+#ifdef __ARCH_SI_TRAPNO
+ si.si_trapno = trapno;
+#endif
+ si.si_addr_lsb = PAGE_SHIFT;
+ /*
+ * Don't use force here, it's convenient if the signal
+ * can be temporarily blocked.
+ * This could cause a loop when the user sets SIGBUS
+ * to SIG_IGN, but hopefully noone will do that?
+ */
+ ret = send_sig_info(SIGBUS, &si, t); /* synchronous? */
+ if (ret < 0)
+ printk(KERN_INFO "MCE: Error sending signal to %s:%d: %d\n",
+ t->comm, t->pid, ret);
+ return ret;
+}
+
+/*
+ * Kill all processes that have a poisoned page mapped and then isolate
+ * the page.
+ *
+ * General strategy:
+ * Find all processes having the page mapped and kill them.
+ * But we keep a page reference around so that the page is not
+ * actually freed yet.
+ * Then stash the page away
+ *
+ * There's no convenient way to get back to mapped processes
+ * from the VMAs. So do a brute-force search over all
+ * running processes.
+ *
+ * Remember that machine checks are not common (or rather
+ * if they are common you have other problems), so this shouldn't
+ * be a performance issue.
+ *
+ * Also there are some races possible while we get from the
+ * error detection to actually handle it.
+ */
+
+struct to_kill {
+ struct list_head nd;
+ struct task_struct *tsk;
+ unsigned long addr;
+ unsigned addr_valid:1;
+};
+
+/*
+ * Failure handling: if we can't find or can't kill a process there's
+ * not much we can do. We just print a message and ignore otherwise.
+ */
+
+/*
+ * Schedule a process for later kill.
+ * Uses GFP_ATOMIC allocations to avoid potential recursions in the VM.
+ * TBD would GFP_NOIO be enough?
+ */
+static void add_to_kill(struct task_struct *tsk, struct page *p,
+ struct vm_area_struct *vma,
+ struct list_head *to_kill,
+ struct to_kill **tkc)
+{
+ struct to_kill *tk;
+
+ if (*tkc) {
+ tk = *tkc;
+ *tkc = NULL;
+ } else {
+ tk = kmalloc(sizeof(struct to_kill), GFP_ATOMIC);
+ if (!tk) {
+ printk(KERN_ERR
+ "MCE: Out of memory while machine check handling\n");
+ return;
+ }
+ }
+ tk->addr = page_address_in_vma(p, vma);
+ tk->addr_valid = 1;
+
+ /*
+ * In theory we don't have to kill when the page was
+ * munmaped. But it could be also a mremap. Since that's
+ * likely very rare kill anyways just out of paranoia, but use
+ * a SIGKILL because the error is not contained anymore.
+ */
+ if (tk->addr == -EFAULT) {
+ pr_debug("MCE: Unable to find user space address %lx in %s\n",
+ page_to_pfn(p), tsk->comm);
+ tk->addr_valid = 0;
+ }
+ get_task_struct(tsk);
+ tk->tsk = tsk;
+ list_add_tail(&tk->nd, to_kill);
+}
+
+/*
+ * Kill the processes that have been collected earlier.
+ *
+ * Only do anything when DOIT is set, otherwise just free the list
+ * (this is used for clean pages which do not need killing)
+ * Also when FAIL is set do a force kill because something went
+ * wrong earlier.
+ */
+static void kill_procs_ao(struct list_head *to_kill, int doit, int trapno,
+ int fail, unsigned long pfn)
+{
+ struct to_kill *tk, *next;
+
+ list_for_each_entry_safe (tk, next, to_kill, nd) {
+ if (doit) {
+ /*
+ * In case something went wrong with munmaping
+ * make sure the process doesn't catch the
+ * signal and then access the memory. Just kill it.
+ * the signal handlers
+ */
+ if (fail || tk->addr_valid == 0) {
+ printk(KERN_ERR
+"MCE %#lx: forcibly killing %s:%d because of failure to unmap corrupted page\n",
+ pfn, tk->tsk->comm, tk->tsk->pid);
+ force_sig(SIGKILL, tk->tsk);
+ }
+
+ /*
+ * In theory the process could have mapped
+ * something else on the address in-between. We could
+ * check for that, but we need to tell the
+ * process anyways.
+ */
+ else if (kill_proc_ao(tk->tsk, tk->addr, trapno,
+ pfn) < 0)
+ printk(KERN_ERR
+ "MCE %#lx: Cannot send advisory machine check signal to %s:%d\n",
+ pfn, tk->tsk->comm, tk->tsk->pid);
+ }
+ put_task_struct(tk->tsk);
+ kfree(tk);
+ }
+}
+
+/*
+ * Collect processes when the error hit an anonymous page.
+ */
+static void collect_procs_anon(struct page *page, struct list_head *to_kill,
+ struct to_kill **tkc)
+{
+ struct vm_area_struct *vma;
+ struct task_struct *tsk;
+ struct anon_vma *av;
+
+ read_lock(&tasklist_lock);
+
+ av = page_lock_anon_vma(page);
+ if (av == NULL) /* Not actually mapped anymore */
+ goto out;
+
+ for_each_process (tsk) {
+ if (!tsk->mm)
+ continue;
+ list_for_each_entry (vma, &av->head, anon_vma_node) {
+ if (vma->vm_mm == tsk->mm)
+ add_to_kill(tsk, page, vma, to_kill, tkc);
+ }
+ }
+ page_unlock_anon_vma(av);
+out:
+ read_unlock(&tasklist_lock);
+}
+
+/*
+ * Collect processes when the error hit a file mapped page.
+ */
+static void collect_procs_file(struct page *page, struct list_head *to_kill,
+ struct to_kill **tkc)
+{
+ struct vm_area_struct *vma;
+ struct task_struct *tsk;
+ struct prio_tree_iter iter;
+ struct address_space *mapping = page_mapping(page);
+
+ /*
+ * A note on the locking order between the two locks.
+ * We don't rely on this particular order.
+ * If you have some other code that needs a different order
+ * feel free to switch them around. Or add a reverse link
+ * from mm_struct to task_struct, then this could be all
+ * done without taking tasklist_lock and looping over all tasks.
+ */
+
+ read_lock(&tasklist_lock);
+ spin_lock(&mapping->i_mmap_lock);
+ for_each_process(tsk) {
+ pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+
+ if (!tsk->mm)
+ continue;
+
+ vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff,
+ pgoff)
+ if (vma->vm_mm == tsk->mm)
+ add_to_kill(tsk, page, vma, to_kill, tkc);
+ }
+ spin_unlock(&mapping->i_mmap_lock);
+ read_unlock(&tasklist_lock);
+}
+
+/*
+ * Collect the processes who have the corrupted page mapped to kill.
+ * This is done in two steps for locking reasons.
+ * First preallocate one tokill structure outside the spin locks,
+ * so that we can kill at least one process reasonably reliable.
+ */
+static void collect_procs(struct page *page, struct list_head *tokill)
+{
+ struct to_kill *tk;
+
+ tk = kmalloc(sizeof(struct to_kill), GFP_KERNEL);
+ /* memory allocation failure is implicitly handled */
+ if (PageAnon(page))
+ collect_procs_anon(page, tokill, &tk);
+ else
+ collect_procs_file(page, tokill, &tk);
+ kfree(tk);
+}
+
+/*
+ * Error handlers for various types of pages.
+ */
+
+enum outcome {
+ FAILED, /* Error handling failed */
+ DELAYED, /* Will be handled later */
+ IGNORED, /* Error safely ignored */
+ RECOVERED, /* Successfully recovered */
+};
+
+static const char *action_name[] = {
+ [FAILED] = "Failed",
+ [DELAYED] = "Delayed",
+ [IGNORED] = "Ignored",
+ [RECOVERED] = "Recovered",
+};
+
+/*
+ * Error hit kernel page.
+ * Do nothing, try to be lucky and not touch this instead. For a few cases we
+ * could be more sophisticated.
+ */
+static int me_kernel(struct page *p, unsigned long pfn)
+{
+ return DELAYED;
+}
+
+/*
+ * Already poisoned page.
+ */
+static int me_ignore(struct page *p, unsigned long pfn)
+{
+ return IGNORED;
+}
+
+/*
+ * Page in unknown state. Do nothing.
+ */
+static int me_unknown(struct page *p, unsigned long pfn)
+{
+ printk(KERN_ERR "MCE %#lx: Unknown page state\n", pfn);
+ return FAILED;
+}
+
+/*
+ * Free memory
+ */
+static int me_free(struct page *p, unsigned long pfn)
+{
+ return DELAYED;
+}
+
+/*
+ * Clean (or cleaned) page cache page.
+ */
+static int me_pagecache_clean(struct page *p, unsigned long pfn)
+{
+ struct address_space *mapping;
+
+ if (!isolate_lru_page(p))
+ page_cache_release(p);
+
+ mapping = page_mapping(p);
+ if (mapping == NULL)
+ return RECOVERED;
+
+ /*
+ * Now truncate the page in the page cache. This is really
+ * more like a "temporary hole punch"
+ * Don't do this for block devices when someone else
+ * has a reference, because it could be file system metadata
+ * and that's not safe to truncate.
+ */
+ if (!S_ISREG(mapping->host->i_mode) &&
+ !invalidate_complete_page(mapping, p)) {
+ printk(KERN_ERR
+ "MCE %#lx: failed to invalidate metadata page\n",
+ pfn);
+ return FAILED;
+ }
+
+ truncate_inode_page(mapping, p);
+ if (page_has_private(p) && !try_to_release_page(p, GFP_NOIO)) {
+ pr_debug(KERN_ERR "MCE %#lx: failed to release buffers\n",
+ pfn);
+ return FAILED;
+ }
+ return RECOVERED;
+}
+
+/*
+ * Dirty cache page page
+ * Issues: when the error hit a hole page the error is not properly
+ * propagated.
+ */
+static int me_pagecache_dirty(struct page *p, unsigned long pfn)
+{
+ struct address_space *mapping = page_mapping(p);
+
+ SetPageError(p);
+ /* TBD: print more information about the file. */
+ if (mapping) {
+ /*
+ * IO error will be reported by write(), fsync(), etc.
+ * who check the mapping.
+ * This way the application knows that something went
+ * wrong with its dirty file data.
+ *
+ * There's one open issue:
+ *
+ * The EIO will be only reported on the next IO
+ * operation and then cleared through the IO map.
+ * Normally Linux has two mechanisms to pass IO error
+ * first through the AS_EIO flag in the address space
+ * and then through the PageError flag in the page.
+ * Since we drop pages on memory failure handling the
+ * only mechanism open to use is through AS_AIO.
+ *
+ * This has the disadvantage that it gets cleared on
+ * the first operation that returns an error, while
+ * the PageError bit is more sticky and only cleared
+ * when the page is reread or dropped. If an
+ * application assumes it will always get error on
+ * fsync, but does other operations on the fd before
+ * and the page is dropped inbetween then the error
+ * will not be properly reported.
+ *
+ * This can already happen even without hwpoisoned
+ * pages: first on metadata IO errors (which only
+ * report through AS_EIO) or when the page is dropped
+ * at the wrong time.
+ *
+ * So right now we assume that the application DTRT on
+ * the first EIO, but we're not worse than other parts
+ * of the kernel.
+ */
+ mapping_set_error(mapping, EIO);
+ }
+
+ return me_pagecache_clean(p, pfn);
+}
+
+/*
+ * Clean and dirty swap cache.
+ *
+ * Dirty swap cache page is tricky to handle. The page could live both in page
+ * cache and swap cache(ie. page is freshly swapped in). So it could be
+ * referenced concurrently by 2 types of PTEs:
+ * normal PTEs and swap PTEs. We try to handle them consistently by calling u
+ * try_to_unmap(TTU_IGNORE_HWPOISON) to convert the normal PTEs to swap PTEs,
+ * and then
+ * - clear dirty bit to prevent IO
+ * - remove from LRU
+ * - but keep in the swap cache, so that when we return to it on
+ * a later page fault, we know the application is accessing
+ * corrupted data and shall be killed (we installed simple
+ * interception code in do_swap_page to catch it).
+ *
+ * Clean swap cache pages can be directly isolated. A later page fault will
+ * bring in the known good data from disk.
+ */
+static int me_swapcache_dirty(struct page *p, unsigned long pfn)
+{
+ ClearPageDirty(p);
+ /* Trigger EIO in shmem: */
+ ClearPageUptodate(p);
+
+ if (!isolate_lru_page(p))
+ page_cache_release(p);
+
+ return DELAYED;
+}
+
+static int me_swapcache_clean(struct page *p, unsigned long pfn)
+{
+ if (!isolate_lru_page(p))
+ page_cache_release(p);
+
+ delete_from_swap_cache(p);
+
+ return RECOVERED;
+}
+
+/*
+ * Huge pages. Needs work.
+ * Issues:
+ * No rmap support so we cannot find the original mapper. In theory could walk
+ * all MMs and look for the mappings, but that would be non atomic and racy.
+ * Need rmap for hugepages for this. Alternatively we could employ a heuristic,
+ * like just walking the current process and hoping it has it mapped (that
+ * should be usually true for the common "shared database cache" case)
+ * Should handle free huge pages and dequeue them too, but this needs to
+ * handle huge page accounting correctly.
+ */
+static int me_huge_page(struct page *p, unsigned long pfn)
+{
+ return FAILED;
+}
+
+/*
+ * Various page states we can handle.
+ *
+ * A page state is defined by its current page->flags bits.
+ * The table matches them in order and calls the right handler.
+ *
+ * This is quite tricky because we can access page at any time
+ * in its live cycle, so all accesses have to be extremly careful.
+ *
+ * This is not complete. More states could be added.
+ * For any missing state don't attempt recovery.
+ */
+
+#define dirty (1UL << PG_dirty)
+#define sc (1UL << PG_swapcache)
+#define unevict (1UL << PG_unevictable)
+#define mlock (1UL << PG_mlocked)
+#define writeback (1UL << PG_writeback)
+#define lru (1UL << PG_lru)
+#define swapbacked (1UL << PG_swapbacked)
+#define head (1UL << PG_head)
+#define tail (1UL << PG_tail)
+#define compound (1UL << PG_compound)
+#define slab (1UL << PG_slab)
+#define buddy (1UL << PG_buddy)
+#define reserved (1UL << PG_reserved)
+
+static struct page_state {
+ unsigned long mask;
+ unsigned long res;
+ char *msg;
+ int (*action)(struct page *p, unsigned long pfn);
+} error_states[] = {
+ { reserved, reserved, "reserved kernel", me_ignore },
+ { buddy, buddy, "free kernel", me_free },
+
+ /*
+ * Could in theory check if slab page is free or if we can drop
+ * currently unused objects without touching them. But just
+ * treat it as standard kernel for now.
+ */
+ { slab, slab, "kernel slab", me_kernel },
+
+#ifdef CONFIG_PAGEFLAGS_EXTENDED
+ { head, head, "huge", me_huge_page },
+ { tail, tail, "huge", me_huge_page },
+#else
+ { compound, compound, "huge", me_huge_page },
+#endif
+
+ { sc|dirty, sc|dirty, "swapcache", me_swapcache_dirty },
+ { sc|dirty, sc, "swapcache", me_swapcache_clean },
+
+#ifdef CONFIG_UNEVICTABLE_LRU
+ { unevict|dirty, unevict|dirty, "unevictable LRU", me_pagecache_dirty},
+ { unevict, unevict, "unevictable LRU", me_pagecache_clean},
+#endif
+
+#ifdef CONFIG_HAVE_MLOCKED_PAGE_BIT
+ { mlock|dirty, mlock|dirty, "mlocked LRU", me_pagecache_dirty },
+ { mlock, mlock, "mlocked LRU", me_pagecache_clean },
+#endif
+
+ { lru|dirty, lru|dirty, "LRU", me_pagecache_dirty },
+ { lru|dirty, lru, "clean LRU", me_pagecache_clean },
+ { swapbacked, swapbacked, "anonymous", me_pagecache_clean },
+
+ /*
+ * Catchall entry: must be at end.
+ */
+ { 0, 0, "unknown page state", me_unknown },
+};
+
+static void action_result(unsigned long pfn, char *msg, int result)
+{
+ printk(KERN_ERR "MCE %#lx: %s%s page recovery: %s\n",
+ pfn, PageDirty(pfn_to_page(pfn)) ? "dirty " : "",
+ msg, action_name[result]);
+}
+
+static void page_action(struct page_state *ps, struct page *p,
+ unsigned long pfn)
+{
+ int result;
+
+ result = ps->action(p, pfn);
+ action_result(pfn, ps->msg, result);
+ if (page_count(p) != 1)
+ printk(KERN_ERR
+ "MCE %#lx: %s page still referenced by %d users\n",
+ pfn, ps->msg, page_count(p) - 1);
+
+ /* Could do more checks here if page looks ok */
+ atomic_long_add(1, &mce_bad_pages);
+
+ /*
+ * Could adjust zone counters here to correct for the missing page.
+ */
+}
+
+#define N_UNMAP_TRIES 5
+
+/*
+ * Do all that is necessary to remove user space mappings. Unmap
+ * the pages and send SIGBUS to the processes if the data was dirty.
+ */
+static void hwpoison_user_mappings(struct page *p, unsigned long pfn,
+ int trapno)
+{
+ enum ttu_flags ttu = TTU_UNMAP | TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS;
+ int kill = sysctl_memory_failure_early_kill;
+ struct address_space *mapping;
+ LIST_HEAD(tokill);
+ int ret;
+ int i;
+
+ if (PageReserved(p) || PageCompound(p) || PageSlab(p))
+ return;
+
+ if (!PageLRU(p))
+ lru_add_drain();
+
+ /*
+ * This check implies we don't kill processes if their pages
+ * are in the swap cache early. Those are always late kills.
+ */
+ if (!page_mapped(p))
+ return;
+
+ if (PageSwapCache(p)) {
+ printk(KERN_ERR
+ "MCE %#lx: keeping poisoned page in swap cache\n", pfn);
+ ttu |= TTU_IGNORE_HWPOISON;
+ }
+
+ /*
+ * Propagate the dirty bit from PTEs to struct page first, because we
+ * need this to decide if we should kill or just drop the page.
+ */
+ mapping = page_mapping(p);
+ if (!PageDirty(p) && mapping && mapping_cap_writeback_dirty(mapping)) {
+ if (page_mkclean(p))
+ SetPageDirty(p);
+ else {
+ kill = 0;
+ ttu |= TTU_IGNORE_HWPOISON;
+ printk(KERN_INFO
+ "MCE %#lx: corrupted page was clean: dropped without side effects\n",
+ pfn);
+ }
+ }
+
+ /*
+ * First collect all the processes that have the page
+ * mapped. This has to be done before try_to_unmap,
+ * because ttu takes the rmap data structures down.
+ *
+ * This also has the side effect to propagate the dirty
+ * bit from PTEs into the struct page. This is needed
+ * to actually decide if something needs to be killed
+ * or errored, or if it's ok to just drop the page.
+ *
+ * Error handling: We ignore errors here because
+ * there's nothing that can be done.
+ */
+ if (kill)
+ collect_procs(p, &tokill);
+
+ /*
+ * try_to_unmap can fail temporarily due to races.
+ * Try a few times (RED-PEN better strategy?)
+ */
+ for (i = 0; i < N_UNMAP_TRIES; i++) {
+ ret = try_to_unmap(p, ttu);
+ if (ret == SWAP_SUCCESS)
+ break;
+ pr_debug("MCE %#lx: try_to_unmap retry needed %d\n", pfn, ret);
+ }
+
+ if (ret != SWAP_SUCCESS)
+ printk(KERN_ERR "MCE %#lx: failed to unmap page (mapcount=%d)\n",
+ pfn, page_mapcount(p));
+
+ /*
+ * Now that the dirty bit has been propagated to the
+ * struct page and all unmaps done we can decide if
+ * killing is needed or not. Only kill when the page
+ * was dirty, otherwise the tokill list is merely
+ * freed. When there was a problem unmapping earlier
+ * use a more force-full uncatchable kill to prevent
+ * any accesses to the poisoned memory.
+ */
+ kill_procs_ao(&tokill, !!PageDirty(p), trapno,
+ ret != SWAP_SUCCESS, pfn);
+}
+
+/**
+ * memory_failure - Handle memory failure of a page.
+ * @pfn: Page Number of the corrupted page
+ * @trapno: Trap number reported in the signal to user space.
+ *
+ * This function is called by the low level machine check code
+ * of an architecture when it detects hardware memory corruption
+ * of a page. It tries its best to recover, which includes
+ * dropping pages, killing processes etc.
+ *
+ * The function is primarily of use for corruptions that
+ * happen outside the current execution context (e.g. when
+ * detected by a background scrubber)
+ *
+ * Must run in process context (e.g. a work queue) with interrupts
+ * enabled and no spinlocks hold.
+ */
+void memory_failure(unsigned long pfn, int trapno)
+{
+ struct page_state *ps;
+ struct page *p;
+
+ if (!pfn_valid(pfn)) {
+ printk(KERN_ERR
+ "MCE %#lx: memory outside kernel control: Ignored\n",
+ pfn);
+ return;
+ }
+
+ p = pfn_to_page(pfn);
+ if (TestSetPageHWPoison(p)) {
+ action_result(pfn, "already hardware poisoned", IGNORED);
+ return;
+ }
+
+ /*
+ * We need/can do nothing about count=0 pages.
+ * 1) it's a free page, and therefore in safe hand:
+ * prep_new_page() will be the gate keeper.
+ * 2) it's part of a non-compound high order page.
+ * Implies some kernel user: cannot stop them from
+ * R/W the page; let's pray that the page has been
+ * used and will be freed some time later.
+ * In fact it's dangerous to directly bump up page count from 0,
+ * that may make page_freeze_refs()/page_unfreeze_refs() mismatch.
+ */
+ if (!get_page_unless_zero(compound_head(p))) {
+ action_result(pfn, "free or high order kernel", IGNORED);
+ return;
+ }
+
+ /*
+ * Lock the page and wait for writeback to finish.
+ * It's very difficult to mess with pages currently under IO
+ * and in many cases impossible, so we just avoid it here.
+ */
+ lock_page_nosync(p);
+ wait_on_page_writeback(p);
+
+ /*
+ * Now take care of user space mappings.
+ */
+ hwpoison_user_mappings(p, pfn, trapno);
+
+ /*
+ * Torn down by someone else?
+ */
+ if (PageLRU(p) && !PageSwapCache(p) && p->mapping == NULL) {
+ action_result(pfn, "already truncated LRU", IGNORED);
+ goto out;
+ }
+
+ for (ps = error_states;; ps++) {
+ if ((p->flags & ps->mask) == ps->res) {
+ page_action(ps, p, pfn);
+ break;
+ }
+ }
+out:
+ unlock_page(p);
+}
--- sound-2.6.orig/include/linux/mm.h
+++ sound-2.6/include/linux/mm.h
@@ -815,6 +815,7 @@ extern int vmtruncate(struct inode * ino
extern int vmtruncate_range(struct inode * inode, loff_t offset, loff_t end);

void truncate_inode_page(struct address_space *mapping, struct page *page);
+int invalidate_complete_page(struct address_space *mapping, struct page *page);

#ifdef CONFIG_MMU
extern int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
@@ -1327,5 +1328,10 @@ void vmemmap_populate_print_last(void);
extern int account_locked_memory(struct mm_struct *mm, struct rlimit *rlim,
size_t size);
extern void refund_locked_memory(struct mm_struct *mm, size_t size);
+
+extern void memory_failure(unsigned long pfn, int trapno);
+extern int sysctl_memory_failure_early_kill;
+extern atomic_long_t mce_bad_pages;
+
#endif /* __KERNEL__ */
#endif /* _LINUX_MM_H */
--- sound-2.6.orig/kernel/sysctl.c
+++ sound-2.6/kernel/sysctl.c
@@ -1344,6 +1344,20 @@ static struct ctl_table vm_table[] = {
.mode = 0644,
.proc_handler = &scan_unevictable_handler,
},
+#ifdef CONFIG_MEMORY_FAILURE
+ {
+ .ctl_name = CTL_UNNUMBERED,
+ .procname = "memory_failure_early_kill",
+ .data = &sysctl_memory_failure_early_kill,
+ .maxlen = sizeof(sysctl_memory_failure_early_kill),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec_minmax,
+ .strategy = &sysctl_intvec,
+ .extra1 = &zero,
+ .extra2 = &one,
+ },
+#endif
+
/*
* NOTE: do not add new entries to this table unless you have read
* Documentation/sysctl/ctl_unnumbered.txt
--- sound-2.6.orig/fs/proc/meminfo.c
+++ sound-2.6/fs/proc/meminfo.c
@@ -95,7 +95,11 @@ static int meminfo_proc_show(struct seq_
"Committed_AS: %8lu kB\n"
"VmallocTotal: %8lu kB\n"
"VmallocUsed: %8lu kB\n"
- "VmallocChunk: %8lu kB\n",
+ "VmallocChunk: %8lu kB\n"
+#ifdef CONFIG_MEMORY_FAILURE
+ "HardwareCorrupted: %8lu kB\n"
+#endif
+ ,
K(i.totalram),
K(i.freeram),
K(i.bufferram),
@@ -140,6 +144,9 @@ static int meminfo_proc_show(struct seq_
(unsigned long)VMALLOC_TOTAL >> 10,
vmi.used >> 10,
vmi.largest_chunk >> 10
+#ifdef CONFIG_MEMORY_FAILURE
+ ,atomic_long_read(&mce_bad_pages) << (PAGE_SHIFT - 10)
+#endif
);

hugetlb_report_meminfo(m);
--- sound-2.6.orig/mm/Kconfig
+++ sound-2.6/mm/Kconfig
@@ -239,6 +239,9 @@ config KSM
Enable the KSM kernel module to allow page sharing of equal pages
among different tasks.

+config MEMORY_FAILURE
+ bool
+
config NOMMU_INITIAL_TRIM_EXCESS
int "Turn on mmap() excess space trimming before booting"
depends on !MMU
--- sound-2.6.orig/Documentation/sysctl/vm.txt
+++ sound-2.6/Documentation/sysctl/vm.txt
@@ -32,6 +32,7 @@ Currently, these files are in /proc/sys/
- legacy_va_layout
- lowmem_reserve_ratio
- max_map_count
+- memory_failure_early_kill
- min_free_kbytes
- min_slab_ratio
- min_unmapped_ratio
@@ -53,7 +54,6 @@ Currently, these files are in /proc/sys/
- vfs_cache_pressure
- zone_reclaim_mode

-
==============================================================

block_dump
@@ -275,6 +275,32 @@ e.g., up to one or two maps per allocati

The default value is 65536.

+=============================================================
+
+memory_failure_early_kill:
+
+Control how to kill processes when uncorrected memory error (typically
+a 2bit error in a memory module) is detected in the background by hardware
+that cannot be handled by the kernel. In some cases (like the page
+still having a valid copy on disk) the kernel will handle the failure
+transparently without affecting any applications. But if there is
+no other uptodate copy of the data it will kill to prevent any data
+corruptions from propagating.
+
+1: Kill all processes that have the corrupted and not reloadable page mapped
+as soon as the corruption is detected. Note this is not supported
+for a few types of pages, like kernel internally allocated data or
+the swap cache, but works for the majority of user pages.
+
+0: Only unmap the corrupted page from all processes and only kill a process
+who tries to access it.
+
+The kill is done using a catchable SIGBUS with BUS_MCEERR_AO, so processes can
+handle this if they want to.
+
+This is only active on architectures/platforms with advanced machine
+check handling and depends on the hardware capabilities.
+
==============================================================

min_free_kbytes:
--- sound-2.6.orig/mm/filemap.c
+++ sound-2.6/mm/filemap.c
@@ -105,6 +105,10 @@
*
* ->task->proc_lock
* ->dcache_lock (proc_pid_lookup)
+ *
+ * (code doesn't rely on that order, so you could switch it around)
+ * ->tasklist_lock (memory_failure, collect_procs_ao)
+ * ->i_mmap_lock
*/

/*
--- sound-2.6.orig/mm/rmap.c
+++ sound-2.6/mm/rmap.c
@@ -36,6 +36,10 @@
* mapping->tree_lock (widely used, in set_page_dirty,
* in arch-dependent flush_dcache_mmap_lock,
* within inode_lock in __sync_single_inode)
+ *
+ * (code doesn't rely on that order so it could be switched around)
+ * ->tasklist_lock
+ * anon_vma->lock (memory_failure, collect_procs_anon)
*/

#include <linux/mm.h>
--- sound-2.6.orig/mm/truncate.c
+++ sound-2.6/mm/truncate.c
@@ -118,8 +118,7 @@ truncate_complete_page(struct address_sp
*
* Returns non-zero if the page was successfully invalidated.
*/
-static int
-invalidate_complete_page(struct address_space *mapping, struct page *page)
+int invalidate_complete_page(struct address_space *mapping, struct page *page)
{
int ret;
--
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Wu Fengguang
2009-06-15 02:45:38 UTC
Permalink
In most places we want to test/operate on the compound head page.
The raw poisoned page is recorded in hwpoison_control.p for others.

Signed-off-by: Wu Fengguang <***@intel.com>
---
mm/memory-failure.c | 15 ++++++++++-----
1 file changed, 10 insertions(+), 5 deletions(-)

--- sound-2.6.orig/mm/memory-failure.c
+++ sound-2.6/mm/memory-failure.c
@@ -314,7 +314,8 @@ static const char *hwpoison_outcome_name

struct hwpoison_control {
unsigned long pfn;
- struct page *page;
+ struct page *p; /* corrupted page */
+ struct page *page; /* compound page head */
int outcome;
};

@@ -732,13 +733,17 @@ void memory_failure(unsigned long pfn, i
}

p = pfn_to_page(pfn);
- hpc.pfn = pfn;
- hpc.page = p;
if (TestSetPageHWPoison(p)) {
- action_result(&hpc, "already hardware poisoned", IGNORED);
+ printk(KERN_ERR
+ "MCE %#lx: already hardware poisoned: Ignored\n",
+ pfn);
return;
}

+ hpc.pfn = pfn;
+ hpc.p = p;
+ hpc.page = p = compound_head(p);
+
/*
* We need/can do nothing about count=0 pages.
* 1) it's a free page, and therefore in safe hand:
@@ -750,7 +755,7 @@ void memory_failure(unsigned long pfn, i
* In fact it's dangerous to directly bump up page count from 0,
* that may make page_freeze_refs()/page_unfreeze_refs() mismatch.
*/
- if (!get_page_unless_zero(compound_head(p))) {
+ if (!get_page_unless_zero(p)) {
action_result(&hpc, "free or high order kernel", IGNORED);
return;
}
--
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Wu Fengguang
2009-06-15 02:45:42 UTC
Permalink
From: Andi Kleen <***@linux.intel.com>

Normally the memory-failure.c code is enabled by the architecture, but
for easier testing independent of architecture changes enable it unconditionally.

This should not be merged into mainline.

Signed-off-by: Andi Kleen <***@linux.intel.com>

---
mm/Kconfig | 2 ++
1 file changed, 2 insertions(+)

--- sound-2.6.orig/mm/Kconfig
+++ sound-2.6/mm/Kconfig
@@ -241,6 +241,8 @@ config KSM

config MEMORY_FAILURE
bool
+ default y
+ depends on MMU

config HWPOISON_INJECT
tristate "Poison pages injector"
--
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Wu Fengguang
2009-06-15 02:45:39 UTC
Permalink
The free pages in the buddy system may well have no PG_buddy set.
Introduce is_free_buddy_page() for detecting them reliably.

Signed-off-by: Wu Fengguang <***@intel.com>
---
mm/internal.h | 3 +++
mm/memory-failure.c | 24 +++++++++---------------
mm/page_alloc.c | 16 ++++++++++++++++
3 files changed, 28 insertions(+), 15 deletions(-)

--- sound-2.6.orig/mm/memory-failure.c
+++ sound-2.6/mm/memory-failure.c
@@ -324,15 +324,15 @@ struct hwpoison_control {
* Do nothing, try to be lucky and not touch this instead. For a few cases we
* could be more sophisticated.
*/
-static int me_kernel(struct hwpoison_control *hpc)
+static int me_slab(struct hwpoison_control *hpc)
{
return DELAYED;
}

/*
- * Already poisoned page.
+ * Reserved kernel page.
*/
-static int me_ignore(struct hwpoison_control *hpc)
+static int me_reserved(struct hwpoison_control *hpc)
{
return IGNORED;
}
@@ -347,14 +347,6 @@ static int me_unknown(struct hwpoison_co
}

/*
- * Free memory
- */
-static int me_free(struct hwpoison_control *hpc)
-{
- return DELAYED;
-}
-
-/*
* Clean (or cleaned) page cache page.
*/
static int me_pagecache_clean(struct hwpoison_control *hpc)
@@ -539,15 +531,14 @@ static struct page_state {
char *msg;
int (*action)(struct hwpoison_control *hpc);
} error_states[] = {
- { reserved, reserved, "reserved kernel", me_ignore },
- { buddy, buddy, "free kernel", me_free },
+ { reserved, reserved, "reserved kernel", me_reserved },

/*
* Could in theory check if slab page is free or if we can drop
* currently unused objects without touching them. But just
* treat it as standard kernel for now.
*/
- { slab, slab, "kernel slab", me_kernel },
+ { slab, slab, "kernel slab", me_slab },

#ifdef CONFIG_PAGEFLAGS_EXTENDED
{ head, head, "huge", me_huge_page },
@@ -756,7 +747,10 @@ void memory_failure(unsigned long pfn, i
* that may make page_freeze_refs()/page_unfreeze_refs() mismatch.
*/
if (!get_page_unless_zero(p)) {
- action_result(&hpc, "free or high order kernel", IGNORED);
+ if (is_free_buddy_page(p))
+ action_result(&hpc, "free buddy", DELAYED);
+ else
+ action_result(&hpc, "high order kernel", IGNORED);
return;
}

--- sound-2.6.orig/mm/internal.h
+++ sound-2.6/mm/internal.h
@@ -49,6 +49,9 @@ extern void putback_lru_page(struct page
extern unsigned long highest_memmap_pfn;
extern void __free_pages_bootmem(struct page *page, unsigned int order);
extern void prep_compound_page(struct page *page, unsigned long order);
+#ifdef CONFIG_MEMORY_FAILURE
+extern bool is_free_buddy_page(struct page *page);
+#endif


/*
--- sound-2.6.orig/mm/page_alloc.c
+++ sound-2.6/mm/page_alloc.c
@@ -4966,3 +4966,19 @@ __offline_isolated_pages(unsigned long s
spin_unlock_irqrestore(&zone->lock, flags);
}
#endif
+
+#ifdef CONFIG_MEMORY_FAILURE
+bool is_free_buddy_page(struct page *page)
+{
+ int pfn = page_to_pfn(page);
+ int order;
+
+ for (order = 0; order < MAX_ORDER; order++) {
+ struct page *page_head = page - (pfn & ((1 << order) - 1));
+
+ if (PageBuddy(page_head) && page_order(page_head) >= order)
+ return true;
+ }
+ return false;
+}
+#endif
--
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Wu Fengguang
2009-06-15 02:45:31 UTC
Permalink
From: Nick Piggin <***@suse.de>

Extract out truncate_inode_page() out of the truncate path so that
it can be used by memory-failure.c

[AK: description, headers, fix typos]
v2: Some white space changes from Fengguang Wu
v3: add comments

Signed-off-by: Nick Piggin <***@suse.de>
Signed-off-by: Andi Kleen <***@linux.intel.com>
---
include/linux/mm.h | 2 ++
mm/truncate.c | 34 ++++++++++++++++++++++------------
2 files changed, 24 insertions(+), 12 deletions(-)

--- sound-2.6.orig/mm/truncate.c
+++ sound-2.6/mm/truncate.c
@@ -135,6 +135,26 @@ invalidate_complete_page(struct address_
return ret;
}

+/*
+ * Remove one page from its pagecache mapping. The page must be locked.
+ * This does not truncate the file on disk, it performs the pagecache
+ * side of the truncate operation. Dirty data will be discarded, and
+ * concurrent page references are ignored.
+ *
+ * Generic mm/fs code cannot call this on filesystem metadata mappings
+ * because those can assume that a page reference is enough to pin the
+ * page to its mapping.
+ */
+void truncate_inode_page(struct address_space *mapping, struct page *page)
+{
+ if (page_mapped(page)) {
+ unmap_mapping_range(mapping,
+ (loff_t)page->index << PAGE_CACHE_SHIFT,
+ PAGE_CACHE_SIZE, 0);
+ }
+ truncate_complete_page(mapping, page);
+}
+
/**
* truncate_inode_pages - truncate range of pages specified by start & end byte offsets
* @mapping: mapping to truncate
@@ -196,12 +216,7 @@ void truncate_inode_pages_range(struct a
unlock_page(page);
continue;
}
- if (page_mapped(page)) {
- unmap_mapping_range(mapping,
- (loff_t)page_index<<PAGE_CACHE_SHIFT,
- PAGE_CACHE_SIZE, 0);
- }
- truncate_complete_page(mapping, page);
+ truncate_inode_page(mapping, page);
unlock_page(page);
}
pagevec_release(&pvec);
@@ -238,15 +253,10 @@ void truncate_inode_pages_range(struct a
break;
lock_page(page);
wait_on_page_writeback(page);
- if (page_mapped(page)) {
- unmap_mapping_range(mapping,
- (loff_t)page->index<<PAGE_CACHE_SHIFT,
- PAGE_CACHE_SIZE, 0);
- }
+ truncate_inode_page(mapping, page);
if (page->index > next)
next = page->index;
next++;
- truncate_complete_page(mapping, page);
unlock_page(page);
}
pagevec_release(&pvec);
--- sound-2.6.orig/include/linux/mm.h
+++ sound-2.6/include/linux/mm.h
@@ -814,6 +814,8 @@ static inline void unmap_shared_mapping_
extern int vmtruncate(struct inode * inode, loff_t offset);
extern int vmtruncate_range(struct inode * inode, loff_t offset, loff_t end);

+void truncate_inode_page(struct address_space *mapping, struct page *page);
+
#ifdef CONFIG_MMU
extern int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, int write_access);
--
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Wu Fengguang
2009-06-15 02:45:37 UTC
Permalink
Code cleanups to allow passing around more parameters and states.

Signed-off-by: Wu Fengguang <***@intel.com>
---
mm/memory-failure.c | 94 ++++++++++++++++++++++++------------------
1 file changed, 54 insertions(+), 40 deletions(-)

--- sound-2.6.orig/mm/memory-failure.c
+++ sound-2.6/mm/memory-failure.c
@@ -298,26 +298,32 @@ static void collect_procs(struct page *p
* Error handlers for various types of pages.
*/

-enum outcome {
+enum hwpoison_outcome {
FAILED, /* Error handling failed */
DELAYED, /* Will be handled later */
IGNORED, /* Error safely ignored */
RECOVERED, /* Successfully recovered */
};

-static const char *action_name[] = {
+static const char *hwpoison_outcome_name[] = {
[FAILED] = "Failed",
[DELAYED] = "Delayed",
[IGNORED] = "Ignored",
[RECOVERED] = "Recovered",
};

+struct hwpoison_control {
+ unsigned long pfn;
+ struct page *page;
+ int outcome;
+};
+
/*
* Error hit kernel page.
* Do nothing, try to be lucky and not touch this instead. For a few cases we
* could be more sophisticated.
*/
-static int me_kernel(struct page *p, unsigned long pfn)
+static int me_kernel(struct hwpoison_control *hpc)
{
return DELAYED;
}
@@ -325,7 +331,7 @@ static int me_kernel(struct page *p, uns
/*
* Already poisoned page.
*/
-static int me_ignore(struct page *p, unsigned long pfn)
+static int me_ignore(struct hwpoison_control *hpc)
{
return IGNORED;
}
@@ -333,16 +339,16 @@ static int me_ignore(struct page *p, uns
/*
* Page in unknown state. Do nothing.
*/
-static int me_unknown(struct page *p, unsigned long pfn)
+static int me_unknown(struct hwpoison_control *hpc)
{
- printk(KERN_ERR "MCE %#lx: Unknown page state\n", pfn);
+ printk(KERN_ERR "MCE %#lx: Unknown page state\n", hpc->pfn);
return FAILED;
}

/*
* Free memory
*/
-static int me_free(struct page *p, unsigned long pfn)
+static int me_free(struct hwpoison_control *hpc)
{
return DELAYED;
}
@@ -350,9 +356,10 @@ static int me_free(struct page *p, unsig
/*
* Clean (or cleaned) page cache page.
*/
-static int me_pagecache_clean(struct page *p, unsigned long pfn)
+static int me_pagecache_clean(struct hwpoison_control *hpc)
{
struct address_space *mapping;
+ struct page *p = hpc->page;

if (!isolate_lru_page(p))
page_cache_release(p);
@@ -372,14 +379,14 @@ static int me_pagecache_clean(struct pag
!invalidate_complete_page(mapping, p)) {
printk(KERN_ERR
"MCE %#lx: failed to invalidate metadata page\n",
- pfn);
+ hpc->pfn);
return FAILED;
}

truncate_inode_page(mapping, p);
if (page_has_private(p) && !try_to_release_page(p, GFP_NOIO)) {
pr_debug(KERN_ERR "MCE %#lx: failed to release buffers\n",
- pfn);
+ hpc->pfn);
return FAILED;
}
return RECOVERED;
@@ -390,11 +397,11 @@ static int me_pagecache_clean(struct pag
* Issues: when the error hit a hole page the error is not properly
* propagated.
*/
-static int me_pagecache_dirty(struct page *p, unsigned long pfn)
+static int me_pagecache_dirty(struct hwpoison_control *hpc)
{
- struct address_space *mapping = page_mapping(p);
+ struct address_space *mapping = page_mapping(hpc->page);

- SetPageError(p);
+ SetPageError(hpc->page);
/* TBD: print more information about the file. */
if (mapping) {
/*
@@ -434,7 +441,7 @@ static int me_pagecache_dirty(struct pag
mapping_set_error(mapping, EIO);
}

- return me_pagecache_clean(p, pfn);
+ return me_pagecache_clean(hpc);
}

/*
@@ -456,8 +463,10 @@ static int me_pagecache_dirty(struct pag
* Clean swap cache pages can be directly isolated. A later page fault will
* bring in the known good data from disk.
*/
-static int me_swapcache_dirty(struct page *p, unsigned long pfn)
+static int me_swapcache_dirty(struct hwpoison_control *hpc)
{
+ struct page *p = hpc->page;
+
ClearPageDirty(p);
/* Trigger EIO in shmem: */
ClearPageUptodate(p);
@@ -468,8 +477,10 @@ static int me_swapcache_dirty(struct pag
return DELAYED;
}

-static int me_swapcache_clean(struct page *p, unsigned long pfn)
+static int me_swapcache_clean(struct hwpoison_control *hpc)
{
+ struct page *p = hpc->page;
+
if (!isolate_lru_page(p))
page_cache_release(p);

@@ -489,7 +500,7 @@ static int me_swapcache_clean(struct pag
* Should handle free huge pages and dequeue them too, but this needs to
* handle huge page accounting correctly.
*/
-static int me_huge_page(struct page *p, unsigned long pfn)
+static int me_huge_page(struct hwpoison_control *hpc)
{
return FAILED;
}
@@ -525,7 +536,7 @@ static struct page_state {
unsigned long mask;
unsigned long res;
char *msg;
- int (*action)(struct page *p, unsigned long pfn);
+ int (*action)(struct hwpoison_control *hpc);
} error_states[] = {
{ reserved, reserved, "reserved kernel", me_ignore },
{ buddy, buddy, "free kernel", me_free },
@@ -567,24 +578,22 @@ static struct page_state {
{ 0, 0, "unknown page state", me_unknown },
};

-static void action_result(unsigned long pfn, char *msg, int result)
+static void action_result(struct hwpoison_control *hpc, char *msg, int result)
{
+ hpc->outcome = result;
printk(KERN_ERR "MCE %#lx: %s%s page recovery: %s\n",
- pfn, PageDirty(pfn_to_page(pfn)) ? "dirty " : "",
- msg, action_name[result]);
+ hpc->pfn, PageDirty(hpc->page) ? "dirty " : "",
+ msg, hwpoison_outcome_name[result]);
}

-static void page_action(struct page_state *ps, struct page *p,
- unsigned long pfn)
+static void page_action(struct page_state *ps, struct hwpoison_control *hpc)
{
- int result;
+ action_result(hpc, ps->msg, ps->action(hpc));

- result = ps->action(p, pfn);
- action_result(pfn, ps->msg, result);
- if (page_count(p) != 1)
+ if (page_count(hpc->page) != 1)
printk(KERN_ERR
"MCE %#lx: %s page still referenced by %d users\n",
- pfn, ps->msg, page_count(p) - 1);
+ hpc->pfn, ps->msg, page_count(hpc->page) - 1);

/* Could do more checks here if page looks ok */
atomic_long_add(1, &mce_bad_pages);
@@ -600,12 +609,12 @@ static void page_action(struct page_stat
* Do all that is necessary to remove user space mappings. Unmap
* the pages and send SIGBUS to the processes if the data was dirty.
*/
-static void hwpoison_user_mappings(struct page *p, unsigned long pfn,
- int trapno)
+static void hwpoison_user_mappings(struct hwpoison_control *hpc, int trapno)
{
enum ttu_flags ttu = TTU_UNMAP | TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS;
int kill = sysctl_memory_failure_early_kill;
struct address_space *mapping;
+ struct page *p = hpc->page;
LIST_HEAD(tokill);
int ret;
int i;
@@ -625,7 +634,8 @@ static void hwpoison_user_mappings(struc

if (PageSwapCache(p)) {
printk(KERN_ERR
- "MCE %#lx: keeping poisoned page in swap cache\n", pfn);
+ "MCE %#lx: keeping poisoned page in swap cache\n",
+ hpc->pfn);
ttu |= TTU_IGNORE_HWPOISON;
}

@@ -642,7 +652,7 @@ static void hwpoison_user_mappings(struc
ttu |= TTU_IGNORE_HWPOISON;
printk(KERN_INFO
"MCE %#lx: corrupted page was clean: dropped without side effects\n",
- pfn);
+ hpc->pfn);
}
}

@@ -670,12 +680,13 @@ static void hwpoison_user_mappings(struc
ret = try_to_unmap(p, ttu);
if (ret == SWAP_SUCCESS)
break;
- pr_debug("MCE %#lx: try_to_unmap retry needed %d\n", pfn, ret);
+ pr_debug("MCE %#lx: try_to_unmap retry needed %d\n",
+ hpc->pfn, ret);
}

if (ret != SWAP_SUCCESS)
printk(KERN_ERR "MCE %#lx: failed to unmap page (mapcount=%d)\n",
- pfn, page_mapcount(p));
+ hpc->pfn, page_mapcount(p));

/*
* Now that the dirty bit has been propagated to the
@@ -687,7 +698,7 @@ static void hwpoison_user_mappings(struc
* any accesses to the poisoned memory.
*/
kill_procs_ao(&tokill, !!PageDirty(p), trapno,
- ret != SWAP_SUCCESS, pfn);
+ ret != SWAP_SUCCESS, hpc->pfn);
}

/**
@@ -711,6 +722,7 @@ void memory_failure(unsigned long pfn, i
{
struct page_state *ps;
struct page *p;
+ struct hwpoison_control hpc;

if (!pfn_valid(pfn)) {
printk(KERN_ERR
@@ -720,8 +732,10 @@ void memory_failure(unsigned long pfn, i
}

p = pfn_to_page(pfn);
+ hpc.pfn = pfn;
+ hpc.page = p;
if (TestSetPageHWPoison(p)) {
- action_result(pfn, "already hardware poisoned", IGNORED);
+ action_result(&hpc, "already hardware poisoned", IGNORED);
return;
}

@@ -737,7 +751,7 @@ void memory_failure(unsigned long pfn, i
* that may make page_freeze_refs()/page_unfreeze_refs() mismatch.
*/
if (!get_page_unless_zero(compound_head(p))) {
- action_result(pfn, "free or high order kernel", IGNORED);
+ action_result(&hpc, "free or high order kernel", IGNORED);
return;
}

@@ -752,19 +766,19 @@ void memory_failure(unsigned long pfn, i
/*
* Now take care of user space mappings.
*/
- hwpoison_user_mappings(p, pfn, trapno);
+ hwpoison_user_mappings(&hpc, trapno);

/*
* Torn down by someone else?
*/
if (PageLRU(p) && !PageSwapCache(p) && p->mapping == NULL) {
- action_result(pfn, "already truncated LRU", IGNORED);
+ action_result(&hpc, "already truncated LRU", IGNORED);
goto out;
}

for (ps = error_states;; ps++) {
if ((p->flags & ps->mask) == ps->res) {
- page_action(ps, p, pfn);
+ page_action(ps, &hpc);
break;
}
}
--
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Wu Fengguang
2009-06-15 02:45:29 UTC
Permalink
From: Andi Kleen <***@linux.intel.com>

When a page has the poison bit set replace the PTE with a poison entry.
This causes the right error handling to be done later when a process runs
into it.

Also add a new flag to not do that (needed for the memory-failure handler
later)

Reviewed-by: Wu Fengguang <***@intel.com>
Signed-off-by: Andi Kleen <***@linux.intel.com>

---
include/linux/rmap.h | 1 +
mm/rmap.c | 9 ++++++++-
2 files changed, 9 insertions(+), 1 deletion(-)

--- sound-2.6.orig/mm/rmap.c
+++ sound-2.6/mm/rmap.c
@@ -958,7 +958,14 @@ static int try_to_unmap_one(struct page
/* Update high watermark before we lower rss */
update_hiwater_rss(mm);

- if (PageAnon(page)) {
+ if (PageHWPoison(page) && !(flags & TTU_IGNORE_HWPOISON)) {
+ if (PageAnon(page))
+ dec_mm_counter(mm, anon_rss);
+ else if (!is_migration_entry(pte_to_swp_entry(*pte)))
+ dec_mm_counter(mm, file_rss);
+ set_pte_at(mm, address, pte,
+ swp_entry_to_pte(make_hwpoison_entry(page)));
+ } else if (PageAnon(page)) {
swp_entry_t entry = { .val = page_private(page) };

if (PageSwapCache(page)) {
--- sound-2.6.orig/include/linux/rmap.h
+++ sound-2.6/include/linux/rmap.h
@@ -94,6 +94,7 @@ enum ttu_flags {

TTU_IGNORE_MLOCK = (1 << 8), /* ignore mlock */
TTU_IGNORE_ACCESS = (1 << 9), /* don't age */
+ TTU_IGNORE_HWPOISON = (1 << 10),/* corrupted page is recoverable */
};
#define TTU_ACTION(x) ((x) & TTU_ACTION_MASK)
--
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Minchan Kim
2009-06-15 13:09:03 UTC
Permalink
Post by Wu Fengguang
When a page has the poison bit set replace the PTE with a poison entry.
This causes the right error handling to be done later when a process runs
into it.
Also add a new flag to not do that (needed for the memory-failure handler
later)
---
 include/linux/rmap.h |    1 +
 mm/rmap.c            |    9 ++++++++-
 2 files changed, 9 insertions(+), 1 deletion(-)
--- sound-2.6.orig/mm/rmap.c
+++ sound-2.6/mm/rmap.c
@@ -958,7 +958,14 @@ static int try_to_unmap_one(struct page
       /* Update high watermark before we lower rss */
       update_hiwater_rss(mm);
-       if (PageAnon(page)) {
+       if (PageHWPoison(page) && !(flags & TTU_IGNORE_HWPOISON)) {
+               if (PageAnon(page))
+                       dec_mm_counter(mm, anon_rss);
+               else if (!is_migration_entry(pte_to_swp_entry(*pte)))
Isn't it straightforward to use !is_hwpoison_entry ?
--
Kinds regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Wu Fengguang
2009-06-15 15:26:12 UTC
Permalink
Post by Minchan Kim
Post by Wu Fengguang
When a page has the poison bit set replace the PTE with a poison entry.
This causes the right error handling to be done later when a process runs
into it.
Also add a new flag to not do that (needed for the memory-failure handler
later)
---
 include/linux/rmap.h |    1 +
 mm/rmap.c            |    9 ++++++++-
 2 files changed, 9 insertions(+), 1 deletion(-)
--- sound-2.6.orig/mm/rmap.c
+++ sound-2.6/mm/rmap.c
@@ -958,7 +958,14 @@ static int try_to_unmap_one(struct page
       /* Update high watermark before we lower rss */
       update_hiwater_rss(mm);
-       if (PageAnon(page)) {
+       if (PageHWPoison(page) && !(flags & TTU_IGNORE_HWPOISON)) {
+               if (PageAnon(page))
+                       dec_mm_counter(mm, anon_rss);
+               else if (!is_migration_entry(pte_to_swp_entry(*pte)))
Isn't it straightforward to use !is_hwpoison_entry ?
Good catch! It looks like a redundant check: the
page_check_address() at the beginning of the function guarantees that
!is_migration_entry() or !is_migration_entry() tests will all be TRUE.

So let's do this?

-               else if (!is_migration_entry(pte_to_swp_entry(*pte)))
+               else


Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Minchan Kim
2009-06-16 00:03:08 UTC
Permalink
On Mon, 15 Jun 2009 23:26:12 +0800
Post by Wu Fengguang
Post by Minchan Kim
Post by Wu Fengguang
When a page has the poison bit set replace the PTE with a poison entry.
This causes the right error handling to be done later when a process runs
into it.
Also add a new flag to not do that (needed for the memory-failure handler
later)
---
 include/linux/rmap.h |    1 +
 mm/rmap.c            |    9 ++++++++-
 2 files changed, 9 insertions(+), 1 deletion(-)
--- sound-2.6.orig/mm/rmap.c
+++ sound-2.6/mm/rmap.c
@@ -958,7 +958,14 @@ static int try_to_unmap_one(struct page
       /* Update high watermark before we lower rss */
       update_hiwater_rss(mm);
-       if (PageAnon(page)) {
+       if (PageHWPoison(page) && !(flags & TTU_IGNORE_HWPOISON)) {
+               if (PageAnon(page))
+                       dec_mm_counter(mm, anon_rss);
+               else if (!is_migration_entry(pte_to_swp_entry(*pte)))
Isn't it straightforward to use !is_hwpoison_entry ?
Good catch! It looks like a redundant check: the
page_check_address() at the beginning of the function guarantees that
!is_migration_entry() or !is_migration_entry() tests will all be TRUE.
So let's do this?
It seems you expand my sight :)

I don't know migration well.
How page_check_address guarantee it's not migration entry ?

In addtion, If the page is poison while we are going to migration((PAGE_MIGRATION && migration) == TRUE), we should decrease file_rss ?
Post by Wu Fengguang
-               else if (!is_migration_entry(pte_to_swp_entry(*pte)))
+               else
Thanks,
Fengguang
--
Kinds Regards
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Wu Fengguang
2009-06-16 13:49:44 UTC
Permalink
Post by Minchan Kim
On Mon, 15 Jun 2009 23:26:12 +0800
Post by Wu Fengguang
Post by Minchan Kim
Post by Wu Fengguang
When a page has the poison bit set replace the PTE with a poison entry.
This causes the right error handling to be done later when a process runs
into it.
Also add a new flag to not do that (needed for the memory-failure handler
later)
---
 include/linux/rmap.h |    1 +
 mm/rmap.c            |    9 ++++++++-
 2 files changed, 9 insertions(+), 1 deletion(-)
--- sound-2.6.orig/mm/rmap.c
+++ sound-2.6/mm/rmap.c
@@ -958,7 +958,14 @@ static int try_to_unmap_one(struct page
       /* Update high watermark before we lower rss */
       update_hiwater_rss(mm);
-       if (PageAnon(page)) {
+       if (PageHWPoison(page) && !(flags & TTU_IGNORE_HWPOISON)) {
+               if (PageAnon(page))
+                       dec_mm_counter(mm, anon_rss);
+               else if (!is_migration_entry(pte_to_swp_entry(*pte)))
Isn't it straightforward to use !is_hwpoison_entry ?
Good catch! It looks like a redundant check: the
page_check_address() at the beginning of the function guarantees that
!is_migration_entry() or !is_migration_entry() tests will all be TRUE.
So let's do this?
It seems you expand my sight :)
I don't know migration well.
How page_check_address guarantee it's not migration entry ?
page_check_address() calls pte_present() which returns the
(_PAGE_PRESENT | _PAGE_PROTNONE) bits. While x86-64 defines

#define __swp_entry(type, offset) ((swp_entry_t) { \
((type) << (_PAGE_BIT_PRESENT + 1)) \
| ((offset) << SWP_OFFSET_SHIFT) })

where SWP_OFFSET_SHIFT is defined to the bigger one of
max(_PAGE_BIT_PROTNONE + 1, _PAGE_BIT_FILE + 1) = max(8+1, 6+1) = 9.

So __swp_entry(type, offset) := (type << 1) | (offset << 9)

We know that the swap type is 5 bits. So the bit 0 _PAGE_PRESENT and bit 8
_PAGE_PROTNONE will all be zero for swap entries.
Post by Minchan Kim
In addtion, If the page is poison while we are going to
migration((PAGE_MIGRATION && migration) == TRUE), we should decrease
file_rss ?
It will die on trying to migrate the poisoned page so we don't care
the accounting. But normally the poisoned page shall already be
isolated so we don't care that die either.

Thanks,
Fengguang
Post by Minchan Kim
Post by Wu Fengguang
-               else if (!is_migration_entry(pte_to_swp_entry(*pte)))
+               else
Thanks,
Fengguang
--
Kinds Regards
Minchan Kim
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Minchan Kim
2009-06-17 00:28:26 UTC
Permalink
On Tue, 16 Jun 2009 21:49:44 +0800
Post by Wu Fengguang
Post by Minchan Kim
On Mon, 15 Jun 2009 23:26:12 +0800
Post by Wu Fengguang
Post by Minchan Kim
Post by Wu Fengguang
When a page has the poison bit set replace the PTE with a poison entry.
This causes the right error handling to be done later when a process runs
into it.
Also add a new flag to not do that (needed for the memory-failure handler
later)
---
 include/linux/rmap.h |    1 +
 mm/rmap.c            |    9 ++++++++-
 2 files changed, 9 insertions(+), 1 deletion(-)
--- sound-2.6.orig/mm/rmap.c
+++ sound-2.6/mm/rmap.c
@@ -958,7 +958,14 @@ static int try_to_unmap_one(struct page
       /* Update high watermark before we lower rss */
       update_hiwater_rss(mm);
-       if (PageAnon(page)) {
+       if (PageHWPoison(page) && !(flags & TTU_IGNORE_HWPOISON)) {
+               if (PageAnon(page))
+                       dec_mm_counter(mm, anon_rss);
+               else if (!is_migration_entry(pte_to_swp_entry(*pte)))
Isn't it straightforward to use !is_hwpoison_entry ?
Good catch! It looks like a redundant check: the
page_check_address() at the beginning of the function guarantees that
!is_migration_entry() or !is_migration_entry() tests will all be TRUE.
So let's do this?
It seems you expand my sight :)
I don't know migration well.
How page_check_address guarantee it's not migration entry ?
page_check_address() calls pte_present() which returns the
(_PAGE_PRESENT | _PAGE_PROTNONE) bits. While x86-64 defines
#define __swp_entry(type, offset) ((swp_entry_t) { \
((type) << (_PAGE_BIT_PRESENT + 1)) \
| ((offset) << SWP_OFFSET_SHIFT) })
where SWP_OFFSET_SHIFT is defined to the bigger one of
max(_PAGE_BIT_PROTNONE + 1, _PAGE_BIT_FILE + 1) = max(8+1, 6+1) = 9.
So __swp_entry(type, offset) := (type << 1) | (offset << 9)
We know that the swap type is 5 bits. So the bit 0 _PAGE_PRESENT and bit 8
_PAGE_PROTNONE will all be zero for swap entries.
Thanks for kind explanation :)
Post by Wu Fengguang
Post by Minchan Kim
In addtion, If the page is poison while we are going to
migration((PAGE_MIGRATION && migration) == TRUE), we should decrease
file_rss ?
It will die on trying to migrate the poisoned page so we don't care
the accounting. But normally the poisoned page shall already be
Okay. then, how about this ?
We should not increase file_rss on trying to migrate the poisoned page

-               else if (!is_migration_entry(pte_to_swp_entry(*pte)))
+               else if (!(PAGE_MIGRATION && migration))
Post by Wu Fengguang
isolated so we don't care that die either.
Thanks,
Fengguang
Post by Minchan Kim
Post by Wu Fengguang
-               else if (!is_migration_entry(pte_to_swp_entry(*pte)))
+               else
Thanks,
Fengguang
--
Kinds Regards
Minchan Kim
--
Kinds Regards
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Wu Fengguang
2009-06-17 07:23:19 UTC
Permalink
Post by Minchan Kim
On Tue, 16 Jun 2009 21:49:44 +0800
Post by Wu Fengguang
Post by Minchan Kim
On Mon, 15 Jun 2009 23:26:12 +0800
Post by Wu Fengguang
Post by Minchan Kim
Post by Wu Fengguang
When a page has the poison bit set replace the PTE with a poison entry.
This causes the right error handling to be done later when a process runs
into it.
Also add a new flag to not do that (needed for the memory-failure handler
later)
---
 include/linux/rmap.h |    1 +
 mm/rmap.c            |    9 ++++++++-
 2 files changed, 9 insertions(+), 1 deletion(-)
--- sound-2.6.orig/mm/rmap.c
+++ sound-2.6/mm/rmap.c
@@ -958,7 +958,14 @@ static int try_to_unmap_one(struct page
       /* Update high watermark before we lower rss */
       update_hiwater_rss(mm);
-       if (PageAnon(page)) {
+       if (PageHWPoison(page) && !(flags & TTU_IGNORE_HWPOISON)) {
+               if (PageAnon(page))
+                       dec_mm_counter(mm, anon_rss);
+               else if (!is_migration_entry(pte_to_swp_entry(*pte)))
Isn't it straightforward to use !is_hwpoison_entry ?
Good catch! It looks like a redundant check: the
page_check_address() at the beginning of the function guarantees that
!is_migration_entry() or !is_migration_entry() tests will all be TRUE.
So let's do this?
It seems you expand my sight :)
I don't know migration well.
How page_check_address guarantee it's not migration entry ?
page_check_address() calls pte_present() which returns the
(_PAGE_PRESENT | _PAGE_PROTNONE) bits. While x86-64 defines
#define __swp_entry(type, offset) ((swp_entry_t) { \
((type) << (_PAGE_BIT_PRESENT + 1)) \
| ((offset) << SWP_OFFSET_SHIFT) })
where SWP_OFFSET_SHIFT is defined to the bigger one of
max(_PAGE_BIT_PROTNONE + 1, _PAGE_BIT_FILE + 1) = max(8+1, 6+1) = 9.
So __swp_entry(type, offset) := (type << 1) | (offset << 9)
We know that the swap type is 5 bits. So the bit 0 _PAGE_PRESENT and bit 8
_PAGE_PROTNONE will all be zero for swap entries.
Thanks for kind explanation :)
You are welcome~
Post by Minchan Kim
Post by Wu Fengguang
Post by Minchan Kim
In addtion, If the page is poison while we are going to
migration((PAGE_MIGRATION && migration) == TRUE), we should decrease
file_rss ?
It will die on trying to migrate the poisoned page so we don't care
the accounting. But normally the poisoned page shall already be
Okay. then, how about this ?
We should not increase file_rss on trying to migrate the poisoned page
-               else if (!is_migration_entry(pte_to_swp_entry(*pte)))
+               else if (!(PAGE_MIGRATION && migration))
This is good if we are going to stop the hwpoison page from being
consumed by move_to_new_page(), but I highly doubt we'll ever add
PageHWPoison() checks into the migration code.

Because this race window is small enough:

TestSetPageHWPoison(p);
lock_page(page);
try_to_unmap(page, TTU_MIGRATION|...);
lock_page_nosync(p);

such small race windows can be found all over the kernel, it's just
insane to try to fix any of them.

For example, if the newly allocated page get corrupted, this kind of code who
assumes it is the only user of the page (but memory_failure() comes in between
like a ghost) will go BUG():

/*
* Block others from accessing the page when we get around to
* establishing additional references. We are the only one
* holding a reference to the new page at this point.
*/
if (!trylock_page(newpage))
BUG();

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Minchan Kim
2009-06-17 13:27:36 UTC
Permalink
Post by Wu Fengguang
Post by Minchan Kim
On Tue, 16 Jun 2009 21:49:44 +0800
Post by Wu Fengguang
Post by Minchan Kim
On Mon, 15 Jun 2009 23:26:12 +0800
Post by Minchan Kim
Post by Wu Fengguang
When a page has the poison bit set replace the PTE with a poison entry.
This causes the right error handling to be done later when a process runs
into it.
Also add a new flag to not do that (needed for the memory-failure handler
later)
---
 include/linux/rmap.h |    1 +
 mm/rmap.c            |    9 ++++++++-
 2 files changed, 9 insertions(+), 1 deletion(-)
--- sound-2.6.orig/mm/rmap.c
+++ sound-2.6/mm/rmap.c
@@ -958,7 +958,14 @@ static int try_to_unmap_one(struct page
       /* Update high watermark before we lower rss */
       update_hiwater_rss(mm);
-       if (PageAnon(page)) {
+       if (PageHWPoison(page) && !(flags & TTU_IGNORE_HWPOISON)) {
+               if (PageAnon(page))
+                       dec_mm_counter(mm, anon_rss);
+               else if (!is_migration_entry(pte_to_swp_entry(*pte)))
Isn't it straightforward to use !is_hwpoison_entry ?
Good catch!  It looks like a redundant check: the
page_check_address() at the beginning of the function guarantees that
!is_migration_entry() or !is_migration_entry() tests will all be TRUE.
So let's do this?
It seems you expand my sight :)
I don't know migration well.
How page_check_address guarantee it's not migration entry ?
page_check_address() calls pte_present() which returns the
(_PAGE_PRESENT | _PAGE_PROTNONE) bits. While x86-64 defines
#define __swp_entry(type, offset)       ((swp_entry_t) { \
                                         ((type) << (_PAGE_BIT_PRESENT + 1)) \
                                         | ((offset) << SWP_OFFSET_SHIFT) })
where SWP_OFFSET_SHIFT is defined to the bigger one of
max(_PAGE_BIT_PROTNONE + 1, _PAGE_BIT_FILE + 1) = max(8+1, 6+1) = 9.
So __swp_entry(type, offset) := (type << 1) | (offset << 9)
We know that the swap type is 5 bits. So the bit 0 _PAGE_PRESENT and bit 8
_PAGE_PROTNONE will all be zero for swap entries.
Thanks for kind explanation :)
You are welcome~
Post by Minchan Kim
Post by Wu Fengguang
Post by Minchan Kim
In addtion, If the page is poison while we are going to
migration((PAGE_MIGRATION && migration) == TRUE), we should decrease
file_rss ?
It will die on trying to migrate the poisoned page so we don't care
the accounting. But normally the poisoned page shall already be
Okay. then, how about this ?
We should not increase file_rss on trying to migrate the poisoned page
-               else if (!is_migration_entry(pte_to_swp_entry(*pte)))
+               else if (!(PAGE_MIGRATION && migration))
This is good if we are going to stop the hwpoison page from being
consumed by move_to_new_page(), but I highly doubt we'll ever add
PageHWPoison() checks into the migration code.
       TestSetPageHWPoison(p);
                                  lock_page(page);
                                  try_to_unmap(page, TTU_MIGRATION|...);
       lock_page_nosync(p);
such small race windows can be found all over the kernel, it's just
insane to try to fix any of them.
Sorry for too late response.

I see your point.
My opinion is that at least we must be notified when such situation happen.
So I think it would be better to add some warning to fix up it when it
happen even thought it is small race window.
--
Kinds regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Wu Fengguang
2009-06-17 13:37:08 UTC
Permalink
Post by Minchan Kim
Post by Wu Fengguang
Post by Minchan Kim
On Tue, 16 Jun 2009 21:49:44 +0800
Post by Wu Fengguang
Post by Minchan Kim
On Mon, 15 Jun 2009 23:26:12 +0800
On Mon, Jun 15, 2009 at 11:45 AM, Wu Fengguang<fengguang.w=
When a page has the poison bit set replace the PTE with =
a poison entry.
Post by Minchan Kim
Post by Wu Fengguang
Post by Minchan Kim
Post by Wu Fengguang
Post by Minchan Kim
This causes the right error handling to be done later wh=
en a process runs
Post by Minchan Kim
Post by Wu Fengguang
Post by Minchan Kim
Post by Wu Fengguang
Post by Minchan Kim
into it.
Also add a new flag to not do that (needed for the memor=
y-failure handler
Post by Minchan Kim
Post by Wu Fengguang
Post by Minchan Kim
Post by Wu Fengguang
Post by Minchan Kim
later)
---
=C2=A0include/linux/rmap.h | =C2=A0 =C2=A01 +
=C2=A0mm/rmap.c =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
| =C2=A0 =C2=A09 ++++++++-
Post by Minchan Kim
Post by Wu Fengguang
Post by Minchan Kim
Post by Wu Fengguang
Post by Minchan Kim
=C2=A02 files changed, 9 insertions(+), 1 deletion(-)
--- sound-2.6.orig/mm/rmap.c
+++ sound-2.6/mm/rmap.c
@@ -958,7 +958,14 @@ static int try_to_unmap_one(struct =
page
Post by Minchan Kim
Post by Wu Fengguang
Post by Minchan Kim
Post by Wu Fengguang
Post by Minchan Kim
=C2=A0 =C2=A0 =C2=A0 =C2=A0/* Update high watermark befo=
re we lower rss */
Post by Minchan Kim
Post by Wu Fengguang
Post by Minchan Kim
Post by Wu Fengguang
Post by Minchan Kim
=C2=A0 =C2=A0 =C2=A0 =C2=A0update_hiwater_rss(mm);
- =C2=A0 =C2=A0 =C2=A0 if (PageAnon(page)) {
+ =C2=A0 =C2=A0 =C2=A0 if (PageHWPoison(page) && !(flags=
& TTU_IGNORE_HWPOISON)) {
Post by Minchan Kim
Post by Wu Fengguang
Post by Minchan Kim
Post by Wu Fengguang
Post by Minchan Kim
+ =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 if (P=
ageAnon(page))
Post by Minchan Kim
Post by Wu Fengguang
Post by Minchan Kim
Post by Wu Fengguang
Post by Minchan Kim
+ =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
=C2=A0 =C2=A0 =C2=A0 dec_mm_counter(mm, anon_rss);
Post by Minchan Kim
Post by Wu Fengguang
Post by Minchan Kim
Post by Wu Fengguang
Post by Minchan Kim
+ =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 else =
if (!is_migration_entry(pte_to_swp_entry(*pte)))
Post by Minchan Kim
Post by Wu Fengguang
Post by Minchan Kim
Post by Wu Fengguang
Post by Minchan Kim
Isn't it straightforward to use !is_hwpoison_entry ?
Good catch! =C2=A0It looks like a redundant check: the
page_check_address() at the beginning of the function guaran=
tees that
Post by Minchan Kim
Post by Wu Fengguang
Post by Minchan Kim
Post by Wu Fengguang
Post by Minchan Kim
!is_migration_entry() or !is_migration_entry() tests will al=
l be TRUE.
Post by Minchan Kim
Post by Wu Fengguang
Post by Minchan Kim
Post by Wu Fengguang
Post by Minchan Kim
So let's do this?
It seems you expand my sight :)
I don't know migration well.
How page_check_address guarantee it's not migration entry ?
page_check_address() calls pte_present() which returns the
(_PAGE_PRESENT | _PAGE_PROTNONE) bits. While x86-64 defines
#define __swp_entry(type, offset) =C2=A0 =C2=A0 =C2=A0 ((swp_ent=
ry_t) { \
Post by Minchan Kim
Post by Wu Fengguang
Post by Minchan Kim
Post by Wu Fengguang
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
=C2=A0((type) << (_PAGE_BIT_PRESENT + 1)) \
Post by Minchan Kim
Post by Wu Fengguang
Post by Minchan Kim
Post by Wu Fengguang
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
=C2=A0| ((offset) << SWP_OFFSET_SHIFT) })
Post by Minchan Kim
Post by Wu Fengguang
Post by Minchan Kim
Post by Wu Fengguang
where SWP_OFFSET_SHIFT is defined to the bigger one of
max(_PAGE_BIT_PROTNONE + 1, _PAGE_BIT_FILE + 1) =3D max(8+1, 6+1=
) =3D 9.
Post by Minchan Kim
Post by Wu Fengguang
Post by Minchan Kim
Post by Wu Fengguang
So __swp_entry(type, offset) :=3D (type << 1) | (offset << 9)
We know that the swap type is 5 bits. So the bit 0 _PAGE_PRESENT=
and bit 8
Post by Minchan Kim
Post by Wu Fengguang
Post by Minchan Kim
Post by Wu Fengguang
_PAGE_PROTNONE will all be zero for swap entries.
Thanks for kind explanation :)
You are welcome~
Post by Minchan Kim
Post by Wu Fengguang
Post by Minchan Kim
In addtion, If the page is poison while we are going to
migration((PAGE_MIGRATION && migration) =3D=3D TRUE), we shoul=
d decrease
Post by Minchan Kim
Post by Wu Fengguang
Post by Minchan Kim
Post by Wu Fengguang
Post by Minchan Kim
file_rss ?
It will die on trying to migrate the poisoned page so we don't c=
are
Post by Minchan Kim
Post by Wu Fengguang
Post by Minchan Kim
Post by Wu Fengguang
the accounting. But normally the poisoned page shall already be
Okay. then, how about this ?
We should not increase file_rss on trying to migrate the poisoned =
page
Post by Minchan Kim
Post by Wu Fengguang
Post by Minchan Kim
- =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 else if (!is_mi=
gration_entry(pte_to_swp_entry(*pte)))
Post by Minchan Kim
Post by Wu Fengguang
Post by Minchan Kim
+ =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 else if (!(PAGE=
_MIGRATION && migration))
Post by Minchan Kim
Post by Wu Fengguang
This is good if we are going to stop the hwpoison page from being
consumed by move_to_new_page(), but I highly doubt we'll ever add
PageHWPoison() checks into the migration code.
=C2=A0 =C2=A0 =C2=A0 =C2=A0TestSetPageHWPoison(p);
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 lock_page(page);
Post by Minchan Kim
Post by Wu Fengguang
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 try_to_unmap(page, TT=
U_MIGRATION|...);
Post by Minchan Kim
Post by Wu Fengguang
=C2=A0 =C2=A0 =C2=A0 =C2=A0lock_page_nosync(p);
such small race windows can be found all over the kernel, it's just
insane to try to fix any of them.
=20
Sorry for too late response.
=20
I see your point.
My opinion is that at least we must be notified when such situation h=
appen.
Post by Minchan Kim
So I think it would be better to add some warning to fix up it when i=
t
Post by Minchan Kim
happen even thought it is small race window.
Notification is also pointless here: we'll die hard on
accessing/consuming the poisoned page anyway :(

Thanks,
=46engguang
Minchan Kim
2009-06-17 13:43:29 UTC
Permalink
Post by Wu Fengguang
Post by Minchan Kim
Post by Wu Fengguang
Post by Minchan Kim
On Tue, 16 Jun 2009 21:49:44 +0800
Post by Wu Fengguang
Post by Minchan Kim
On Mon, 15 Jun 2009 23:26:12 +0800
Post by Minchan Kim
Post by Wu Fengguang
When a page has the poison bit set replace the PTE with a poison entry.
This causes the right error handling to be done later when a process runs
into it.
Also add a new flag to not do that (needed for the memory-failure handler
later)
---
 include/linux/rmap.h |    1 +
 mm/rmap.c            |    9 ++++++++-
 2 files changed, 9 insertions(+), 1 deletion(-)
--- sound-2.6.orig/mm/rmap.c
+++ sound-2.6/mm/rmap.c
@@ -958,7 +958,14 @@ static int try_to_unmap_one(struct page
       /* Update high watermark before we lower rss */
       update_hiwater_rss(mm);
-       if (PageAnon(page)) {
+       if (PageHWPoison(page) && !(flags & TTU_IGNORE_HWPOISON)) {
+               if (PageAnon(page))
+                       dec_mm_counter(mm, anon_rss);
+               else if (!is_migration_entry(pte_to_swp_entry(*pte)))
Isn't it straightforward to use !is_hwpoison_entry ?
Good catch!  It looks like a redundant check: the
page_check_address() at the beginning of the function guarantees that
!is_migration_entry() or !is_migration_entry() tests will all be TRUE.
So let's do this?
It seems you expand my sight :)
I don't know migration well.
How page_check_address guarantee it's not migration entry ?
page_check_address() calls pte_present() which returns the
(_PAGE_PRESENT | _PAGE_PROTNONE) bits. While x86-64 defines
#define __swp_entry(type, offset)       ((swp_entry_t) { \
                                         ((type) << (_PAGE_BIT_PRESENT + 1)) \
                                         | ((offset) << SWP_OFFSET_SHIFT) })
where SWP_OFFSET_SHIFT is defined to the bigger one of
max(_PAGE_BIT_PROTNONE + 1, _PAGE_BIT_FILE + 1) = max(8+1, 6+1) = 9.
So __swp_entry(type, offset) := (type << 1) | (offset << 9)
We know that the swap type is 5 bits. So the bit 0 _PAGE_PRESENT and bit 8
_PAGE_PROTNONE will all be zero for swap entries.
Thanks for kind explanation :)
You are welcome~
Post by Minchan Kim
Post by Wu Fengguang
Post by Minchan Kim
In addtion, If the page is poison while we are going to
migration((PAGE_MIGRATION && migration) == TRUE), we should decrease
file_rss ?
It will die on trying to migrate the poisoned page so we don't care
the accounting. But normally the poisoned page shall already be
Okay. then, how about this ?
We should not increase file_rss on trying to migrate the poisoned page
-               else if (!is_migration_entry(pte_to_swp_entry(*pte)))
+               else if (!(PAGE_MIGRATION && migration))
This is good if we are going to stop the hwpoison page from being
consumed by move_to_new_page(), but I highly doubt we'll ever add
PageHWPoison() checks into the migration code.
       TestSetPageHWPoison(p);
                                  lock_page(page);
                                  try_to_unmap(page, TTU_MIGRATION|...);
       lock_page_nosync(p);
such small race windows can be found all over the kernel, it's just
insane to try to fix any of them.
Sorry for too late response.
I see your point.
My opinion is that at least we must be notified when such situation happen.
So I think it would be better to add some warning to fix up it when it
happen even thought  it is small race window.
Notification is also pointless here: we'll die hard on
accessing/consuming the poisoned page anyway :(
My intention wasn't to recover it.

It just add something like WARN_ON.
You said it is small window enough. but I think it can happen more
hight probability in migration-workload.(At a moment, I don't know
what kinds of app)
For such case, If we can hear reporting of warning, at that time we
can consider migration handling for HWPoison.
Post by Wu Fengguang
Thanks,
Fengguang
--
Kinds regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Wu Fengguang
2009-06-17 14:03:34 UTC
Permalink
Post by Minchan Kim
Post by Wu Fengguang
Post by Minchan Kim
Post by Wu Fengguang
Post by Minchan Kim
On Tue, 16 Jun 2009 21:49:44 +0800
Post by Wu Fengguang
Post by Minchan Kim
On Mon, 15 Jun 2009 23:26:12 +0800
Post by Minchan Kim
Post by Wu Fengguang
When a page has the poison bit set replace the PTE with a poison entry.
This causes the right error handling to be done later when a process runs
into it.
Also add a new flag to not do that (needed for the memory-failure handler
later)
---
 include/linux/rmap.h |    1 +
 mm/rmap.c            |    9 ++++++++-
 2 files changed, 9 insertions(+), 1 deletion(-)
--- sound-2.6.orig/mm/rmap.c
+++ sound-2.6/mm/rmap.c
@@ -958,7 +958,14 @@ static int try_to_unmap_one(struct page
       /* Update high watermark before we lower rss */
       update_hiwater_rss(mm);
-       if (PageAnon(page)) {
+       if (PageHWPoison(page) && !(flags & TTU_IGNORE_HWPOISON)) {
+               if (PageAnon(page))
+                       dec_mm_counter(mm, anon_rss);
+               else if (!is_migration_entry(pte_to_swp_entry(*pte)))
Isn't it straightforward to use !is_hwpoison_entry ?
Good catch!  It looks like a redundant check: the
page_check_address() at the beginning of the function guarantees that
!is_migration_entry() or !is_migration_entry() tests will all be TRUE.
So let's do this?
It seems you expand my sight :)
I don't know migration well.
How page_check_address guarantee it's not migration entry ?
page_check_address() calls pte_present() which returns the
(_PAGE_PRESENT | _PAGE_PROTNONE) bits. While x86-64 defines
#define __swp_entry(type, offset)       ((swp_entry_t) { \
                                         ((type) << (_PAGE_BIT_PRESENT + 1)) \
                                         | ((offset) << SWP_OFFSET_SHIFT) })
where SWP_OFFSET_SHIFT is defined to the bigger one of
max(_PAGE_BIT_PROTNONE + 1, _PAGE_BIT_FILE + 1) = max(8+1, 6+1) = 9.
So __swp_entry(type, offset) := (type << 1) | (offset << 9)
We know that the swap type is 5 bits. So the bit 0 _PAGE_PRESENT and bit 8
_PAGE_PROTNONE will all be zero for swap entries.
Thanks for kind explanation :)
You are welcome~
Post by Minchan Kim
Post by Wu Fengguang
Post by Minchan Kim
In addtion, If the page is poison while we are going to
migration((PAGE_MIGRATION && migration) == TRUE), we should decrease
file_rss ?
It will die on trying to migrate the poisoned page so we don't care
the accounting. But normally the poisoned page shall already be
Okay. then, how about this ?
We should not increase file_rss on trying to migrate the poisoned page
-               else if (!is_migration_entry(pte_to_swp_entry(*pte)))
+               else if (!(PAGE_MIGRATION && migration))
This is good if we are going to stop the hwpoison page from being
consumed by move_to_new_page(), but I highly doubt we'll ever add
PageHWPoison() checks into the migration code.
       TestSetPageHWPoison(p);
                                  lock_page(page);
                                  try_to_unmap(page, TTU_MIGRATION|...);
       lock_page_nosync(p);
such small race windows can be found all over the kernel, it's just
insane to try to fix any of them.
Sorry for too late response.
I see your point.
My opinion is that at least we must be notified when such situation happen.
So I think it would be better to add some warning to fix up it when it
happen even thought  it is small race window.
Notification is also pointless here: we'll die hard on
accessing/consuming the poisoned page anyway :(
My intention wasn't to recover it.
Yes, that's not the point.
Post by Minchan Kim
It just add something like WARN_ON.
You said it is small window enough. but I think it can happen more
hight probability in migration-workload.(At a moment, I don't know
what kinds of app)
For such case, If we can hear reporting of warning, at that time we
can consider migration handling for HWPoison.
The point is, any page can go corrupted any time. We don't need to add
1000 PageHWPoison() tests in the kernel like this. We don't aim for
100% protection, that's impossible. I'd be very contented if ever it
can reach 80% coverage :)

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Minchan Kim
2009-06-17 14:08:13 UTC
Permalink
Post by Wu Fengguang
Post by Minchan Kim
Post by Wu Fengguang
Post by Minchan Kim
Post by Wu Fengguang
Post by Minchan Kim
On Tue, 16 Jun 2009 21:49:44 +0800
Post by Wu Fengguang
Post by Minchan Kim
On Mon, 15 Jun 2009 23:26:12 +0800
Post by Minchan Kim
Post by Wu Fengguang
When a page has the poison bit set replace the PTE with a poison entry.
This causes the right error handling to be done later when a process runs
into it.
Also add a new flag to not do that (needed for the memory-failure handler
later)
---
 include/linux/rmap.h |    1 +
 mm/rmap.c            |    9 ++++++++-
 2 files changed, 9 insertions(+), 1 deletion(-)
--- sound-2.6.orig/mm/rmap.c
+++ sound-2.6/mm/rmap.c
@@ -958,7 +958,14 @@ static int try_to_unmap_one(struct page
       /* Update high watermark before we lower rss */
       update_hiwater_rss(mm);
-       if (PageAnon(page)) {
+       if (PageHWPoison(page) && !(flags & TTU_IGNORE_HWPOISON)) {
+               if (PageAnon(page))
+                       dec_mm_counter(mm, anon_rss);
+               else if (!is_migration_entry(pte_to_swp_entry(*pte)))
Isn't it straightforward to use !is_hwpoison_entry ?
Good catch!  It looks like a redundant check: the
page_check_address() at the beginning of the function guarantees that
!is_migration_entry() or !is_migration_entry() tests will all be TRUE.
So let's do this?
It seems you expand my sight :)
I don't know migration well.
How page_check_address guarantee it's not migration entry ?
page_check_address() calls pte_present() which returns the
(_PAGE_PRESENT | _PAGE_PROTNONE) bits. While x86-64 defines
#define __swp_entry(type, offset)       ((swp_entry_t) { \
                                         ((type) << (_PAGE_BIT_PRESENT + 1)) \
                                         | ((offset) << SWP_OFFSET_SHIFT) })
where SWP_OFFSET_SHIFT is defined to the bigger one of
max(_PAGE_BIT_PROTNONE + 1, _PAGE_BIT_FILE + 1) = max(8+1, 6+1) = 9.
So __swp_entry(type, offset) := (type << 1) | (offset << 9)
We know that the swap type is 5 bits. So the bit 0 _PAGE_PRESENT and bit 8
_PAGE_PROTNONE will all be zero for swap entries.
Thanks for kind explanation :)
You are welcome~
Post by Minchan Kim
Post by Wu Fengguang
Post by Minchan Kim
In addtion, If the page is poison while we are going to
migration((PAGE_MIGRATION && migration) == TRUE), we should decrease
file_rss ?
It will die on trying to migrate the poisoned page so we don't care
the accounting. But normally the poisoned page shall already be
Okay. then, how about this ?
We should not increase file_rss on trying to migrate the poisoned page
-               else if (!is_migration_entry(pte_to_swp_entry(*pte)))
+               else if (!(PAGE_MIGRATION && migration))
This is good if we are going to stop the hwpoison page from being
consumed by move_to_new_page(), but I highly doubt we'll ever add
PageHWPoison() checks into the migration code.
       TestSetPageHWPoison(p);
                                  lock_page(page);
                                  try_to_unmap(page, TTU_MIGRATION|...);
       lock_page_nosync(p);
such small race windows can be found all over the kernel, it's just
insane to try to fix any of them.
Sorry for too late response.
I see your point.
My opinion is that at least we must be notified when such situation happen.
So I think it would be better to add some warning to fix up it when it
happen even thought  it is small race window.
Notification is also pointless here: we'll die hard on
accessing/consuming the poisoned page anyway :(
My intention wasn't to recover it.
Yes, that's not the point.
Post by Minchan Kim
It just add something like WARN_ON.
You said it is small window enough. but I think it can happen more
hight probability in migration-workload.(At a moment, I don't know
what kinds of app)
For such case, If we can hear reporting of warning, at that time we
can consider migration handling for HWPoison.
The point is, any page can go corrupted any time. We don't need to add
1000 PageHWPoison() tests in the kernel like this. We don't aim for
100% protection, that's impossible. I'd be very contented if ever it
can reach 80% coverage :)
Okay.
If it is your goal, I also think migration portion of all is very small.
Thanks for kind reply for my boring discussion.
Post by Wu Fengguang
Thanks,
Fengguang
--
Kinds regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Wu Fengguang
2009-06-17 14:12:37 UTC
Permalink
Post by Minchan Kim
Post by Wu Fengguang
Post by Minchan Kim
Post by Wu Fengguang
Post by Minchan Kim
Post by Wu Fengguang
Post by Minchan Kim
On Tue, 16 Jun 2009 21:49:44 +0800
Post by Wu Fengguang
Post by Minchan Kim
On Mon, 15 Jun 2009 23:26:12 +0800
Post by Minchan Kim
Post by Wu Fengguang
When a page has the poison bit set replace the PTE with a poison entry.
This causes the right error handling to be done later when a process runs
into it.
Also add a new flag to not do that (needed for the memory-failure handler
later)
---
 include/linux/rmap.h |    1 +
 mm/rmap.c            |    9 ++++++++-
 2 files changed, 9 insertions(+), 1 deletion(-)
--- sound-2.6.orig/mm/rmap.c
+++ sound-2.6/mm/rmap.c
@@ -958,7 +958,14 @@ static int try_to_unmap_one(struct page
       /* Update high watermark before we lower rss */
       update_hiwater_rss(mm);
-       if (PageAnon(page)) {
+       if (PageHWPoison(page) && !(flags & TTU_IGNORE_HWPOISON)) {
+               if (PageAnon(page))
+                       dec_mm_counter(mm, anon_rss);
+               else if (!is_migration_entry(pte_to_swp_entry(*pte)))
Isn't it straightforward to use !is_hwpoison_entry ?
Good catch!  It looks like a redundant check: the
page_check_address() at the beginning of the function guarantees that
!is_migration_entry() or !is_migration_entry() tests will all be TRUE.
So let's do this?
It seems you expand my sight :)
I don't know migration well.
How page_check_address guarantee it's not migration entry ?
page_check_address() calls pte_present() which returns the
(_PAGE_PRESENT | _PAGE_PROTNONE) bits. While x86-64 defines
#define __swp_entry(type, offset)       ((swp_entry_t) { \
                                         ((type) << (_PAGE_BIT_PRESENT + 1)) \
                                         | ((offset) << SWP_OFFSET_SHIFT) })
where SWP_OFFSET_SHIFT is defined to the bigger one of
max(_PAGE_BIT_PROTNONE + 1, _PAGE_BIT_FILE + 1) = max(8+1, 6+1) = 9.
So __swp_entry(type, offset) := (type << 1) | (offset << 9)
We know that the swap type is 5 bits. So the bit 0 _PAGE_PRESENT and bit 8
_PAGE_PROTNONE will all be zero for swap entries.
Thanks for kind explanation :)
You are welcome~
Post by Minchan Kim
Post by Wu Fengguang
Post by Minchan Kim
In addtion, If the page is poison while we are going to
migration((PAGE_MIGRATION && migration) == TRUE), we should decrease
file_rss ?
It will die on trying to migrate the poisoned page so we don't care
the accounting. But normally the poisoned page shall already be
Okay. then, how about this ?
We should not increase file_rss on trying to migrate the poisoned page
-               else if (!is_migration_entry(pte_to_swp_entry(*pte)))
+               else if (!(PAGE_MIGRATION && migration))
This is good if we are going to stop the hwpoison page from being
consumed by move_to_new_page(), but I highly doubt we'll ever add
PageHWPoison() checks into the migration code.
       TestSetPageHWPoison(p);
                                  lock_page(page);
                                  try_to_unmap(page, TTU_MIGRATION|...);
       lock_page_nosync(p);
such small race windows can be found all over the kernel, it's just
insane to try to fix any of them.
Sorry for too late response.
I see your point.
My opinion is that at least we must be notified when such situation happen.
So I think it would be better to add some warning to fix up it when it
happen even thought  it is small race window.
Notification is also pointless here: we'll die hard on
accessing/consuming the poisoned page anyway :(
My intention wasn't to recover it.
Yes, that's not the point.
Post by Minchan Kim
It just add something like WARN_ON.
You said it is small window enough. but I think it can happen more
hight probability in migration-workload.(At a moment, I don't know
what kinds of app)
For such case, If we can hear reporting of warning, at that time we
can consider migration handling for HWPoison.
The point is, any page can go corrupted any time. We don't need to add
1000 PageHWPoison() tests in the kernel like this. We don't aim for
100% protection, that's impossible. I'd be very contented if ever it
can reach 80% coverage :)
Okay.
If it is your goal, I also think migration portion of all is very small.
Thanks for kind reply for my boring discussion.
Thank you, I'll add comments to clearly state that goal and its rational :)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Wu Fengguang
2009-06-18 12:14:30 UTC
Permalink
It is private mail for my question.
I don't want to make noise in LKML.
And I don't want to disturb your progress to merge HWPoison.
Post by Wu Fengguang
       TestSetPageHWPoison(p);
                                  lock_page(page);
                                  try_to_unmap(page, TTU_MIGRATION|...);
       lock_page_nosync(p);
such small race windows can be found all over the kernel, it's just
insane to try to fix any of them.
I don't know there are intentional small race windows in kernel until you said.
I thought kernel code is perfect so it wouldn't allow race window
although it is very small. But you pointed out. Until now, My thought
is wrong.
Do you know else small race windows by intention ?
If you know it, tell me, please. It can expand my sight. :)
The memory failure code does not aim to rescue 100% page corruptions.
That's unreasonable goal - the kernel pages, slab pages (including the
big dcache/icache) are almost impossible to isolate.
Comparing to the big slab pools, the migration and other race windows are
really too small to care about :)
Also, If you will mention this contents as annotation, I will add my
review sign.
Good suggestion. Here is a patch for comment updates.
Thanks for kind reply for my boring discussion.
Boring? Not at all :)

Thanks,
Fengguang

---
mm/memory-failure.c | 76 +++++++++++++++++++++++++-----------------
1 file changed, 47 insertions(+), 29 deletions(-)

--- sound-2.6.orig/mm/memory-failure.c
+++ sound-2.6/mm/memory-failure.c
@@ -1,4 +1,8 @@
/*
+ * linux/mm/memory-failure.c
+ *
+ * High level machine check handler.
+ *
* Copyright (C) 2008, 2009 Intel Corporation
* Authors: Andi Kleen, Fengguang Wu
*
@@ -6,29 +10,36 @@
* the GNU General Public License ("GPL") version 2 only as published by the
* Free Software Foundation.
*
- * High level machine check handler. Handles pages reported by the
- * hardware as being corrupted usually due to a 2bit ECC memory or cache
- * failure.
- *
- * This focuses on pages detected as corrupted in the background.
- * When the current CPU tries to consume corruption the currently
- * running process can just be killed directly instead. This implies
- * that if the error cannot be handled for some reason it's safe to
- * just ignore it because no corruption has been consumed yet. Instead
- * when that happens another machine check will happen.
- *
- * Handles page cache pages in various states. The tricky part
- * here is that we can access any page asynchronous to other VM
- * users, because memory failures could happen anytime and anywhere,
- * possibly violating some of their assumptions. This is why this code
- * has to be extremely careful. Generally it tries to use normal locking
- * rules, as in get the standard locks, even if that means the
- * error handling takes potentially a long time.
- *
- * The operation to map back from RMAP chains to processes has to walk
- * the complete process list and has non linear complexity with the number
- * mappings. In short it can be quite slow. But since memory corruptions
- * are rare we hope to get away with this.
+ * Pages are reported by the hardware as being corrupted usually due to a
+ * 2bit ECC memory or cache failure. Machine check can either be raised when
+ * corruption is found in background memory scrubbing, or when someone tries to
+ * consume the corruption. This code focuses on the former case. If it cannot
+ * handle the error for some reason it's safe to just ignore it because no
+ * corruption has been consumed yet. Instead when that happens another (deadly)
+ * machine check will happen.
+ *
+ * The tricky part here is that we can access any page asynchronous to other VM
+ * users, because memory failures could happen anytime and anywhere, possibly
+ * violating some of their assumptions. This is why this code has to be
+ * extremely careful. Generally it tries to use normal locking rules, as in get
+ * the standard locks, even if that means the error handling takes potentially
+ * a long time.
+ *
+ * We don't aim to rescue 100% corruptions. That's unreasonable goal - the
+ * kernel text and slab pages (including the big dcache/icache) are almost
+ * impossible to isolate. We also try to keep the code clean by ignoring the
+ * other thousands of small corruption windows.
+ *
+ * When the corrupted page data is not recoverable, the tasks mapped the page
+ * have to be killed. We offer two kill options:
+ * - early kill with SIGBUS.BUS_MCEERR_AO (optional)
+ * - late kill with SIGBUS.BUS_MCEERR_AR (mandatory)
+ * A task will be early killed as soon as corruption is found in its virtual
+ * address space, if it has called prctl(PR_MEMORY_FAILURE_EARLY_KILL, 1, ...);
+ * Any task will be late killed when it tries to access its corrupted virtual
+ * address. The early kill option offers KVM or other apps with large caches an
+ * opportunity to isolate the corrupted page from its internal cache, so as to
+ * avoid being late killed.
*/

/*
@@ -275,6 +286,12 @@ static void collect_procs_file(struct pa

vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff,
pgoff)
+ /*
+ * Send early kill signal to tasks whose vma covers
+ * the page but not necessarily mapped it in its pte.
+ * Applications who requested early kill normally want
+ * to be informed of such data corruptions.
+ */
if (vma->vm_mm == tsk->mm)
add_to_kill(tsk, page, vma, to_kill, tkc);
}
@@ -284,6 +301,12 @@ static void collect_procs_file(struct pa

/*
* Collect the processes who have the corrupted page mapped to kill.
+ *
+ * The operation to map back from RMAP chains to processes has to walk
+ * the complete process list and has non linear complexity with the number
+ * mappings. In short it can be quite slow. But since memory corruptions
+ * are rare and only tasks flagged PF_EARLY_KILL will be searched, we hope to
+ * get away with this.
*/
static void collect_procs(struct page *page, struct list_head *tokill)
{
@@ -439,7 +462,7 @@ static int me_pagecache_dirty(struct pag
* Dirty swap cache page is tricky to handle. The page could live both in page
* cache and swap cache(ie. page is freshly swapped in). So it could be
* referenced concurrently by 2 types of PTEs:
- * normal PTEs and swap PTEs. We try to handle them consistently by calling u
+ * normal PTEs and swap PTEs. We try to handle them consistently by calling
* try_to_unmap(TTU_IGNORE_HWPOISON) to convert the normal PTEs to swap PTEs,
* and then
* - clear dirty bit to prevent IO
@@ -647,11 +670,6 @@ static void hwpoison_user_mappings(struc
* mapped. This has to be done before try_to_unmap,
* because ttu takes the rmap data structures down.
*
- * This also has the side effect to propagate the dirty
- * bit from PTEs into the struct page. This is needed
- * to actually decide if something needs to be killed
- * or errored, or if it's ok to just drop the page.
- *
* Error handling: We ignore errors here because
* there's nothing that can be done.
*/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Minchan Kim
2009-06-18 13:31:52 UTC
Permalink
Post by Wu Fengguang
It is private mail for my question.
I don't want to make noise in LKML.
And I don't want to disturb your progress to merge HWPoison.
Post by Wu Fengguang
       TestSetPageHWPoison(p);
                                  lock_page(page);
                                  try_to_unmap(page, TTU_MIGRATION|...);
       lock_page_nosync(p);
such small race windows can be found all over the kernel, it's just
insane to try to fix any of them.
I don't know there are intentional small race windows in kernel until you said.
I thought kernel code is perfect so it wouldn't allow race window
although it is very small. But you pointed out. Until now, My thought
is wrong.
Do you know else small race windows by intention ?
If you know it, tell me, please. It can expand my sight. :)
The memory failure code does not aim to rescue 100% page corruptions.
That's unreasonable goal - the kernel pages, slab pages (including the
big dcache/icache) are almost impossible to isolate.
Comparing to the big slab pools, the migration and other race windows are
really too small to care about :)
Also, If you will mention this contents as annotation, I will add my
review sign.
Good suggestion. Here is a patch for comment updates.
Thanks for kind reply for my boring discussion.
Boring? Not at all :)
Thanks,
Fengguang
---
 mm/memory-failure.c |   76 +++++++++++++++++++++++++-----------------
 1 file changed, 47 insertions(+), 29 deletions(-)
--- sound-2.6.orig/mm/memory-failure.c
+++ sound-2.6/mm/memory-failure.c
@@ -1,4 +1,8 @@
 /*
+ * linux/mm/memory-failure.c
+ *
+ * High level machine check handler.
+ *
 * Copyright (C) 2008, 2009 Intel Corporation
 * Authors: Andi Kleen, Fengguang Wu
 *
@@ -6,29 +10,36 @@
 * the GNU General Public License ("GPL") version 2 only as published by the
 * Free Software Foundation.
 *
- * High level machine check handler. Handles pages reported by the
- * hardware as being corrupted usually due to a 2bit ECC memory or cache
- * failure.
- *
- * This focuses on pages detected as corrupted in the background.
- * When the current CPU tries to consume corruption the currently
- * running process can just be killed directly instead. This implies
- * that if the error cannot be handled for some reason it's safe to
- * just ignore it because no corruption has been consumed yet. Instead
- * when that happens another machine check will happen.
- *
- * Handles page cache pages in various states. The tricky part
- * here is that we can access any page asynchronous to other VM
- * users, because memory failures could happen anytime and anywhere,
- * possibly violating some of their assumptions. This is why this code
- * has to be extremely careful. Generally it tries to use normal locking
- * rules, as in get the standard locks, even if that means the
- * error handling takes potentially a long time.
- *
- * The operation to map back from RMAP chains to processes has to walk
- * the complete process list and has non linear complexity with the number
- * mappings. In short it can be quite slow. But since memory corruptions
- * are rare we hope to get away with this.
+ * Pages are reported by the hardware as being corrupted usually due to a
+ * 2bit ECC memory or cache failure. Machine check can either be raised when
+ * corruption is found in background memory scrubbing, or when someone tries to
+ * consume the corruption. This code focuses on the former case.  If it cannot
+ * handle the error for some reason it's safe to just ignore it because no
+ * corruption has been consumed yet. Instead when that happens another (deadly)
+ * machine check will happen.
+ *
+ * The tricky part here is that we can access any page asynchronous to other VM
+ * users, because memory failures could happen anytime and anywhere, possibly
+ * violating some of their assumptions. This is why this code has to be
+ * extremely careful. Generally it tries to use normal locking rules, as in get
+ * the standard locks, even if that means the error handling takes potentially
+ * a long time.
+ *
+ * We don't aim to rescue 100% corruptions. That's unreasonable goal - the
+ * kernel text and slab pages (including the big dcache/icache) are almost
+ * impossible to isolate. We also try to keep the code clean by ignoring the
+ * other thousands of small corruption windows.
other thousands of small corruption windows(ex, migration, ...)
As far as you know , please write down them.

Anyway, I already added my sign.
Thanks for your effort never get exhausted. :)
--
Kinds regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Wu Fengguang
2009-06-19 01:58:38 UTC
Permalink
Post by Minchan Kim
Post by Wu Fengguang
It is private mail for my question.
I don't want to make noise in LKML.
And I don't want to disturb your progress to merge HWPoison.
Post by Wu Fengguang
       TestSetPageHWPoison(p);
                                  lock_page(page);
                                  try_to_unmap(page, TTU_MIGRATION|...);
       lock_page_nosync(p);
such small race windows can be found all over the kernel, it's just
insane to try to fix any of them.
I don't know there are intentional small race windows in kernel until you said.
I thought kernel code is perfect so it wouldn't allow race window
although it is very small. But you pointed out. Until now, My thought
is wrong.
Do you know else small race windows by intention ?
If you know it, tell me, please. It can expand my sight. :)
The memory failure code does not aim to rescue 100% page corruptions.
That's unreasonable goal - the kernel pages, slab pages (including the
big dcache/icache) are almost impossible to isolate.
Comparing to the big slab pools, the migration and other race windows are
really too small to care about :)
Also, If you will mention this contents as annotation, I will add my
review sign.
Good suggestion. Here is a patch for comment updates.
Thanks for kind reply for my boring discussion.
Boring? Not at all :)
Thanks,
Fengguang
---
 mm/memory-failure.c |   76 +++++++++++++++++++++++++-----------------
 1 file changed, 47 insertions(+), 29 deletions(-)
--- sound-2.6.orig/mm/memory-failure.c
+++ sound-2.6/mm/memory-failure.c
@@ -1,4 +1,8 @@
 /*
+ * linux/mm/memory-failure.c
+ *
+ * High level machine check handler.
+ *
 * Copyright (C) 2008, 2009 Intel Corporation
 * Authors: Andi Kleen, Fengguang Wu
 *
@@ -6,29 +10,36 @@
 * the GNU General Public License ("GPL") version 2 only as published by the
 * Free Software Foundation.
 *
- * High level machine check handler. Handles pages reported by the
- * hardware as being corrupted usually due to a 2bit ECC memory or cache
- * failure.
- *
- * This focuses on pages detected as corrupted in the background.
- * When the current CPU tries to consume corruption the currently
- * running process can just be killed directly instead. This implies
- * that if the error cannot be handled for some reason it's safe to
- * just ignore it because no corruption has been consumed yet. Instead
- * when that happens another machine check will happen.
- *
- * Handles page cache pages in various states. The tricky part
- * here is that we can access any page asynchronous to other VM
- * users, because memory failures could happen anytime and anywhere,
- * possibly violating some of their assumptions. This is why this code
- * has to be extremely careful. Generally it tries to use normal locking
- * rules, as in get the standard locks, even if that means the
- * error handling takes potentially a long time.
- *
- * The operation to map back from RMAP chains to processes has to walk
- * the complete process list and has non linear complexity with the number
- * mappings. In short it can be quite slow. But since memory corruptions
- * are rare we hope to get away with this.
+ * Pages are reported by the hardware as being corrupted usually due to a
+ * 2bit ECC memory or cache failure. Machine check can either be raised when
+ * corruption is found in background memory scrubbing, or when someone tries to
+ * consume the corruption. This code focuses on the former case.  If it cannot
+ * handle the error for some reason it's safe to just ignore it because no
+ * corruption has been consumed yet. Instead when that happens another (deadly)
+ * machine check will happen.
+ *
+ * The tricky part here is that we can access any page asynchronous to other VM
+ * users, because memory failures could happen anytime and anywhere, possibly
+ * violating some of their assumptions. This is why this code has to be
+ * extremely careful. Generally it tries to use normal locking rules, as in get
+ * the standard locks, even if that means the error handling takes potentially
+ * a long time.
+ *
+ * We don't aim to rescue 100% corruptions. That's unreasonable goal - the
+ * kernel text and slab pages (including the big dcache/icache) are almost
+ * impossible to isolate. We also try to keep the code clean by ignoring the
+ * other thousands of small corruption windows.
other thousands of small corruption windows(ex, migration, ...)
As far as you know , please write down them.
Like this:

new_page = alloc_page();
<small corruption window>
write to new_page
<small corruption window>
read from new_page
Post by Minchan Kim
Anyway, I already added my sign.
Thanks for your effort never get exhausted. :)
You are welcome :)

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Balbir Singh
2009-06-15 03:18:18 UTC
Permalink
Hi all,
Comments are warmly welcome on the newly introduced uevent code :)
I hope we can reach consensus in this round and then be able to post
a final version for .31 inclusion.
Isn't that too aggressive? .31 is already in the merge window.
--
Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Wu Fengguang
2009-06-15 04:27:53 UTC
Permalink
Post by Balbir Singh
Hi all,
Comments are warmly welcome on the newly introduced uevent code :)
I hope we can reach consensus in this round and then be able to post
a final version for .31 inclusion.
Isn't that too aggressive? .31 is already in the merge window.
Yes, a bit aggressive. This is a new feature that involves complex logics.
However it is basically a no-op when there are no memory errors,
and when memory corruption does occur, it's better to (possibly) panic
in this code than to panic unconditionally in the absence of this
feature (as said by Rik).

So IMHO it's OK for .31 as long as we agree on the user interfaces,
ie. /proc/sys/vm/memory_failure_early_kill and the hwpoison uevent.

It comes a long way through numerous reviews, and I believe all the
important issues and concerns have been addressed. Nick, Rik, Hugh,
Ingo, ... what are your opinions? Is the uevent good enough to meet
your request to "die hard" or "die gracefully" or whatever on memory
failure events?

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Nick Piggin
2009-06-15 06:44:47 UTC
Permalink
Post by Wu Fengguang
Post by Balbir Singh
Hi all,
Comments are warmly welcome on the newly introduced uevent code :)
I hope we can reach consensus in this round and then be able to post
a final version for .31 inclusion.
Isn't that too aggressive? .31 is already in the merge window.
Yes, a bit aggressive. This is a new feature that involves complex logics.
However it is basically a no-op when there are no memory errors,
and when memory corruption does occur, it's better to (possibly) panic
in this code than to panic unconditionally in the absence of this
feature (as said by Rik).
So IMHO it's OK for .31 as long as we agree on the user interfaces,
ie. /proc/sys/vm/memory_failure_early_kill and the hwpoison uevent.
It comes a long way through numerous reviews, and I believe all the
important issues and concerns have been addressed. Nick, Rik, Hugh,
Ingo, ... what are your opinions? Is the uevent good enough to meet
your request to "die hard" or "die gracefully" or whatever on memory
failure events?
Uevent? As in, send a message to userspace? I don't think this
would be ideal for a fail-stop/failover situation.

I can't see a good reason to rush to merge it.

IMO the userspace-visible changes have maybe not been considered
too thoroughly, which is what I'd be most worried about. I probably
missed seeing documentation of exact semantics and situations
where admins should tune things one way or the other.

Did we verify with filesystem maintainers (eg. btrfs) that the
!ISREG test will be enough to prevent oopses?

I hope it is going to be merged with an easy-to-use fault injector,
because that is the only way Joe kernel developer is ever going to
test it.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Andi Kleen
2009-06-15 07:09:14 UTC
Permalink
Post by Nick Piggin
Post by Wu Fengguang
So IMHO it's OK for .31 as long as we agree on the user interfaces,
ie. /proc/sys/vm/memory_failure_early_kill and the hwpoison uevent.
It comes a long way through numerous reviews, and I believe all the
important issues and concerns have been addressed. Nick, Rik, Hugh,
Ingo, ... what are your opinions? Is the uevent good enough to meet
your request to "die hard" or "die gracefully" or whatever on memory
failure events?
Uevent? As in, send a message to userspace? I don't think this
would be ideal for a fail-stop/failover situation.
Agreed.

For failover you typically want a application level heartbeat anyways
to guard against user space software problems and if there's a kill then it
would catch it. Also again in you want to check against all corruptions you
have to do it in the low level handler or better watch corrected
events too to predict failures (but the later is quite hard to do generally).
To some extent the first is already implemented on x86, e.g. set
the tolerance level to 0 will give more aggressive panics.
Post by Nick Piggin
I can't see a good reason to rush to merge it.
The low level x86 code for MCA recovery is in, just this high level
part is missing to kill the correct process. I think it would be good to merge
a core now. The basic code seems to be also as well tested as we can do it
right now and exposing it to more users would be good. It's undoubtedly not
perfect yet, but that's not a requirement for merge.

There's a lot of fancy stuff that could be done in addition,
but that's not really needed right now and for a lot of the fancy
ideas (I have enough on my own :) it's dubious they are actually
worth it.
Post by Nick Piggin
IMO the userspace-visible changes have maybe not been considered
too thoroughly, which is what I'd be most worried about. I probably
missed seeing documentation of exact semantics and situations
where admins should tune things one way or the other.
There's only a single tunable anyways, early kill vs late kill.

For KVM you need early kill, for the others it remains to be seen.
Post by Nick Piggin
I hope it is going to be merged with an easy-to-use fault injector,
because that is the only way Joe kernel developer is ever going to
test it.
See patches 13 and 14. In addition there's another low level x86
injector too.

There's also a test suite available (mce-test on kernel.org git)

-Andi
--
***@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Nick Piggin
2009-06-15 07:19:07 UTC
Permalink
Post by Andi Kleen
Post by Nick Piggin
Post by Wu Fengguang
So IMHO it's OK for .31 as long as we agree on the user interfaces,
ie. /proc/sys/vm/memory_failure_early_kill and the hwpoison uevent.
It comes a long way through numerous reviews, and I believe all the
important issues and concerns have been addressed. Nick, Rik, Hugh,
Ingo, ... what are your opinions? Is the uevent good enough to meet
your request to "die hard" or "die gracefully" or whatever on memory
failure events?
Uevent? As in, send a message to userspace? I don't think this
would be ideal for a fail-stop/failover situation.
Agreed.
For failover you typically want a application level heartbeat anyways
to guard against user space software problems and if there's a kill then it
would catch it. Also again in you want to check against all corruptions you
have to do it in the low level handler or better watch corrected
events too to predict failures (but the later is quite hard to do generally).
To some extent the first is already implemented on x86, e.g. set
the tolerance level to 0 will give more aggressive panics.
Post by Nick Piggin
I can't see a good reason to rush to merge it.
The low level x86 code for MCA recovery is in, just this high level
part is missing to kill the correct process. I think it would be good to merge
a core now. The basic code seems to be also as well tested as we can do it
right now and exposing it to more users would be good. It's undoubtedly not
perfect yet, but that's not a requirement for merge.
There's a lot of fancy stuff that could be done in addition,
but that's not really needed right now and for a lot of the fancy
ideas (I have enough on my own :) it's dubious they are actually
worth it.
Just my opinion. Normally it takes a lot longer for VM patches
like this to go through, but it's not really up to me anyway.
If Andrew or Linus has it in their head to merge it in 2.6.31,
it's going to get merged ;)
Post by Andi Kleen
Post by Nick Piggin
IMO the userspace-visible changes have maybe not been considered
too thoroughly, which is what I'd be most worried about. I probably
missed seeing documentation of exact semantics and situations
where admins should tune things one way or the other.
There's only a single tunable anyways, early kill vs late kill.
For KVM you need early kill, for the others it remains to be seen.
Right. It's almost like you need to do a per-process thing, and
those that can handle things (such as the new SIGBUS or the new
EIO) could get those, and others could be killed.

Early-kill for KVM does seem like reasonable justification on the
surface, but when I think more about it, I wonder does the guest
actually stand any better chance to correct the error if it is
reported at time T rather than T+delta? (who knows what the page
will be used for at any given time).
Post by Andi Kleen
Post by Nick Piggin
I hope it is going to be merged with an easy-to-use fault injector,
because that is the only way Joe kernel developer is ever going to
test it.
See patches 13 and 14. In addition there's another low level x86
injector too.
There's also a test suite available (mce-test on kernel.org git)
OK, I probably saw some "FOR TETSING" things and wrongly assumed the
injector was not supposed to be merged.

Thanks,
Nick

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Wu Fengguang
2009-06-15 12:10:01 UTC
Permalink
Post by Nick Piggin
Post by Andi Kleen
Post by Nick Piggin
Post by Wu Fengguang
So IMHO it's OK for .31 as long as we agree on the user interfaces,
ie. /proc/sys/vm/memory_failure_early_kill and the hwpoison uevent.
It comes a long way through numerous reviews, and I believe all the
important issues and concerns have been addressed. Nick, Rik, Hugh,
Ingo, ... what are your opinions? Is the uevent good enough to meet
your request to "die hard" or "die gracefully" or whatever on memory
failure events?
Uevent? As in, send a message to userspace? I don't think this
would be ideal for a fail-stop/failover situation.
Agreed.
For failover you typically want a application level heartbeat anyways
to guard against user space software problems and if there's a kill then it
would catch it. Also again in you want to check against all corruptions you
have to do it in the low level handler or better watch corrected
events too to predict failures (but the later is quite hard to do generally).
To some extent the first is already implemented on x86, e.g. set
the tolerance level to 0 will give more aggressive panics.
Post by Nick Piggin
I can't see a good reason to rush to merge it.
The low level x86 code for MCA recovery is in, just this high level
part is missing to kill the correct process. I think it would be good to merge
a core now. The basic code seems to be also as well tested as we can do it
right now and exposing it to more users would be good. It's undoubtedly not
perfect yet, but that's not a requirement for merge.
There's a lot of fancy stuff that could be done in addition,
but that's not really needed right now and for a lot of the fancy
ideas (I have enough on my own :) it's dubious they are actually
worth it.
Just my opinion. Normally it takes a lot longer for VM patches
like this to go through, but it's not really up to me anyway.
If Andrew or Linus has it in their head to merge it in 2.6.31,
it's going to get merged ;)
Post by Andi Kleen
Post by Nick Piggin
IMO the userspace-visible changes have maybe not been considered
too thoroughly, which is what I'd be most worried about. I probably
missed seeing documentation of exact semantics and situations
where admins should tune things one way or the other.
There's only a single tunable anyways, early kill vs late kill.
For KVM you need early kill, for the others it remains to be seen.
Right. It's almost like you need to do a per-process thing, and
those that can handle things (such as the new SIGBUS or the new
EIO) could get those, and others could be killed.
To send early SIGBUS kills to processes who has called
sigaction(SIGBUS, ...)? KVM will sure do that. For other apps we
don't mind they can understand that signal at all.
Post by Nick Piggin
Early-kill for KVM does seem like reasonable justification on the
surface, but when I think more about it, I wonder does the guest
actually stand any better chance to correct the error if it is
reported at time T rather than T+delta? (who knows what the page
will be used for at any given time).
Early kill makes a lot difference for KVM. Think about the vast
amount of clean page cache pages. With early kill the page can be
trivially isolated. With late kill the whole virtual machine dies
hard.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Nick Piggin
2009-06-15 12:25:28 UTC
Permalink
Post by Wu Fengguang
Post by Nick Piggin
Post by Andi Kleen
For KVM you need early kill, for the others it remains to be seen.
Right. It's almost like you need to do a per-process thing, and
those that can handle things (such as the new SIGBUS or the new
EIO) could get those, and others could be killed.
To send early SIGBUS kills to processes who has called
sigaction(SIGBUS, ...)? KVM will sure do that. For other apps we
don't mind they can understand that signal at all.
For apps that hook into SIGBUS for some other means and
do not understand the new type of SIGBUS signal? What about
those?
Post by Wu Fengguang
Post by Nick Piggin
Early-kill for KVM does seem like reasonable justification on the
surface, but when I think more about it, I wonder does the guest
actually stand any better chance to correct the error if it is
reported at time T rather than T+delta? (who knows what the page
will be used for at any given time).
Early kill makes a lot difference for KVM. Think about the vast
amount of clean page cache pages. With early kill the page can be
trivially isolated. With late kill the whole virtual machine dies
hard.
Why? In both cases it will enter the exception handler and
attempt to do something about it... in both cases I would
have thought there is some chance that the page error is not
recoverable and some chance it is recoverable. Or am I
missing something?

Anyway, I would like to see a basic analysis of those probabilities
to justify early kill. Not saying there is no justification, but
it would be helpful to see why.

Thanks,
Nick


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Wu Fengguang
2009-06-15 14:22:25 UTC
Permalink
Post by Nick Piggin
Post by Wu Fengguang
Post by Nick Piggin
Post by Andi Kleen
For KVM you need early kill, for the others it remains to be seen.
Right. It's almost like you need to do a per-process thing, and
those that can handle things (such as the new SIGBUS or the new
EIO) could get those, and others could be killed.
To send early SIGBUS kills to processes who has called
sigaction(SIGBUS, ...)? KVM will sure do that. For other apps we
don't mind they can understand that signal at all.
For apps that hook into SIGBUS for some other means and
Yes I was referring to the sigaction(SIGBUS) apps, others will
be late killed anyway.
Post by Nick Piggin
do not understand the new type of SIGBUS signal? What about
those?
We introduced two new SIGBUS codes:
BUS_MCEERR_AO=5 for early kill
BUS_MCEERR_AR=4 for late kill
I'd assume a legacy application will handle them in the same way (both
are unexpected code to the application).

We don't care whether the application can be killed by BUS_MCEERR_AO
or BUS_MCEERR_AR depending on its SIGBUS handler implementation.
But (in the rare case) if the handler
- refused to die on BUS_MCEERR_AR, it may create a busy loop and
flooding of SIGBUS signals, which is a bug of the application.
BUS_MCEERR_AO is one time and won't lead to busy loops.
- does something that hurts itself (ie. data safety) on BUS_MCEERR_AO,
it may well hurt the same way on BUS_MCEERR_AR. The latter one is
unavoidable, so the application must be fixed anyway.
Post by Nick Piggin
Post by Wu Fengguang
Post by Nick Piggin
Early-kill for KVM does seem like reasonable justification on the
surface, but when I think more about it, I wonder does the guest
actually stand any better chance to correct the error if it is
reported at time T rather than T+delta? (who knows what the page
will be used for at any given time).
Early kill makes a lot difference for KVM. Think about the vast
amount of clean page cache pages. With early kill the page can be
trivially isolated. With late kill the whole virtual machine dies
hard.
Why? In both cases it will enter the exception handler and
attempt to do something about it... in both cases I would
have thought there is some chance that the page error is not
recoverable and some chance it is recoverable. Or am I
missing something?
The early kill / late kill to KVM from the POV of host kernel matches
the MCE AO/AR events inside the KVM guest kernel. The key difference
between AO/AR is, whether the page is _being_ consumed.

It's a lot harder (if possible) to try to stop an active consumer.
For example, the clean cache pages can be consumed in many ways:
- be accessed by read()/write() or mapped read/write
- be reclaimed and then allocated for whatever new usage, for example,
be zeroed by __GFP_ZERO, or be insert into another file and start
read/write IO and be accessed by disk driver via DMA, or even be
allocated for kernel slabs..
Frankly speaking I don't know how to stop all the above consumers.
We now simply die on AR events.
Post by Nick Piggin
Anyway, I would like to see a basic analysis of those probabilities
to justify early kill. Not saying there is no justification, but
it would be helpful to see why.
That's fine. I'd be glad if the above explanation paves way to
solutions for AR events :)

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Wu Fengguang
2009-06-17 06:37:02 UTC
Permalink
Post by Wu Fengguang
Post by Nick Piggin
Post by Wu Fengguang
Post by Nick Piggin
Post by Andi Kleen
For KVM you need early kill, for the others it remains to be seen.
Right. It's almost like you need to do a per-process thing, and
those that can handle things (such as the new SIGBUS or the new
EIO) could get those, and others could be killed.
To send early SIGBUS kills to processes who has called
sigaction(SIGBUS, ...)? KVM will sure do that. For other apps we
don't mind they can understand that signal at all.
For apps that hook into SIGBUS for some other means and
Yes I was referring to the sigaction(SIGBUS) apps, others will
be late killed anyway.
Post by Nick Piggin
do not understand the new type of SIGBUS signal? What about
those?
BUS_MCEERR_AO=5 for early kill
BUS_MCEERR_AR=4 for late kill
I'd assume a legacy application will handle them in the same way (both
are unexpected code to the application).
We don't care whether the application can be killed by BUS_MCEERR_AO
or BUS_MCEERR_AR depending on its SIGBUS handler implementation.
But (in the rare case) if the handler
- refused to die on BUS_MCEERR_AR, it may create a busy loop and
flooding of SIGBUS signals, which is a bug of the application.
BUS_MCEERR_AO is one time and won't lead to busy loops.
- does something that hurts itself (ie. data safety) on BUS_MCEERR_AO,
it may well hurt the same way on BUS_MCEERR_AR. The latter one is
unavoidable, so the application must be fixed anyway.
This patch materializes the automatically early kill idea.
It aims to remove the vm.memory_failure_ealy_kill sysctl parameter.

This is mainly a policy change, please comment.

Thanks,
Fengguang

---
HWPOISON: only early kill processes who installed SIGBUS handler

We want to send SIGBUS.BUS_MCEERR_AO signals to KVM ASAP, so that
it is able to take actions to isolate the corrupted page. In fact,
any applications that does extensive internal caching (KVM, Oracle,
etc.) is advised to install a SIGBUS handler to get early notifications
of corrupted memory, so that it has good possibility to find and remove
the page from its cache. If don't do so, they will later receive the
SIGBUS.BUS_MCEERR_AR signal on accessing the corrupted memory, which
can be deadly (too hard to rescue).

For applications that don't care the signal, let them continue to run
until they try to consume the corrupted data.

For applications that used to catch the SIGBUS handler but don't understand
the new BUS_MCEERR_AO/BUS_MCEERR_AR codes, they may
- refused to die on BUS_MCEERR_AR, creating a busy loop and
flooding of SIGBUS signals, which is a bug of the application.
BUS_MCEERR_AO is an one shot event and won't lead to busy loops.
- does something that hurts itself (ie. data safety) on BUS_MCEERR_AO,
it may well hurt the same way on BUS_MCEERR_AR. The latter one is
unavoidable, so the application must be fixed anyway.


CC: Nick Piggin <***@suse.de>
Signed-off-by: Wu Fengguang <***@intel.com>
---
mm/memory-failure.c | 18 ++++++++++++++++--
1 file changed, 16 insertions(+), 2 deletions(-)

--- sound-2.6.orig/mm/memory-failure.c
+++ sound-2.6/mm/memory-failure.c
@@ -205,6 +205,20 @@ static void kill_procs_ao(struct list_he
}
}

+static bool task_early_kill_elegible(struct task_struct *tsk)
+{
+ __sighandler_t handler;
+
+ if (!tsk->mm)
+ return false;
+
+ handler = tsk->sighand->action[SIGBUS-1].sa.sa_handler;
+ if (handler == SIG_DFL || handler == SIG_IGN)
+ return false;
+
+ return true;
+}
+
/*
* Collect processes when the error hit an anonymous page.
*/
@@ -222,7 +236,7 @@ static void collect_procs_anon(struct pa
goto out;

for_each_process (tsk) {
- if (!tsk->mm)
+ if (!task_early_kill_elegible(tsk))
continue;
list_for_each_entry (vma, &av->head, anon_vma_node) {
if (!page_mapped_in_vma(page, vma))
@@ -262,7 +276,7 @@ static void collect_procs_file(struct pa
for_each_process(tsk) {
pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);

- if (!tsk->mm)
+ if (!task_early_kill_elegible(tsk))
continue;

vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Nick Piggin
2009-06-17 08:04:04 UTC
Permalink
Post by Wu Fengguang
Post by Wu Fengguang
Post by Nick Piggin
Post by Wu Fengguang
Post by Nick Piggin
Post by Andi Kleen
For KVM you need early kill, for the others it remains to be seen.
Right. It's almost like you need to do a per-process thing, and
those that can handle things (such as the new SIGBUS or the new
EIO) could get those, and others could be killed.
To send early SIGBUS kills to processes who has called
sigaction(SIGBUS, ...)? KVM will sure do that. For other apps we
don't mind they can understand that signal at all.
For apps that hook into SIGBUS for some other means and
Yes I was referring to the sigaction(SIGBUS) apps, others will
be late killed anyway.
Post by Nick Piggin
do not understand the new type of SIGBUS signal? What about
those?
BUS_MCEERR_AO=5 for early kill
BUS_MCEERR_AR=4 for late kill
I'd assume a legacy application will handle them in the same way (both
are unexpected code to the application).
We don't care whether the application can be killed by BUS_MCEERR_AO
or BUS_MCEERR_AR depending on its SIGBUS handler implementation.
But (in the rare case) if the handler
- refused to die on BUS_MCEERR_AR, it may create a busy loop and
flooding of SIGBUS signals, which is a bug of the application.
BUS_MCEERR_AO is one time and won't lead to busy loops.
- does something that hurts itself (ie. data safety) on BUS_MCEERR_AO,
it may well hurt the same way on BUS_MCEERR_AR. The latter one is
unavoidable, so the application must be fixed anyway.
This patch materializes the automatically early kill idea.
It aims to remove the vm.memory_failure_ealy_kill sysctl parameter.
This is mainly a policy change, please comment.
Well then you can still early-kill random apps that did not
want it, and you may still cause problems if its sigbus
handler does something nontrivial.

Can you use a prctl or something so it can expclitly
register interest in this?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Wu Fengguang
2009-06-17 09:55:32 UTC
Permalink
Post by Nick Piggin
Post by Wu Fengguang
Post by Wu Fengguang
Post by Nick Piggin
Post by Wu Fengguang
Post by Nick Piggin
Post by Andi Kleen
For KVM you need early kill, for the others it remains to be seen.
Right. It's almost like you need to do a per-process thing, and
those that can handle things (such as the new SIGBUS or the new
EIO) could get those, and others could be killed.
To send early SIGBUS kills to processes who has called
sigaction(SIGBUS, ...)? KVM will sure do that. For other apps we
don't mind they can understand that signal at all.
For apps that hook into SIGBUS for some other means and
Yes I was referring to the sigaction(SIGBUS) apps, others will
be late killed anyway.
Post by Nick Piggin
do not understand the new type of SIGBUS signal? What about
those?
BUS_MCEERR_AO=5 for early kill
BUS_MCEERR_AR=4 for late kill
I'd assume a legacy application will handle them in the same way (both
are unexpected code to the application).
We don't care whether the application can be killed by BUS_MCEERR_AO
or BUS_MCEERR_AR depending on its SIGBUS handler implementation.
But (in the rare case) if the handler
- refused to die on BUS_MCEERR_AR, it may create a busy loop and
flooding of SIGBUS signals, which is a bug of the application.
BUS_MCEERR_AO is one time and won't lead to busy loops.
- does something that hurts itself (ie. data safety) on BUS_MCEERR_AO,
it may well hurt the same way on BUS_MCEERR_AR. The latter one is
unavoidable, so the application must be fixed anyway.
This patch materializes the automatically early kill idea.
It aims to remove the vm.memory_failure_ealy_kill sysctl parameter.
This is mainly a policy change, please comment.
Well then you can still early-kill random apps that did not
want it, and you may still cause problems if its sigbus
handler does something nontrivial.
Can you use a prctl or something so it can expclitly
register interest in this?
No I don't think prctl would be much better.

- if an application want early/late kill, it can do so with a proper
written SIGBUS handler: the prctl call is redundant.
- if an admin want to control early/late kill for an unmodified app,
prctl is as unhelpful as this patch(*).
- prctl does can help legacy apps whose SIGBUS handler has trouble
with the new SIGBUS codes, however such application should be rare
and the application should be fixed(why shall it do something wrong
on newly introduced code at all? Shall we stop introducing new codes
just because some random buggy app cannot handle new codes?)

So I still prefer this patch, until we come up with some solution that
allows both app and admin to change the setting.

(*) Or if prctl options can be inherited across exec(), we may write a
wrapper tool to set early kill preference for some legacy app:

# this_app_can_be_early_killed some_legacy_app --legacy-options

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Nick Piggin
2009-06-17 10:00:06 UTC
Permalink
Post by Wu Fengguang
Post by Nick Piggin
Well then you can still early-kill random apps that did not
want it, and you may still cause problems if its sigbus
handler does something nontrivial.
Can you use a prctl or something so it can expclitly
register interest in this?
No I don't think prctl would be much better.
- if an application want early/late kill, it can do so with a proper
written SIGBUS handler: the prctl call is redundant.
s/proper written/is switched to new semantics based on the existance
of a/
Post by Wu Fengguang
- if an admin want to control early/late kill for an unmodified app,
prctl is as unhelpful as this patch(*).
Clearly you can execute a process with a given prctl.
Post by Wu Fengguang
- prctl does can help legacy apps whose SIGBUS handler has trouble
with the new SIGBUS codes, however such application should be rare
and the application should be fixed(why shall it do something wrong
on newly introduced code at all? Shall we stop introducing new codes
just because some random buggy app cannot handle new codes?)
Backwards compatibility? Kind of important.
Post by Wu Fengguang
So I still prefer this patch, until we come up with some solution that
allows both app and admin to change the setting.
Not only does it allow that, but it also provides backwards
compatibility. Your patch does not allow admin to change
anything nor does it guarantee 100% back compat so I can't
see how you think it is better.

Also it does not allow for an app with a SIGBUS handler to
use late kill. If late kill is useful to anyone, why would
it not be useful to some app with a SIGBUS handler (that is
not KVM)?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Wu Fengguang
2009-06-17 11:56:34 UTC
Permalink
Post by Nick Piggin
Post by Wu Fengguang
Post by Nick Piggin
Well then you can still early-kill random apps that did not
want it, and you may still cause problems if its sigbus
handler does something nontrivial.
Can you use a prctl or something so it can expclitly
register interest in this?
No I don't think prctl would be much better.
- if an application want early/late kill, it can do so with a proper
written SIGBUS handler: the prctl call is redundant.
s/proper written/is switched to new semantics based on the existance
of a/
Not necessarily so. If an application
- did not has a SIGBUS handler, and want to be
- early killed: must install a handler, this is not a big problem
because it may well want to rescue something on the event.
- late killed: just do nothing.
(here kill = 'notification')
- had a SIGBUS hander, and want to
- early die: call exit(0) in the handler.
- late die: intercept and ignore the signal.
So if source code modification is viable, prctl is not necessary at all.
Post by Nick Piggin
Post by Wu Fengguang
- if an admin want to control early/late kill for an unmodified app,
prctl is as unhelpful as this patch(*).
Clearly you can execute a process with a given prctl.
OK, right.
Post by Nick Piggin
Post by Wu Fengguang
- prctl does can help legacy apps whose SIGBUS handler has trouble
with the new SIGBUS codes, however such application should be rare
and the application should be fixed(why shall it do something wrong
on newly introduced code at all? Shall we stop introducing new codes
just because some random buggy app cannot handle new codes?)
Backwards compatibility? Kind of important.
Maybe.
Post by Nick Piggin
Post by Wu Fengguang
So I still prefer this patch, until we come up with some solution that
allows both app and admin to change the setting.
Not only does it allow that, but it also provides backwards
compatibility. Your patch does not allow admin to change
anything nor does it guarantee 100% back compat so I can't
see how you think it is better.
I didn't say it is better, but clearly mean that prctl is not better
enough to warrant a new user interface, if(!adm_friendly). Now it's
obvious that adm_friendly=1, so I agree prctl is a good interface :)
Post by Nick Piggin
Also it does not allow for an app with a SIGBUS handler to
use late kill. If late kill is useful to anyone, why would
it not be useful to some app with a SIGBUS handler (that is
not KVM)?
Late kill will always be sent. Ignore the early kill signal in the
SIGBUS handler does the trick (see above analyzes).

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Wu Fengguang
2009-06-18 09:56:44 UTC
Permalink
Post by Nick Piggin
Post by Wu Fengguang
Post by Wu Fengguang
Post by Nick Piggin
Post by Wu Fengguang
Post by Nick Piggin
Post by Andi Kleen
For KVM you need early kill, for the others it remains to be seen.
Right. It's almost like you need to do a per-process thing, and
those that can handle things (such as the new SIGBUS or the new
EIO) could get those, and others could be killed.
To send early SIGBUS kills to processes who has called
sigaction(SIGBUS, ...)? KVM will sure do that. For other apps we
don't mind they can understand that signal at all.
For apps that hook into SIGBUS for some other means and
Yes I was referring to the sigaction(SIGBUS) apps, others will
be late killed anyway.
Post by Nick Piggin
do not understand the new type of SIGBUS signal? What about
those?
BUS_MCEERR_AO=5 for early kill
BUS_MCEERR_AR=4 for late kill
I'd assume a legacy application will handle them in the same way (both
are unexpected code to the application).
We don't care whether the application can be killed by BUS_MCEERR_AO
or BUS_MCEERR_AR depending on its SIGBUS handler implementation.
But (in the rare case) if the handler
- refused to die on BUS_MCEERR_AR, it may create a busy loop and
flooding of SIGBUS signals, which is a bug of the application.
BUS_MCEERR_AO is one time and won't lead to busy loops.
- does something that hurts itself (ie. data safety) on BUS_MCEERR_AO,
it may well hurt the same way on BUS_MCEERR_AR. The latter one is
unavoidable, so the application must be fixed anyway.
This patch materializes the automatically early kill idea.
It aims to remove the vm.memory_failure_ealy_kill sysctl parameter.
This is mainly a policy change, please comment.
Well then you can still early-kill random apps that did not
want it, and you may still cause problems if its sigbus
handler does something nontrivial.
Can you use a prctl or something so it can expclitly
register interest in this?
OK, this patch allows one to request early kill by calling
prctl(PR_MEMORY_FAILURE_EARLY_KILL, 1, ...).

Now either app or admin can choose to enable/disable early kill on
a per-process basis. But still, an admin won't be able to change the
behavior of an application who calls prctl() to set the option by itself.

Thanks,
Fengguang

---
include/linux/prctl.h | 6 ++++++
include/linux/sched.h | 1 +
kernel/sys.c | 6 ++++++
mm/memory-failure.c | 12 ++++++++++--
4 files changed, 23 insertions(+), 2 deletions(-)

--- sound-2.6.orig/include/linux/prctl.h
+++ sound-2.6/include/linux/prctl.h
@@ -88,4 +88,10 @@
#define PR_TASK_PERF_COUNTERS_DISABLE 31
#define PR_TASK_PERF_COUNTERS_ENABLE 32

+/*
+ * Send early SIGBUS.BUS_MCEERR_AO notification on memory corruption?
+ * Useful for KVM and mission critical apps.
+ */
+#define PR_MEMORY_FAILURE_EARLY_KILL 33
+
#endif /* _LINUX_PRCTL_H */
--- sound-2.6.orig/include/linux/sched.h
+++ sound-2.6/include/linux/sched.h
@@ -1666,6 +1666,7 @@ extern cputime_t task_gtime(struct task_
#define PF_MEMALLOC 0x00000800 /* Allocating memory */
#define PF_FLUSHER 0x00001000 /* responsible for disk writeback */
#define PF_USED_MATH 0x00002000 /* if unset the fpu must be initialized before use */
+#define PF_EARLY_KILL 0x00004000 /* early kill me on memory failure */
#define PF_NOFREEZE 0x00008000 /* this thread should not be frozen */
#define PF_FROZEN 0x00010000 /* frozen for system suspend */
#define PF_FSTRANS 0x00020000 /* inside a filesystem transaction */
--- sound-2.6.orig/kernel/sys.c
+++ sound-2.6/kernel/sys.c
@@ -1545,6 +1545,12 @@ SYSCALL_DEFINE5(prctl, int, option, unsi
current->timer_slack_ns = arg2;
error = 0;
break;
+ case PR_MEMORY_FAILURE_EARLY_KILL:
+ if (arg2)
+ me->flags |= PF_EARLY_KILL;
+ else
+ me->flags &= ~PF_EARLY_KILL;
+ break;
default:
error = -EINVAL;
break;
--- sound-2.6.orig/mm/memory-failure.c
+++ sound-2.6/mm/memory-failure.c
@@ -205,6 +205,14 @@ static void kill_procs_ao(struct list_he
}
}

+static bool task_early_kill_elegible(struct task_struct *tsk)
+{
+ if (!tsk->mm)
+ return false;
+
+ return tsk->flags & PF_EARLY_KILL;
+}
+
/*
* Collect processes when the error hit an anonymous page.
*/
@@ -222,7 +230,7 @@ static void collect_procs_anon(struct pa
goto out;

for_each_process (tsk) {
- if (!tsk->mm)
+ if (!task_early_kill_elegible(tsk))
continue;
list_for_each_entry (vma, &av->head, anon_vma_node) {
if (!page_mapped_in_vma(page, vma))
@@ -262,7 +270,7 @@ static void collect_procs_file(struct pa
for_each_process(tsk) {
pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);

- if (!tsk->mm)
+ if (!task_early_kill_elegible(tsk))
continue;

vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Nick Piggin
2009-06-15 08:14:53 UTC
Permalink
Post by Nick Piggin
Did we verify with filesystem maintainers (eg. btrfs) that the
!ISREG test will be enough to prevent oopses?
BTW. this is quite a significant change I think and not
really documented well enough. Previously a filesystem
will know exactly when and why pagecache in a mapping
under its control will be truncated (as opposed to
invalidated).

They even have opportunity to hold locks such as i_mutex.

And depending on what they do, they could do interesting
things even with ISREG files.

So, I really think this needs review by filesystem
maintainers and it would be far safer to use invalidate
until it is known to be safe.

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Wu Fengguang
2009-06-15 10:09:54 UTC
Permalink
Post by Nick Piggin
Post by Nick Piggin
Did we verify with filesystem maintainers (eg. btrfs) that the
!ISREG test will be enough to prevent oopses?
BTW. this is quite a significant change I think and not
really documented well enough. Previously a filesystem
will know exactly when and why pagecache in a mapping
under its control will be truncated (as opposed to
invalidated).
They even have opportunity to hold locks such as i_mutex.
And depending on what they do, they could do interesting
things even with ISREG files.
So, I really think this needs review by filesystem
maintainers and it would be far safer to use invalidate
until it is known to be safe.
Nick, we are doing invalidate_complete_page() for !S_ISREG inodes now.
Do you mean to do invalidate_complete_page() for all inodes for now?
That's a good suggestion, it shall be able to do the job for most
pages indeed.

To make things look clear, here is the complete memory-failure.c
without the uevent bits.

Thanks,
Fengguang
---

/*
* Copyright (C) 2008, 2009 Intel Corporation
* Authors: Andi Kleen, Fengguang Wu
*
* This software may be redistributed and/or modified under the terms of
* the GNU General Public License ("GPL") version 2 only as published by the
* Free Software Foundation.
*
* High level machine check handler. Handles pages reported by the
* hardware as being corrupted usually due to a 2bit ECC memory or cache
* failure.
*
* This focuses on pages detected as corrupted in the background.
* When the current CPU tries to consume corruption the currently
* running process can just be killed directly instead. This implies
* that if the error cannot be handled for some reason it's safe to
* just ignore it because no corruption has been consumed yet. Instead
* when that happens another machine check will happen.
*
* Handles page cache pages in various states. The tricky part
* here is that we can access any page asynchronous to other VM
* users, because memory failures could happen anytime and anywhere,
* possibly violating some of their assumptions. This is why this code
* has to be extremely careful. Generally it tries to use normal locking
* rules, as in get the standard locks, even if that means the
* error handling takes potentially a long time.
*
* The operation to map back from RMAP chains to processes has to walk
* the complete process list and has non linear complexity with the number
* mappings. In short it can be quite slow. But since memory corruptions
* are rare we hope to get away with this.
*/

/*
* Notebook:
* - hugetlb needs more code
* - kcore/oldmem/vmcore/mem/kmem check for hwpoison pages
* - pass bad pages to kdump next kernel
*/
#define DEBUG 1
#include <linux/kernel.h>
#include <linux/mm.h>
#include <linux/page-flags.h>
#include <linux/sched.h>
#include <linux/rmap.h>
#include <linux/pagemap.h>
#include <linux/swap.h>
#include <linux/backing-dev.h>
#include "internal.h"

int sysctl_memory_failure_early_kill __read_mostly = 1;

atomic_long_t mce_bad_pages __read_mostly = ATOMIC_LONG_INIT(0);

/*
* Send all the processes who have the page mapped an ``action optional''
* signal.
*/
static int kill_proc_ao(struct task_struct *t, unsigned long addr, int trapno,
unsigned long pfn)
{
struct siginfo si;
int ret;

printk(KERN_ERR
"MCE %#lx: Killing %s:%d early due to hardware memory corruption\n",
pfn, t->comm, t->pid);
si.si_signo = SIGBUS;
si.si_errno = 0;
si.si_code = BUS_MCEERR_AO;
si.si_addr = (void *)addr;
#ifdef __ARCH_SI_TRAPNO
si.si_trapno = trapno;
#endif
si.si_addr_lsb = PAGE_SHIFT;
/*
* Don't use force here, it's convenient if the signal
* can be temporarily blocked.
* This could cause a loop when the user sets SIGBUS
* to SIG_IGN, but hopefully noone will do that?
*/
ret = send_sig_info(SIGBUS, &si, t); /* synchronous? */
if (ret < 0)
printk(KERN_INFO "MCE: Error sending signal to %s:%d: %d\n",
t->comm, t->pid, ret);
return ret;
}

/*
* Kill all processes that have a poisoned page mapped and then isolate
* the page.
*
* General strategy:
* Find all processes having the page mapped and kill them.
* But we keep a page reference around so that the page is not
* actually freed yet.
* Then stash the page away
*
* There's no convenient way to get back to mapped processes
* from the VMAs. So do a brute-force search over all
* running processes.
*
* Remember that machine checks are not common (or rather
* if they are common you have other problems), so this shouldn't
* be a performance issue.
*
* Also there are some races possible while we get from the
* error detection to actually handle it.
*/

struct to_kill {
struct list_head nd;
struct task_struct *tsk;
unsigned long addr;
unsigned addr_valid:1;
};

/*
* Failure handling: if we can't find or can't kill a process there's
* not much we can do. We just print a message and ignore otherwise.
*/

/*
* Schedule a process for later kill.
*/
static void add_to_kill(struct task_struct *tsk, struct page *p,
struct vm_area_struct *vma,
struct list_head *to_kill,
struct to_kill **tkc)
{
struct to_kill *tk;

if (*tkc) {
tk = *tkc;
*tkc = NULL;
} else {
tk = kmalloc(sizeof(struct to_kill), GFP_ATOMIC);
if (!tk) {
printk(KERN_ERR
"MCE: Out of memory while machine check handling\n");
return;
}
}
tk->addr = page_address_in_vma(p, vma);
tk->addr_valid = 1;

/*
* In theory we don't have to kill when the page was
* munmaped. But it could be also a mremap. Since that's
* likely very rare kill anyways just out of paranoia, but use
* a SIGKILL because the error is not contained anymore.
*/
if (tk->addr == -EFAULT) {
pr_debug("MCE: Unable to find user space address %lx in %s\n",
page_to_pfn(p), tsk->comm);
tk->addr_valid = 0;
}
get_task_struct(tsk);
tk->tsk = tsk;
list_add_tail(&tk->nd, to_kill);
}

/*
* Kill the processes that have been collected earlier.
*
* Only do anything when DOIT is set, otherwise just free the list
* (this is used for clean pages which do not need killing)
* Also when FAIL is set do a force kill because something went
* wrong earlier.
*/
static void kill_procs_ao(struct list_head *to_kill, int doit, int trapno,
int fail, unsigned long pfn)
{
struct to_kill *tk, *next;

list_for_each_entry_safe (tk, next, to_kill, nd) {
if (doit) {
/*
* In case something went wrong with munmaping
* make sure the process doesn't catch the
* signal and then access the memory. Just kill it.
* the signal handlers
*/
if (fail || tk->addr_valid == 0) {
printk(KERN_ERR
"MCE %#lx: forcibly killing %s:%d because of failure to unmap corrupted page\n",
pfn, tk->tsk->comm, tk->tsk->pid);
force_sig(SIGKILL, tk->tsk);
}

/*
* In theory the process could have mapped
* something else on the address in-between. We could
* check for that, but we need to tell the
* process anyways.
*/
else if (kill_proc_ao(tk->tsk, tk->addr, trapno,
pfn) < 0)
printk(KERN_ERR
"MCE %#lx: Cannot send advisory machine check signal to %s:%d\n",
pfn, tk->tsk->comm, tk->tsk->pid);
}
put_task_struct(tk->tsk);
kfree(tk);
}
}

/*
* Collect processes when the error hit an anonymous page.
*/
static void collect_procs_anon(struct page *page, struct list_head *to_kill,
struct to_kill **tkc)
{
struct vm_area_struct *vma;
struct task_struct *tsk;
struct anon_vma *av;

read_lock(&tasklist_lock);

av = page_lock_anon_vma(page);
if (av == NULL) /* Not actually mapped anymore */
goto out;

for_each_process (tsk) {
if (!tsk->mm)
continue;
list_for_each_entry (vma, &av->head, anon_vma_node) {
if (!page_mapped_in_vma(page, vma))
continue;

if (vma->vm_mm == tsk->mm)
add_to_kill(tsk, page, vma, to_kill, tkc);
}
}
page_unlock_anon_vma(av);
out:
read_unlock(&tasklist_lock);
}

/*
* Collect processes when the error hit a file mapped page.
*/
static void collect_procs_file(struct page *page, struct list_head *to_kill,
struct to_kill **tkc)
{
struct vm_area_struct *vma;
struct task_struct *tsk;
struct prio_tree_iter iter;
struct address_space *mapping = page->mapping;

/*
* A note on the locking order between the two locks.
* We don't rely on this particular order.
* If you have some other code that needs a different order
* feel free to switch them around. Or add a reverse link
* from mm_struct to task_struct, then this could be all
* done without taking tasklist_lock and looping over all tasks.
*/

read_lock(&tasklist_lock);
spin_lock(&mapping->i_mmap_lock);
for_each_process(tsk) {
pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);

if (!tsk->mm)
continue;

vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff,
pgoff)
if (vma->vm_mm == tsk->mm)
add_to_kill(tsk, page, vma, to_kill, tkc);
}
spin_unlock(&mapping->i_mmap_lock);
read_unlock(&tasklist_lock);
}

/*
* Collect the processes who have the corrupted page mapped to kill.
*/
static void collect_procs(struct page *page, struct list_head *tokill)
{
struct to_kill *tk;

/*
* First preallocate one to_kill structure outside the spin locks,
* so that we can kill at least one process reasonably reliable.
*/
tk = kmalloc(sizeof(struct to_kill), GFP_NOIO);

if (PageAnon(page))
collect_procs_anon(page, tokill, &tk);
else
collect_procs_file(page, tokill, &tk);
kfree(tk);
}

/*
* Error handlers for various types of pages.
*/

enum outcome {
FAILED, /* Error handling failed */
DELAYED, /* Will be handled later */
IGNORED, /* Error safely ignored */
RECOVERED, /* Successfully recovered */
};

static const char *action_name[] = {
[FAILED] = "Failed",
[DELAYED] = "Delayed",
[IGNORED] = "Ignored",
[RECOVERED] = "Recovered",
};

/*
* Error hit kernel page.
* Do nothing, try to be lucky and not touch this instead. For a few cases we
* could be more sophisticated.
*/
static int me_kernel(struct page *p, unsigned long pfn)
{
return DELAYED;
}

/*
* Already poisoned page.
*/
static int me_ignore(struct page *p, unsigned long pfn)
{
return IGNORED;
}

/*
* Page in unknown state. Do nothing.
*/
static int me_unknown(struct page *p, unsigned long pfn)
{
printk(KERN_ERR "MCE %#lx: Unknown page state\n", pfn);
return FAILED;
}

/*
* Free memory
*/
static int me_free(struct page *p, unsigned long pfn)
{
return DELAYED;
}

/*
* Clean (or cleaned) page cache page.
*/
static int me_pagecache_clean(struct page *p, unsigned long pfn)
{
struct address_space *mapping;

if (!isolate_lru_page(p))
page_cache_release(p);

mapping = page_mapping(p);
if (mapping == NULL)
return RECOVERED;

/*
* Now truncate the page in the page cache. This is really
* more like a "temporary hole punch"
* Don't do this for block devices when someone else
* has a reference, because it could be file system metadata
* and that's not safe to truncate.
*/
if (!S_ISREG(mapping->host->i_mode) &&
!invalidate_complete_page(mapping, p)) {
printk(KERN_ERR
"MCE %#lx: failed to invalidate metadata page\n",
pfn);
return FAILED;
}

truncate_inode_page(mapping, p);
if (page_has_private(p) && !try_to_release_page(p, GFP_NOIO)) {
pr_debug(KERN_ERR "MCE %#lx: failed to release buffers\n",
pfn);
return FAILED;
}
return RECOVERED;
}

/*
* Dirty cache page page
* Issues: when the error hit a hole page the error is not properly
* propagated.
*/
static int me_pagecache_dirty(struct page *p, unsigned long pfn)
{
struct address_space *mapping = page_mapping(p);

SetPageError(p);
/* TBD: print more information about the file. */
if (mapping) {
/*
* IO error will be reported by write(), fsync(), etc.
* who check the mapping.
* This way the application knows that something went
* wrong with its dirty file data.
*
* There's one open issue:
*
* The EIO will be only reported on the next IO
* operation and then cleared through the IO map.
* Normally Linux has two mechanisms to pass IO error
* first through the AS_EIO flag in the address space
* and then through the PageError flag in the page.
* Since we drop pages on memory failure handling the
* only mechanism open to use is through AS_AIO.
*
* This has the disadvantage that it gets cleared on
* the first operation that returns an error, while
* the PageError bit is more sticky and only cleared
* when the page is reread or dropped. If an
* application assumes it will always get error on
* fsync, but does other operations on the fd before
* and the page is dropped inbetween then the error
* will not be properly reported.
*
* This can already happen even without hwpoisoned
* pages: first on metadata IO errors (which only
* report through AS_EIO) or when the page is dropped
* at the wrong time.
*
* So right now we assume that the application DTRT on
* the first EIO, but we're not worse than other parts
* of the kernel.
*/
mapping_set_error(mapping, EIO);
}

return me_pagecache_clean(p, pfn);
}

/*
* Clean and dirty swap cache.
*
* Dirty swap cache page is tricky to handle. The page could live both in page
* cache and swap cache(ie. page is freshly swapped in). So it could be
* referenced concurrently by 2 types of PTEs:
* normal PTEs and swap PTEs. We try to handle them consistently by calling u
* try_to_unmap(TTU_IGNORE_HWPOISON) to convert the normal PTEs to swap PTEs,
* and then
* - clear dirty bit to prevent IO
* - remove from LRU
* - but keep in the swap cache, so that when we return to it on
* a later page fault, we know the application is accessing
* corrupted data and shall be killed (we installed simple
* interception code in do_swap_page to catch it).
*
* Clean swap cache pages can be directly isolated. A later page fault will
* bring in the known good data from disk.
*/
static int me_swapcache_dirty(struct page *p, unsigned long pfn)
{
ClearPageDirty(p);
/* Trigger EIO in shmem: */
ClearPageUptodate(p);

if (!isolate_lru_page(p))
page_cache_release(p);

return DELAYED;
}

static int me_swapcache_clean(struct page *p, unsigned long pfn)
{
if (!isolate_lru_page(p))
page_cache_release(p);

delete_from_swap_cache(p);

return RECOVERED;
}

/*
* Huge pages. Needs work.
* Issues:
* No rmap support so we cannot find the original mapper. In theory could walk
* all MMs and look for the mappings, but that would be non atomic and racy.
* Need rmap for hugepages for this. Alternatively we could employ a heuristic,
* like just walking the current process and hoping it has it mapped (that
* should be usually true for the common "shared database cache" case)
* Should handle free huge pages and dequeue them too, but this needs to
* handle huge page accounting correctly.
*/
static int me_huge_page(struct page *p, unsigned long pfn)
{
return FAILED;
}

/*
* Various page states we can handle.
*
* A page state is defined by its current page->flags bits.
* The table matches them in order and calls the right handler.
*
* This is quite tricky because we can access page at any time
* in its live cycle, so all accesses have to be extremly careful.
*
* This is not complete. More states could be added.
* For any missing state don't attempt recovery.
*/

#define dirty (1UL << PG_dirty)
#define sc (1UL << PG_swapcache)
#define unevict (1UL << PG_unevictable)
#define mlock (1UL << PG_mlocked)
#define writeback (1UL << PG_writeback)
#define lru (1UL << PG_lru)
#define swapbacked (1UL << PG_swapbacked)
#define head (1UL << PG_head)
#define tail (1UL << PG_tail)
#define compound (1UL << PG_compound)
#define slab (1UL << PG_slab)
#define buddy (1UL << PG_buddy)
#define reserved (1UL << PG_reserved)

static struct page_state {
unsigned long mask;
unsigned long res;
char *msg;
int (*action)(struct page *p, unsigned long pfn);
} error_states[] = {
{ reserved, reserved, "reserved kernel", me_ignore },
{ buddy, buddy, "free kernel", me_free },

/*
* Could in theory check if slab page is free or if we can drop
* currently unused objects without touching them. But just
* treat it as standard kernel for now.
*/
{ slab, slab, "kernel slab", me_kernel },

#ifdef CONFIG_PAGEFLAGS_EXTENDED
{ head, head, "huge", me_huge_page },
{ tail, tail, "huge", me_huge_page },
#else
{ compound, compound, "huge", me_huge_page },
#endif

{ sc|dirty, sc|dirty, "swapcache", me_swapcache_dirty },
{ sc|dirty, sc, "swapcache", me_swapcache_clean },

#ifdef CONFIG_UNEVICTABLE_LRU
{ unevict|dirty, unevict|dirty, "unevictable LRU", me_pagecache_dirty},
{ unevict, unevict, "unevictable LRU", me_pagecache_clean},
#endif

#ifdef CONFIG_HAVE_MLOCKED_PAGE_BIT
{ mlock|dirty, mlock|dirty, "mlocked LRU", me_pagecache_dirty },
{ mlock, mlock, "mlocked LRU", me_pagecache_clean },
#endif

{ lru|dirty, lru|dirty, "LRU", me_pagecache_dirty },
{ lru|dirty, lru, "clean LRU", me_pagecache_clean },
{ swapbacked, swapbacked, "anonymous", me_pagecache_clean },

/*
* Catchall entry: must be at end.
*/
{ 0, 0, "unknown page state", me_unknown },
};

static void action_result(unsigned long pfn, char *msg, int result)
{
printk(KERN_ERR "MCE %#lx: %s%s page recovery: %s\n",
pfn, PageDirty(pfn_to_page(pfn)) ? "dirty " : "",
msg, action_name[result]);
}

static void page_action(struct page_state *ps, struct page *p,
unsigned long pfn)
{
int result;

result = ps->action(p, pfn);
action_result(pfn, ps->msg, result);
if (page_count(p) != 1)
printk(KERN_ERR
"MCE %#lx: %s page still referenced by %d users\n",
pfn, ps->msg, page_count(p) - 1);

/* Could do more checks here if page looks ok */
atomic_long_add(1, &mce_bad_pages);

/*
* Could adjust zone counters here to correct for the missing page.
*/
}

#define N_UNMAP_TRIES 5

/*
* Do all that is necessary to remove user space mappings. Unmap
* the pages and send SIGBUS to the processes if the data was dirty.
*/
static void hwpoison_user_mappings(struct page *p, unsigned long pfn,
int trapno)
{
enum ttu_flags ttu = TTU_UNMAP | TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS;
int kill = sysctl_memory_failure_early_kill;
struct address_space *mapping;
LIST_HEAD(tokill);
int ret;
int i;

if (PageReserved(p) || PageCompound(p) || PageSlab(p))
return;

if (!PageLRU(p))
lru_add_drain();

/*
* This check implies we don't kill processes if their pages
* are in the swap cache early. Those are always late kills.
*/
if (!page_mapped(p))
return;

if (PageSwapCache(p)) {
printk(KERN_ERR
"MCE %#lx: keeping poisoned page in swap cache\n", pfn);
ttu |= TTU_IGNORE_HWPOISON;
}

/*
* Propagate the dirty bit from PTEs to struct page first, because we
* need this to decide if we should kill or just drop the page.
*/
mapping = page_mapping(p);
if (!PageDirty(p) && mapping && mapping_cap_writeback_dirty(mapping)) {
if (page_mkclean(p))
SetPageDirty(p);
else {
kill = 0;
ttu |= TTU_IGNORE_HWPOISON;
printk(KERN_INFO
"MCE %#lx: corrupted page was clean: dropped without side effects\n",
pfn);
}
}

/*
* First collect all the processes that have the page
* mapped. This has to be done before try_to_unmap,
* because ttu takes the rmap data structures down.
*
* This also has the side effect to propagate the dirty
* bit from PTEs into the struct page. This is needed
* to actually decide if something needs to be killed
* or errored, or if it's ok to just drop the page.
*
* Error handling: We ignore errors here because
* there's nothing that can be done.
*/
if (kill && p->mapping)
collect_procs(p, &tokill);

/*
* try_to_unmap can fail temporarily due to races.
* Try a few times (RED-PEN better strategy?)
*/
for (i = 0; i < N_UNMAP_TRIES; i++) {
ret = try_to_unmap(p, ttu);
if (ret == SWAP_SUCCESS)
break;
pr_debug("MCE %#lx: try_to_unmap retry needed %d\n", pfn, ret);
}

if (ret != SWAP_SUCCESS)
printk(KERN_ERR "MCE %#lx: failed to unmap page (mapcount=%d)\n",
pfn, page_mapcount(p));

/*
* Now that the dirty bit has been propagated to the
* struct page and all unmaps done we can decide if
* killing is needed or not. Only kill when the page
* was dirty, otherwise the tokill list is merely
* freed. When there was a problem unmapping earlier
* use a more force-full uncatchable kill to prevent
* any accesses to the poisoned memory.
*/
kill_procs_ao(&tokill, !!PageDirty(p), trapno,
ret != SWAP_SUCCESS, pfn);
}

/**
* memory_failure - Handle memory failure of a page.
* @pfn: Page Number of the corrupted page
* @trapno: Trap number reported in the signal to user space.
*
* This function is called by the low level machine check code
* of an architecture when it detects hardware memory corruption
* of a page. It tries its best to recover, which includes
* dropping pages, killing processes etc.
*
* The function is primarily of use for corruptions that
* happen outside the current execution context (e.g. when
* detected by a background scrubber)
*
* Must run in process context (e.g. a work queue) with interrupts
* enabled and no spinlocks hold.
*/
void memory_failure(unsigned long pfn, int trapno)
{
struct page_state *ps;
struct page *p;

if (!pfn_valid(pfn)) {
printk(KERN_ERR
"MCE %#lx: memory outside kernel control: Ignored\n",
pfn);
return;
}

p = pfn_to_page(pfn);
if (TestSetPageHWPoison(p)) {
action_result(pfn, "already hardware poisoned", IGNORED);
return;
}

/*
* We need/can do nothing about count=0 pages.
* 1) it's a free page, and therefore in safe hand:
* prep_new_page() will be the gate keeper.
* 2) it's part of a non-compound high order page.
* Implies some kernel user: cannot stop them from
* R/W the page; let's pray that the page has been
* used and will be freed some time later.
* In fact it's dangerous to directly bump up page count from 0,
* that may make page_freeze_refs()/page_unfreeze_refs() mismatch.
*/
if (!get_page_unless_zero(compound_head(p))) {
action_result(pfn, "free or high order kernel", IGNORED);
return;
}

/*
* Lock the page and wait for writeback to finish.
* It's very difficult to mess with pages currently under IO
* and in many cases impossible, so we just avoid it here.
*/
lock_page_nosync(p);
wait_on_page_writeback(p);

/*
* Now take care of user space mappings.
*/
hwpoison_user_mappings(p, pfn, trapno);

/*
* Torn down by someone else?
*/
if (PageLRU(p) && !PageSwapCache(p) && p->mapping == NULL) {
action_result(pfn, "already truncated LRU", IGNORED);
goto out;
}

for (ps = error_states;; ps++) {
if ((p->flags & ps->mask) == ps->res) {
page_action(ps, p, pfn);
break;
}
}
out:
unlock_page(p);
}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Nick Piggin
2009-06-15 10:36:19 UTC
Permalink
Post by Wu Fengguang
Post by Nick Piggin
Post by Nick Piggin
Did we verify with filesystem maintainers (eg. btrfs) that the
!ISREG test will be enough to prevent oopses?
BTW. this is quite a significant change I think and not
really documented well enough. Previously a filesystem
will know exactly when and why pagecache in a mapping
under its control will be truncated (as opposed to
invalidated).
They even have opportunity to hold locks such as i_mutex.
And depending on what they do, they could do interesting
things even with ISREG files.
So, I really think this needs review by filesystem
maintainers and it would be far safer to use invalidate
until it is known to be safe.
Nick, we are doing invalidate_complete_page() for !S_ISREG inodes now.
Do you mean to do invalidate_complete_page() for all inodes for now?
That would make me a lot happier. It is obviously correct because
that is basically what page reclaim and inode reclaim and drop
caches etc does.

Note that I still don't like exporting invalidate_complete_page
fro the same reasons I don't like exporting truncate_complete_page,
so I will ask if you can do an invalidate_inode_page function
along the same lines of the truncate_inode_page one please.
Post by Wu Fengguang
That's a good suggestion, it shall be able to do the job for most
pages indeed.
Yes I think it will be far far safer while only introducing
another small class of pages which cannot be recovered (probably
a much smaller set most of the time than the size of the existing
set of pages which cannot be recovered).


--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Wu Fengguang
2009-06-15 11:41:49 UTC
Permalink
Post by Nick Piggin
Post by Wu Fengguang
Post by Nick Piggin
Post by Nick Piggin
Did we verify with filesystem maintainers (eg. btrfs) that the
!ISREG test will be enough to prevent oopses?
BTW. this is quite a significant change I think and not
really documented well enough. Previously a filesystem
will know exactly when and why pagecache in a mapping
under its control will be truncated (as opposed to
invalidated).
They even have opportunity to hold locks such as i_mutex.
And depending on what they do, they could do interesting
things even with ISREG files.
So, I really think this needs review by filesystem
maintainers and it would be far safer to use invalidate
until it is known to be safe.
Nick, we are doing invalidate_complete_page() for !S_ISREG inodes now.
Do you mean to do invalidate_complete_page() for all inodes for now?
That would make me a lot happier. It is obviously correct because
that is basically what page reclaim and inode reclaim and drop
caches etc does.
Note that I still don't like exporting invalidate_complete_page
fro the same reasons I don't like exporting truncate_complete_page,
so I will ask if you can do an invalidate_inode_page function
along the same lines of the truncate_inode_page one please.
Sure. I did something radical - don't try to isolate dirty/writeback
pages, to match the exact invalidate_mapping_pages() behavior.

Let's mess with the dirty/writeback pages some time later.

+/*
+ * Clean (or cleaned) page cache page.
+ */
+static int me_pagecache_clean(struct page *p, unsigned long pfn)
+{
+ struct address_space *mapping;
+
+ if (!isolate_lru_page(p))
+ page_cache_release(p);
+
+ mapping = page_mapping(p);
+ if (mapping == NULL)
+ return RECOVERED;
+
+ /*
+ * Now remove it from page cache.
+ * Currently we only remove clean, unused page for the sake of safety.
+ */
+ if (!invalidate_inode_page(mapping, p)) {
+ printk(KERN_ERR
+ "MCE %#lx: failed to remove from page cache\n", pfn);
+ return FAILED;
+ }
+ return RECOVERED;
+}


--- sound-2.6.orig/mm/truncate.c
+++ sound-2.6/mm/truncate.c
@@ -135,6 +135,21 @@ invalidate_complete_page(struct address_
return ret;
}

+/*
+ * Safely invalidate one page from its pagecache mapping.
+ * It only drops clean, unused pages. The page must be locked.
+ *
+ * Returns 1 if the page is successfully invalidated, otherwise 0.
+ */
+int invalidate_inode_page(struct address_space *mapping, struct page *page)
+{
+ if (PageDirty(page) || PageWriteback(page))
+ return 0;
+ if (page_mapped(page))
+ return 0;
+ return invalidate_complete_page(mapping, page);
+}
+
/**
* truncate_inode_pages - truncate range of pages specified by start & end byte offsets
* @mapping: mapping to truncate
@@ -311,12 +326,8 @@ unsigned long invalidate_mapping_pages(s
if (lock_failed)
continue;

- if (PageDirty(page) || PageWriteback(page))
- goto unlock;
- if (page_mapped(page))
- goto unlock;
- ret += invalidate_complete_page(mapping, page);
-unlock:
+ ret += invalidate_inode_page(mapping, page);
+
unlock_page(page);
if (next > end)
break;
Post by Nick Piggin
Post by Wu Fengguang
That's a good suggestion, it shall be able to do the job for most
pages indeed.
Yes I think it will be far far safer while only introducing
another small class of pages which cannot be recovered (probably
a much smaller set most of the time than the size of the existing
set of pages which cannot be recovered).
Anyway, dirty pages are limited to 15% of the total memory by default.

Thanks,
Fengguang
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Hugh Dickins
2009-06-15 12:51:43 UTC
Permalink
Post by Wu Fengguang
Post by Balbir Singh
I hope we can reach consensus in this round and then be able to post
a final version for .31 inclusion.
Isn't that too aggressive? .31 is already in the merge window.
Yes, a bit aggressive. This is a new feature that involves complex logics.
However it is basically a no-op when there are no memory errors,
and when memory corruption does occur, it's better to (possibly) panic
in this code than to panic unconditionally in the absence of this
feature (as said by Rik).
So IMHO it's OK for .31 as long as we agree on the user interfaces,
ie. /proc/sys/vm/memory_failure_early_kill and the hwpoison uevent.
It comes a long way through numerous reviews, and I believe all the
important issues and concerns have been addressed. Nick, Rik, Hugh,
Ingo, ... what are your opinions?
And for how long has this work been in linux-next or mmotm?

My opinion is that it's way too late for .31 - your only chance
is that Linus sometimes gets bored with playing safe, and decides
to break his rules and try something out regardless - but I'd hope
the bootmem business already sated his appetite for danger this time.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Alan Cox
2009-06-15 13:00:19 UTC
Permalink
Post by Hugh Dickins
My opinion is that it's way too late for .31 - your only chance
is that Linus sometimes gets bored with playing safe, and decides
to break his rules and try something out regardless - but I'd hope
the bootmem business already sated his appetite for danger this time.
I see no consensus on it being worth merging, no testing, no upstream
integration shakedown, no builds on non-x86 boxes, no work with other
arch maintainers who have similar abilities and needs.

It belongs in next for a release or two while people work on it, while
things like PPC64 can plumb into it if they wish and while people work
out if its actually useful given that for most users it reduces the
reliability of the services they are providing rather than improving it.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Andi Kleen
2009-06-15 13:29:34 UTC
Permalink
I think you're wrong about killing processes decreasing
reliability. Traditionally we always tried to keep things running if possible
instead of panicing. That is why ext3 or block does not default to panic
on each IO error for example. Or oops does not panic by default like
on BSDs. Your argumentation would be good for a traditional early Unix
which likes to panic instead of handling errors, but that's not the
Linux way as I know it.

Also BTW in many cases (e.g. a lot of file cache pages) there
is actually no process kill involved; just a reload of the page from
disk.

Then for example in a cluster you typically have a application level
heartbeat on the application and restarting the app is faster
if you don't need to reboot the box too. If you don't have a cluster
with failover then gracefull degradation is the best. In general
panic is a very nasty failure mode and should be avoided.

That said you can configure it anyways to panic if you want,
but it would be a very bad default.

See also Linus' or hpa's statement on the topic.
Post by Alan Cox
no testing,
There's an extensive test suite in mce-test.git

We did a lot of testing with these separate test suites and also
some other tests. For much more it needs actual users pounding on it, and that
can be only adequately done in mainline.

That said the real tests of this can be only done with test suites
really, these errors tend to not happen quickly.
Post by Alan Cox
integration shakedown, no builds on non-x86 boxes, no work with other
arch maintainers who have similar abilities and needs.
We did build tests on ia64 and power and it was reviewed by Tony for IA64.
The ia64 specific code is not quite ready yet, but will come at some point.

I don't think it's a requirement for merging to have PPC64 support.

-Andi
--
***@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
H. Peter Anvin
2009-06-15 13:28:20 UTC
Permalink
Post by Andi Kleen
That said the real tests of this can be only done with test suites
really, these errors tend to not happen quickly.
What's far more important, quite frankly, is harmlessness regression
testing.

-hpa
--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Alan Cox
2009-06-15 14:48:32 UTC
Permalink
On Mon, 15 Jun 2009 15:29:34 +0200
Post by Andi Kleen
I think you're wrong about killing processes decreasing
reliability. Traditionally we always tried to keep things running if possible
instead of panicing. That is why ext3 or block does not default to panic
on each IO error for example. Or oops does not panic by default like
on BSDs. Your argumentation would be good for a traditional early Unix
which likes to panic instead of handling errors, but that's not the
Linux way as I know it.
Everyone I knew in the business end of deploying Linux turned on panics
for I/O errors, reboot on panic and all the rest of those.

Why ? because they don't want a system where the web server is running
but not logging transactions, or to find out the database is up but that
some other "must not fail" layer killed or stalled the backup server for
it last week ...

The I/O ones can really blow up on you in a reliable environment because
often the process still exists but isn't working so fools much of the
monitoring software.
Post by Andi Kleen
That said you can configure it anyways to panic if you want,
but it would be a very bad default.
That depends for whom
Post by Andi Kleen
See also Linus' or hpa's statement on the topic.
Linus doesn't run big server systems. Its a really bad default for
developers. Its probably a bad default for desktop users.
Post by Andi Kleen
We did a lot of testing with these separate test suites and also
some other tests. For much more it needs actual users pounding on it, and that
can be only adequately done in mainline.
Thats why we have -next and -mm
Post by Andi Kleen
We did build tests on ia64 and power and it was reviewed by Tony for IA64.
The ia64 specific code is not quite ready yet, but will come at some point.
I don't think it's a requirement for merging to have PPC64 support.
Really - so if your design is wrong for the way PPC wants to work what
are we going to do ? It's not a requirement that PPC64 support is there
but it is most certainly a requirement that its been in -next a while and
other arch maintainers have at least had time to say "works for me",
"irrelevant to my platform" or "Arghhh noooo.. ECC errors work like
[this] so we need ..."

I'd guess that zSeries has some rather different views on how ECC
failures propogate through the hypervisors for example, including the
fact that a failed page can be unfailed which you don't seem to allow for.

(You can unfail pages on x86 as well it appears by scrubbing them via DMA
- yes ?)


Alan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Andi Kleen
2009-06-15 15:24:28 UTC
Permalink
Post by Alan Cox
Everyone I knew in the business end of deploying Linux turned on panics
for I/O errors, reboot on panic and all the rest of those.
oops=panic already implies panic on all machine check exceptions, so they will
be fine then (assuming this is the best strategy for availability
for them, which I personally find quite doubtful, but we can discuss this some
other time)
Post by Alan Cox
Really - so if your design is wrong for the way PPC wants to work what
are we going to do ? It's not a requirement that PPC64 support is there
Then we change the code. Or if it's too difficult don't support their stuff.
After all it's not cast in stone. That said I doubt the PPC requirements will
be much different than what we have.
Post by Alan Cox
I'd guess that zSeries has some rather different views on how ECC
failures propogate through the hypervisors for example, including the
fact that a failed page can be unfailed which you don't seem to allow for.
That's correct.

That's because unpoisioning is quite hard -- you need some kind
of synchronization point for all the error handling and that's
the poisoned page and if it unposions itself then you need
some very heavy weight synchronization to avoid handling errors
multiple time. I looked at it, but it's quite messy.

Also it's of somewhat dubious value.
Post by Alan Cox
(You can unfail pages on x86 as well it appears by scrubbing them via DMA
- yes ?)
Not architectually. Also the other problem is not just unpoisoning them,
but finding out if the page is permenantly bad or just temporarily.

-Andi
--
***@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Alan Cox
2009-06-15 15:28:04 UTC
Permalink
Post by Andi Kleen
oops=panic already implies panic on all machine check exceptions, so they will
be fine then (assuming this is the best strategy for availability
for them, which I personally find quite doubtful, but we can discuss this some
other time)
You can have the argument with all the people who deploy large systems.
Providing their boxes can be persuaded to panic they don't care about the
masses.
Post by Andi Kleen
That's because unpoisioning is quite hard -- you need some kind
of synchronization point for all the error handling and that's
the poisoned page and if it unposions itself then you need
some very heavy weight synchronization to avoid handling errors
multiple time. I looked at it, but it's quite messy.
Also it's of somewhat dubious value.
On a system running under a hypervisor or with hot swappable memory its
of rather higher value. In the hypervisor case the guest system can
acquire a new virtual page to replace the faulty one. In fact the
hypervisor case is even more complex as the guest may get migrated at
which point knowledge of "poisoned" memory is not at all connected to
information on hardware failings.
Post by Andi Kleen
Post by Alan Cox
(You can unfail pages on x86 as well it appears by scrubbing them via DMA
- yes ?)
Not architectually. Also the other problem is not just unpoisoning them,
but finding out if the page is permenantly bad or just temporarily.
Small detail you are overlooking: Hot swap mirrorable memory.

Second detail you are overlooking

curse a lot
suspend to disk
remove dirt from fans, clean/replace RAM
resume from disk

The very act of making the ECC error not take out the box creates the
environment whereby the underlying hardware error (if there was one) can
be cured.

In all these cases the fact you've got to shoot stuff because a page has
been lost becomes totally disconnected from the idea that the page is
somehow not recoverable and "contaminated" forever.

Which to me says your model is wrong.

Alan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Andi Kleen
2009-06-15 16:19:04 UTC
Permalink
Post by Alan Cox
curse a lot
suspend to disk
remove dirt from fans, clean/replace RAM
resume from disk
The very act of making the ECC error not take out the box creates the
Ok so at least you agree now that handling these errors without
panic is the right thing to do. That's at least some progress.
Post by Alan Cox
environment whereby the underlying hardware error (if there was one) can
be cured.
These ECC errors are still somewhat rare (or rather if they become
common you should definitely service the system). That is why
losing a single page of memory for them isn't a big issue normally.

Sure you could spend effort making unpoisioning work,
but it would seem very dubious to me. After all it's just another
4K of memory for each error.

The only reasonably good use case I heard for unpoisoning was
if you have a lot of huge pages (you can't use a full huge page with one bad
small page), but that's also still relatively exotic.

-Andi

[1] mostly you need a new special form of RCU I think
--
***@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Alan Cox
2009-06-15 16:28:16 UTC
Permalink
Post by Andi Kleen
Post by Alan Cox
curse a lot
suspend to disk
remove dirt from fans, clean/replace RAM
resume from disk
The very act of making the ECC error not take out the box creates the
Ok so at least you agree now that handling these errors without
panic is the right thing to do. That's at least some progress.
There are some situations it may be useful - possibly. But then if you
can't sort the resulting mess out because your patches are too limited
its not useful yet is it.

Even then this isn't 2.6.31 stuff - its 2.6.31-mm or -next stuff maybe
for 2.6.32

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Andi Kleen
2009-06-15 17:07:26 UTC
Permalink
Post by Alan Cox
But then if you
can't sort the resulting mess out because your patches are too limited
its not useful yet is it.
With "too limited" you refer to unpoisioning?

Again very slowly:

- If you have a lot of errors you die eventually anyways.
- If you have a very low rate of errors (which is the normal case) you don't
need unpoisioning because the memory lost for each error is miniscule.
- In the case of a hypervisor it's actually not memory lost, but only
guest physical address space, which is plenty on a 64bit system. You can
eventually replace it by readding memory to a guest, but that's unlikely
to be needed.

-Andi
--
***@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Russ Anderson
2009-06-16 19:44:30 UTC
Permalink
Post by Andi Kleen
I think you're wrong about killing processes decreasing
reliability. Traditionally we always tried to keep things running if possible
instead of panicing.
Customers love the ia64 feature of killing a user process instead of
panicing the system when a user process hits a memory uncorrectable
error. Avoiding a system panic is a very good thing.
--
Russ Anderson, OS RAS/Partitioning Project Lead
SGI - Silicon Graphics Inc ***@sgi.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
H. Peter Anvin
2009-06-16 20:28:54 UTC
Permalink
Post by Russ Anderson
Post by Andi Kleen
I think you're wrong about killing processes decreasing
reliability. Traditionally we always tried to keep things running if possible
instead of panicing.
Customers love the ia64 feature of killing a user process instead of
panicing the system when a user process hits a memory uncorrectable
error. Avoiding a system panic is a very good thing.
Sometimes (sometimes it's a very bad thing.)

However, the more fundamental thing is that it is always trivial to
promote an error to a higher severity; the opposite is not true. As
such, it becomes an administrator-set policy, which is what it needs to be.

-hpa

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Russ Anderson
2009-06-16 20:54:49 UTC
Permalink
Post by H. Peter Anvin
Post by Russ Anderson
Post by Andi Kleen
I think you're wrong about killing processes decreasing
reliability. Traditionally we always tried to keep things running if possible
instead of panicing.
Customers love the ia64 feature of killing a user process instead of
panicing the system when a user process hits a memory uncorrectable
error. Avoiding a system panic is a very good thing.
Sometimes (sometimes it's a very bad thing.)
However, the more fundamental thing is that it is always trivial to
promote an error to a higher severity; the opposite is not true. As
such, it becomes an administrator-set policy, which is what it needs to be.
Good point. On ia64 the recovery code is implemented as a kernel
loadable module. Installing the module turns on the feature.

That is handy for customer demos. Install the module, inject a
memory error, have an application read the bad data and get killed.
Repeat a few times. Then uninstall the module, inject a
memory error, have an application read the bad data and watch
the system panic.

Then it is the customer's choice to have it on or off.
--
Russ Anderson, OS RAS/Partitioning Project Lead
SGI - Silicon Graphics Inc ***@sgi.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
H. Peter Anvin
2009-06-16 20:58:15 UTC
Permalink
Post by Russ Anderson
Post by H. Peter Anvin
However, the more fundamental thing is that it is always trivial to
promote an error to a higher severity; the opposite is not true. As
such, it becomes an administrator-set policy, which is what it needs to be.
Good point. On ia64 the recovery code is implemented as a kernel
loadable module. Installing the module turns on the feature.
That is handy for customer demos. Install the module, inject a
memory error, have an application read the bad data and get killed.
Repeat a few times. Then uninstall the module, inject a
memory error, have an application read the bad data and watch
the system panic.
Then it is the customer's choice to have it on or off.
There are a number of ways to set escalation policy. Modules isn't
necessarily the best, but it doesn't really matter what the exact
mechanism is.

-hpa

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Loading...