Discussion:
kernel 3.0: BUG: soft lockup: find_get_pages+0x51/0x110
(too old to reply)
Paweł Sikora
2011-10-12 18:12:33 UTC
Permalink
Hi Hugh,
i'm resending previous private email with larger cc list as you've requ=
ested.


in the last weekend my server died again (processes stuck for 22/23s!) =
but this time i have more logs for you.
=20
on my dual-opteron machines i have non-standard settings:
- DISABLED swap space (fast user process killing is better for my autot=
est farm than long disk swapping
and 64GB ecc-ram is enough for my processing).
- vm.overcommit_memory =3D 2,
- vm.overcommit_ratio =3D 100.

after initial BUG_ON (pasted below) there's a flood of 'rcu_sched_state=
detected stalls / CPU#X stuck for 22s!'
(full compressed log is available at: http://pluto.agmk.net/kernel/kern=
el.bz2)

Oct 9 08:06:43 hal kernel: [408578.629070] ------------[ cut here ]---=
---------
Oct 9 08:06:43 hal kernel: [408578.629143] kernel BUG at include/linux=
/swapops.h:105!
Oct 9 08:06:43 hal kernel: [408578.629143] invalid opcode: 0000 [#1] S=
MP
Oct 9 08:06:43 hal kernel: [408578.629143] CPU 14
Oct 9 08:06:43 hal kernel: [408578.629143] Modules linked in: nfs fsca=
che binfmt_misc nfsd lockd nfs_acl auth_rpcgss sunrpc ipmi_si ipmi_devi=
ntf ipmi_msghandler sch_sfq iptable_nat nf_nat nf_conntrack_ipv4 nf_con=
ntrack nf_defrag_ipv4 iptable_filter xt_TCPMSS xt_tcpudp iptable_mangle=
ip_tables ip6table_filter ip6_tables x_tables ext4 jbd2 crc16 raid10 r=
aid0 dm_mod uvesafb autofs4 dummy aoe joydev usbhid hid ide_cd_mod cdro=
m ata_generic pata_acpi ohci_hcd pata_atiixp sp5100_tco ide_pci_generic=
ssb ehci_hcd igb i2c_piix4 pcmcia evdev atiixp pcmcia_core psmouse ide=
_core usbcore i2c_core amd64_edac_mod mmc_core edac_core pcspkr sg edac=
_mce_amd button serio_raw processor dca ghes k10temp hwmon hed sd_mod c=
rc_t10dif raid1 md_mod ext3 jbd mbcache ahci libahci libata scsi_mod [l=
ast unloaded: scsi_wait_scan]
Oct 9 08:06:43 hal kernel: [408578.629143]
Oct 9 08:06:43 hal kernel: [408578.629143] Pid: 29214, comm: bitgen No=
t tainted 3.0.4 #5 Supermicro H8DGU/H8DGU
Oct 9 08:06:43 hal kernel: [408578.629143] RIP: 0010:[<ffffffff81127b7=
6>] [<ffffffff81127b76>] migration_entry_wait+0x156/0x160
Oct 9 08:06:43 hal kernel: [408578.629143] RSP: 0000:ffff88021cee7d18 =
EFLAGS: 00010246
Oct 9 08:06:43 hal kernel: [408578.629143] RAX: 0a00000000080068 RBX: =
ffffea001d1dbe70 RCX: ffff880c02d18978
Oct 9 08:06:43 hal kernel: [408578.629143] RDX: 0000000000851a42 RSI: =
ffff880d8fe33618 RDI: ffffea002a09dd50
Oct 9 08:06:43 hal kernel: [408578.629143] RBP: ffff88021cee7d38 R08: =
ffff880d8fe33618 R09: 0000000000000028
Oct 9 08:06:43 hal kernel: [408578.629143] R10: ffff881006eb0f00 R11: =
f800000000851a42 R12: ffffea002a09dd40
Oct 9 08:06:43 hal kernel: [408578.629143] R13: 0000000c02d18978 R14: =
00000000d872f000 R15: ffff880d8fe33618
Oct 9 08:06:43 hal kernel: [408578.629143] FS: 00007f864d7fa700(0000)=
GS:ffff880c1fc80000(0063) knlGS:00000000f432a910
Oct 9 08:06:43 hal kernel: [408578.629143] CS: 0010 DS: 002b ES: 002b=
CR0: 0000000080050033
Oct 9 08:06:43 hal kernel: [408578.629143] CR2: 00000000d872f000 CR3: =
000000100668c000 CR4: 00000000000006e0
Oct 9 08:06:43 hal kernel: [408578.629143] DR0: 0000000000000000 DR1: =
0000000000000000 DR2: 0000000000000000
Oct 9 08:06:43 hal kernel: [408578.629143] DR3: 0000000000000000 DR6: =
00000000ffff0ff0 DR7: 0000000000000400
Oct 9 08:06:43 hal kernel: [408578.629143] Process bitgen (pid: 29214,=
threadinfo ffff88021cee6000, task ffff880407b06900)
Oct 9 08:06:43 hal kernel: [408578.629143] Stack:
Oct 9 08:06:43 hal kernel: [408578.629143] 00000000dcef4000 ffff880c0=
0968c98 ffff880c02d18978 000000010a34843e
Oct 9 08:06:43 hal kernel: [408578.629143] ffff88021cee7de8 ffffffff8=
11016a1 ffff88021cee7d78 ffff881006eb0f00
Oct 9 08:06:43 hal kernel: [408578.629143] ffff88021cee7d98 ffffffff8=
10feee2 ffff880c06f0d170 8000000b98d14067
Oct 9 08:06:43 hal kernel: [408578.629143] Call Trace:
Oct 9 08:06:43 hal kernel: [408578.629143] [<ffffffff811016a1>] handl=
e_pte_fault+0xae1/0xaf0
Oct 9 08:06:43 hal kernel: [408578.629143] [<ffffffff810feee2>] ? __p=
te_alloc+0x42/0x120
Oct 9 08:06:43 hal kernel: [408578.629143] [<ffffffff8112c26b>] ? do_=
huge_pmd_anonymous_page+0xab/0x310
Oct 9 08:06:43 hal kernel: [408578.629143] [<ffffffff81102a31>] handl=
e_mm_fault+0x181/0x310
Oct 9 08:06:43 hal kernel: [408578.629143] [<ffffffff81106097>] ? vma=
_adjust+0x537/0x570
Oct 9 08:06:43 hal kernel: [408578.629143] [<ffffffff81424bed>] do_pa=
ge_fault+0x11d/0x4e0
Oct 9 08:06:43 hal kernel: [408578.629143] [<ffffffff81109a05>] ? do_=
mremap+0x2d5/0x570
Oct 9 08:06:43 hal kernel: [408578.629143] [<ffffffff81421d5f>] page_=
fault+0x1f/0x30
Oct 9 08:06:43 hal kernel: [408578.629143] Code: 80 00 00 00 00 31 f6 =
48 89 df e8 e6 58 fb ff eb d7 85 c9 0f 84 44 ff ff ff 8d 51 01 89 c8 f0=
0f b1 16 39 c1 90 74 b5 89 c1 eb e6 <0f> 0b 0f 1f 84 00 00 00 00 00 55=
48 89 e5 48 83 ec 20 48 85 ff
Oct 9 08:06:43 hal kernel: [408578.629143] RIP [<ffffffff81127b76>] m=
igration_entry_wait+0x156/0x160
Oct 9 08:06:43 hal kernel: [408578.629143] RSP <ffff88021cee7d18>
Oct 9 08:06:43 hal kernel: [408578.642823] ---[ end trace 0a3736230116=
3711 ]---
Oct 9 08:07:10 hal kernel: [408605.283257] BUG: soft lockup - CPU#12 s=
tuck for 23s! [par:29801]
Oct 9 08:07:10 hal kernel: [408605.283367] Modules linked in: nfs fsca=
che binfmt_misc nfsd lockd nfs_acl auth_rpcgss sunrpc ipmi_si ipmi_devi=
ntf ipmi_msghandler sch_sfq iptable_nat nf_nat nf_conntrack_ipv4 nf_con=
ntrack nf_defrag_ipv4 iptable_filter xt_TCPMSS xt_tcpudp iptable_mangle=
ip_tables ip6table_filter ip6_tables x_tables ext4 jbd2 crc16 raid10 r=
aid0 dm_mod uvesafb autofs4 dummy aoe joydev usbhid hid ide_cd_mod cdro=
m ata_generic pata_acpi ohci_hcd pata_atiixp sp5100_tco ide_pci_generic=
ssb ehci_hcd igb i2c_piix4 pcmcia evdev atiixp pcmcia_core psmouse ide=
_core usbcore i2c_core amd64_edac_mod mmc_core edac_core pcspkr sg edac=
_mce_amd button serio_raw processor dca ghes k10temp hwmon hed sd_mod c=
rc_t10dif raid1 md_mod ext3 jbd mbcache ahci libahci libata scsi_mod [l=
ast unloaded: scsi_wait_scan]
Oct 9 08:07:10 hal kernel: [408605.285807] CPU 12
Oct 9 08:07:10 hal kernel: [408605.285807] Modules linked in: nfs fsca=
che binfmt_misc nfsd lockd nfs_acl auth_rpcgss sunrpc ipmi_si ipmi_devi=
ntf ipmi_msghandler sch_sfq iptable_nat nf_nat nf_conntrack_ipv4 nf_con=
ntrack nf_defrag_ipv4 iptable_filter xt_TCPMSS xt_tcpudp iptable_mangle=
ip_tables ip6table_filter ip6_tables x_tables ext4 jbd2 crc16 raid10 r=
aid0 dm_mod uvesafb autofs4 dummy aoe joydev usbhid hid ide_cd_mod cdro=
m ata_generic pata_acpi ohci_hcd pata_atiixp sp5100_tco ide_pci_generic=
ssb ehci_hcd igb i2c_piix4 pcmcia evdev atiixp pcmcia_core psmouse ide=
_core usbcore i2c_core amd64_edac_mod mmc_core edac_core pcspkr sg edac=
_mce_amd button serio_raw processor dca ghes k10temp hwmon hed sd_mod c=
rc_t10dif raid1 md_mod ext3 jbd mbcache ahci libahci libata scsi_mod [l=
ast unloaded: scsi_wait_scan]
Oct 9 08:07:10 hal kernel: [408605.285807]
Oct 9 08:07:10 hal kernel: [408605.285807] Pid: 29801, comm: par Taint=
ed: G D 3.0.4 #5 Supermicro H8DGU/H8DGU
Oct 9 08:07:10 hal kernel: [408605.285807] RIP: 0010:[<ffffffff814216a=
4>] [<ffffffff814216a4>] _raw_spin_lock+0x14/0x20
Oct 9 08:07:10 hal kernel: [408605.285807] RSP: 0018:ffff880c02def808 =
EFLAGS: 00000293
Oct 9 08:07:10 hal kernel: [408605.285807] RAX: 0000000000000b09 RBX: =
ffffea002741f6b8 RCX: ffff880000000000
Oct 9 08:07:10 hal kernel: [408605.285807] RDX: ffffea0000000000 RSI: =
000000002a09dd40 RDI: ffffea002a09dd50
Oct 9 08:07:10 hal kernel: [408605.285807] RBP: ffff880c02def808 R08: =
0000000000000000 R09: ffff880f2e4f4d70
Oct 9 08:07:10 hal kernel: [408605.285807] R10: ffff880c1fffbe00 R11: =
0000000000000050 R12: ffffffff8142988e
Oct 9 08:07:10 hal kernel: [408605.285807] R13: ffff880c02def7b8 R14: =
ffffffff810e63dc R15: ffff880c02def7b8
Oct 9 08:07:10 hal kernel: [408605.285807] FS: 00007fe6b677c720(0000)=
GS:ffff880c1fc00000(0063) knlGS:00000000f6e0d910
Oct 9 08:07:10 hal kernel: [408605.285807] CS: 0010 DS: 002b ES: 002b=
CR0: 000000008005003b
Oct 9 08:07:10 hal kernel: [408605.285807] CR2: 00000000dd40012c CR3: =
00000009f6b78000 CR4: 00000000000006e0
Oct 9 08:07:10 hal kernel: [408605.285807] DR0: 0000000000000000 DR1: =
0000000000000000 DR2: 0000000000000000
Oct 9 08:07:10 hal kernel: [408605.285807] DR3: 0000000000000000 DR6: =
00000000ffff0ff0 DR7: 0000000000000400
Oct 9 08:07:10 hal kernel: [408605.285807] Process par (pid: 29801, th=
readinfo ffff880c02dee000, task ffff880c07c90700)
Oct 9 08:07:10 hal kernel: [408605.285807] Stack:
Oct 9 08:07:10 hal kernel: [408605.285807] ffff880c02def858 ffffffff8=
110a2d7 ffffea001fc16fe0 ffff880c02d183a8
Oct 9 08:07:10 hal kernel: [408605.285807] ffff880c02def8c8 ffffea001=
fc16fa8 ffff880c00968c98 ffff881006eb0f00
Oct 9 08:07:10 hal kernel: [408605.285807] 0000000000000301 00000000d=
8675000 ffff880c02def8c8 ffffffff8110aa6a
Oct 9 08:07:10 hal kernel: [408605.285807] Call Trace:
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff8110a2d7>] __pag=
e_check_address+0x107/0x1a0
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff8110aa6a>] try_t=
o_unmap_one+0x3a/0x420
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff8110be44>] try_t=
o_unmap_anon+0xb4/0x130
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff8110bf75>] try_t=
o_unmap+0x65/0x80
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff811285d0>] migra=
te_pages+0x310/0x4c0
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff810e93c2>] ? ___=
_pagevec_lru_add+0x12/0x20
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff8111cbf0>] ? ftr=
ace_define_fields_mm_compaction_isolate_template+0x70/0x70
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff8111d5da>] compa=
ct_zone+0x52a/0x8c0
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff810f9919>] ? zon=
e_statistics+0x99/0xc0
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff8111dade>] compa=
ct_zone_order+0x7e/0xb0
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff810e46a8>] ? get=
_page_from_freelist+0x3b8/0x7e0
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff8111dbcd>] try_t=
o_compact_pages+0xbd/0xf0
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff810e5148>] __all=
oc_pages_direct_compact+0xa8/0x180
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff810e56c5>] __all=
oc_pages_nodemask+0x4a5/0x7f0
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff810e9698>] ? lru=
_cache_add_lru+0x28/0x50
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff8110a92d>] ? pag=
e_add_new_anon_rmap+0x9d/0xb0
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff8111b865>] alloc=
_pages_vma+0x95/0x180
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff8112c2f8>] do_hu=
ge_pmd_anonymous_page+0x138/0x310
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff81102ace>] handl=
e_mm_fault+0x21e/0x310
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff81001716>] ? __s=
witch_to+0x1e6/0x2c0
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff81424bed>] do_pa=
ge_fault+0x11d/0x4e0
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff8141f178>] ? sch=
edule+0x308/0xa10
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff811077a7>] ? do_=
mmap_pgoff+0x357/0x370
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff8110790d>] ? sys=
_mmap_pgoff+0x14d/0x220
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff81421d5f>] page_=
fault+0x1f/0x30
Oct 9 08:07:10 hal kernel: [408605.285807] Code: 0f b6 c2 85 c0 0f 95 =
c0 0f b6 c0 5d c3 66 2e 0f 1f 84 00 00 00 00 00 55 b8 00 01 00 00 48 89=
e5 f0 66 0f c1 07 38 e0 74 06 f3 90 <8a> 07 eb f6 5d c3 66 0f 1f 44 00=
00 55 48 89 e5 9c 58 fa ba 00
Oct 9 08:07:10 hal kernel: [408605.285807] Call Trace:
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff8110a2d7>] __pag=
e_check_address+0x107/0x1a0
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff8110aa6a>] try_t=
o_unmap_one+0x3a/0x420
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff8110be44>] try_t=
o_unmap_anon+0xb4/0x130
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff8110bf75>] try_t=
o_unmap+0x65/0x80
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff811285d0>] migra=
te_pages+0x310/0x4c0
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff810e93c2>] ? ___=
_pagevec_lru_add+0x12/0x20
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff8111cbf0>] ? ftr=
ace_define_fields_mm_compaction_isolate_template+0x70/0x70
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff8111d5da>] compa=
ct_zone+0x52a/0x8c0
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff810f9919>] ? zon=
e_statistics+0x99/0xc0
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff8111dade>] compa=
ct_zone_order+0x7e/0xb0
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff810e46a8>] ? get=
_page_from_freelist+0x3b8/0x7e0
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff8111dbcd>] try_t=
o_compact_pages+0xbd/0xf0
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff810e5148>] __all=
oc_pages_direct_compact+0xa8/0x180
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff810e56c5>] __all=
oc_pages_nodemask+0x4a5/0x7f0
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff810e9698>] ? lru=
_cache_add_lru+0x28/0x50
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff8110a92d>] ? pag=
e_add_new_anon_rmap+0x9d/0xb0
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff8111b865>] alloc=
_pages_vma+0x95/0x180
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff8112c2f8>] do_hu=
ge_pmd_anonymous_page+0x138/0x310
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff81102ace>] handl=
e_mm_fault+0x21e/0x310
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff81001716>] ? __s=
witch_to+0x1e6/0x2c0
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff81424bed>] do_pa=
ge_fault+0x11d/0x4e0
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff8141f178>] ? sch=
edule+0x308/0xa10
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff811077a7>] ? do_=
mmap_pgoff+0x357/0x370
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff8110790d>] ? sys=
_mmap_pgoff+0x14d/0x220
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff81421d5f>] page_=
fault+0x1f/0x30

BR,
Pawe=C5=82.
Hugh Dickins
2011-10-13 23:16:01 UTC
Permalink
[ Subject refers to a different, unexplained 3.0 bug from Pawel ]
Post by Paweł Sikora
Hi Hugh,
i'm resending previous private email with larger cc list as you've requested.
Thanks, yes, on this one I think I do have an answer;
and we ought to bring Mel and Andrea in too.
Post by Paweł Sikora
in the last weekend my server died again (processes stuck for 22/23s!) but this time i have more logs for you.
- DISABLED swap space (fast user process killing is better for my autotest farm than long disk swapping
and 64GB ecc-ram is enough for my processing).
- vm.overcommit_memory = 2,
- vm.overcommit_ratio = 100.
after initial BUG_ON (pasted below) there's a flood of 'rcu_sched_state detected stalls / CPU#X stuck for 22s!'
Yes, those are just a tiresome consequence of exiting from a BUG
while holding the page table lock(s).
Post by Paweł Sikora
(full compressed log is available at: http://pluto.agmk.net/kernel/kernel.bz2)
Oct 9 08:06:43 hal kernel: [408578.629070] ------------[ cut here ]------------
Oct 9 08:06:43 hal kernel: [408578.629143] kernel BUG at include/linux/swapops.h:105!
Oct 9 08:06:43 hal kernel: [408578.629143] invalid opcode: 0000 [#1] SMP
Oct 9 08:06:43 hal kernel: [408578.629143] CPU 14
[ I'm deleting that irrelevant long line list of modules ]
Post by Paweł Sikora
Oct 9 08:06:43 hal kernel: [408578.629143]
Oct 9 08:06:43 hal kernel: [408578.629143] Pid: 29214, comm: bitgen Not tainted 3.0.4 #5 Supermicro H8DGU/H8DGU
Oct 9 08:06:43 hal kernel: [408578.629143] RIP: 0010:[<ffffffff81127b76>] [<ffffffff81127b76>] migration_entry_wait+0x156/0x160
Oct 9 08:06:43 hal kernel: [408578.629143] RSP: 0000:ffff88021cee7d18 EFLAGS: 00010246
Oct 9 08:06:43 hal kernel: [408578.629143] RAX: 0a00000000080068 RBX: ffffea001d1dbe70 RCX: ffff880c02d18978
Oct 9 08:06:43 hal kernel: [408578.629143] RDX: 0000000000851a42 RSI: ffff880d8fe33618 RDI: ffffea002a09dd50
Oct 9 08:06:43 hal kernel: [408578.629143] RBP: ffff88021cee7d38 R08: ffff880d8fe33618 R09: 0000000000000028
Oct 9 08:06:43 hal kernel: [408578.629143] R10: ffff881006eb0f00 R11: f800000000851a42 R12: ffffea002a09dd40
Oct 9 08:06:43 hal kernel: [408578.629143] R13: 0000000c02d18978 R14: 00000000d872f000 R15: ffff880d8fe33618
Oct 9 08:06:43 hal kernel: [408578.629143] FS: 00007f864d7fa700(0000) GS:ffff880c1fc80000(0063) knlGS:00000000f432a910
Oct 9 08:06:43 hal kernel: [408578.629143] CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033
Oct 9 08:06:43 hal kernel: [408578.629143] CR2: 00000000d872f000 CR3: 000000100668c000 CR4: 00000000000006e0
Oct 9 08:06:43 hal kernel: [408578.629143] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Oct 9 08:06:43 hal kernel: [408578.629143] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Oct 9 08:06:43 hal kernel: [408578.629143] Process bitgen (pid: 29214, threadinfo ffff88021cee6000, task ffff880407b06900)
Oct 9 08:06:43 hal kernel: [408578.629143] 00000000dcef4000 ffff880c00968c98 ffff880c02d18978 000000010a34843e
Oct 9 08:06:43 hal kernel: [408578.629143] ffff88021cee7de8 ffffffff811016a1 ffff88021cee7d78 ffff881006eb0f00
Oct 9 08:06:43 hal kernel: [408578.629143] ffff88021cee7d98 ffffffff810feee2 ffff880c06f0d170 8000000b98d14067
Oct 9 08:06:43 hal kernel: [408578.629143] [<ffffffff811016a1>] handle_pte_fault+0xae1/0xaf0
Oct 9 08:06:43 hal kernel: [408578.629143] [<ffffffff810feee2>] ? __pte_alloc+0x42/0x120
Oct 9 08:06:43 hal kernel: [408578.629143] [<ffffffff8112c26b>] ? do_huge_pmd_anonymous_page+0xab/0x310
Oct 9 08:06:43 hal kernel: [408578.629143] [<ffffffff81102a31>] handle_mm_fault+0x181/0x310
Oct 9 08:06:43 hal kernel: [408578.629143] [<ffffffff81106097>] ? vma_adjust+0x537/0x570
Oct 9 08:06:43 hal kernel: [408578.629143] [<ffffffff81424bed>] do_page_fault+0x11d/0x4e0
Oct 9 08:06:43 hal kernel: [408578.629143] [<ffffffff81109a05>] ? do_mremap+0x2d5/0x570
Oct 9 08:06:43 hal kernel: [408578.629143] [<ffffffff81421d5f>] page_fault+0x1f/0x30
Oct 9 08:06:43 hal kernel: [408578.629143] Code: 80 00 00 00 00 31 f6 48 89 df e8 e6 58 fb ff eb d7 85 c9 0f 84 44 ff ff ff 8d 51 01 89 c8 f0 0f b1 16 39 c1 90 74 b5 89 c1 eb e6 <0f> 0b 0f 1f 84 00 00 00 00 00 55 48 89 e5 48 83 ec 20 48 85 ff
Oct 9 08:06:43 hal kernel: [408578.629143] RIP [<ffffffff81127b76>] migration_entry_wait+0x156/0x160
Oct 9 08:06:43 hal kernel: [408578.629143] RSP <ffff88021cee7d18>
Oct 9 08:06:43 hal kernel: [408578.642823] ---[ end trace 0a37362301163711 ]---
Oct 9 08:07:10 hal kernel: [408605.283257] BUG: soft lockup - CPU#12 stuck for 23s! [par:29801]
Oct 9 08:07:10 hal kernel: [408605.285807] CPU 12
Oct 9 08:07:10 hal kernel: [408605.285807]
Oct 9 08:07:10 hal kernel: [408605.285807] Pid: 29801, comm: par Tainted: G D 3.0.4 #5 Supermicro H8DGU/H8DGU
Oct 9 08:07:10 hal kernel: [408605.285807] RIP: 0010:[<ffffffff814216a4>] [<ffffffff814216a4>] _raw_spin_lock+0x14/0x20
Oct 9 08:07:10 hal kernel: [408605.285807] RSP: 0018:ffff880c02def808 EFLAGS: 00000293
Oct 9 08:07:10 hal kernel: [408605.285807] RAX: 0000000000000b09 RBX: ffffea002741f6b8 RCX: ffff880000000000
Oct 9 08:07:10 hal kernel: [408605.285807] RDX: ffffea0000000000 RSI: 000000002a09dd40 RDI: ffffea002a09dd50
Oct 9 08:07:10 hal kernel: [408605.285807] RBP: ffff880c02def808 R08: 0000000000000000 R09: ffff880f2e4f4d70
Oct 9 08:07:10 hal kernel: [408605.285807] R10: ffff880c1fffbe00 R11: 0000000000000050 R12: ffffffff8142988e
Oct 9 08:07:10 hal kernel: [408605.285807] R13: ffff880c02def7b8 R14: ffffffff810e63dc R15: ffff880c02def7b8
Oct 9 08:07:10 hal kernel: [408605.285807] FS: 00007fe6b677c720(0000) GS:ffff880c1fc00000(0063) knlGS:00000000f6e0d910
Oct 9 08:07:10 hal kernel: [408605.285807] CS: 0010 DS: 002b ES: 002b CR0: 000000008005003b
Oct 9 08:07:10 hal kernel: [408605.285807] CR2: 00000000dd40012c CR3: 00000009f6b78000 CR4: 00000000000006e0
Oct 9 08:07:10 hal kernel: [408605.285807] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Oct 9 08:07:10 hal kernel: [408605.285807] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Oct 9 08:07:10 hal kernel: [408605.285807] Process par (pid: 29801, threadinfo ffff880c02dee000, task ffff880c07c90700)
Oct 9 08:07:10 hal kernel: [408605.285807] ffff880c02def858 ffffffff8110a2d7 ffffea001fc16fe0 ffff880c02d183a8
Oct 9 08:07:10 hal kernel: [408605.285807] ffff880c02def8c8 ffffea001fc16fa8 ffff880c00968c98 ffff881006eb0f00
Oct 9 08:07:10 hal kernel: [408605.285807] 0000000000000301 00000000d8675000 ffff880c02def8c8 ffffffff8110aa6a
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff8110a2d7>] __page_check_address+0x107/0x1a0
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff8110aa6a>] try_to_unmap_one+0x3a/0x420
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff8110be44>] try_to_unmap_anon+0xb4/0x130
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff8110bf75>] try_to_unmap+0x65/0x80
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff811285d0>] migrate_pages+0x310/0x4c0
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff810e93c2>] ? ____pagevec_lru_add+0x12/0x20
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff8111cbf0>] ? ftrace_define_fields_mm_compaction_isolate_template+0x70/0x70
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff8111d5da>] compact_zone+0x52a/0x8c0
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff810f9919>] ? zone_statistics+0x99/0xc0
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff8111dade>] compact_zone_order+0x7e/0xb0
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff810e46a8>] ? get_page_from_freelist+0x3b8/0x7e0
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff8111dbcd>] try_to_compact_pages+0xbd/0xf0
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff810e5148>] __alloc_pages_direct_compact+0xa8/0x180
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff810e56c5>] __alloc_pages_nodemask+0x4a5/0x7f0
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff810e9698>] ? lru_cache_add_lru+0x28/0x50
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff8110a92d>] ? page_add_new_anon_rmap+0x9d/0xb0
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff8111b865>] alloc_pages_vma+0x95/0x180
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff8112c2f8>] do_huge_pmd_anonymous_page+0x138/0x310
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff81102ace>] handle_mm_fault+0x21e/0x310
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff81001716>] ? __switch_to+0x1e6/0x2c0
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff81424bed>] do_page_fault+0x11d/0x4e0
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff8141f178>] ? schedule+0x308/0xa10
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff811077a7>] ? do_mmap_pgoff+0x357/0x370
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff8110790d>] ? sys_mmap_pgoff+0x14d/0x220
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff81421d5f>] page_fault+0x1f/0x30
Oct 9 08:07:10 hal kernel: [408605.285807] Code: 0f b6 c2 85 c0 0f 95 c0 0f b6 c0 5d c3 66 2e 0f 1f 84 00 00 00 00 00 55 b8 00 01 00 00 48 89 e5 f0 66 0f c1 07 38 e0 74 06 f3 90 <8a> 07 eb f6 5d c3 66 0f 1f 44 00 00 55 48 89 e5 9c 58 fa ba 00
I guess this is the only time you've seen this? In which case, ideally
I would try to devise a testcase to demonstrate the issue below instead;
but that may involve more ingenuity than I can find time for, let's see
see if people approve of this patch anyway (it applies to 3.1 or 3.0,
and earlier releases except that i_mmap_mutex used to be i_mmap_lock).


[PATCH] mm: add anon_vma locking to mremap move

I don't usually pay much attention to the stale "? " addresses in
stack backtraces, but this lucky report from Pawel Sikora hints that
mremap's move_ptes() has inadequate locking against page migration.

3.0 BUG_ON(!PageLocked(p)) in migration_entry_to_page():
kernel BUG at include/linux/swapops.h:105!
RIP: 0010:[<ffffffff81127b76>] [<ffffffff81127b76>]
migration_entry_wait+0x156/0x160
[<ffffffff811016a1>] handle_pte_fault+0xae1/0xaf0
[<ffffffff810feee2>] ? __pte_alloc+0x42/0x120
[<ffffffff8112c26b>] ? do_huge_pmd_anonymous_page+0xab/0x310
[<ffffffff81102a31>] handle_mm_fault+0x181/0x310
[<ffffffff81106097>] ? vma_adjust+0x537/0x570
[<ffffffff81424bed>] do_page_fault+0x11d/0x4e0
[<ffffffff81109a05>] ? do_mremap+0x2d5/0x570
[<ffffffff81421d5f>] page_fault+0x1f/0x30

mremap's down_write of mmap_sem, together with i_mmap_mutex/lock,
and pagetable locks, were good enough before page migration (with its
requirement that every migration entry be found) came in; and enough
while migration always held mmap_sem. But not enough nowadays, when
there's memory hotremove and compaction: anon_vma lock is also needed,
to make sure a migration entry is not dodging around behind our back.

It appears that Mel's a8bef8ff6ea1 "mm: migration: avoid race between
shift_arg_pages() and rmap_walk() during migration by not migrating
temporary stacks" was actually a workaround for this in the special
common case of exec's use of move_pagetables(); and we should probably
now remove that VM_STACK_INCOMPLETE_SETUP stuff as a separate cleanup.

Reported-by: Pawel Sikora <***@agmk.net>
Cc: ***@kernel.org
Signed-off-by: Hugh Dickins <***@google.com>
---

mm/mremap.c | 5 +++++
1 file changed, 5 insertions(+)

--- 3.1-rc9/mm/mremap.c 2011-07-21 19:17:23.000000000 -0700
+++ linux/mm/mremap.c 2011-10-13 14:36:25.097780974 -0700
@@ -77,6 +77,7 @@ static void move_ptes(struct vm_area_str
unsigned long new_addr)
{
struct address_space *mapping = NULL;
+ struct anon_vma *anon_vma = vma->anon_vma;
struct mm_struct *mm = vma->vm_mm;
pte_t *old_pte, *new_pte, pte;
spinlock_t *old_ptl, *new_ptl;
@@ -95,6 +96,8 @@ static void move_ptes(struct vm_area_str
mapping = vma->vm_file->f_mapping;
mutex_lock(&mapping->i_mmap_mutex);
}
+ if (anon_vma)
+ anon_vma_lock(anon_vma);

/*
* We don't have to worry about the ordering of src and dst
@@ -121,6 +124,8 @@ static void move_ptes(struct vm_area_str
spin_unlock(new_ptl);
pte_unmap(new_pte - 1);
pte_unmap_unlock(old_pte - 1, old_ptl);
+ if (anon_vma)
+ anon_vma_unlock(anon_vma);
if (mapping)
mutex_unlock(&mapping->i_mmap_mutex);
mmu_notifier_invalidate_range_end(vma->vm_mm, old_start, old_end);
Hugh Dickins
2011-10-13 23:30:09 UTC
Permalink
[ Subject refers to a different, unexplained 3.0 bug from Pawel ]
Post by Paweł Sikora
Hi Hugh,
i'm resending previous private email with larger cc list as you've requested.
Thanks, yes, on this one I think I do have an answer;
and we ought to bring Mel and Andrea in too.
Post by Paweł Sikora
in the last weekend my server died again (processes stuck for 22/23s!) but this time i have more logs for you.
- DISABLED swap space (fast user process killing is better for my autotest farm than long disk swapping
and 64GB ecc-ram is enough for my processing).
- vm.overcommit_memory = 2,
- vm.overcommit_ratio = 100.
after initial BUG_ON (pasted below) there's a flood of 'rcu_sched_state detected stalls / CPU#X stuck for 22s!'
Yes, those are just a tiresome consequence of exiting from a BUG
while holding the page table lock(s).
Post by Paweł Sikora
(full compressed log is available at: http://pluto.agmk.net/kernel/kernel.bz2)
Oct 9 08:06:43 hal kernel: [408578.629070] ------------[ cut here ]------------
Oct 9 08:06:43 hal kernel: [408578.629143] kernel BUG at include/linux/swapops.h:105!
Oct 9 08:06:43 hal kernel: [408578.629143] invalid opcode: 0000 [#1] SMP
Oct 9 08:06:43 hal kernel: [408578.629143] CPU 14
[ I'm deleting that irrelevant long line list of modules ]
Post by Paweł Sikora
Oct 9 08:06:43 hal kernel: [408578.629143]
Oct 9 08:06:43 hal kernel: [408578.629143] Pid: 29214, comm: bitgen Not tainted 3.0.4 #5 Supermicro H8DGU/H8DGU
Oct 9 08:06:43 hal kernel: [408578.629143] RIP: 0010:[<ffffffff81127b76>] [<ffffffff81127b76>] migration_entry_wait+0x156/0x160
Oct 9 08:06:43 hal kernel: [408578.629143] RSP: 0000:ffff88021cee7d18 EFLAGS: 00010246
Oct 9 08:06:43 hal kernel: [408578.629143] RAX: 0a00000000080068 RBX: ffffea001d1dbe70 RCX: ffff880c02d18978
Oct 9 08:06:43 hal kernel: [408578.629143] RDX: 0000000000851a42 RSI: ffff880d8fe33618 RDI: ffffea002a09dd50
Oct 9 08:06:43 hal kernel: [408578.629143] RBP: ffff88021cee7d38 R08: ffff880d8fe33618 R09: 0000000000000028
Oct 9 08:06:43 hal kernel: [408578.629143] R10: ffff881006eb0f00 R11: f800000000851a42 R12: ffffea002a09dd40
Oct 9 08:06:43 hal kernel: [408578.629143] R13: 0000000c02d18978 R14: 00000000d872f000 R15: ffff880d8fe33618
Oct 9 08:06:43 hal kernel: [408578.629143] FS: 00007f864d7fa700(0000) GS:ffff880c1fc80000(0063) knlGS:00000000f432a910
Oct 9 08:06:43 hal kernel: [408578.629143] CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033
Oct 9 08:06:43 hal kernel: [408578.629143] CR2: 00000000d872f000 CR3: 000000100668c000 CR4: 00000000000006e0
Oct 9 08:06:43 hal kernel: [408578.629143] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Oct 9 08:06:43 hal kernel: [408578.629143] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Oct 9 08:06:43 hal kernel: [408578.629143] Process bitgen (pid: 29214, threadinfo ffff88021cee6000, task ffff880407b06900)
Oct 9 08:06:43 hal kernel: [408578.629143] 00000000dcef4000 ffff880c00968c98 ffff880c02d18978 000000010a34843e
Oct 9 08:06:43 hal kernel: [408578.629143] ffff88021cee7de8 ffffffff811016a1 ffff88021cee7d78 ffff881006eb0f00
Oct 9 08:06:43 hal kernel: [408578.629143] ffff88021cee7d98 ffffffff810feee2 ffff880c06f0d170 8000000b98d14067
Oct 9 08:06:43 hal kernel: [408578.629143] [<ffffffff811016a1>] handle_pte_fault+0xae1/0xaf0
Oct 9 08:06:43 hal kernel: [408578.629143] [<ffffffff810feee2>] ? __pte_alloc+0x42/0x120
Oct 9 08:06:43 hal kernel: [408578.629143] [<ffffffff8112c26b>] ? do_huge_pmd_anonymous_page+0xab/0x310
Oct 9 08:06:43 hal kernel: [408578.629143] [<ffffffff81102a31>] handle_mm_fault+0x181/0x310
Oct 9 08:06:43 hal kernel: [408578.629143] [<ffffffff81106097>] ? vma_adjust+0x537/0x570
Oct 9 08:06:43 hal kernel: [408578.629143] [<ffffffff81424bed>] do_page_fault+0x11d/0x4e0
Oct 9 08:06:43 hal kernel: [408578.629143] [<ffffffff81109a05>] ? do_mremap+0x2d5/0x570
Oct 9 08:06:43 hal kernel: [408578.629143] [<ffffffff81421d5f>] page_fault+0x1f/0x30
Oct 9 08:06:43 hal kernel: [408578.629143] Code: 80 00 00 00 00 31 f6 48 89 df e8 e6 58 fb ff eb d7 85 c9 0f 84 44 ff ff ff 8d 51 01 89 c8 f0 0f b1 16 39 c1 90 74 b5 89 c1 eb e6 <0f> 0b 0f 1f 84 00 00 00 00 00 55 48 89 e5 48 83 ec 20 48 85 ff
Oct 9 08:06:43 hal kernel: [408578.629143] RIP [<ffffffff81127b76>] migration_entry_wait+0x156/0x160
Oct 9 08:06:43 hal kernel: [408578.629143] RSP <ffff88021cee7d18>
Oct 9 08:06:43 hal kernel: [408578.642823] ---[ end trace 0a37362301163711 ]---
Oct 9 08:07:10 hal kernel: [408605.283257] BUG: soft lockup - CPU#12 stuck for 23s! [par:29801]
Oct 9 08:07:10 hal kernel: [408605.285807] CPU 12
Oct 9 08:07:10 hal kernel: [408605.285807]
Oct 9 08:07:10 hal kernel: [408605.285807] Pid: 29801, comm: par Tainted: G D 3.0.4 #5 Supermicro H8DGU/H8DGU
Oct 9 08:07:10 hal kernel: [408605.285807] RIP: 0010:[<ffffffff814216a4>] [<ffffffff814216a4>] _raw_spin_lock+0x14/0x20
Oct 9 08:07:10 hal kernel: [408605.285807] RSP: 0018:ffff880c02def808 EFLAGS: 00000293
Oct 9 08:07:10 hal kernel: [408605.285807] RAX: 0000000000000b09 RBX: ffffea002741f6b8 RCX: ffff880000000000
Oct 9 08:07:10 hal kernel: [408605.285807] RDX: ffffea0000000000 RSI: 000000002a09dd40 RDI: ffffea002a09dd50
Oct 9 08:07:10 hal kernel: [408605.285807] RBP: ffff880c02def808 R08: 0000000000000000 R09: ffff880f2e4f4d70
Oct 9 08:07:10 hal kernel: [408605.285807] R10: ffff880c1fffbe00 R11: 0000000000000050 R12: ffffffff8142988e
Oct 9 08:07:10 hal kernel: [408605.285807] R13: ffff880c02def7b8 R14: ffffffff810e63dc R15: ffff880c02def7b8
Oct 9 08:07:10 hal kernel: [408605.285807] FS: 00007fe6b677c720(0000) GS:ffff880c1fc00000(0063) knlGS:00000000f6e0d910
Oct 9 08:07:10 hal kernel: [408605.285807] CS: 0010 DS: 002b ES: 002b CR0: 000000008005003b
Oct 9 08:07:10 hal kernel: [408605.285807] CR2: 00000000dd40012c CR3: 00000009f6b78000 CR4: 00000000000006e0
Oct 9 08:07:10 hal kernel: [408605.285807] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Oct 9 08:07:10 hal kernel: [408605.285807] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Oct 9 08:07:10 hal kernel: [408605.285807] Process par (pid: 29801, threadinfo ffff880c02dee000, task ffff880c07c90700)
Oct 9 08:07:10 hal kernel: [408605.285807] ffff880c02def858 ffffffff8110a2d7 ffffea001fc16fe0 ffff880c02d183a8
Oct 9 08:07:10 hal kernel: [408605.285807] ffff880c02def8c8 ffffea001fc16fa8 ffff880c00968c98 ffff881006eb0f00
Oct 9 08:07:10 hal kernel: [408605.285807] 0000000000000301 00000000d8675000 ffff880c02def8c8 ffffffff8110aa6a
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff8110a2d7>] __page_check_address+0x107/0x1a0
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff8110aa6a>] try_to_unmap_one+0x3a/0x420
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff8110be44>] try_to_unmap_anon+0xb4/0x130
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff8110bf75>] try_to_unmap+0x65/0x80
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff811285d0>] migrate_pages+0x310/0x4c0
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff810e93c2>] ? ____pagevec_lru_add+0x12/0x20
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff8111cbf0>] ? ftrace_define_fields_mm_compaction_isolate_template+0x70/0x70
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff8111d5da>] compact_zone+0x52a/0x8c0
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff810f9919>] ? zone_statistics+0x99/0xc0
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff8111dade>] compact_zone_order+0x7e/0xb0
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff810e46a8>] ? get_page_from_freelist+0x3b8/0x7e0
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff8111dbcd>] try_to_compact_pages+0xbd/0xf0
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff810e5148>] __alloc_pages_direct_compact+0xa8/0x180
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff810e56c5>] __alloc_pages_nodemask+0x4a5/0x7f0
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff810e9698>] ? lru_cache_add_lru+0x28/0x50
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff8110a92d>] ? page_add_new_anon_rmap+0x9d/0xb0
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff8111b865>] alloc_pages_vma+0x95/0x180
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff8112c2f8>] do_huge_pmd_anonymous_page+0x138/0x310
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff81102ace>] handle_mm_fault+0x21e/0x310
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff81001716>] ? __switch_to+0x1e6/0x2c0
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff81424bed>] do_page_fault+0x11d/0x4e0
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff8141f178>] ? schedule+0x308/0xa10
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff811077a7>] ? do_mmap_pgoff+0x357/0x370
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff8110790d>] ? sys_mmap_pgoff+0x14d/0x220
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff81421d5f>] page_fault+0x1f/0x30
Oct 9 08:07:10 hal kernel: [408605.285807] Code: 0f b6 c2 85 c0 0f 95 c0 0f b6 c0 5d c3 66 2e 0f 1f 84 00 00 00 00 00 55 b8 00 01 00 00 48 89 e5 f0 66 0f c1 07 38 e0 74 06 f3 90 <8a> 07 eb f6 5d c3 66 0f 1f 44 00 00 55 48 89 e5 9c 58 fa ba 00
I guess this is the only time you've seen this? In which case, ideally
I would try to devise a testcase to demonstrate the issue below instead;
but that may involve more ingenuity than I can find time for, let's see
see if people approve of this patch anyway (it applies to 3.1 or 3.0,
and earlier releases except that i_mmap_mutex used to be i_mmap_lock).


[PATCH] mm: add anon_vma locking to mremap move

I don't usually pay much attention to the stale "? " addresses in
stack backtraces, but this lucky report from Pawel Sikora hints that
mremap's move_ptes() has inadequate locking against page migration.

3.0 BUG_ON(!PageLocked(p)) in migration_entry_to_page():
kernel BUG at include/linux/swapops.h:105!
RIP: 0010:[<ffffffff81127b76>] [<ffffffff81127b76>]
migration_entry_wait+0x156/0x160
[<ffffffff811016a1>] handle_pte_fault+0xae1/0xaf0
[<ffffffff810feee2>] ? __pte_alloc+0x42/0x120
[<ffffffff8112c26b>] ? do_huge_pmd_anonymous_page+0xab/0x310
[<ffffffff81102a31>] handle_mm_fault+0x181/0x310
[<ffffffff81106097>] ? vma_adjust+0x537/0x570
[<ffffffff81424bed>] do_page_fault+0x11d/0x4e0
[<ffffffff81109a05>] ? do_mremap+0x2d5/0x570
[<ffffffff81421d5f>] page_fault+0x1f/0x30

mremap's down_write of mmap_sem, together with i_mmap_mutex/lock,
and pagetable locks, were good enough before page migration (with its
requirement that every migration entry be found) came in; and enough
while migration always held mmap_sem. But not enough nowadays, when
there's memory hotremove and compaction: anon_vma lock is also needed,
to make sure a migration entry is not dodging around behind our back.

It appears that Mel's a8bef8ff6ea1 "mm: migration: avoid race between
shift_arg_pages() and rmap_walk() during migration by not migrating
temporary stacks" was actually a workaround for this in the special
common case of exec's use of move_pagetables(); and we should probably
now remove that VM_STACK_INCOMPLETE_SETUP stuff as a separate cleanup.

Reported-by: Pawel Sikora <***@agmk.net>
Cc: ***@kernel.org
Signed-off-by: Hugh Dickins <***@google.com>
---

mm/mremap.c | 5 +++++
1 file changed, 5 insertions(+)

--- 3.1-rc9/mm/mremap.c 2011-07-21 19:17:23.000000000 -0700
+++ linux/mm/mremap.c 2011-10-13 14:36:25.097780974 -0700
@@ -77,6 +77,7 @@ static void move_ptes(struct vm_area_str
unsigned long new_addr)
{
struct address_space *mapping = NULL;
+ struct anon_vma *anon_vma = vma->anon_vma;
struct mm_struct *mm = vma->vm_mm;
pte_t *old_pte, *new_pte, pte;
spinlock_t *old_ptl, *new_ptl;
@@ -95,6 +96,8 @@ static void move_ptes(struct vm_area_str
mapping = vma->vm_file->f_mapping;
mutex_lock(&mapping->i_mmap_mutex);
}
+ if (anon_vma)
+ anon_vma_lock(anon_vma);

/*
* We don't have to worry about the ordering of src and dst
@@ -121,6 +124,8 @@ static void move_ptes(struct vm_area_str
spin_unlock(new_ptl);
pte_unmap(new_pte - 1);
pte_unmap_unlock(old_pte - 1, old_ptl);
+ if (anon_vma)
+ anon_vma_unlock(anon_vma);
if (mapping)
mutex_unlock(&mapping->i_mmap_mutex);
mmu_notifier_invalidate_range_end(vma->vm_mm, old_start, old_end);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Christoph Hellwig
2011-10-16 16:11:08 UTC
Permalink
Btw,

Anders Ossowicki reported a very similar soft lockup on 2.6.38 recently,
although without a bug on before.

Here is the pointer: https://lkml.org/lkml/2011/10/11/87

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Andrea Arcangeli
2011-10-16 23:54:42 UTC
Permalink
Post by Hugh Dickins
mremap's down_write of mmap_sem, together with i_mmap_mutex/lock,
and pagetable locks, were good enough before page migration (with its
requirement that every migration entry be found) came in; and enough
while migration always held mmap_sem. But not enough nowadays, when
there's memory hotremove and compaction: anon_vma lock is also needed,
to make sure a migration entry is not dodging around behind our back.
For things like migrate and split_huge_page, the anon_vma layer must
guarantee the page is reachable by rmap walk at all times regardless
if it's at the old or new address.

This shall be guaranteed by the copy_vma called by move_vma well
before move_page_tables/move_ptes can run.

copy_vma obviously takes the anon_vma lock to insert the new "dst" vma
into the anon_vma chains structures (vma_link does that). That before
any pte can be moved.

Because we keep two vmas mapped on both src and dst range, with
different vma->vm_pgoff that is valid for the page (the page doesn't
change its page->index) the page should always find _all_ its pte at
any given time.

There may be other variables at play like the order of insertion in
the anon_vma chain matches our direction of copy and removal of the
old pte. But I think the double locking of the PT lock should make the
order in the anon_vma chain absolutely irrelevant (the rmap_walk
obviously takes the PT lock too), and furthermore likely the
anon_vma_chain insertion is favorable (the dst vma is inserted last
and checked last). But it shouldn't matter.

Another thing could be the copy_vma vma_merge branch succeeding
(returning not NULL) but I doubt we risk to fall into that one. For
the rmap_walk to be always working on both the src and dst
vma->vma_pgoff the pgoff must be different so we can't possibly be ok
if there's just 1 vma covering the whole range. I exclude this could
be the case because the pgoff passed to copy_vma is different than the
vma->vm_pgoff given to copy_vma, so vma_merge can't possibly succeed.

Yet another point to investigate is the point where we teardown the
old vma and we leave the new vma generated by copy_vma
established. That's apparently taken care of by do_munmap in move_vma
so that shall be safe too as munmap is safe in the first place.

Overall I don't think this patch is needed and it seems a noop.
Post by Hugh Dickins
It appears that Mel's a8bef8ff6ea1 "mm: migration: avoid race between
shift_arg_pages() and rmap_walk() during migration by not migrating
temporary stacks" was actually a workaround for this in the special
common case of exec's use of move_pagetables(); and we should probably
now remove that VM_STACK_INCOMPLETE_SETUP stuff as a separate cleanup.
I don't think this patch can help with that, the problem of execve vs
rmap_walk is that there's 1 single vma existing for src and dst
virtual ranges while execve runs move_page_tables. So there is no
possible way that rmap_walk will be guaranteed to find _all_ ptes
mapping a page if there's just one vma mapping either the src or dst
range while move_page_table runs. No addition of locking whatsoever
can fix that bug because we miss a vma (well modulo locking that
prevents rmap_walk to run at all, until we're finished with execve,
which is more or less what VM_STACK_INCOMPLETE_SETUP does...).

The only way is to fix this is prevent migrate (or any other rmap_walk
user that requires 100% reliability from the rmap layer, for example
swap doesn't require 100% reliability and can still run and gracefully
fail at finding the pte) while we're moving pagetables in execve. And
that's what Mel's above mentioned patch does.

The other way to fix that bug that I implemented was to do copy_vma in
execve, so that we still have both src and dst ranges of
move_page_tables covered by 2 (not 1) vma, each with the proper
vma->vm_pgoff, so my approach fixed that bug as well (but requires a
vma allocation in execve so it was dropped in favor of Mel's patch
which is totally fine with as both approaches fixes the bug equally
well, even if now we've to deal with this special case of sometime
rmap_walk having false negatives if the vma_flags is set, and the
important thing is that after VM_STACK_INCOMPLETE_SETUP has been
cleared it won't ever be set again for the whole lifetime of the vma).

I may be missing something, I did a short review so far, just so the
patch doesn't get merged if not needed. I mean I think it needs a bit
more looks on it... The fact the i_mmap_mutex was taken but the
anon_vma lock was not taken (while in every other place they both are
needed) certainly makes the patch look correct, but that's just a
misleading coincidence I think.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Hugh Dickins
2011-10-17 18:51:00 UTC
Permalink
Post by Andrea Arcangeli
Post by Hugh Dickins
mremap's down_write of mmap_sem, together with i_mmap_mutex/lock,
and pagetable locks, were good enough before page migration (with its
requirement that every migration entry be found) came in; and enough
while migration always held mmap_sem. But not enough nowadays, when
there's memory hotremove and compaction: anon_vma lock is also needed,
to make sure a migration entry is not dodging around behind our back.
For things like migrate and split_huge_page, the anon_vma layer must
guarantee the page is reachable by rmap walk at all times regardless
if it's at the old or new address.
This shall be guaranteed by the copy_vma called by move_vma well
before move_page_tables/move_ptes can run.
copy_vma obviously takes the anon_vma lock to insert the new "dst" vma
into the anon_vma chains structures (vma_link does that). That before
any pte can be moved.
Because we keep two vmas mapped on both src and dst range, with
different vma->vm_pgoff that is valid for the page (the page doesn't
change its page->index) the page should always find _all_ its pte at
any given time.
There may be other variables at play like the order of insertion in
the anon_vma chain matches our direction of copy and removal of the
old pte. But I think the double locking of the PT lock should make the
order in the anon_vma chain absolutely irrelevant (the rmap_walk
obviously takes the PT lock too), and furthermore likely the
anon_vma_chain insertion is favorable (the dst vma is inserted last
and checked last). But it shouldn't matter.
Thanks a lot for thinking it over. I _almost_ agree with you, except
there's one aspect that I forgot to highlight in the patch comment:
remove_migration_pte() behaves as page_check_address() does by default,
it peeks to see if what it wants is there _before_ taking ptlock.

And therefore, I think, it is possible that during mremap move, the swap
pte is in neither of the locations it tries at the instant it peeks there.

We could put a stop to that: see plausible alternative patch below.
Though I have dithered from one to the other and back, I think on the
whole I still prefer the anon_vma locking in move_ptes(): we don't care
too deeply about the speed of mremap, but we do care about the speed of
exec, and this does add another lock/unlock there, but it will always
be uncontended; whereas the patch at the migration end could be adding
a contended and unnecessary lock.

Oh, I don't know which, you vote - if you now agree there is a problem.
I'll sign off the migrate.c one if you prefer it. But no hurry.
Post by Andrea Arcangeli
Another thing could be the copy_vma vma_merge branch succeeding
(returning not NULL) but I doubt we risk to fall into that one. For
the rmap_walk to be always working on both the src and dst
vma->vma_pgoff the pgoff must be different so we can't possibly be ok
if there's just 1 vma covering the whole range. I exclude this could
be the case because the pgoff passed to copy_vma is different than the
vma->vm_pgoff given to copy_vma, so vma_merge can't possibly succeed.
Yet another point to investigate is the point where we teardown the
old vma and we leave the new vma generated by copy_vma
established. That's apparently taken care of by do_munmap in move_vma
so that shall be safe too as munmap is safe in the first place.
Overall I don't think this patch is needed and it seems a noop.
Post by Hugh Dickins
It appears that Mel's a8bef8ff6ea1 "mm: migration: avoid race between
shift_arg_pages() and rmap_walk() during migration by not migrating
temporary stacks" was actually a workaround for this in the special
common case of exec's use of move_pagetables(); and we should probably
now remove that VM_STACK_INCOMPLETE_SETUP stuff as a separate cleanup.
I don't think this patch can help with that, the problem of execve vs
rmap_walk is that there's 1 single vma existing for src and dst
virtual ranges while execve runs move_page_tables. So there is no
possible way that rmap_walk will be guaranteed to find _all_ ptes
mapping a page if there's just one vma mapping either the src or dst
range while move_page_table runs. No addition of locking whatsoever
can fix that bug because we miss a vma (well modulo locking that
prevents rmap_walk to run at all, until we're finished with execve,
which is more or less what VM_STACK_INCOMPLETE_SETUP does...).
The only way is to fix this is prevent migrate (or any other rmap_walk
user that requires 100% reliability from the rmap layer, for example
swap doesn't require 100% reliability and can still run and gracefully
fail at finding the pte) while we're moving pagetables in execve. And
that's what Mel's above mentioned patch does.
Thanks for explaining, yes, you're right.
Post by Andrea Arcangeli
The other way to fix that bug that I implemented was to do copy_vma in
execve, so that we still have both src and dst ranges of
move_page_tables covered by 2 (not 1) vma, each with the proper
vma->vm_pgoff, so my approach fixed that bug as well (but requires a
vma allocation in execve so it was dropped in favor of Mel's patch
which is totally fine with as both approaches fixes the bug equally
well, even if now we've to deal with this special case of sometime
rmap_walk having false negatives if the vma_flags is set, and the
important thing is that after VM_STACK_INCOMPLETE_SETUP has been
cleared it won't ever be set again for the whole lifetime of the vma).
I think your two-vmas approach is more aesthetically pleasing (and
matches mremap), but can see that Mel's vmaflag hack^Htechnique ends up
more economical. It is a bit sad that we lose that all-pages-swappable
condition for unlimited args, for a brief moment, but I think no memory
allocations are made in that interval, so I guess it's fine.

Hugh
Post by Andrea Arcangeli
I may be missing something, I did a short review so far, just so the
patch doesn't get merged if not needed. I mean I think it needs a bit
more looks on it... The fact the i_mmap_mutex was taken but the
anon_vma lock was not taken (while in every other place they both are
needed) certainly makes the patch look correct, but that's just a
misleading coincidence I think.
--- 3.1-rc9/mm/migrate.c 2011-07-21 19:17:23.000000000 -0700
+++ linux/mm/migrate.c 2011-10-17 11:21:48.923826334 -0700
@@ -119,12 +119,6 @@ static int remove_migration_pte(struct p
goto out;

ptep = pte_offset_map(pmd, addr);
-
- if (!is_swap_pte(*ptep)) {
- pte_unmap(ptep);
- goto out;
- }
-
ptl = pte_lockptr(mm, pmd);
}


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Andrea Arcangeli
2011-10-17 22:05:34 UTC
Permalink
Post by Hugh Dickins
Thanks a lot for thinking it over. I _almost_ agree with you, except
remove_migration_pte() behaves as page_check_address() does by default,
it peeks to see if what it wants is there _before_ taking ptlock.
And therefore, I think, it is possible that during mremap move, the swap
pte is in neither of the locations it tries at the instant it peeks there.
I see what you mean, I didn't realize you were fixing that race.
mremap for a few CPU cycles (which may expand if interrupted by irq)
the migration entry will only live in the kernel stack of the process
doing mremap. So the rmap_walk may just loop quick lockless and not
see it and return while mremap holds boths PT locks (src and dst
pte).

Now getting an irq exactly at that migrate cycle and that irq doesn't
sound too easy but we still must fix this race.

Maybe who needs a 100% reliability should not go lockless looping all
over the vmas without taking PT lock that prevents serialization
against the pte "moving" functions that normally do in order
ptep_clear_flush(src_ptep); set_pet_at(dst_ptep).

For example I never thought of optimizing __split_huge_page_splitting,
that must be reliable so I never felt like it could be safe to go
lockless there.

So I think it's better to fix migrate, as there may be other places
like mremap. Who can't afford failure should do the PT locking.

But maybe it's possible to find good reasons to fix the race in the
other way too.
Post by Hugh Dickins
We could put a stop to that: see plausible alternative patch below.
Though I have dithered from one to the other and back, I think on the
whole I still prefer the anon_vma locking in move_ptes(): we don't care
too deeply about the speed of mremap, but we do care about the speed of
exec, and this does add another lock/unlock there, but it will always
be uncontended; whereas the patch at the migration end could be adding
a contended and unnecessary lock.
Oh, I don't know which, you vote - if you now agree there is a problem.
I'll sign off the migrate.c one if you prefer it. But no hurry.
Adding more locking in migrate than in mremap fast path should be
better performance-wise. Java GC uses mremap. migrate is somewhat less
performance critical, but I guess there may be other workloads where
migrate runs more often than mremap. But it also depends on the false
positive ratio of rmap_walk, if normally that's low the patch to
migrate may actually result in an optimization, while the mremap patch
can't possibly speed anything.

In short I'm slightly more inclined on preferring the fix to migrate
and enforce all rmap-walkers who can't fail should not go lockless
speculative on the ptes but take the lock before checking if the pte
they're searching is there.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Mel Gorman
2011-10-19 07:43:36 UTC
Permalink
Post by Hugh Dickins
Post by Andrea Arcangeli
Post by Hugh Dickins
mremap's down_write of mmap_sem, together with i_mmap_mutex/lock,
and pagetable locks, were good enough before page migration (with its
requirement that every migration entry be found) came in; and enough
while migration always held mmap_sem. But not enough nowadays, when
there's memory hotremove and compaction: anon_vma lock is also needed,
to make sure a migration entry is not dodging around behind our back.
For things like migrate and split_huge_page, the anon_vma layer must
guarantee the page is reachable by rmap walk at all times regardless
if it's at the old or new address.
This shall be guaranteed by the copy_vma called by move_vma well
before move_page_tables/move_ptes can run.
copy_vma obviously takes the anon_vma lock to insert the new "dst" vma
into the anon_vma chains structures (vma_link does that). That before
any pte can be moved.
Because we keep two vmas mapped on both src and dst range, with
different vma->vm_pgoff that is valid for the page (the page doesn't
change its page->index) the page should always find _all_ its pte at
any given time.
There may be other variables at play like the order of insertion in
the anon_vma chain matches our direction of copy and removal of the
old pte. But I think the double locking of the PT lock should make the
order in the anon_vma chain absolutely irrelevant (the rmap_walk
obviously takes the PT lock too), and furthermore likely the
anon_vma_chain insertion is favorable (the dst vma is inserted last
and checked last). But it shouldn't matter.
Thanks a lot for thinking it over. I _almost_ agree with you, except
remove_migration_pte() behaves as page_check_address() does by default,
it peeks to see if what it wants is there _before_ taking ptlock.
And therefore, I think, it is possible that during mremap move, the swap
pte is in neither of the locations it tries at the instant it peeks there.
I should have read the rest of the thread before responding :/ .

This makes more sense and is a relief in a sense. There is nothing known
wrong with the VMA locking or ordering. The correct PTE is found but it is
in the wrong state.
Post by Hugh Dickins
We could put a stop to that: see plausible alternative patch below.
Though I have dithered from one to the other and back, I think on the
whole I still prefer the anon_vma locking in move_ptes(): we don't care
too deeply about the speed of mremap, but we do care about the speed of
I still think the anon_vma lock serialises mremap and migration. If that
is correct, it could cause things like huge page collapsing stalling mremap
operations. That might cause slowdowns in JVMs during GC which is undesirable.
Post by Hugh Dickins
exec, and this does add another lock/unlock there, but it will always
be uncontended; whereas the patch at the migration end could be adding
a contended and unnecessary lock.
Oh, I don't know which, you vote - if you now agree there is a problem.
I'll sign off the migrate.c one if you prefer it. But no hurry.
My vote is with the migration change. While there are occasionally
patches to make migration go faster, I don't consider it a hot path.
mremap may be used intensively by JVMs so I'd loathe to hurt it.

Thanks Hugh.
--
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Linus Torvalds
2011-10-19 13:39:55 UTC
Permalink
Post by Mel Gorman
My vote is with the migration change. While there are occasionally
patches to make migration go faster, I don't consider it a hot path.
mremap may be used intensively by JVMs so I'd loathe to hurt it.
Ok, everybody seems to like that more, and it removes code rather than
adds it, so I certainly prefer it too. Paweł, can you test that other
patch (to mm/migrate.c) that Hugh posted? Instead of the mremap vma
locking patch that you already verified for your setup?

Hugh - that one didn't have a changelog/sign-off, so if you could
write that up, and Paweł's testing is successful, I can apply it...
Looks like we have acks from both Andrea and Mel.

Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Hugh Dickins
2011-10-19 19:42:15 UTC
Permalink
Post by Linus Torvalds
Post by Mel Gorman
My vote is with the migration change. While there are occasionally
patches to make migration go faster, I don't consider it a hot path.
mremap may be used intensively by JVMs so I'd loathe to hurt it.
Ok, everybody seems to like that more, and it removes code rather than
adds it, so I certainly prefer it too. Pawel, can you test that other
patch (to mm/migrate.c) that Hugh posted? Instead of the mremap vma
locking patch that you already verified for your setup?
Hugh - that one didn't have a changelog/sign-off, so if you could
write that up, and Pawel's testing is successful, I can apply it...
Looks like we have acks from both Andrea and Mel.
Yes, I'm glad to have that input from Andrea and Mel, thank you.

Here we go. I can't add a Tested-by since Pawel was reporting on the
alternative patch, but perhaps you'll be able to add that in later.

I may have read too much into Pawel's mail, but it sounded like he
would have expected an eponymous find_get_pages() lockup by now,
and was pleased that this patch appeared to have cured that.

I've spent quite a while trying to explain find_get_pages() lockup by
a missed migration entry, but I just don't see it: I don't expect this
(or the alternative) patch to do anything to fix that problem. I won't
mind if it magically goes away, but I expect we'll need more info from
the debug patch I sent Justin a couple of days ago.

Ah, I'd better send the patch separately as
"[PATCH] mm: fix race between mremap and removing migration entry":
Pawel's "l" makes my old alpine setup choose quoted printable when
I reply to your mail.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Paweł Sikora
2011-10-20 06:30:21 UTC
Permalink
Post by Hugh Dickins
Post by Linus Torvalds
Post by Mel Gorman
My vote is with the migration change. While there are occasionally
patches to make migration go faster, I don't consider it a hot path.
mremap may be used intensively by JVMs so I'd loathe to hurt it.
Ok, everybody seems to like that more, and it removes code rather than
adds it, so I certainly prefer it too. Pawel, can you test that other
patch (to mm/migrate.c) that Hugh posted? Instead of the mremap vma
locking patch that you already verified for your setup?
Hugh - that one didn't have a changelog/sign-off, so if you could
write that up, and Pawel's testing is successful, I can apply it...
Looks like we have acks from both Andrea and Mel.
Yes, I'm glad to have that input from Andrea and Mel, thank you.
Here we go. I can't add a Tested-by since Pawel was reporting on the
alternative patch, but perhaps you'll be able to add that in later.
I may have read too much into Pawel's mail, but it sounded like he
would have expected an eponymous find_get_pages() lockup by now,
and was pleased that this patch appeared to have cured that.
I've spent quite a while trying to explain find_get_pages() lockup by
a missed migration entry, but I just don't see it: I don't expect this
(or the alternative) patch to do anything to fix that problem. I won't
mind if it magically goes away, but I expect we'll need more info from
the debug patch I sent Justin a couple of days ago.
the latest patch (mm/migrate.c) applied on 3.0.4 also survives points
1) and 2) described previously (https://lkml.org/lkml/2011/10/18/427),
so please apply it to the upstream/stable git tree.

from the other side, both patches don't help for 3.0.4+vserver host soft-lock
which dies in few hours of stressing. iirc this lock has started with 2.6.38.
is there any major change in memory managment area in 2.6.38 that i can bisect
and test with vserver?

BR,
Paweł.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Linus Torvalds
2011-10-20 06:51:11 UTC
Permalink
Post by Paweł Sikora
the latest patch (mm/migrate.c) applied on 3.0.4 also survives points
1) and 2) described previously (https://lkml.org/lkml/2011/10/18/427),
so please apply it to the upstream/stable git tree.
Ok, thanks, applied and pushed out.
Post by Paweł Sikora
from the other side, both patches don't help for 3.0.4+vserver host soft-lock
which dies in few hours of stressing. iirc this lock has started with 2.6.38.
is there any major change in memory managment area in 2.6.38 that i can bisect
and test with vserver?
I suspect you'd be best off simply just doing a full bisect. Yes, if
2.6.37 is the last known working kernel for you, and 38 breaks, that's
a lot of commits (about 10k, to be exact), and it will take an
annoying number of reboots and tests, but assuming you don't hit any
problems, it should still be "only" about 14 bisection points or so.

You could *try* to minimize the bisect by only looking at commits that
change mm/, but quite frankly, partial tree bisects tend to not be all
that reliable. But if you want to try, you could do basically

git bisect start mm/
git bisect good v2.6.37
git bisect bad v2.6.38

and go from there. That will try to do a more specific bisect, and you
should have fewer test points, but the end result really is much less
reliable. But it might help narrow things down a bit.

Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Nai Xia
2011-10-21 06:54:29 UTC
Permalink
Post by Paweł Sikora
Post by Hugh Dickins
Post by Linus Torvalds
Post by Mel Gorman
My vote is with the migration change. While there are occasionally
patches to make migration go faster, I don't consider it a hot path.
mremap may be used intensively by JVMs so I'd loathe to hurt it.
Ok, everybody seems to like that more, and it removes code rather than
adds it, so I certainly prefer it too. Pawel, can you test that other
patch (to mm/migrate.c) that Hugh posted? Instead of the mremap vma
locking patch that you already verified for your setup?
Hugh - that one didn't have a changelog/sign-off, so if you could
write that up, and Pawel's testing is successful, I can apply it...
Looks like we have acks from both Andrea and Mel.
Yes, I'm glad to have that input from Andrea and Mel, thank you.
Here we go.  I can't add a Tested-by since Pawel was reporting on the
alternative patch, but perhaps you'll be able to add that in later.
I may have read too much into Pawel's mail, but it sounded like he
would have expected an eponymous find_get_pages() lockup by now,
and was pleased that this patch appeared to have cured that.
I've spent quite a while trying to explain find_get_pages() lockup by
a missed migration entry, but I just don't see it: I don't expect this
(or the alternative) patch to do anything to fix that problem.  I won't
mind if it magically goes away, but I expect we'll need more info from
the debug patch I sent Justin a couple of days ago.
the latest patch (mm/migrate.c) applied on 3.0.4 also survives points
1) and 2) described previously (https://lkml.org/lkml/2011/10/18/427),
so please apply it to the upstream/stable git tree.
from the other side, both patches don't help for 3.0.4+vserver host soft-lock
Hi Paweł,

Did your "both" mean that you applied each patch and run the tests separately,
or you applied the both patches and run them together?

Maybe there were more than one bugs dancing but having a same effect,
not fixing all of them wouldn't help at all.

Thanks,

Nai Xia
Post by Paweł Sikora
which dies in few hours of stressing. iirc this lock has started with 2.6.38.
is there any major change in memory managment area in 2.6.38 that i can bisect
and test with vserver?
BR,
Paweł.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Pawel Sikora
2011-10-21 07:35:46 UTC
Permalink
Post by Nai Xia
Post by Paweł Sikora
Post by Hugh Dickins
Post by Linus Torvalds
Post by Mel Gorman
My vote is with the migration change. While there are occasionally
patches to make migration go faster, I don't consider it a hot path.
mremap may be used intensively by JVMs so I'd loathe to hurt it.
Ok, everybody seems to like that more, and it removes code rather than
adds it, so I certainly prefer it too. Pawel, can you test that other
patch (to mm/migrate.c) that Hugh posted? Instead of the mremap vma
locking patch that you already verified for your setup?
Hugh - that one didn't have a changelog/sign-off, so if you could
write that up, and Pawel's testing is successful, I can apply it...
Looks like we have acks from both Andrea and Mel.
Yes, I'm glad to have that input from Andrea and Mel, thank you.
Here we go. I can't add a Tested-by since Pawel was reporting on the
alternative patch, but perhaps you'll be able to add that in later.
I may have read too much into Pawel's mail, but it sounded like he
would have expected an eponymous find_get_pages() lockup by now,
and was pleased that this patch appeared to have cured that.
I've spent quite a while trying to explain find_get_pages() lockup by
a missed migration entry, but I just don't see it: I don't expect this
(or the alternative) patch to do anything to fix that problem. I won't
mind if it magically goes away, but I expect we'll need more info from
the debug patch I sent Justin a couple of days ago.
the latest patch (mm/migrate.c) applied on 3.0.4 also survives points
1) and 2) described previously (https://lkml.org/lkml/2011/10/18/427),
so please apply it to the upstream/stable git tree.
from the other side, both patches don't help for 3.0.4+vserver host soft-lock
Hi Paweł,
Did your "both" mean that you applied each patch and run the tests separately,
yes, i've tested Hugh's patches separately.
Post by Nai Xia
Maybe there were more than one bugs dancing but having a same effect,
not fixing all of them wouldn't help at all.
i suppose that vserver patch only exposes some tricky bug introduced in 2.6.38.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Nai Xia
2011-10-20 12:51:33 UTC
Permalink
Post by Hugh Dickins
Post by Linus Torvalds
Post by Mel Gorman
My vote is with the migration change. While there are occasionally
patches to make migration go faster, I don't consider it a hot path.
mremap may be used intensively by JVMs so I'd loathe to hurt it.
Ok, everybody seems to like that more, and it removes code rather than
adds it, so I certainly prefer it too. Pawel, can you test that other
patch (to mm/migrate.c) that Hugh posted? Instead of the mremap vma
locking patch that you already verified for your setup?
Hugh - that one didn't have a changelog/sign-off, so if you could
write that up, and Pawel's testing is successful, I can apply it...
Looks like we have acks from both Andrea and Mel.
Yes, I'm glad to have that input from Andrea and Mel, thank you.
Here we go. I can't add a Tested-by since Pawel was reporting on the
alternative patch, but perhaps you'll be able to add that in later.
I may have read too much into Pawel's mail, but it sounded like he
would have expected an eponymous find_get_pages() lockup by now,
and was pleased that this patch appeared to have cured that.
I've spent quite a while trying to explain find_get_pages() lockup by
a missed migration entry, but I just don't see it: I don't expect this
(or the alternative) patch to do anything to fix that problem. I won't
mind if it magically goes away, but I expect we'll need more info from
the debug patch I sent Justin a couple of days ago.
Hi Hugh,

Will you please look into my explanation in my reply to Andrea in this thread
and see if it's what you are seeking?


Thanks,

Nai Xia
Post by Hugh Dickins
Ah, I'd better send the patch separately as
Pawel's "l" makes my old alpine setup choose quoted printable when
I reply to your mail.
Hugh
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Hugh Dickins
2011-10-20 18:36:06 UTC
Permalink
I'm travelling at the moment, my brain is not in gear, the source is not in
front of me, and I'm not used to typing on my phone much! Excuses, excuses

I flip between thinking you are right, and I'm a fool, and thinking you are
wrong, and I'm still a fool.

Please work it out with Linus, Andrea and Mel: I may not be able to reply
for a couple of days - thanks.

Hugh
Post by Paweł Sikora
Post by Hugh Dickins
Post by Linus Torvalds
Post by Mel Gorman
My vote is with the migration change. While there are occasionally
patches to make migration go faster, I don't consider it a hot path.
mremap may be used intensively by JVMs so I'd loathe to hurt it.
Ok, everybody seems to like that more, and it removes code rather than
adds it, so I certainly prefer it too. Pawel, can you test that other
patch (to mm/migrate.c) that Hugh posted? Instead of the mremap vma
locking patch that you already verified for your setup?
Hugh - that one didn't have a changelog/sign-off, so if you could
write that up, and Pawel's testing is successful, I can apply it...
Looks like we have acks from both Andrea and Mel.
Yes, I'm glad to have that input from Andrea and Mel, thank you.
Here we go. I can't add a Tested-by since Pawel was reporting on the
alternative patch, but perhaps you'll be able to add that in later.
I may have read too much into Pawel's mail, but it sounded like he
would have expected an eponymous find_get_pages() lockup by now,
and was pleased that this patch appeared to have cured that.
I've spent quite a while trying to explain find_get_pages() lockup by
a missed migration entry, but I just don't see it: I don't expect this
(or the alternative) patch to do anything to fix that problem. I won't
mind if it magically goes away, but I expect we'll need more info from
the debug patch I sent Justin a couple of days ago.
Hi Hugh,
Will you please look into my explanation in my reply to Andrea in this thread
and see if it's what you are seeking?
Thanks,
Nai Xia
Post by Hugh Dickins
Ah, I'd better send the patch separately as
Pawel's "l" makes my old alpine setup choose quoted printable when
I reply to your mail.
Hugh
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign
http://stopthemeter.ca/
Nai Xia
2011-10-21 06:22:37 UTC
Permalink
Post by Hugh Dickins
I'm travelling at the moment, my brain is not in gear, the source is not in
front of me, and I'm not used to typing on my phone much!  Excuses, excuses
I flip between thinking you are right, and I'm a fool, and thinking you are
wrong, and I'm still a fool.
Ha, well, human brains are all weak in thoroughly searching racing state space,
while automated model checking is still far from applicable to complex
real world
like kernel source. Maybe some day someone will give out a human guided
computer aided tool to help us search the combination of all involved code paths
to valid a specific high level logic assertion.
Post by Hugh Dickins
Please work it out with Linus, Andrea and Mel: I may not be able to reply
for a couple of days - thanks.
OK.

And as a side note. Since I notice that Pawel's workload may include OOM,
I'd like to give an imaginary series of events that may trigger such an bug.

1. do_brk() want to expand a vma, but vma_merge failed because of
transient ENOMEM, but succeeded in creating a new vmas at the boundary.

vma_a vma_b
|----------------|---------------------|

2. page fault in vma_b, gives it a anon_vma, then page fault in vma_a,
it reuses the anon_vma of vma_b.


3. vma_a remaps to somewhere irrelevant, a new vma_c is created
and linked by anon_vma_clone(). In the anon_vma chain of vma_b,
vma_c is linked after vma_b:

vma_a vma_b vma_c
|----------------|---------------------| |==============|

vma_b vma_c
|---------------------| |==============|



4. vma_c remaps back to its original place where vma_a was.
Ok, vma_merge() in copy_vma() says that this request can be merged
to vma_b, and it returns with vma_b.

5. move_page_tables moves from vma_c to vma_b, and races with rmap_walk.
The reverse ordering of vma_b and vma_c in anon_vma chain makes
rmap_walk miss an entry in the way I explained.

Well, it seems a very tricky construction, but also seems a possible
thing to me.

Will Linus, Andrea and Mel or any other one please look into my construction
and judge if it's valid?

Thanks

Nai Xia
Post by Hugh Dickins
Hugh
Post by Paweł Sikora
Post by Hugh Dickins
Post by Linus Torvalds
Post by Mel Gorman
My vote is with the migration change. While there are occasionally
patches to make migration go faster, I don't consider it a hot path.
mremap may be used intensively by JVMs so I'd loathe to hurt it.
Ok, everybody seems to like that more, and it removes code rather than
adds it, so I certainly prefer it too. Pawel, can you test that other
patch (to mm/migrate.c) that Hugh posted? Instead of the mremap vma
locking patch that you already verified for your setup?
Hugh - that one didn't have a changelog/sign-off, so if you could
write that up, and Pawel's testing is successful, I can apply it...
Looks like we have acks from both Andrea and Mel.
Yes, I'm glad to have that input from Andrea and Mel, thank you.
Here we go.  I can't add a Tested-by since Pawel was reporting on the
alternative patch, but perhaps you'll be able to add that in later.
I may have read too much into Pawel's mail, but it sounded like he
would have expected an eponymous find_get_pages() lockup by now,
and was pleased that this patch appeared to have cured that.
I've spent quite a while trying to explain find_get_pages() lockup by
a missed migration entry, but I just don't see it: I don't expect this
(or the alternative) patch to do anything to fix that problem.  I won't
mind if it magically goes away, but I expect we'll need more info from
the debug patch I sent Justin a couple of days ago.
Hi Hugh,
Will you please look into my explanation in my reply to Andrea in this thread
and see if it's what you are seeking?
Thanks,
Nai Xia
Post by Hugh Dickins
Ah, I'd better send the patch separately as
Pawel's "l" makes my old alpine setup choose quoted printable when
I reply to your mail.
Hugh
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign
http://stopthemeter.ca/
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Pawel Sikora
2011-10-21 08:07:05 UTC
Permalink
Post by Nai Xia
And as a side note. Since I notice that Pawel's workload may include OOM,
my last tests on patched (3.0.4 + migrate.c fix + vserver) kernel produce full cpu load
on dual 8-cores opterons like on this htop screenshot -> Loading Image...
afaics all userspace applications usualy don't use more than half of physical memory
and so called "cache" on htop bar doesn't reach the 100%.

the patched kernel with disabled CONFIG_TRANSPARENT_HUGEPAGE (new thing in 2.6.38)
died at night, so now i'm going to disable also CONFIG_COMPACTION/MIGRATION in next
steps and stress this machine again...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Nai Xia
2011-10-21 09:07:56 UTC
Permalink
Post by Pawel Sikora
Post by Nai Xia
And as a side note. Since I notice that Pawel's workload may include OOM,
my last tests on patched (3.0.4 + migrate.c fix + vserver) kernel produce full cpu load
on dual 8-cores opterons like on this htop screenshot -> http://pluto.agmk.net/kernel/screen1.png
afaics all userspace applications usualy don't use more than half of physical memory
and so called "cache" on htop bar doesn't reach the 100%.
OK,did you logged any OOM killing if there was some memory usage burst?
But, well my above OOM reasoning is a direct short cut to imagined
root cause of "adjacent VMAs which
should have been merged but in fact not merged" case.
Maybe there are other cases that can lead to this or maybe it's
totally another bug....

But still I think if my reasoning is good, similar bad things will
happen again some time in the future,
even if it was not your case here...
Post by Pawel Sikora
the patched kernel with disabled CONFIG_TRANSPARENT_HUGEPAGE (new thing in 2.6.38)
died at night, so now i'm going to disable also CONFIG_COMPACTION/MIGRATION in next
steps and stress this machine again...
OK, it's smart to narrow down the range first....
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Paweł Sikora
2011-10-21 21:36:46 UTC
Permalink
Post by Nai Xia
Post by Pawel Sikora
Post by Nai Xia
And as a side note. Since I notice that Pawel's workload may include OOM,
my last tests on patched (3.0.4 + migrate.c fix + vserver) kernel produce full cpu load
on dual 8-cores opterons like on this htop screenshot -> http://pluto.agmk.net/kernel/screen1.png
afaics all userspace applications usualy don't use more than half of physical memory
and so called "cache" on htop bar doesn't reach the 100%.
OK,did you logged any OOM killing if there was some memory usage burst?
But, well my above OOM reasoning is a direct short cut to imagined
root cause of "adjacent VMAs which
should have been merged but in fact not merged" case.
Maybe there are other cases that can lead to this or maybe it's
totally another bug....
i don't see any OOM killing with my conservative settings
(vm.overcommit_memory=2, vm.overcommit_ratio=100).
Post by Nai Xia
But still I think if my reasoning is good, similar bad things will
happen again some time in the future,
even if it was not your case here...
Post by Pawel Sikora
the patched kernel with disabled CONFIG_TRANSPARENT_HUGEPAGE (new thing in 2.6.38)
died at night, so now i'm going to disable also CONFIG_COMPACTION/MIGRATION in next
steps and stress this machine again...
OK, it's smart to narrow down the range first....
disabling hugepage/compacting didn't help but disabling hugepage/compacting/migration keeps
opterons stable for ~9h so far. userspace uses ~40GB (from 64) ram, caches reach 100% on htop bar,
average load ~16. i wonder if it survive weekend...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Nai Xia
2011-10-22 06:21:23 UTC
Permalink
Post by Paweł Sikora
Post by Nai Xia
Post by Pawel Sikora
Post by Nai Xia
And as a side note. Since I notice that Pawel's workload may include OOM,
my last tests on patched (3.0.4 + migrate.c fix + vserver) kernel produce full cpu load
on dual 8-cores opterons like on this htop screenshot -> http://pluto.agmk.net/kernel/screen1.png
afaics all userspace applications usualy don't use more than half of physical memory
and so called "cache" on htop bar doesn't reach the 100%.
OK,did you logged any OOM killing if there was some memory usage burst?
But, well my above OOM reasoning is a direct short cut to imagined
root cause of "adjacent VMAs which
should have been merged but in fact not merged" case.
Maybe there are other cases that can lead to this or maybe it's
totally another bug....
i don't see any OOM killing with my conservative settings
(vm.overcommit_memory=2, vm.overcommit_ratio=100).
OK, that does not matter now. Andrea showed us a simpler way to goto
this bug.
Post by Paweł Sikora
Post by Nai Xia
But still I think if my reasoning is good, similar bad things will
happen again some time in the future,
even if it was not your case here...
Post by Pawel Sikora
the patched kernel with disabled CONFIG_TRANSPARENT_HUGEPAGE (new thing in 2.6.38)
died at night, so now i'm going to disable also CONFIG_COMPACTION/MIGRATION in next
steps and stress this machine again...
OK, it's smart to narrow down the range first....
disabling hugepage/compacting didn't help but disabling hugepage/compacting/migration keeps
opterons stable for ~9h so far. userspace uses ~40GB (from 64) ram, caches reach 100% on htop bar,
average load ~16. i wonder if it survive weekend...
Maybe you should give another shot of Andrea's latest anon_vma_order_tail() patch. :)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Paweł Sikora
2011-10-22 16:42:26 UTC
Permalink
Post by Nai Xia
Post by Paweł Sikora
Post by Nai Xia
Post by Pawel Sikora
Post by Nai Xia
And as a side note. Since I notice that Pawel's workload may include OOM,
my last tests on patched (3.0.4 + migrate.c fix + vserver) kernel produce full cpu load
on dual 8-cores opterons like on this htop screenshot -> http://pluto.agmk.net/kernel/screen1.png
afaics all userspace applications usualy don't use more than half of physical memory
and so called "cache" on htop bar doesn't reach the 100%.
OK,did you logged any OOM killing if there was some memory usage burst?
But, well my above OOM reasoning is a direct short cut to imagined
root cause of "adjacent VMAs which
should have been merged but in fact not merged" case.
Maybe there are other cases that can lead to this or maybe it's
totally another bug....
i don't see any OOM killing with my conservative settings
(vm.overcommit_memory=2, vm.overcommit_ratio=100).
OK, that does not matter now. Andrea showed us a simpler way to goto
this bug.
Post by Paweł Sikora
Post by Nai Xia
But still I think if my reasoning is good, similar bad things will
happen again some time in the future,
even if it was not your case here...
Post by Pawel Sikora
the patched kernel with disabled CONFIG_TRANSPARENT_HUGEPAGE (new thing in 2.6.38)
died at night, so now i'm going to disable also CONFIG_COMPACTION/MIGRATION in next
steps and stress this machine again...
OK, it's smart to narrow down the range first....
disabling hugepage/compacting didn't help but disabling hugepage/compacting/migration keeps
opterons stable for ~9h so far. userspace uses ~40GB (from 64) ram, caches reach 100% on htop bar,
average load ~16. i wonder if it survive weekend...
Maybe you should give another shot of Andrea's latest anon_vma_order_tail() patch. :)
all my attempts to disabling thp/compaction/migration failed (machine locked).
now, i'm testing 3.0.7+vserver+Hugh's+Andrea's patches+enabled few kernel debug options.

so far it has logged only something unrelated to memory managment subsystem:

[ 258.397014] =======================================================
[ 258.397209] [ INFO: possible circular locking dependency detected ]
[ 258.397311] 3.0.7-vs2.3.1-dirty #1
[ 258.397402] -------------------------------------------------------
[ 258.397503] slave_odra_g_00/19432 is trying to acquire lock:
[ 258.397603] (&(&sig->cputimer.lock)->rlock){-.....}, at: [<ffffffff8103adfc>] update_curr+0xfc/0x190
[ 258.397912]
[ 258.397912] but task is already holding lock:
[ 258.398090] (&rq->lock){-.-.-.}, at: [<ffffffff81041a8e>] scheduler_tick+0x4e/0x280
[ 258.398387]
[ 258.398388] which lock already depends on the new lock.
[ 258.398389]
[ 258.398652]
[ 258.398653] the existing dependency chain (in reverse order) is:
[ 258.398836]
[ 258.398837] -> #2 (&rq->lock){-.-.-.}:
[ 258.399178] [<ffffffff810959ee>] lock_acquire+0x8e/0x120
[ 258.399336] [<ffffffff81466e5c>] _raw_spin_lock+0x2c/0x40
[ 258.399495] [<ffffffff81040bd7>] wake_up_new_task+0x97/0x1c0
[ 258.399652] [<ffffffff81047db6>] do_fork+0x176/0x460
[ 258.399807] [<ffffffff8100999c>] kernel_thread+0x6c/0x70
[ 258.399964] [<ffffffff8144715d>] rest_init+0x21/0xc4
[ 258.400120] [<ffffffff818adbd2>] start_kernel+0x3d6/0x3e1
[ 258.400280] [<ffffffff818ad322>] x86_64_start_reservations+0x132/0x136
[ 258.400336] [<ffffffff818ad416>] x86_64_start_kernel+0xf0/0xf7
[ 258.400336]
[ 258.400336] -> #1 (&p->pi_lock){-.-.-.}:
[ 258.400336] [<ffffffff810959ee>] lock_acquire+0x8e/0x120
[ 258.400336] [<ffffffff81466f5c>] _raw_spin_lock_irqsave+0x3c/0x60
[ 258.400336] [<ffffffff8106f328>] thread_group_cputimer+0x38/0x100
[ 258.400336] [<ffffffff8106f41d>] cpu_timer_sample_group+0x2d/0xa0
[ 258.400336] [<ffffffff8107080a>] set_process_cpu_timer+0x3a/0x110
[ 258.400336] [<ffffffff8107091a>] update_rlimit_cpu+0x3a/0x60
[ 258.400336] [<ffffffff81062c0e>] do_prlimit+0x19e/0x240
[ 258.400336] [<ffffffff81063008>] sys_setrlimit+0x48/0x60
[ 258.400336] [<ffffffff8146efbb>] system_call_fastpath+0x16/0x1b
[ 258.400336]
[ 258.400336] -> #0 (&(&sig->cputimer.lock)->rlock){-.....}:
[ 258.400336] [<ffffffff810951e7>] __lock_acquire+0x1aa7/0x1cc0
[ 258.400336] [<ffffffff810959ee>] lock_acquire+0x8e/0x120
[ 258.400336] [<ffffffff81466e5c>] _raw_spin_lock+0x2c/0x40
[ 258.400336] [<ffffffff8103adfc>] update_curr+0xfc/0x190
[ 258.400336] [<ffffffff8103b22d>] task_tick_fair+0x2d/0x140
[ 258.400336] [<ffffffff81041b0f>] scheduler_tick+0xcf/0x280
[ 258.400336] [<ffffffff8105a439>] update_process_times+0x69/0x80
[ 258.400336] [<ffffffff8108e0cf>] tick_sched_timer+0x5f/0xc0
[ 258.400336] [<ffffffff81071339>] __run_hrtimer+0x79/0x1f0
[ 258.400336] [<ffffffff81071ce3>] hrtimer_interrupt+0xf3/0x220
[ 258.400336] [<ffffffff8101daa4>] smp_apic_timer_interrupt+0x64/0xa0
[ 258.400336] [<ffffffff8146f9d3>] apic_timer_interrupt+0x13/0x20
[ 258.400336] [<ffffffff8107092d>] update_rlimit_cpu+0x4d/0x60
[ 258.400336] [<ffffffff81062c0e>] do_prlimit+0x19e/0x240
[ 258.400336] [<ffffffff81063008>] sys_setrlimit+0x48/0x60
[ 258.400336] [<ffffffff8146efbb>] system_call_fastpath+0x16/0x1b
[ 258.400336]
[ 258.400336] other info that might help us debug this:
[ 258.400336]
[ 258.400336] Chain exists of:
[ 258.400336] &(&sig->cputimer.lock)->rlock --> &p->pi_lock --> &rq->lock
[ 258.400336]
[ 258.400336] Possible unsafe locking scenario:
[ 258.400336]
[ 258.400336] CPU0 CPU1
[ 258.400336] ---- ----
[ 258.400336] lock(&rq->lock);
[ 258.400336] lock(&p->pi_lock);
[ 258.400336] lock(&rq->lock);
[ 258.400336] lock(&(&sig->cputimer.lock)->rlock);
[ 258.400336]
[ 258.400336] *** DEADLOCK ***
[ 258.400336]
[ 258.400336] 2 locks held by slave_odra_g_00/19432:
[ 258.400336] #0: (tasklist_lock){.+.+..}, at: [<ffffffff81062acd>] do_prlimit+0x5d/0x240
[ 258.400336] #1: (&rq->lock){-.-.-.}, at: [<ffffffff81041a8e>] scheduler_tick+0x4e/0x280
[ 258.400336]
[ 258.400336] stack backtrace:
[ 258.400336] Pid: 19432, comm: slave_odra_g_00 Not tainted 3.0.7-vs2.3.1-dirty #1
[ 258.400336] Call Trace:
[ 258.400336] <IRQ> [<ffffffff8145e204>] print_circular_bug+0x23d/0x24e
[ 258.400336] [<ffffffff810951e7>] __lock_acquire+0x1aa7/0x1cc0
[ 258.400336] [<ffffffff8109264d>] ? mark_lock+0x2dd/0x330
[ 258.400336] [<ffffffff81093bfd>] ? __lock_acquire+0x4bd/0x1cc0
[ 258.400336] [<ffffffff8103adfc>] ? update_curr+0xfc/0x190
[ 258.400336] [<ffffffff810959ee>] lock_acquire+0x8e/0x120
[ 258.400336] [<ffffffff8103adfc>] ? update_curr+0xfc/0x190
[ 258.400336] [<ffffffff81466e5c>] _raw_spin_lock+0x2c/0x40
[ 258.400336] [<ffffffff8103adfc>] ? update_curr+0xfc/0x190
[ 258.400336] [<ffffffff8103adfc>] update_curr+0xfc/0x190
[ 258.400336] [<ffffffff8103b22d>] task_tick_fair+0x2d/0x140
[ 258.400336] [<ffffffff81041b0f>] scheduler_tick+0xcf/0x280
[ 258.400336] [<ffffffff8105a439>] update_process_times+0x69/0x80
[ 258.400336] [<ffffffff8108e0cf>] tick_sched_timer+0x5f/0xc0
[ 258.400336] [<ffffffff81071339>] __run_hrtimer+0x79/0x1f0
[ 258.400336] [<ffffffff8108e070>] ? tick_nohz_handler+0x100/0x100
[ 258.400336] [<ffffffff81071ce3>] hrtimer_interrupt+0xf3/0x220
[ 258.400336] [<ffffffff8101daa4>] smp_apic_timer_interrupt+0x64/0xa0
[ 258.400336] [<ffffffff8146f9d3>] apic_timer_interrupt+0x13/0x20
[ 258.400336] <EOI> [<ffffffff814674e0>] ? _raw_spin_unlock_irq+0x30/0x40
[ 258.400336] [<ffffffff8107092d>] update_rlimit_cpu+0x4d/0x60
[ 258.400336] [<ffffffff81062c0e>] do_prlimit+0x19e/0x240
[ 258.400336] [<ffffffff81063008>] sys_setrlimit+0x48/0x60
[ 258.400336] [<ffffffff8146efbb>] system_call_fastpath+0x16/0x1b

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Pawel Sikora
2011-10-25 07:33:50 UTC
Permalink
Post by Paweł Sikora
And as a side note. Since I notice that Pawel's workload ma=
y include OOM,
Post by Paweł Sikora
my last tests on patched (3.0.4 + migrate.c fix + vserver) k=
ernel produce full cpu load
Post by Paweł Sikora
on dual 8-cores opterons like on this htop screenshot -> htt=
p://pluto.agmk.net/kernel/screen1.png
Post by Paweł Sikora
afaics all userspace applications usualy don't use more than=
half of physical memory
Post by Paweł Sikora
and so called "cache" on htop bar doesn't reach the 100%.
OK=EF=BC=8Cdid you logged any OOM killing if there was some me=
mory usage burst?
Post by Paweł Sikora
But, well my above OOM reasoning is a direct short cut to imag=
ined
Post by Paweł Sikora
root cause of "adjacent VMAs which
should have been merged but in fact not merged" case.
Maybe there are other cases that can lead to this or maybe it'=
s
Post by Paweł Sikora
totally another bug....
i don't see any OOM killing with my conservative settings
(vm.overcommit_memory=3D2, vm.overcommit_ratio=3D100).
OK, that does not matter now. Andrea showed us a simpler way to go=
to
this bug.
Post by Paweł Sikora
But still I think if my reasoning is good, similar bad things =
will
Post by Paweł Sikora
happen again some time in the future,
even if it was not your case here...
the patched kernel with disabled CONFIG_TRANSPARENT_HUGEPAGE=
(new thing in 2.6.38)
Post by Paweł Sikora
died at night, so now i'm going to disable also CONFIG_COMPA=
CTION/MIGRATION in next
Post by Paweł Sikora
steps and stress this machine again...
OK, it's smart to narrow down the range first....
disabling hugepage/compacting didn't help but disabling hugepage=
/compacting/migration keeps
Post by Paweł Sikora
opterons stable for ~9h so far. userspace uses ~40GB (from 64) r=
am, caches reach 100% on htop bar,
Post by Paweł Sikora
average load ~16. i wonder if it survive weekend...
Maybe you should give another shot of Andrea's latest anon_vma_ord=
er_tail() patch. :)
all my attempts to disabling thp/compaction/migration failed (machi=
ne locked).
now, i'm testing 3.0.7+vserver+Hugh's+Andrea's patches+enabled few =
kernel debug options.
=20
Have you got the result of this patch combination by now?
yes, this combination is working *stable* for ~2 days so far (with heav=
y stressing).

moreover, i've isolated/reported a faulty code in vserver patch that ca=
uses cryptic
deadlocks for 2.6.38+ kernels: http://list.linux-vserver.org/archive?ms=
p:5420:mdaibmimlbgoligkjdma
I have no clues about the locking below, indeed, it seems like anothe=
r bug......

this might be fixed by 3.0.8 https://lkml.org/lkml/2011/10/23/26, i'll=
test it soon...
so far it has logged only something unrelated to memory managment s=
[ 258.397014] =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
[ 258.397209] [ INFO: possible circular locking dependency detecte=
d ]
[ 258.397311] 3.0.7-vs2.3.1-dirty #1
[ 258.397402] ----------------------------------------------------=
---
[ 258.397603] (&(&sig->cputimer.lock)->rlock){-.....}, at: [<ffff=
ffff8103adfc>] update_curr+0xfc/0x190
Nai Xia
2011-10-20 09:11:28 UTC
Permalink
Hi Andrea,
Post by Andrea Arcangeli
Post by Hugh Dickins
mremap's down_write of mmap_sem, together with i_mmap_mutex/lock,
and pagetable locks, were good enough before page migration (with its
requirement that every migration entry be found) came in; and enough
while migration always held mmap_sem.  But not enough nowadays, when
there's memory hotremove and compaction: anon_vma lock is also needed,
to make sure a migration entry is not dodging around behind our back.
For things like migrate and split_huge_page, the anon_vma layer must
guarantee the page is reachable by rmap walk at all times regardless
if it's at the old or new address.
This shall be guaranteed by the copy_vma called by move_vma well
before move_page_tables/move_ptes can run.
copy_vma obviously takes the anon_vma lock to insert the new "dst" vma
into the anon_vma chains structures (vma_link does that). That before
any pte can be moved.
Because we keep two vmas mapped on both src and dst range, with
different vma->vm_pgoff that is valid for the page (the page doesn't
change its page->index) the page should always find _all_ its pte at
any given time.
There may be other variables at play like the order of insertion in
the anon_vma chain matches our direction of copy and removal of the
old pte. But I think the double locking of the PT lock should make the
order in the anon_vma chain absolutely irrelevant (the rmap_walk
obviously takes the PT lock too), and furthermore likely the
anon_vma_chain insertion is favorable (the dst vma is inserted last
and checked last). But it shouldn't matter.
I happened to be reading these code last week.

And I do think this order matters, the reason is just quite similar why we
need i_mmap_lock in move_ptes():
If rmap_walk goes dst--->src, then when it first look into dst, ok, the
pte is not there, and it happily skip it and release the PTL.
Then just before it look into src, move_ptes() comes in, takes the locks
and moves the pte from src to dst. And then when rmap_walk() look
into src, it will find an empty pte again. The pte is still there,
but rmap_walk() missed it !

IMO, this can really happen in case of vma_merge() succeeding.
Imagine that src vma is lately faulted and in anon_vma_prepare()
it got a same anon_vma with an existing vma ( named evil_vma )through
find_mergeable_anon_vma(). This can potentially make the vma_merge() in
copy_vma() return with evil_vma on some new relocation request. But src_vma
is really linked _after_ evil_vma/new_vma/dst_vma.
In this way, the ordering protocol of anon_vma chain is broken.
This should be a rare case because I think in most cases
if two VMAs can reusable_anon_vma() they were already merged.

How do you think ?

And If my reasoning is sound and this bug is really triggered by it
Hugh's first patch should be the right fix :)


Regards,

Nai Xia
Post by Andrea Arcangeli
Another thing could be the copy_vma vma_merge branch succeeding
(returning not NULL) but I doubt we risk to fall into that one. For
the rmap_walk to be always working on both the src and dst
vma->vma_pgoff the pgoff must be different so we can't possibly be ok
if there's just 1 vma covering the whole range. I exclude this could
be the case because the pgoff passed to copy_vma is different than the
vma->vm_pgoff given to copy_vma, so vma_merge can't possibly succeed.
Yet another point to investigate is the point where we teardown the
old vma and we leave the new vma generated by copy_vma
established. That's apparently taken care of by do_munmap in move_vma
so that shall be safe too as munmap is safe in the first place.
Overall I don't think this patch is needed and it seems a noop.
Post by Hugh Dickins
It appears that Mel's a8bef8ff6ea1 "mm: migration: avoid race between
shift_arg_pages() and rmap_walk() during migration by not migrating
temporary stacks" was actually a workaround for this in the special
common case of exec's use of move_pagetables(); and we should probably
now remove that VM_STACK_INCOMPLETE_SETUP stuff as a separate cleanup.
I don't think this patch can help with that, the problem of execve vs
rmap_walk is that there's 1 single vma existing for src and dst
virtual ranges while execve runs move_page_tables. So there is no
possible way that rmap_walk will be guaranteed to find _all_ ptes
mapping a page if there's just one vma mapping either the src or dst
range while move_page_table runs. No addition of locking whatsoever
can fix that bug because we miss a vma (well modulo locking that
prevents rmap_walk to run at all, until we're finished with execve,
which is more or less what VM_STACK_INCOMPLETE_SETUP does...).
The only way is to fix this is prevent migrate (or any other rmap_walk
user that requires 100% reliability from the rmap layer, for example
swap doesn't require 100% reliability and can still run and gracefully
fail at finding the pte) while we're moving pagetables in execve. And
that's what Mel's above mentioned patch does.
The other way to fix that bug that I implemented was to do copy_vma in
execve, so that we still have both src and dst ranges of
move_page_tables covered by 2 (not 1) vma, each with the proper
vma->vm_pgoff, so my approach fixed that bug as well (but requires a
vma allocation in execve so it was dropped in favor of Mel's patch
which is totally fine with as both approaches fixes the bug equally
well, even if now we've to deal with this special case of sometime
rmap_walk having false negatives if the vma_flags is set, and the
important thing is that after VM_STACK_INCOMPLETE_SETUP has been
cleared it won't ever be set again for the whole lifetime of the vma).
I may be missing something, I did a short review so far, just so the
patch doesn't get merged if not needed. I mean I think it needs a bit
more looks on it... The fact the i_mmap_mutex was taken but the
anon_vma lock was not taken (while in every other place they both are
needed) certainly makes the patch look correct, but that's just a
misleading coincidence I think.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Mel Gorman
2011-10-21 15:56:32 UTC
Permalink
Post by Nai Xia
Post by Andrea Arcangeli
Post by Hugh Dickins
mremap's down_write of mmap_sem, together with i_mmap_mutex/lock,
and pagetable locks, were good enough before page migration (with its
requirement that every migration entry be found) came in; and enough
while migration always held mmap_sem.  But not enough nowadays, when
there's memory hotremove and compaction: anon_vma lock is also needed,
to make sure a migration entry is not dodging around behind our back.
For things like migrate and split_huge_page, the anon_vma layer must
guarantee the page is reachable by rmap walk at all times regardless
if it's at the old or new address.
This shall be guaranteed by the copy_vma called by move_vma well
before move_page_tables/move_ptes can run.
copy_vma obviously takes the anon_vma lock to insert the new "dst" vma
into the anon_vma chains structures (vma_link does that). That before
any pte can be moved.
Because we keep two vmas mapped on both src and dst range, with
different vma->vm_pgoff that is valid for the page (the page doesn't
change its page->index) the page should always find _all_ its pte at
any given time.
There may be other variables at play like the order of insertion in
the anon_vma chain matches our direction of copy and removal of the
old pte. But I think the double locking of the PT lock should make the
order in the anon_vma chain absolutely irrelevant (the rmap_walk
obviously takes the PT lock too), and furthermore likely the
anon_vma_chain insertion is favorable (the dst vma is inserted last
and checked last). But it shouldn't matter.
I happened to be reading these code last week.
And I do think this order matters, the reason is just quite similar why we
If rmap_walk goes dst--->src, then when it first look into dst, ok, the
You might be right in that the ordering matters. We do link new VMAs at
the end of the list in anon_vma_chain_list so remove_migrate_ptes should
be walking from src->dst.

If remove_migrate_pte finds src first, it will remove the pte and the
correct version will get copied. If move_ptes runs between when
remove_migrate_ptes moves from src to dst, then the PTE at dst will
still be correct.
Post by Nai Xia
pte is not there, and it happily skip it and release the PTL.
Then just before it look into src, move_ptes() comes in, takes the locks
and moves the pte from src to dst. And then when rmap_walk() look
into src, it will find an empty pte again. The pte is still there,
but rmap_walk() missed it !
I believe the ordering is correct though and protects us in this case.
Post by Nai Xia
IMO, this can really happen in case of vma_merge() succeeding.
Imagine that src vma is lately faulted and in anon_vma_prepare()
it got a same anon_vma with an existing vma ( named evil_vma )through
find_mergeable_anon_vma(). This can potentially make the vma_merge() in
copy_vma() return with evil_vma on some new relocation request. But src_vma
is really linked _after_ evil_vma/new_vma/dst_vma.
In this way, the ordering protocol of anon_vma chain is broken.
This should be a rare case because I think in most cases
if two VMAs can reusable_anon_vma() they were already merged.
How do you think ?
Despite the comments in anon_vma_compatible(), I would expect that VMAs
that can share an anon_vma from find_mergeable_anon_vma() will also get
merged. When the new VMA is created, it will be linked in the usual
manner and the oldest->newest ordering is what is required. That's not
that important though.

What is important is if mremap is moving src to a dst that is adjacent
to another anon_vma. If src has never been faulted, it's not an issue
because there are also no migration PTEs. If src has been faulted, then
is_mergeable_anon_vma() should fail as anon_vma1 != anon_vma2 and they
are not compatible. The ordering is preserved and we are still ok.

All that said, while I don't think there is a problem, I can't convince
myself 100% of it. Andrea, can you spot a flaw?
--
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Nai Xia
2011-10-21 17:21:25 UTC
Permalink
Post by Mel Gorman
Post by Nai Xia
Post by Andrea Arcangeli
Post by Hugh Dickins
mremap's down_write of mmap_sem, together with i_mmap_mutex/lock,
and pagetable locks, were good enough before page migration (with its
requirement that every migration entry be found) came in; and enough
while migration always held mmap_sem.  But not enough nowadays, when
there's memory hotremove and compaction: anon_vma lock is also needed,
to make sure a migration entry is not dodging around behind our back.
For things like migrate and split_huge_page, the anon_vma layer must
guarantee the page is reachable by rmap walk at all times regardless
if it's at the old or new address.
This shall be guaranteed by the copy_vma called by move_vma well
before move_page_tables/move_ptes can run.
copy_vma obviously takes the anon_vma lock to insert the new "dst" vma
into the anon_vma chains structures (vma_link does that). That before
any pte can be moved.
Because we keep two vmas mapped on both src and dst range, with
different vma->vm_pgoff that is valid for the page (the page doesn't
change its page->index) the page should always find _all_ its pte at
any given time.
There may be other variables at play like the order of insertion in
the anon_vma chain matches our direction of copy and removal of the
old pte. But I think the double locking of the PT lock should make the
order in the anon_vma chain absolutely irrelevant (the rmap_walk
obviously takes the PT lock too), and furthermore likely the
anon_vma_chain insertion is favorable (the dst vma is inserted last
and checked last). But it shouldn't matter.
I happened to be reading these code last week.
And I do think this order matters, the reason is just quite similar why we
If rmap_walk goes dst--->src, then when it first look into dst, ok, the
You might be right in that the ordering matters. We do link new VMAs at
the end of the list in anon_vma_chain_list so remove_migrate_ptes should
be walking from src->dst.
If remove_migrate_pte finds src first, it will remove the pte and the
correct version will get copied. If move_ptes runs between when
remove_migrate_ptes moves from src to dst, then the PTE at dst will
still be correct.
Post by Nai Xia
pte is not there, and it happily skip it and release the PTL.
Then just before it look into src, move_ptes() comes in, takes the locks
and moves the pte from src to dst. And then when rmap_walk() look
into src,  it will find an empty pte again. The pte is still there,
but rmap_walk() missed it !
I believe the ordering is correct though and protects us in this case.
Post by Nai Xia
IMO, this can really happen in case of vma_merge() succeeding.
Imagine that src vma is lately faulted and in anon_vma_prepare()
it got a same anon_vma with an existing vma ( named evil_vma )through
find_mergeable_anon_vma().  This can potentially make the vma_merge() in
copy_vma() return with evil_vma on some new relocation request. But src_vma
is really linked _after_  evil_vma/new_vma/dst_vma.
In this way, the ordering protocol  of anon_vma chain is broken.
This should be a rare case because I think in most cases
if two VMAs can reusable_anon_vma() they were already merged.
How do you think  ?
Despite the comments in anon_vma_compatible(), I would expect that VMAs
that can share an anon_vma from find_mergeable_anon_vma() will also get
merged. When the new VMA is created, it will be linked in the usual
manner and the oldest->newest ordering is what is required. That's not
that important though.
What is important is if mremap is moving src to a dst that is adjacent
to another anon_vma. If src has never been faulted, it's not an issue
because there are also no migration PTEs. If src has been faulted, then
is_mergeable_anon_vma() should fail as anon_vma1 != anon_vma2 and they
are not compatible. The ordering is preserved and we are still ok.
Hi Mel,

Thanks for input. I agree on _almost_ all your reasoning above.

But there is a tricky series of events I mentioned in
https://lkml.org/lkml/2011/10/21/14

, which, I think, can really lead to anon_vma1 == anon_vma2 in this case.
These events is led by a failure when do_brk() fails on vma_merge() due to
ENOMEM, rare it maybe though, And I am still not sure if there exists
any other corner cases when a "should be merged" VMAs just sit there
side by side
for sth reason -- normally, that does not trigger BUGs, so maybe hard to
detect in real workload.

Please refer to my link and I think the construction was very clear if I had not
missed sth subtle.

Thanks,

Nai Xia
Post by Mel Gorman
All that said, while I don't think there is a problem, I can't convince
myself 100% of it. Andrea, can you spot a flaw?
--
Mel Gorman
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Andrea Arcangeli
2011-10-21 17:41:20 UTC
Permalink
Post by Mel Gorman
Post by Nai Xia
Post by Andrea Arcangeli
Post by Hugh Dickins
mremap's down_write of mmap_sem, together with i_mmap_mutex/lock,
and pagetable locks, were good enough before page migration (with its
requirement that every migration entry be found) came in; and enough
while migration always held mmap_sem.  But not enough nowadays, when
there's memory hotremove and compaction: anon_vma lock is also needed,
to make sure a migration entry is not dodging around behind our back.
For things like migrate and split_huge_page, the anon_vma layer must
guarantee the page is reachable by rmap walk at all times regardless
if it's at the old or new address.
This shall be guaranteed by the copy_vma called by move_vma well
before move_page_tables/move_ptes can run.
copy_vma obviously takes the anon_vma lock to insert the new "dst" vma
into the anon_vma chains structures (vma_link does that). That before
any pte can be moved.
Because we keep two vmas mapped on both src and dst range, with
different vma->vm_pgoff that is valid for the page (the page doesn't
change its page->index) the page should always find _all_ its pte at
any given time.
There may be other variables at play like the order of insertion in
the anon_vma chain matches our direction of copy and removal of the
old pte. But I think the double locking of the PT lock should make the
order in the anon_vma chain absolutely irrelevant (the rmap_walk
obviously takes the PT lock too), and furthermore likely the
anon_vma_chain insertion is favorable (the dst vma is inserted last
and checked last). But it shouldn't matter.
I happened to be reading these code last week.
And I do think this order matters, the reason is just quite similar why we
If rmap_walk goes dst--->src, then when it first look into dst, ok, the
You might be right in that the ordering matters. We do link new VMAs at
Yes I also think ordering matters as I mentioned in the previous email
that Nai answered to.
Post by Mel Gorman
the end of the list in anon_vma_chain_list so remove_migrate_ptes should
be walking from src->dst.
Correct. Like I mentioned in that previous email that Nai answered,
that wouldn't be ok only if vma_merge succeeds and I didn't change my mind
about that...

copy_vma is only called by mremap so supposedly that path can
trigger. Looks like I was wrong about vma_merge being able to succeed
in copy_vma, and if it does I still think it's a problem as we have no
ordering guarantee.

The only other place that depends on the anon_vma_chain order is fork,
and there, no vma_merge can happen, so that is safe.
Post by Mel Gorman
If remove_migrate_pte finds src first, it will remove the pte and the
correct version will get copied. If move_ptes runs between when
remove_migrate_ptes moves from src to dst, then the PTE at dst will
still be correct.
The problem is rmap_walk will search dst before src. So it will do
nothing on dst. Then mremap moves the pte from src to dst. When rmap
walk then checks "src" it finds nothing again.
Post by Mel Gorman
Post by Nai Xia
pte is not there, and it happily skip it and release the PTL.
Then just before it look into src, move_ptes() comes in, takes the locks
and moves the pte from src to dst. And then when rmap_walk() look
into src, it will find an empty pte again. The pte is still there,
but rmap_walk() missed it !
I believe the ordering is correct though and protects us in this case.
Normally it is, the only problem is vma_merge succeeding I think.
Post by Mel Gorman
Post by Nai Xia
IMO, this can really happen in case of vma_merge() succeeding.
Imagine that src vma is lately faulted and in anon_vma_prepare()
it got a same anon_vma with an existing vma ( named evil_vma )through
find_mergeable_anon_vma(). This can potentially make the vma_merge() in
copy_vma() return with evil_vma on some new relocation request. But src_vma
is really linked _after_ evil_vma/new_vma/dst_vma.
In this way, the ordering protocol of anon_vma chain is broken.
This should be a rare case because I think in most cases
if two VMAs can reusable_anon_vma() they were already merged.
How do you think ?
I tried to understand the above scenario yesterday but with 12 hour
of travel on me I just couldn't.

Yesterday however I thought of another simpler case:

part of a vma is moved with mremap elsewhere. Then it is moved back to
its original place. So then vma_merge will succeed, and the "src" of
mremap is now queued last in anon_vma_chain, wrong ordering.

Today I read an email from Nai who showed apparently the same scenario
I was thinking, without evil vmas or stuff.

I have an hard time to imagine a vma_merge succeeding on a vma that
isn't going back to its original place. The vm_pgoff + vma->anon_vma
checks should keep some linarity so going back to the original place
sounds the only way vma_merge can succeed in copy_vma. But still it
can happen in that case I think (so not sure how the above scenario
with an evil_vma could ever happen if it has a different anon_vma and
it's not a part of a vma that is going back to its original place like
in the second scenario Nai also posted about).

That me and Nai had same scenario hypothesis indipendentely (second
Nai hypoteisis not the first quoted above), plus copy_vma doing
vma_merge and being only called by mremap, sounds like it can really
happen.
Post by Mel Gorman
Despite the comments in anon_vma_compatible(), I would expect that VMAs
that can share an anon_vma from find_mergeable_anon_vma() will also get
merged. When the new VMA is created, it will be linked in the usual
manner and the oldest->newest ordering is what is required. That's not
that important though.
What is important is if mremap is moving src to a dst that is adjacent
to another anon_vma. If src has never been faulted, it's not an issue
because there are also no migration PTEs. If src has been faulted, then
is_mergeable_anon_vma() should fail as anon_vma1 != anon_vma2 and they
are not compatible. The ordering is preserved and we are still ok.
I was thinking along these lines, the only pitfall should be when
something is moved and put back into its original place. When it is
moved, a new vma is created and queued last. When it's put back to its
original location, vma_merge will succeed, and "src" is now the
previous "dst" so queued last and that breaks.
Post by Mel Gorman
All that said, while I don't think there is a problem, I can't convince
myself 100% of it. Andrea, can you spot a flaw?
I think Nai's correct, only second hypothesis though.

We have two options:

1) we remove the vma_merge call from copy_vma and we do the vma_merge
manually after mremap succeed (so then we're as safe as fork is and we
relay on the ordering). No locks but we'll just do 1 more allocation
for one addition temporary vma that will be removed after mremap
completed.

2) Hugh's original fix.

First option probably is faster and prefereable, the vma_merge there
should only trigger when putting things back to origin I suspect, and
never with random mremaps, not sure how common it is to put things
back to origin. If we're in a hurry we can merge Hugh's patch and
optimize it later. We can still retain the migrate fix if we intend to
take way number 1 later. I didn't like too much migrate doing
speculative access on ptes that it can't miss or it'll crash anyway.

Said that the fix merged upstream is 99% certain to fix things in
practice already so I doubt we're in hurry. And if things go wrong
these issues don't go unnoticed and they shouldn't corrupt memory even
if they trigger. 100% certain it can't do damage (other than a BUG_ON)
for split_huge_page as I count the pmds encountered in the rmap_walk
when I set the splitting bit, and I compare that count with
page_mapcount and BUG_ON if they don't match, and later I repeat the
same comparsion in the second rmap_walk that establishes the pte and
downgrades the hugepmd to pmd, and BUG_ON again if they don't match
with the previous rmap_walk count. It may be possible to trigger the
BUG_ON with some malicious activity but it won't be too easy either
because it's not an instant thing, still a race had to trigger and
it's hard to reproduce.

The anon_vma lock is quite a wide lock as it's shared by all parents
anon_vma_chains too, slab allocation from local cpu may actually be
faster in some condition (even when the slab allocation is
superflous). But then I'm not sure. So I'm not against applying Hugh's
fix even for the long run. I wouldn't git revert the migration change,
but then if we go with Hugh's fix probably it'd be safe.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Andrea Arcangeli
2011-10-21 22:50:08 UTC
Permalink
Post by Andrea Arcangeli
1) we remove the vma_merge call from copy_vma and we do the vma_merge
manually after mremap succeed (so then we're as safe as fork is and we
relay on the ordering). No locks but we'll just do 1 more allocation
for one addition temporary vma that will be removed after mremap
completed.
2) Hugh's original fix.
3) put the src vma at the tail if vma_merge succeeds and the src vma
and dst vma aren't the same

I tried to implement this but I'm still wondering about the safety of
this with concurrent processes all calling mremap at the same time on
the same anon_vma same_anon_vma list, the reasoning I think it may be
safe is in the comment. I run a few mremap with my benchmark where the
THP aware mremap in -mm gets a x10 boost and moves 5G and it didn't
crash but that's about it and not conclusive, if you review please
comment...

I've to pack luggage and prepare to fly to KS tomorrow so I may not be
responsive in the next few days.

===
From f2898ff06b5a9a14b9d957c7696137f42a2438e9 Mon Sep 17 00:00:00 2001
From: Andrea Arcangeli <***@redhat.com>
Date: Sat, 22 Oct 2011 00:11:49 +0200
Subject: [PATCH] mremap: enforce rmap src/dst vma ordering in case of
vma_merge succeeding in copy_vma

migrate was doing a rmap_walk with speculative lock-less access on
pagetables. That could lead it to not serialize properly against
mremap PT locks. But a second problem remains in the order of vmas in
the same_anon_vma list used by the rmap_walk.

If vma_merge would succeed in copy_vma, the src vma could be placed
after the dst vma in the same_anon_vma list. That could still lead
migrate to miss some pte.

This patch adds a anon_vma_order_tail() function to force the dst vma
at the end of the list before mremap starts to solve the problem.

If the mremap is very large and there are a lots of parents or childs
sharing the anon_vma root lock, this should still scale better than
taking the anon_vma root lock around every pte copy practically for
the whole duration of mremap.
---
include/linux/rmap.h | 1 +
mm/mmap.c | 8 ++++++++
mm/rmap.c | 43 +++++++++++++++++++++++++++++++++++++++++++
3 files changed, 52 insertions(+), 0 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 2148b12..45eb098 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -120,6 +120,7 @@ void anon_vma_init(void); /* create anon_vma_cachep */
int anon_vma_prepare(struct vm_area_struct *);
void unlink_anon_vmas(struct vm_area_struct *);
int anon_vma_clone(struct vm_area_struct *, struct vm_area_struct *);
+void anon_vma_order_tail(struct vm_area_struct *);
int anon_vma_fork(struct vm_area_struct *, struct vm_area_struct *);
void __anon_vma_link(struct vm_area_struct *);

diff --git a/mm/mmap.c b/mm/mmap.c
index a65efd4..a5858dc 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2339,7 +2339,15 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
*/
if (vma_start >= new_vma->vm_start &&
vma_start < new_vma->vm_end)
+ /*
+ * No need to call anon_vma_order_tail() in
+ * this case because the same PT lock will
+ * serialize the rmap_walk against both src
+ * and dst vmas.
+ */
*vmap = new_vma;
+ else
+ anon_vma_order_tail(new_vma);
} else {
new_vma = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
if (new_vma) {
diff --git a/mm/rmap.c b/mm/rmap.c
index 8005080..170cece 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -272,6 +272,49 @@ int anon_vma_clone(struct vm_area_struct *dst, struct vm_area_struct *src)
}

/*
+ * Some rmap walk that needs to find all ptes/hugepmds without false
+ * negatives (like migrate and split_huge_page) running concurrent
+ * with operations that copy or move pagetables (like mremap() and
+ * fork()) to be safe depends the anon_vma "same_anon_vma" list to be
+ * in a certain order: the dst_vma must be placed after the src_vma in
+ * the list. This is always guaranteed by fork() but mremap() needs to
+ * call this function to enforce it in case the dst_vma isn't newly
+ * allocated and chained with the anon_vma_clone() function but just
+ * an extension of a pre-existing vma through vma_merge.
+ *
+ * NOTE: the same_anon_vma list can still changed by other processes
+ * while mremap runs because mremap doesn't hold the anon_vma mutex to
+ * prevent modifications to the list while it runs. All we need to
+ * enforce is that the relative order of this process vmas isn't
+ * changing (we don't care about other vmas order). Each vma
+ * corresponds to an anon_vma_chain structure so there's no risk that
+ * other processes calling anon_vma_order_tail() and changing the
+ * same_anon_vma list under mremap() will screw with the relative
+ * order of this process vmas in the list, because we won't alter the
+ * order of any vma that isn't belonging to this process. And there
+ * can't be another anon_vma_order_tail running concurrently with
+ * mremap() coming from this process because we hold the mmap_sem for
+ * the whole mremap(). fork() ordering dependency also shouldn't be
+ * affected because we only care that the parent vmas are placed in
+ * the list before the child vmas and anon_vma_order_tail won't reorder
+ * vmas from either the fork parent or child.
+ */
+void anon_vma_order_tail(struct vm_area_struct *dst)
+{
+ struct anon_vma_chain *pavc;
+ struct anon_vma *root = NULL;
+
+ list_for_each_entry_reverse(pavc, &dst->anon_vma_chain, same_vma) {
+ struct anon_vma *anon_vma = pavc->anon_vma;
+ VM_BUG_ON(pavc->vma != dst);
+ root = lock_anon_vma_root(root, anon_vma);
+ list_del(&pavc->same_anon_vma);
+ list_add_tail(&pavc->same_anon_vma, &anon_vma->head);
+ }
+ unlock_anon_vma_root(root);
+}
+
+/*
* Attach vma to its own anon_vma, as well as to the anon_vmas that
* the corresponding VMA in the parent process is attached to.
* Returns 0 on success, non-zero on failure.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Nai Xia
2011-10-22 05:52:22 UTC
Permalink
Post by Andrea Arcangeli
Post by Andrea Arcangeli
1) we remove the vma_merge call from copy_vma and we do the vma_merge
manually after mremap succeed (so then we're as safe as fork is and we
relay on the ordering). No locks but we'll just do 1 more allocation
for one addition temporary vma that will be removed after mremap
completed.
2) Hugh's original fix.
3) put the src vma at the tail if vma_merge succeeds and the src vma
and dst vma aren't the same
I tried to implement this but I'm still wondering about the safety of
this with concurrent processes all calling mremap at the same time on
the same anon_vma same_anon_vma list, the reasoning I think it may be
safe is in the comment. I run a few mremap with my benchmark where the
THP aware mremap in -mm gets a x10 boost and moves 5G and it didn't
BTW, I am curious about what benchmark did you run and " x10 boost"
meaning compared to Hugh's anon_vma_locking fix?
Post by Andrea Arcangeli
crash but that's about it and not conclusive, if you review please
comment...
My comment is at the bottom of this post.
Post by Andrea Arcangeli
I've to pack luggage and prepare to fly to KS tomorrow so I may not be
responsive in the next few days.
===
From f2898ff06b5a9a14b9d957c7696137f42a2438e9 Mon Sep 17 00:00:00 2001
Date: Sat, 22 Oct 2011 00:11:49 +0200
Subject: [PATCH] mremap: enforce rmap src/dst vma ordering in case of
vma_merge succeeding in copy_vma
migrate was doing a rmap_walk with speculative lock-less access on
pagetables. That could lead it to not serialize properly against
mremap PT locks. But a second problem remains in the order of vmas in
the same_anon_vma list used by the rmap_walk.
If vma_merge would succeed in copy_vma, the src vma could be placed
after the dst vma in the same_anon_vma list. That could still lead
migrate to miss some pte.
This patch adds a anon_vma_order_tail() function to force the dst vma
at the end of the list before mremap starts to solve the problem.
If the mremap is very large and there are a lots of parents or childs
sharing the anon_vma root lock, this should still scale better than
taking the anon_vma root lock around every pte copy practically for
the whole duration of mremap.
---
include/linux/rmap.h | 1 +
mm/mmap.c | 8 ++++++++
mm/rmap.c | 43 +++++++++++++++++++++++++++++++++++++++++++
3 files changed, 52 insertions(+), 0 deletions(-)
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 2148b12..45eb098 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -120,6 +120,7 @@ void anon_vma_init(void); /* create anon_vma_cachep */
int anon_vma_prepare(struct vm_area_struct *);
void unlink_anon_vmas(struct vm_area_struct *);
int anon_vma_clone(struct vm_area_struct *, struct vm_area_struct *);
+void anon_vma_order_tail(struct vm_area_struct *);
int anon_vma_fork(struct vm_area_struct *, struct vm_area_struct *);
void __anon_vma_link(struct vm_area_struct *);
diff --git a/mm/mmap.c b/mm/mmap.c
index a65efd4..a5858dc 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2339,7 +2339,15 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
*/
if (vma_start >= new_vma->vm_start &&
vma_start < new_vma->vm_end)
+ /*
+ * No need to call anon_vma_order_tail() in
+ * this case because the same PT lock will
+ * serialize the rmap_walk against both src
+ * and dst vmas.
+ */
*vmap = new_vma;
+ else
+ anon_vma_order_tail(new_vma);
} else {
new_vma = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
if (new_vma) {
diff --git a/mm/rmap.c b/mm/rmap.c
index 8005080..170cece 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -272,6 +272,49 @@ int anon_vma_clone(struct vm_area_struct *dst, struct vm_area_struct *src)
}
/*
+ * Some rmap walk that needs to find all ptes/hugepmds without false
+ * negatives (like migrate and split_huge_page) running concurrent
+ * with operations that copy or move pagetables (like mremap() and
+ * fork()) to be safe depends the anon_vma "same_anon_vma" list to be
+ * in a certain order: the dst_vma must be placed after the src_vma in
+ * the list. This is always guaranteed by fork() but mremap() needs to
+ * call this function to enforce it in case the dst_vma isn't newly
+ * allocated and chained with the anon_vma_clone() function but just
+ * an extension of a pre-existing vma through vma_merge.
+ *
+ * NOTE: the same_anon_vma list can still changed by other processes
+ * while mremap runs because mremap doesn't hold the anon_vma mutex to
+ * prevent modifications to the list while it runs. All we need to
+ * enforce is that the relative order of this process vmas isn't
+ * changing (we don't care about other vmas order). Each vma
+ * corresponds to an anon_vma_chain structure so there's no risk that
+ * other processes calling anon_vma_order_tail() and changing the
+ * same_anon_vma list under mremap() will screw with the relative
+ * order of this process vmas in the list, because we won't alter the
+ * order of any vma that isn't belonging to this process. And there
+ * can't be another anon_vma_order_tail running concurrently with
+ * mremap() coming from this process because we hold the mmap_sem for
+ * the whole mremap(). fork() ordering dependency also shouldn't be
+ * affected because we only care that the parent vmas are placed in
+ * the list before the child vmas and anon_vma_order_tail won't reorder
+ * vmas from either the fork parent or child.
+ */
+void anon_vma_order_tail(struct vm_area_struct *dst)
+{
+ struct anon_vma_chain *pavc;
+ struct anon_vma *root = NULL;
+
+ list_for_each_entry_reverse(pavc, &dst->anon_vma_chain, same_vma) {
+ struct anon_vma *anon_vma = pavc->anon_vma;
+ VM_BUG_ON(pavc->vma != dst);
+ root = lock_anon_vma_root(root, anon_vma);
+ list_del(&pavc->same_anon_vma);
+ list_add_tail(&pavc->same_anon_vma, &anon_vma->head);
+ }
+ unlock_anon_vma_root(root);
+}
This patch and together with the reasoning looks good to me.
But I wondering this patch can make the anon_vma chain ordering game more
complex and harder to play in the future.
However, if it does bring much perfomance benefit, I vote for this patch
because it balances all three requirements here: bug free, performance &
no two VMAs stay not merged for no good reason.

Our situation again makes me have the strong feeling that we are really
in bad need of a computer aided way to travel all possible state space.
There are some guys around me who do automatic software testing research.
But I am afraid our problem is too much "real world" for them... sigh...



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Andrea Arcangeli
2011-10-31 17:14:41 UTC
Permalink
Post by Nai Xia
BTW, I am curious about what benchmark did you run and " x10 boost"
meaning compared to Hugh's anon_vma_locking fix?
I was referring to the mremap optimizations I pushed in -mm.
Post by Nai Xia
This patch and together with the reasoning looks good to me.
But I wondering this patch can make the anon_vma chain ordering game more
complex and harder to play in the future.
Well we don't know yet what future will bring... at least this adds
some documentation on the fact the order matters for
fork/mremap/migrate/split_huge_page. As far as I can tell they're the
4 pieces of the VM where the rmap_walk order matters. And
split_huge_page and migrate are the only two where if the rmap_walk
fails we can't safely continue and have to BUG_ON.
Post by Nai Xia
However, if it does bring much perfomance benefit, I vote for this patch
because it balances all three requirements here: bug free, performance &
no two VMAs stay not merged for no good reason.
I suppose it should bring an SMP performance benefit as the critical
section is reduced but we'll have to do some more list_del/add_tail
than if we take the global lock...
Post by Nai Xia
Our situation again makes me have the strong feeling that we are really
in bad need of a computer aided way to travel all possible state space.
There are some guys around me who do automatic software testing research.
But I am afraid our problem is too much "real world" for them... sigh...
Also the code changes too fast for that...

I'll send the patch again with signoff.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Andrea Arcangeli
2011-10-31 17:27:20 UTC
Permalink
migrate was doing a rmap_walk with speculative lock-less access on
pagetables. That could lead it to not serialize properly against
mremap PT locks. But a second problem remains in the order of vmas in
the same_anon_vma list used by the rmap_walk.

If vma_merge would succeed in copy_vma, the src vma could be placed
after the dst vma in the same_anon_vma list. That could still lead
migrate to miss some pte.

This patch adds a anon_vma_order_tail() function to force the dst vma
at the end of the list before mremap starts to solve the problem.

If the mremap is very large and there are a lots of parents or childs
sharing the anon_vma root lock, this should still scale better than
taking the anon_vma root lock around every pte copy practically for
the whole duration of mremap.

Signed-off-by: Andrea Arcangeli <***@redhat.com>
---
include/linux/rmap.h | 1 +
mm/mmap.c | 8 ++++++++
mm/rmap.c | 44 ++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 53 insertions(+), 0 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 2148b12..45eb098 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -120,6 +120,7 @@ void anon_vma_init(void); /* create anon_vma_cachep */
int anon_vma_prepare(struct vm_area_struct *);
void unlink_anon_vmas(struct vm_area_struct *);
int anon_vma_clone(struct vm_area_struct *, struct vm_area_struct *);
+void anon_vma_order_tail(struct vm_area_struct *);
int anon_vma_fork(struct vm_area_struct *, struct vm_area_struct *);
void __anon_vma_link(struct vm_area_struct *);

diff --git a/mm/mmap.c b/mm/mmap.c
index a65efd4..a5858dc 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2339,7 +2339,15 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
*/
if (vma_start >= new_vma->vm_start &&
vma_start < new_vma->vm_end)
+ /*
+ * No need to call anon_vma_order_tail() in
+ * this case because the same PT lock will
+ * serialize the rmap_walk against both src
+ * and dst vmas.
+ */
*vmap = new_vma;
+ else
+ anon_vma_order_tail(new_vma);
} else {
new_vma = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
if (new_vma) {
diff --git a/mm/rmap.c b/mm/rmap.c
index 8005080..6dbc165 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -272,6 +272,50 @@ int anon_vma_clone(struct vm_area_struct *dst, struct vm_area_struct *src)
}

/*
+ * Some rmap walk that needs to find all ptes/hugepmds without false
+ * negatives (like migrate and split_huge_page) running concurrent
+ * with operations that copy or move pagetables (like mremap() and
+ * fork()) to be safe depends the anon_vma "same_anon_vma" list to be
+ * in a certain order: the dst_vma must be placed after the src_vma in
+ * the list. This is always guaranteed by fork() but mremap() needs to
+ * call this function to enforce it in case the dst_vma isn't newly
+ * allocated and chained with the anon_vma_clone() function but just
+ * an extension of a pre-existing vma through vma_merge.
+ *
+ * NOTE: the same_anon_vma list can still be changed by other
+ * processes while mremap runs because mremap doesn't hold the
+ * anon_vma mutex to prevent modifications to the list while it
+ * runs. All we need to enforce is that the relative order of this
+ * process vmas isn't changing (we don't care about other vmas
+ * order). Each vma corresponds to an anon_vma_chain structure so
+ * there's no risk that other processes calling anon_vma_order_tail()
+ * and changing the same_anon_vma list under mremap() will screw with
+ * the relative order of this process vmas in the list, because we
+ * won't alter the order of any vma that isn't belonging to this
+ * process. And there can't be another anon_vma_order_tail running
+ * concurrently with mremap() coming from this process because we hold
+ * the mmap_sem for the whole mremap(). fork() ordering dependency
+ * also shouldn't be affected because we only care that the parent
+ * vmas are placed in the list before the child vmas and
+ * anon_vma_order_tail won't reorder vmas from either the fork parent
+ * or child.
+ */
+void anon_vma_order_tail(struct vm_area_struct *dst)
+{
+ struct anon_vma_chain *pavc;
+ struct anon_vma *root = NULL;
+
+ list_for_each_entry_reverse(pavc, &dst->anon_vma_chain, same_vma) {
+ struct anon_vma *anon_vma = pavc->anon_vma;
+ VM_BUG_ON(pavc->vma != dst);
+ root = lock_anon_vma_root(root, anon_vma);
+ list_del(&pavc->same_anon_vma);
+ list_add_tail(&pavc->same_anon_vma, &anon_vma->head);
+ }
+ unlock_anon_vma_root(root);
+}
+
+/*
* Attach vma to its own anon_vma, as well as to the anon_vmas that
* the corresponding VMA in the parent process is attached to.
* Returns 0 on success, non-zero on failure.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Mel Gorman
2011-11-01 12:07:26 UTC
Permalink
Post by Andrea Arcangeli
migrate was doing a rmap_walk with speculative lock-less access on
pagetables. That could lead it to not serialize properly against
mremap PT locks. But a second problem remains in the order of vmas in
the same_anon_vma list used by the rmap_walk.
If vma_merge would succeed in copy_vma, the src vma could be placed
after the dst vma in the same_anon_vma list. That could still lead
migrate to miss some pte.
For future reference, why? How about this as an explanation?

If vma_merge would succeed in copy_vma, the src vma could be placed
after the dst vma in the same_anon_vma list. That leads to a race
between migration and mremap whereby a migration PTE is left behind.

mremap migration
create dst VMA
rmap_walk
finds dst, no ptes, release PTL
move_ptes
copies src PTEs to dst
finds src, ptes empty, releases PTL

The migration PTE is now left behind because the order of VMAs matter.
Post by Andrea Arcangeli
This patch adds a anon_vma_order_tail() function to force the dst vma
at the end of the list before mremap starts to solve the problem.
Document the alternative just in case?

"One fix would be to have mremap take the anon_vma lock which would
serialise migration and mremap but this would hurt scalability. Instead,
this patch adds....."

I would also prefer something like anon_vma_moveto_tail() but maybe
it's just me that sees "order" and thinks "high-order allocation".
Post by Andrea Arcangeli
If the mremap is very large and there are a lots of parents or childs
sharing the anon_vma root lock, this should still scale better than
taking the anon_vma root lock around every pte copy practically for
the whole duration of mremap.
---
include/linux/rmap.h | 1 +
mm/mmap.c | 8 ++++++++
mm/rmap.c | 44 ++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 53 insertions(+), 0 deletions(-)
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 2148b12..45eb098 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -120,6 +120,7 @@ void anon_vma_init(void); /* create anon_vma_cachep */
int anon_vma_prepare(struct vm_area_struct *);
void unlink_anon_vmas(struct vm_area_struct *);
int anon_vma_clone(struct vm_area_struct *, struct vm_area_struct *);
+void anon_vma_order_tail(struct vm_area_struct *);
int anon_vma_fork(struct vm_area_struct *, struct vm_area_struct *);
void __anon_vma_link(struct vm_area_struct *);
diff --git a/mm/mmap.c b/mm/mmap.c
index a65efd4..a5858dc 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2339,7 +2339,15 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
*/
if (vma_start >= new_vma->vm_start &&
vma_start < new_vma->vm_end)
+ /*
+ * No need to call anon_vma_order_tail() in
+ * this case because the same PT lock will
+ * serialize the rmap_walk against both src
+ * and dst vmas.
+ */
*vmap = new_vma;
+ else
+ anon_vma_order_tail(new_vma);
} else {
new_vma = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
if (new_vma) {
diff --git a/mm/rmap.c b/mm/rmap.c
index 8005080..6dbc165 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -272,6 +272,50 @@ int anon_vma_clone(struct vm_area_struct *dst, struct vm_area_struct *src)
}
/*
+ * Some rmap walk that needs to find all ptes/hugepmds without false
+ * negatives (like migrate and split_huge_page) running concurrent
+ * with operations that copy or move pagetables (like mremap() and
+ * fork()) to be safe depends the anon_vma "same_anon_vma" list to be
+ * in a certain order: the dst_vma must be placed after the src_vma in
+ * the list. This is always guaranteed by fork() but mremap() needs to
+ * call this function to enforce it in case the dst_vma isn't newly
+ * allocated and chained with the anon_vma_clone() function but just
+ * an extension of a pre-existing vma through vma_merge.
+ *
+ * NOTE: the same_anon_vma list can still be changed by other
+ * processes while mremap runs because mremap doesn't hold the
+ * anon_vma mutex to prevent modifications to the list while it
+ * runs. All we need to enforce is that the relative order of this
+ * process vmas isn't changing (we don't care about other vmas
+ * order). Each vma corresponds to an anon_vma_chain structure so
+ * there's no risk that other processes calling anon_vma_order_tail()
+ * and changing the same_anon_vma list under mremap() will screw with
+ * the relative order of this process vmas in the list, because we
+ * won't alter the order of any vma that isn't belonging to this
+ * process. And there can't be another anon_vma_order_tail running
+ * concurrently with mremap() coming from this process because we hold
+ * the mmap_sem for the whole mremap(). fork() ordering dependency
+ * also shouldn't be affected because we only care that the parent
+ * vmas are placed in the list before the child vmas and
+ * anon_vma_order_tail won't reorder vmas from either the fork parent
+ * or child.
+ */
+void anon_vma_order_tail(struct vm_area_struct *dst)
+{
+ struct anon_vma_chain *pavc;
+ struct anon_vma *root = NULL;
+
+ list_for_each_entry_reverse(pavc, &dst->anon_vma_chain, same_vma) {
+ struct anon_vma *anon_vma = pavc->anon_vma;
+ VM_BUG_ON(pavc->vma != dst);
+ root = lock_anon_vma_root(root, anon_vma);
+ list_del(&pavc->same_anon_vma);
+ list_add_tail(&pavc->same_anon_vma, &anon_vma->head);
+ }
+ unlock_anon_vma_root(root);
+}
+
This is following the same rules as anon_vma_clone() and I didn't see a
flaw in your explanation as to why it's safe.

Acked-by: Mel Gorman <***@suse.de>
--
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Nai Xia
2011-11-01 14:35:22 UTC
Permalink
Post by Andrea Arcangeli
migrate was doing a rmap_walk with speculative lock-less access on
pagetables. That could lead it to not serialize properly against
mremap PT locks. But a second problem remains in the order of vmas in
the same_anon_vma list used by the rmap_walk.
If vma_merge would succeed in copy_vma, the src vma could be placed
after the dst vma in the same_anon_vma list. That could still lead
migrate to miss some pte.
This patch adds a anon_vma_order_tail() function to force the dst vma
at the end of the list before mremap starts to solve the problem.
If the mremap is very large and there are a lots of parents or childs
sharing the anon_vma root lock, this should still scale better than
taking the anon_vma root lock around every pte copy practically for
the whole duration of mremap.
---
 include/linux/rmap.h |    1 +
 mm/mmap.c            |    8 ++++++++
 mm/rmap.c            |   44 ++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 53 insertions(+), 0 deletions(-)
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 2148b12..45eb098 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -120,6 +120,7 @@ void anon_vma_init(void);   /* create anon_vma_cachep */
 int  anon_vma_prepare(struct vm_area_struct *);
 void unlink_anon_vmas(struct vm_area_struct *);
 int anon_vma_clone(struct vm_area_struct *, struct vm_area_struct *);
+void anon_vma_order_tail(struct vm_area_struct *);
 int anon_vma_fork(struct vm_area_struct *, struct vm_area_struct *);
 void __anon_vma_link(struct vm_area_struct *);
diff --git a/mm/mmap.c b/mm/mmap.c
index a65efd4..a5858dc 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2339,7 +2339,15 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
                */
               if (vma_start >= new_vma->vm_start &&
                   vma_start < new_vma->vm_end)
+                       /*
+                        * No need to call anon_vma_order_tail() in
+                        * this case because the same PT lock will
+                        * serialize the rmap_walk against both src
+                        * and dst vmas.
+                        */
                       *vmap = new_vma;
+               else
+                       anon_vma_order_tail(new_vma);
       } else {
               new_vma = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
               if (new_vma) {
diff --git a/mm/rmap.c b/mm/rmap.c
index 8005080..6dbc165 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -272,6 +272,50 @@ int anon_vma_clone(struct vm_area_struct *dst, struct vm_area_struct *src)
 }
 /*
+ * Some rmap walk that needs to find all ptes/hugepmds without false
+ * negatives (like migrate and split_huge_page) running concurrent
+ * with operations that copy or move pagetables (like mremap() and
+ * fork()) to be safe depends the anon_vma "same_anon_vma" list to be
+ * in a certain order: the dst_vma must be placed after the src_vma in
+ * the list. This is always guaranteed by fork() but mremap() needs to
+ * call this function to enforce it in case the dst_vma isn't newly
+ * allocated and chained with the anon_vma_clone() function but just
+ * an extension of a pre-existing vma through vma_merge.
+ *
+ * NOTE: the same_anon_vma list can still be changed by other
+ * processes while mremap runs because mremap doesn't hold the
+ * anon_vma mutex to prevent modifications to the list while it
+ * runs. All we need to enforce is that the relative order of this
+ * process vmas isn't changing (we don't care about other vmas
+ * order). Each vma corresponds to an anon_vma_chain structure so
+ * there's no risk that other processes calling anon_vma_order_tail()
+ * and changing the same_anon_vma list under mremap() will screw with
+ * the relative order of this process vmas in the list, because we
+ * won't alter the order of any vma that isn't belonging to this
+ * process. And there can't be another anon_vma_order_tail running
+ * concurrently with mremap() coming from this process because we hold
+ * the mmap_sem for the whole mremap(). fork() ordering dependency
+ * also shouldn't be affected because we only care that the parent
+ * vmas are placed in the list before the child vmas and
+ * anon_vma_order_tail won't reorder vmas from either the fork parent
+ * or child.
+ */
+void anon_vma_order_tail(struct vm_area_struct *dst)
+{
+       struct anon_vma_chain *pavc;
+       struct anon_vma *root = NULL;
+
+       list_for_each_entry_reverse(pavc, &dst->anon_vma_chain, same_vma) {
+               struct anon_vma *anon_vma = pavc->anon_vma;
+               VM_BUG_ON(pavc->vma != dst);
+               root = lock_anon_vma_root(root, anon_vma);
+               list_del(&pavc->same_anon_vma);
+               list_add_tail(&pavc->same_anon_vma, &anon_vma->head);
+       }
+       unlock_anon_vma_root(root);
+}
I think Pawel might want to sign a "Tested-by", he may have been running this
patch safely for quite some days. :)

Reviewed-by: Nai Xia <***@gmail.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Hugh Dickins
2011-11-04 07:31:04 UTC
Permalink
Post by Andrea Arcangeli
migrate was doing a rmap_walk with speculative lock-less access on
pagetables. That could lead it to not serialize properly against
mremap PT locks. But a second problem remains in the order of vmas in
the same_anon_vma list used by the rmap_walk.
I do think that Nai Xia deserves special credit for thinking deeper
into this than the rest of us (before you came back): something like
Post by Andrea Arcangeli
If vma_merge would succeed in copy_vma, the src vma could be placed
after the dst vma in the same_anon_vma list. That could still lead
migrate to miss some pte.
This patch adds a anon_vma_order_tail() function to force the dst vma
I agree with Mel that anon_vma_moveto_tail() would be a better name;
or even anon_vma_move_to_tail().
Post by Andrea Arcangeli
at the end of the list before mremap starts to solve the problem.
If the mremap is very large and there are a lots of parents or childs
sharing the anon_vma root lock, this should still scale better than
taking the anon_vma root lock around every pte copy practically for
the whole duration of mremap.
But I'm sorry to say that I'm actually not persuaded by the patch,
on three counts.
Post by Andrea Arcangeli
---
include/linux/rmap.h | 1 +
mm/mmap.c | 8 ++++++++
mm/rmap.c | 44 ++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 53 insertions(+), 0 deletions(-)
B
Post by Andrea Arcangeli
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 2148b12..45eb098 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -120,6 +120,7 @@ void anon_vma_init(void); /* create anon_vma_cachep */
int anon_vma_prepare(struct vm_area_struct *);
void unlink_anon_vmas(struct vm_area_struct *);
int anon_vma_clone(struct vm_area_struct *, struct vm_area_struct *);
+void anon_vma_order_tail(struct vm_area_struct *);
int anon_vma_fork(struct vm_area_struct *, struct vm_area_struct *);
void __anon_vma_link(struct vm_area_struct *);
diff --git a/mm/mmap.c b/mm/mmap.c
index a65efd4..a5858dc 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2339,7 +2339,15 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
*/
if (vma_start >= new_vma->vm_start &&
vma_start < new_vma->vm_end)
+ /*
+ * No need to call anon_vma_order_tail() in
+ * this case because the same PT lock will
+ * serialize the rmap_walk against both src
+ * and dst vmas.
+ */
Really? Please convince me: I just do not see what ensures that
the same pt lock covers both src and dst areas in this case.
Post by Andrea Arcangeli
*vmap = new_vma;
+ else
+ anon_vma_order_tail(new_vma);
And if this puts new_vma in the right position for the normal
move_page_tables(), as anon_vma_clone() does in the block below,
aren't they both in exactly the wrong position for the abnormal
move_page_tables(), called to put ptes back where they were if
the original move_page_tables() fails?

It might be possible to argue that move_page_tables() can only
fail by failing to allocate memory for pud or pmd, and that (perhaps)
could only happen if the task was being OOM-killed and ran out of
reserves at this point, and if it's being OOM-killed then we don't
mind losing a migration entry for a moment... perhaps.

Certainly I'd agree that it's a very rare case. But it feels wrong
to be attempting to fix the already unlikely issue, while ignoring
this aspect, or relying on such unrelated implementation details.

Perhaps some further anon_vma_ordering could fix it up,
but that would look increasingly desperate.
Post by Andrea Arcangeli
} else {
new_vma = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
if (new_vma) {
diff --git a/mm/rmap.c b/mm/rmap.c
index 8005080..6dbc165 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -272,6 +272,50 @@ int anon_vma_clone(struct vm_area_struct *dst, struct vm_area_struct *src)
}
/*
+ * Some rmap walk that needs to find all ptes/hugepmds without false
+ * negatives (like migrate and split_huge_page) running concurrent
+ * with operations that copy or move pagetables (like mremap() and
+ * fork()) to be safe depends the anon_vma "same_anon_vma" list to be
+ * in a certain order: the dst_vma must be placed after the src_vma in
+ * the list. This is always guaranteed by fork() but mremap() needs to
+ * call this function to enforce it in case the dst_vma isn't newly
+ * allocated and chained with the anon_vma_clone() function but just
+ * an extension of a pre-existing vma through vma_merge.
+ *
+ * NOTE: the same_anon_vma list can still be changed by other
+ * processes while mremap runs because mremap doesn't hold the
+ * anon_vma mutex to prevent modifications to the list while it
+ * runs. All we need to enforce is that the relative order of this
+ * process vmas isn't changing (we don't care about other vmas
+ * order). Each vma corresponds to an anon_vma_chain structure so
+ * there's no risk that other processes calling anon_vma_order_tail()
+ * and changing the same_anon_vma list under mremap() will screw with
+ * the relative order of this process vmas in the list, because we
+ * won't alter the order of any vma that isn't belonging to this
+ * process. And there can't be another anon_vma_order_tail running
+ * concurrently with mremap() coming from this process because we hold
+ * the mmap_sem for the whole mremap(). fork() ordering dependency
+ * also shouldn't be affected because we only care that the parent
+ * vmas are placed in the list before the child vmas and
+ * anon_vma_order_tail won't reorder vmas from either the fork parent
+ * or child.
+ */
+void anon_vma_order_tail(struct vm_area_struct *dst)
+{
+ struct anon_vma_chain *pavc;
+ struct anon_vma *root = NULL;
+
+ list_for_each_entry_reverse(pavc, &dst->anon_vma_chain, same_vma) {
+ struct anon_vma *anon_vma = pavc->anon_vma;
+ VM_BUG_ON(pavc->vma != dst);
+ root = lock_anon_vma_root(root, anon_vma);
+ list_del(&pavc->same_anon_vma);
+ list_add_tail(&pavc->same_anon_vma, &anon_vma->head);
+ }
+ unlock_anon_vma_root(root);
+}
I thought this was correct, but now I'm not so sure. You rightly
consider the question of interference between concurrent mremaps in
different mms in your comment above, but I'm still not convinced it
is safe. Oh, probably just my persistent failure to picture these
avcs properly.

If we were back in the days of the simple anon_vma list, I'd probably
share your enthusiasm for the list ordering solution; but now it looks
like a fragile and contorted way of avoiding the obvious... we just
need to use the anon_vma_lock (but perhaps there are some common and
easily tested conditions under which we can skip it e.g. when a single
pt lock covers src and dst?).

Sorry to be so negative! I may just be wrong on all counts.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Nai Xia
2011-11-04 14:34:54 UTC
Permalink
Post by Hugh Dickins
Post by Andrea Arcangeli
migrate was doing a rmap_walk with speculative lock-less access on
pagetables. That could lead it to not serialize properly against
mremap PT locks. But a second problem remains in the order of vmas in
the same_anon_vma list used by the rmap_walk.
I do think that Nai Xia deserves special credit for thinking deeper
into this than the rest of us (before you came back): something like
Thanks! ;-)
Post by Hugh Dickins
Post by Andrea Arcangeli
If vma_merge would succeed in copy_vma, the src vma could be placed
after the dst vma in the same_anon_vma list. That could still lead
migrate to miss some pte.
This patch adds a anon_vma_order_tail() function to force the dst vma
I agree with Mel that anon_vma_moveto_tail() would be a better name;
or even anon_vma_move_to_tail().
Post by Andrea Arcangeli
at the end of the list before mremap starts to solve the problem.
If the mremap is very large and there are a lots of parents or childs
sharing the anon_vma root lock, this should still scale better than
taking the anon_vma root lock around every pte copy practically for
the whole duration of mremap.
But I'm sorry to say that I'm actually not persuaded by the patch,
on three counts.
Post by Andrea Arcangeli
---
 include/linux/rmap.h |    1 +
 mm/mmap.c            |    8 ++++++++
 mm/rmap.c            |   44 ++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 53 insertions(+), 0 deletions(-)
B
Post by Andrea Arcangeli
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 2148b12..45eb098 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -120,6 +120,7 @@ void anon_vma_init(void); /* create anon_vma_cachep */
 int  anon_vma_prepare(struct vm_area_struct *);
 void unlink_anon_vmas(struct vm_area_struct *);
 int anon_vma_clone(struct vm_area_struct *, struct vm_area_struct *);
+void anon_vma_order_tail(struct vm_area_struct *);
 int anon_vma_fork(struct vm_area_struct *, struct vm_area_struct *);
 void __anon_vma_link(struct vm_area_struct *);
diff --git a/mm/mmap.c b/mm/mmap.c
index a65efd4..a5858dc 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2339,7 +2339,15 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
               */
              if (vma_start >= new_vma->vm_start &&
                  vma_start < new_vma->vm_end)
+                     /*
+                      * No need to call anon_vma_order_tail() in
+                      * this case because the same PT lock will
+                      * serialize the rmap_walk against both src
+                      * and dst vmas.
+                      */
Really?  Please convince me: I just do not see what ensures that
the same pt lock covers both src and dst areas in this case.
At the first glance that rmap_walk does travel this merged VMA
once...
But, Now, Wait...., I am actually really puzzled that this case can really
happen at all, you see that vma_merge() does not break the validness
between page->index and its VMA. So if this can really happen,
a page->index should be valid in both areas in a same VMA.
It's strange to imagine that a PTE is copy inside a _same_ VMA
and page->index is valid at both old and new places.

IMO, the only case that src VMA can be merged by the new
is that src VMA hasn't been faulted yet and the pgoff
is recalculated. And if my reasoning is true, this place
does not need to be worried about.

How do you think?
Post by Hugh Dickins
Post by Andrea Arcangeli
                      *vmap = new_vma;
+             else
+                     anon_vma_order_tail(new_vma);
And if this puts new_vma in the right position for the normal
move_page_tables(), as anon_vma_clone() does in the block below,
aren't they both in exactly the wrong position for the abnormal
move_page_tables(), called to put ptes back where they were if
the original move_page_tables() fails?
OH,MY, at least 6 six eye balls missed another apparent case...
Now you know why I said "Human brains are all weak in...." :P
Post by Hugh Dickins
It might be possible to argue that move_page_tables() can only
fail by failing to allocate memory for pud or pmd, and that (perhaps)
could only happen if the task was being OOM-killed and ran out of
reserves at this point, and if it's being OOM-killed then we don't
mind losing a migration entry for a moment... perhaps.
Certainly I'd agree that it's a very rare case.  But it feels wrong
to be attempting to fix the already unlikely issue, while ignoring
this aspect, or relying on such unrelated implementation details.
Perhaps some further anon_vma_ordering could fix it up,
but that would look increasingly desperate.
Post by Andrea Arcangeli
      } else {
              new_vma = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
              if (new_vma) {
diff --git a/mm/rmap.c b/mm/rmap.c
index 8005080..6dbc165 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -272,6 +272,50 @@ int anon_vma_clone(struct vm_area_struct *dst, struct vm_area_struct *src)
 }
 /*
+ * Some rmap walk that needs to find all ptes/hugepmds without false
+ * negatives (like migrate and split_huge_page) running concurrent
+ * with operations that copy or move pagetables (like mremap() and
+ * fork()) to be safe depends the anon_vma "same_anon_vma" list to be
+ * in a certain order: the dst_vma must be placed after the src_vma in
+ * the list. This is always guaranteed by fork() but mremap() needs to
+ * call this function to enforce it in case the dst_vma isn't newly
+ * allocated and chained with the anon_vma_clone() function but just
+ * an extension of a pre-existing vma through vma_merge.
+ *
+ * NOTE: the same_anon_vma list can still be changed by other
+ * processes while mremap runs because mremap doesn't hold the
+ * anon_vma mutex to prevent modifications to the list while it
+ * runs. All we need to enforce is that the relative order of this
+ * process vmas isn't changing (we don't care about other vmas
+ * order). Each vma corresponds to an anon_vma_chain structure so
+ * there's no risk that other processes calling anon_vma_order_tail()
+ * and changing the same_anon_vma list under mremap() will screw with
+ * the relative order of this process vmas in the list, because we
+ * won't alter the order of any vma that isn't belonging to this
+ * process. And there can't be another anon_vma_order_tail running
+ * concurrently with mremap() coming from this process because we hold
+ * the mmap_sem for the whole mremap(). fork() ordering dependency
+ * also shouldn't be affected because we only care that the parent
+ * vmas are placed in the list before the child vmas and
+ * anon_vma_order_tail won't reorder vmas from either the fork parent
+ * or child.
+ */
+void anon_vma_order_tail(struct vm_area_struct *dst)
+{
+     struct anon_vma_chain *pavc;
+     struct anon_vma *root = NULL;
+
+     list_for_each_entry_reverse(pavc, &dst->anon_vma_chain, same_vma) {
+             struct anon_vma *anon_vma = pavc->anon_vma;
+             VM_BUG_ON(pavc->vma != dst);
+             root = lock_anon_vma_root(root, anon_vma);
+             list_del(&pavc->same_anon_vma);
+             list_add_tail(&pavc->same_anon_vma, &anon_vma->head);
+     }
+     unlock_anon_vma_root(root);
+}
I thought this was correct, but now I'm not so sure.  You rightly
consider the question of interference between concurrent mremaps in
different mms in your comment above, but I'm still not convinced it
is safe.  Oh, probably just my persistent failure to picture these
avcs properly.
If we were back in the days of the simple anon_vma list, I'd probably
share your enthusiasm for the list ordering solution; but now it looks
like a fragile and contorted way of avoiding the obvious... we just
need to use the anon_vma_lock (but perhaps there are some common and
easily tested conditions under which we can skip it e.g. when a single
pt lock covers src and dst?).
Sorry to be so negative!  I may just be wrong on all counts.
Hugh
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Pawel Sikora
2011-11-04 15:59:26 UTC
Permalink
Post by Nai Xia
Post by Hugh Dickins
Post by Andrea Arcangeli
migrate was doing a rmap_walk with speculative lock-less access on
pagetables. That could lead it to not serialize properly against
mremap PT locks. But a second problem remains in the order of vmas in
the same_anon_vma list used by the rmap_walk.
I do think that Nai Xia deserves special credit for thinking deeper
into this than the rest of us (before you came back): something like
Thanks! ;-)
hi all,

i'm still testing anon_vma_order_tail() patch. 10 days of heavy processing
and machine is still stable but i've recorded some interesting thing:

$ uname -a
Linux hal 3.0.8-vs2.3.1-dirty #6 SMP Tue Oct 25 10:07:50 CEST 2011 x86_64 AMD_Opteron(tm)_Processor_6128 PLD Linux
$ uptime
16:47:44 up 10 days, 4:21, 5 users, load average: 19.55, 19.15, 18.76
$ ps aux|grep migration
root 6 0.0 0.0 0 0 ? S Oct25 0:00 [migration/0]
root 8 68.0 0.0 0 0 ? S Oct25 9974:01 [migration/1]
root 13 35.4 0.0 0 0 ? S Oct25 5202:15 [migration/2]
root 17 71.4 0.0 0 0 ? S Oct25 10479:10 [migration/3]
root 21 70.7 0.0 0 0 ? S Oct25 10370:14 [migration/4]
root 25 66.1 0.0 0 0 ? S Oct25 9698:11 [migration/5]
root 29 70.1 0.0 0 0 ? S Oct25 10283:22 [migration/6]
root 33 62.6 0.0 0 0 ? S Oct25 9190:28 [migration/7]
root 37 0.0 0.0 0 0 ? S Oct25 0:00 [migration/8]
root 41 97.7 0.0 0 0 ? S Oct25 14338:30 [migration/9]
root 45 29.2 0.0 0 0 ? S Oct25 4290:00 [migration/10]
root 49 68.7 0.0 0 0 ? S Oct25 10081:38 [migration/11]
root 53 98.7 0.0 0 0 ? S Oct25 14477:25 [migration/12]
root 57 70.0 0.0 0 0 ? S Oct25 10272:57 [migration/13]
root 61 69.7 0.0 0 0 ? S Oct25 10232:29 [migration/14]
root 65 70.9 0.0 0 0 ? S Oct25 10403:09 [migration/15]

wow, 71..241 hours in migration processes after 10 days of uptime?
machine has 2 opteron nodes with 32GB ram paired with each processor.
i suppose that it spends a lot of time on migration (processes + memory pages).

BR,
Paweł.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Nai Xia
2011-11-05 02:21:15 UTC
Permalink
Post by Pawel Sikora
Post by Nai Xia
Post by Hugh Dickins
Post by Andrea Arcangeli
migrate was doing a rmap_walk with speculative lock-less access on
pagetables. That could lead it to not serialize properly against
mremap PT locks. But a second problem remains in the order of vmas in
the same_anon_vma list used by the rmap_walk.
I do think that Nai Xia deserves special credit for thinking deeper
into this than the rest of us (before you came back): something like
Thanks! ;-)
hi all,
i'm still testing anon_vma_order_tail() patch. 10 days of heavy processing
$ uname -a
Linux hal 3.0.8-vs2.3.1-dirty #6 SMP Tue Oct 25 10:07:50 CEST 2011 x86_64 AMD_Opteron(tm)_Processor_6128 PLD Linux
$ uptime
 16:47:44 up 10 days,  4:21,  5 users,  load average: 19.55, 19.15, 18.76
$ ps aux|grep migration
root         6  0.0  0.0      0     0 ?        S    Oct25   0:00 [migration/0]
root         8 68.0  0.0      0     0 ?        S    Oct25 9974:01 [migration/1]
root        13 35.4  0.0      0     0 ?        S    Oct25 5202:15 [migration/2]
root        17 71.4  0.0      0     0 ?        S    Oct25 10479:10 [migration/3]
root        21 70.7  0.0      0     0 ?        S    Oct25 10370:14 [migration/4]
root        25 66.1  0.0      0     0 ?        S    Oct25 9698:11 [migration/5]
root        29 70.1  0.0      0     0 ?        S    Oct25 10283:22 [migration/6]
root        33 62.6  0.0      0     0 ?        S    Oct25 9190:28 [migration/7]
root        37  0.0  0.0      0     0 ?        S    Oct25   0:00 [migration/8]
root        41 97.7  0.0      0     0 ?        S    Oct25 14338:30 [migration/9]
root        45 29.2  0.0      0     0 ?        S    Oct25 4290:00 [migration/10]
root        49 68.7  0.0      0     0 ?        S    Oct25 10081:38 [migration/11]
root        53 98.7  0.0      0     0 ?        S    Oct25 14477:25 [migration/12]
root        57 70.0  0.0      0     0 ?        S    Oct25 10272:57 [migration/13]
root        61 69.7  0.0      0     0 ?        S    Oct25 10232:29 [migration/14]
root        65 70.9  0.0      0     0 ?        S    Oct25 10403:09 [migration/15]
wow, 71..241 hours in migration processes after 10 days of uptime?
machine has 2 opteron nodes with 32GB ram paired with each processor.
i suppose that it spends a lot of time on migration (processes + memory pages).
Hi Paweł, it seems to me an issue related to load balancing but might
not directly
related to this bug or even not related to abnormal page migration.
Can this be a scheduler & interrupts issue?

But oh, well, actually I never ever had touch a 16-core machine
and do heavy processing. So I cannot tell if this result is normal or not.

Maybe you should ask for a broader range of people?

BR,
Nai
Post by Pawel Sikora
BR,
Paweł.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Hugh Dickins
2011-11-04 19:16:03 UTC
Permalink
Post by Nai Xia
Post by Hugh Dickins
Post by Andrea Arcangeli
@@ -2339,7 +2339,15 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
               */
              if (vma_start >= new_vma->vm_start &&
                  vma_start < new_vma->vm_end)
+                     /*
+                      * No need to call anon_vma_order_tail() in
+                      * this case because the same PT lock will
+                      * serialize the rmap_walk against both src
+                      * and dst vmas.
+                      */
Really?  Please convince me: I just do not see what ensures that
the same pt lock covers both src and dst areas in this case.
At the first glance that rmap_walk does travel this merged VMA
once...
But, Now, Wait...., I am actually really puzzled that this case can really
happen at all, you see that vma_merge() does not break the validness
between page->index and its VMA. So if this can really happen,
a page->index should be valid in both areas in a same VMA.
It's strange to imagine that a PTE is copy inside a _same_ VMA
and page->index is valid at both old and new places.
Yes, I think you are right, thank you for elucidating it.

That was a real case when we wrote copy_vma(), when rmap was using
pte_chains; but once anon_vma came in, and imposed vm_pgoff matching
on anonymous mappings too, it became dead code. With linear vm_pgoff
matching, you cannot fit a range in two places within the same vma.
(And even the non-linear case relies upon vm_pgoff defaults.)

So we could simplify the copy_vma() interface a little now (get rid of
that nasty **vmap): I'm not quite sure whether we ought to do that,
but certainly Andrea's comment there should be updated (if he also
agrees with your analysis).
Post by Nai Xia
IMO, the only case that src VMA can be merged by the new
is that src VMA hasn't been faulted yet and the pgoff
is recalculated. And if my reasoning is true, this place
does not need to be worried about.
I don't see a place where "the pgoff is recalculated" (except in
the consistent way when expanding or splitting or merging vma), nor
where vma merge would allow for variable pgoff. I agree that we
could avoid finalizing vm_pgoff for an anonymous area until its
anon_vma is assigned: were you imagining doing that in future,
or am I overlooking something already there?

Hugh
Andrea Arcangeli
2011-11-04 20:54:40 UTC
Permalink
Post by Hugh Dickins
Post by Nai Xia
Post by Hugh Dickins
Post by Andrea Arcangeli
@@ -2339,7 +2339,15 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
               */
              if (vma_start >= new_vma->vm_start &&
                  vma_start < new_vma->vm_end)
+                     /*
+                      * No need to call anon_vma_order_tail() in
+                      * this case because the same PT lock will
+                      * serialize the rmap_walk against both src
+                      * and dst vmas.
+                      */
Really?  Please convince me: I just do not see what ensures that
the same pt lock covers both src and dst areas in this case.
At the first glance that rmap_walk does travel this merged VMA
once...
But, Now, Wait...., I am actually really puzzled that this case can really
happen at all, you see that vma_merge() does not break the validness
between page->index and its VMA. So if this can really happen,
a page->index should be valid in both areas in a same VMA.
It's strange to imagine that a PTE is copy inside a _same_ VMA
and page->index is valid at both old and new places.
Yes, I think you are right, thank you for elucidating it.
That was a real case when we wrote copy_vma(), when rmap was using
pte_chains; but once anon_vma came in, and imposed vm_pgoff matching
on anonymous mappings too, it became dead code. With linear vm_pgoff
matching, you cannot fit a range in two places within the same vma.
(And even the non-linear case relies upon vm_pgoff defaults.)
So we could simplify the copy_vma() interface a little now (get rid of
that nasty **vmap): I'm not quite sure whether we ought to do that,
but certainly Andrea's comment there should be updated (if he also
agrees with your analysis).
The vmap should only trigger when the prev vma (prev relative to src
vma) is extended at the end to make space for the dst range. And by
extending it, we filled the hole between the prev vma and "src"
vma. So then the prev vma becomes the "src vma" and also the "dst
vma". So we can't keep working with the old "vma" pointer after that.

I doubt it can be removed without crashing in the above case.

I thought some more about it and what I missed I think is the
anon_vma_merge in vma_adjust. What that anon_vma_merge, rmap_walk will
have to complete before we can start moving the ptes. And so rmap_walk
when starts again from scratch (after anon_vma_merge run in
vma_adjust) will find all ptes even if vma_merge succeeded before.

In fact this may also work for fork. Fork will take the anon_vma root
lock somehow to queue the child vma in the same_anon_vma. Doing so it
will serialize against any running rmap_walk from all other cpus. The
ordering has never been an issue for fork anyway, but it would have
have been an issue for mremap in case vma_merge succeeded and src_vma
!= dst_vma, if vma_merge didn't act as a serialization point against
rmap_walk (which I realized now).

What makes it safe is again taking both PT locks simultanously. So it
doesn't matter what rmap_walk searches, as long as the anon_vma_chain
list cannot change by the time rmap_walk started.

What I thought before was rmap_walk checking vma1 and then vma_merge
succeed (where src vma is vma2 and dst vma is vma1, but vma1 is not a
new vma queued at the end of same_anon_vma), move_page_tables moves
the pte from vma2 to vma1, and then rmap_walk checks vma2. But again
vma_merge won't be allowed to complete in the middle of rmap_walk, and
so it cannot trigger and we can safely drop the patch. It wasn't
immediate to think at the locks taken within vma_adjust sorry.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Nai Xia
2011-11-05 00:09:02 UTC
Permalink
Post by Andrea Arcangeli
Post by Hugh Dickins
Post by Nai Xia
@@ -2339,7 +2339,15 @@ struct vm_area_struct *copy_vma(struct v=
m_area_struct **vmap,
Post by Andrea Arcangeli
Post by Hugh Dickins
Post by Nai Xia
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0*/
=A0 =A0 =A0 =A0 =A0 =A0 =A0 if (vma_start >=3D new_vma->vm_star=
t &&
Post by Andrea Arcangeli
Post by Hugh Dickins
Post by Nai Xia
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 vma_start < new_vma->vm_end=
)
Post by Andrea Arcangeli
Post by Hugh Dickins
Post by Nai Xia
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 /*
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0* No need to call =
anon_vma_order_tail() in
Post by Andrea Arcangeli
Post by Hugh Dickins
Post by Nai Xia
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0* this case becaus=
e the same PT lock will
Post by Andrea Arcangeli
Post by Hugh Dickins
Post by Nai Xia
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0* serialize the rm=
ap_walk against both src
Post by Andrea Arcangeli
Post by Hugh Dickins
Post by Nai Xia
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0* and dst vmas.
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0*/
Really? =A0Please convince me: I just do not see what ensures th=
at
Post by Andrea Arcangeli
Post by Hugh Dickins
Post by Nai Xia
the same pt lock covers both src and dst areas in this case.
At the first glance that rmap_walk does travel this merged VMA
once...
But, Now, Wait...., I am actually really puzzled that this case ca=
n really
Post by Andrea Arcangeli
Post by Hugh Dickins
Post by Nai Xia
happen at all, you see that vma_merge() does not break the validne=
ss
Post by Andrea Arcangeli
Post by Hugh Dickins
Post by Nai Xia
between page->index and its VMA. So if this can really happen,
a page->index should be valid in both areas in a same VMA.
It's strange to imagine that a PTE is copy inside a _same_ VMA
and page->index is valid at both old and new places.
Yes, I think you are right, thank you for elucidating it.
That was a real case when we wrote copy_vma(), when rmap was using
pte_chains; but once anon_vma came in, and imposed vm_pgoff matching
on anonymous mappings too, it became dead code. =A0With linear vm_pg=
off
Post by Andrea Arcangeli
Post by Hugh Dickins
matching, you cannot fit a range in two places within the same vma.
(And even the non-linear case relies upon vm_pgoff defaults.)
So we could simplify the copy_vma() interface a little now (get rid =
of
Post by Andrea Arcangeli
Post by Hugh Dickins
that nasty **vmap): I'm not quite sure whether we ought to do that,
but certainly Andrea's comment there should be updated (if he also
agrees with your analysis).
The vmap should only trigger when the prev vma (prev relative to src
vma) is extended at the end to make space for the dst range. And by
extending it, we filled the hole between the prev vma and "src"
vma. So then the prev vma becomes the "src vma" and also the "dst
vma". So we can't keep working with the old "vma" pointer after that.
I doubt it can be removed without crashing in the above case.
Yes, this line itself should not be removed. As I explained,
pgoff adjustment at the top of the copy_vma() for non-faulted
vma will lead to this case. But we do not need to worry
about the move_page_tables() should after this happens.
And so no lines need to be added here. But maybe the
documentation should be changed in your original patch
to clarify this. Reasoning with PTL locks for this case might
be somewhat misleading.

Furthermore, the move_page_tables() call following this case
might better be totally avoided for code readability and it's
simple to judge with (vma =3D=3D new_vma)

Do you agree? :)
Post by Andrea Arcangeli
I thought some more about it and what I missed I think is the
anon_vma_merge in vma_adjust. What that anon_vma_merge, rmap_walk wil=
l
Post by Andrea Arcangeli
have to complete before we can start moving the ptes. And so rmap_wal=
k
Post by Andrea Arcangeli
when starts again from scratch (after anon_vma_merge run in
vma_adjust) will find all ptes even if vma_merge succeeded before.
In fact this may also work for fork. Fork will take the anon_vma root
lock somehow to queue the child vma in the same_anon_vma. Doing so it
will serialize against any running rmap_walk from all other cpus. The
ordering has never been an issue for fork anyway, but it would have
have been an issue for mremap in case vma_merge succeeded and src_vma
!=3D dst_vma, if vma_merge didn't act as a serialization point agains=
t
Post by Andrea Arcangeli
rmap_walk (which I realized now).
What makes it safe is again taking both PT locks simultanously. So it
doesn't matter what rmap_walk searches, as long as the anon_vma_chain
list cannot change by the time rmap_walk started.
What I thought before was rmap_walk checking vma1 and then vma_merge
succeed (where src vma is vma2 and dst vma is vma1, but vma1 is not a
new vma queued at the end of same_anon_vma), move_page_tables moves
the pte from vma2 to vma1, and then rmap_walk checks vma2. But again
vma_merge won't be allowed to complete in the middle of rmap_walk, an=
d
Post by Andrea Arcangeli
so it cannot trigger and we can safely drop the patch. It wasn't
immediate to think at the locks taken within vma_adjust sorry.
Oh, no, sorry. I think I was trying to clarify in the first reply on
that thread that
we all agree that anon_vma chain is 100% stable when doing rmap_walk().
What is important, I think, is the relative order of these three event=
s:
1. The time rmap_walk() scans the src
2. The time rmap_walk() scans the dst
3. The time move_page_tables() move PTE from src vma to dst.

rmap_walk() scans dst( taking dst PTL) ---> move_page_tables() with
both PTLs ---> rmap_walk() scans src(taking src PTL)

will trigger this bug. The racing is there even if rmap_walk() scans s=
rc--->dst
but that racing does not harm. I think Mel explained why it's safe for =
good
ordering in his first reply to my post.

vma_merge() is only guilty for giving a wrong order of VMAs before
move_page_tables() and rmap_walk() begin to race, itself does not race
with rmap_walk().

You see, it seems this game might be really puzzling. Indeed, maybe it'=
s time
to fall back on locks instead of playing with racing. Just like the
good old time,
our classic OS text book told us that shared variables deserve locks. :=
-)


Thanks,

Nai
Hugh Dickins
2011-11-05 02:21:28 UTC
Permalink
Post by Nai Xia
Post by Andrea Arcangeli
Post by Hugh Dickins
Post by Nai Xia
Post by Hugh Dickins
Post by Andrea Arcangeli
@@ -2339,7 +2339,15 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
               */
              if (vma_start >= new_vma->vm_start &&
                  vma_start < new_vma->vm_end)
+                     /*
+                      * No need to call anon_vma_order_tail() in
+                      * this case because the same PT lock will
+                      * serialize the rmap_walk against both src
+                      * and dst vmas.
+                      */
Really?  Please convince me: I just do not see what ensures that
the same pt lock covers both src and dst areas in this case.
At the first glance that rmap_walk does travel this merged VMA
once...
But, Now, Wait...., I am actually really puzzled that this case can really
happen at all, you see that vma_merge() does not break the validness
between page->index and its VMA. So if this can really happen,
a page->index should be valid in both areas in a same VMA.
It's strange to imagine that a PTE is copy inside a _same_ VMA
and page->index is valid at both old and new places.
Yes, I think you are right, thank you for elucidating it.
That was a real case when we wrote copy_vma(), when rmap was using
pte_chains; but once anon_vma came in, and imposed vm_pgoff matching
on anonymous mappings too, it became dead code.  With linear vm_pgoff
matching, you cannot fit a range in two places within the same vma.
(And even the non-linear case relies upon vm_pgoff defaults.)
So we could simplify the copy_vma() interface a little now (get rid of
that nasty **vmap): I'm not quite sure whether we ought to do that,
but certainly Andrea's comment there should be updated (if he also
agrees with your analysis).
The vmap should only trigger when the prev vma (prev relative to src
vma) is extended at the end to make space for the dst range. And by
extending it, we filled the hole between the prev vma and "src"
vma. So then the prev vma becomes the "src vma" and also the "dst
vma". So we can't keep working with the old "vma" pointer after that.
I doubt it can be removed without crashing in the above case.
Yes, this line itself should not be removed. As I explained,
pgoff adjustment at the top of the copy_vma() for non-faulted
vma will lead to this case.
Ah, thank you, that's what I was asking you to point me to, the place
I was missing that recalculates pgoff: at the head of copy_vma() itself.

Yes, if that adjustment remains (no reason why not), then we cannot
remove the *vmap = new_vma; but that is the only case that nowadays
can need the *vmap = new_vma (agreed?), which does deserve a comment.
Post by Nai Xia
But we do not need to worry
about the move_page_tables() should after this happens.
And so no lines need to be added here. But maybe the
documentation should be changed in your original patch
to clarify this. Reasoning with PTL locks for this case might
be somewhat misleading.
Right, there are no ptes there yet, so we're cannot miss any.
Post by Nai Xia
Furthermore, the move_page_tables() call following this case
might better be totally avoided for code readability and it's
simple to judge with (vma == new_vma)
Do you agree? :)
Well, it's true that looking at pagetables in this case is just
a waste of time; but personally I'd prefer to add more comment
than special case handling for this.
Post by Nai Xia
Post by Andrea Arcangeli
I thought some more about it and what I missed I think is the
anon_vma_merge in vma_adjust. What that anon_vma_merge, rmap_walk will
have to complete before we can start moving the ptes. And so rmap_walk
when starts again from scratch (after anon_vma_merge run in
vma_adjust) will find all ptes even if vma_merge succeeded before.
In fact this may also work for fork. Fork will take the anon_vma root
lock somehow to queue the child vma in the same_anon_vma. Doing so it
will serialize against any running rmap_walk from all other cpus. The
ordering has never been an issue for fork anyway, but it would have
have been an issue for mremap in case vma_merge succeeded and src_vma
!= dst_vma, if vma_merge didn't act as a serialization point against
rmap_walk (which I realized now).
What makes it safe is again taking both PT locks simultanously. So it
doesn't matter what rmap_walk searches, as long as the anon_vma_chain
list cannot change by the time rmap_walk started.
What I thought before was rmap_walk checking vma1 and then vma_merge
succeed (where src vma is vma2 and dst vma is vma1, but vma1 is not a
new vma queued at the end of same_anon_vma), move_page_tables moves
the pte from vma2 to vma1, and then rmap_walk checks vma2. But again
vma_merge won't be allowed to complete in the middle of rmap_walk, and
so it cannot trigger and we can safely drop the patch. It wasn't
immediate to think at the locks taken within vma_adjust sorry.
I found Andrea's "anon_vma_merge" reply very hard to understand; but
it looks like he now accepts that it was mistaken, or on the wrong
track at least...
Post by Nai Xia
Oh, no, sorry. I think I was trying to clarify in the first reply on
that thread that
we all agree that anon_vma chain is 100% stable when doing rmap_walk().
1. The time rmap_walk() scans the src
2. The time rmap_walk() scans the dst
3. The time move_page_tables() move PTE from src vma to dst.
... after you set us straight again with this.
Post by Nai Xia
rmap_walk() scans dst( taking dst PTL) ---> move_page_tables() with
both PTLs ---> rmap_walk() scans src(taking src PTL)
will trigger this bug. The racing is there even if rmap_walk() scans src--->dst
but that racing does not harm. I think Mel explained why it's safe for good
ordering in his first reply to my post.
vma_merge() is only guilty for giving a wrong order of VMAs before
move_page_tables() and rmap_walk() begin to race, itself does not race
with rmap_walk().
You see, it seems this game might be really puzzling. Indeed, maybe it's time
to fall back on locks instead of playing with racing. Just like the
good old time,
our classic OS text book told us that shared variables deserve locks. :-)
That's my preference, yes: this mail thread seems to cry out for that!

Hugh
Andrea Arcangeli
2011-11-05 03:07:18 UTC
Permalink
Post by Hugh Dickins
I found Andrea's "anon_vma_merge" reply very hard to understand; but
it looks like he now accepts that it was mistaken, or on the wrong
track at least...
No matter how we get the order right, we still need to reverse the
order in case of error without taking the lock. So even allocating a
new vma every time wouldn't be enough to get out of the ordering
games (it would be enough in the non-error path of course...).

So there are a couple of ways:

1) Keep my patch (adjust comment) and add a second ordering call in
the error path. Cleanup the *vmap case.

2) Always allocate a new vma, merge later, and still keep my patch for
reversing the order in the error path only (not an huge improvement
if we still have to reverse the order). So this now looks the worst
option at the light of the error path which would give
trouble by going the opposite way... again.

3) Return to your fix that takes the anon_vma lock during the pte
moves

Fixing my patch requires just a one liner to fix the error path, it's
not like the patch was wrong in fact it reduced the window even more,
it just missed one liner in the error path.

But it's still doing reordering. Which I think is safe and not
fundamentally different in ordering terms by the old anon_vma logic
before _chain (which is why this bug could have triggered before
too). But certainly more complex than taking the anon_vma lock around
every pagetable move, that's for sure. fork will still relay on the
ordering but fork has a super easy life compared to mremap which goes
both ways and has vma_merge in it too which makes the vma order non
deterministic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Andrea Arcangeli
2011-11-05 17:06:22 UTC
Permalink
migrate was doing a rmap_walk with speculative lock-less access on
pagetables. That could lead it to not serialize properly against
mremap PT locks. But a second problem remains in the order of vmas in
the same_anon_vma list used by the rmap_walk.

If vma_merge would succeed in copy_vma, the src vma could be placed
after the dst vma in the same_anon_vma list. That could still lead
migrate to miss some pte.

This patch adds a anon_vma_moveto_tail() function to force the dst vma
at the end of the list before mremap starts to solve the problem.

If the mremap is very large and there are a lots of parents or childs
sharing the anon_vma root lock, this should still scale better than
taking the anon_vma root lock around every pte copy practically for
the whole duration of mremap.

Update: Hugh noticed special care is needed in the error path where
move_page_tables goes in the reverse direction, a second
anon_vma_moveto_tail() call is needed in the error path.

This program exercises the anon_vma_moveto_tail:

===

int main()
{
static struct timeval oldstamp, newstamp;
long diffsec;
char *p, *p2, *p3, *p4;
if (posix_memalign((void **)&p, 2*1024*1024, SIZE))
perror("memalign"), exit(1);
if (posix_memalign((void **)&p2, 2*1024*1024, SIZE))
perror("memalign"), exit(1);
if (posix_memalign((void **)&p3, 2*1024*1024, SIZE))
perror("memalign"), exit(1);

memset(p, 0xff, SIZE);
printf("%p\n", p);
memset(p2, 0xff, SIZE);
memset(p3, 0x77, 4096);
if (memcmp(p, p2, SIZE))
printf("error\n");
p4 = mremap(p+SIZE/2, SIZE/2, SIZE/2, MREMAP_FIXED|MREMAP_MAYMOVE, p3);
if (p4 != p3)
perror("mremap"), exit(1);
p4 = mremap(p4, SIZE/2, SIZE/2, MREMAP_FIXED|MREMAP_MAYMOVE, p+SIZE/2);
if (p4 != p+SIZE/2)
perror("mremap"), exit(1);
if (memcmp(p, p2, SIZE))
printf("error\n");
printf("ok\n");

return 0;
}
===

$ perf probe -a anon_vma_moveto_tail
Add new event:
probe:anon_vma_moveto_tail (on anon_vma_moveto_tail)

You can now use it on all perf tools, such as:

perf record -e probe:anon_vma_moveto_tail -aR sleep 1

$ perf record -e probe:anon_vma_moveto_tail -aR ./anon_vma_moveto_tail
0x7f2ca2800000
ok
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.043 MB perf.data (~1860 samples) ]
$ perf report --stdio
100.00% anon_vma_moveto [kernel.kallsyms] [k] anon_vma_moveto_tail

Reported-by: Nai Xia <***@gmail.com>
Acked-by: Mel Gorman <***@suse.de>
Signed-off-by: Andrea Arcangeli <***@redhat.com>
---
include/linux/rmap.h | 1 +
mm/mmap.c | 22 ++++++++++++++++++++--
mm/mremap.c | 1 +
mm/rmap.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 67 insertions(+), 2 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 2148b12..1afb995 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -120,6 +120,7 @@ void anon_vma_init(void); /* create anon_vma_cachep */
int anon_vma_prepare(struct vm_area_struct *);
void unlink_anon_vmas(struct vm_area_struct *);
int anon_vma_clone(struct vm_area_struct *, struct vm_area_struct *);
+void anon_vma_moveto_tail(struct vm_area_struct *);
int anon_vma_fork(struct vm_area_struct *, struct vm_area_struct *);
void __anon_vma_link(struct vm_area_struct *);

diff --git a/mm/mmap.c b/mm/mmap.c
index 3c0061f..948513d 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2322,13 +2322,16 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
struct vm_area_struct *new_vma, *prev;
struct rb_node **rb_link, *rb_parent;
struct mempolicy *pol;
+ bool faulted_in_anon_vma = true;

/*
* If anonymous vma has not yet been faulted, update new pgoff
* to match new location, to increase its chance of merging.
*/
- if (!vma->vm_file && !vma->anon_vma)
+ if (!vma->vm_file && !vma->anon_vma) {
pgoff = addr >> PAGE_SHIFT;
+ faulted_in_anon_vma = false;
+ }

find_vma_prepare(mm, addr, &prev, &rb_link, &rb_parent);
new_vma = vma_merge(mm, prev, addr, addr + len, vma->vm_flags,
@@ -2338,8 +2341,23 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
* Source vma may have been merged into new_vma
*/
if (vma_start >= new_vma->vm_start &&
- vma_start < new_vma->vm_end)
+ vma_start < new_vma->vm_end) {
+ /*
+ * The only way we can get a vma_merge with
+ * self during an mremap is if the vma hasn't
+ * been faulted in yet and we were allowed to
+ * reset the dst vma->vm_pgoff to the
+ * destination address of the mremap to allow
+ * the merge to happen. mremap must change the
+ * vm_pgoff linearity between src and dst vmas
+ * (in turn preventing a vma_merge) to be
+ * safe. It is only safe to keep the vm_pgoff
+ * linear if there are no pages mapped yet.
+ */
+ VM_BUG_ON(faulted_in_anon_vma);
*vmap = new_vma;
+ } else
+ anon_vma_moveto_tail(new_vma);
} else {
new_vma = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
if (new_vma) {
diff --git a/mm/mremap.c b/mm/mremap.c
index d6959cb..d845537 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -225,6 +225,7 @@ static unsigned long move_vma(struct vm_area_struct *vma,
* which will succeed since page tables still there,
* and then proceed to unmap new area instead of old.
*/
+ anon_vma_moveto_tail(vma);
move_page_tables(new_vma, new_addr, vma, old_addr, moved_len);
vma = new_vma;
old_len = new_len;
diff --git a/mm/rmap.c b/mm/rmap.c
index 6541cf7..9832f03 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -272,6 +272,51 @@ int anon_vma_clone(struct vm_area_struct *dst, struct vm_area_struct *src)
}

/*
+ * Some rmap walk that needs to find all ptes/hugepmds without false
+ * negatives (like migrate and split_huge_page) running concurrent
+ * with operations that copy or move pagetables (like mremap() and
+ * fork()) to be safe. They depend on the anon_vma "same_anon_vma"
+ * list to be in a certain order: the dst_vma must be placed after the
+ * src_vma in the list. This is always guaranteed by fork() but
+ * mremap() needs to call this function to enforce it in case the
+ * dst_vma isn't newly allocated and chained with the anon_vma_clone()
+ * function but just an extension of a pre-existing vma through
+ * vma_merge.
+ *
+ * NOTE: the same_anon_vma list can still be changed by other
+ * processes while mremap runs because mremap doesn't hold the
+ * anon_vma mutex to prevent modifications to the list while it
+ * runs. All we need to enforce is that the relative order of this
+ * process vmas isn't changing (we don't care about other vmas
+ * order). Each vma corresponds to an anon_vma_chain structure so
+ * there's no risk that other processes calling anon_vma_moveto_tail()
+ * and changing the same_anon_vma list under mremap() will screw with
+ * the relative order of this process vmas in the list, because we
+ * they can't alter the order of any vma that belongs to this
+ * process. And there can't be another anon_vma_moveto_tail() running
+ * concurrently with mremap() coming from this process because we hold
+ * the mmap_sem for the whole mremap(). fork() ordering dependency
+ * also shouldn't be affected because fork() only cares that the
+ * parent vmas are placed in the list before the child vmas and
+ * anon_vma_moveto_tail() won't reorder vmas from either the fork()
+ * parent or child.
+ */
+void anon_vma_moveto_tail(struct vm_area_struct *dst)
+{
+ struct anon_vma_chain *pavc;
+ struct anon_vma *root = NULL;
+
+ list_for_each_entry_reverse(pavc, &dst->anon_vma_chain, same_vma) {
+ struct anon_vma *anon_vma = pavc->anon_vma;
+ VM_BUG_ON(pavc->vma != dst);
+ root = lock_anon_vma_root(root, anon_vma);
+ list_del(&pavc->same_anon_vma);
+ list_add_tail(&pavc->same_anon_vma, &anon_vma->head);
+ }
+ unlock_anon_vma_root(root);
+}
+
+/*
* Attach vma to its own anon_vma, as well as to the anon_vmas that
* the corresponding VMA in the parent process is attached to.
* Returns 0 on success, non-zero on failure.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
David Rientjes
2011-12-08 03:24:59 UTC
Permalink
Post by Andrea Arcangeli
migrate was doing a rmap_walk with speculative lock-less access on
pagetables. That could lead it to not serialize properly against
mremap PT locks. But a second problem remains in the order of vmas in
the same_anon_vma list used by the rmap_walk.
If vma_merge would succeed in copy_vma, the src vma could be placed
after the dst vma in the same_anon_vma list. That could still lead
migrate to miss some pte.
This patch adds a anon_vma_moveto_tail() function to force the dst vma
at the end of the list before mremap starts to solve the problem.
If the mremap is very large and there are a lots of parents or childs
sharing the anon_vma root lock, this should still scale better than
taking the anon_vma root lock around every pte copy practically for
the whole duration of mremap.
Update: Hugh noticed special care is needed in the error path where
move_page_tables goes in the reverse direction, a second
anon_vma_moveto_tail() call is needed in the error path.
Is this still needed? It's missing in linux-next.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Andrea Arcangeli
2011-12-08 12:42:17 UTC
Permalink
Post by David Rientjes
Post by Andrea Arcangeli
migrate was doing a rmap_walk with speculative lock-less access on
pagetables. That could lead it to not serialize properly against
mremap PT locks. But a second problem remains in the order of vmas in
the same_anon_vma list used by the rmap_walk.
If vma_merge would succeed in copy_vma, the src vma could be placed
after the dst vma in the same_anon_vma list. That could still lead
migrate to miss some pte.
This patch adds a anon_vma_moveto_tail() function to force the dst vma
at the end of the list before mremap starts to solve the problem.
If the mremap is very large and there are a lots of parents or childs
sharing the anon_vma root lock, this should still scale better than
taking the anon_vma root lock around every pte copy practically for
the whole duration of mremap.
Update: Hugh noticed special care is needed in the error path where
move_page_tables goes in the reverse direction, a second
anon_vma_moveto_tail() call is needed in the error path.
Is this still needed? It's missing in linux-next.
Yes it's needed, either this or the anon_vma lock around
move_page_tables. Then we also need the i_mmap_mutex around fork or a
triple loop in vmtruncate (then we could remove i_mmap_mutex in
mremap).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Andrew Morton
2011-12-09 00:08:56 UTC
Permalink
On Sat, 5 Nov 2011 18:06:22 +0100
Post by Andrea Arcangeli
This patch adds a anon_vma_moveto_tail() function to force the dst vma
at the end of the list before mremap starts to solve the problem.
It's not obvious to me that the patch which I merged is the one which
we want to merge, given the amount of subsequent discussion. Please
check this.

I'm thinking we merge this into 3.3-rc1, tagged for backporting into
3.2.x. To give us additional time to think about it and test it.

Or perhaps the bug just isn't serious enough to bother fixing it in 3.2
or earlier?



From: Andrea Arcangeli <***@redhat.com>
Subject: mremap: enforce rmap src/dst vma ordering in case of vma_merge() succeeding in copy_vma()

migrate was doing an rmap_walk with speculative lock-less access on
pagetables. That could lead it to not serializing properly against mremap
PT locks. But a second problem remains in the order of vmas in the
same_anon_vma list used by the rmap_walk.

If vma_merge succeeds in copy_vma, the src vma could be placed after the
dst vma in the same_anon_vma list. That could still lead to migrate
missing some pte.

This patch adds an anon_vma_moveto_tail() function to force the dst vma at
the end of the list before mremap starts to solve the problem.

If the mremap is very large and there are a lots of parents or childs
sharing the anon_vma root lock, this should still scale better than taking
the anon_vma root lock around every pte copy practically for the whole
duration of mremap.

Update: Hugh noticed special care is needed in the error path where
move_page_tables goes in the reverse direction, a second
anon_vma_moveto_tail() call is needed in the error path.

This program exercises the anon_vma_moveto_tail:

===

int main()
{
static struct timeval oldstamp, newstamp;
long diffsec;
char *p, *p2, *p3, *p4;
if (posix_memalign((void **)&p, 2*1024*1024, SIZE))
perror("memalign"), exit(1);
if (posix_memalign((void **)&p2, 2*1024*1024, SIZE))
perror("memalign"), exit(1);
if (posix_memalign((void **)&p3, 2*1024*1024, SIZE))
perror("memalign"), exit(1);

memset(p, 0xff, SIZE);
printf("%p\n", p);
memset(p2, 0xff, SIZE);
memset(p3, 0x77, 4096);
if (memcmp(p, p2, SIZE))
printf("error\n");
p4 = mremap(p+SIZE/2, SIZE/2, SIZE/2, MREMAP_FIXED|MREMAP_MAYMOVE, p3);
if (p4 != p3)
perror("mremap"), exit(1);
p4 = mremap(p4, SIZE/2, SIZE/2, MREMAP_FIXED|MREMAP_MAYMOVE, p+SIZE/2);
if (p4 != p+SIZE/2)
perror("mremap"), exit(1);
if (memcmp(p, p2, SIZE))
printf("error\n");
printf("ok\n");

return 0;
}
===

$ perf probe -a anon_vma_moveto_tail
Add new event:
probe:anon_vma_moveto_tail (on anon_vma_moveto_tail)

You can now use it on all perf tools, such as:

perf record -e probe:anon_vma_moveto_tail -aR sleep 1

$ perf record -e probe:anon_vma_moveto_tail -aR ./anon_vma_moveto_tail
0x7f2ca2800000
ok
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.043 MB perf.data (~1860 samples) ]
$ perf report --stdio
100.00% anon_vma_moveto [kernel.kallsyms] [k] anon_vma_moveto_tail

Signed-off-by: Andrea Arcangeli <***@redhat.com>
Reported-by: Nai Xia <***@gmail.com>
Acked-by: Mel Gorman <***@suse.de>
Cc: Hugh Dickins <***@google.com>
Cc: Pawel Sikora <***@agmk.net
Cc: <***@vger.kernel.org>
Signed-off-by: Andrew Morton <***@linux-foundation.org>
---

include/linux/rmap.h | 1
mm/mmap.c | 22 ++++++++++++++++++--
mm/mremap.c | 1
mm/rmap.c | 45 +++++++++++++++++++++++++++++++++++++++++
4 files changed, 67 insertions(+), 2 deletions(-)

diff -puN include/linux/rmap.h~mremap-enforce-rmap-src-dst-vma-ordering-in-case-of-vma_merge-succeeding-in-copy_vma include/linux/rmap.h
--- a/include/linux/rmap.h~mremap-enforce-rmap-src-dst-vma-ordering-in-case-of-vma_merge-succeeding-in-copy_vma
+++ a/include/linux/rmap.h
@@ -120,6 +120,7 @@ void anon_vma_init(void); /* create anon
int anon_vma_prepare(struct vm_area_struct *);
void unlink_anon_vmas(struct vm_area_struct *);
int anon_vma_clone(struct vm_area_struct *, struct vm_area_struct *);
+void anon_vma_moveto_tail(struct vm_area_struct *);
int anon_vma_fork(struct vm_area_struct *, struct vm_area_struct *);
void __anon_vma_link(struct vm_area_struct *);

diff -puN mm/mmap.c~mremap-enforce-rmap-src-dst-vma-ordering-in-case-of-vma_merge-succeeding-in-copy_vma mm/mmap.c
--- a/mm/mmap.c~mremap-enforce-rmap-src-dst-vma-ordering-in-case-of-vma_merge-succeeding-in-copy_vma
+++ a/mm/mmap.c
@@ -2349,13 +2349,16 @@ struct vm_area_struct *copy_vma(struct v
struct vm_area_struct *new_vma, *prev;
struct rb_node **rb_link, *rb_parent;
struct mempolicy *pol;
+ bool faulted_in_anon_vma = true;

/*
* If anonymous vma has not yet been faulted, update new pgoff
* to match new location, to increase its chance of merging.
*/
- if (!vma->vm_file && !vma->anon_vma)
+ if (!vma->vm_file && !vma->anon_vma) {
pgoff = addr >> PAGE_SHIFT;
+ faulted_in_anon_vma = false;
+ }

find_vma_prepare(mm, addr, &prev, &rb_link, &rb_parent);
new_vma = vma_merge(mm, prev, addr, addr + len, vma->vm_flags,
@@ -2365,8 +2368,23 @@ struct vm_area_struct *copy_vma(struct v
* Source vma may have been merged into new_vma
*/
if (vma_start >= new_vma->vm_start &&
- vma_start < new_vma->vm_end)
+ vma_start < new_vma->vm_end) {
+ /*
+ * The only way we can get a vma_merge with
+ * self during an mremap is if the vma hasn't
+ * been faulted in yet and we were allowed to
+ * reset the dst vma->vm_pgoff to the
+ * destination address of the mremap to allow
+ * the merge to happen. mremap must change the
+ * vm_pgoff linearity between src and dst vmas
+ * (in turn preventing a vma_merge) to be
+ * safe. It is only safe to keep the vm_pgoff
+ * linear if there are no pages mapped yet.
+ */
+ VM_BUG_ON(faulted_in_anon_vma);
*vmap = new_vma;
+ } else
+ anon_vma_moveto_tail(new_vma);
} else {
new_vma = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
if (new_vma) {
diff -puN mm/mremap.c~mremap-enforce-rmap-src-dst-vma-ordering-in-case-of-vma_merge-succeeding-in-copy_vma mm/mremap.c
--- a/mm/mremap.c~mremap-enforce-rmap-src-dst-vma-ordering-in-case-of-vma_merge-succeeding-in-copy_vma
+++ a/mm/mremap.c
@@ -225,6 +225,7 @@ static unsigned long move_vma(struct vm_
* which will succeed since page tables still there,
* and then proceed to unmap new area instead of old.
*/
+ anon_vma_moveto_tail(vma);
move_page_tables(new_vma, new_addr, vma, old_addr, moved_len);
vma = new_vma;
old_len = new_len;
diff -puN mm/rmap.c~mremap-enforce-rmap-src-dst-vma-ordering-in-case-of-vma_merge-succeeding-in-copy_vma mm/rmap.c
--- a/mm/rmap.c~mremap-enforce-rmap-src-dst-vma-ordering-in-case-of-vma_merge-succeeding-in-copy_vma
+++ a/mm/rmap.c
@@ -272,6 +272,51 @@ int anon_vma_clone(struct vm_area_struct
}

/*
+ * Some rmap walk that needs to find all ptes/hugepmds without false
+ * negatives (like migrate and split_huge_page) running concurrent
+ * with operations that copy or move pagetables (like mremap() and
+ * fork()) to be safe. They depend on the anon_vma "same_anon_vma"
+ * list to be in a certain order: the dst_vma must be placed after the
+ * src_vma in the list. This is always guaranteed by fork() but
+ * mremap() needs to call this function to enforce it in case the
+ * dst_vma isn't newly allocated and chained with the anon_vma_clone()
+ * function but just an extension of a pre-existing vma through
+ * vma_merge.
+ *
+ * NOTE: the same_anon_vma list can still be changed by other
+ * processes while mremap runs because mremap doesn't hold the
+ * anon_vma mutex to prevent modifications to the list while it
+ * runs. All we need to enforce is that the relative order of this
+ * process vmas isn't changing (we don't care about other vmas
+ * order). Each vma corresponds to an anon_vma_chain structure so
+ * there's no risk that other processes calling anon_vma_moveto_tail()
+ * and changing the same_anon_vma list under mremap() will screw with
+ * the relative order of this process vmas in the list, because we
+ * they can't alter the order of any vma that belongs to this
+ * process. And there can't be another anon_vma_moveto_tail() running
+ * concurrently with mremap() coming from this process because we hold
+ * the mmap_sem for the whole mremap(). fork() ordering dependency
+ * also shouldn't be affected because fork() only cares that the
+ * parent vmas are placed in the list before the child vmas and
+ * anon_vma_moveto_tail() won't reorder vmas from either the fork()
+ * parent or child.
+ */
+void anon_vma_moveto_tail(struct vm_area_struct *dst)
+{
+ struct anon_vma_chain *pavc;
+ struct anon_vma *root = NULL;
+
+ list_for_each_entry_reverse(pavc, &dst->anon_vma_chain, same_vma) {
+ struct anon_vma *anon_vma = pavc->anon_vma;
+ VM_BUG_ON(pavc->vma != dst);
+ root = lock_anon_vma_root(root, anon_vma);
+ list_del(&pavc->same_anon_vma);
+ list_add_tail(&pavc->same_anon_vma, &anon_vma->head);
+ }
+ unlock_anon_vma_root(root);
+}
+
+/*
* Attach vma to its own anon_vma, as well as to the anon_vmas that
* the corresponding VMA in the parent process is attached to.
* Returns 0 on success, non-zero on failure.
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Andrea Arcangeli
2011-12-09 01:55:06 UTC
Permalink
Post by Andrew Morton
It's not obvious to me that the patch which I merged is the one which
we want to merge, given the amount of subsequent discussion. Please
check this.
That's not the last version.
Post by Andrew Morton
I'm thinking we merge this into 3.3-rc1, tagged for backporting into
3.2.x. To give us additional time to think about it and test it.
Or perhaps the bug just isn't serious enough to bother fixing it in 3.2
or earlier?
Probably not serious enough, I'm not aware of anybody reproducing it.

Then we've also to think what to do about the i_mmap_mutex, if to
remove it from mremap it too, or if to add it to fork too.

The problem of the i_mmap_mutex is that the prio tree, being a tree,
has no way for us to ensure ordering of the range "walk" is related to
the order of "insertion". So a solution like below can't work for
prio tree (it only works for the anon_vma_chain _list_).

Either we loop twice in the rmap_walk (adding a third loop to
vmtruncate) or we add i_mmap_mutex to fork (where it looks missing and
probably the page_mapped check in __delete_from_page_cache can fire if
such a race triggers, otherwise it looks fairly innocent race but
clearly the implications aren't obvious or there would be no BUG_ON in
__delete_from_page_cache).

For file mappings the only rmap walk that has to be exact and not to
miss any pte, is the vmtruncate path. That's why only vmtruncate would
need a third loop (third because we need a first loop before the
pagecache truncation, and two more loops to catch all ptes, or a
temporary, but only temporary pte, can still be mapped and fire the
bug-on in __delete_from_page_cache).

For anon pages it's only split_huge_page and remove_migration_ptes
that shouldn't miss ptes/hugepmds.

===
From: Andrea Arcangeli <***@redhat.com>
Subject: [PATCH] mremap: enforce rmap src/dst vma ordering in case of vma_merge succeeding in copy_vma

migrate was doing a rmap_walk with speculative lock-less access on
pagetables. That could lead it to not serialize properly against
mremap PT locks. But a second problem remains in the order of vmas in
the same_anon_vma list used by the rmap_walk.

If vma_merge would succeed in copy_vma, the src vma could be placed
after the dst vma in the same_anon_vma list. That could still lead
migrate to miss some pte.

This patch adds a anon_vma_moveto_tail() function to force the dst vma
at the end of the list before mremap starts to solve the problem.

If the mremap is very large and there are a lots of parents or childs
sharing the anon_vma root lock, this should still scale better than
taking the anon_vma root lock around every pte copy practically for
the whole duration of mremap.

Update: Hugh noticed special care is needed in the error path where
move_page_tables goes in the reverse direction, a second
anon_vma_moveto_tail() call is needed in the error path.

This program exercises the anon_vma_moveto_tail:

===

int main()
{
static struct timeval oldstamp, newstamp;
long diffsec;
char *p, *p2, *p3, *p4;
if (posix_memalign((void **)&p, 2*1024*1024, SIZE))
perror("memalign"), exit(1);
if (posix_memalign((void **)&p2, 2*1024*1024, SIZE))
perror("memalign"), exit(1);
if (posix_memalign((void **)&p3, 2*1024*1024, SIZE))
perror("memalign"), exit(1);

memset(p, 0xff, SIZE);
printf("%p\n", p);
memset(p2, 0xff, SIZE);
memset(p3, 0x77, 4096);
if (memcmp(p, p2, SIZE))
printf("error\n");
p4 = mremap(p+SIZE/2, SIZE/2, SIZE/2, MREMAP_FIXED|MREMAP_MAYMOVE, p3);
if (p4 != p3)
perror("mremap"), exit(1);
p4 = mremap(p4, SIZE/2, SIZE/2, MREMAP_FIXED|MREMAP_MAYMOVE, p+SIZE/2);
if (p4 != p+SIZE/2)
perror("mremap"), exit(1);
if (memcmp(p, p2, SIZE))
printf("error\n");
printf("ok\n");

return 0;
}
===

$ perf probe -a anon_vma_moveto_tail
Add new event:
probe:anon_vma_moveto_tail (on anon_vma_moveto_tail)

You can now use it on all perf tools, such as:

perf record -e probe:anon_vma_moveto_tail -aR sleep 1

$ perf record -e probe:anon_vma_moveto_tail -aR ./anon_vma_moveto_tail
0x7f2ca2800000
ok
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.043 MB perf.data (~1860 samples) ]
$ perf report --stdio
100.00% anon_vma_moveto [kernel.kallsyms] [k] anon_vma_moveto_tail

Reported-by: Nai Xia <***@gmail.com>
Acked-by: Mel Gorman <***@suse.de>
Signed-off-by: Andrea Arcangeli <***@redhat.com>
---
include/linux/rmap.h | 1 +
mm/mmap.c | 24 +++++++++++++++++++++---
mm/mremap.c | 9 +++++++++
mm/rmap.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 76 insertions(+), 3 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 2148b12..1afb995 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -120,6 +120,7 @@ void anon_vma_init(void); /* create anon_vma_cachep */
int anon_vma_prepare(struct vm_area_struct *);
void unlink_anon_vmas(struct vm_area_struct *);
int anon_vma_clone(struct vm_area_struct *, struct vm_area_struct *);
+void anon_vma_moveto_tail(struct vm_area_struct *);
int anon_vma_fork(struct vm_area_struct *, struct vm_area_struct *);
void __anon_vma_link(struct vm_area_struct *);

diff --git a/mm/mmap.c b/mm/mmap.c
index eae90af..adea3b8 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2322,13 +2322,16 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
struct vm_area_struct *new_vma, *prev;
struct rb_node **rb_link, *rb_parent;
struct mempolicy *pol;
+ bool faulted_in_anon_vma = true;

/*
* If anonymous vma has not yet been faulted, update new pgoff
* to match new location, to increase its chance of merging.
*/
- if (!vma->vm_file && !vma->anon_vma)
+ if (unlikely(!vma->vm_file && !vma->anon_vma)) {
pgoff = addr >> PAGE_SHIFT;
+ faulted_in_anon_vma = false;
+ }

find_vma_prepare(mm, addr, &prev, &rb_link, &rb_parent);
new_vma = vma_merge(mm, prev, addr, addr + len, vma->vm_flags,
@@ -2337,9 +2340,24 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
/*
* Source vma may have been merged into new_vma
*/
- if (vma_start >= new_vma->vm_start &&
- vma_start < new_vma->vm_end)
+ if (unlikely(vma_start >= new_vma->vm_start &&
+ vma_start < new_vma->vm_end)) {
+ /*
+ * The only way we can get a vma_merge with
+ * self during an mremap is if the vma hasn't
+ * been faulted in yet and we were allowed to
+ * reset the dst vma->vm_pgoff to the
+ * destination address of the mremap to allow
+ * the merge to happen. mremap must change the
+ * vm_pgoff linearity between src and dst vmas
+ * (in turn preventing a vma_merge) to be
+ * safe. It is only safe to keep the vm_pgoff
+ * linear if there are no pages mapped yet.
+ */
+ VM_BUG_ON(faulted_in_anon_vma);
*vmap = new_vma;
+ } else
+ anon_vma_moveto_tail(new_vma);
} else {
new_vma = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
if (new_vma) {
diff --git a/mm/mremap.c b/mm/mremap.c
index d6959cb..87bb839 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -221,6 +221,15 @@ static unsigned long move_vma(struct vm_area_struct *vma,
moved_len = move_page_tables(vma, old_addr, new_vma, new_addr, old_len);
if (moved_len < old_len) {
/*
+ * Before moving the page tables from the new vma to
+ * the old vma, we need to be sure the old vma is
+ * queued after new vma in the same_anon_vma list to
+ * prevent SMP races with rmap_walk (that could lead
+ * rmap_walk to miss some page table).
+ */
+ anon_vma_moveto_tail(vma);
+
+ /*
* On error, move entries back from new area to old,
* which will succeed since page tables still there,
* and then proceed to unmap new area instead of old.
diff --git a/mm/rmap.c b/mm/rmap.c
index a4fd368..a2e5ce1 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -272,6 +272,51 @@ int anon_vma_clone(struct vm_area_struct *dst, struct vm_area_struct *src)
}

/*
+ * Some rmap walk that needs to find all ptes/hugepmds without false
+ * negatives (like migrate and split_huge_page) running concurrent
+ * with operations that copy or move pagetables (like mremap() and
+ * fork()) to be safe. They depend on the anon_vma "same_anon_vma"
+ * list to be in a certain order: the dst_vma must be placed after the
+ * src_vma in the list. This is always guaranteed by fork() but
+ * mremap() needs to call this function to enforce it in case the
+ * dst_vma isn't newly allocated and chained with the anon_vma_clone()
+ * function but just an extension of a pre-existing vma through
+ * vma_merge.
+ *
+ * NOTE: the same_anon_vma list can still be changed by other
+ * processes while mremap runs because mremap doesn't hold the
+ * anon_vma mutex to prevent modifications to the list while it
+ * runs. All we need to enforce is that the relative order of this
+ * process vmas isn't changing (we don't care about other vmas
+ * order). Each vma corresponds to an anon_vma_chain structure so
+ * there's no risk that other processes calling anon_vma_moveto_tail()
+ * and changing the same_anon_vma list under mremap() will screw with
+ * the relative order of this process vmas in the list, because we
+ * they can't alter the order of any vma that belongs to this
+ * process. And there can't be another anon_vma_moveto_tail() running
+ * concurrently with mremap() coming from this process because we hold
+ * the mmap_sem for the whole mremap(). fork() ordering dependency
+ * also shouldn't be affected because fork() only cares that the
+ * parent vmas are placed in the list before the child vmas and
+ * anon_vma_moveto_tail() won't reorder vmas from either the fork()
+ * parent or child.
+ */
+void anon_vma_moveto_tail(struct vm_area_struct *dst)
+{
+ struct anon_vma_chain *pavc;
+ struct anon_vma *root = NULL;
+
+ list_for_each_entry_reverse(pavc, &dst->anon_vma_chain, same_vma) {
+ struct anon_vma *anon_vma = pavc->anon_vma;
+ VM_BUG_ON(pavc->vma != dst);
+ root = lock_anon_vma_root(root, anon_vma);
+ list_del(&pavc->same_anon_vma);
+ list_add_tail(&pavc->same_anon_vma, &anon_vma->head);
+ }
+ unlock_anon_vma_root(root);
+}
+
+/*
* Attach vma to its own anon_vma, as well as to the anon_vmas that
* the corresponding VMA in the parent process is attached to.
* Returns 0 on success, non-zero on failure.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Andrea Arcangeli
2011-11-04 23:56:03 UTC
Permalink
Post by Hugh Dickins
Post by Andrea Arcangeli
diff --git a/mm/mmap.c b/mm/mmap.c
index a65efd4..a5858dc 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2339,7 +2339,15 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
*/
if (vma_start >= new_vma->vm_start &&
vma_start < new_vma->vm_end)
+ /*
+ * No need to call anon_vma_order_tail() in
+ * this case because the same PT lock will
+ * serialize the rmap_walk against both src
+ * and dst vmas.
+ */
Really? Please convince me: I just do not see what ensures that
the same pt lock covers both src and dst areas in this case.
Right, vma being the same for src/dst doesn't mean the PT lock is the
same, it might be if source pte entry fit in the same pagetable but
maybe not if the vma is >2M (the max a single pagetable can point to).
Post by Hugh Dickins
Post by Andrea Arcangeli
*vmap = new_vma;
+ else
+ anon_vma_order_tail(new_vma);
And if this puts new_vma in the right position for the normal
move_page_tables(), as anon_vma_clone() does in the block below,
aren't they both in exactly the wrong position for the abnormal
move_page_tables(), called to put ptes back where they were if
the original move_page_tables() fails?
Failure paths. Good point, they'd need to be reversed again in that
case.
Post by Hugh Dickins
It might be possible to argue that move_page_tables() can only
fail by failing to allocate memory for pud or pmd, and that (perhaps)
could only happen if the task was being OOM-killed and ran out of
reserves at this point, and if it's being OOM-killed then we don't
mind losing a migration entry for a moment... perhaps.
Hmm no it wouldn't be ok, or I wouldn't want to risk that.
Post by Hugh Dickins
Certainly I'd agree that it's a very rare case. But it feels wrong
to be attempting to fix the already unlikely issue, while ignoring
this aspect, or relying on such unrelated implementation details.
Agreed.
Post by Hugh Dickins
Perhaps some further anon_vma_ordering could fix it up,
but that would look increasingly desperate.
I think what Nai didn't consider in explaining this theoretical race
that I noticed now is the anon_vma root lock taken by adjust_vma.

If the merge succeeds adjust_vma will take the lock and flush away
from all others CPUs any sign of rmap_walk before the move_page_tables
can start.

So it can't happen that you do rmap_walk, check vma1, mremap moves
stuff from vma2 to vma1 (wrong order), and then rmap_walk continues
checking vma2 where the pte won't be there anymore. It can't happen
because mremap would block in vma_merge waiting the rmap_walk to
complete. Before proceeding moving any pte. Thanks to the anon_vma
lock already taken by adjust_vma.

So the real fix for the real bug is the one already merged in kernel
v3.1 and we don't need to make any more changes because there is no
race left.

The only bug was the lack of PT lock before checking the pte that
could read the ptes while move_ptes transferred the pte from src_ptep
to kernel stack, and before writing it to dst_ptep. That is closed by
taking the PT lock in migrate before checking if the pte could be a
migrate pte (so flushing move_ptes away from all other CPUs while
migrate checks if a migrate-pte is mapped in the pte).

I don't think the ordering matters anymore, Nai theory sounded good
there was just one small detail he missed in the vma_merge internal
locking that prevents the race to trigger.
Post by Hugh Dickins
If we were back in the days of the simple anon_vma list, I'd probably
share your enthusiasm for the list ordering solution; but now it looks
like a fragile and contorted way of avoiding the obvious... we just
need to use the anon_vma_lock (but perhaps there are some common and
easily tested conditions under which we can skip it e.g. when a single
pt lock covers src and dst?).
Actually I thought about this one when I didn't notice yet the
vma_merge internal locking that prevents Nai's remaining race to
trigger. And my conclusion is that the anon_vma_chains aren't actually
changing anything with regard to ordering. It become a bit
multidimensional to think about it so it complicates things
incredibly, but the ordering issue could have happened before too, and
the fix would have worked for both.

Old anon_vma is like three dimensional (vma, anon_vma, page). Now it's
(vma, chain, anon_vma, page). But if you consider just a single
process execve'd without any child, it returns three dimensional. And
the moment you add childs, you can imagine the old "three dimension"
anon_vma logic to be the one of the parent. And if parent is safe with
all childs vmas in the same_anon_vma_list, then childs are sure safe
too to reorder that way. But hey it's not needed so we're faster and
we don't have to do those list searches during mremap and it's simpler
too :).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Nai Xia
2011-11-05 00:21:03 UTC
Permalink
Post by Andrea Arcangeli
Post by Hugh Dickins
Post by Andrea Arcangeli
diff --git a/mm/mmap.c b/mm/mmap.c
index a65efd4..a5858dc 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2339,7 +2339,15 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
             */
            if (vma_start >= new_vma->vm_start &&
                vma_start < new_vma->vm_end)
+                   /*
+                    * No need to call anon_vma_order_tail() in
+                    * this case because the same PT lock will
+                    * serialize the rmap_walk against both src
+                    * and dst vmas.
+                    */
Really?  Please convince me: I just do not see what ensures that
the same pt lock covers both src and dst areas in this case.
Right, vma being the same for src/dst doesn't mean the PT lock is the
same, it might be if source pte entry fit in the same pagetable but
maybe not if the vma is >2M (the max a single pagetable can point to).
Post by Hugh Dickins
Post by Andrea Arcangeli
                    *vmap = new_vma;
+           else
+                   anon_vma_order_tail(new_vma);
And if this puts new_vma in the right position for the normal
move_page_tables(), as anon_vma_clone() does in the block below,
aren't they both in exactly the wrong position for the abnormal
move_page_tables(), called to put ptes back where they were if
the original move_page_tables() fails?
Failure paths. Good point, they'd need to be reversed again in that
case.
Post by Hugh Dickins
It might be possible to argue that move_page_tables() can only
fail by failing to allocate memory for pud or pmd, and that (perhaps)
could only happen if the task was being OOM-killed and ran out of
reserves at this point, and if it's being OOM-killed then we don't
mind losing a migration entry for a moment... perhaps.
Hmm no it wouldn't be ok, or I wouldn't want to risk that.
Post by Hugh Dickins
Certainly I'd agree that it's a very rare case.  But it feels wrong
to be attempting to fix the already unlikely issue, while ignoring
this aspect, or relying on such unrelated implementation details.
Agreed.
Post by Hugh Dickins
Perhaps some further anon_vma_ordering could fix it up,
but that would look increasingly desperate.
I think what Nai didn't consider in explaining this theoretical race
that I noticed now is the anon_vma root lock taken by adjust_vma.
If the merge succeeds adjust_vma will take the lock and flush away
from all others CPUs any sign of rmap_walk before the move_page_tables
can start.
So it can't happen that you do rmap_walk, check vma1, mremap moves
stuff from vma2 to vma1 (wrong order), and then rmap_walk continues
checking vma2 where the pte won't be there anymore. It can't happen
because mremap would block in vma_merge waiting the rmap_walk to
complete. Before proceeding moving any pte. Thanks to the anon_vma
lock already taken by adjust_vma.
Still, I think it's not rmap_walk() ---> mremap() --> rmap_walk() that trigger
the bug, but this events would:

copy_vma() ---> rmap_walk() scan dst VMA --> move_page_tables() moves src to dst
---> rmap_walk() scan src VMA. :D

I might be wrong. But thank you all for the time and patience for
playing this racing game
with me. It's really an honor to exhaust my mind on a daunting thing
with you. :)


Best Regards,

Nai

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Nai Xia
2011-11-05 00:59:02 UTC
Permalink
Post by Andrea Arcangeli
Post by Hugh Dickins
Post by Andrea Arcangeli
diff --git a/mm/mmap.c b/mm/mmap.c
index a65efd4..a5858dc 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2339,7 +2339,15 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
             */
            if (vma_start >= new_vma->vm_start &&
                vma_start < new_vma->vm_end)
+                   /*
+                    * No need to call anon_vma_order_tail() in
+                    * this case because the same PT lock will
+                    * serialize the rmap_walk against both src
+                    * and dst vmas.
+                    */
Really?  Please convince me: I just do not see what ensures that
the same pt lock covers both src and dst areas in this case.
Right, vma being the same for src/dst doesn't mean the PT lock is the
same, it might be if source pte entry fit in the same pagetable but
maybe not if the vma is >2M (the max a single pagetable can point to).
Post by Hugh Dickins
Post by Andrea Arcangeli
                    *vmap = new_vma;
+           else
+                   anon_vma_order_tail(new_vma);
And if this puts new_vma in the right position for the normal
move_page_tables(), as anon_vma_clone() does in the block below,
aren't they both in exactly the wrong position for the abnormal
move_page_tables(), called to put ptes back where they were if
the original move_page_tables() fails?
Failure paths. Good point, they'd need to be reversed again in that
case.
Post by Hugh Dickins
It might be possible to argue that move_page_tables() can only
fail by failing to allocate memory for pud or pmd, and that (perhaps)
could only happen if the task was being OOM-killed and ran out of
reserves at this point, and if it's being OOM-killed then we don't
mind losing a migration entry for a moment... perhaps.
Hmm no it wouldn't be ok, or I wouldn't want to risk that.
Post by Hugh Dickins
Certainly I'd agree that it's a very rare case.  But it feels wrong
to be attempting to fix the already unlikely issue, while ignoring
this aspect, or relying on such unrelated implementation details.
Agreed.
Post by Hugh Dickins
Perhaps some further anon_vma_ordering could fix it up,
but that would look increasingly desperate.
I think what Nai didn't consider in explaining this theoretical race
that I noticed now is the anon_vma root lock taken by adjust_vma.
If the merge succeeds adjust_vma will take the lock and flush away
from all others CPUs any sign of rmap_walk before the move_page_tables
can start.
So it can't happen that you do rmap_walk, check vma1, mremap moves
stuff from vma2 to vma1 (wrong order), and then rmap_walk continues
checking vma2 where the pte won't be there anymore. It can't happen
because mremap would block in vma_merge waiting the rmap_walk to
complete. Before proceeding moving any pte. Thanks to the anon_vma
lock already taken by adjust_vma.
Still,  I think it's not rmap_walk() ---> mremap() --> rmap_walk() that trigger
copy_vma() ---> rmap_walk() scan dst VMA --> move_page_tables() moves src to dst
--->  rmap_walk() scan src VMA.  :D
OK, I think I need to be more concise: Your last reasoning only
ensures that mremap
as a whole entity cannot interleave with rmap_walk(). But I think
nothing can prevent
move_page_tables() from doing this. As long as copy_vma() gives an
wrong ordering,
the racing between rmap_walk() & move_page_tables() afterwards may
trigger the bug.

Do you agree?
I might be wrong. But thank you all for the time and patience for
playing this racing game
with me. It's really an honor to exhaust my mind on a daunting thing
with you. :)
Best Regards,
Nai
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Andrea Arcangeli
2011-11-05 01:33:17 UTC
Permalink
Post by Nai Xia
copy_vma() ---> rmap_walk() scan dst VMA --> move_page_tables() moves src to dst
---> rmap_walk() scan src VMA. :D
Hmm yes. I think I got in the wrong track because I focused too much
on that line you started talking about, the *vmap = new_vma, you said
I had to reorder stuff there too, and that didn't make sense.

The reason it doesn't make sense is that it can't be ok to reorder
stuff when *vmap = new_vma (i.e. new_vma = old_vma). So if I didn't
need to reorder in that case I thought I could extrapolate it was
always ok.

But the opposite is true: that case can't be solved.

Can it really happen that vma_merge will pack (prev_vma, new_range,
old_vma) together in a single vma? (i.e. prev_vma extended to
old_vma->vm_end)

Even if there's no prev_vma in the picture (but that's the extreme
case) it cannot be safe: i.e. a (new_range, old_vma) or (old_vma,
new_range).

1 single "vma" for src and dst virtual ranges, means 1 single
vma->vm_pgoff. But we've two virtual addresses and two ptes. So the
same page->index can't work for both if the vma->vm_pgoff is the
same.

So regardless of the ordering here we're dealing with something more
fundamental.

If rmap_walk runs immediately after vma_merge completes and releases
the anon_vma_lock, it won't find any pte in the vma anymore. No matter
the order.

I thought at this before and I didn't mention it but at the light of
the above issue I start to think this is the only possible correct
solution to the problem. We should just never call vma_merge before
move_page_tables. And do the merge by hand later after mremap is
complete.

The only safe way to do it is to have _two_ different vmas, with two
different ->vm_pgoff. Then it will work. And by always creating a new
vma we'll always have it queued at the end, and it'll be safe for the
same reasons fork is safe.

Always allocate a new vma, and then after the whole vma copy is
complete, look if we can merge and free some vma. After the fact, so
it means we can't use vma_merge anymore. vma_merge assumes the
new_range is "virtual" and no vma is mapped there I think. Anyway
that's an implementation issue. In some unlikely case we'll allocate 1
more vma than before, and we'll free it once mremap is finished, but
that's small problem compared to solving this once and for all.

And that will fix it without ordering games and it'll fix the *vmap=
new_vma case too. That case really tripped on me as I was assuming
*that* was correct.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Nai Xia
2011-11-05 02:00:52 UTC
Permalink
Post by Andrea Arcangeli
Post by Nai Xia
copy_vma() ---> rmap_walk() scan dst VMA --> move_page_tables() moves src to dst
--->  rmap_walk() scan src VMA.  :D
Hmm yes. I think I got in the wrong track because I focused too much
on that line you started talking about, the *vmap = new_vma, you said
I had to reorder stuff there too, and that didn't make sense.
Oh, I think you misunderstood me in that. I was just saying:

if (*vmap = new_vma), then _NO_ PTEs need to be moved afterwards,
because vma has not yet been faulted at all. Otherwise, it breaks the
page->index semantics in the way I explained in my reply to Hugh.

So nothing need to be added there, but the reason is because
the above reasoning, not the same PTL locking...

And for this case alone, I think the proper solving place
should be outside move_vma() but inside do_mremap()
by only vma_adjust() and vma_merge() like stuff.
Because really it does not involve move_page_tables().
Post by Andrea Arcangeli
The reason it doesn't make sense is that it can't be ok to reorder
stuff when *vmap = new_vma (i.e. new_vma = old_vma). So if I didn't
need to reorder in that case I thought I could extrapolate it was
always ok.
But the opposite is true: that case can't be solved.
Can it really happen that vma_merge will pack (prev_vma, new_range,
old_vma) together in a single vma? (i.e. prev_vma extended to
old_vma->vm_end)
Even if there's no prev_vma in the picture (but that's the extreme
case) it cannot be safe: i.e. a (new_range, old_vma) or (old_vma,
new_range).
1 single "vma" for src and dst virtual ranges, means 1 single
vma->vm_pgoff. But we've two virtual addresses and two ptes. So the
same page->index can't work for both if the vma->vm_pgoff is the
same.
So regardless of the ordering here we're dealing with something more
fundamental.
If rmap_walk runs immediately after vma_merge completes and releases
the anon_vma_lock, it won't find any pte in the vma anymore. No matter
the order.
I thought at this before and I didn't mention it but at the light of
the above issue I start to think this is the only possible correct
solution to the problem. We should just never call vma_merge before
move_page_tables. And do the merge by hand later after mremap is
complete.
The only safe way to do it is to have _two_ different vmas, with two
different ->vm_pgoff. Then it will work. And by always creating a new
vma we'll always have it queued at the end, and it'll be safe for the
same reasons fork is safe.
Always allocate a new vma, and then after the whole vma copy is
complete, look if we can merge and free some vma. After the fact, so
it means we can't use vma_merge anymore. vma_merge assumes the
new_range is "virtual" and no vma is mapped there I think. Anyway
that's an implementation issue. In some unlikely case we'll allocate 1
more vma than before, and we'll free it once mremap is finished, but
that's small problem compared to solving this once and for all.
And that will fix it without ordering games and it'll fix the *vmap=
new_vma case too. That case really tripped on me as I was assuming
*that* was correct.
Yes. "Allocating a new vma, copy first and merge later " seems
another solution without the tricky reordering. But you know,
I now share some of Hugh's feeling that maybe we are too
desperate using racing in places where locks are simpler
and guaranteed to be safe.

But I think Mel indicated that anon_vma_locking might be
harmful to JVM SMP performance.
How severe you expect this to be, Mel ?


Thanks

Nai

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Mel Gorman
2011-11-07 13:14:13 UTC
Permalink
Post by Nai Xia
<SNIP>
The only safe way to do it is to have _two_ different vmas, with two
different ->vm_pgoff. Then it will work. And by always creating a new
vma we'll always have it queued at the end, and it'll be safe for the
same reasons fork is safe.
Always allocate a new vma, and then after the whole vma copy is
complete, look if we can merge and free some vma. After the fact, so
it means we can't use vma_merge anymore. vma_merge assumes the
new_range is "virtual" and no vma is mapped there I think. Anyway
that's an implementation issue. In some unlikely case we'll allocate 1
more vma than before, and we'll free it once mremap is finished, but
that's small problem compared to solving this once and for all.
And that will fix it without ordering games and it'll fix the *vmap=
new_vma case too. That case really tripped on me as I was assuming
*that* was correct.
Yes. "Allocating a new vma, copy first and merge later " seems
another solution without the tricky reordering. But you know,
I now share some of Hugh's feeling that maybe we are too
desperate using racing in places where locks are simpler
and guaranteed to be safe.
I'm tending to agree. The number of cases that must be kept in mind
is getting too tricky. Taking the anon_vma lock may be slower but at
the risk of sounding chicken, it's easier to understand.
Post by Nai Xia
But I think Mel indicated that anon_vma_locking might be
harmful to JVM SMP performance.
How severe you expect this to be, Mel ?
I would only expect it to be a problem during garbage collection when
there is a greater likelihood that mremap is heavily used. While it
would have been nice to avoid additional overhead in mremap, I don't
think the JVM GC case on its own is sufficient justification to avoid
taking the anon_vma lock.
--
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Andrea Arcangeli
2011-11-07 15:42:35 UTC
Permalink
Post by Mel Gorman
I'm tending to agree. The number of cases that must be kept in mind
is getting too tricky. Taking the anon_vma lock may be slower but at
the risk of sounding chicken, it's easier to understand.
Post by Nai Xia
But I think Mel indicated that anon_vma_locking might be
harmful to JVM SMP performance.
How severe you expect this to be, Mel ?
I would only expect it to be a problem during garbage collection when
there is a greater likelihood that mremap is heavily used. While it
would have been nice to avoid additional overhead in mremap, I don't
think the JVM GC case on its own is sufficient justification to avoid
taking the anon_vma lock.
Adding one liner in the error path and a bugcheck in the *vmap case,
doesn't seem the end of the world compared to my previous fix that you
acked. I suspect last friday I was probably confused for a little
while because I was recovering from some flu I picked up with the cold
weather and the confusion around the vmap case which I assumed as safe
(not only if no page was faulted in yet) also didn't help.

BTW, with regard to those comments about human brain being all weak,
well I doubt monkey brain would work better, so in absence of some
alien brain which may work better than ours, we should concentrate and
handle it :). The ordering constraints isn't going away no matter what
we do in mremap, fork has the exact same issue, except it won't
require reordering but my patch documents that.

NOTE: f we could remove _all_ the ordering dependencies between the
vmas pointed by the anon_vma_chains queued in the same_anon_vma list,
and all the rmap_walk then I would be more inclined to agree on
keeping the simpler way, because then we would stop playing the
ordering games all together, but regardless of mremap, we'll be still
playing ordering games with fork vs rmap_walk, so we can exploit that
to run a bit faster in mremap too, play the same ordering game (though
I admit more complex to play the ordering games in mremap as it
requires 2 more function calls for the vma_merge case) but not
fundamentally different.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Mel Gorman
2011-11-07 16:28:08 UTC
Permalink
Post by Andrea Arcangeli
Post by Mel Gorman
I'm tending to agree. The number of cases that must be kept in mind
is getting too tricky. Taking the anon_vma lock may be slower but at
the risk of sounding chicken, it's easier to understand.
Post by Nai Xia
But I think Mel indicated that anon_vma_locking might be
harmful to JVM SMP performance.
How severe you expect this to be, Mel ?
I would only expect it to be a problem during garbage collection when
there is a greater likelihood that mremap is heavily used. While it
would have been nice to avoid additional overhead in mremap, I don't
think the JVM GC case on its own is sufficient justification to avoid
taking the anon_vma lock.
Adding one liner in the error path and a bugcheck in the *vmap case,
doesn't seem the end of the world compared to my previous fix that you
acked.
Note that I didn't suddenly turn that ack into a nack although

1) A small comment on why we need to call anon_vma_moveto_tail in the
error path would be nice

2) It is unfortunate that we need the faulted_in_anon_vma just
for a VM_BUG_ON check that only exists for CONFIG_DEBUG_VM
but not earth shatting

What I said was taking the anon_vma lock may be slower but it was
generally easier to understand. I'm happy with the new patch too
particularly as it keeps the "ordering game" consistent for fork
and mremap but I previously missed move_page_tables in the error
path so was worried if there was something else I managed to miss
particularly in light of the "Allocating a new vma, copy first and
merge later" direction.

I'm also prefectly happy with my human meat brain and do not expect
to replace it with an aliens.
--
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Andrea Arcangeli
2011-11-09 01:25:42 UTC
Permalink
Post by Mel Gorman
Note that I didn't suddenly turn that ack into a nack although
:)
Post by Mel Gorman
1) A small comment on why we need to call anon_vma_moveto_tail in the
error path would be nice
I can add that.
Post by Mel Gorman
2) It is unfortunate that we need the faulted_in_anon_vma just
for a VM_BUG_ON check that only exists for CONFIG_DEBUG_VM
but not earth shatting
It should be optimized away at build time. It thought it was better
not to leave that path without a VM_BUG_ON. It should be a slow path
in the first place (probably we should even mark it unlikely). And
it's obscure enough that I think a check will clarify things. In the
common case (i.e. some pte faulted in) that vma_merge on self if it
succeeds, it couldn't possibly be safe because the vma->vm_pgoff vs
page->index linearity couldn't be valid for the same vma and the same
page on two different virtual addresses. So checking for it I think is
sane. Especially given at some point it was mentioned we could
optimize away the check all together, so it's a bit of an obscure path
that the VM_BUG_ON I think will help document (and verify).
Post by Mel Gorman
What I said was taking the anon_vma lock may be slower but it was
generally easier to understand. I'm happy with the new patch too
particularly as it keeps the "ordering game" consistent for fork
and mremap but I previously missed move_page_tables in the error
path so was worried if there was something else I managed to miss
particularly in light of the "Allocating a new vma, copy first and
merge later" direction.
I liked that direction a lot. I thought with that we could stick to
the exact same behavior of fork and not need to reorder stuff. But the
error path is still in the way, and we've to undo the move in place
without tearing down the vmas. Plus it would have required to write
mode code, and the allocation path wouldn't have necessarily been
faster than a reordering if the list is not huge.
Post by Mel Gorman
I'm also prefectly happy with my human meat brain and do not expect
to replace it with an aliens.
8-)

On a totally different but related topic, unmap_mapping_range_tree
walks the prio tree the same way try_to_unmap_file walks it and if
truncate can truncate "dst" before "src" then supposedly the
try_to_unmap_file could miss a migration entry copied into the "child"
ptep while fork runs too... But I think there is no risk there because
we don't establish migration ptes there, and we just unmap the
pagecache, so worst case we'll abort migration if the race trigger and
we'll retry later. But I wonder what happens if truncate runs against
fork, if truncate can drop ptes from dst before src (like mremap
comment says), we could still end up with some pte mapped to the file
in the ptes of the child, even if the pte was correctly truncated in
the parent...

Overall I think fork/mremap vs fully_reliable_rmap_walk/truncate
aren't fundamentally different in relation. If we relay on ordering
for anon pages in fork it's not adding too much mess to also relay on
ordering for mremap. If we take the i_mmap_mutex in mremap because we
can't enforce a order in the prio tree, then we need the i_mmap_mutex
in fork too (and that's missing). But nothing prevents us to use a
lock in mreamp and ordering in fork. I think the decision should be
based more on performance expectations.

So we could add the ordering to mremap (patch posted), and add the
i_mmap_mutex to fork, or we add the anon_vma lock in both mremap and
fork, and the i_mmap_lock to fork.

Also note, if we find a way to enforce orderings in the prio tree (not
sure if it's possible, apparently it's already using list_add_tail
so..), then we could also remove the i_mmap_lock from mremap and fork.

Keeping the anon and file cases separated it's better though, I think
the patch I posted should close the race and be ok. Pending your
requested changes. If you think taking the lock is faster it's fine
with me, but I think taking the anon_vma lock once per VMA (plus the
anon_vma_chail list walk), and reduce the per-pagetable locking
overhead is better. Ideally the anon_vma_chain lists won't be long
anyway. And if they are long and lots of processes do mremap at the
same time it should still work better. The anon_vma root lock is not
so small lock to take and better not to take it repeatedly. I also
recall Andi's patches to try to avoid doing lock/unlock in a tight
loop, if we take it and we do some work with it hold, is likely better
than bouncing it at high freq across CPUs for each pmd.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Nai Xia
2011-11-11 09:14:51 UTC
Permalink
Post by Andrea Arcangeli
Post by Mel Gorman
Note that I didn't suddenly turn that ack into a nack although
:)
Post by Mel Gorman
  1) A small comment on why we need to call anon_vma_moveto_tail in the
     error path would be nice
I can add that.
Post by Mel Gorman
  2) It is unfortunate that we need the faulted_in_anon_vma just
     for a VM_BUG_ON check that only exists for CONFIG_DEBUG_VM
     but not earth shatting
It should be optimized away at build time. It thought it was better
not to leave that path without a VM_BUG_ON. It should be a slow path
in the first place (probably we should even mark it unlikely). And
it's obscure enough that I think a check will clarify things. In the
common case (i.e. some pte faulted in) that vma_merge on self if it
succeeds, it couldn't possibly be safe because the vma->vm_pgoff vs
page->index linearity couldn't be valid for the same vma and the same
page on two different virtual addresses. So checking for it I think is
sane. Especially given at some point it was mentioned we could
optimize away the check all together, so it's a bit of an obscure path
that the VM_BUG_ON I think will help document (and verify).
Post by Mel Gorman
What I said was taking the anon_vma lock may be slower but it was
generally easier to understand. I'm happy with the new patch too
particularly as it keeps the "ordering game" consistent for fork
and mremap but I previously missed move_page_tables in the error
path so was worried if there was something else I managed to miss
particularly in light of the "Allocating a new vma, copy first and
merge later" direction.
I liked that direction a lot. I thought with that we could stick to
the exact same behavior of fork and not need to reorder stuff. But the
error path is still in the way, and we've to undo the move in place
without tearing down the vmas. Plus it would have required to write
mode code, and the allocation path wouldn't have necessarily been
faster than a reordering if the list is not huge.
Post by Mel Gorman
I'm also prefectly happy with my human meat brain and do not expect
to replace it with an aliens.
8-)
On a totally different but related topic, unmap_mapping_range_tree
walks the prio tree the same way try_to_unmap_file walks it and if
truncate can truncate "dst" before "src" then supposedly the
try_to_unmap_file could miss a migration entry copied into the "child"
ptep while fork runs too... But I think there is no risk there because
we don't establish migration ptes there, and we just unmap the
pagecache, so worst case we'll abort migration if the race trigger and
we'll retry later. But I wonder what happens if truncate runs against
fork, if truncate can drop ptes from dst before src (like mremap
comment says), we could still end up with some pte mapped to the file
in the ptes of the child, even if the pte was correctly truncated in
the parent...
Overall I think fork/mremap vs fully_reliable_rmap_walk/truncate
aren't fundamentally different in relation. If we relay on ordering
for anon pages in fork it's not adding too much mess to also relay on
ordering for mremap. If we take the i_mmap_mutex in mremap because we
can't enforce a order in the prio tree, then we need the i_mmap_mutex
in fork too (and that's missing). But nothing prevents us to use a
lock in mreamp and ordering in fork. I think the decision should be
based more on performance expectations.
So we could add the ordering to mremap (patch posted), and add the
i_mmap_mutex to fork, or we add the anon_vma lock in both mremap and
fork, and the i_mmap_lock to fork.
Also note, if we find a way to enforce orderings in the prio tree (not
sure if it's possible, apparently it's already using list_add_tail
so..), then we could also remove the i_mmap_lock from mremap and fork.
Oh, well, I had thought that for partial remap the src and dst VMA are
inserted as
different prio tree nodes, instead of being list_add_tail linked,
which means they
can not be reordered back and force at all...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Andrea Arcangeli
2011-11-16 14:00:42 UTC
Permalink
Post by Andrea Arcangeli
Also note, if we find a way to enforce orderings in the prio tree (not
sure if it's possible, apparently it's already using list_add_tail
so..), then we could also remove the i_mmap_lock from mremap and fork.
I'm not optimistic we can enforce ordering there. Being a tree it's
walked in range order.

I thought of another solution that would avoid having to reorder the
list in mremap and avoid the i_mmap_mutex to be added to fork (and
then we can remove it from mremap too). The solution is to rmap_walk
twice. I mean two loops over the same_anon_vma for those rmap walks
that must be reliable (that includes two calls of
unmap_mapping_range). For both same_anon_vma and prio tree.

Reading truncate_pagecache I see two loops already and a comment
saying it's for fork(), to avoid leaking ptes in the child. So fork is
probably ok already without having to take the i_mmap_mutex, but then
I wonder why that also doesn't fix mremap if we do two loops there and
why that i_mmap_mutex is really needed in mremap considering those two
calls already present in truncate_pagecache. I wonder if that was a
"theoretical" fix that missed the fact truncate already walks the prio
tree twice, so it doesn't matter if the rmap_walk goes in the opposite
direction of move_page_tables? That i_mmap_lock in mremap (now
i_mmap_mutex) is there since start of git history. The double loop was
introduced in d00806b183152af6d24f46f0c33f14162ca1262a. So it's very
possible that i_mmap_mutex is now useless (after
d00806b183152af6d24f46f0c33f14162ca1262a) and the fix for fork, was
already taking care of mremap too and that i_mmap_mutex can now be
removed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Hugh Dickins
2011-11-17 00:16:57 UTC
Permalink
Post by Andrea Arcangeli
Post by Andrea Arcangeli
Also note, if we find a way to enforce orderings in the prio tree (not
sure if it's possible, apparently it's already using list_add_tail
so..), then we could also remove the i_mmap_lock from mremap and fork.
I'm not optimistic we can enforce ordering there. Being a tree it's
walked in range order.
I thought of another solution that would avoid having to reorder the
list in mremap and avoid the i_mmap_mutex to be added to fork (and
then we can remove it from mremap too). The solution is to rmap_walk
twice. I mean two loops over the same_anon_vma for those rmap walks
that must be reliable (that includes two calls of
unmap_mapping_range). For both same_anon_vma and prio tree.
Reading truncate_pagecache I see two loops already and a comment
saying it's for fork(), to avoid leaking ptes in the child. So fork is
probably ok already without having to take the i_mmap_mutex, but then
I wonder why that also doesn't fix mremap if we do two loops there and
why that i_mmap_mutex is really needed in mremap considering those two
calls already present in truncate_pagecache. I wonder if that was a
"theoretical" fix that missed the fact truncate already walks the prio
tree twice, so it doesn't matter if the rmap_walk goes in the opposite
direction of move_page_tables? That i_mmap_lock in mremap (now
i_mmap_mutex) is there since start of git history. The double loop was
introduced in d00806b183152af6d24f46f0c33f14162ca1262a. So it's very
possible that i_mmap_mutex is now useless (after
d00806b183152af6d24f46f0c33f14162ca1262a) and the fix for fork, was
already taking care of mremap too and that i_mmap_mutex can now be
removed.
As you found, the mremap locking long predates truncation's double unmap.

That's an interesting point, and you may be right - though, what about
the *very* unlikely case where unmap_mapping_range looks at new vma
when pte is in old, then at old vma when pte is in new, then
move_page_tables runs out of memory and cannot complete, then the
second unmap_mapping_range looks at old vma while pte is still in new
(I guess this needs some other activity to have jumbled the prio_tree,
and may just be impossible), then at new (to be abandoned) vma after
pte has moved back to old.

Probably not an everyday occurrence :)

But, setting that aside, I've always thought of that second call to
unmap_mapping_range() as a regrettable expedient that we should try
to eliminate e.g. by checking for private mappings in the first pass,
and skipping the second call if there were none.

But since nobody ever complained about that added overhead, I never
got around to bothering; and you may consider the i_mmap_mutex in
move_ptes a more serious unnecessary overhead.

By the way, you mention "a comment saying it's for fork()": I don't
find "fork" anywhere in mm/truncate.c, my understanding is in this
comment (probably mine) from truncate_pagecache():

/*
* unmap_mapping_range is called twice, first simply for
* efficiency so that truncate_inode_pages does fewer
* single-page unmaps. However after this first call, and
* before truncate_inode_pages finishes, it is possible for
* private pages to be COWed, which remain after
* truncate_inode_pages finishes, hence the second
* unmap_mapping_range call must be made for correctness.
*/

The second call was not (I think) necessary when we relied upon
truncate_count, but became necessary once Nick relied upon page lock
(the page lock on the file page providing no guarantee for the COWed
page).

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Nai Xia
2011-11-17 02:49:24 UTC
Permalink
Post by Hugh Dickins
Post by Andrea Arcangeli
Post by Andrea Arcangeli
Also note, if we find a way to enforce orderings in the prio tree (not
sure if it's possible, apparently it's already using list_add_tail
so..), then we could also remove the i_mmap_lock from mremap and fork.
I'm not optimistic we can enforce ordering there. Being a tree it's
walked in range order.
I thought of another solution that would avoid having to reorder the
list in mremap and avoid the i_mmap_mutex to be added to fork (and
then we can remove it from mremap too). The solution is to rmap_walk
twice. I mean two loops over the same_anon_vma for those rmap walks
that must be reliable (that includes two calls of
unmap_mapping_range). For both same_anon_vma and prio tree.
Reading truncate_pagecache I see two loops already and a comment
saying it's for fork(), to avoid leaking ptes in the child. So fork is
probably ok already without having to take the i_mmap_mutex, but then
I wonder why that also doesn't fix mremap if we do two loops there and
why that i_mmap_mutex is really needed in mremap considering those two
calls already present in truncate_pagecache. I wonder if that was a
"theoretical" fix that missed the fact truncate already walks the prio
tree twice, so it doesn't matter if the rmap_walk goes in the opposite
direction of move_page_tables? That i_mmap_lock in mremap (now
i_mmap_mutex) is there since start of git history. The double loop was
introduced in d00806b183152af6d24f46f0c33f14162ca1262a. So it's very
possible that i_mmap_mutex is now useless (after
d00806b183152af6d24f46f0c33f14162ca1262a) and the fix for fork, was
already taking care of mremap too and that i_mmap_mutex can now be
removed.
As you found, the mremap locking long predates truncation's double unmap.
That's an interesting point, and you may be right - though, what about
the *very* unlikely case where unmap_mapping_range looks at new vma
when pte is in old, then at old vma when pte is in new, then
move_page_tables runs out of memory and cannot complete, then the
second unmap_mapping_range looks at old vma while pte is still in new
(I guess this needs some other activity to have jumbled the prio_tree,
and may just be impossible), then at new (to be abandoned) vma after
pte has moved back to old.
Probably not an everyday occurrence :)
But, setting that aside, I've always thought of that second call to
unmap_mapping_range() as a regrettable expedient that we should try
to eliminate e.g. by checking for private mappings in the first pass,
and skipping the second call if there were none.
But since nobody ever complained about that added overhead, I never
got around to bothering; and you may consider the i_mmap_mutex in
move_ptes a more serious unnecessary overhead.
By the way, you mention "a comment saying it's for fork()": I don't
find "fork" anywhere in mm/truncate.c, my understanding is in this
I think you guys are talking about two different COWs:

Andrea's question is that if a new VMA is created by fork() between
the two loops and PTEs are getting copied.

And you are refering to the new PTEs get COWed by __do_fault() in
the same VMA before the cache pages are really dropped.

From my point of view, the two loops there are really fork()
irrelevant, as you said, they are only for missed COWed ptes in the
same VMA before a cache page is really blind for find_get_page().




As for Andrea's reasoning, I think I deem this racing story as below:

1. fork() is safe without tree lock/mutex after the second loop, the
reason is just why it's safe for the try_to_unmap_file: the new VMA is
really linked as list tail in a *same* tree node as the old VMA in
vma prio_tree. The old and new are traveled by vma_prio_tree_foreach()
in a proper order. And fork() does not include a error path requiring
backward page table copy operation which needs a reverse order.

2. Partial mremap is not safe for this without tree lock/mutex, because the src
and dst VMA are different prio_tree nodes, and their order are not meant to
be screwed.



Nai
Post by Hugh Dickins
/*
* unmap_mapping_range is called twice, first simply for
* efficiency so that truncate_inode_pages does fewer
* single-page unmaps. However after this first call, and
* before truncate_inode_pages finishes, it is possible for
* private pages to be COWed, which remain after
* truncate_inode_pages finishes, hence the second
* unmap_mapping_range call must be made for correctness.
*/
The second call was not (I think) necessary when we relied upon
truncate_count, but became necessary once Nick relied upon page lock
(the page lock on the file page providing no guarantee for the COWed
page).
Hugh
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Nai Xia
2011-11-17 06:21:56 UTC
Permalink
Post by Hugh Dickins
Post by Andrea Arcangeli
Post by Andrea Arcangeli
Also note, if we find a way to enforce orderings in the prio tree (not
sure if it's possible, apparently it's already using list_add_tail
so..), then we could also remove the i_mmap_lock from mremap and fork.
I'm not optimistic we can enforce ordering there. Being a tree it's
walked in range order.
I thought of another solution that would avoid having to reorder the
list in mremap and avoid the i_mmap_mutex to be added to fork (and
then we can remove it from mremap too). The solution is to rmap_walk
twice. I mean two loops over the same_anon_vma for those rmap walks
that must be reliable (that includes two calls of
unmap_mapping_range). For both same_anon_vma and prio tree.
Reading truncate_pagecache I see two loops already and a comment
saying it's for fork(), to avoid leaking ptes in the child. So fork is
probably ok already without having to take the i_mmap_mutex, but then
I wonder why that also doesn't fix mremap if we do two loops there and
why that i_mmap_mutex is really needed in mremap considering those two
calls already present in truncate_pagecache. I wonder if that was a
"theoretical" fix that missed the fact truncate already walks the prio
tree twice, so it doesn't matter if the rmap_walk goes in the opposite
direction of move_page_tables? That i_mmap_lock in mremap (now
i_mmap_mutex) is there since start of git history. The double loop was
introduced in d00806b183152af6d24f46f0c33f14162ca1262a. So it's very
possible that i_mmap_mutex is now useless (after
d00806b183152af6d24f46f0c33f14162ca1262a) and the fix for fork, was
already taking care of mremap too and that i_mmap_mutex can now be
removed.
As you found, the mremap locking long predates truncation's double unmap.
That's an interesting point, and you may be right - though, what about
the *very* unlikely case where unmap_mapping_range looks at new vma
when pte is in old, then at old vma when pte is in new, then
move_page_tables runs out of memory and cannot complete, then the
second unmap_mapping_range looks at old vma while pte is still in new
(I guess this needs some other activity to have jumbled the prio_tree,
and may just be impossible), then at new (to be abandoned) vma after
pte has moved back to old.
I think this cannot happen either with proper ordering or with the tree lock
and Andrea was talking about if the two loops setup can avoid taking the
tree lock in mremap().

So, a simple answer would be: No, the two loops setup does not aim at
solving the PTE copy racing in fork() (it's lucky though), so can cannot
solve the problem of mremap either.
Post by Hugh Dickins
Probably not an everyday occurrence :)
But, setting that aside, I've always thought of that second call to
unmap_mapping_range() as a regrettable expedient that we should try
to eliminate e.g. by checking for private mappings in the first pass,
and skipping the second call if there were none.
Don't you think this is only a partial solution? Given that
truncate_inode_page() does not shoot down cowed ptes, the zap of
ptes and the cache pages are not atomic anyway,
so the second pass seems unavoidable for general cases....

Of course, if you let truncate_inode_page() has an option to unmap
cowed ptes, the second pass may not be needed, but then you may worry
about the performance.... a real dilemma, isn't it? :)
Post by Hugh Dickins
But since nobody ever complained about that added overhead, I never
got around to bothering; and you may consider the i_mmap_mutex in
move_ptes a more serious unnecessary overhead.
By the way, you mention "a comment saying it's for fork()": I don't
find "fork" anywhere in mm/truncate.c, my understanding is in this
       /*
        * unmap_mapping_range is called twice, first simply for
        * efficiency so that truncate_inode_pages does fewer
        * single-page unmaps.  However after this first call, and
        * before truncate_inode_pages finishes, it is possible for
        * private pages to be COWed, which remain after
        * truncate_inode_pages finishes, hence the second
        * unmap_mapping_range call must be made for correctness.
        */
The second call was not (I think) necessary when we relied upon
truncate_count, but became necessary once Nick relied upon page lock
(the page lock on the file page providing no guarantee for the COWed
page).
Hmm, yes, do_wp_page() does not take the page lock when doing COW
(only the PTE lock), but I think another citical reason for the second pass
is that nothing can prevent a just zapped pte from launching a write fault
again and get COWed in __do_fault(), just *before* truncate_inode_pages()
can take the its page lock.., so even if we bring do_wp_page() under control
of page lock, the second pass is still needed, right?

Nai
Post by Hugh Dickins
Hugh
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Andrea Arcangeli
2011-11-17 18:42:52 UTC
Permalink
Hi Hugh,
Post by Hugh Dickins
As you found, the mremap locking long predates truncation's double unmap.
That's an interesting point, and you may be right - though, what about
the *very* unlikely case where unmap_mapping_range looks at new vma
when pte is in old, then at old vma when pte is in new, then
move_page_tables runs out of memory and cannot complete, then the
second unmap_mapping_range looks at old vma while pte is still in new
(I guess this needs some other activity to have jumbled the prio_tree,
and may just be impossible), then at new (to be abandoned) vma after
pte has moved back to old.
I tend to think it should still work fine. The second loop is needed
to take care of the "reverse" order. If the first move_page_tables is
not in order the second move_page_tables will be in order. So it
should catch it. If the first move_page_tables is in order, the double
loop will catch any skip in the second move_page_tables.

Well if I'm missing something worst case we'd need a dummy
mutex_lock/unlock of the i_mmap_mutex before running the rolling-back
move_page_tables no big deal, still out of the fast path.
Post by Hugh Dickins
But since nobody ever complained about that added overhead, I never
got around to bothering; and you may consider the i_mmap_mutex in
move_ptes a more serious unnecessary overhead.
The point is that if there's no solution to fix truncate by removing
the double loop for the other reasons, so we could take advantage of
the double loop in mremap too (adding proper comment to truncate.c of
course).
Post by Hugh Dickins
By the way, you mention "a comment saying it's for fork()": I don't
find "fork" anywhere in mm/truncate.c, my understanding is in this
/*
* unmap_mapping_range is called twice, first simply for
* efficiency so that truncate_inode_pages does fewer
* single-page unmaps. However after this first call, and
* before truncate_inode_pages finishes, it is possible for
* private pages to be COWed, which remain after
* truncate_inode_pages finishes, hence the second
* unmap_mapping_range call must be made for correctness.
*/
The second call was not (I think) necessary when we relied upon
truncate_count, but became necessary once Nick relied upon page lock
(the page lock on the file page providing no guarantee for the COWed
page).
I see. Truncate locks down the page while it shoots down the pte so no
new mapping could be established, while the COWs still can because
they don't take the lock on the old page. But do_wp_page takes the
lock for anon pages and MAP_SHARED. It's a little weird it doesn't
take it for MAP_PRIVATE (i.e. VM_SHARED not set). MAP_SHARED already
does the check for page->mapping being null after the lock is obtained.

The double loop happens to make fork safe too, or the inverse ordering
between truncate and fork would lead to the same issue and that will
also map pagecache (not just anon cows). I don't see lock_page in fork
it just copies the pte it doesn't mangle on the page lock.

Note however that for a tiny window, with the current truncate code
that does unmap+truncate+unmap, there can still be a pte in the fork
child that points to an orphaned pagecache (before the second call of
unmap_mapping_range starts). It'd be a transient pte, it'll be dropped
as soon as the second unmap_mapping_range runs. Not sure how bad that
thing is. To avoid it we'd need to run unmap+unmap+truncate. That way
no pte in fork could map anymore a orphaned pagecache. But then the
second unmap wouldn't take down the COWs generated by do_wp_page in
MAP_PRIVATE areas anymore.

So it boils down if we are ok with transient pte mapping an orphaned
pagecache for a little. The only problem I can see is that writes
would then be discared without triggering SIGBUS beyond the end of
i_size on MAP_SHARED. But if the write from the other process (or
thread) happened a millisecond before it would be discared anyway. So
I guess it's not a problem and it's mostly an implementation issue if
there could be any code that won't like a pte pointing to an orphaned
pagecache for a little while. I'm optimistic it can work safe and we
can just drop the i_mmap_mutex completely from mremap after checking
that those transient ptes mapping orphaned pagecache won't trigger
asserts.

As for the anon_vma my ordering patch (last version I posted) fixes it
already. The other way is to add double loops. Or the anon_vma->lock
of course!

If we go double loops for anon-vma, with split_huge_page I could
unlink any anon_vma_chain where the address-range matches but the
pte/pmd is not found, and re-check in the second loop _only_ those
anon_vma_chains where we failed to find a mapping. Only thought about
it, not actually attempted to implement it. Even rmap_walk could do
that but it requires changes to the caller (i.e. migrate.c), while for
split_huge_page it'd be simpler local change. Then I would relink the
re-checked anon_vma_chains with list_splice. The whole list is
protected by the root anon vma lock which is hold for the whole
duration of split_huge_page so I guess it shall be doable.

The rmap_walks of filebacked mappings won't need any double loop (only
migrate and split_huge_page will need it) because neither
remove_migration_ptes nor split_huge_page runs on filebacked mappings
as migration ptes and hugepage splits only runs for anon memory. And
nothing would prevent to add double loops there too if we extend
split_huge_page to pagecache (we already double loop in truncate).

Nai, if prio tree could guarantee ordering, 1) there would be no
i_mmap_lock I guess, or there would be a comment that it's only for
the vma_merge case and the error path that goes in reverse order, 2)
if you were right that list_add_tail in prio tree and both src and dst
vmas being in the same node guarantees ordering, it would imply the
prio tree works in O(N) and that can't be or we'd use a list instead
of a prio tree. The whole idea of any structure smarter of a list is
to insert things in some "order" that depends on the index (the index
is the vm_start,vm_end range in the prio tree case) do some "work" in
insert so the walk can be faster, but that practically guarantees the
walk won't be in the same order as the way it was inserted.

If prio tree could guarantee ordering then I could also reoder the
prio tree extending my patch that already fixes it the anon_vma case,
and still avoid the i_mmap_mutex without requiring double loops.

So in short.

1) for anon I'm not sure if it's better my current patch that fixes
the anon case just fine, or if to go double loops in
split_huge_page/migrate, or if to add the anon_vma lock aroudn
move_page_tables.

2) for filebacked if we can deal with the transient pte on orphaned
pagecache we can just add a comment to truncate.c and drop the
i_mmap_mutex.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Nai Xia
2011-11-18 01:42:05 UTC
Permalink
Post by Paweł Sikora
Hi Hugh,
Post by Hugh Dickins
As you found, the mremap locking long predates truncation's double unmap.
That's an interesting point, and you may be right - though, what about
the *very* unlikely case where unmap_mapping_range looks at new vma
when pte is in old, then at old vma when pte is in new, then
move_page_tables runs out of memory and cannot complete, then the
second unmap_mapping_range looks at old vma while pte is still in new
(I guess this needs some other activity to have jumbled the prio_tree,
and may just be impossible), then at new (to be abandoned) vma after
pte has moved back to old.
I tend to think it should still work fine. The second loop is needed
to take care of the "reverse" order. If the first move_page_tables is
not in order the second move_page_tables will be in order. So it
should catch it. If the first move_page_tables is in order, the double
loop will catch any skip in the second move_page_tables.
First of all, I believe that at the POSIX level, it's ok for
truncate_inode_page()
not scanning COWed pages, since basically we does not provide any guarantee
for privately mapped file pages for this behavior. But missing a file
mapped pte after its
cache page is already removed from the the page cache is a
fundermental malfuntion for
a shared mapping when some threads see the file cache page is gone
while some thread
is still r/w from/to it! No matter how short the gap between
truncate_inode_page() and
the second loop, this is wrong.

Second, even if the we don't care about this POSIX flaw that may
introduce, a pte can still
missed by the second loop. mremap can happen serveral times during
these non-atomic
firstpass-trunc-secondpass operations, a proper events can happily
make the wrong order
for every scan, and miss them all -- That's just what in Hugh's mind
in the post you just
replied. Without lock and proper ordering( which patial mremap cannot provide),
this *will* happen.

You may disagree with me and have that locking removed, and I am
already have that
one line patch prepared waiting fora bug bumpping up again, what a
cheap patch submission!

:P


Thanks,

Nai

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Andrea Arcangeli
2011-11-18 02:17:14 UTC
Permalink
Post by Nai Xia
First of all, I believe that at the POSIX level, it's ok for
truncate_inode_page()
not scanning COWed pages, since basically we does not provide any guarantee
for privately mapped file pages for this behavior. But missing a file
mapped pte after its
cache page is already removed from the the page cache is a
I also exclude there is a case that would break, but it's safer to
keep things as is, in case somebody depends on segfault trapping.
Post by Nai Xia
fundermental malfuntion for
a shared mapping when some threads see the file cache page is gone
while some thread
is still r/w from/to it! No matter how short the gap between
truncate_inode_page() and
the second loop, this is wrong.
Truncate will destroy the info on disk too... so if somebody is
writing to a mapping which points beyond the end of the i_size
concurrently with truncate, the result is undefined. The write may
well reach the page but then the page is discared. Or you may get
SIGBUS before the write.
Post by Nai Xia
Second, even if the we don't care about this POSIX flaw that may
introduce, a pte can still
missed by the second loop. mremap can happen serveral times during
these non-atomic
firstpass-trunc-secondpass operations, a proper events can happily
make the wrong order
for every scan, and miss them all -- That's just what in Hugh's mind
in the post you just
replied. Without lock and proper ordering( which patial mremap cannot provide),
this *will* happen.
There won't be more than one mremap running concurrently from the same
process (we must enforce it by making sure anon_vma lock and
i_mmap_lock are both taken at least once in copy_vma, they're already
both taken in fork, they should already be taken in all common cases
in copy_vma so for all cases it's going to be a L1 exclusive cacheline
already). I don't exclude there may be some case that won't take the
locks in vma_adjust though, we should check it, if we decide to relay
on the double loop, but it'd be a simple addition if needed.

I'm more concerned about the pte pointing to the orphaned pagecache
that would materialize for a little while because of
unmap+truncate+unmap instead of unmap+unmap+truncate (but the latter
order is needed for the COWs).
Post by Nai Xia
You may disagree with me and have that locking removed, and I am
already have that
one line patch prepared waiting fora bug bumpping up again, what a
cheap patch submission!
Well I'm not yet sure it's good idea to remove the i_mmap_mutex, or if
we should just add the anon_vma lock in mremap and add the i_mmap_lock
in fork (to avoid the orphaned pagecache left mapped in the child
which already may happen unless there's some i_mmap_lock belonging to
the same inode taken after copy_page_range returns until we return to
userland and child can run, and I don't think we can relay on the
order of the prio tree in fork. Fork is safe for anon pages because
there we can relay on the order of the same_anon_vma list.

I think clearing up if this orphaned pagecache is dangerous would be a
good start. If too complex we just add the i_mmap_lock around
copy_page_range in fork if vma->vm_file is set. If you instead think
we can deal with the orphaned pagecache we can add a dummy lock/unlock
of i_mmap_mutex in copy_vma vma_merge succeeding case (short critical
section and not common common case) and remove the i_mmap_mutex around
move_page_tables (common case) overall speeding up mremap and not
degrading fork.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Nai Xia
2011-11-19 09:15:10 UTC
Permalink
Post by Andrea Arcangeli
Post by Nai Xia
First of all, I believe that at the POSIX level, it's ok for
truncate_inode_page()
not scanning  COWed pages, since basically we does not provide any guarantee
for privately mapped file pages for this behavior. But missing a file
mapped pte after its
cache page is already removed from the the page cache is a
I also exclude there is a case that would break, but it's safer to
keep things as is, in case somebody depends on segfault trapping.
Post by Nai Xia
fundermental malfuntion for
a shared mapping when some threads see the file cache page is gone
while some thread
is still r/w from/to it! No matter how short the gap between
truncate_inode_page() and
the second loop, this is wrong.
Truncate will destroy the info on disk too... so if somebody is
writing to a mapping which points beyond the end of the i_size
concurrently with truncate, the result is undefined. The write may
well reach the page but then the page is discared. Or you may get
SIGBUS before the write.
Post by Nai Xia
Second, even if the we don't care about this POSIX flaw that may
introduce, a pte can still
missed by the second loop. mremap can happen serveral times during
these non-atomic
firstpass-trunc-secondpass operations, a proper events can happily
make the wrong order
for every scan, and miss them all -- That's just what in Hugh's mind
in the post you just
replied. Without lock and proper ordering( which patial mremap cannot provide),
this *will* happen.
There won't be more than one mremap running concurrently from the same
process (we must enforce it by making sure anon_vma lock and
i_mmap_lock are both taken at least once in copy_vma, they're already
both taken in fork, they should already be taken in all common cases
in copy_vma so for all cases it's going to be a L1 exclusive cacheline
already). I don't exclude there may be some case that won't take the
locks in vma_adjust though, we should check it, if we decide to relay
on the double loop, but it'd be a simple addition if needed.
I mean it's not the concurrent mremap, it's mremap() can be done several
times between these 3-stage scans, since we don't take the mmap_sem
of the scanned VMAs, they are valid to do so. And without proper ordering
and locks/mutex it's possible for these 3-stage scans racing with these
mremap() s and a ghost PTE just jumps back and force and misses all
these scans.
Post by Andrea Arcangeli
I'm more concerned about the pte pointing to the orphaned pagecache
that would materialize for a little while because of
unmap+truncate+unmap instead of unmap+unmap+truncate (but the latter
order is needed for the COWs).
Post by Nai Xia
You may disagree with me and have that locking removed, and I am
already have that
one line patch prepared waiting fora bug bumpping up again, what a
cheap patch submission!
Well I'm not yet sure it's good idea to remove the i_mmap_mutex, or if
we should just add the anon_vma lock in mremap and add the i_mmap_lock
in fork (to avoid the orphaned pagecache left mapped in the child
which already may happen unless there's some i_mmap_lock belonging to
the same inode taken after copy_page_range returns until we return to
userland and child can run, and I don't think we can relay on the
order of the prio tree in fork. Fork is safe for anon pages because
there we can relay on the order of the same_anon_vma list.
I think clearing up if this orphaned pagecache is dangerous would be a
good start. If too complex we just add the i_mmap_lock around
copy_page_range in fork if vma->vm_file is set. If you instead think
we can deal with the orphaned pagecache we can add a dummy lock/unlock
of i_mmap_mutex in copy_vma vma_merge succeeding case (short critical
section and not common common case) and remove the i_mmap_mutex around
move_page_tables (common case) overall speeding up mremap and not
degrading fork.
I am actually feel comfortable either direction you take :)

But I do think orphaned pagecache is not a good idea,
don't you see there is a "BUG_ON(page_mapped(page))"
in __delete_from_page_cache()? Do you really plan to
remove this line?

Nai

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Nai Xia
2011-10-22 05:07:11 UTC
Permalink
Post by Andrea Arcangeli
Post by Mel Gorman
Post by Nai Xia
Post by Andrea Arcangeli
Post by Hugh Dickins
mremap's down_write of mmap_sem, together with i_mmap_mutex/lock,
and pagetable locks, were good enough before page migration (with its
requirement that every migration entry be found) came in; and enough
while migration always held mmap_sem. But not enough nowadays, when
there's memory hotremove and compaction: anon_vma lock is also needed,
to make sure a migration entry is not dodging around behind our back.
For things like migrate and split_huge_page, the anon_vma layer must
guarantee the page is reachable by rmap walk at all times regardless
if it's at the old or new address.
This shall be guaranteed by the copy_vma called by move_vma well
before move_page_tables/move_ptes can run.
copy_vma obviously takes the anon_vma lock to insert the new "dst" vma
into the anon_vma chains structures (vma_link does that). That before
any pte can be moved.
Because we keep two vmas mapped on both src and dst range, with
different vma->vm_pgoff that is valid for the page (the page doesn't
change its page->index) the page should always find _all_ its pte at
any given time.
There may be other variables at play like the order of insertion in
the anon_vma chain matches our direction of copy and removal of the
old pte. But I think the double locking of the PT lock should make the
order in the anon_vma chain absolutely irrelevant (the rmap_walk
obviously takes the PT lock too), and furthermore likely the
anon_vma_chain insertion is favorable (the dst vma is inserted last
and checked last). But it shouldn't matter.
I happened to be reading these code last week.
And I do think this order matters, the reason is just quite similar why we
If rmap_walk goes dst--->src, then when it first look into dst, ok, the
You might be right in that the ordering matters. We do link new VMAs at
Yes I also think ordering matters as I mentioned in the previous email
that Nai answered to.
Post by Mel Gorman
the end of the list in anon_vma_chain_list so remove_migrate_ptes should
be walking from src->dst.
Correct. Like I mentioned in that previous email that Nai answered,
that wouldn't be ok only if vma_merge succeeds and I didn't change my mind
about that...
copy_vma is only called by mremap so supposedly that path can
trigger. Looks like I was wrong about vma_merge being able to succeed
in copy_vma, and if it does I still think it's a problem as we have no
ordering guarantee.
The only other place that depends on the anon_vma_chain order is fork,
and there, no vma_merge can happen, so that is safe.
Post by Mel Gorman
If remove_migrate_pte finds src first, it will remove the pte and the
correct version will get copied. If move_ptes runs between when
remove_migrate_ptes moves from src to dst, then the PTE at dst will
still be correct.
The problem is rmap_walk will search dst before src. So it will do
nothing on dst. Then mremap moves the pte from src to dst. When rmap
walk then checks "src" it finds nothing again.
Post by Mel Gorman
Post by Nai Xia
pte is not there, and it happily skip it and release the PTL.
Then just before it look into src, move_ptes() comes in, takes the locks
and moves the pte from src to dst. And then when rmap_walk() look
into src, it will find an empty pte again. The pte is still there,
but rmap_walk() missed it !
I believe the ordering is correct though and protects us in this case.
Normally it is, the only problem is vma_merge succeeding I think.
Post by Mel Gorman
Post by Nai Xia
IMO, this can really happen in case of vma_merge() succeeding.
Imagine that src vma is lately faulted and in anon_vma_prepare()
it got a same anon_vma with an existing vma ( named evil_vma )through
find_mergeable_anon_vma(). This can potentially make the vma_merge() in
copy_vma() return with evil_vma on some new relocation request. But src_vma
is really linked _after_ evil_vma/new_vma/dst_vma.
In this way, the ordering protocol of anon_vma chain is broken.
This should be a rare case because I think in most cases
if two VMAs can reusable_anon_vma() they were already merged.
How do you think ?
I tried to understand the above scenario yesterday but with 12 hour
of travel on me I just couldn't.
Oh,yes, the first hypothesis was actually a vague feeling that things
might go wrong in that direction. The details in it was somewhat
missleading. But following that direction, I found the 2nd clear
hypothesis that leads to this bug step by step.
Post by Andrea Arcangeli
part of a vma is moved with mremap elsewhere. Then it is moved back to
its original place. So then vma_merge will succeed, and the "src" of
mremap is now queued last in anon_vma_chain, wrong ordering.
Oh, yes, partial mremaping will do the trick. I was too addicted to find
a case when two VMAs missed a normal merge chance but will merge later
on. The only thing I can find by now is that ENOMEM is vma_adjust().

Partial mremaping is a simpler case and definitely more likey to happen.
Post by Andrea Arcangeli
Today I read an email from Nai who showed apparently the same scenario
I was thinking, without evil vmas or stuff.
I have an hard time to imagine a vma_merge succeeding on a vma that
isn't going back to its original place. The vm_pgoff + vma->anon_vma
checks should keep some linarity so going back to the original place
sounds the only way vma_merge can succeed in copy_vma. But still it
can happen in that case I think (so not sure how the above scenario
with an evil_vma could ever happen if it has a different anon_vma and
it's not a part of a vma that is going back to its original place like
in the second scenario Nai also posted about).
That me and Nai had same scenario hypothesis indipendentely (second
Nai hypoteisis not the first quoted above), plus copy_vma doing
vma_merge and being only called by mremap, sounds like it can really
happen.
Post by Mel Gorman
Despite the comments in anon_vma_compatible(), I would expect that VMAs
that can share an anon_vma from find_mergeable_anon_vma() will also get
merged. When the new VMA is created, it will be linked in the usual
manner and the oldest->newest ordering is what is required. That's not
that important though.
What is important is if mremap is moving src to a dst that is adjacent
to another anon_vma. If src has never been faulted, it's not an issue
because there are also no migration PTEs. If src has been faulted, then
is_mergeable_anon_vma() should fail as anon_vma1 != anon_vma2 and they
are not compatible. The ordering is preserved and we are still ok.
I was thinking along these lines, the only pitfall should be when
something is moved and put back into its original place. When it is
moved, a new vma is created and queued last. When it's put back to its
original location, vma_merge will succeed, and "src" is now the
previous "dst" so queued last and that breaks.
Post by Mel Gorman
All that said, while I don't think there is a problem, I can't convince
myself 100% of it. Andrea, can you spot a flaw?
I think Nai's correct, only second hypothesis though.
1) we remove the vma_merge call from copy_vma and we do the vma_merge
manually after mremap succeed (so then we're as safe as fork is and we
relay on the ordering). No locks but we'll just do 1 more allocation
for one addition temporary vma that will be removed after mremap
completed.
2) Hugh's original fix.
First option probably is faster and prefereable, the vma_merge there
should only trigger when putting things back to origin I suspect, and
never with random mremaps, not sure how common it is to put things
back to origin. If we're in a hurry we can merge Hugh's patch and
optimize it later. We can still retain the migrate fix if we intend to
take way number 1 later. I didn't like too much migrate doing
speculative access on ptes that it can't miss or it'll crash anyway.
Me too, I think it's error-prone or at least we must be very careful
of its not doing sth evil. If the speculative access does not save
too much of the time, we need not brother to waste our mind power
over it.
Post by Andrea Arcangeli
Said that the fix merged upstream is 99% certain to fix things in
practice already so I doubt we're in hurry. And if things go wrong
these issues don't go unnoticed and they shouldn't corrupt memory even
if they trigger. 100% certain it can't do damage (other than a BUG_ON)
for split_huge_page as I count the pmds encountered in the rmap_walk
when I set the splitting bit, and I compare that count with
page_mapcount and BUG_ON if they don't match, and later I repeat the
same comparsion in the second rmap_walk that establishes the pte and
downgrades the hugepmd to pmd, and BUG_ON again if they don't match
with the previous rmap_walk count. It may be possible to trigger the
BUG_ON with some malicious activity but it won't be too easy either
because it's not an instant thing, still a race had to trigger and
it's hard to reproduce.
The anon_vma lock is quite a wide lock as it's shared by all parents
anon_vma_chains too, slab allocation from local cpu may actually be
faster in some condition (even when the slab allocation is
superflous). But then I'm not sure. So I'm not against applying Hugh's
fix even for the long run. I wouldn't git revert the migration change,
but then if we go with Hugh's fix probably it'd be safe.
Yeah, anon_vma root lock is a big lock. And JFYI, actually I am doing
some very nasty hacking on anon_vma and one of the side effects is
breaking the root lock into pieces. But this area is pretty
convolved by many racing conditions. I hope some day I will finally make
my patch work and have your precious review of it. :-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Andrea Arcangeli
2011-10-31 16:34:22 UTC
Permalink
Hi Nai,
Post by Nai Xia
Yeah, anon_vma root lock is a big lock. And JFYI, actually I am doing
some very nasty hacking on anon_vma and one of the side effects is
breaking the root lock into pieces. But this area is pretty
convolved by many racing conditions. I hope some day I will finally make
my patch work and have your precious review of it. :-)
:) It's going to be not trivial, initially it was not a shared lock
but it wasn't safe that way (especially with migrate required a
reliable rmap_walk) and using a shared lock across all
same_anon_vma/same_vma lists was the only way to be safe and solve the
races.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Linus Torvalds
2011-10-16 22:37:46 UTC
Permalink
What's the status of this thing? Is it stable/3.1 material? Do we have
ack/nak's for it? Anybody?

Linus
Post by Hugh Dickins
[PATCH] mm: add anon_vma locking to mremap move
I don't usually pay much attention to the stale "? " addresses in
stack backtraces, but this lucky report from Pawel Sikora hints that
mremap's move_ptes() has inadequate locking against page migration.
 kernel BUG at include/linux/swapops.h:105!
 RIP: 0010:[<ffffffff81127b76>]  [<ffffffff81127b76>]
                      migration_entry_wait+0x156/0x160
 [<ffffffff811016a1>] handle_pte_fault+0xae1/0xaf0
 [<ffffffff810feee2>] ? __pte_alloc+0x42/0x120
 [<ffffffff8112c26b>] ? do_huge_pmd_anonymous_page+0xab/0x310
 [<ffffffff81102a31>] handle_mm_fault+0x181/0x310
 [<ffffffff81106097>] ? vma_adjust+0x537/0x570
 [<ffffffff81424bed>] do_page_fault+0x11d/0x4e0
 [<ffffffff81109a05>] ? do_mremap+0x2d5/0x570
 [<ffffffff81421d5f>] page_fault+0x1f/0x30
mremap's down_write of mmap_sem, together with i_mmap_mutex/lock,
and pagetable locks, were good enough before page migration (with its
requirement that every migration entry be found) came in; and enough
while migration always held mmap_sem.  But not enough nowadays, when
there's memory hotremove and compaction: anon_vma lock is also needed,
to make sure a migration entry is not dodging around behind our back.
It appears that Mel's a8bef8ff6ea1 "mm: migration: avoid race between
shift_arg_pages() and rmap_walk() during migration by not migrating
temporary stacks" was actually a workaround for this in the special
common case of exec's use of move_pagetables(); and we should probably
now remove that VM_STACK_INCOMPLETE_SETUP stuff as a separate cleanup.
---
 mm/mremap.c |    5 +++++
 1 file changed, 5 insertions(+)
--- 3.1-rc9/mm/mremap.c 2011-07-21 19:17:23.000000000 -0700
+++ linux/mm/mremap.c   2011-10-13 14:36:25.097780974 -0700
@@ -77,6 +77,7 @@ static void move_ptes(struct vm_area_str
               unsigned long new_addr)
 {
       struct address_space *mapping = NULL;
+       struct anon_vma *anon_vma = vma->anon_vma;
       struct mm_struct *mm = vma->vm_mm;
       pte_t *old_pte, *new_pte, pte;
       spinlock_t *old_ptl, *new_ptl;
@@ -95,6 +96,8 @@ static void move_ptes(struct vm_area_str
               mapping = vma->vm_file->f_mapping;
               mutex_lock(&mapping->i_mmap_mutex);
       }
+       if (anon_vma)
+               anon_vma_lock(anon_vma);
       /*
        * We don't have to worry about the ordering of src and dst
@@ -121,6 +124,8 @@ static void move_ptes(struct vm_area_str
               spin_unlock(new_ptl);
       pte_unmap(new_pte - 1);
       pte_unmap_unlock(old_pte - 1, old_ptl);
+       if (anon_vma)
+               anon_vma_unlock(anon_vma);
       if (mapping)
               mutex_unlock(&mapping->i_mmap_mutex);
       mmu_notifier_invalidate_range_end(vma->vm_mm, old_start, old_end);
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Hugh Dickins
2011-10-17 03:02:16 UTC
Permalink
I've not read through and digested Andrea's reply yet, but I'd say
this is not something we need to rush into 3.1 at the last moment,
before it's been fully considered: the bug here is hard to hit,
ancient, made more likely in 2.6.35 by compaction and in 2.6.38 by
THP's reliance on compaction, but not a regression in 3.1 at all - let
it wait until stable.

Hugh

On Sun, Oct 16, 2011 at 3:37 PM, Linus Torvalds
What's the status of this thing? Is it stable/3.1 material? Do we hav=
e
ack/nak's for it? Anybody?
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 Linus
Post by Hugh Dickins
[PATCH] mm: add anon_vma locking to mremap move
I don't usually pay much attention to the stale "? " addresses in
stack backtraces, but this lucky report from Pawel Sikora hints that
mremap's move_ptes() has inadequate locking against page migration.
=C2=A0kernel BUG at include/linux/swapops.h:105!
=C2=A0RIP: 0010:[<ffffffff81127b76>] =C2=A0[<ffffffff81127b76>]
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
=C2=A0 migration_entry_wait+0x156/0x160
Post by Hugh Dickins
=C2=A0[<ffffffff811016a1>] handle_pte_fault+0xae1/0xaf0
=C2=A0[<ffffffff810feee2>] ? __pte_alloc+0x42/0x120
=C2=A0[<ffffffff8112c26b>] ? do_huge_pmd_anonymous_page+0xab/0x310
=C2=A0[<ffffffff81102a31>] handle_mm_fault+0x181/0x310
=C2=A0[<ffffffff81106097>] ? vma_adjust+0x537/0x570
=C2=A0[<ffffffff81424bed>] do_page_fault+0x11d/0x4e0
=C2=A0[<ffffffff81109a05>] ? do_mremap+0x2d5/0x570
=C2=A0[<ffffffff81421d5f>] page_fault+0x1f/0x30
mremap's down_write of mmap_sem, together with i_mmap_mutex/lock,
and pagetable locks, were good enough before page migration (with it=
s
Post by Hugh Dickins
requirement that every migration entry be found) came in; and enough
while migration always held mmap_sem. =C2=A0But not enough nowadays,=
when
Post by Hugh Dickins
there's memory hotremove and compaction: anon_vma lock is also neede=
d,
Post by Hugh Dickins
to make sure a migration entry is not dodging around behind our back=
=2E
Post by Hugh Dickins
It appears that Mel's a8bef8ff6ea1 "mm: migration: avoid race betwee=
n
Post by Hugh Dickins
shift_arg_pages() and rmap_walk() during migration by not migrating
temporary stacks" was actually a workaround for this in the special
common case of exec's use of move_pagetables(); and we should probab=
ly
Post by Hugh Dickins
now remove that VM_STACK_INCOMPLETE_SETUP stuff as a separate cleanu=
p.
Post by Hugh Dickins
---
=C2=A0mm/mremap.c | =C2=A0 =C2=A05 +++++
=C2=A01 file changed, 5 insertions(+)
--- 3.1-rc9/mm/mremap.c 2011-07-21 19:17:23.000000000 -0700
+++ linux/mm/mremap.c =C2=A0 2011-10-13 14:36:25.097780974 -0700
@@ -77,6 +77,7 @@ static void move_ptes(struct vm_area_str
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0unsigned long=
new_addr)
Post by Hugh Dickins
=C2=A0{
=C2=A0 =C2=A0 =C2=A0 =C2=A0struct address_space *mapping =3D NULL;
+ =C2=A0 =C2=A0 =C2=A0 struct anon_vma *anon_vma =3D vma->anon_vma;
=C2=A0 =C2=A0 =C2=A0 =C2=A0struct mm_struct *mm =3D vma->vm_mm;
=C2=A0 =C2=A0 =C2=A0 =C2=A0pte_t *old_pte, *new_pte, pte;
=C2=A0 =C2=A0 =C2=A0 =C2=A0spinlock_t *old_ptl, *new_ptl;
@@ -95,6 +96,8 @@ static void move_ptes(struct vm_area_str
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0mapping =3D v=
ma->vm_file->f_mapping;
Post by Hugh Dickins
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0mutex_lock(&m=
apping->i_mmap_mutex);
Post by Hugh Dickins
=C2=A0 =C2=A0 =C2=A0 =C2=A0}
+ =C2=A0 =C2=A0 =C2=A0 if (anon_vma)
+ =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 anon_vma_lock(ano=
n_vma);
Post by Hugh Dickins
=C2=A0 =C2=A0 =C2=A0 =C2=A0/*
=C2=A0 =C2=A0 =C2=A0 =C2=A0 * We don't have to worry about the order=
ing of src and dst
Post by Hugh Dickins
@@ -121,6 +124,8 @@ static void move_ptes(struct vm_area_str
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0spin_unlock(n=
ew_ptl);
Post by Hugh Dickins
=C2=A0 =C2=A0 =C2=A0 =C2=A0pte_unmap(new_pte - 1);
=C2=A0 =C2=A0 =C2=A0 =C2=A0pte_unmap_unlock(old_pte - 1, old_ptl);
+ =C2=A0 =C2=A0 =C2=A0 if (anon_vma)
+ =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 anon_vma_unlock(a=
non_vma);
Post by Hugh Dickins
=C2=A0 =C2=A0 =C2=A0 =C2=A0if (mapping)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0mutex_unlock(=
&mapping->i_mmap_mutex);
Post by Hugh Dickins
=C2=A0 =C2=A0 =C2=A0 =C2=A0mmu_notifier_invalidate_range_end(vma->vm=
_mm, old_start, old_end);
Linus Torvalds
2011-10-17 03:09:12 UTC
Permalink
Post by Hugh Dickins
I've not read through and digested Andrea's reply yet, but I'd say
this is not something we need to rush into 3.1 at the last moment,
before it's been fully considered: the bug here is hard to hit,
ancient, made more likely in 2.6.35 by compaction and in 2.6.38 by
THP's reliance on compaction, but not a regression in 3.1 at all - let
it wait until stable.
Ok, thanks. Just wanted to check.

Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Paweł Sikora
2011-10-18 19:17:27 UTC
Permalink
Post by Hugh Dickins
[ Subject refers to a different, unexplained 3.0 bug from Pawel ]
Post by Paweł Sikora
Hi Hugh,
i'm resending previous private email with larger cc list as you've requested.
Thanks, yes, on this one I think I do have an answer;
and we ought to bring Mel and Andrea in too.
Post by Paweł Sikora
in the last weekend my server died again (processes stuck for 22/23s!) but this time i have more logs for you.
- DISABLED swap space (fast user process killing is better for my autotest farm than long disk swapping
and 64GB ecc-ram is enough for my processing).
- vm.overcommit_memory = 2,
- vm.overcommit_ratio = 100.
after initial BUG_ON (pasted below) there's a flood of 'rcu_sched_state detected stalls / CPU#X stuck for 22s!'
Yes, those are just a tiresome consequence of exiting from a BUG
while holding the page table lock(s).
Post by Paweł Sikora
(full compressed log is available at: http://pluto.agmk.net/kernel/kernel.bz2)
Oct 9 08:06:43 hal kernel: [408578.629070] ------------[ cut here ]------------
Oct 9 08:06:43 hal kernel: [408578.629143] kernel BUG at include/linux/swapops.h:105!
Oct 9 08:06:43 hal kernel: [408578.629143] invalid opcode: 0000 [#1] SMP
Oct 9 08:06:43 hal kernel: [408578.629143] CPU 14
[ I'm deleting that irrelevant long line list of modules ]
Post by Paweł Sikora
Oct 9 08:06:43 hal kernel: [408578.629143]
Oct 9 08:06:43 hal kernel: [408578.629143] Pid: 29214, comm: bitgen Not tainted 3.0.4 #5 Supermicro H8DGU/H8DGU
Oct 9 08:06:43 hal kernel: [408578.629143] RIP: 0010:[<ffffffff81127b76>] [<ffffffff81127b76>] migration_entry_wait+0x156/0x160
Oct 9 08:06:43 hal kernel: [408578.629143] RSP: 0000:ffff88021cee7d18 EFLAGS: 00010246
Oct 9 08:06:43 hal kernel: [408578.629143] RAX: 0a00000000080068 RBX: ffffea001d1dbe70 RCX: ffff880c02d18978
Oct 9 08:06:43 hal kernel: [408578.629143] RDX: 0000000000851a42 RSI: ffff880d8fe33618 RDI: ffffea002a09dd50
Oct 9 08:06:43 hal kernel: [408578.629143] RBP: ffff88021cee7d38 R08: ffff880d8fe33618 R09: 0000000000000028
Oct 9 08:06:43 hal kernel: [408578.629143] R10: ffff881006eb0f00 R11: f800000000851a42 R12: ffffea002a09dd40
Oct 9 08:06:43 hal kernel: [408578.629143] R13: 0000000c02d18978 R14: 00000000d872f000 R15: ffff880d8fe33618
Oct 9 08:06:43 hal kernel: [408578.629143] FS: 00007f864d7fa700(0000) GS:ffff880c1fc80000(0063) knlGS:00000000f432a910
Oct 9 08:06:43 hal kernel: [408578.629143] CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033
Oct 9 08:06:43 hal kernel: [408578.629143] CR2: 00000000d872f000 CR3: 000000100668c000 CR4: 00000000000006e0
Oct 9 08:06:43 hal kernel: [408578.629143] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Oct 9 08:06:43 hal kernel: [408578.629143] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Oct 9 08:06:43 hal kernel: [408578.629143] Process bitgen (pid: 29214, threadinfo ffff88021cee6000, task ffff880407b06900)
Oct 9 08:06:43 hal kernel: [408578.629143] 00000000dcef4000 ffff880c00968c98 ffff880c02d18978 000000010a34843e
Oct 9 08:06:43 hal kernel: [408578.629143] ffff88021cee7de8 ffffffff811016a1 ffff88021cee7d78 ffff881006eb0f00
Oct 9 08:06:43 hal kernel: [408578.629143] ffff88021cee7d98 ffffffff810feee2 ffff880c06f0d170 8000000b98d14067
Oct 9 08:06:43 hal kernel: [408578.629143] [<ffffffff811016a1>] handle_pte_fault+0xae1/0xaf0
Oct 9 08:06:43 hal kernel: [408578.629143] [<ffffffff810feee2>] ? __pte_alloc+0x42/0x120
Oct 9 08:06:43 hal kernel: [408578.629143] [<ffffffff8112c26b>] ? do_huge_pmd_anonymous_page+0xab/0x310
Oct 9 08:06:43 hal kernel: [408578.629143] [<ffffffff81102a31>] handle_mm_fault+0x181/0x310
Oct 9 08:06:43 hal kernel: [408578.629143] [<ffffffff81106097>] ? vma_adjust+0x537/0x570
Oct 9 08:06:43 hal kernel: [408578.629143] [<ffffffff81424bed>] do_page_fault+0x11d/0x4e0
Oct 9 08:06:43 hal kernel: [408578.629143] [<ffffffff81109a05>] ? do_mremap+0x2d5/0x570
Oct 9 08:06:43 hal kernel: [408578.629143] [<ffffffff81421d5f>] page_fault+0x1f/0x30
Oct 9 08:06:43 hal kernel: [408578.629143] Code: 80 00 00 00 00 31 f6 48 89 df e8 e6 58 fb ff eb d7 85 c9 0f 84 44 ff ff ff 8d 51 01 89 c8 f0 0f b1 16 39 c1 90 74 b5 89 c1 eb e6 <0f> 0b 0f 1f 84 00 00 00 00 00 55 48 89 e5 48 83 ec 20 48 85 ff
Oct 9 08:06:43 hal kernel: [408578.629143] RIP [<ffffffff81127b76>] migration_entry_wait+0x156/0x160
Oct 9 08:06:43 hal kernel: [408578.629143] RSP <ffff88021cee7d18>
Oct 9 08:06:43 hal kernel: [408578.642823] ---[ end trace 0a37362301163711 ]---
Oct 9 08:07:10 hal kernel: [408605.283257] BUG: soft lockup - CPU#12 stuck for 23s! [par:29801]
Oct 9 08:07:10 hal kernel: [408605.285807] CPU 12
Oct 9 08:07:10 hal kernel: [408605.285807]
Oct 9 08:07:10 hal kernel: [408605.285807] Pid: 29801, comm: par Tainted: G D 3.0.4 #5 Supermicro H8DGU/H8DGU
Oct 9 08:07:10 hal kernel: [408605.285807] RIP: 0010:[<ffffffff814216a4>] [<ffffffff814216a4>] _raw_spin_lock+0x14/0x20
Oct 9 08:07:10 hal kernel: [408605.285807] RSP: 0018:ffff880c02def808 EFLAGS: 00000293
Oct 9 08:07:10 hal kernel: [408605.285807] RAX: 0000000000000b09 RBX: ffffea002741f6b8 RCX: ffff880000000000
Oct 9 08:07:10 hal kernel: [408605.285807] RDX: ffffea0000000000 RSI: 000000002a09dd40 RDI: ffffea002a09dd50
Oct 9 08:07:10 hal kernel: [408605.285807] RBP: ffff880c02def808 R08: 0000000000000000 R09: ffff880f2e4f4d70
Oct 9 08:07:10 hal kernel: [408605.285807] R10: ffff880c1fffbe00 R11: 0000000000000050 R12: ffffffff8142988e
Oct 9 08:07:10 hal kernel: [408605.285807] R13: ffff880c02def7b8 R14: ffffffff810e63dc R15: ffff880c02def7b8
Oct 9 08:07:10 hal kernel: [408605.285807] FS: 00007fe6b677c720(0000) GS:ffff880c1fc00000(0063) knlGS:00000000f6e0d910
Oct 9 08:07:10 hal kernel: [408605.285807] CS: 0010 DS: 002b ES: 002b CR0: 000000008005003b
Oct 9 08:07:10 hal kernel: [408605.285807] CR2: 00000000dd40012c CR3: 00000009f6b78000 CR4: 00000000000006e0
Oct 9 08:07:10 hal kernel: [408605.285807] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Oct 9 08:07:10 hal kernel: [408605.285807] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Oct 9 08:07:10 hal kernel: [408605.285807] Process par (pid: 29801, threadinfo ffff880c02dee000, task ffff880c07c90700)
Oct 9 08:07:10 hal kernel: [408605.285807] ffff880c02def858 ffffffff8110a2d7 ffffea001fc16fe0 ffff880c02d183a8
Oct 9 08:07:10 hal kernel: [408605.285807] ffff880c02def8c8 ffffea001fc16fa8 ffff880c00968c98 ffff881006eb0f00
Oct 9 08:07:10 hal kernel: [408605.285807] 0000000000000301 00000000d8675000 ffff880c02def8c8 ffffffff8110aa6a
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff8110a2d7>] __page_check_address+0x107/0x1a0
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff8110aa6a>] try_to_unmap_one+0x3a/0x420
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff8110be44>] try_to_unmap_anon+0xb4/0x130
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff8110bf75>] try_to_unmap+0x65/0x80
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff811285d0>] migrate_pages+0x310/0x4c0
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff810e93c2>] ? ____pagevec_lru_add+0x12/0x20
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff8111cbf0>] ? ftrace_define_fields_mm_compaction_isolate_template+0x70/0x70
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff8111d5da>] compact_zone+0x52a/0x8c0
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff810f9919>] ? zone_statistics+0x99/0xc0
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff8111dade>] compact_zone_order+0x7e/0xb0
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff810e46a8>] ? get_page_from_freelist+0x3b8/0x7e0
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff8111dbcd>] try_to_compact_pages+0xbd/0xf0
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff810e5148>] __alloc_pages_direct_compact+0xa8/0x180
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff810e56c5>] __alloc_pages_nodemask+0x4a5/0x7f0
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff810e9698>] ? lru_cache_add_lru+0x28/0x50
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff8110a92d>] ? page_add_new_anon_rmap+0x9d/0xb0
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff8111b865>] alloc_pages_vma+0x95/0x180
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff8112c2f8>] do_huge_pmd_anonymous_page+0x138/0x310
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff81102ace>] handle_mm_fault+0x21e/0x310
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff81001716>] ? __switch_to+0x1e6/0x2c0
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff81424bed>] do_page_fault+0x11d/0x4e0
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff8141f178>] ? schedule+0x308/0xa10
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff811077a7>] ? do_mmap_pgoff+0x357/0x370
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff8110790d>] ? sys_mmap_pgoff+0x14d/0x220
Oct 9 08:07:10 hal kernel: [408605.285807] [<ffffffff81421d5f>] page_fault+0x1f/0x30
Oct 9 08:07:10 hal kernel: [408605.285807] Code: 0f b6 c2 85 c0 0f 95 c0 0f b6 c0 5d c3 66 2e 0f 1f 84 00 00 00 00 00 55 b8 00 01 00 00 48 89 e5 f0 66 0f c1 07 38 e0 74 06 f3 90 <8a> 07 eb f6 5d c3 66 0f 1f 44 00 00 55 48 89 e5 9c 58 fa ba 00
I guess this is the only time you've seen this? In which case, ideally
I would try to devise a testcase to demonstrate the issue below instead;
but that may involve more ingenuity than I can find time for, let's see
see if people approve of this patch anyway (it applies to 3.1 or 3.0,
and earlier releases except that i_mmap_mutex used to be i_mmap_lock).
[PATCH] mm: add anon_vma locking to mremap move
I don't usually pay much attention to the stale "? " addresses in
stack backtraces, but this lucky report from Pawel Sikora hints that
mremap's move_ptes() has inadequate locking against page migration.
kernel BUG at include/linux/swapops.h:105!
RIP: 0010:[<ffffffff81127b76>] [<ffffffff81127b76>]
migration_entry_wait+0x156/0x160
[<ffffffff811016a1>] handle_pte_fault+0xae1/0xaf0
[<ffffffff810feee2>] ? __pte_alloc+0x42/0x120
[<ffffffff8112c26b>] ? do_huge_pmd_anonymous_page+0xab/0x310
[<ffffffff81102a31>] handle_mm_fault+0x181/0x310
[<ffffffff81106097>] ? vma_adjust+0x537/0x570
[<ffffffff81424bed>] do_page_fault+0x11d/0x4e0
[<ffffffff81109a05>] ? do_mremap+0x2d5/0x570
[<ffffffff81421d5f>] page_fault+0x1f/0x30
mremap's down_write of mmap_sem, together with i_mmap_mutex/lock,
and pagetable locks, were good enough before page migration (with its
requirement that every migration entry be found) came in; and enough
while migration always held mmap_sem. But not enough nowadays, when
there's memory hotremove and compaction: anon_vma lock is also needed,
to make sure a migration entry is not dodging around behind our back.
It appears that Mel's a8bef8ff6ea1 "mm: migration: avoid race between
shift_arg_pages() and rmap_walk() during migration by not migrating
temporary stacks" was actually a workaround for this in the special
common case of exec's use of move_pagetables(); and we should probably
now remove that VM_STACK_INCOMPLETE_SETUP stuff as a separate cleanup.
---
mm/mremap.c | 5 +++++
1 file changed, 5 insertions(+)
--- 3.1-rc9/mm/mremap.c 2011-07-21 19:17:23.000000000 -0700
+++ linux/mm/mremap.c 2011-10-13 14:36:25.097780974 -0700
@@ -77,6 +77,7 @@ static void move_ptes(struct vm_area_str
unsigned long new_addr)
{
struct address_space *mapping = NULL;
+ struct anon_vma *anon_vma = vma->anon_vma;
struct mm_struct *mm = vma->vm_mm;
pte_t *old_pte, *new_pte, pte;
spinlock_t *old_ptl, *new_ptl;
@@ -95,6 +96,8 @@ static void move_ptes(struct vm_area_str
mapping = vma->vm_file->f_mapping;
mutex_lock(&mapping->i_mmap_mutex);
}
+ if (anon_vma)
+ anon_vma_lock(anon_vma);
/*
* We don't have to worry about the ordering of src and dst
@@ -121,6 +124,8 @@ static void move_ptes(struct vm_area_str
spin_unlock(new_ptl);
pte_unmap(new_pte - 1);
pte_unmap_unlock(old_pte - 1, old_ptl);
+ if (anon_vma)
+ anon_vma_unlock(anon_vma);
if (mapping)
mutex_unlock(&mapping->i_mmap_mutex);
mmu_notifier_invalidate_range_end(vma->vm_mm, old_start, old_end);
Hi,

1).
with this patch applied to vanilla 3.0.6 kernel my opterons have been working stable for ~4 days so far.
nice :)

2).
with this patch i can't reproduce soft-lockup described at https://lkml.org/lkml/2011/8/30/112
nice :)

3).
now i've started more tests with this patch + 3.0.4 + vserver 2.3.1 to check possibly related locks
described on vserver mailinglist http://list.linux-vserver.org/archive?mss:5264:201108:odomikkjgoemcaomgidl
and lkml archive https://lkml.org/lkml/2011/5/23/398

1h uptime and still going...
Mel Gorman
2011-10-19 07:30:36 UTC
Permalink
<SNIP>
I guess this is the only time you've seen this? In which case, ideally
I would try to devise a testcase to demonstrate the issue below instead;
Considering that mremap workloads have been tested fairly heavily and
this hasn't triggered before (or at least not reported), I would not be
confident it can be easily reproduced. Maybe reproducing is easier if
interrupts are also high.
but that may involve more ingenuity than I can find time for, let's see
see if people approve of this patch anyway (it applies to 3.1 or 3.0,
and earlier releases except that i_mmap_mutex used to be i_mmap_lock).
[PATCH] mm: add anon_vma locking to mremap move
I don't usually pay much attention to the stale "? " addresses in
stack backtraces, but this lucky report from Pawel Sikora hints that
mremap's move_ptes() has inadequate locking against page migration.
kernel BUG at include/linux/swapops.h:105!
This check is triggered if migration PTEs are left behind. In the few
cases I saw this during compaction development, it was because a VMA was
unreachable during remove_migration_pte. With the anon_vma changes, the
locking during VMA insertion is meant to protect it and the order VMAs
are linked is important so the right anon_vma lock is found.

I don't think it is an unreachable VMA problem because if it was, the
problem would trigger much more frequently and not be exclusive to
mremap.
RIP: 0010:[<ffffffff81127b76>] [<ffffffff81127b76>]
migration_entry_wait+0x156/0x160
[<ffffffff811016a1>] handle_pte_fault+0xae1/0xaf0
[<ffffffff810feee2>] ? __pte_alloc+0x42/0x120
[<ffffffff8112c26b>] ? do_huge_pmd_anonymous_page+0xab/0x310
[<ffffffff81102a31>] handle_mm_fault+0x181/0x310
[<ffffffff81106097>] ? vma_adjust+0x537/0x570
[<ffffffff81424bed>] do_page_fault+0x11d/0x4e0
[<ffffffff81109a05>] ? do_mremap+0x2d5/0x570
[<ffffffff81421d5f>] page_fault+0x1f/0x30
mremap's down_write of mmap_sem, together with i_mmap_mutex/lock,
and pagetable locks, were good enough before page migration (with its
requirement that every migration entry be found) came in; and enough
while migration always held mmap_sem. But not enough nowadays, when
there's memory hotremove and compaction: anon_vma lock is also needed,
to make sure a migration entry is not dodging around behind our back.
migration holds the anon_vma lock while it unmaps the pages and keeps holding
it until after remove_migration_ptes is called. There are two anon vmas
that should exist during mremap that were created for the move. They
should not be able to disappear while migration runs and right now, I'm
not seeing how the VMA can get lost :(

I think a consequence of this patch is that migration and mremap are now
serialised by anon_vma lock. As a result, it might still fix the problem
if there is some race between mremap and migration simply by stopping
them playing with each other.
It appears that Mel's a8bef8ff6ea1 "mm: migration: avoid race between
shift_arg_pages() and rmap_walk() during migration by not migrating
temporary stacks" was actually a workaround for this in the special
common case of exec's use of move_pagetables(); and we should probably
now remove that VM_STACK_INCOMPLETE_SETUP stuff as a separate cleanup.
The problem was that there was only one VMA for two page table
ranges. The neater fix was to create a second VMA but that required a
kmalloc and additional VMA work during exec which was considered too
heavy. VM_STACK_INCOMPLETE_SETUP is less clean but it is faster.
Mel Gorman
2011-10-21 12:44:15 UTC
Permalink
Post by Mel Gorman
Post by Paweł Sikora
RIP: 0010:[<ffffffff81127b76>] [<ffffffff81127b76>]
migration_entry_wait+0x156/0x160
[<ffffffff811016a1>] handle_pte_fault+0xae1/0xaf0
[<ffffffff810feee2>] ? __pte_alloc+0x42/0x120
[<ffffffff8112c26b>] ? do_huge_pmd_anonymous_page+0xab/0x310
[<ffffffff81102a31>] handle_mm_fault+0x181/0x310
[<ffffffff81106097>] ? vma_adjust+0x537/0x570
[<ffffffff81424bed>] do_page_fault+0x11d/0x4e0
[<ffffffff81109a05>] ? do_mremap+0x2d5/0x570
[<ffffffff81421d5f>] page_fault+0x1f/0x30
mremap's down_write of mmap_sem, together with i_mmap_mutex/lock,
and pagetable locks, were good enough before page migration (with its
requirement that every migration entry be found) came in; and enough
while migration always held mmap_sem. But not enough nowadays, when
there's memory hotremove and compaction: anon_vma lock is also needed,
to make sure a migration entry is not dodging around behind our back.
migration holds the anon_vma lock while it unmaps the pages and keeps holding
it until after remove_migration_ptes is called.
I reread this today and realised I was sloppy with my writing. migration
holds the anon_vma lock while it unmaps the pages. It also holds the
anon_vma lock during remove_migration_ptes. For the migration operation,
a reference count is held on anon_vma but not the lock itself.
Post by Mel Gorman
There are two anon vmas
that should exist during mremap that were created for the move. They
should not be able to disappear while migration runs and right now,
And what is preventing them disappearing is not the lock but the
reference count.
Continue reading on narkive:
Loading...