Skip to content
Snippets Groups Projects
  1. May 24, 2024
    • Miaohe Lin's avatar
      mm/memory-failure: fix handling of dissolved but not taken off from buddy pages · 8cf360b9
      Miaohe Lin authored
      When I did memory failure tests recently, below panic occurs:
      
      page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x8cee00
      flags: 0x6fffe0000000000(node=1|zone=2|lastcpupid=0x7fff)
      raw: 06fffe0000000000 dead000000000100 dead000000000122 0000000000000000
      raw: 0000000000000000 0000000000000009 00000000ffffffff 0000000000000000
      page dumped because: VM_BUG_ON_PAGE(!PageBuddy(page))
      ------------[ cut here ]------------
      kernel BUG at include/linux/page-flags.h:1009!
      invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
      RIP: 0010:__del_page_from_free_list+0x151/0x180
      RSP: 0018:ffffa49c90437998 EFLAGS: 00000046
      RAX: 0000000000000035 RBX: 0000000000000009 RCX: ffff8dd8dfd1c9c8
      RDX: 0000000000000000 RSI: 0000000000000027 RDI: ffff8dd8dfd1c9c0
      RBP: ffffd901233b8000 R08: ffffffffab5511f8 R09: 0000000000008c69
      R10: 0000000000003c15 R11: ffffffffab5511f8 R12: ffff8dd8fffc0c80
      R13: 0000000000000001 R14: ffff8dd8fffc0c80 R15: 0000000000000009
      FS:  00007ff916304740(0000) GS:ffff8dd8dfd00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 000055eae50124c8 CR3: 00000008479e0000 CR4: 00000000000006f0
      Call Trace:
       <TASK>
       __rmqueue_pcplist+0x23b/0x520
       get_page_from_freelist+0x26b/0xe40
       __alloc_pages_noprof+0x113/0x1120
       __folio_alloc_noprof+0x11/0xb0
       alloc_buddy_hugetlb_folio.isra.0+0x5a/0x130
       __alloc_fresh_hugetlb_folio+0xe7/0x140
       alloc_pool_huge_folio+0x68/0x100
       set_max_huge_pages+0x13d/0x340
       hugetlb_sysctl_handler_common+0xe8/0x110
       proc_sys_call_handler+0x194/0x280
       vfs_write+0x387/0x550
       ksys_write+0x64/0xe0
       do_syscall_64+0xc2/0x1d0
       entry_SYSCALL_64_after_hwframe+0x77/0x7f
      RIP: 0033:0x7ff916114887
      RSP: 002b:00007ffec8a2fd78 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
      RAX: ffffffffffffffda RBX: 000055eae500e350 RCX: 00007ff916114887
      RDX: 0000000000000004 RSI: 000055eae500e390 RDI: 0000000000000003
      RBP: 000055eae50104c0 R08: 0000000000000000 R09: 000055eae50104c0
      R10: 0000000000000077 R11: 0000000000000246 R12: 0000000000000004
      R13: 0000000000000004 R14: 00007ff916216b80 R15: 00007ff916216a00
       </TASK>
      Modules linked in: mce_inject hwpoison_inject
      ---[ end trace 0000000000000000 ]---
      
      And before the panic, there had an warning about bad page state:
      
      BUG: Bad page state in process page-types  pfn:8cee00
      page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x8cee00
      flags: 0x6fffe0000000000(node=1|zone=2|lastcpupid=0x7fff)
      page_type: 0xffffff7f(buddy)
      raw: 06fffe0000000000 ffffd901241c0008 ffffd901240f8008 0000000000000000
      raw: 0000000000000000 0000000000000009 00000000ffffff7f 0000000000000000
      page dumped because: nonzero mapcount
      Modules linked in: mce_inject hwpoison_inject
      CPU: 8 PID: 154211 Comm: page-types Not tainted 6.9.0-rc4-00499-g5544ec3178e2-dirty #22
      Call Trace:
       <TASK>
       dump_stack_lvl+0x83/0xa0
       bad_page+0x63/0xf0
       free_unref_page+0x36e/0x5c0
       unpoison_memory+0x50b/0x630
       simple_attr_write_xsigned.constprop.0.isra.0+0xb3/0x110
       debugfs_attr_write+0x42/0x60
       full_proxy_write+0x5b/0x80
       vfs_write+0xcd/0x550
       ksys_write+0x64/0xe0
       do_syscall_64+0xc2/0x1d0
       entry_SYSCALL_64_after_hwframe+0x77/0x7f
      RIP: 0033:0x7f189a514887
      RSP: 002b:00007ffdcd899718 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
      RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f189a514887
      RDX: 0000000000000009 RSI: 00007ffdcd899730 RDI: 0000000000000003
      RBP: 00007ffdcd8997a0 R08: 0000000000000000 R09: 00007ffdcd8994b2
      R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffdcda199a8
      R13: 0000000000404af1 R14: 000000000040ad78 R15: 00007f189a7a5040
       </TASK>
      
      The root cause should be the below race:
      
       memory_failure
        try_memory_failure_hugetlb
         me_huge_page
          __page_handle_poison
           dissolve_free_hugetlb_folio
           drain_all_pages -- Buddy page can be isolated e.g. for compaction.
           take_page_off_buddy -- Failed as page is not in the buddy list.
      	     -- Page can be putback into buddy after compaction.
          page_ref_inc -- Leads to buddy page with refcnt = 1.
      
      Then unpoison_memory() can unpoison the page and send the buddy page back
      into buddy list again leading to the above bad page state warning.  And
      bad_page() will call page_mapcount_reset() to remove PageBuddy from buddy
      page leading to later VM_BUG_ON_PAGE(!PageBuddy(page)) when trying to
      allocate this page.
      
      Fix this issue by only treating __page_handle_poison() as successful when
      it returns 1.
      
      Link: https://lkml.kernel.org/r/20240523071217.1696196-1-linmiaohe@huawei.com
      
      
      Fixes: ceaf8fbe ("mm, hwpoison: skip raw hwpoison page in freeing 1GB hugepage")
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8cf360b9
    • Miaohe Lin's avatar
      mm/huge_memory: don't unpoison huge_zero_folio · fe6f86f4
      Miaohe Lin authored
      When I did memory failure tests recently, below panic occurs:
      
       kernel BUG at include/linux/mm.h:1135!
       invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
       CPU: 9 PID: 137 Comm: kswapd1 Not tainted 6.9.0-rc4-00491-gd5ce28f156fe-dirty #14
       RIP: 0010:shrink_huge_zero_page_scan+0x168/0x1a0
       RSP: 0018:ffff9933c6c57bd0 EFLAGS: 00000246
       RAX: 000000000000003e RBX: 0000000000000000 RCX: ffff88f61fc5c9c8
       RDX: 0000000000000000 RSI: 0000000000000027 RDI: ffff88f61fc5c9c0
       RBP: ffffcd7c446b0000 R08: ffffffff9a9405f0 R09: 0000000000005492
       R10: 00000000000030ea R11: ffffffff9a9405f0 R12: 0000000000000000
       R13: 0000000000000000 R14: 0000000000000000 R15: ffff88e703c4ac00
       FS:  0000000000000000(0000) GS:ffff88f61fc40000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 000055f4da6e9878 CR3: 0000000c71048000 CR4: 00000000000006f0
       Call Trace:
        <TASK>
        do_shrink_slab+0x14f/0x6a0
        shrink_slab+0xca/0x8c0
        shrink_node+0x2d0/0x7d0
        balance_pgdat+0x33a/0x720
        kswapd+0x1f3/0x410
        kthread+0xd5/0x100
        ret_from_fork+0x2f/0x50
        ret_from_fork_asm+0x1a/0x30
        </TASK>
       Modules linked in: mce_inject hwpoison_inject
       ---[ end trace 0000000000000000 ]---
       RIP: 0010:shrink_huge_zero_page_scan+0x168/0x1a0
       RSP: 0018:ffff9933c6c57bd0 EFLAGS: 00000246
       RAX: 000000000000003e RBX: 0000000000000000 RCX: ffff88f61fc5c9c8
       RDX: 0000000000000000 RSI: 0000000000000027 RDI: ffff88f61fc5c9c0
       RBP: ffffcd7c446b0000 R08: ffffffff9a9405f0 R09: 0000000000005492
       R10: 00000000000030ea R11: ffffffff9a9405f0 R12: 0000000000000000
       R13: 0000000000000000 R14: 0000000000000000 R15: ffff88e703c4ac00
       FS:  0000000000000000(0000) GS:ffff88f61fc40000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 000055f4da6e9878 CR3: 0000000c71048000 CR4: 00000000000006f0
      
      The root cause is that HWPoison flag will be set for huge_zero_folio
      without increasing the folio refcnt.  But then unpoison_memory() will
      decrease the folio refcnt unexpectedly as it appears like a successfully
      hwpoisoned folio leading to VM_BUG_ON_PAGE(page_ref_count(page) == 0) when
      releasing huge_zero_folio.
      
      Skip unpoisoning huge_zero_folio in unpoison_memory() to fix this issue. 
      We're not prepared to unpoison huge_zero_folio yet.
      
      Link: https://lkml.kernel.org/r/20240516122608.22610-1-linmiaohe@huawei.com
      
      
      Fixes: 478d134e ("mm/huge_memory: do not overkill when splitting huge_zero_page")
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarYang Shi <shy828301@gmail.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Reviewed-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: Xu Yu <xuyu@linux.alibaba.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      fe6f86f4
  2. May 06, 2024
  3. Apr 26, 2024
  4. Apr 16, 2024
    • Miaohe Lin's avatar
      mm/memory-failure: fix deadlock when hugetlb_optimize_vmemmap is enabled · 1983184c
      Miaohe Lin authored
      When I did hard offline test with hugetlb pages, below deadlock occurs:
      
      ======================================================
      WARNING: possible circular locking dependency detected
      6.8.0-11409-gf6cef5f8c37f #1 Not tainted
      ------------------------------------------------------
      bash/46904 is trying to acquire lock:
      ffffffffabe68910 (cpu_hotplug_lock){++++}-{0:0}, at: static_key_slow_dec+0x16/0x60
      
      but task is already holding lock:
      ffffffffabf92ea8 (pcp_batch_high_lock){+.+.}-{3:3}, at: zone_pcp_disable+0x16/0x40
      
      which lock already depends on the new lock.
      
      the existing dependency chain (in reverse order) is:
      
      -> #1 (pcp_batch_high_lock){+.+.}-{3:3}:
             __mutex_lock+0x6c/0x770
             page_alloc_cpu_online+0x3c/0x70
             cpuhp_invoke_callback+0x397/0x5f0
             __cpuhp_invoke_callback_range+0x71/0xe0
             _cpu_up+0xeb/0x210
             cpu_up+0x91/0xe0
             cpuhp_bringup_mask+0x49/0xb0
             bringup_nonboot_cpus+0xb7/0xe0
             smp_init+0x25/0xa0
             kernel_init_freeable+0x15f/0x3e0
             kernel_init+0x15/0x1b0
             ret_from_fork+0x2f/0x50
             ret_from_fork_asm+0x1a/0x30
      
      -> #0 (cpu_hotplug_lock){++++}-{0:0}:
             __lock_acquire+0x1298/0x1cd0
             lock_acquire+0xc0/0x2b0
             cpus_read_lock+0x2a/0xc0
             static_key_slow_dec+0x16/0x60
             __hugetlb_vmemmap_restore_folio+0x1b9/0x200
             dissolve_free_huge_page+0x211/0x260
             __page_handle_poison+0x45/0xc0
             memory_failure+0x65e/0xc70
             hard_offline_page_store+0x55/0xa0
             kernfs_fop_write_iter+0x12c/0x1d0
             vfs_write+0x387/0x550
             ksys_write+0x64/0xe0
             do_syscall_64+0xca/0x1e0
             entry_SYSCALL_64_after_hwframe+0x6d/0x75
      
      other info that might help us debug this:
      
       Possible unsafe locking scenario:
      
             CPU0                    CPU1
             ----                    ----
        lock(pcp_batch_high_lock);
                                     lock(cpu_hotplug_lock);
                                     lock(pcp_batch_high_lock);
        rlock(cpu_hotplug_lock);
      
       *** DEADLOCK ***
      
      5 locks held by bash/46904:
       #0: ffff98f6c3bb23f0 (sb_writers#5){.+.+}-{0:0}, at: ksys_write+0x64/0xe0
       #1: ffff98f6c328e488 (&of->mutex){+.+.}-{3:3}, at: kernfs_fop_write_iter+0xf8/0x1d0
       #2: ffff98ef83b31890 (kn->active#113){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x100/0x1d0
       #3: ffffffffabf9db48 (mf_mutex){+.+.}-{3:3}, at: memory_failure+0x44/0xc70
       #4: ffffffffabf92ea8 (pcp_batch_high_lock){+.+.}-{3:3}, at: zone_pcp_disable+0x16/0x40
      
      stack backtrace:
      CPU: 10 PID: 46904 Comm: bash Kdump: loaded Not tainted 6.8.0-11409-gf6cef5f8c37f #1
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
      Call Trace:
       <TASK>
       dump_stack_lvl+0x68/0xa0
       check_noncircular+0x129/0x140
       __lock_acquire+0x1298/0x1cd0
       lock_acquire+0xc0/0x2b0
       cpus_read_lock+0x2a/0xc0
       static_key_slow_dec+0x16/0x60
       __hugetlb_vmemmap_restore_folio+0x1b9/0x200
       dissolve_free_huge_page+0x211/0x260
       __page_handle_poison+0x45/0xc0
       memory_failure+0x65e/0xc70
       hard_offline_page_store+0x55/0xa0
       kernfs_fop_write_iter+0x12c/0x1d0
       vfs_write+0x387/0x550
       ksys_write+0x64/0xe0
       do_syscall_64+0xca/0x1e0
       entry_SYSCALL_64_after_hwframe+0x6d/0x75
      RIP: 0033:0x7fc862314887
      Code: 10 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
      RSP: 002b:00007fff19311268 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
      RAX: ffffffffffffffda RBX: 000000000000000c RCX: 00007fc862314887
      RDX: 000000000000000c RSI: 000056405645fe10 RDI: 0000000000000001
      RBP: 000056405645fe10 R08: 00007fc8623d1460 R09: 000000007fffffff
      R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000000c
      R13: 00007fc86241b780 R14: 00007fc862417600 R15: 00007fc862416a00
      
      In short, below scene breaks the lock dependency chain:
      
       memory_failure
        __page_handle_poison
         zone_pcp_disable -- lock(pcp_batch_high_lock)
         dissolve_free_huge_page
          __hugetlb_vmemmap_restore_folio
           static_key_slow_dec
            cpus_read_lock -- rlock(cpu_hotplug_lock)
      
      Fix this by calling drain_all_pages() instead.
      
      This issue won't occur until commit a6b40850 ("mm: hugetlb: replace
      hugetlb_free_vmemmap_enabled with a static_key").  As it introduced
      rlock(cpu_hotplug_lock) in dissolve_free_huge_page() code path while
      lock(pcp_batch_high_lock) is already in the __page_handle_poison().
      
      [linmiaohe@huawei.com: extend comment per Oscar]
      [akpm@linux-foundation.org: reflow block comment]
      Link: https://lkml.kernel.org/r/20240407085456.2798193-1-linmiaohe@huawei.com
      
      
      Fixes: a6b40850 ("mm: hugetlb: replace hugetlb_free_vmemmap_enabled with a static_key")
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Acked-by: default avatarOscar Salvador <osalvador@suse.de>
      Reviewed-by: default avatarJane Chu <jane.chu@oracle.com>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1983184c
  5. Feb 08, 2024
    • Miaohe Lin's avatar
      mm/memory-failure: fix crash in split_huge_page_to_list from soft_offline_page · 2fde9e7f
      Miaohe Lin authored
      When I did soft offline stress test, a machine was observed to crash with
      the following message:
      
        kernel BUG at include/linux/memcontrol.h:554!
        invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
        CPU: 5 PID: 3837 Comm: hwpoison.sh Not tainted 6.7.0-next-20240112-00001-g8ecf3e7fb7c8-dirty #97
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
        RIP: 0010:folio_memcg+0xaf/0xd0
        Code: 10 5b 5d c3 cc cc cc cc 48 c7 c6 08 b1 f2 b2 48 89 ef e8 b4 c5 f8 ff 90 0f 0b 48 c7 c6 d0 b0 f2 b2 48 89 ef e8 a2 c5 f8 ff 90 <0f> 0b 48 c7 c6 08 b1 f2 b2 48 89 ef e8 90 c5 f8 ff 90 0f 0b 66 66
        RSP: 0018:ffffb6c043657c98 EFLAGS: 00000296
        RAX: 000000000000004b RBX: ffff932bc1d1e401 RCX: ffff933abfb5c908
        RDX: 0000000000000000 RSI: 0000000000000027 RDI: ffff933abfb5c900
        RBP: ffffea6f04019080 R08: ffffffffb3338ce8 R09: 0000000000009ffb
        R10: 00000000000004dd R11: ffffffffb3308d00 R12: ffffea6f04019080
        R13: ffffea6f04019080 R14: 0000000000000001 R15: ffffb6c043657da0
        FS:  00007f6c60f6b740(0000) GS:ffff933abfb40000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000559c3bc8b980 CR3: 0000000107f1c000 CR4: 00000000000006f0
        Call Trace:
         <TASK>
         split_huge_page_to_list+0x4d/0x1380
         try_to_split_thp_page+0x3a/0xf0
         soft_offline_page+0x1ea/0x8a0
         soft_offline_page_store+0x52/0x90
         kernfs_fop_write_iter+0x118/0x1b0
         vfs_write+0x30b/0x430
         ksys_write+0x5e/0xe0
         do_syscall_64+0xb0/0x1b0
         entry_SYSCALL_64_after_hwframe+0x6d/0x75
        RIP: 0033:0x7f6c60d14697
        Code: 10 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
        RSP: 002b:00007ffe9b72b8d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
        RAX: ffffffffffffffda RBX: 000000000000000c RCX: 00007f6c60d14697
        RDX: 000000000000000c RSI: 0000559c3bc8b980 RDI: 0000000000000001
        RBP: 0000559c3bc8b980 R08: 00007f6c60dd1460 R09: 000000007fffffff
        R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000000c
        R13: 00007f6c60e1a780 R14: 00007f6c60e16600 R15: 00007f6c60e15a00
      
      The problem is that page->mapping is overloaded with slab->slab_list or
      slabs fields now, so slab pages could be taken as non-LRU movable pages if
      field slabs contains PAGE_MAPPING_MOVABLE or slab_list->prev is set to
      LIST_POISON2.  These slab pages will be treated as thp later leading to
      crash in split_huge_page_to_list().
      
      Link: https://lkml.kernel.org/r/20240126065837.2100184-1-linmiaohe@huawei.com
      Link: https://lkml.kernel.org/r/20240124084014.1772906-1-linmiaohe@huawei.com
      
      
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Fixes: 130d4df5 ("mm/sl[au]b: rearrange struct slab fields to allow larger rcu_head")
      Reviewed-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      2fde9e7f
  6. Jan 26, 2024
  7. Dec 29, 2023
  8. Dec 20, 2023
  9. Dec 11, 2023
  10. Dec 07, 2023
    • Shiyang Ruan's avatar
      mm, pmem, xfs: Introduce MF_MEM_PRE_REMOVE for unbind · fa422b35
      Shiyang Ruan authored
      Now, if we suddenly remove a PMEM device(by calling unbind) which
      contains FSDAX while programs are still accessing data in this device,
      e.g.:
      ```
       $FSSTRESS_PROG -d $SCRATCH_MNT -n 99999 -p 4 &
       # $FSX_PROG -N 1000000 -o 8192 -l 500000 $SCRATCH_MNT/t001 &
       echo "pfn1.1" > /sys/bus/nd/drivers/nd_pmem/unbind
      ```
      it could come into an unacceptable state:
        1. device has gone but mount point still exists, and umount will fail
             with "target is busy"
        2. programs will hang and cannot be killed
        3. may crash with NULL pointer dereference
      
      To fix this, we introduce a MF_MEM_PRE_REMOVE flag to let it know that we
      are going to remove the whole device, and make sure all related processes
      could be notified so that they could end up gracefully.
      
      This patch is inspired by Dan's "mm, dax, pmem: Introduce
      dev_pagemap_failure()"[1].  With the help of dax_holder and
      ->notify_failure() mechanism, the pmem driver is able to ask filesystem
      on it to unmap all files in use, and notify processes who are using
      those files.
      
      Call trace:
      trigger unbind
       -> unbind_store()
        -> ... (skip)
         -> devres_release_all()
          -> kill_dax()
           -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
            -> xfs_dax_notify_failure()
            `-> freeze_super()             // freeze (kernel call)
            `-> do xfs rmap
            ` -> mf_dax_kill_procs()
            `  -> collect_procs_fsdax()    // all associated processes
            `  -> unmap_and_kill()
            ` -> invalidate_inode_pages2_range() // drop file's cache
            `-> thaw_super()               // thaw (both kernel & user call)
      
      Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
      event.  Use the exclusive freeze/thaw[2] to lock the filesystem to prevent
      new dax mapping from being created.  Do not shutdown filesystem directly
      if configuration is not supported, or if failure range includes metadata
      area.  Make sure all files and processes(not only the current progress)
      are handled correctly.  Also drop the cache of associated files before
      pmem is removed.
      
      [1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/
      [2]: https://lore.kernel.org/linux-xfs/169116275623.3187159.16862410128731457358.stg-ugh@frogsfrogsfrogs/
      
      
      
      Signed-off-by: default avatarShiyang Ruan <ruansy.fnst@fujitsu.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDan Williams <dan.j.williams@intel.com>
      Signed-off-by: default avatarChandan Babu R <chandanbabu@kernel.org>
      fa422b35
  11. Oct 04, 2023
  12. Sep 05, 2023
    • Tong Tiangen's avatar
      mm: memory-failure: use rcu lock instead of tasklist_lock when collect_procs() · d256d1cd
      Tong Tiangen authored
      We found a softlock issue in our test, analyzed the logs, and found that
      the relevant CPU call trace as follows:
      
      CPU0:
        _do_fork
          -> copy_process()
            -> write_lock_irq(&tasklist_lock)  //Disable irq,waiting for
            					 //tasklist_lock
      
      CPU1:
        wp_page_copy()
          ->pte_offset_map_lock()
            -> spin_lock(&page->ptl);        //Hold page->ptl
          -> ptep_clear_flush()
            -> flush_tlb_others() ...
              -> smp_call_function_many()
                -> arch_send_call_function_ipi_mask()
                  -> csd_lock_wait()         //Waiting for other CPUs respond
      	                               //IPI
      
      CPU2:
        collect_procs_anon()
          -> read_lock(&tasklist_lock)       //Hold tasklist_lock
            ->for_each_process(tsk)
              -> page_mapped_in_vma()
                -> page_vma_mapped_walk()
      	    -> map_pte()
                    ->spin_lock(&page->ptl)  //Waiting for page->ptl
      
      We can see that CPU1 waiting for CPU0 respond IPI,CPU0 waiting for CPU2
      unlock tasklist_lock, CPU2 waiting for CPU1 unlock page->ptl. As a result,
      softlockup is triggered.
      
      For collect_procs_anon(), what we're doing is task list iteration, during
      the iteration, with the help of call_rcu(), the task_struct object is freed
      only after one or more grace periods elapse. the logic as follows:
      
      release_task()
        -> __exit_signal()
          -> __unhash_process()
            -> list_del_rcu()
      
        -> put_task_struct_rcu_user()
          -> call_rcu(&task->rcu, delayed_put_task_struct)
      
      delayed_put_task_struct()
        -> put_task_struct()
        -> if (refcount_sub_and_test())
           	__put_task_struct()
                -> free_task()
      
      Therefore, under the protection of the rcu lock, we can safely use
      get_task_struct() to ensure a safe reference to task_struct during the
      iteration.
      
      By removing the use of tasklist_lock in task list iteration, we can break
      the softlock chain above.
      
      The same logic can also be applied to:
       - collect_procs_file()
       - collect_procs_fsdax()
       - collect_procs_ksm()
      
      Link: https://lkml.kernel.org/r/20230828022527.241693-1-tongtiangen@huawei.com
      
      
      Signed-off-by: default avatarTong Tiangen <tongtiangen@huawei.com>
      Acked-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d256d1cd
  13. Sep 02, 2023
Loading