Skip to content
Snippets Groups Projects
  1. May 24, 2024
    • Chengming Zhou's avatar
      mm/ksm: fix possible UAF of stable_node · 90e82349
      Chengming Zhou authored
      The commit 2c653d0e ("ksm: introduce ksm_max_page_sharing per page
      deduplication limit") introduced a possible failure case in the
      stable_tree_insert(), where we may free the new allocated stable_node_dup
      if we fail to prepare the missing chain node.
      
      Then that kfolio return and unlock with a freed stable_node set...  And
      any MM activities can come in to access kfolio->mapping, so UAF.
      
      Fix it by moving folio_set_stable_node() to the end after stable_node
      is inserted successfully.
      
      Link: https://lkml.kernel.org/r/20240513-b4-ksm-stable-node-uaf-v1-1-f687de76f452@linux.dev
      
      
      Fixes: 2c653d0e ("ksm: introduce ksm_max_page_sharing per page deduplication limit")
      Signed-off-by: default avatarChengming Zhou <chengming.zhou@linux.dev>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Stefan Roesch <shr@devkernel.io>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      90e82349
    • Miaohe Lin's avatar
      mm/memory-failure: fix handling of dissolved but not taken off from buddy pages · 8cf360b9
      Miaohe Lin authored
      When I did memory failure tests recently, below panic occurs:
      
      page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x8cee00
      flags: 0x6fffe0000000000(node=1|zone=2|lastcpupid=0x7fff)
      raw: 06fffe0000000000 dead000000000100 dead000000000122 0000000000000000
      raw: 0000000000000000 0000000000000009 00000000ffffffff 0000000000000000
      page dumped because: VM_BUG_ON_PAGE(!PageBuddy(page))
      ------------[ cut here ]------------
      kernel BUG at include/linux/page-flags.h:1009!
      invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
      RIP: 0010:__del_page_from_free_list+0x151/0x180
      RSP: 0018:ffffa49c90437998 EFLAGS: 00000046
      RAX: 0000000000000035 RBX: 0000000000000009 RCX: ffff8dd8dfd1c9c8
      RDX: 0000000000000000 RSI: 0000000000000027 RDI: ffff8dd8dfd1c9c0
      RBP: ffffd901233b8000 R08: ffffffffab5511f8 R09: 0000000000008c69
      R10: 0000000000003c15 R11: ffffffffab5511f8 R12: ffff8dd8fffc0c80
      R13: 0000000000000001 R14: ffff8dd8fffc0c80 R15: 0000000000000009
      FS:  00007ff916304740(0000) GS:ffff8dd8dfd00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 000055eae50124c8 CR3: 00000008479e0000 CR4: 00000000000006f0
      Call Trace:
       <TASK>
       __rmqueue_pcplist+0x23b/0x520
       get_page_from_freelist+0x26b/0xe40
       __alloc_pages_noprof+0x113/0x1120
       __folio_alloc_noprof+0x11/0xb0
       alloc_buddy_hugetlb_folio.isra.0+0x5a/0x130
       __alloc_fresh_hugetlb_folio+0xe7/0x140
       alloc_pool_huge_folio+0x68/0x100
       set_max_huge_pages+0x13d/0x340
       hugetlb_sysctl_handler_common+0xe8/0x110
       proc_sys_call_handler+0x194/0x280
       vfs_write+0x387/0x550
       ksys_write+0x64/0xe0
       do_syscall_64+0xc2/0x1d0
       entry_SYSCALL_64_after_hwframe+0x77/0x7f
      RIP: 0033:0x7ff916114887
      RSP: 002b:00007ffec8a2fd78 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
      RAX: ffffffffffffffda RBX: 000055eae500e350 RCX: 00007ff916114887
      RDX: 0000000000000004 RSI: 000055eae500e390 RDI: 0000000000000003
      RBP: 000055eae50104c0 R08: 0000000000000000 R09: 000055eae50104c0
      R10: 0000000000000077 R11: 0000000000000246 R12: 0000000000000004
      R13: 0000000000000004 R14: 00007ff916216b80 R15: 00007ff916216a00
       </TASK>
      Modules linked in: mce_inject hwpoison_inject
      ---[ end trace 0000000000000000 ]---
      
      And before the panic, there had an warning about bad page state:
      
      BUG: Bad page state in process page-types  pfn:8cee00
      page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x8cee00
      flags: 0x6fffe0000000000(node=1|zone=2|lastcpupid=0x7fff)
      page_type: 0xffffff7f(buddy)
      raw: 06fffe0000000000 ffffd901241c0008 ffffd901240f8008 0000000000000000
      raw: 0000000000000000 0000000000000009 00000000ffffff7f 0000000000000000
      page dumped because: nonzero mapcount
      Modules linked in: mce_inject hwpoison_inject
      CPU: 8 PID: 154211 Comm: page-types Not tainted 6.9.0-rc4-00499-g5544ec3178e2-dirty #22
      Call Trace:
       <TASK>
       dump_stack_lvl+0x83/0xa0
       bad_page+0x63/0xf0
       free_unref_page+0x36e/0x5c0
       unpoison_memory+0x50b/0x630
       simple_attr_write_xsigned.constprop.0.isra.0+0xb3/0x110
       debugfs_attr_write+0x42/0x60
       full_proxy_write+0x5b/0x80
       vfs_write+0xcd/0x550
       ksys_write+0x64/0xe0
       do_syscall_64+0xc2/0x1d0
       entry_SYSCALL_64_after_hwframe+0x77/0x7f
      RIP: 0033:0x7f189a514887
      RSP: 002b:00007ffdcd899718 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
      RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f189a514887
      RDX: 0000000000000009 RSI: 00007ffdcd899730 RDI: 0000000000000003
      RBP: 00007ffdcd8997a0 R08: 0000000000000000 R09: 00007ffdcd8994b2
      R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffdcda199a8
      R13: 0000000000404af1 R14: 000000000040ad78 R15: 00007f189a7a5040
       </TASK>
      
      The root cause should be the below race:
      
       memory_failure
        try_memory_failure_hugetlb
         me_huge_page
          __page_handle_poison
           dissolve_free_hugetlb_folio
           drain_all_pages -- Buddy page can be isolated e.g. for compaction.
           take_page_off_buddy -- Failed as page is not in the buddy list.
      	     -- Page can be putback into buddy after compaction.
          page_ref_inc -- Leads to buddy page with refcnt = 1.
      
      Then unpoison_memory() can unpoison the page and send the buddy page back
      into buddy list again leading to the above bad page state warning.  And
      bad_page() will call page_mapcount_reset() to remove PageBuddy from buddy
      page leading to later VM_BUG_ON_PAGE(!PageBuddy(page)) when trying to
      allocate this page.
      
      Fix this issue by only treating __page_handle_poison() as successful when
      it returns 1.
      
      Link: https://lkml.kernel.org/r/20240523071217.1696196-1-linmiaohe@huawei.com
      
      
      Fixes: ceaf8fbe ("mm, hwpoison: skip raw hwpoison page in freeing 1GB hugepage")
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8cf360b9
    • Miaohe Lin's avatar
      mm/huge_memory: don't unpoison huge_zero_folio · fe6f86f4
      Miaohe Lin authored
      When I did memory failure tests recently, below panic occurs:
      
       kernel BUG at include/linux/mm.h:1135!
       invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
       CPU: 9 PID: 137 Comm: kswapd1 Not tainted 6.9.0-rc4-00491-gd5ce28f156fe-dirty #14
       RIP: 0010:shrink_huge_zero_page_scan+0x168/0x1a0
       RSP: 0018:ffff9933c6c57bd0 EFLAGS: 00000246
       RAX: 000000000000003e RBX: 0000000000000000 RCX: ffff88f61fc5c9c8
       RDX: 0000000000000000 RSI: 0000000000000027 RDI: ffff88f61fc5c9c0
       RBP: ffffcd7c446b0000 R08: ffffffff9a9405f0 R09: 0000000000005492
       R10: 00000000000030ea R11: ffffffff9a9405f0 R12: 0000000000000000
       R13: 0000000000000000 R14: 0000000000000000 R15: ffff88e703c4ac00
       FS:  0000000000000000(0000) GS:ffff88f61fc40000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 000055f4da6e9878 CR3: 0000000c71048000 CR4: 00000000000006f0
       Call Trace:
        <TASK>
        do_shrink_slab+0x14f/0x6a0
        shrink_slab+0xca/0x8c0
        shrink_node+0x2d0/0x7d0
        balance_pgdat+0x33a/0x720
        kswapd+0x1f3/0x410
        kthread+0xd5/0x100
        ret_from_fork+0x2f/0x50
        ret_from_fork_asm+0x1a/0x30
        </TASK>
       Modules linked in: mce_inject hwpoison_inject
       ---[ end trace 0000000000000000 ]---
       RIP: 0010:shrink_huge_zero_page_scan+0x168/0x1a0
       RSP: 0018:ffff9933c6c57bd0 EFLAGS: 00000246
       RAX: 000000000000003e RBX: 0000000000000000 RCX: ffff88f61fc5c9c8
       RDX: 0000000000000000 RSI: 0000000000000027 RDI: ffff88f61fc5c9c0
       RBP: ffffcd7c446b0000 R08: ffffffff9a9405f0 R09: 0000000000005492
       R10: 00000000000030ea R11: ffffffff9a9405f0 R12: 0000000000000000
       R13: 0000000000000000 R14: 0000000000000000 R15: ffff88e703c4ac00
       FS:  0000000000000000(0000) GS:ffff88f61fc40000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 000055f4da6e9878 CR3: 0000000c71048000 CR4: 00000000000006f0
      
      The root cause is that HWPoison flag will be set for huge_zero_folio
      without increasing the folio refcnt.  But then unpoison_memory() will
      decrease the folio refcnt unexpectedly as it appears like a successfully
      hwpoisoned folio leading to VM_BUG_ON_PAGE(page_ref_count(page) == 0) when
      releasing huge_zero_folio.
      
      Skip unpoisoning huge_zero_folio in unpoison_memory() to fix this issue. 
      We're not prepared to unpoison huge_zero_folio yet.
      
      Link: https://lkml.kernel.org/r/20240516122608.22610-1-linmiaohe@huawei.com
      
      
      Fixes: 478d134e ("mm/huge_memory: do not overkill when splitting huge_zero_page")
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarYang Shi <shy828301@gmail.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Reviewed-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: Xu Yu <xuyu@linux.alibaba.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      fe6f86f4
    • Hailong.Liu's avatar
      mm/vmalloc: fix vmalloc which may return null if called with __GFP_NOFAIL · 8e0545c8
      Hailong.Liu authored
      commit a421ef30 ("mm: allow !GFP_KERNEL allocations for kvmalloc")
      includes support for __GFP_NOFAIL, but it presents a conflict with commit
      dd544141 ("vmalloc: back off when the current task is OOM-killed").  A
      possible scenario is as follows:
      
      process-a
      __vmalloc_node_range(GFP_KERNEL | __GFP_NOFAIL)
          __vmalloc_area_node()
              vm_area_alloc_pages()
      		--> oom-killer send SIGKILL to process-a
              if (fatal_signal_pending(current)) break;
      --> return NULL;
      
      To fix this, do not check fatal_signal_pending() in vm_area_alloc_pages()
      if __GFP_NOFAIL set.
      
      This issue occurred during OPLUS KASAN TEST. Below is part of the log
      -> oom-killer sends signal to process
      [65731.222840] [ T1308] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/apps/uid_10198,task=gs.intelligence,pid=32454,uid=10198
      
      [65731.259685] [T32454] Call trace:
      [65731.259698] [T32454]  dump_backtrace+0xf4/0x118
      [65731.259734] [T32454]  show_stack+0x18/0x24
      [65731.259756] [T32454]  dump_stack_lvl+0x60/0x7c
      [65731.259781] [T32454]  dump_stack+0x18/0x38
      [65731.259800] [T32454]  mrdump_common_die+0x250/0x39c [mrdump]
      [65731.259936] [T32454]  ipanic_die+0x20/0x34 [mrdump]
      [65731.260019] [T32454]  atomic_notifier_call_chain+0xb4/0xfc
      [65731.260047] [T32454]  notify_die+0x114/0x198
      [65731.260073] [T32454]  die+0xf4/0x5b4
      [65731.260098] [T32454]  die_kernel_fault+0x80/0x98
      [65731.260124] [T32454]  __do_kernel_fault+0x160/0x2a8
      [65731.260146] [T32454]  do_bad_area+0x68/0x148
      [65731.260174] [T32454]  do_mem_abort+0x151c/0x1b34
      [65731.260204] [T32454]  el1_abort+0x3c/0x5c
      [65731.260227] [T32454]  el1h_64_sync_handler+0x54/0x90
      [65731.260248] [T32454]  el1h_64_sync+0x68/0x6c
      
      [65731.260269] [T32454]  z_erofs_decompress_queue+0x7f0/0x2258
      --> be->decompressed_pages = kvcalloc(be->nr_pages, sizeof(struct page *), GFP_KERNEL | __GFP_NOFAIL);
      	kernel panic by NULL pointer dereference.
      	erofs assume kvmalloc with __GFP_NOFAIL never return NULL.
      [65731.260293] [T32454]  z_erofs_runqueue+0xf30/0x104c
      [65731.260314] [T32454]  z_erofs_readahead+0x4f0/0x968
      [65731.260339] [T32454]  read_pages+0x170/0xadc
      [65731.260364] [T32454]  page_cache_ra_unbounded+0x874/0xf30
      [65731.260388] [T32454]  page_cache_ra_order+0x24c/0x714
      [65731.260411] [T32454]  filemap_fault+0xbf0/0x1a74
      [65731.260437] [T32454]  __do_fault+0xd0/0x33c
      [65731.260462] [T32454]  handle_mm_fault+0xf74/0x3fe0
      [65731.260486] [T32454]  do_mem_abort+0x54c/0x1b34
      [65731.260509] [T32454]  el0_da+0x44/0x94
      [65731.260531] [T32454]  el0t_64_sync_handler+0x98/0xb4
      [65731.260553] [T32454]  el0t_64_sync+0x198/0x19c
      
      Link: https://lkml.kernel.org/r/20240510100131.1865-1-hailong.liu@oppo.com
      
      
      Fixes: 9376130c ("mm/vmalloc: add support for __GFP_NOFAIL")
      Signed-off-by: default avatarHailong.Liu <hailong.liu@oppo.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Suggested-by: default avatarBarry Song <21cnbao@gmail.com>
      Reported-by: default avatarOven <liyangouwen1@oppo.com>
      Reviewed-by: default avatarBarry Song <baohua@kernel.org>
      Reviewed-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Cc: Chao Yu <chao@kernel.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Gao Xiang <xiang@kernel.org>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8e0545c8
    • Jeff Xu's avatar
      mseal: add mseal syscall · 8be7258a
      Jeff Xu authored
      The new mseal() is an syscall on 64 bit CPU, and with following signature:
      
      int mseal(void addr, size_t len, unsigned long flags)
      addr/len: memory range.
      flags: reserved.
      
      mseal() blocks following operations for the given memory range.
      
      1> Unmapping, moving to another location, and shrinking the size,
         via munmap() and mremap(), can leave an empty space, therefore can
         be replaced with a VMA with a new set of attributes.
      
      2> Moving or expanding a different VMA into the current location,
         via mremap().
      
      3> Modifying a VMA via mmap(MAP_FIXED).
      
      4> Size expansion, via mremap(), does not appear to pose any specific
         risks to sealed VMAs. It is included anyway because the use case is
         unclear. In any case, users can rely on merging to expand a sealed VMA.
      
      5> mprotect() and pkey_mprotect().
      
      6> Some destructive madvice() behaviors (e.g. MADV_DONTNEED) for anonymous
         memory, when users don't have write permission to the memory. Those
         behaviors can alter region contents by discarding pages, effectively a
         memset(0) for anonymous memory.
      
      Following input during RFC are incooperated into this patch:
      
      Jann Horn: raising awareness and providing valuable insights on the
      destructive madvise operations.
      Linus Torvalds: assisting in defining system call signature and scope.
      Liam R. Howlett: perf optimization.
      Theo de Raadt: sharing the experiences and insight gained from
        implementing mimmutable() in OpenBSD.
      
      Finally, the idea that inspired this patch comes from Stephen Röttger's
      work in Chrome V8 CFI.
      
      [jeffxu@chromium.org: add branch prediction hint, per Pedro]
        Link: https://lkml.kernel.org/r/20240423192825.1273679-2-jeffxu@chromium.org
      Link: https://lkml.kernel.org/r/20240415163527.626541-3-jeffxu@chromium.org
      
      
      Signed-off-by: default avatarJeff Xu <jeffxu@chromium.org>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Reviewed-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Pedro Falcato <pedro.falcato@gmail.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Guenter Roeck <groeck@chromium.org>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jeff Xu <jeffxu@google.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Jorge Lucangeli Obes <jorgelo@chromium.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Muhammad Usama Anjum <usama.anjum@collabora.com>
      Cc: Pedro Falcato <pedro.falcato@gmail.com>
      Cc: Stephen Röttger <sroettger@google.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Amer Al Shanawany <amer.shanawany@gmail.com>
      Cc: Javier Carrasco <javier.carrasco.cruz@gmail.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8be7258a
  2. May 22, 2024
    • Linus Torvalds's avatar
      mm: simplify and improve print_vma_addr() output · de7e71ef
      Linus Torvalds authored
      
      Use '%pD' to print out the filename, and print out the actual offset
      within the file too, rather than just what the virtual address of the
      mapping is (which doesn't tell you anything about any mapping offsets).
      
      Also, use the exact vma_lookup() instead of find_vma() - the latter
      looks up any vma _after_ the address, which is of questionable value
      (yes, maybe you fell off the beginning, but you'd be more likely to fall
      off the end).
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      de7e71ef
  3. May 19, 2024
    • Dave Chinner's avatar
      mm/page-owner: use gfp_nested_mask() instead of open coded masking · 99b80ac4
      Dave Chinner authored
      The page-owner tracking code records stack traces during page allocation. 
      To do this, it must do a memory allocation for the stack information from
      inside an existing memory allocation context.  This internal allocation
      must obey the high level caller allocation constraints to avoid generating
      false positive warnings that have nothing to do with the code they are
      instrumenting/tracking (e.g.  through lockdep reclaim state tracking)
      
      We also don't want recording stack traces to deplete emergency memory
      reserves - debug code is useless if it creates new issues that can't be
      replicated when the debug code is disabled.
      
      Switch the stack tracking allocation masking to use gfp_nested_mask() to
      address these issues.  gfp_nested_mask() naturally strips GFP_ZONEMASK,
      too, which greatly simplifies this code.
      
      Link: https://lkml.kernel.org/r/20240430054604.4169568-4-david@fromorbit.com
      
      
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Andrey Konovalov <andreyknvl@gmail.com>
      Cc: Marco Elver <elver@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      99b80ac4
    • Dave Chinner's avatar
      mm: lift gfp_kmemleak_mask() to gfp.h · 1c00f936
      Dave Chinner authored
      Patch series "mm: fix nested allocation context filtering".
      
      This patchset is the followup to the comment I made earlier today:
      
      https://lore.kernel.org/linux-xfs/ZjAyIWUzDipofHFJ@dread.disaster.area/
      
      Tl;dr: Memory allocations that are done inside the public memory
      allocation API need to obey the reclaim recursion constraints placed on
      the allocation by the original caller, including the "don't track
      recursion for this allocation" case defined by __GFP_NOLOCKDEP.
      
      These nested allocations are generally in debug code that is tracking
      something about the allocation (kmemleak, KASAN, etc) and so are
      allocating private kernel objects that only that debug system will use.
      
      Neither the page-owner code nor the stack depot code get this right.  They
      also also clear GFP_ZONEMASK as a separate operation, which is completely
      redundant because the constraint filter applied immediately after
      guarantees that GFP_ZONEMASK bits are cleared.
      
      kmemleak gets this filtering right.  It preserves the allocation
      constraints for deadlock prevention and clears all other context flags
      whilst also ensuring that the nested allocation will fail quickly,
      silently and without depleting emergency kernel reserves if there is no
      memory available.
      
      This can be made much more robust, immune to whack-a-mole games and the
      code greatly simplified by lifting gfp_kmemleak_mask() to
      include/linux/gfp.h and using that everywhere.  Also document it so that
      there is no excuse for not knowing about it when writing new debug code
      that nests allocations.
      
      Tested with lockdep, KASAN + page_owner=on and kmemleak=on over multiple
      fstests runs with XFS.
      
      
      This patch (of 3):
      
      Any "internal" nested allocation done from within an allocation context
      needs to obey the high level allocation gfp_mask constraints.  This is
      necessary for debug code like KASAN, kmemleak, lockdep, etc that allocate
      memory for saving stack traces and other information during memory
      allocation.  If they don't obey things like __GFP_NOLOCKDEP or
      __GFP_NOWARN, they produce false positive failure detections.
      
      kmemleak gets this right by using gfp_kmemleak_mask() to pass through the
      relevant context flags to the nested allocation to ensure that the
      allocation follows the constraints of the caller context.
      
      KASAN recently was foudn to be missing __GFP_NOLOCKDEP due to stack depot
      allocations, and even more recently the page owner tracking code was also
      found to be missing __GFP_NOLOCKDEP support.
      
      We also don't wan't want KASAN or lockdep to drive the system into OOM
      kill territory by exhausting emergency reserves.  This is something that
      kmemleak also gets right by adding (__GFP_NORETRY | __GFP_NOMEMALLOC |
      __GFP_NOWARN) to the allocation mask.
      
      Hence it is clear that we need to define a common nested allocation filter
      mask for these sorts of third party nested allocations used in debug code.
      So to start this process, lift gfp_kmemleak_mask() to gfp.h and rename it
      to gfp_nested_mask(), and convert the kmemleak callers to use it.
      
      Link: https://lkml.kernel.org/r/20240430054604.4169568-1-david@fromorbit.com
      Link: https://lkml.kernel.org/r/20240430054604.4169568-2-david@fromorbit.com
      
      
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarMarco Elver <elver@google.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Andrey Konovalov <andreyknvl@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1c00f936
  4. May 14, 2024
    • Mike Rapoport (IBM)'s avatar
      mm/execmem, arch: convert remaining overrides of module_alloc to execmem · 223b5e57
      Mike Rapoport (IBM) authored
      
      Extend execmem parameters to accommodate more complex overrides of
      module_alloc() by architectures.
      
      This includes specification of a fallback range required by arm, arm64
      and powerpc, EXECMEM_MODULE_DATA type required by powerpc, support for
      allocation of KASAN shadow required by s390 and x86 and support for
      late initialization of execmem required by arm64.
      
      The core implementation of execmem_alloc() takes care of suppressing
      warnings when the initial allocation fails but there is a fallback range
      defined.
      
      Signed-off-by: default avatarMike Rapoport (IBM) <rppt@kernel.org>
      Acked-by: default avatarWill Deacon <will@kernel.org>
      Acked-by: default avatarSong Liu <song@kernel.org>
      Tested-by: default avatarLiviu Dudau <liviu@dudau.co.uk>
      Signed-off-by: default avatarLuis Chamberlain <mcgrof@kernel.org>
      223b5e57
    • Mike Rapoport (IBM)'s avatar
      mm/execmem, arch: convert simple overrides of module_alloc to execmem · f6bec26c
      Mike Rapoport (IBM) authored
      
      Several architectures override module_alloc() only to define address
      range for code allocations different than VMALLOC address space.
      
      Provide a generic implementation in execmem that uses the parameters for
      address space ranges, required alignment and page protections provided
      by architectures.
      
      The architectures must fill execmem_info structure and implement
      execmem_arch_setup() that returns a pointer to that structure. This way the
      execmem initialization won't be called from every architecture, but rather
      from a central place, namely a core_initcall() in execmem.
      
      The execmem provides execmem_alloc() API that wraps __vmalloc_node_range()
      with the parameters defined by the architectures.  If an architecture does
      not implement execmem_arch_setup(), execmem_alloc() will fall back to
      module_alloc().
      
      Signed-off-by: default avatarMike Rapoport (IBM) <rppt@kernel.org>
      Acked-by: default avatarSong Liu <song@kernel.org>
      Reviewed-by: default avatarMasami Hiramatsu (Google) <mhiramat@kernel.org>
      Signed-off-by: default avatarLuis Chamberlain <mcgrof@kernel.org>
      f6bec26c
    • Mike Rapoport (IBM)'s avatar
      mm: introduce execmem_alloc() and execmem_free() · 12af2b83
      Mike Rapoport (IBM) authored
      
      module_alloc() is used everywhere as a mean to allocate memory for code.
      
      Beside being semantically wrong, this unnecessarily ties all subsystems
      that need to allocate code, such as ftrace, kprobes and BPF to modules and
      puts the burden of code allocation to the modules code.
      
      Several architectures override module_alloc() because of various
      constraints where the executable memory can be located and this causes
      additional obstacles for improvements of code allocation.
      
      Start splitting code allocation from modules by introducing execmem_alloc()
      and execmem_free() APIs.
      
      Initially, execmem_alloc() is a wrapper for module_alloc() and
      execmem_free() is a replacement of module_memfree() to allow updating all
      call sites to use the new APIs.
      
      Since architectures define different restrictions on placement,
      permissions, alignment and other parameters for memory that can be used by
      different subsystems that allocate executable memory, execmem_alloc() takes
      a type argument, that will be used to identify the calling subsystem and to
      allow architectures define parameters for ranges suitable for that
      subsystem.
      
      No functional changes.
      
      Signed-off-by: default avatarMike Rapoport (IBM) <rppt@kernel.org>
      Acked-by: default avatarMasami Hiramatsu (Google) <mhiramat@kernel.org>
      Acked-by: default avatarSong Liu <song@kernel.org>
      Acked-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      Signed-off-by: default avatarLuis Chamberlain <mcgrof@kernel.org>
      12af2b83
  5. May 11, 2024
  6. May 07, 2024
  7. May 06, 2024
Loading