Skip to content
Snippets Groups Projects
  1. Sep 15, 2023
    • Lukas Wunner's avatar
      panic: Reenable preemption in WARN slowpath · cccd3281
      Lukas Wunner authored
      
      Commit:
      
        5a5d7e9b ("cpuidle: lib/bug: Disable rcu_is_watching() during WARN/BUG")
      
      amended warn_slowpath_fmt() to disable preemption until the WARN splat
      has been emitted.
      
      However the commit neglected to reenable preemption in the !fmt codepath,
      i.e. when a WARN splat is emitted without additional format string.
      
      One consequence is that users may see more splats than intended.  E.g. a
      WARN splat emitted in a work item results in at least two extra splats:
      
        BUG: workqueue leaked lock or atomic
        (emitted by process_one_work())
      
        BUG: scheduling while atomic
        (emitted by worker_thread() -> schedule())
      
      Ironically the point of the commit was to *avoid* extra splats. ;)
      
      Fix it.
      
      Fixes: 5a5d7e9b ("cpuidle: lib/bug: Disable rcu_is_watching() during WARN/BUG")
      Signed-off-by: default avatarLukas Wunner <lukas@wunner.de>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Link: https://lore.kernel.org/r/3ec48fde01e4ee6505f77908ba351bad200ae3d1.1694763684.git.lukas@wunner.de
      cccd3281
  2. Sep 13, 2023
  3. Sep 12, 2023
    • Jinjie Ruan's avatar
      eventfs: Fix the NULL pointer dereference bug in eventfs_remove_rec() · c8414dab
      Jinjie Ruan authored
      Inject fault while probing btrfs.ko, if kstrdup() fails in
      eventfs_prepare_ef() in eventfs_add_dir(), it will return ERR_PTR
      to assign file->ef. But the eventfs_remove() check NULL in
      trace_module_remove_events(), which causes the below NULL
      pointer dereference.
      
      As both Masami and Steven suggest, allocater side should handle the
      error carefully and remove it, so fix the places where it failed.
      
       Could not create tracefs 'raid56_write' directory
       Btrfs loaded, zoned=no, fsverity=no
       Unable to handle kernel NULL pointer dereference at virtual address 000000000000001c
       Mem abort info:
         ESR = 0x0000000096000004
         EC = 0x25: DABT (current EL), IL = 32 bits
         SET = 0, FnV = 0
         EA = 0, S1PTW = 0
         FSC = 0x04: level 0 translation fault
       Data abort info:
         ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
         CM = 0, WnR = 0, TnD = 0, TagAccess = 0
         GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
       user pgtable: 4k pages, 48-bit VAs, pgdp=0000000102544000
       [000000000000001c] pgd=0000000000000000, p4d=0000000000000000
       Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP
       Dumping ftrace buffer:
          (ftrace buffer empty)
       Modules linked in: btrfs(-) libcrc32c xor xor_neon raid6_pq cfg80211 rfkill 8021q garp mrp stp llc ipv6 [last unloaded: btrfs]
       CPU: 15 PID: 1343 Comm: rmmod Tainted: G                 N 6.5.0+ #40
       Hardware name: linux,dummy-virt (DT)
       pstate: 80000005 (Nzcv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
       pc : eventfs_remove_rec+0x24/0xc0
       lr : eventfs_remove+0x68/0x1d8
       sp : ffff800082d63b60
       x29: ffff800082d63b60 x28: ffffb84b80ddd00c x27: ffffb84b3054ba40
       x26: 0000000000000002 x25: ffff800082d63bf8 x24: ffffb84b8398e440
       x23: ffffb84b82af3000 x22: dead000000000100 x21: dead000000000122
       x20: ffff800082d63bf8 x19: fffffffffffffff4 x18: ffffb84b82508820
       x17: 0000000000000000 x16: 0000000000000000 x15: 000083bc876a3166
       x14: 000000000000006d x13: 000000000000006d x12: 0000000000000000
       x11: 0000000000000001 x10: 00000000000017e0 x9 : 0000000000000001
       x8 : 0000000000000000 x7 : 0000000000000000 x6 : ffffb84b84289804
       x5 : 0000000000000000 x4 : 9696969696969697 x3 : ffff33a5b7601f38
       x2 : 0000000000000000 x1 : ffff800082d63bf8 x0 : fffffffffffffff4
       Call trace:
        eventfs_remove_rec+0x24/0xc0
        eventfs_remove+0x68/0x1d8
        remove_event_file_dir+0x88/0x100
        event_remove+0x140/0x15c
        trace_module_notify+0x1fc/0x230
        notifier_call_chain+0x98/0x17c
        blocking_notifier_call_chain+0x4c/0x74
        __arm64_sys_delete_module+0x1a4/0x298
        invoke_syscall+0x44/0x100
        el0_svc_common.constprop.1+0x68/0xe0
        do_el0_svc+0x1c/0x28
        el0_svc+0x3c/0xc4
        el0t_64_sync_handler+0xa0/0xc4
        el0t_64_sync+0x174/0x178
       Code: 5400052c a90153b3 aa0003f3 aa0103f4 (f9401400)
       ---[ end trace 0000000000000000 ]---
       Kernel panic - not syncing: Oops: Fatal exception
       SMP: stopping secondary CPUs
       Dumping ftrace buffer:
          (ftrace buffer empty)
       Kernel Offset: 0x384b00c00000 from 0xffff800080000000
       PHYS_OFFSET: 0xffffcc5b80000000
       CPU features: 0x88000203,3c020000,1000421b
       Memory Limit: none
       Rebooting in 1 seconds..
      
      Link: https://lore.kernel.org/linux-trace-kernel/20230912134752.1838524-1-ruanjinjie@huawei.com
      Link: https://lore.kernel.org/all/20230912025808.668187-1-ruanjinjie@huawei.com/
      Link: https://lore.kernel.org/all/20230911052818.1020547-1-ruanjinjie@huawei.com/
      Link: https://lore.kernel.org/all/20230909072817.182846-1-ruanjinjie@huawei.com/
      Link: https://lore.kernel.org/all/20230908074816.3724716-1-ruanjinjie@huawei.com/
      
      
      
      Cc: Ajay Kaher <akaher@vmware.com>
      Fixes: 5bdcd5f5 ("eventfs: Implement removal of meta data from eventfs")
      Signed-off-by: default avatarJinjie Ruan <ruanjinjie@huawei.com>
      Suggested-by: default avatarMasami Hiramatsu (Google) <mhiramat@kernel.org>
      Suggested-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      c8414dab
    • Chen Yu's avatar
      PM: hibernate: Fix the exclusive get block device in test_resume mode · 148b6f4c
      Chen Yu authored
      Commit 5904de0d ("PM: hibernate: Do not get block device exclusively
      in test_resume mode") fixes a hibernation issue under test_resume mode.
      That commit is supposed to open the block device in non-exclusive mode
      when in test_resume. However the code does the opposite, which is against
      its description.
      
      In summary, the swap device is only opened exclusively by swsusp_check()
      with its corresponding *close(), and must be in non test_resume mode.
      This is to avoid the race condition that different processes scribble the
      device at the same time. All the other cases should use non-exclusive mode.
      
      Fix it by really disabling exclusive mode under test_resume.
      
      Fixes: 5904de0d ("PM: hibernate: Do not get block device exclusively in test_resume mode")
      Closes: https://lore.kernel.org/lkml/000000000000761f5f0603324129@google.com/
      
      
      Reported-by: default avatarPengfei Xu <pengfei.xu@intel.com>
      Signed-off-by: default avatarChen Yu <yu.c.chen@intel.com>
      Tested-by: default avatarChenzhou Feng <chenzhoux.feng@intel.com>
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      148b6f4c
    • Chen Yu's avatar
      PM: hibernate: Rename function parameter from snapshot_test to exclusive · 40d84e19
      Chen Yu authored
      
      Several functions reply on snapshot_test to decide whether to
      open the resume device exclusively. However there is no strict
      connection between the snapshot_test and the open mode. Rename
      the 'snapshot_test' input parameter to 'exclusive' to better reflect
      the use case.
      
      No functional change is expected.
      
      Signed-off-by: default avatarChen Yu <yu.c.chen@intel.com>
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      40d84e19
  4. Sep 11, 2023
  5. Sep 09, 2023
  6. Sep 08, 2023
    • Zhenhua Huang's avatar
      Revert "dma-contiguous: check for memory region overlap" · f875db4f
      Zhenhua Huang authored
      
      This reverts commit 3fa6456e.
      
      The Commit broke the CMA region creation through DT on arm64,
      as showed below logs with "memblock=debug":
      [    0.000000] memblock_phys_alloc_range: 41943040 bytes align=0x200000
      from=0x0000000000000000 max_addr=0x00000000ffffffff
      early_init_dt_alloc_reserved_memory_arch+0x34/0xa0
      [    0.000000] memblock_reserve: [0x00000000fd600000-0x00000000ffdfffff]
      memblock_alloc_range_nid+0xc0/0x19c
      [    0.000000] Reserved memory: overlap with other memblock reserved region
      
      >From call flow, region we defined in DT was always reserved before entering
      into rmem_cma_setup. Also, rmem_cma_setup has one routine cma_init_reserved_mem
      to ensure the region was reserved. Checking the region not reserved here seems
      not correct.
      
      early_init_fdt_scan_reserved_mem:
          fdt_scan_reserved_mem
              __reserved_mem_reserve_reg
      		early_init_dt_reserve_memory
      			memblock_reserve(using “reg” prop case)
              fdt_init_reserved_mem
      		__reserved_mem_alloc_size
      			*early_init_dt_alloc_reserved_memory_arch*
      				memblock_reserve(dynamic alloc case)
              __reserved_mem_init_node
      		rmem_cma_setup(region overlap check here should always fail)
      
      Example DT can be used to reproduce issue:
      
          dump_mem: mem_dump_region {
                  compatible = "shared-dma-pool";
                  alloc-ranges = <0x0 0x00000000 0x0 0xffffffff>;
                  reusable;
                  size = <0 0x2800000>;
          };
      
      Signed-off-by: default avatarZhenhua Huang <quic_zhenhuah@quicinc.com>
      f875db4f
  7. Sep 07, 2023
  8. Sep 06, 2023
    • Puranjay Mohan's avatar
      bpf: make bpf_prog_pack allocator portable · 20e490ad
      Puranjay Mohan authored
      
      The bpf_prog_pack allocator currently uses module_alloc() and
      module_memfree() to allocate and free memory. This is not portable
      because different architectures use different methods for allocating
      memory for BPF programs. Like ARM64 and riscv use vmalloc()/vfree().
      
      Use bpf_jit_alloc_exec() and bpf_jit_free_exec() for memory management
      in bpf_prog_pack allocator. Other architectures can override these with
      their implementation and will be able to use bpf_prog_pack directly.
      
      On architectures that don't override bpf_jit_alloc/free_exec() this is
      basically a NOP.
      
      Signed-off-by: default avatarPuranjay Mohan <puranjay12@gmail.com>
      Acked-by: default avatarSong Liu <song@kernel.org>
      Acked-by: default avatarBjörn Töpel <bjorn@kernel.org>
      Tested-by: default avatarBjörn Töpel <bjorn@rivosinc.com>
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/r/20230831131229.497941-2-puranjay12@gmail.com
      
      
      Signed-off-by: default avatarPalmer Dabbelt <palmer@rivosinc.com>
      Unverified
      20e490ad
    • Martin KaFai Lau's avatar
      bpf: bpf_sk_storage: Fix the missing uncharge in sk_omem_alloc · 55d49f75
      Martin KaFai Lau authored
      
      The commit c83597fa ("bpf: Refactor some inode/task/sk storage functions
      for reuse"), refactored the bpf_{sk,task,inode}_storage_free() into
      bpf_local_storage_unlink_nolock() which then later renamed to
      bpf_local_storage_destroy(). The commit accidentally passed the
      "bool uncharge_mem = false" argument to bpf_selem_unlink_storage_nolock()
      which then stopped the uncharge from happening to the sk->sk_omem_alloc.
      
      This missing uncharge only happens when the sk is going away (during
      __sk_destruct).
      
      This patch fixes it by always passing "uncharge_mem = true". It is a
      noop to the task/inode/cgroup storage because they do not have the
      map_local_storage_(un)charge enabled in the map_ops. A followup patch
      will be done in bpf-next to remove the uncharge_mem argument.
      
      A selftest is added in the next patch.
      
      Fixes: c83597fa ("bpf: Refactor some inode/task/sk storage functions for reuse")
      Signed-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20230901231129.578493-3-martin.lau@linux.dev
      55d49f75
    • Martin KaFai Lau's avatar
      bpf: bpf_sk_storage: Fix invalid wait context lockdep report · a96a44ab
      Martin KaFai Lau authored
      
      './test_progs -t test_local_storage' reported a splat:
      
      [   27.137569] =============================
      [   27.138122] [ BUG: Invalid wait context ]
      [   27.138650] 6.5.0-03980-gd11ae1b16b0a #247 Tainted: G           O
      [   27.139542] -----------------------------
      [   27.140106] test_progs/1729 is trying to lock:
      [   27.140713] ffff8883ef047b88 (stock_lock){-.-.}-{3:3}, at: local_lock_acquire+0x9/0x130
      [   27.141834] other info that might help us debug this:
      [   27.142437] context-{5:5}
      [   27.142856] 2 locks held by test_progs/1729:
      [   27.143352]  #0: ffffffff84bcd9c0 (rcu_read_lock){....}-{1:3}, at: rcu_lock_acquire+0x4/0x40
      [   27.144492]  #1: ffff888107deb2c0 (&storage->lock){..-.}-{2:2}, at: bpf_local_storage_update+0x39e/0x8e0
      [   27.145855] stack backtrace:
      [   27.146274] CPU: 0 PID: 1729 Comm: test_progs Tainted: G           O       6.5.0-03980-gd11ae1b16b0a #247
      [   27.147550] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
      [   27.149127] Call Trace:
      [   27.149490]  <TASK>
      [   27.149867]  dump_stack_lvl+0x130/0x1d0
      [   27.152609]  dump_stack+0x14/0x20
      [   27.153131]  __lock_acquire+0x1657/0x2220
      [   27.153677]  lock_acquire+0x1b8/0x510
      [   27.157908]  local_lock_acquire+0x29/0x130
      [   27.159048]  obj_cgroup_charge+0xf4/0x3c0
      [   27.160794]  slab_pre_alloc_hook+0x28e/0x2b0
      [   27.161931]  __kmem_cache_alloc_node+0x51/0x210
      [   27.163557]  __kmalloc+0xaa/0x210
      [   27.164593]  bpf_map_kzalloc+0xbc/0x170
      [   27.165147]  bpf_selem_alloc+0x130/0x510
      [   27.166295]  bpf_local_storage_update+0x5aa/0x8e0
      [   27.167042]  bpf_fd_sk_storage_update_elem+0xdb/0x1a0
      [   27.169199]  bpf_map_update_value+0x415/0x4f0
      [   27.169871]  map_update_elem+0x413/0x550
      [   27.170330]  __sys_bpf+0x5e9/0x640
      [   27.174065]  __x64_sys_bpf+0x80/0x90
      [   27.174568]  do_syscall_64+0x48/0xa0
      [   27.175201]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
      [   27.175932] RIP: 0033:0x7effb40e41ad
      [   27.176357] Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d8
      [   27.179028] RSP: 002b:00007ffe64c21fc8 EFLAGS: 00000202 ORIG_RAX: 0000000000000141
      [   27.180088] RAX: ffffffffffffffda RBX: 00007ffe64c22768 RCX: 00007effb40e41ad
      [   27.181082] RDX: 0000000000000020 RSI: 00007ffe64c22008 RDI: 0000000000000002
      [   27.182030] RBP: 00007ffe64c21ff0 R08: 0000000000000000 R09: 00007ffe64c22788
      [   27.183038] R10: 0000000000000064 R11: 0000000000000202 R12: 0000000000000000
      [   27.184006] R13: 00007ffe64c22788 R14: 00007effb42a1000 R15: 0000000000000000
      [   27.184958]  </TASK>
      
      It complains about acquiring a local_lock while holding a raw_spin_lock.
      It means it should not allocate memory while holding a raw_spin_lock
      since it is not safe for RT.
      
      raw_spin_lock is needed because bpf_local_storage supports tracing
      context. In particular for task local storage, it is easy to
      get a "current" task PTR_TO_BTF_ID in tracing bpf prog.
      However, task (and cgroup) local storage has already been moved to
      bpf mem allocator which can be used after raw_spin_lock.
      
      The splat is for the sk storage. For sk (and inode) storage,
      it has not been moved to bpf mem allocator. Using raw_spin_lock or not,
      kzalloc(GFP_ATOMIC) could theoretically be unsafe in tracing context.
      However, the local storage helper requires a verifier accepted
      sk pointer (PTR_TO_BTF_ID), it is hypothetical if that (mean running
      a bpf prog in a kzalloc unsafe context and also able to hold a verifier
      accepted sk pointer) could happen.
      
      This patch avoids kzalloc after raw_spin_lock to silent the splat.
      There is an existing kzalloc before the raw_spin_lock. At that point,
      a kzalloc is very likely required because a lookup has just been done
      before. Thus, this patch always does the kzalloc before acquiring
      the raw_spin_lock and remove the later kzalloc usage after the
      raw_spin_lock. After this change, it will have a charge and then
      uncharge during the syscall bpf_map_update_elem() code path.
      This patch opts for simplicity and not continue the old
      optimization to save one charge and uncharge.
      
      This issue is dated back to the very first commit of bpf_sk_storage
      which had been refactored multiple times to create task, inode, and
      cgroup storage. This patch uses a Fixes tag with a more recent
      commit that should be easier to do backport.
      
      Fixes: b00fa38a ("bpf: Enable non-atomic allocations in local storage")
      Signed-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20230901231129.578493-2-martin.lau@linux.dev
      a96a44ab
    • Sebastian Andrzej Siewior's avatar
      bpf: Assign bpf_tramp_run_ctx::saved_run_ctx before recursion check. · 6764e767
      Sebastian Andrzej Siewior authored
      
      __bpf_prog_enter_recur() assigns bpf_tramp_run_ctx::saved_run_ctx before
      performing the recursion check which means in case of a recursion
      __bpf_prog_exit_recur() uses the previously set bpf_tramp_run_ctx::saved_run_ctx
      value.
      
      __bpf_prog_enter_sleepable_recur() assigns bpf_tramp_run_ctx::saved_run_ctx
      after the recursion check which means in case of a recursion
      __bpf_prog_exit_sleepable_recur() uses an uninitialized value. This does not
      look right. If I read the entry trampoline code right, then bpf_tramp_run_ctx
      isn't initialized upfront.
      
      Align __bpf_prog_enter_sleepable_recur() with __bpf_prog_enter_recur() and
      set bpf_tramp_run_ctx::saved_run_ctx before the recursion check is made.
      Remove the assignment of saved_run_ctx in kern_sys_bpf() since it happens
      a few cycles later.
      
      Fixes: e384c7b7 ("bpf, x86: Create bpf_tramp_run_ctx on the caller thread's stack")
      Signed-off-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarJiri Olsa <jolsa@kernel.org>
      Link: https://lore.kernel.org/bpf/20230830080405.251926-3-bigeasy@linutronix.de
      6764e767
    • Sebastian Andrzej Siewior's avatar
      bpf: Invoke __bpf_prog_exit_sleepable_recur() on recursion in kern_sys_bpf(). · 7645629f
      Sebastian Andrzej Siewior authored
      
      If __bpf_prog_enter_sleepable_recur() detects recursion then it returns
      0 without undoing rcu_read_lock_trace(), migrate_disable() or
      decrementing the recursion counter. This is fine in the JIT case because
      the JIT code will jump in the 0 case to the end and invoke the matching
      exit trampoline (__bpf_prog_exit_sleepable_recur()).
      
      This is not the case in kern_sys_bpf() which returns directly to the
      caller with an error code.
      
      Add __bpf_prog_exit_sleepable_recur() as clean up in the recursion case.
      
      Fixes: b1d18a75 ("bpf: Extend sys_bpf commands for bpf_syscall programs.")
      Signed-off-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarJiri Olsa <jolsa@kernel.org>
      Link: https://lore.kernel.org/bpf/20230830080405.251926-2-bigeasy@linutronix.de
      7645629f
  9. Sep 03, 2023
  10. Sep 02, 2023
    • Linus Torvalds's avatar
      cgroup: fix build when CGROUP_SCHED is not enabled · 76be05d4
      Linus Torvalds authored
      
      Sudip Mukherjee reports that the mips sb1250_swarm_defconfig build fails
      with the current kernel.  It isn't actually MIPS-specific, it's just
      that that defconfig does not have CGROUP_SCHED enabled like most configs
      do, and as such shows this error:
      
        kernel/cgroup/cgroup.c: In function 'cgroup_local_stat_show':
        kernel/cgroup/cgroup.c:3699:15: error: implicit declaration of function 'cgroup_tryget_css'; did you mean 'cgroup_tryget'? [-Werror=implicit-function-declaration]
         3699 |         css = cgroup_tryget_css(cgrp, ss);
              |               ^~~~~~~~~~~~~~~~~
              |               cgroup_tryget
        kernel/cgroup/cgroup.c:3699:13: warning: assignment to 'struct cgroup_subsys_state *' from 'int' makes pointer from integer without a cast [-Wint-conversion]
         3699 |         css = cgroup_tryget_css(cgrp, ss);
              |             ^
      
      because cgroup_tryget_css() only exists when CGROUP_SCHED is enabled,
      and the cgroup_local_stat_show() function should similarly be guarded by
      that config option.
      
      Move things around a bit to fix this all.
      
      Fixes: d1d4ff5d ("cgroup: put cgroup_tryget_css() inside CONFIG_CGROUP_SCHED")
      Reported-by: default avatarSudip Mukherjee <sudipm.mukherjee@gmail.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      76be05d4
    • Shrikanth Hegde's avatar
      sched/fair: Optimize should_we_balance() for large SMT systems · f8858d96
      Shrikanth Hegde authored
      
      should_we_balance() is called in load_balance() to find out if the CPU that
      is trying to do the load balance is the right one or not.
      
      With commit:
      
        b1bfeab9("sched/fair: Consider the idle state of the whole core for load balance")
      
      the code tries to find an idle core to do the load balancing
      and falls back on an idle sibling CPU if there is no idle core.
      
      However, on larger SMT systems, it could be needlessly iterating to find a
      idle by scanning all the CPUs in an non-idle core. If the core is not idle,
      and first SMT sibling which is idle has been found, then its not needed to
      check other SMT siblings for idleness
      
      Lets say in SMT4, Core0 has 0,2,4,6 and CPU0 is BUSY and rest are IDLE.
      balancing domain is MC/DIE. CPU2 will be set as the first idle_smt and
      same process would be repeated for CPU4 and CPU6 but this is unnecessary.
      Since calling is_core_idle loops through all CPU's in the SMT mask, effect
      is multiplied by weight of smt_mask. For example,when say 1 CPU is busy,
      we would skip loop for 2 CPU's and skip iterating over 8CPU's. That
      effect would be more in DIE/NUMA domain where there are more cores.
      
      Testing and performance evaluation
      ==================================
      
      The test has been done on this system which has 12 cores, i.e 24 small
      cores with SMT=4:
      
        lscpu
        Architecture:            ppc64le
          Byte Order:            Little Endian
        CPU(s):                  96
          On-line CPU(s) list:   0-95
        Model name:              POWER10 (architected), altivec supported
          Thread(s) per core:    8
      
      Used funclatency bcc tool to evaluate the time taken by should_we_balance(). For
      base tip/sched/core the time taken is collected by making the
      should_we_balance() noinline. time is in nanoseconds. The values are
      collected by running the funclatency tracer for 60 seconds. values are
      average of 3 such runs. This represents the expected reduced time with
      patch.
      
      tip/sched/core was at commit:
      
        2f88c8e8 ("sched/eevdf/doc: Modify the documented knob to base_slice_ns as well")
      
      Results:
      
      	------------------------------------------------------------------------------
      	workload			   tip/sched/core	with_patch(%gain)
      	------------------------------------------------------------------------------
      	idle system				 809.3		 695.0(16.45)
      	stress ng – 12 threads -l 100		1013.5		 893.1(13.49)
      	stress ng – 24 threads -l 100		1073.5		 980.0(9.54)
      	stress ng – 48 threads -l 100		 683.0		 641.0(6.55)
      	stress ng – 96 threads -l 100		2421.0		2300(5.26)
      	stress ng – 96 threads -l 15		 375.5		 377.5(-0.53)
      	stress ng – 96 threads -l 25		 635.5		 637.5(-0.31)
      	stress ng – 96 threads -l 35		 934.0		 891.0(4.83)
      
      Ran schbench(old), hackbench and stress_ng  to evaluate the workload
      performance between tip/sched/core and with patch.
      No modification to tip/sched/core
      
      TL;DR:
      
      Good improvement is seen with schbench. when hackbench and stress_ng
      runs for longer good improvement is seen.
      
      	------------------------------------------------------------------------------
      	schbench(old)		            tip		+patch(%gain)
      	10 iterations			sched/core
      	------------------------------------------------------------------------------
      	1 Threads
      	50.0th:		      		    8.00       9.00(-12.50)
      	75.0th:   			    9.60       9.00(6.25)
      	90.0th:   			   11.80      10.20(13.56)
      	95.0th:   			   12.60      10.40(17.46)
      	99.0th:   			   13.60      11.90(12.50)
      	99.5th:   			   14.10      12.60(10.64)
      	99.9th:   			   15.90      14.60(8.18)
      	2 Threads
      	50.0th:   			    9.90       9.20(7.07)
      	75.0th:   			   12.60      10.10(19.84)
      	90.0th:   			   15.50      12.00(22.58)
      	95.0th:   			   17.70      14.00(20.90)
      	99.0th:   			   21.20      16.90(20.28)
      	99.5th:   			   22.60      17.50(22.57)
      	99.9th:   			   30.40      19.40(36.18)
      	4 Threads
      	50.0th:   			   12.50      10.60(15.20)
      	75.0th:   			   15.30      12.00(21.57)
      	90.0th:   			   18.60      14.10(24.19)
      	95.0th:   			   21.30      16.20(23.94)
      	99.0th:   			   26.00      20.70(20.38)
      	99.5th:   			   27.60      22.50(18.48)
      	99.9th:   			   33.90      31.40(7.37)
      	8 Threads
      	50.0th:   			   16.30      14.30(12.27)
      	75.0th:   			   20.20      17.40(13.86)
      	90.0th:   			   24.50      21.90(10.61)
      	95.0th:   			   27.30      24.70(9.52)
      	99.0th:   			   35.00      31.20(10.86)
      	99.5th:   			   46.40      33.30(28.23)
      	99.9th:   			   89.30      57.50(35.61)
      	16 Threads
      	50.0th:   			   22.70      20.70(8.81)
      	75.0th:   			   30.10      27.40(8.97)
      	90.0th:   			   36.00      32.80(8.89)
      	95.0th:   			   39.60      36.40(8.08)
      	99.0th:   			   49.20      44.10(10.37)
      	99.5th:   			   64.90      50.50(22.19)
      	99.9th:   			  143.50     100.60(29.90)
      	32 Threads
      	50.0th:   			   34.60      35.50(-2.60)
      	75.0th:   			   48.20      50.50(-4.77)
      	90.0th:   			   59.20      62.40(-5.41)
      	95.0th:   			   65.20      69.00(-5.83)
      	99.0th:   			   80.40      83.80(-4.23)
      	99.5th:   			  102.10      98.90(3.13)
      	99.9th:   			  727.10     506.80(30.30)
      
      schbench does improve in general. There is some run to run variation with
      schbench. Did a validation run to confirm that trend is similar.
      
      	------------------------------------------------------------------------------
      	hackbench				tip	   +patch(%gain)
      	20 iterations, 50000 loops	     sched/core
      	------------------------------------------------------------------------------
      	Process 10 groups                :      11.74      11.70(0.34)
      	Process 20 groups                :      22.73      22.69(0.18)
      	Process 30 groups                :      33.39      33.40(-0.03)
      	Process 40 groups                :      43.73      43.61(0.27)
      	Process 50 groups                :      53.82      54.35(-0.98)
      	Process 60 groups                :      64.16      65.29(-1.76)
      	thread 10 Time                   :      12.81      12.79(0.16)
      	thread 20 Time                   :      24.63      24.47(0.65)
      	Process(Pipe) 10 Time            :       6.40       6.34(0.94)
      	Process(Pipe) 20 Time            :      10.62      10.63(-0.09)
      	Process(Pipe) 30 Time            :      15.09      14.84(1.66)
      	Process(Pipe) 40 Time            :      19.42      19.01(2.11)
      	Process(Pipe) 50 Time            :      24.04      23.34(2.91)
      	Process(Pipe) 60 Time            :      28.94      27.51(4.94)
      	thread(Pipe) 10 Time             :       6.96       6.87(1.29)
      	thread(Pipe) 20 Time             :      11.74      11.73(0.09)
      
      hackbench shows slight improvement with pipe. Slight degradation in process.
      
      	------------------------------------------------------------------------------
      	stress_ng				tip        +patch(%gain)
      	10 iterations 100000 cpu_ops	     sched/core
      	------------------------------------------------------------------------------
      
      	--cpu=96 -util=100 Time taken  	 :       5.30,       5.01(5.47)
      	--cpu=48 -util=100 Time taken    :       7.94,       6.73(15.24)
      	--cpu=24 -util=100 Time taken    :      11.67,       8.75(25.02)
      	--cpu=12 -util=100 Time taken    :      15.71,      15.02(4.39)
      	--cpu=96 -util=10 Time taken     :      22.71,      22.19(2.29)
      	--cpu=96 -util=20 Time taken     :      12.14,      12.37(-1.89)
      	--cpu=96 -util=30 Time taken     :       8.76,       8.86(-1.14)
      	--cpu=96 -util=40 Time taken     :       7.13,       7.14(-0.14)
      	--cpu=96 -util=50 Time taken     :       6.10,       6.13(-0.49)
      	--cpu=96 -util=60 Time taken     :       5.42,       5.41(0.18)
      	--cpu=96 -util=70 Time taken     :       4.94,       4.94(0.00)
      	--cpu=96 -util=80 Time taken     :       4.56,       4.53(0.66)
      	--cpu=96 -util=90 Time taken     :       4.27,       4.26(0.23)
      
      Good improvement seen with 24 CPUs. In this case only one CPU is busy,
      and no core is idle. Decent improvement with 100% utilization case. no
      difference in other utilization.
      
      Fixes: b1bfeab9 ("sched/fair: Consider the idle state of the whole core for load balance")
      Signed-off-by: default avatarShrikanth Hegde <sshegde@linux.vnet.ibm.com>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Link: https://lore.kernel.org/r/20230902081204.232218-1-sshegde@linux.vnet.ibm.com
      f8858d96
    • Valentin Schneider's avatar
      tracing/filters: Fix coding style issues · cbb557ba
      Valentin Schneider authored
      Recent commits have introduced some coding style issues, fix those up.
      
      Link: https://lkml.kernel.org/r/20230901151039.125186-5-vschneid@redhat.com
      
      
      
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Signed-off-by: default avatarValentin Schneider <vschneid@redhat.com>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      cbb557ba
    • Valentin Schneider's avatar
      tracing/filters: Change parse_pred() cpulist ternary into an if block · 2900bcbe
      Valentin Schneider authored
      Review comments noted that an if block would be clearer than a ternary, so
      swap it out.
      
      No change in behaviour intended
      
      Link: https://lkml.kernel.org/r/20230901151039.125186-4-vschneid@redhat.com
      
      
      
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Signed-off-by: default avatarValentin Schneider <vschneid@redhat.com>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      2900bcbe
    • Valentin Schneider's avatar
      tracing/filters: Fix double-free of struct filter_pred.mask · 1caf7adb
      Valentin Schneider authored
      When a cpulist filter is found to contain a single CPU, that CPU is saved
      as a scalar and the backing cpumask storage is freed.
      
      Also NULL the mask to avoid a double-free once we get down to
      free_predicate().
      
      Link: https://lkml.kernel.org/r/20230901151039.125186-3-vschneid@redhat.com
      
      
      
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Reported-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: default avatarValentin Schneider <vschneid@redhat.com>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      1caf7adb
    • Valentin Schneider's avatar
      tracing/filters: Fix error-handling of cpulist parsing buffer · 9af40584
      Valentin Schneider authored
      parse_pred() allocates a string buffer to parse the user-provided cpulist,
      but doesn't check the allocation result nor does it free the buffer once it
      is no longer needed.
      
      Add an allocation check, and free the buffer as soon as it is no longer
      needed.
      
      Link: https://lkml.kernel.org/r/20230901151039.125186-2-vschneid@redhat.com
      
      
      
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Reported-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Reported-by: default avatarJosh Poimboeuf <jpoimboe@redhat.com>
      Signed-off-by: default avatarValentin Schneider <vschneid@redhat.com>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      9af40584
    • Brian Foster's avatar
      tracing: Zero the pipe cpumask on alloc to avoid spurious -EBUSY · 3d07fa1d
      Brian Foster authored
      The pipe cpumask used to serialize opens between the main and percpu
      trace pipes is not zeroed or initialized. This can result in
      spurious -EBUSY returns if underlying memory is not fully zeroed.
      This has been observed by immediate failure to read the main
      trace_pipe file on an otherwise newly booted and idle system:
      
       # cat /sys/kernel/debug/tracing/trace_pipe
       cat: /sys/kernel/debug/tracing/trace_pipe: Device or resource busy
      
      Zero the allocation of pipe_cpumask to avoid the problem.
      
      Link: https://lore.kernel.org/linux-trace-kernel/20230831125500.986862-1-bfoster@redhat.com
      
      
      
      Cc: stable@vger.kernel.org
      Fixes: c2489bb7 ("tracing: Introduce pipe_cpumask to avoid race on trace_pipes")
      Reviewed-by: default avatarZheng Yejian <zhengyejian1@huawei.com>
      Reviewed-by: default avatarMasami Hiramatsu (Google) <mhiramat@kernel.org>
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      3d07fa1d
    • Ruan Jinjie's avatar
      ftrace: Use LIST_HEAD to initialize clear_hash · 2a30dbcb
      Ruan Jinjie authored
      Use LIST_HEAD() to initialize clear_hash instead of open-coding it.
      
      Link: https://lore.kernel.org/linux-trace-kernel/20230809071551.913041-1-ruanjinjie@huawei.com
      
      
      
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Signed-off-by: default avatarRuan Jinjie <ruanjinjie@huawei.com>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      2a30dbcb
    • Levi Yun's avatar
      ftrace: Use within_module to check rec->ip within specified module. · 13511489
      Levi Yun authored
      within_module_core && within_module_init condition is same to
      within module but it's more readable.
      
      Use within_module instead of former condition to check rec->ip
      within specified module area or not.
      
      Link: https://lore.kernel.org/linux-trace-kernel/20230803205236.32201-1-ppbuk5246@gmail.com
      
      
      
      Signed-off-by: default avatarLevi Yun <ppbuk5246@gmail.com>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      13511489
    • Zheng Yejian's avatar
      tracing: Fix race issue between cpu buffer write and swap · 3163f635
      Zheng Yejian authored
      Warning happened in rb_end_commit() at code:
      	if (RB_WARN_ON(cpu_buffer, !local_read(&cpu_buffer->committing)))
      
        WARNING: CPU: 0 PID: 139 at kernel/trace/ring_buffer.c:3142
      	rb_commit+0x402/0x4a0
        Call Trace:
         ring_buffer_unlock_commit+0x42/0x250
         trace_buffer_unlock_commit_regs+0x3b/0x250
         trace_event_buffer_commit+0xe5/0x440
         trace_event_buffer_reserve+0x11c/0x150
         trace_event_raw_event_sched_switch+0x23c/0x2c0
         __traceiter_sched_switch+0x59/0x80
         __schedule+0x72b/0x1580
         schedule+0x92/0x120
         worker_thread+0xa0/0x6f0
      
      It is because the race between writing event into cpu buffer and swapping
      cpu buffer through file per_cpu/cpu0/snapshot:
      
        Write on CPU 0             Swap buffer by per_cpu/cpu0/snapshot on CPU 1
        --------                   --------
                                   tracing_snapshot_write()
                                     [...]
      
        ring_buffer_lock_reserve()
          cpu_buffer = buffer->buffers[cpu]; // 1. Suppose find 'cpu_buffer_a';
          [...]
          rb_reserve_next_event()
            [...]
      
                                     ring_buffer_swap_cpu()
                                       if (local_read(&cpu_buffer_a->committing))
                                           goto out_dec;
                                       if (local_read(&cpu_buffer_b->committing))
                                           goto out_dec;
                                       buffer_a->buffers[cpu] = cpu_buffer_b;
                                       buffer_b->buffers[cpu] = cpu_buffer_a;
                                       // 2. cpu_buffer has swapped here.
      
            rb_start_commit(cpu_buffer);
            if (unlikely(READ_ONCE(cpu_buffer->buffer)
                != buffer)) { // 3. This check passed due to 'cpu_buffer->buffer'
              [...]           //    has not changed here.
              return NULL;
            }
                                       cpu_buffer_b->buffer = buffer_a;
                                       cpu_buffer_a->buffer = buffer_b;
                                       [...]
      
            // 4. Reserve event from 'cpu_buffer_a'.
      
        ring_buffer_unlock_commit()
          [...]
          cpu_buffer = buffer->buffers[cpu]; // 5. Now find 'cpu_buffer_b' !!!
          rb_commit(cpu_buffer)
            rb_end_commit()  // 6. WARN for the wrong 'committing' state !!!
      
      Based on above analysis, we can easily reproduce by following testcase:
        ``` bash
        #!/bin/bash
      
        dmesg -n 7
        sysctl -w kernel.panic_on_warn=1
        TR=/sys/kernel/tracing
        echo 7 > ${TR}/buffer_size_kb
        echo "sched:sched_switch" > ${TR}/set_event
        while [ true ]; do
                echo 1 > ${TR}/per_cpu/cpu0/snapshot
        done &
        while [ true ]; do
                echo 1 > ${TR}/per_cpu/cpu0/snapshot
        done &
        while [ true ]; do
                echo 1 > ${TR}/per_cpu/cpu0/snapshot
        done &
        ```
      
      To fix it, IIUC, we can use smp_call_function_single() to do the swap on
      the target cpu where the buffer is located, so that above race would be
      avoided.
      
      Link: https://lore.kernel.org/linux-trace-kernel/20230831132739.4070878-1-zhengyejian1@huawei.com
      
      
      
      Cc: <mhiramat@kernel.org>
      Fixes: f1affcaa ("tracing: Add snapshot in the per_cpu trace directories")
      Signed-off-by: default avatarZheng Yejian <zhengyejian1@huawei.com>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      3163f635
    • Mikhail Kobuk's avatar
      tracing: Remove extra space at the end of hwlat_detector/mode · 2cf0dee9
      Mikhail Kobuk authored
      Space is printed after each mode value including the last one:
      $ echo \"$(sudo cat /sys/kernel/tracing/hwlat_detector/mode)\"
      "none [round-robin] per-cpu "
      
      Found by Linux Verification Center (linuxtesting.org) with SVACE.
      
      Link: https://lore.kernel.org/linux-trace-kernel/20230825103432.7750-1-m.kobuk@ispras.ru
      
      
      
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Fixes: 8fa826b7 ("trace/hwlat: Implement the mode config option")
      Signed-off-by: default avatarMikhail Kobuk <m.kobuk@ispras.ru>
      Reviewed-by: default avatarAlexey Khoroshilov <khoroshilov@ispras.ru>
      Acked-by: default avatarDaniel Bristot de Oliveira <bristot@kernel.org>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      2cf0dee9
  11. Aug 31, 2023
  12. Aug 30, 2023
Loading