Skip to content
Snippets Groups Projects
  1. May 24, 2024
    • dicken.ding's avatar
      genirq/irqdesc: Prevent use-after-free in irq_find_at_or_after() · b84a8aba
      dicken.ding authored
      
      irq_find_at_or_after() dereferences the interrupt descriptor which is
      returned by mt_find() while neither holding sparse_irq_lock nor RCU read
      lock, which means the descriptor can be freed between mt_find() and the
      dereference:
      
          CPU0                            CPU1
          desc = mt_find()
                                          delayed_free_desc(desc)
          irq_desc_get_irq(desc)
      
      The use-after-free is reported by KASAN:
      
          Call trace:
           irq_get_next_irq+0x58/0x84
           show_stat+0x638/0x824
           seq_read_iter+0x158/0x4ec
           proc_reg_read_iter+0x94/0x12c
           vfs_read+0x1e0/0x2c8
      
          Freed by task 4471:
           slab_free_freelist_hook+0x174/0x1e0
           __kmem_cache_free+0xa4/0x1dc
           kfree+0x64/0x128
           irq_kobj_release+0x28/0x3c
           kobject_put+0xcc/0x1e0
           delayed_free_desc+0x14/0x2c
           rcu_do_batch+0x214/0x720
      
      Guard the access with a RCU read lock section.
      
      Fixes: 721255b9 ("genirq: Use a maple tree for interrupt descriptor management")
      Signed-off-by: default avatardicken.ding <dicken.ding@mediatek.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/r/20240524091739.31611-1-dicken.ding@mediatek.com
      b84a8aba
    • Jeff Xu's avatar
      mseal: wire up mseal syscall · ff388fe5
      Jeff Xu authored
      Patch series "Introduce mseal", v10.
      
      This patchset proposes a new mseal() syscall for the Linux kernel.
      
      In a nutshell, mseal() protects the VMAs of a given virtual memory range
      against modifications, such as changes to their permission bits.
      
      Modern CPUs support memory permissions, such as the read/write (RW) and
      no-execute (NX) bits.  Linux has supported NX since the release of kernel
      version 2.6.8 in August 2004 [1].  The memory permission feature improves
      the security stance on memory corruption bugs, as an attacker cannot
      simply write to arbitrary memory and point the code to it.  The memory
      must be marked with the X bit, or else an exception will occur. 
      Internally, the kernel maintains the memory permissions in a data
      structure called VMA (vm_area_struct).  mseal() additionally protects the
      VMA itself against modifications of the selected seal type.
      
      Memory sealing is useful to mitigate memory corruption issues where a
      corrupted pointer is passed to a memory management system.  For example,
      such an attacker primitive can break control-flow integrity guarantees
      since read-only memory that is supposed to be trusted can become writable
      or .text pages can get remapped.  Memory sealing can automatically be
      applied by the runtime loader to seal .text and .rodata pages and
      applications can additionally seal security critical data at runtime.  A
      similar feature already exists in the XNU kernel with the
      VM_FLAGS_PERMANENT [3] flag and on OpenBSD with the mimmutable syscall
      [4].  Also, Chrome wants to adopt this feature for their CFI work [2] and
      this patchset has been designed to be compatible with the Chrome use case.
      
      Two system calls are involved in sealing the map:  mmap() and mseal().
      
      The new mseal() is an syscall on 64 bit CPU, and with following signature:
      
      int mseal(void addr, size_t len, unsigned long flags)
      addr/len: memory range.
      flags: reserved.
      
      mseal() blocks following operations for the given memory range.
      
      1> Unmapping, moving to another location, and shrinking the size,
         via munmap() and mremap(), can leave an empty space, therefore can
         be replaced with a VMA with a new set of attributes.
      
      2> Moving or expanding a different VMA into the current location,
         via mremap().
      
      3> Modifying a VMA via mmap(MAP_FIXED).
      
      4> Size expansion, via mremap(), does not appear to pose any specific
         risks to sealed VMAs. It is included anyway because the use case is
         unclear. In any case, users can rely on merging to expand a sealed VMA.
      
      5> mprotect() and pkey_mprotect().
      
      6> Some destructive madvice() behaviors (e.g. MADV_DONTNEED) for anonymous
         memory, when users don't have write permission to the memory. Those
         behaviors can alter region contents by discarding pages, effectively a
         memset(0) for anonymous memory.
      
      The idea that inspired this patch comes from Stephen Röttger’s work in
      V8 CFI [5].  Chrome browser in ChromeOS will be the first user of this
      API.
      
      Indeed, the Chrome browser has very specific requirements for sealing,
      which are distinct from those of most applications.  For example, in the
      case of libc, sealing is only applied to read-only (RO) or read-execute
      (RX) memory segments (such as .text and .RELRO) to prevent them from
      becoming writable, the lifetime of those mappings are tied to the lifetime
      of the process.
      
      Chrome wants to seal two large address space reservations that are managed
      by different allocators.  The memory is mapped RW- and RWX respectively
      but write access to it is restricted using pkeys (or in the future ARM
      permission overlay extensions).  The lifetime of those mappings are not
      tied to the lifetime of the process, therefore, while the memory is
      sealed, the allocators still need to free or discard the unused memory. 
      For example, with madvise(DONTNEED).
      
      However, always allowing madvise(DONTNEED) on this range poses a security
      risk.  For example if a jump instruction crosses a page boundary and the
      second page gets discarded, it will overwrite the target bytes with zeros
      and change the control flow.  Checking write-permission before the discard
      operation allows us to control when the operation is valid.  In this case,
      the madvise will only succeed if the executing thread has PKEY write
      permissions and PKRU changes are protected in software by control-flow
      integrity.
      
      Although the initial version of this patch series is targeting the Chrome
      browser as its first user, it became evident during upstream discussions
      that we would also want to ensure that the patch set eventually is a
      complete solution for memory sealing and compatible with other use cases. 
      The specific scenario currently in mind is glibc's use case of loading and
      sealing ELF executables.  To this end, Stephen is working on a change to
      glibc to add sealing support to the dynamic linker, which will seal all
      non-writable segments at startup.  Once this work is completed, all
      applications will be able to automatically benefit from these new
      protections.
      
      In closing, I would like to formally acknowledge the valuable
      contributions received during the RFC process, which were instrumental in
      shaping this patch:
      
      Jann Horn: raising awareness and providing valuable insights on the
        destructive madvise operations.
      Liam R. Howlett: perf optimization.
      Linus Torvalds: assisting in defining system call signature and scope.
      Theo de Raadt: sharing the experiences and insight gained from
        implementing mimmutable() in OpenBSD.
      
      MM perf benchmarks
      ==================
      This patch adds a loop in the mprotect/munmap/madvise(DONTNEED) to
      check the VMAs’ sealing flag, so that no partial update can be made,
      when any segment within the given memory range is sealed.
      
      To measure the performance impact of this loop, two tests are developed.
      [8]
      
      The first is measuring the time taken for a particular system call,
      by using clock_gettime(CLOCK_MONOTONIC). The second is using
      PERF_COUNT_HW_REF_CPU_CYCLES (exclude user space). Both tests have
      similar results.
      
      The tests have roughly below sequence:
      for (i = 0; i < 1000, i++)
          create 1000 mappings (1 page per VMA)
          start the sampling
          for (j = 0; j < 1000, j++)
              mprotect one mapping
          stop and save the sample
          delete 1000 mappings
      calculates all samples.
      
      Below tests are performed on Intel(R) Pentium(R) Gold 7505 @ 2.00GHz,
      4G memory, Chromebook.
      
      Based on the latest upstream code:
      The first test (measuring time)
      syscall__	vmas	t	t_mseal	delta_ns	per_vma	%
      munmap__  	1	909	944	35	35	104%
      munmap__  	2	1398	1502	104	52	107%
      munmap__  	4	2444	2594	149	37	106%
      munmap__  	8	4029	4323	293	37	107%
      munmap__  	16	6647	6935	288	18	104%
      munmap__  	32	11811	12398	587	18	105%
      mprotect	1	439	465	26	26	106%
      mprotect	2	1659	1745	86	43	105%
      mprotect	4	3747	3889	142	36	104%
      mprotect	8	6755	6969	215	27	103%
      mprotect	16	13748	14144	396	25	103%
      mprotect	32	27827	28969	1142	36	104%
      madvise_	1	240	262	22	22	109%
      madvise_	2	366	442	76	38	121%
      madvise_	4	623	751	128	32	121%
      madvise_	8	1110	1324	215	27	119%
      madvise_	16	2127	2451	324	20	115%
      madvise_	32	4109	4642	534	17	113%
      
      The second test (measuring cpu cycle)
      syscall__	vmas	cpu	cmseal	delta_cpu	per_vma	%
      munmap__	1	1790	1890	100	100	106%
      munmap__	2	2819	3033	214	107	108%
      munmap__	4	4959	5271	312	78	106%
      munmap__	8	8262	8745	483	60	106%
      munmap__	16	13099	14116	1017	64	108%
      munmap__	32	23221	24785	1565	49	107%
      mprotect	1	906	967	62	62	107%
      mprotect	2	3019	3203	184	92	106%
      mprotect	4	6149	6569	420	105	107%
      mprotect	8	9978	10524	545	68	105%
      mprotect	16	20448	21427	979	61	105%
      mprotect	32	40972	42935	1963	61	105%
      madvise_	1	434	497	63	63	115%
      madvise_	2	752	899	147	74	120%
      madvise_	4	1313	1513	200	50	115%
      madvise_	8	2271	2627	356	44	116%
      madvise_	16	4312	4883	571	36	113%
      madvise_	32	8376	9319	943	29	111%
      
      Based on the result, for 6.8 kernel, sealing check adds
      20-40 nano seconds, or around 50-100 CPU cycles, per VMA.
      
      In addition, I applied the sealing to 5.10 kernel:
      The first test (measuring time)
      syscall__	vmas	t	tmseal	delta_ns	per_vma	%
      munmap__	1	357	390	33	33	109%
      munmap__	2	442	463	21	11	105%
      munmap__	4	614	634	20	5	103%
      munmap__	8	1017	1137	120	15	112%
      munmap__	16	1889	2153	263	16	114%
      munmap__	32	4109	4088	-21	-1	99%
      mprotect	1	235	227	-7	-7	97%
      mprotect	2	495	464	-30	-15	94%
      mprotect	4	741	764	24	6	103%
      mprotect	8	1434	1437	2	0	100%
      mprotect	16	2958	2991	33	2	101%
      mprotect	32	6431	6608	177	6	103%
      madvise_	1	191	208	16	16	109%
      madvise_	2	300	324	24	12	108%
      madvise_	4	450	473	23	6	105%
      madvise_	8	753	806	53	7	107%
      madvise_	16	1467	1592	125	8	108%
      madvise_	32	2795	3405	610	19	122%
      					
      The second test (measuring cpu cycle)
      syscall__	nbr_vma	cpu	cmseal	delta_cpu	per_vma	%
      munmap__	1	684	715	31	31	105%
      munmap__	2	861	898	38	19	104%
      munmap__	4	1183	1235	51	13	104%
      munmap__	8	1999	2045	46	6	102%
      munmap__	16	3839	3816	-23	-1	99%
      munmap__	32	7672	7887	216	7	103%
      mprotect	1	397	443	46	46	112%
      mprotect	2	738	788	50	25	107%
      mprotect	4	1221	1256	35	9	103%
      mprotect	8	2356	2429	72	9	103%
      mprotect	16	4961	4935	-26	-2	99%
      mprotect	32	9882	10172	291	9	103%
      madvise_	1	351	380	29	29	108%
      madvise_	2	565	615	49	25	109%
      madvise_	4	872	933	61	15	107%
      madvise_	8	1508	1640	132	16	109%
      madvise_	16	3078	3323	245	15	108%
      madvise_	32	5893	6704	811	25	114%
      
      For 5.10 kernel, sealing check adds 0-15 ns in time, or 10-30
      CPU cycles, there is even decrease in some cases.
      
      It might be interesting to compare 5.10 and 6.8 kernel
      The first test (measuring time)
      syscall__	vmas	t_5_10	t_6_8	delta_ns	per_vma	%
      munmap__	1	357	909	552	552	254%
      munmap__	2	442	1398	956	478	316%
      munmap__	4	614	2444	1830	458	398%
      munmap__	8	1017	4029	3012	377	396%
      munmap__	16	1889	6647	4758	297	352%
      munmap__	32	4109	11811	7702	241	287%
      mprotect	1	235	439	204	204	187%
      mprotect	2	495	1659	1164	582	335%
      mprotect	4	741	3747	3006	752	506%
      mprotect	8	1434	6755	5320	665	471%
      mprotect	16	2958	13748	10790	674	465%
      mprotect	32	6431	27827	21397	669	433%
      madvise_	1	191	240	49	49	125%
      madvise_	2	300	366	67	33	122%
      madvise_	4	450	623	173	43	138%
      madvise_	8	753	1110	357	45	147%
      madvise_	16	1467	2127	660	41	145%
      madvise_	32	2795	4109	1314	41	147%
      
      The second test (measuring cpu cycle)
      syscall__	vmas	cpu_5_10	c_6_8	delta_cpu	per_vma	%
      munmap__	1	684	1790	1106	1106	262%
      munmap__	2	861	2819	1958	979	327%
      munmap__	4	1183	4959	3776	944	419%
      munmap__	8	1999	8262	6263	783	413%
      munmap__	16	3839	13099	9260	579	341%
      munmap__	32	7672	23221	15549	486	303%
      mprotect	1	397	906	509	509	228%
      mprotect	2	738	3019	2281	1140	409%
      mprotect	4	1221	6149	4929	1232	504%
      mprotect	8	2356	9978	7622	953	423%
      mprotect	16	4961	20448	15487	968	412%
      mprotect	32	9882	40972	31091	972	415%
      madvise_	1	351	434	82	82	123%
      madvise_	2	565	752	186	93	133%
      madvise_	4	872	1313	442	110	151%
      madvise_	8	1508	2271	763	95	151%
      madvise_	16	3078	4312	1234	77	140%
      madvise_	32	5893	8376	2483	78	142%
      
      From 5.10 to 6.8
      munmap: added 250-550 ns in time, or 500-1100 in cpu cycle, per vma.
      mprotect: added 200-750 ns in time, or 500-1200 in cpu cycle, per vma.
      madvise: added 33-50 ns in time, or 70-110 in cpu cycle, per vma.
      
      In comparison to mseal, which adds 20-40 ns or 50-100 CPU cycles, the
      increase from 5.10 to 6.8 is significantly larger, approximately ten times
      greater for munmap and mprotect.
      
      When I discuss the mm performance with Brian Makin, an engineer who worked
      on performance, it was brought to my attention that such performance
      benchmarks, which measuring millions of mm syscall in a tight loop, may
      not accurately reflect real-world scenarios, such as that of a database
      service.  Also this is tested using a single HW and ChromeOS, the data
      from another HW or distribution might be different.  It might be best to
      take this data with a grain of salt.
      
      
      This patch (of 5):
      
      Wire up mseal syscall for all architectures.
      
      Link: https://lkml.kernel.org/r/20240415163527.626541-1-jeffxu@chromium.org
      Link: https://lkml.kernel.org/r/20240415163527.626541-2-jeffxu@chromium.org
      
      
      Signed-off-by: default avatarJeff Xu <jeffxu@chromium.org>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Reviewed-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Guenter Roeck <groeck@chromium.org>
      Cc: Jann Horn <jannh@google.com> [Bug #2]
      Cc: Jeff Xu <jeffxu@google.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Jorge Lucangeli Obes <jorgelo@chromium.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Muhammad Usama Anjum <usama.anjum@collabora.com>
      Cc: Pedro Falcato <pedro.falcato@gmail.com>
      Cc: Stephen Röttger <sroettger@google.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Amer Al Shanawany <amer.shanawany@gmail.com>
      Cc: Javier Carrasco <javier.carrasco.cruz@gmail.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ff388fe5
  2. May 23, 2024
    • Dongli Zhang's avatar
      genirq/cpuhotplug, x86/vector: Prevent vector leak during CPU offline · a6c11c0a
      Dongli Zhang authored
      
      The absence of IRQD_MOVE_PCNTXT prevents immediate effectiveness of
      interrupt affinity reconfiguration via procfs. Instead, the change is
      deferred until the next instance of the interrupt being triggered on the
      original CPU.
      
      When the interrupt next triggers on the original CPU, the new affinity is
      enforced within __irq_move_irq(). A vector is allocated from the new CPU,
      but the old vector on the original CPU remains and is not immediately
      reclaimed. Instead, apicd->move_in_progress is flagged, and the reclaiming
      process is delayed until the next trigger of the interrupt on the new CPU.
      
      Upon the subsequent triggering of the interrupt on the new CPU,
      irq_complete_move() adds a task to the old CPU's vector_cleanup list if it
      remains online. Subsequently, the timer on the old CPU iterates over its
      vector_cleanup list, reclaiming old vectors.
      
      However, a rare scenario arises if the old CPU is outgoing before the
      interrupt triggers again on the new CPU.
      
      In that case irq_force_complete_move() is not invoked on the outgoing CPU
      to reclaim the old apicd->prev_vector because the interrupt isn't currently
      affine to the outgoing CPU, and irq_needs_fixup() returns false. Even
      though __vector_schedule_cleanup() is later called on the new CPU, it
      doesn't reclaim apicd->prev_vector; instead, it simply resets both
      apicd->move_in_progress and apicd->prev_vector to 0.
      
      As a result, the vector remains unreclaimed in vector_matrix, leading to a
      CPU vector leak.
      
      To address this issue, move the invocation of irq_force_complete_move()
      before the irq_needs_fixup() call to reclaim apicd->prev_vector, if the
      interrupt is currently or used to be affine to the outgoing CPU.
      
      Additionally, reclaim the vector in __vector_schedule_cleanup() as well,
      following a warning message, although theoretically it should never see
      apicd->move_in_progress with apicd->prev_cpu pointing to an offline CPU.
      
      Fixes: f0383c24 ("genirq/cpuhotplug: Add support for cleaning up move in progress")
      Signed-off-by: default avatarDongli Zhang <dongli.zhang@oracle.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/r/20240522220218.162423-1-dongli.zhang@oracle.com
      a6c11c0a
    • Steven Rostedt (Google)'s avatar
      tracing/treewide: Remove second parameter of __assign_str() · 2c92ca84
      Steven Rostedt (Google) authored
      With the rework of how the __string() handles dynamic strings where it
      saves off the source string in field in the helper structure[1], the
      assignment of that value to the trace event field is stored in the helper
      value and does not need to be passed in again.
      
      This means that with:
      
        __string(field, mystring)
      
      Which use to be assigned with __assign_str(field, mystring), no longer
      needs the second parameter and it is unused. With this, __assign_str()
      will now only get a single parameter.
      
      There's over 700 users of __assign_str() and because coccinelle does not
      handle the TRACE_EVENT() macro I ended up using the following sed script:
      
        git grep -l __assign_str | while read a ; do
            sed -e 's/\(__assign_str([^,]*[^ ,]\) *,[^;]*/\1)/' $a > /tmp/test-file;
            mv /tmp/test-file $a;
        done
      
      I then searched for __assign_str() that did not end with ';' as those
      were multi line assignments that the sed script above would fail to catch.
      
      Note, the same updates will need to be done for:
      
        __assign_str_len()
        __assign_rel_str()
        __assign_rel_str_len()
      
      I tested this with both an allmodconfig and an allyesconfig (build only for both).
      
      [1] https://lore.kernel.org/linux-trace-kernel/20240222211442.634192653@goodmis.org/
      
      Link: https://lore.kernel.org/linux-trace-kernel/20240516133454.681ba6a0@rorschach.local.home
      
      
      
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Julia Lawall <Julia.Lawall@inria.fr>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      Acked-by: default avatarJani Nikula <jani.nikula@intel.com>
      Acked-by: Christian König <christian.koenig@amd.com> for the amdgpu parts.
      Acked-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> #for
      Acked-by: Rafael J. Wysocki <rafael@kernel.org> # for thermal
      Acked-by: default avatarTakashi Iwai <tiwai@suse.de>
      Acked-by: Darrick J. Wong <djwong@kernel.org>	# xfs
      Tested-by: default avatarGuenter Roeck <linux@roeck-us.net>
      2c92ca84
  3. May 22, 2024
  4. May 21, 2024
    • Yang Li's avatar
      rv: Update rv_en(dis)able_monitor doc to match kernel-doc · 1e8b7b3d
      Yang Li authored
      The patch updates the function documentation comment for
      rv_en(dis)able_monitor to adhere to the kernel-doc specification.
      
      Link: https://lore.kernel.org/linux-trace-kernel/20240520054239.61784-1-yang.lee@linux.alibaba.com
      
      
      
      Fixes: 102227b9 ("rv: Add Runtime Verification (RV) interface")
      Signed-off-by: default avatarYang Li <yang.lee@linux.alibaba.com>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      1e8b7b3d
    • Jeff Johnson's avatar
      tracing: Add MODULE_DESCRIPTION() to preemptirq_delay_test · 23748e3e
      Jeff Johnson authored
      Fix the 'make W=1' warning:
      
      WARNING: modpost: missing MODULE_DESCRIPTION() in kernel/trace/preemptirq_delay_test.o
      
      Link: https://lore.kernel.org/linux-trace-kernel/20240518-md-preemptirq_delay_test-v1-1-387d11b30d85@quicinc.com
      
      
      
      Cc: stable@vger.kernel.org
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Fixes: f96e8577 ("lib: Add module for testing preemptoff/irqsoff latency tracers")
      Acked-by: default avatarMasami Hiramatsu (Google) <mhiramat@kernel.org>
      Signed-off-by: default avatarJeff Johnson <quic_jjohnson@quicinc.com>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      23748e3e
    • Petr Pavlu's avatar
      ring-buffer: Fix a race between readers and resize checks · c2274b90
      Petr Pavlu authored
      The reader code in rb_get_reader_page() swaps a new reader page into the
      ring buffer by doing cmpxchg on old->list.prev->next to point it to the
      new page. Following that, if the operation is successful,
      old->list.next->prev gets updated too. This means the underlying
      doubly-linked list is temporarily inconsistent, page->prev->next or
      page->next->prev might not be equal back to page for some page in the
      ring buffer.
      
      The resize operation in ring_buffer_resize() can be invoked in parallel.
      It calls rb_check_pages() which can detect the described inconsistency
      and stop further tracing:
      
      [  190.271762] ------------[ cut here ]------------
      [  190.271771] WARNING: CPU: 1 PID: 6186 at kernel/trace/ring_buffer.c:1467 rb_check_pages.isra.0+0x6a/0xa0
      [  190.271789] Modules linked in: [...]
      [  190.271991] Unloaded tainted modules: intel_uncore_frequency(E):1 skx_edac(E):1
      [  190.272002] CPU: 1 PID: 6186 Comm: cmd.sh Kdump: loaded Tainted: G            E      6.9.0-rc6-default #5 158d3e1e6d0b091c34c3b96bfd99a1c58306d79f
      [  190.272011] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.0-0-gd239552c-rebuilt.opensuse.org 04/01/2014
      [  190.272015] RIP: 0010:rb_check_pages.isra.0+0x6a/0xa0
      [  190.272023] Code: [...]
      [  190.272028] RSP: 0018:ffff9c37463abb70 EFLAGS: 00010206
      [  190.272034] RAX: ffff8eba04b6cb80 RBX: 0000000000000007 RCX: ffff8eba01f13d80
      [  190.272038] RDX: ffff8eba01f130c0 RSI: ffff8eba04b6cd00 RDI: ffff8eba0004c700
      [  190.272042] RBP: ffff8eba0004c700 R08: 0000000000010002 R09: 0000000000000000
      [  190.272045] R10: 00000000ffff7f52 R11: ffff8eba7f600000 R12: ffff8eba0004c720
      [  190.272049] R13: ffff8eba00223a00 R14: 0000000000000008 R15: ffff8eba067a8000
      [  190.272053] FS:  00007f1bd64752c0(0000) GS:ffff8eba7f680000(0000) knlGS:0000000000000000
      [  190.272057] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  190.272061] CR2: 00007f1bd6662590 CR3: 000000010291e001 CR4: 0000000000370ef0
      [  190.272070] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  190.272073] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [  190.272077] Call Trace:
      [  190.272098]  <TASK>
      [  190.272189]  ring_buffer_resize+0x2ab/0x460
      [  190.272199]  __tracing_resize_ring_buffer.part.0+0x23/0xa0
      [  190.272206]  tracing_resize_ring_buffer+0x65/0x90
      [  190.272216]  tracing_entries_write+0x74/0xc0
      [  190.272225]  vfs_write+0xf5/0x420
      [  190.272248]  ksys_write+0x67/0xe0
      [  190.272256]  do_syscall_64+0x82/0x170
      [  190.272363]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
      [  190.272373] RIP: 0033:0x7f1bd657d263
      [  190.272381] Code: [...]
      [  190.272385] RSP: 002b:00007ffe72b643f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
      [  190.272391] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f1bd657d263
      [  190.272395] RDX: 0000000000000002 RSI: 0000555a6eb538e0 RDI: 0000000000000001
      [  190.272398] RBP: 0000555a6eb538e0 R08: 000000000000000a R09: 0000000000000000
      [  190.272401] R10: 0000555a6eb55190 R11: 0000000000000246 R12: 00007f1bd6662500
      [  190.272404] R13: 0000000000000002 R14: 00007f1bd6667c00 R15: 0000000000000002
      [  190.272412]  </TASK>
      [  190.272414] ---[ end trace 0000000000000000 ]---
      
      Note that ring_buffer_resize() calls rb_check_pages() only if the parent
      trace_buffer has recording disabled. Recent commit d78ab792
      ("tracing: Stop current tracer when resizing buffer") causes that it is
      now always the case which makes it more likely to experience this issue.
      
      The window to hit this race is nonetheless very small. To help
      reproducing it, one can add a delay loop in rb_get_reader_page():
      
       ret = rb_head_page_replace(reader, cpu_buffer->reader_page);
       if (!ret)
       	goto spin;
       for (unsigned i = 0; i < 1U << 26; i++)  /* inserted delay loop */
       	__asm__ __volatile__ ("" : : : "memory");
       rb_list_head(reader->list.next)->prev = &cpu_buffer->reader_page->list;
      
      .. and then run the following commands on the target system:
      
       echo 1 > /sys/kernel/tracing/events/sched/sched_switch/enable
       while true; do
       	echo 16 > /sys/kernel/tracing/buffer_size_kb; sleep 0.1
       	echo 8 > /sys/kernel/tracing/buffer_size_kb; sleep 0.1
       done &
       while true; do
       	for i in /sys/kernel/tracing/per_cpu/*; do
       		timeout 0.1 cat $i/trace_pipe; sleep 0.2
       	done
       done
      
      To fix the problem, make sure ring_buffer_resize() doesn't invoke
      rb_check_pages() concurrently with a reader operating on the same
      ring_buffer_per_cpu by taking its cpu_buffer->reader_lock.
      
      Link: https://lore.kernel.org/linux-trace-kernel/20240517134008.24529-3-petr.pavlu@suse.com
      
      
      
      Cc: stable@vger.kernel.org
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Fixes: 659f451f ("ring-buffer: Add integrity check at end of iter read")
      Signed-off-by: default avatarPetr Pavlu <petr.pavlu@suse.com>
      [ Fixed whitespace ]
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      c2274b90
    • Petr Pavlu's avatar
      ring-buffer: Correct stale comments related to non-consuming readers · ea70a962
      Petr Pavlu authored
      Adjust the following code documentation:
      
      * Kernel-doc comments for ring_buffer_read_prepare() and
        ring_buffer_read_finish() mention that recording to the ring buffer is
        disabled when the read is active. Remove mention of this restriction
        because it was already lifted in commit 1039221c ("ring-buffer: Do
        not disable recording when there is an iterator").
      
      * Function ring_buffer_read_finish() performs a self-check of the
        ring-buffer by locking cpu_buffer->reader_lock and then calling
        rb_check_pages(). The preceding comment explains that the lock is
        needed because rb_check_pages() clears the HEAD flag required by
        readers which might be running in parallel. Remove this explanation
        because commit 8843e06f ("ring-buffer: Handle race between
        rb_move_tail and rb_check_pages") simplified the function so it no
        longer resets the mentioned flag. Nonetheless, the lock is still
        needed because a reader swapping a page into the ring buffer can make
        the underlying doubly-linked list temporarily inconsistent.
      
      This is a non-functional change.
      
      Link: https://lore.kernel.org/linux-trace-kernel/20240517134008.24529-2-petr.pavlu@suse.com
      
      
      
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: default avatarPetr Pavlu <petr.pavlu@suse.com>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      ea70a962
  5. May 18, 2024
    • Linus Torvalds's avatar
      kprobe/ftrace: fix build error due to bad function definition · 4b377b48
      Linus Torvalds authored
      
      Commit 1a7d0890 ("kprobe/ftrace: bail out if ftrace was killed")
      introduced a bad K&R function definition, which we haven't accepted in a
      long long time.
      
      Gcc seems to let it slide, but clang notices with the appropriate error:
      
        kernel/kprobes.c:1140:24: error: a function declaration without a prototype is deprecated in all >
         1140 | void kprobe_ftrace_kill()
              |                        ^
              |                         void
      
      but this commit was apparently never in linux-next before it was sent
      upstream, so it didn't get the appropriate build test coverage.
      
      Fixes: 1a7d0890 kprobe/ftrace: bail out if ftrace was killed
      Cc: Stephen Brennan <stephen.s.brennan@oracle.com>
      Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4b377b48
  6. May 17, 2024
  7. May 15, 2024
  8. May 14, 2024
  9. May 13, 2024
  10. May 12, 2024
  11. May 09, 2024
Loading