- May 24, 2024
-
-
dicken.ding authored
irq_find_at_or_after() dereferences the interrupt descriptor which is returned by mt_find() while neither holding sparse_irq_lock nor RCU read lock, which means the descriptor can be freed between mt_find() and the dereference: CPU0 CPU1 desc = mt_find() delayed_free_desc(desc) irq_desc_get_irq(desc) The use-after-free is reported by KASAN: Call trace: irq_get_next_irq+0x58/0x84 show_stat+0x638/0x824 seq_read_iter+0x158/0x4ec proc_reg_read_iter+0x94/0x12c vfs_read+0x1e0/0x2c8 Freed by task 4471: slab_free_freelist_hook+0x174/0x1e0 __kmem_cache_free+0xa4/0x1dc kfree+0x64/0x128 irq_kobj_release+0x28/0x3c kobject_put+0xcc/0x1e0 delayed_free_desc+0x14/0x2c rcu_do_batch+0x214/0x720 Guard the access with a RCU read lock section. Fixes: 721255b9 ("genirq: Use a maple tree for interrupt descriptor management") Signed-off-by:
dicken.ding <dicken.ding@mediatek.com> Signed-off-by:
Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20240524091739.31611-1-dicken.ding@mediatek.com
-
Jeff Xu authored
Patch series "Introduce mseal", v10. This patchset proposes a new mseal() syscall for the Linux kernel. In a nutshell, mseal() protects the VMAs of a given virtual memory range against modifications, such as changes to their permission bits. Modern CPUs support memory permissions, such as the read/write (RW) and no-execute (NX) bits. Linux has supported NX since the release of kernel version 2.6.8 in August 2004 [1]. The memory permission feature improves the security stance on memory corruption bugs, as an attacker cannot simply write to arbitrary memory and point the code to it. The memory must be marked with the X bit, or else an exception will occur. Internally, the kernel maintains the memory permissions in a data structure called VMA (vm_area_struct). mseal() additionally protects the VMA itself against modifications of the selected seal type. Memory sealing is useful to mitigate memory corruption issues where a corrupted pointer is passed to a memory management system. For example, such an attacker primitive can break control-flow integrity guarantees since read-only memory that is supposed to be trusted can become writable or .text pages can get remapped. Memory sealing can automatically be applied by the runtime loader to seal .text and .rodata pages and applications can additionally seal security critical data at runtime. A similar feature already exists in the XNU kernel with the VM_FLAGS_PERMANENT [3] flag and on OpenBSD with the mimmutable syscall [4]. Also, Chrome wants to adopt this feature for their CFI work [2] and this patchset has been designed to be compatible with the Chrome use case. Two system calls are involved in sealing the map: mmap() and mseal(). The new mseal() is an syscall on 64 bit CPU, and with following signature: int mseal(void addr, size_t len, unsigned long flags) addr/len: memory range. flags: reserved. mseal() blocks following operations for the given memory range. 1> Unmapping, moving to another location, and shrinking the size, via munmap() and mremap(), can leave an empty space, therefore can be replaced with a VMA with a new set of attributes. 2> Moving or expanding a different VMA into the current location, via mremap(). 3> Modifying a VMA via mmap(MAP_FIXED). 4> Size expansion, via mremap(), does not appear to pose any specific risks to sealed VMAs. It is included anyway because the use case is unclear. In any case, users can rely on merging to expand a sealed VMA. 5> mprotect() and pkey_mprotect(). 6> Some destructive madvice() behaviors (e.g. MADV_DONTNEED) for anonymous memory, when users don't have write permission to the memory. Those behaviors can alter region contents by discarding pages, effectively a memset(0) for anonymous memory. The idea that inspired this patch comes from Stephen Röttger’s work in V8 CFI [5]. Chrome browser in ChromeOS will be the first user of this API. Indeed, the Chrome browser has very specific requirements for sealing, which are distinct from those of most applications. For example, in the case of libc, sealing is only applied to read-only (RO) or read-execute (RX) memory segments (such as .text and .RELRO) to prevent them from becoming writable, the lifetime of those mappings are tied to the lifetime of the process. Chrome wants to seal two large address space reservations that are managed by different allocators. The memory is mapped RW- and RWX respectively but write access to it is restricted using pkeys (or in the future ARM permission overlay extensions). The lifetime of those mappings are not tied to the lifetime of the process, therefore, while the memory is sealed, the allocators still need to free or discard the unused memory. For example, with madvise(DONTNEED). However, always allowing madvise(DONTNEED) on this range poses a security risk. For example if a jump instruction crosses a page boundary and the second page gets discarded, it will overwrite the target bytes with zeros and change the control flow. Checking write-permission before the discard operation allows us to control when the operation is valid. In this case, the madvise will only succeed if the executing thread has PKEY write permissions and PKRU changes are protected in software by control-flow integrity. Although the initial version of this patch series is targeting the Chrome browser as its first user, it became evident during upstream discussions that we would also want to ensure that the patch set eventually is a complete solution for memory sealing and compatible with other use cases. The specific scenario currently in mind is glibc's use case of loading and sealing ELF executables. To this end, Stephen is working on a change to glibc to add sealing support to the dynamic linker, which will seal all non-writable segments at startup. Once this work is completed, all applications will be able to automatically benefit from these new protections. In closing, I would like to formally acknowledge the valuable contributions received during the RFC process, which were instrumental in shaping this patch: Jann Horn: raising awareness and providing valuable insights on the destructive madvise operations. Liam R. Howlett: perf optimization. Linus Torvalds: assisting in defining system call signature and scope. Theo de Raadt: sharing the experiences and insight gained from implementing mimmutable() in OpenBSD. MM perf benchmarks ================== This patch adds a loop in the mprotect/munmap/madvise(DONTNEED) to check the VMAs’ sealing flag, so that no partial update can be made, when any segment within the given memory range is sealed. To measure the performance impact of this loop, two tests are developed. [8] The first is measuring the time taken for a particular system call, by using clock_gettime(CLOCK_MONOTONIC). The second is using PERF_COUNT_HW_REF_CPU_CYCLES (exclude user space). Both tests have similar results. The tests have roughly below sequence: for (i = 0; i < 1000, i++) create 1000 mappings (1 page per VMA) start the sampling for (j = 0; j < 1000, j++) mprotect one mapping stop and save the sample delete 1000 mappings calculates all samples. Below tests are performed on Intel(R) Pentium(R) Gold 7505 @ 2.00GHz, 4G memory, Chromebook. Based on the latest upstream code: The first test (measuring time) syscall__ vmas t t_mseal delta_ns per_vma % munmap__ 1 909 944 35 35 104% munmap__ 2 1398 1502 104 52 107% munmap__ 4 2444 2594 149 37 106% munmap__ 8 4029 4323 293 37 107% munmap__ 16 6647 6935 288 18 104% munmap__ 32 11811 12398 587 18 105% mprotect 1 439 465 26 26 106% mprotect 2 1659 1745 86 43 105% mprotect 4 3747 3889 142 36 104% mprotect 8 6755 6969 215 27 103% mprotect 16 13748 14144 396 25 103% mprotect 32 27827 28969 1142 36 104% madvise_ 1 240 262 22 22 109% madvise_ 2 366 442 76 38 121% madvise_ 4 623 751 128 32 121% madvise_ 8 1110 1324 215 27 119% madvise_ 16 2127 2451 324 20 115% madvise_ 32 4109 4642 534 17 113% The second test (measuring cpu cycle) syscall__ vmas cpu cmseal delta_cpu per_vma % munmap__ 1 1790 1890 100 100 106% munmap__ 2 2819 3033 214 107 108% munmap__ 4 4959 5271 312 78 106% munmap__ 8 8262 8745 483 60 106% munmap__ 16 13099 14116 1017 64 108% munmap__ 32 23221 24785 1565 49 107% mprotect 1 906 967 62 62 107% mprotect 2 3019 3203 184 92 106% mprotect 4 6149 6569 420 105 107% mprotect 8 9978 10524 545 68 105% mprotect 16 20448 21427 979 61 105% mprotect 32 40972 42935 1963 61 105% madvise_ 1 434 497 63 63 115% madvise_ 2 752 899 147 74 120% madvise_ 4 1313 1513 200 50 115% madvise_ 8 2271 2627 356 44 116% madvise_ 16 4312 4883 571 36 113% madvise_ 32 8376 9319 943 29 111% Based on the result, for 6.8 kernel, sealing check adds 20-40 nano seconds, or around 50-100 CPU cycles, per VMA. In addition, I applied the sealing to 5.10 kernel: The first test (measuring time) syscall__ vmas t tmseal delta_ns per_vma % munmap__ 1 357 390 33 33 109% munmap__ 2 442 463 21 11 105% munmap__ 4 614 634 20 5 103% munmap__ 8 1017 1137 120 15 112% munmap__ 16 1889 2153 263 16 114% munmap__ 32 4109 4088 -21 -1 99% mprotect 1 235 227 -7 -7 97% mprotect 2 495 464 -30 -15 94% mprotect 4 741 764 24 6 103% mprotect 8 1434 1437 2 0 100% mprotect 16 2958 2991 33 2 101% mprotect 32 6431 6608 177 6 103% madvise_ 1 191 208 16 16 109% madvise_ 2 300 324 24 12 108% madvise_ 4 450 473 23 6 105% madvise_ 8 753 806 53 7 107% madvise_ 16 1467 1592 125 8 108% madvise_ 32 2795 3405 610 19 122% The second test (measuring cpu cycle) syscall__ nbr_vma cpu cmseal delta_cpu per_vma % munmap__ 1 684 715 31 31 105% munmap__ 2 861 898 38 19 104% munmap__ 4 1183 1235 51 13 104% munmap__ 8 1999 2045 46 6 102% munmap__ 16 3839 3816 -23 -1 99% munmap__ 32 7672 7887 216 7 103% mprotect 1 397 443 46 46 112% mprotect 2 738 788 50 25 107% mprotect 4 1221 1256 35 9 103% mprotect 8 2356 2429 72 9 103% mprotect 16 4961 4935 -26 -2 99% mprotect 32 9882 10172 291 9 103% madvise_ 1 351 380 29 29 108% madvise_ 2 565 615 49 25 109% madvise_ 4 872 933 61 15 107% madvise_ 8 1508 1640 132 16 109% madvise_ 16 3078 3323 245 15 108% madvise_ 32 5893 6704 811 25 114% For 5.10 kernel, sealing check adds 0-15 ns in time, or 10-30 CPU cycles, there is even decrease in some cases. It might be interesting to compare 5.10 and 6.8 kernel The first test (measuring time) syscall__ vmas t_5_10 t_6_8 delta_ns per_vma % munmap__ 1 357 909 552 552 254% munmap__ 2 442 1398 956 478 316% munmap__ 4 614 2444 1830 458 398% munmap__ 8 1017 4029 3012 377 396% munmap__ 16 1889 6647 4758 297 352% munmap__ 32 4109 11811 7702 241 287% mprotect 1 235 439 204 204 187% mprotect 2 495 1659 1164 582 335% mprotect 4 741 3747 3006 752 506% mprotect 8 1434 6755 5320 665 471% mprotect 16 2958 13748 10790 674 465% mprotect 32 6431 27827 21397 669 433% madvise_ 1 191 240 49 49 125% madvise_ 2 300 366 67 33 122% madvise_ 4 450 623 173 43 138% madvise_ 8 753 1110 357 45 147% madvise_ 16 1467 2127 660 41 145% madvise_ 32 2795 4109 1314 41 147% The second test (measuring cpu cycle) syscall__ vmas cpu_5_10 c_6_8 delta_cpu per_vma % munmap__ 1 684 1790 1106 1106 262% munmap__ 2 861 2819 1958 979 327% munmap__ 4 1183 4959 3776 944 419% munmap__ 8 1999 8262 6263 783 413% munmap__ 16 3839 13099 9260 579 341% munmap__ 32 7672 23221 15549 486 303% mprotect 1 397 906 509 509 228% mprotect 2 738 3019 2281 1140 409% mprotect 4 1221 6149 4929 1232 504% mprotect 8 2356 9978 7622 953 423% mprotect 16 4961 20448 15487 968 412% mprotect 32 9882 40972 31091 972 415% madvise_ 1 351 434 82 82 123% madvise_ 2 565 752 186 93 133% madvise_ 4 872 1313 442 110 151% madvise_ 8 1508 2271 763 95 151% madvise_ 16 3078 4312 1234 77 140% madvise_ 32 5893 8376 2483 78 142% From 5.10 to 6.8 munmap: added 250-550 ns in time, or 500-1100 in cpu cycle, per vma. mprotect: added 200-750 ns in time, or 500-1200 in cpu cycle, per vma. madvise: added 33-50 ns in time, or 70-110 in cpu cycle, per vma. In comparison to mseal, which adds 20-40 ns or 50-100 CPU cycles, the increase from 5.10 to 6.8 is significantly larger, approximately ten times greater for munmap and mprotect. When I discuss the mm performance with Brian Makin, an engineer who worked on performance, it was brought to my attention that such performance benchmarks, which measuring millions of mm syscall in a tight loop, may not accurately reflect real-world scenarios, such as that of a database service. Also this is tested using a single HW and ChromeOS, the data from another HW or distribution might be different. It might be best to take this data with a grain of salt. This patch (of 5): Wire up mseal syscall for all architectures. Link: https://lkml.kernel.org/r/20240415163527.626541-1-jeffxu@chromium.org Link: https://lkml.kernel.org/r/20240415163527.626541-2-jeffxu@chromium.org Signed-off-by:
Jeff Xu <jeffxu@chromium.org> Reviewed-by:
Kees Cook <keescook@chromium.org> Reviewed-by:
Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guenter Roeck <groeck@chromium.org> Cc: Jann Horn <jannh@google.com> [Bug #2] Cc: Jeff Xu <jeffxu@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Jorge Lucangeli Obes <jorgelo@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muhammad Usama Anjum <usama.anjum@collabora.com> Cc: Pedro Falcato <pedro.falcato@gmail.com> Cc: Stephen Röttger <sroettger@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Amer Al Shanawany <amer.shanawany@gmail.com> Cc: Javier Carrasco <javier.carrasco.cruz@gmail.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
- May 23, 2024
-
-
Dongli Zhang authored
The absence of IRQD_MOVE_PCNTXT prevents immediate effectiveness of interrupt affinity reconfiguration via procfs. Instead, the change is deferred until the next instance of the interrupt being triggered on the original CPU. When the interrupt next triggers on the original CPU, the new affinity is enforced within __irq_move_irq(). A vector is allocated from the new CPU, but the old vector on the original CPU remains and is not immediately reclaimed. Instead, apicd->move_in_progress is flagged, and the reclaiming process is delayed until the next trigger of the interrupt on the new CPU. Upon the subsequent triggering of the interrupt on the new CPU, irq_complete_move() adds a task to the old CPU's vector_cleanup list if it remains online. Subsequently, the timer on the old CPU iterates over its vector_cleanup list, reclaiming old vectors. However, a rare scenario arises if the old CPU is outgoing before the interrupt triggers again on the new CPU. In that case irq_force_complete_move() is not invoked on the outgoing CPU to reclaim the old apicd->prev_vector because the interrupt isn't currently affine to the outgoing CPU, and irq_needs_fixup() returns false. Even though __vector_schedule_cleanup() is later called on the new CPU, it doesn't reclaim apicd->prev_vector; instead, it simply resets both apicd->move_in_progress and apicd->prev_vector to 0. As a result, the vector remains unreclaimed in vector_matrix, leading to a CPU vector leak. To address this issue, move the invocation of irq_force_complete_move() before the irq_needs_fixup() call to reclaim apicd->prev_vector, if the interrupt is currently or used to be affine to the outgoing CPU. Additionally, reclaim the vector in __vector_schedule_cleanup() as well, following a warning message, although theoretically it should never see apicd->move_in_progress with apicd->prev_cpu pointing to an offline CPU. Fixes: f0383c24 ("genirq/cpuhotplug: Add support for cleaning up move in progress") Signed-off-by:
Dongli Zhang <dongli.zhang@oracle.com> Signed-off-by:
Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20240522220218.162423-1-dongli.zhang@oracle.com
-
Steven Rostedt (Google) authored
With the rework of how the __string() handles dynamic strings where it saves off the source string in field in the helper structure[1], the assignment of that value to the trace event field is stored in the helper value and does not need to be passed in again. This means that with: __string(field, mystring) Which use to be assigned with __assign_str(field, mystring), no longer needs the second parameter and it is unused. With this, __assign_str() will now only get a single parameter. There's over 700 users of __assign_str() and because coccinelle does not handle the TRACE_EVENT() macro I ended up using the following sed script: git grep -l __assign_str | while read a ; do sed -e 's/\(__assign_str([^,]*[^ ,]\) *,[^;]*/\1)/' $a > /tmp/test-file; mv /tmp/test-file $a; done I then searched for __assign_str() that did not end with ';' as those were multi line assignments that the sed script above would fail to catch. Note, the same updates will need to be done for: __assign_str_len() __assign_rel_str() __assign_rel_str_len() I tested this with both an allmodconfig and an allyesconfig (build only for both). [1] https://lore.kernel.org/linux-trace-kernel/20240222211442.634192653@goodmis.org/ Link: https://lore.kernel.org/linux-trace-kernel/20240516133454.681ba6a0@rorschach.local.home Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Julia Lawall <Julia.Lawall@inria.fr> Signed-off-by:
Steven Rostedt (Google) <rostedt@goodmis.org> Acked-by:
Jani Nikula <jani.nikula@intel.com> Acked-by: Christian König <christian.koenig@amd.com> for the amdgpu parts. Acked-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> #for Acked-by: Rafael J. Wysocki <rafael@kernel.org> # for thermal Acked-by:
Takashi Iwai <tiwai@suse.de> Acked-by: Darrick J. Wong <djwong@kernel.org> # xfs Tested-by:
Guenter Roeck <linux@roeck-us.net>
-
- May 22, 2024
-
-
Mike Christie authored
This removes the signal/coredump hacks added for vhost_tasks in: Commit f9010dbd ("fork, vhost: Use CLONE_THREAD to fix freezer/ps regression") When that patch was added vhost_tasks did not handle SIGKILL and would try to ignore/clear the signal and continue on until the device's close function was called. In the previous patches vhost_tasks and the vhost drivers were converted to support SIGKILL by cleaning themselves up and exiting. The hacks are no longer needed so this removes them. Signed-off-by:
Mike Christie <michael.christie@oracle.com> Message-Id: <20240316004707.45557-10-michael.christie@oracle.com> Signed-off-by:
Michael S. Tsirkin <mst@redhat.com>
-
Mike Christie authored
Instead of lingering until the device is closed, this has us handle SIGKILL by: 1. marking the worker as killed so we no longer try to use it with new virtqueues and new flush operations. 2. setting the virtqueue to worker mapping so no new works are queued. 3. running all the exiting works. Suggested-by:
Edward Adam Davis <eadavis@qq.com> Reported-and-tested-by:
<syzbot+98edc2df894917b3431f@syzkaller.appspotmail.com> Message-Id: <tencent_546DA49414E876EEBECF2C78D26D242EE50A@qq.com> Signed-off-by:
Mike Christie <michael.christie@oracle.com> Message-Id: <20240316004707.45557-9-michael.christie@oracle.com> Signed-off-by:
Michael S. Tsirkin <mst@redhat.com>
-
- May 21, 2024
-
-
Yang Li authored
The patch updates the function documentation comment for rv_en(dis)able_monitor to adhere to the kernel-doc specification. Link: https://lore.kernel.org/linux-trace-kernel/20240520054239.61784-1-yang.lee@linux.alibaba.com Fixes: 102227b9 ("rv: Add Runtime Verification (RV) interface") Signed-off-by:
Yang Li <yang.lee@linux.alibaba.com> Signed-off-by:
Steven Rostedt (Google) <rostedt@goodmis.org>
-
Jeff Johnson authored
Fix the 'make W=1' warning: WARNING: modpost: missing MODULE_DESCRIPTION() in kernel/trace/preemptirq_delay_test.o Link: https://lore.kernel.org/linux-trace-kernel/20240518-md-preemptirq_delay_test-v1-1-387d11b30d85@quicinc.com Cc: stable@vger.kernel.org Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Fixes: f96e8577 ("lib: Add module for testing preemptoff/irqsoff latency tracers") Acked-by:
Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by:
Jeff Johnson <quic_jjohnson@quicinc.com> Signed-off-by:
Steven Rostedt (Google) <rostedt@goodmis.org>
-
Petr Pavlu authored
The reader code in rb_get_reader_page() swaps a new reader page into the ring buffer by doing cmpxchg on old->list.prev->next to point it to the new page. Following that, if the operation is successful, old->list.next->prev gets updated too. This means the underlying doubly-linked list is temporarily inconsistent, page->prev->next or page->next->prev might not be equal back to page for some page in the ring buffer. The resize operation in ring_buffer_resize() can be invoked in parallel. It calls rb_check_pages() which can detect the described inconsistency and stop further tracing: [ 190.271762] ------------[ cut here ]------------ [ 190.271771] WARNING: CPU: 1 PID: 6186 at kernel/trace/ring_buffer.c:1467 rb_check_pages.isra.0+0x6a/0xa0 [ 190.271789] Modules linked in: [...] [ 190.271991] Unloaded tainted modules: intel_uncore_frequency(E):1 skx_edac(E):1 [ 190.272002] CPU: 1 PID: 6186 Comm: cmd.sh Kdump: loaded Tainted: G E 6.9.0-rc6-default #5 158d3e1e6d0b091c34c3b96bfd99a1c58306d79f [ 190.272011] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.0-0-gd239552c-rebuilt.opensuse.org 04/01/2014 [ 190.272015] RIP: 0010:rb_check_pages.isra.0+0x6a/0xa0 [ 190.272023] Code: [...] [ 190.272028] RSP: 0018:ffff9c37463abb70 EFLAGS: 00010206 [ 190.272034] RAX: ffff8eba04b6cb80 RBX: 0000000000000007 RCX: ffff8eba01f13d80 [ 190.272038] RDX: ffff8eba01f130c0 RSI: ffff8eba04b6cd00 RDI: ffff8eba0004c700 [ 190.272042] RBP: ffff8eba0004c700 R08: 0000000000010002 R09: 0000000000000000 [ 190.272045] R10: 00000000ffff7f52 R11: ffff8eba7f600000 R12: ffff8eba0004c720 [ 190.272049] R13: ffff8eba00223a00 R14: 0000000000000008 R15: ffff8eba067a8000 [ 190.272053] FS: 00007f1bd64752c0(0000) GS:ffff8eba7f680000(0000) knlGS:0000000000000000 [ 190.272057] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 190.272061] CR2: 00007f1bd6662590 CR3: 000000010291e001 CR4: 0000000000370ef0 [ 190.272070] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 190.272073] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 190.272077] Call Trace: [ 190.272098] <TASK> [ 190.272189] ring_buffer_resize+0x2ab/0x460 [ 190.272199] __tracing_resize_ring_buffer.part.0+0x23/0xa0 [ 190.272206] tracing_resize_ring_buffer+0x65/0x90 [ 190.272216] tracing_entries_write+0x74/0xc0 [ 190.272225] vfs_write+0xf5/0x420 [ 190.272248] ksys_write+0x67/0xe0 [ 190.272256] do_syscall_64+0x82/0x170 [ 190.272363] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 190.272373] RIP: 0033:0x7f1bd657d263 [ 190.272381] Code: [...] [ 190.272385] RSP: 002b:00007ffe72b643f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [ 190.272391] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f1bd657d263 [ 190.272395] RDX: 0000000000000002 RSI: 0000555a6eb538e0 RDI: 0000000000000001 [ 190.272398] RBP: 0000555a6eb538e0 R08: 000000000000000a R09: 0000000000000000 [ 190.272401] R10: 0000555a6eb55190 R11: 0000000000000246 R12: 00007f1bd6662500 [ 190.272404] R13: 0000000000000002 R14: 00007f1bd6667c00 R15: 0000000000000002 [ 190.272412] </TASK> [ 190.272414] ---[ end trace 0000000000000000 ]--- Note that ring_buffer_resize() calls rb_check_pages() only if the parent trace_buffer has recording disabled. Recent commit d78ab792 ("tracing: Stop current tracer when resizing buffer") causes that it is now always the case which makes it more likely to experience this issue. The window to hit this race is nonetheless very small. To help reproducing it, one can add a delay loop in rb_get_reader_page(): ret = rb_head_page_replace(reader, cpu_buffer->reader_page); if (!ret) goto spin; for (unsigned i = 0; i < 1U << 26; i++) /* inserted delay loop */ __asm__ __volatile__ ("" : : : "memory"); rb_list_head(reader->list.next)->prev = &cpu_buffer->reader_page->list; .. and then run the following commands on the target system: echo 1 > /sys/kernel/tracing/events/sched/sched_switch/enable while true; do echo 16 > /sys/kernel/tracing/buffer_size_kb; sleep 0.1 echo 8 > /sys/kernel/tracing/buffer_size_kb; sleep 0.1 done & while true; do for i in /sys/kernel/tracing/per_cpu/*; do timeout 0.1 cat $i/trace_pipe; sleep 0.2 done done To fix the problem, make sure ring_buffer_resize() doesn't invoke rb_check_pages() concurrently with a reader operating on the same ring_buffer_per_cpu by taking its cpu_buffer->reader_lock. Link: https://lore.kernel.org/linux-trace-kernel/20240517134008.24529-3-petr.pavlu@suse.com Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Fixes: 659f451f ("ring-buffer: Add integrity check at end of iter read") Signed-off-by:
Petr Pavlu <petr.pavlu@suse.com> [ Fixed whitespace ] Signed-off-by:
Steven Rostedt (Google) <rostedt@goodmis.org>
-
Petr Pavlu authored
Adjust the following code documentation: * Kernel-doc comments for ring_buffer_read_prepare() and ring_buffer_read_finish() mention that recording to the ring buffer is disabled when the read is active. Remove mention of this restriction because it was already lifted in commit 1039221c ("ring-buffer: Do not disable recording when there is an iterator"). * Function ring_buffer_read_finish() performs a self-check of the ring-buffer by locking cpu_buffer->reader_lock and then calling rb_check_pages(). The preceding comment explains that the lock is needed because rb_check_pages() clears the HEAD flag required by readers which might be running in parallel. Remove this explanation because commit 8843e06f ("ring-buffer: Handle race between rb_move_tail and rb_check_pages") simplified the function so it no longer resets the mentioned flag. Nonetheless, the lock is still needed because a reader swapping a page into the ring buffer can make the underlying doubly-linked list temporarily inconsistent. This is a non-functional change. Link: https://lore.kernel.org/linux-trace-kernel/20240517134008.24529-2-petr.pavlu@suse.com Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by:
Petr Pavlu <petr.pavlu@suse.com> Signed-off-by:
Steven Rostedt (Google) <rostedt@goodmis.org>
-
- May 18, 2024
-
-
Linus Torvalds authored
Commit 1a7d0890 ("kprobe/ftrace: bail out if ftrace was killed") introduced a bad K&R function definition, which we haven't accepted in a long long time. Gcc seems to let it slide, but clang notices with the appropriate error: kernel/kprobes.c:1140:24: error: a function declaration without a prototype is deprecated in all > 1140 | void kprobe_ftrace_kill() | ^ | void but this commit was apparently never in linux-next before it was sent upstream, so it didn't get the appropriate build test coverage. Fixes: 1a7d0890 kprobe/ftrace: bail out if ftrace was killed Cc: Stephen Brennan <stephen.s.brennan@oracle.com> Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org> Cc: Guo Ren <guoren@kernel.org> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
- May 17, 2024
-
-
Cheng Yu authored
In the cgroup v2 CPU subsystem, assuming we have a cgroup named 'test', and we set cpu.max and cpu.max.burst: # echo 1000000 > /sys/fs/cgroup/test/cpu.max # echo 1000000 > /sys/fs/cgroup/test/cpu.max.burst then we check cpu.max and cpu.max.burst: # cat /sys/fs/cgroup/test/cpu.max 1000000 100000 # cat /sys/fs/cgroup/test/cpu.max.burst 1000000 Next we set cpu.max again and check cpu.max and cpu.max.burst: # echo 2000000 > /sys/fs/cgroup/test/cpu.max # cat /sys/fs/cgroup/test/cpu.max 2000000 100000 # cat /sys/fs/cgroup/test/cpu.max.burst 1000 ... we find that the cpu.max.burst value changed unexpectedly. In cpu_max_write(), the unit of the burst value returned by tg_get_cfs_burst() is microseconds, while in cpu_max_write(), the burst unit used for calculation should be nanoseconds, which leads to the bug. To fix it, get the burst value directly from tg->cfs_bandwidth.burst. Fixes: f4183717 ("sched/fair: Introduce the burstable CFS controller") Reported-by:
Qixin Liao <liaoqixin@huawei.com> Signed-off-by:
Cheng Yu <serein.chengyu@huawei.com> Signed-off-by:
Zhang Qiao <zhangqiao22@huawei.com> Signed-off-by:
Ingo Molnar <mingo@kernel.org> Reviewed-by:
Vincent Guittot <vincent.guittot@linaro.org> Tested-by:
Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20240424132438.514720-1-serein.chengyu@huawei.com
-
Christian Loehle authored
On 05/03/2024 15:05, Vincent Guittot wrote: I'm fine with either and that was my first thought here, too, but it did seem like the comment was mostly placed there to justify the 'unexpected' high utilization when explicitly passing FREQUENCY_UTIL and the need to clamp it then. So removing did feel slightly more natural to me anyway. So alternatively: From: Christian Loehle <christian.loehle@arm.com> Date: Tue, 5 Mar 2024 09:34:41 +0000 Subject: [PATCH] sched/fair: Remove stale FREQUENCY_UTIL mention effective_cpu_util() flags were removed, so remove mentioning of the flag. commit 9c0b4bb7 ("sched/cpufreq: Rework schedutil governor performance estimation") reworked effective_cpu_util() removing enum cpu_util_type. Modify the comment accordingly. Signed-off-by:
Christian Loehle <christian.loehle@arm.com> Signed-off-by:
Ingo Molnar <mingo@kernel.org> Reviewed-by:
Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/0e2833ee-0939-44e0-82a2-520a585a0153@arm.com
-
Dawei Li authored
Change se->load.weight to se_weight(se) in the calculation for the initial util_avg to avoid unnecessarily inflating the util_avg by 1024 times. The reason is that se->load.weight has the unit/scale as the scaled-up load, while cfs_rg->avg.load_avg has the unit/scale as the true task weight (as mapped directly from the task's nice/priority value). With CONFIG_32BIT, the scaled-up load is equal to the true task weight. With CONFIG_64BIT, the scaled-up load is 1024 times the true task weight. Thus, the current code may inflate the util_avg by 1024 times. The follow-up capping will not allow the util_avg value to go wild. But the calculation should have the correct logic. Signed-off-by:
Dawei Li <daweilics@gmail.com> Signed-off-by:
Ingo Molnar <mingo@kernel.org> Reviewed-by:
Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by:
Vishal Chourasia <vishalc@linux.ibm.com> Link: https://lore.kernel.org/r/20240315015916.21545-1-daweilics@gmail.com
-
Vitalii Bursov authored
Knowing domain's level exactly can be useful when setting relax_domain_level or cpuset.sched_relax_domain_level Usage: cat /debug/sched/domains/cpu0/domain1/level to dump cpu0 domain1's level. SDM macro is not used because sd->level is 'int' and it would hide the type mismatch between 'int' and 'u32'. Signed-off-by:
Vitalii Bursov <vitaly@bursov.com> Signed-off-by:
Ingo Molnar <mingo@kernel.org> Tested-by:
Dietmar Eggemann <dietmar.eggemann@arm.com> Acked-by:
Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by:
Valentin Schneider <vschneid@redhat.com> Link: https://lore.kernel.org/r/9489b6475f6dd6fbc67c617752d4216fa094da53.1714488502.git.vitaly@bursov.com
-
Vitalii Bursov authored
Change relax_domain_level checks so that it would be possible to include or exclude all domains from newidle balancing. This matches the behavior described in the documentation: -1 no request. use system default or follow request of others. 0 no search. 1 search siblings (hyperthreads in a core). "2" enables levels 0 and 1, level_max excludes the last (level_max) level, and level_max+1 includes all levels. Fixes: 1d3504fc ("sched, cpuset: customize sched domains, core") Signed-off-by:
Vitalii Bursov <vitaly@bursov.com> Signed-off-by:
Ingo Molnar <mingo@kernel.org> Tested-by:
Dietmar Eggemann <dietmar.eggemann@arm.com> Reviewed-by:
Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by:
Valentin Schneider <vschneid@redhat.com> Link: https://lore.kernel.org/r/bd6de28e80073c79466ec6401cdeae78f0d4423d.1714488502.git.vitaly@bursov.com
-
- May 15, 2024
-
-
Stephen Brennan authored
If an error happens in ftrace, ftrace_kill() will prevent disarming kprobes. Eventually, the ftrace_ops associated with the kprobes will be freed, yet the kprobes will still be active, and when triggered, they will use the freed memory, likely resulting in a page fault and panic. This behavior can be reproduced quite easily, by creating a kprobe and then triggering a ftrace_kill(). For simplicity, we can simulate an ftrace error with a kernel module like [1]: [1]: https://github.com/brenns10/kernel_stuff/tree/master/ftrace_killer sudo perf probe --add commit_creds sudo perf trace -e probe:commit_creds # In another terminal make sudo insmod ftrace_killer.ko # calls ftrace_kill(), simulating bug # Back to perf terminal # ctrl-c sudo perf probe --del commit_creds After a short period, a page fault and panic would occur as the kprobe continues to execute and uses the freed ftrace_ops. While ftrace_kill() is supposed to be used only in extreme circumstances, it is invoked in FTRACE_WARN_ON() and so there are many places where an unexpected bug could be triggered, yet the system may continue operating, possibly without the administrator noticing. If ftrace_kill() does not panic the system, then we should do everything we can to continue operating, rather than leave a ticking time bomb. Link: https://lore.kernel.org/all/20240501162956.229427-1-stephen.s.brennan@oracle.com/ Signed-off-by:
Stephen Brennan <stephen.s.brennan@oracle.com> Acked-by:
Masami Hiramatsu (Google) <mhiramat@kernel.org> Acked-by:
Guo Ren <guoren@kernel.org> Reviewed-by:
Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by:
Masami Hiramatsu (Google) <mhiramat@kernel.org>
-
Andrii Nakryiko authored
ARRAY_OF_MAPS and HASH_OF_MAPS map types have special logic to save a few extra fields required for correct operations of ARRAY maps, when they are used as inner maps. PERCPU_ARRAY maps have similar requirements as they now support generating inline element lookup logic. So make sure that both classes of maps are handled correctly. Reported-by:
Jakub Kicinski <kuba@kernel.org> Fixes: db69718b ("bpf: inline bpf_map_lookup_elem() for PERCPU_ARRAY maps") Signed-off-by:
Andrii Nakryiko <andrii@kernel.org> Acked-by:
Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20240515062440.846086-1-andrii@kernel.org Signed-off-by:
Alexei Starovoitov <ast@kernel.org>
-
Steven Rostedt (Google) authored
The sub-buffer pages are held in an unsigned long array, and when it is passed to virt_to_page() a cast is needed. Link: https://lore.kernel.org/all/20240515124808.06279d04@canb.auug.org.au/ Link: https://lore.kernel.org/linux-trace-kernel/20240515010558.4abaefdd@rorschach.local.home Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Fixes: 117c3920 ("ring-buffer: Introducing ring-buffer mapping functions") Reported-by:
Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by:
Steven Rostedt (Google) <rostedt@goodmis.org>
-
- May 14, 2024
-
-
Jesper Dangaard Brouer authored
This closely resembles helpers added for the global cgroup_rstat_lock in commit fc29e04a ("cgroup/rstat: add cgroup_rstat_lock helpers and tracepoints"). This is for the per CPU lock cgroup_rstat_cpu_lock. Based on production workloads, we observe the fast-path "update" function cgroup_rstat_updated() is invoked around 3 million times per sec, while the "flush" function cgroup_rstat_flush_locked(), walking each possible CPU, can see periodic spikes of 700 invocations/sec. For this reason, the tracepoints are split into normal and fastpath versions for this per-CPU lock. Making it feasible for production to continuously monitor the non-fastpath tracepoint to detect lock contention issues. The reason for monitoring is that lock disables IRQs which can disturb e.g. softirq processing on the local CPUs involved. When the global cgroup_rstat_lock stops disabling IRQs (e.g converted to a mutex), this per CPU lock becomes the next bottleneck that can introduce latency variations. A practical bpftrace script for monitoring contention latency: bpftrace -e ' tracepoint:cgroup:cgroup_rstat_cpu_lock_contended { @start[tid]=nsecs; @cnt[probe]=count()} tracepoint:cgroup:cgroup_rstat_cpu_locked { if (args->contended) { @wait_ns=hist(nsecs-@start[tid]); delete(@start[tid]);} @cnt[probe]=count()} interval:s:1 {time("%H:%M:%S "); print(@wait_ns); print(@cnt); clear(@cnt);}' Signed-off-by:
Jesper Dangaard Brouer <hawk@kernel.org> Signed-off-by:
Tejun Heo <tj@kernel.org>
-
Zheng Yejian authored
KASAN reports a bug: BUG: KASAN: use-after-free in ftrace_location+0x90/0x120 Read of size 8 at addr ffff888141d40010 by task insmod/424 CPU: 8 PID: 424 Comm: insmod Tainted: G W 6.9.0-rc2+ [...] Call Trace: <TASK> dump_stack_lvl+0x68/0xa0 print_report+0xcf/0x610 kasan_report+0xb5/0xe0 ftrace_location+0x90/0x120 register_kprobe+0x14b/0xa40 kprobe_init+0x2d/0xff0 [kprobe_example] do_one_initcall+0x8f/0x2d0 do_init_module+0x13a/0x3c0 load_module+0x3082/0x33d0 init_module_from_file+0xd2/0x130 __x64_sys_finit_module+0x306/0x440 do_syscall_64+0x68/0x140 entry_SYSCALL_64_after_hwframe+0x71/0x79 The root cause is that, in lookup_rec(), ftrace record of some address is being searched in ftrace pages of some module, but those ftrace pages at the same time is being freed in ftrace_release_mod() as the corresponding module is being deleted: CPU1 | CPU2 register_kprobes() { | delete_module() { check_kprobe_address_safe() { | arch_check_ftrace_location() { | ftrace_location() { | lookup_rec() // USE! | ftrace_release_mod() // Free! To fix this issue: 1. Hold rcu lock as accessing ftrace pages in ftrace_location_range(); 2. Use ftrace_location_range() instead of lookup_rec() in ftrace_location(); 3. Call synchronize_rcu() before freeing any ftrace pages both in ftrace_process_locs()/ftrace_release_mod()/ftrace_free_mem(). Link: https://lore.kernel.org/linux-trace-kernel/20240509192859.1273558-1-zhengyejian1@huawei.com Cc: stable@vger.kernel.org Cc: <mhiramat@kernel.org> Cc: <mark.rutland@arm.com> Cc: <mathieu.desnoyers@efficios.com> Fixes: ae6aa16f ("kprobes: introduce ftrace based optimization") Suggested-by:
Steven Rostedt <rostedt@goodmis.org> Signed-off-by:
Zheng Yejian <zhengyejian1@huawei.com> Signed-off-by:
Steven Rostedt (Google) <rostedt@goodmis.org>
-
Mike Rapoport (IBM) authored
BPF just-in-time compiler depended on CONFIG_MODULES because it used module_alloc() to allocate memory for the generated code. Since code allocations are now implemented with execmem, drop dependency of CONFIG_BPF_JIT on CONFIG_MODULES and make it select CONFIG_EXECMEM. Suggested-by:
Björn Töpel <bjorn@kernel.org> Signed-off-by:
Mike Rapoport (IBM) <rppt@kernel.org> Signed-off-by:
Luis Chamberlain <mcgrof@kernel.org>
-
Mike Rapoport (IBM) authored
kprobes depended on CONFIG_MODULES because it has to allocate memory for code. Since code allocations are now implemented with execmem, kprobes can be enabled in non-modular kernels. Add #ifdef CONFIG_MODULE guards for the code dealing with kprobes inside modules, make CONFIG_KPROBES select CONFIG_EXECMEM and drop the dependency of CONFIG_KPROBES on CONFIG_MODULES. Signed-off-by:
Mike Rapoport (IBM) <rppt@kernel.org> Acked-by:
Masami Hiramatsu (Google) <mhiramat@kernel.org> [mcgrof: rebase in light of NEED_TASKS_RCU ] Signed-off-by:
Luis Chamberlain <mcgrof@kernel.org>
-
Mike Rapoport (IBM) authored
Extend execmem parameters to accommodate more complex overrides of module_alloc() by architectures. This includes specification of a fallback range required by arm, arm64 and powerpc, EXECMEM_MODULE_DATA type required by powerpc, support for allocation of KASAN shadow required by s390 and x86 and support for late initialization of execmem required by arm64. The core implementation of execmem_alloc() takes care of suppressing warnings when the initial allocation fails but there is a fallback range defined. Signed-off-by:
Mike Rapoport (IBM) <rppt@kernel.org> Acked-by:
Will Deacon <will@kernel.org> Acked-by:
Song Liu <song@kernel.org> Tested-by:
Liviu Dudau <liviu@dudau.co.uk> Signed-off-by:
Luis Chamberlain <mcgrof@kernel.org>
-
Mike Rapoport (IBM) authored
module_alloc() is used everywhere as a mean to allocate memory for code. Beside being semantically wrong, this unnecessarily ties all subsystems that need to allocate code, such as ftrace, kprobes and BPF to modules and puts the burden of code allocation to the modules code. Several architectures override module_alloc() because of various constraints where the executable memory can be located and this causes additional obstacles for improvements of code allocation. Start splitting code allocation from modules by introducing execmem_alloc() and execmem_free() APIs. Initially, execmem_alloc() is a wrapper for module_alloc() and execmem_free() is a replacement of module_memfree() to allow updating all call sites to use the new APIs. Since architectures define different restrictions on placement, permissions, alignment and other parameters for memory that can be used by different subsystems that allocate executable memory, execmem_alloc() takes a type argument, that will be used to identify the calling subsystem and to allow architectures define parameters for ranges suitable for that subsystem. No functional changes. Signed-off-by:
Mike Rapoport (IBM) <rppt@kernel.org> Acked-by:
Masami Hiramatsu (Google) <mhiramat@kernel.org> Acked-by:
Song Liu <song@kernel.org> Acked-by:
Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by:
Luis Chamberlain <mcgrof@kernel.org>
-
Mike Rapoport (IBM) authored
Move the logic related to the memory allocation and freeing into module_memory_alloc() and module_memory_free(). Signed-off-by:
Mike Rapoport (IBM) <rppt@kernel.org> Reviewed-by:
Philippe Mathieu-Daudé <philmd@linaro.org> Reviewed-by:
Masami Hiramatsu (Google) <mhiramat@kernel.org> Acked-by:
Song Liu <song@kernel.org> Signed-off-by:
Luis Chamberlain <mcgrof@kernel.org>
-
Justin Stitt authored
strncpy() is deprecated for use on NUL-terminated destination strings [1] and as such we should prefer more robust and less ambiguous string interfaces. The goal is to remove its use completely [2]. namebuf is eventually cleaned of any trailing llvm suffixes using strstr(). This hints that namebuf should be NUL-terminated. static void cleanup_symbol_name(char *s) { char *res; ... res = strstr(s, ".llvm."); ... } Due to this, use strscpy() over strncpy() as it guarantees NUL-termination on the destination buffer. Drop the -1 from the length calculation as it is no longer needed to ensure NUL-termination. Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings [1] Link: https://manpages.debian.org/testing/linux-manual-4.8/strscpy.9.en.html Link: https://github.com/KSPP/linux/issues/90 [2] Cc: linux-hardening@vger.kernel.org Signed-off-by:
Justin Stitt <justinstitt@google.com> Signed-off-by:
Luis Chamberlain <mcgrof@kernel.org>
-
Yifan Hong authored
If UNUSED_KSYMS_WHITELIST is a file generated before Kbuild runs, and the source tree is in a read-only filesystem, the developer must put the file somewhere and specify an absolute path to UNUSED_KSYMS_WHITELIST. This worked, but if IKCONFIG=y, an absolute path is embedded into .config and eventually into vmlinux, causing the build to be less reproducible when building on a different machine. This patch makes the handling of UNUSED_KSYMS_WHITELIST to be similar to MODULE_SIG_KEY. First, check if UNUSED_KSYMS_WHITELIST is an absolute path, just as before this patch. If so, use the path as is. If it is a relative path, use wildcard to check the existence of the file below objtree first. If it does not exist, fall back to the original behavior of adding $(srctree)/ before the value. After this patch, the developer can put the generated file in objtree, then use a relative path against objtree in .config, eradicating any absolute paths that may be evaluated differently on different machines. Signed-off-by:
Yifan Hong <elsk@google.com> Reviewed-by:
Elliot Berman <quic_eberman@quicinc.com> Signed-off-by:
Luis Chamberlain <mcgrof@kernel.org>
-
Dr. David Alan Gilbert authored
Commit 8788ca16 ("ftrace: Remove the legacy _ftrace_direct API") stopped setting the 'ftrace_direct_func_count' variable, but left it around. Clean it up. Link: https://lore.kernel.org/linux-trace-kernel/20240506233305.215735-1-linux@treblig.org Signed-off-by:
Dr. David Alan Gilbert <linux@treblig.org> Signed-off-by:
Steven Rostedt (Google) <rostedt@goodmis.org>
-
Dr. David Alan Gilbert authored
Commit 8788ca16 ("ftrace: Remove the legacy _ftrace_direct API") stopped using 'ftrace_direct_funcs' (and the associated struct ftrace_direct_func). Remove them. Build tested only (on x86-64 with FTRACE and DYNAMIC_FTRACE enabled) Link: https://lore.kernel.org/linux-trace-kernel/20240504132303.67538-1-linux@treblig.org Signed-off-by:
Dr. David Alan Gilbert <linux@treblig.org> Signed-off-by:
Steven Rostedt (Google) <rostedt@goodmis.org>
-
Thorsten Blum authored
Partially revert commit d6cb38e1 ("tracing: Use div64_u64() instead of do_div()") and use do_div() again to utilize its faster 64-by-32 division compared to the 64-by-64 division done by div64_u64(). Explicitly cast the divisor bm_cnt to u32 to prevent a Coccinelle warning reported by do_div.cocci. The warning was removed with commit d6cb38e1 ("tracing: Use div64_u64() instead of do_div()"). Using the faster 64-by-32 division and casting bm_cnt to u32 is safe because we return early from trace_do_benchmark() if bm_cnt > UINT_MAX. This approach is already used twice in trace_do_benchmark() when calculating the standard deviation: do_div(stddev, (u32)bm_cnt); do_div(stddev, (u32)bm_cnt - 1); Link: https://lore.kernel.org/linux-trace-kernel/20240329160229.4874-2-thorsten.blum@toblux.com Signed-off-by:
Thorsten Blum <thorsten.blum@toblux.com> Signed-off-by:
Steven Rostedt (Google) <rostedt@goodmis.org>
-
- May 13, 2024
-
-
Steven Rostedt (Google) authored
While testing libtracefs on the mmapped ring buffer, the test that checks if missed events are accounted for failed when using the mapped buffer. This is because the mapped page does not update the missed events that were dropped because the writer filled up the ring buffer before the reader could catch it. Add the missed events to the reader page/sub-buffer when the IOCTL is done and a new reader page is acquired. Note that all accesses to the reader_page via rb_page_commit() had to be switched to rb_page_size(), and rb_page_size() which was just a copy of rb_page_commit() but now it masks out the RB_MISSED bits. This is needed as the mapped reader page is still active in the ring buffer code and where it reads the commit field of the bpage for the size, it now must mask it otherwise the missed bits that are now set will corrupt the size returned. Link: https://lore.kernel.org/linux-trace-kernel/20240312175405.12fb6726@gandalf.local.home Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Vincent Donnefort <vdonnefort@google.com> Signed-off-by:
Steven Rostedt (Google) <rostedt@goodmis.org>
-
Paul E. McKenney authored
When running heavy test workloads with KASAN enabled, RCU Tasks grace periods can extend for many tens of seconds, significantly slowing trace registration. Therefore, make the registration-side RCU Tasks grace period be asynchronous via call_rcu_tasks(). Link: https://lore.kernel.org/linux-trace-kernel/ac05be77-2972-475b-9b57-56bef15aa00a@paulmck-laptop Reported-by:
Jakub Kicinski <kuba@kernel.org> Reported-by:
Alexei Starovoitov <ast@kernel.org> Reported-by:
Chris Mason <clm@fb.com> Reviewed-by:
Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by:
Paul E. McKenney <paulmck@kernel.org> Signed-off-by:
Steven Rostedt (Google) <rostedt@goodmis.org>
-
Yuran Pereira authored
The function simple_strtoul performs no error checking in scenarios where the input value overflows the intended output variable. This results in this function successfully returning, even when the output does not match the input string (aka the function returns successfully even when the result is wrong). Or as it was mentioned [1], "...simple_strtol(), simple_strtoll(), simple_strtoul(), and simple_strtoull() functions explicitly ignore overflows, which may lead to unexpected results in callers." Hence, the use of those functions is discouraged. This patch replaces all uses of the simple_strtoul with the safer alternatives kstrtoul and kstruint. Callers affected: - add_rec_by_index - set_graph_max_depth_function Side effects of this patch: - Since `fgraph_max_depth` is an `unsigned int`, this patch uses kstrtouint instead of kstrtoul to avoid any compiler warnings that could originate from calling the latter. - This patch ensures that the callers of kstrtou* return accordingly when kstrtoul and kstruint fail for some reason. In this case, both callers this patch is addressing return 0 on error. [1] https://www.kernel.org/doc/html/latest/process/deprecated.html#simple-strtol-simple-strtoll-simple-strtoul-simple-strtoull Link: https://lore.kernel.org/linux-trace-kernel/GV1PR10MB656333529A8D7B8AFB28D238E8B4A@GV1PR10MB6563.EURPRD10.PROD.OUTLOOK.COM Signed-off-by:
Yuran Pereira <yuran.pereira@hotmail.com> Reviewed-by:
Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by:
Steven Rostedt (Google) <rostedt@goodmis.org>
-
Vincent Donnefort authored
Currently, user-space extracts data from the ring-buffer via splice, which is handy for storage or network sharing. However, due to splice limitations, it is imposible to do real-time analysis without a copy. A solution for that problem is to let the user-space map the ring-buffer directly. The mapping is exposed via the per-CPU file trace_pipe_raw. The first element of the mapping is the meta-page. It is followed by each subbuffer constituting the ring-buffer, ordered by their unique page ID: * Meta-page -- include/uapi/linux/trace_mmap.h for a description * Subbuf ID 0 * Subbuf ID 1 ... It is therefore easy to translate a subbuf ID into an offset in the mapping: reader_id = meta->reader->id; reader_offset = meta->meta_page_size + reader_id * meta->subbuf_size; When new data is available, the mapper must call a newly introduced ioctl: TRACE_MMAP_IOCTL_GET_READER. This will update the Meta-page reader ID to point to the next reader containing unread data. Mapping will prevent snapshot and buffer size modifications. Link: https://lore.kernel.org/linux-trace-kernel/20240510140435.3550353-4-vdonnefort@google.com CC: <linux-mm@kvack.org> Signed-off-by:
Vincent Donnefort <vdonnefort@google.com> Signed-off-by:
Steven Rostedt (Google) <rostedt@goodmis.org>
-
Vincent Donnefort authored
In preparation for allowing the user-space to map a ring-buffer, add a set of mapping functions: ring_buffer_{map,unmap}() And controls on the ring-buffer: ring_buffer_map_get_reader() /* swap reader and head */ Mapping the ring-buffer also involves: A unique ID for each subbuf of the ring-buffer, currently they are only identified through their in-kernel VA. A meta-page, where are stored ring-buffer statistics and a description for the current reader The linear mapping exposes the meta-page, and each subbuf of the ring-buffer, ordered following their unique ID, assigned during the first mapping. Once mapped, no subbuf can get in or out of the ring-buffer: the buffer size will remain unmodified and the splice enabling functions will in reality simply memcpy the data instead of swapping subbufs. Link: https://lore.kernel.org/linux-trace-kernel/20240510140435.3550353-3-vdonnefort@google.com CC: <linux-mm@kvack.org> Signed-off-by:
Vincent Donnefort <vdonnefort@google.com> Acked-by:
David Hildenbrand <david@redhat.com> Signed-off-by:
Steven Rostedt (Google) <rostedt@goodmis.org>
-
Vincent Donnefort authored
In preparation for the ring-buffer memory mapping, allocate compound pages for the ring-buffer sub-buffers to enable us to map them to user-space with vm_insert_pages(). Link: https://lore.kernel.org/linux-trace-kernel/20240510140435.3550353-2-vdonnefort@google.com Acked-by:
David Hildenbrand <david@redhat.com> Signed-off-by:
Vincent Donnefort <vdonnefort@google.com> Signed-off-by:
Steven Rostedt (Google) <rostedt@goodmis.org>
-
Beau Belgrave authored
When the ABI was updated to prevent same name w/different args, it missed an important corner case when fields don't end with a space. Typically, space is used for fields to help separate them, like "u8 field1; u8 field2". If no spaces are used, like "u8 field1;u8 field2", then the parsing works for the first time. However, the match check fails on a subsequent register, leading to confusion. This is because the match check uses argv_split() and assumes that all fields will be split upon the space. When spaces are used, we get back { "u8", "field1;" }, without spaces we get back { "u8", "field1;u8" }. This causes a mismatch, and the user program gets back -EADDRINUSE. Add a method to detect this case before calling argv_split(). If found force a space after the field separator character ';'. This ensures all cases work properly for matching. With this fix, the following are all treated as matching: u8 field1;u8 field2 u8 field1; u8 field2 u8 field1;\tu8 field2 u8 field1;\nu8 field2 Link: https://lore.kernel.org/linux-trace-kernel/20240423162338.292-2-beaub@linux.microsoft.com Fixes: ba470eeb ("tracing/user_events: Prevent same name but different args event") Signed-off-by:
Beau Belgrave <beaub@linux.microsoft.com> Signed-off-by:
Steven Rostedt (Google) <rostedt@goodmis.org>
-
- May 12, 2024
-
-
Puranjay Mohan authored
Inline the calls to bpf_get_smp_processor_id() in the riscv bpf jit. RISCV saves the pointer to the CPU's task_struct in the TP (thread pointer) register. This makes it trivial to get the CPU's processor id. As thread_info is the first member of task_struct, we can read the processor id from TP + offsetof(struct thread_info, cpu). RISCV64 JIT output for `call bpf_get_smp_processor_id` ====================================================== Before After -------- ------- auipc t1,0x848c ld a5,32(tp) jalr 604(t1) mv a5,a0 Benchmark using [1] on Qemu. ./benchs/run_bench_trigger.sh glob-arr-inc arr-inc hash-inc +---------------+------------------+------------------+--------------+ | Name | Before | After | % change | |---------------+------------------+------------------+--------------| | glob-arr-inc | 1.077 ± 0.006M/s | 1.336 ± 0.010M/s | + 24.04% | | arr-inc | 1.078 ± 0.002M/s | 1.332 ± 0.015M/s | + 23.56% | | hash-inc | 0.494 ± 0.004M/s | 0.653 ± 0.001M/s | + 32.18% | +---------------+------------------+------------------+--------------+ NOTE: This benchmark includes changes from this patch and the previous patch that implemented the per-cpu insn. [1] https://github.com/anakryiko/linux/commit/8dec900975ef Signed-off-by:
Puranjay Mohan <puranjay@kernel.org> Acked-by:
Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by:
Andrii Nakryiko <andrii@kernel.org> Acked-by:
Björn Töpel <bjorn@kernel.org> Link: https://lore.kernel.org/r/20240502151854.9810-3-puranjay@kernel.org Signed-off-by:
Alexei Starovoitov <ast@kernel.org>
-
- May 09, 2024
-
-
Alexander Lobakin authored
There are several reports that the DMA sync shortcut broke non-coherent devices. dev->dma_need_sync is false after the &device allocation and if a driver didn't call dma_set_mask*(), it will still be false even if the device is not DMA-coherent and thus needs synchronizing. Due to historical reasons, there's still a lot of drivers not calling it. Invert the boolean, so that the sync will be performed by default and the shortcut will be enabled only when calling dma_set_mask*(). Reported-by:
Steven Price <steven.price@arm.com> Closes: https://lore.kernel.org/lkml/010686f5-3049-46a1-8230-7752a1b433ff@arm.com Reported-by:
Marek Szyprowski <m.szyprowski@samsung.com> Closes: https://lore.kernel.org/lkml/46160534-5003-4809-a408-6b3a3f4921e9@samsung.com Fixes: f406c8e4. ("dma: avoid redundant calls for sync operations") Signed-off-by:
Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by:
Christoph Hellwig <hch@lst.de> Tested-by:
Steven Price <steven.price@arm.com> Tested-by:
Marek Szyprowski <m.szyprowski@samsung.com>
-