- Apr 10, 2020
-
-
Vasily Averin authored
If seq_file .next function does not change position index, read after some lseek can generate unexpected output. https://bugzilla.kernel.org/show_bug.cgi?id=206283 Signed-off-by:
Vasily Averin <vvs@virtuozzo.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Acked-by:
Peter Oberparleiter <oberpar@linux.ibm.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Ingo Molnar <mingo@redhat.com> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: NeilBrown <neilb@suse.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Waiman Long <longman@redhat.com> Link: http://lkml.kernel.org/r/f65c6ee7-bd00-f910-2f8a-37cc67e4ff88@virtuozzo.com Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Eric Biggers authored
Patch series "module autoloading fixes and cleanups", v5. This series fixes a bug where request_module() was reporting success to kernel code when module autoloading had been completely disabled via 'echo > /proc/sys/kernel/modprobe'. It also addresses the issues raised on the original thread (https://lkml.kernel.org/lkml/20200310223731.126894-1-ebiggers@kernel.org/T/#u ) bydocumenting the modprobe sysctl, adding a self-test for the empty path case, and downgrading a user-reachable WARN_ONCE(). This patch (of 4): It's long been possible to disable kernel module autoloading completely (while still allowing manual module insertion) by setting /proc/sys/kernel/modprobe to the empty string. This can be preferable to setting it to a nonexistent file since it avoids the overhead of an attempted execve(), avoids potential deadlocks, and avoids the call to security_kernel_module_request() and thus on SELinux-based systems eliminates the need to write SELinux rules to dontaudit module_request. However, when module autoloading is disabled in this way, request_module() returns 0. This is broken because callers expect 0 to mean that the module was successfully loaded. Apparently this was never noticed because this method of disabling module autoloading isn't used much, and also most callers don't use the return value of request_module() since it's always necessary to check whether the module registered its functionality or not anyway. But improperly returning 0 can indeed confuse a few callers, for example get_fs_type() in fs/filesystems.c where it causes a WARNING to be hit: if (!fs && (request_module("fs-%.*s", len, name) == 0)) { fs = __get_fs_type(name, len); WARN_ONCE(!fs, "request_module fs-%.*s succeeded, but still no fs?\n", len, name); } This is easily reproduced with: echo > /proc/sys/kernel/modprobe mount -t NONEXISTENT none / It causes: request_module fs-NONEXISTENT succeeded, but still no fs? WARNING: CPU: 1 PID: 1106 at fs/filesystems.c:275 get_fs_type+0xd6/0xf0 [...] This should actually use pr_warn_once() rather than WARN_ONCE(), since it's also user-reachable if userspace immediately unloads the module. Regardless, request_module() should correctly return an error when it fails. So let's make it return -ENOENT, which matches the error when the modprobe binary doesn't exist. I've also sent patches to document and test this case. Signed-off-by:
Eric Biggers <ebiggers@google.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Reviewed-by:
Kees Cook <keescook@chromium.org> Reviewed-by:
Jessica Yu <jeyu@kernel.org> Acked-by:
Luis Chamberlain <mcgrof@kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jeff Vander Stoep <jeffv@google.com> Cc: Ben Hutchings <benh@debian.org> Cc: Josh Triplett <josh@joshtriplett.org> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/20200310223731.126894-1-ebiggers@kernel.org Link: http://lkml.kernel.org/r/20200312202552.241885-1-ebiggers@kernel.org Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Sergey Senozhatsky authored
printk_deferred(), similarly to printk_safe/printk_nmi, does not immediately attempt to print a new message on the consoles, avoiding calls into non-reentrant kernel paths, e.g. scheduler or timekeeping, which potentially can deadlock the system. Those printk() flavors, instead, rely on per-CPU flush irq_work to print messages from safer contexts. For same reasons (recursive scheduler or timekeeping calls) printk() uses per-CPU irq_work in order to wake up user space syslog/kmsg readers. However, only printk_safe/printk_nmi do make sure that per-CPU areas have been initialised and that it's safe to modify per-CPU irq_work. This means that, for instance, should printk_deferred() be invoked "too early", that is before per-CPU areas are initialised, printk_deferred() will perform illegal per-CPU access. Lech Perczak [0] reports that after commit 1b710b1b ("char/random: silence a lockdep splat with printk()") user-space syslog/kmsg readers are not able to read new kernel messages. The reason is printk_deferred() being called too early (as was pointed out by Petr and John). Fix printk_deferred() and do not queue per-CPU irq_work before per-CPU areas are initialized. Link: https://lore.kernel.org/lkml/aa0732c6-5c4e-8a8b-a1c1-75ebe3dca05b@camlintechnologies.com/ Reported-by:
Lech Perczak <l.perczak@camlintechnologies.com> Signed-off-by:
Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Tested-by:
Jann Horn <jannh@google.com> Reviewed-by:
Petr Mladek <pmladek@suse.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Theodore Ts'o <tytso@mit.edu> Cc: John Ogness <john.ogness@linutronix.de> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
- Apr 09, 2020
-
-
Eric W. Biederman authored
syzbot wrote: > ======================================================== > WARNING: possible irq lock inversion dependency detected > 5.6.0-syzkaller #0 Not tainted > -------------------------------------------------------- > swapper/1/0 just changed the state of lock: > ffffffff898090d8 (tasklist_lock){.+.?}-{2:2}, at: send_sigurg+0x9f/0x320 fs/fcntl.c:840 > but this lock took another, SOFTIRQ-unsafe lock in the past: > (&pid->wait_pidfd){+.+.}-{2:2} > > > and interrupts could create inverse lock ordering between them. > > > other info that might help us debug this: > Possible interrupt unsafe locking scenario: > > CPU0 CPU1 > ---- ---- > lock(&pid->wait_pidfd); > local_irq_disable(); > lock(tasklist_lock); > lock(&pid->wait_pidfd); > <Interrupt> > lock(tasklist_lock); > > *** DEADLOCK *** > > 4 locks held by swapper/1/0: The problem is that because wait_pidfd.lock is taken under the tasklist lock. It must always be taken with irqs disabled as tasklist_lock can be taken from interrupt context and if wait_pidfd.lock was already taken this would create a lock order inversion. Oleg suggested just disabling irqs where I have added extra calls to wait_pidfd.lock. That should be safe and I think the code will eventually do that. It was rightly pointed out by Christian that sharing the wait_pidfd.lock was a premature optimization. It is also true that my pre-merge window testing was insufficient. So remove the premature optimization and give struct pid a dedicated lock of it's own for struct pid things. I have verified that lockdep sees all 3 paths where we take the new pid->lock and lockdep does not complain. It is my current day dream that one day pid->lock can be used to guard the task lists as well and then the tasklist_lock won't need to be held to deliver signals. That will require taking pid->lock with irqs disabled. Acked-by:
Christian Brauner <christian.brauner@ubuntu.com> Link: https://lore.kernel.org/lkml/00000000000011d66805a25cd73f@google.com/ Cc: Oleg Nesterov <oleg@redhat.com> Cc: Christian Brauner <christian.brauner@ubuntu.com> Reported-by:
<syzbot+343f75cdeea091340956@syzkaller.appspotmail.com> Reported-by:
<syzbot+832aabf700bc3ec920b9@syzkaller.appspotmail.com> Reported-by:
<syzbot+f675f964019f884dbd0f@syzkaller.appspotmail.com> Reported-by:
<syzbot+a9fb1457d720a55d6dc5@syzkaller.appspotmail.com> Fixes: 7bc3e6e5 ("proc: Use a list of inodes to flush from proc") Signed-off-by:
"Eric W. Biederman" <ebiederm@xmission.com>
-
- Apr 08, 2020
-
-
Grygorii Strashko authored
The commit 2e05ea5c ("dma-mapping: implement dma_map_single_attrs using dma_map_page_attrs") removed "dma_debug_page" enum, but missed to update type2name string table. This causes incorrect displaying of dma allocation type. Fix it by removing "page" string from type2name string table and switch to use named initializers. Before (dma_alloc_coherent()): k3-ringacc 4b800000.ringacc: scather-gather idx 2208 P=d1140000 N=d114 D=d1140000 L=40 DMA_BIDIRECTIONAL dma map error check not applicable k3-ringacc 4b800000.ringacc: scather-gather idx 2216 P=d1150000 N=d115 D=d1150000 L=40 DMA_BIDIRECTIONAL dma map error check not applicable After: k3-ringacc 4b800000.ringacc: coherent idx 2208 P=d1140000 N=d114 D=d1140000 L=40 DMA_BIDIRECTIONAL dma map error check not applicable k3-ringacc 4b800000.ringacc: coherent idx 2216 P=d1150000 N=d115 D=d1150000 L=40 DMA_BIDIRECTIONAL dma map error check not applicable Fixes: 2e05ea5c ("dma-mapping: implement dma_map_single_attrs using dma_map_page_attrs") Signed-off-by:
Grygorii Strashko <grygorii.strashko@ti.com> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
Kishon Vijay Abraham I authored
The upper 32-bit physical address gets truncated inadvertently when dma_direct_get_required_mask() invokes phys_to_dma_direct(). This results in dma_addressing_limited() return incorrect value when used in platforms with LPAE enabled. Fix it here by explicitly type casting 'max_pfn' to phys_addr_t in order to prevent overflow of intermediate value while evaluating '(max_pfn - 1) << PAGE_SHIFT'. Signed-off-by:
Kishon Vijay Abraham I <kishon@ti.com> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
Peter Zijlstra authored
The 'invalid wait context' splat doesn't print all the information required to reconstruct / validate the error, specifically the irq-context state is missing. Signed-off-by:
Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by:
Ingo Molnar <mingo@kernel.org>
-
Qian Cai authored
The following commit: 7f26482a ("locking/percpu-rwsem: Remove the embedded rwsem") introduced task_struct memory leaks due to messing up the task_struct refcount. At the beginning of percpu_rwsem_wake_function(), it calls get_task_struct(), but if the trylock failed, it will remain in the waitqueue. However, it will run percpu_rwsem_wake_function() again with get_task_struct() to increase the refcount but then only call put_task_struct() once the trylock succeeded. Fix it by adjusting percpu_rwsem_wake_function() a bit to guard against when percpu_rwsem_wait() observing !private, terminating the wait and doing a quick exit() while percpu_rwsem_wake_function() then doing wake_up_process(p) as a use-after-free. Fixes: 7f26482a ("locking/percpu-rwsem: Remove the embedded rwsem") Suggested-by:
Peter Zijlstra <peterz@infradead.org> Signed-off-by:
Qian Cai <cai@lca.pw> Signed-off-by:
Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by:
Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/20200330213002.2374-1-cai@lca.pw
-
Valentin Schneider authored
Requested and effective uclamp values can be a bit tricky to decipher when playing with cgroup hierarchies. Add them to a task's procfs when SCHED_DEBUG is enabled. Reviewed-by:
Qais Yousef <qais.yousef@arm.com> Signed-off-by:
Valentin Schneider <valentin.schneider@arm.com> Signed-off-by:
Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by:
Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/20200226124543.31986-4-valentin.schneider@arm.com
-
Valentin Schneider authored
The printing macros in debug.c keep redefining the same output format. Collect each output format in a single definition, and reuse that definition in the other macros. While at it, add a layer of parentheses and replace printf's with the newly introduced macros. Reviewed-by:
Qais Yousef <qais.yousef@arm.com> Signed-off-by:
Valentin Schneider <valentin.schneider@arm.com> Signed-off-by:
Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by:
Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/20200226124543.31986-3-valentin.schneider@arm.com
-
Valentin Schneider authored
Most printing macros for procfs are defined globally in debug.c, and they are re-defined (to the exact same thing) within proc_sched_show_task(). Get rid of the duplicate defines. Reviewed-by:
Qais Yousef <qais.yousef@arm.com> Signed-off-by:
Valentin Schneider <valentin.schneider@arm.com> Signed-off-by:
Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by:
Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/20200226124543.31986-2-valentin.schneider@arm.com
-
Vincent Donnefort authored
The following commit: 5e83eafb ("sched/fair: Remove the rq->cpu_load[] update code") eliminated the last use case for rq->last_load_update_tick, so remove the field as well. Reviewed-by:
Dietmar Eggemann <dietmar.eggemann@arm.com> Reviewed-by:
Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by:
Vincent Donnefort <vincent.donnefort@arm.com> Signed-off-by:
Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by:
Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/1584710495-308969-1-git-send-email-vincent.donnefort@arm.com
-
Sebastian Andrzej Siewior authored
The kernel test robot triggered a warning with the following race: task-ctx A interrupt-ctx B worker -> process_one_work() -> work_item() -> schedule(); -> sched_submit_work() -> wq_worker_sleeping() -> ->sleeping = 1 atomic_dec_and_test(nr_running) __schedule(); *interrupt* async_page_fault() -> local_irq_enable(); -> schedule(); -> sched_submit_work() -> wq_worker_sleeping() -> if (WARN_ON(->sleeping)) return -> __schedule() -> sched_update_worker() -> wq_worker_running() -> atomic_inc(nr_running); -> ->sleeping = 0; -> sched_update_worker() -> wq_worker_running() if (!->sleeping) return In this context the warning is pointless everything is fine. An interrupt before wq_worker_sleeping() will perform the ->sleeping assignment (0 -> 1 > 0) twice. An interrupt after wq_worker_sleeping() will trigger the warning and nr_running will be decremented (by A) and incremented once (only by B, A will skip it). This is the case until the ->sleeping is zeroed again in wq_worker_running(). Remove the WARN statement because this condition may happen. Document that preemption around wq_worker_sleeping() needs to be disabled to protect ->sleeping and not just as an optimisation. Fixes: 6d25be57 ("sched/core, workqueues: Distangle worker accounting from rq lock") Reported-by:
kernel test robot <lkp@intel.com> Signed-off-by:
Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by:
Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by:
Ingo Molnar <mingo@kernel.org> Cc: Tejun Heo <tj@kernel.org> Link: https://lkml.kernel.org/r/20200327074308.GY11705@shao2-debian
-
Aubrey Li authored
A negative imbalance value was observed after imbalance calculation, this happens when the local sched group type is group_fully_busy, and the average load of local group is greater than the selected busiest group. Fix this problem by comparing the average load of the local and busiest group before imbalance calculation formula. Suggested-by:
Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by:
Phil Auld <pauld@redhat.com> Reviewed-by:
Vincent Guittot <vincent.guittot@linaro.org> Acked-by:
Mel Gorman <mgorman@suse.de> Signed-off-by:
Aubrey Li <aubrey.li@linux.intel.com> Signed-off-by:
Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by:
Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/1585201349-70192-1-git-send-email-aubrey.li@intel.com
-
Huaixin Chang authored
Currently, there is a potential race between distribute_cfs_runtime() and assign_cfs_rq_runtime(). Race happens when cfs_b->runtime is read, distributes without holding lock and finds out there is not enough runtime to charge against after distribution. Because assign_cfs_rq_runtime() might be called during distribution, and use cfs_b->runtime at the same time. Fibtest is the tool to test this race. Assume all gcfs_rq is throttled and cfs period timer runs, slow threads might run and sleep, returning unused cfs_rq runtime and keeping min_cfs_rq_runtime in their local pool. If all this happens sufficiently quickly, cfs_b->runtime will drop a lot. If runtime distributed is large too, over-use of runtime happens. A runtime over-using by about 70 percent of quota is seen when we test fibtest on a 96-core machine. We run fibtest with 1 fast thread and 95 slow threads in test group, configure 10ms quota for this group and see the CPU usage of fibtest is 17.0%, which is far more than the expected 10%. On a smaller machine with 32 cores, we also run fibtest with 96 threads. CPU usage is more than 12%, which is also more than expected 10%. This shows that on similar workloads, this race do affect CPU bandwidth control. Solve this by holding lock inside distribute_cfs_runtime(). Fixes: c06f04c7 ("sched: Fix potential near-infinite distribute_cfs_runtime() loop") Reviewed-by:
Ben Segall <bsegall@google.com> Signed-off-by:
Huaixin Chang <changhuaixin@linux.alibaba.com> Signed-off-by:
Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by:
Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/lkml/20200325092602.22471-1-changhuaixin@linux.alibaba.com/
-
Valentin Schneider authored
sched/core.c uses update_avg() for rq->avg_idle and sched/fair.c uses an open-coded version (with the exact same decay factor) for rq->avg_scan_cost. On top of that, select_idle_cpu() expects to be able to compare these two fields. The only difference between the two is that rq->avg_scan_cost is computed using a pure division rather than a shift. Turns out it actually matters, first of all because the shifted value can be negative, and the standard has this to say about it: """ The result of E1 >> E2 is E1 right-shifted E2 bit positions. [...] If E1 has a signed type and a negative value, the resulting value is implementation-defined. """ Not only this, but (arithmetic) right shifting a negative value (using 2's complement) is *not* equivalent to dividing it by the corresponding power of 2. Let's look at a few examples: -4 -> 0xF..FC -4 >> 3 -> 0xF..FF == -1 != -4 / 8 -8 -> 0xF..F8 -8 >> 3 -> 0xF..FF == -1 == -8 / 8 -9 -> 0xF..F7 -9 >> 3 -> 0xF..FE == -2 != -9 / 8 Make update_avg() use a division, and export it to the private scheduler header to reuse it where relevant. Note that this still lets compilers use a shift here, but should prevent any unwanted surprise. The disassembly of select_idle_cpu() remains unchanged on arm64, and ttwu_do_wakeup() gains 2 instructions; the diff sort of looks like this: - sub x1, x1, x0 + subs x1, x1, x0 // set condition codes + add x0, x1, #0x7 + csel x0, x0, x1, mi // x0 = x1 < 0 ? x0 : x1 add x0, x3, x0, asr #3 which does the right thing (i.e. gives us the expected result while still using an arithmetic shift) Signed-off-by:
Valentin Schneider <valentin.schneider@arm.com> Signed-off-by:
Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by:
Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/20200330090127.16294-1-valentin.schneider@arm.com
-
Jiri Olsa authored
We hit following warning when running tests on kernel compiled with CONFIG_DEBUG_ATOMIC_SLEEP=y: WARNING: CPU: 19 PID: 4472 at mm/gup.c:2381 __get_user_pages_fast+0x1a4/0x200 CPU: 19 PID: 4472 Comm: dummy Not tainted 5.6.0-rc6+ #3 RIP: 0010:__get_user_pages_fast+0x1a4/0x200 ... Call Trace: perf_prepare_sample+0xff1/0x1d90 perf_event_output_forward+0xe8/0x210 __perf_event_overflow+0x11a/0x310 __intel_pmu_pebs_event+0x657/0x850 intel_pmu_drain_pebs_nhm+0x7de/0x11d0 handle_pmi_common+0x1b2/0x650 intel_pmu_handle_irq+0x17b/0x370 perf_event_nmi_handler+0x40/0x60 nmi_handle+0x192/0x590 default_do_nmi+0x6d/0x150 do_nmi+0x2f9/0x3c0 nmi+0x8e/0xd7 While __get_user_pages_fast() is IRQ-safe, it calls access_ok(), which warns on: WARN_ON_ONCE(!in_task() && !pagefault_disabled()) Peter suggested disabling page faults around __get_user_pages_fast(), which gets rid of the warning in access_ok() call. Suggested-by:
Peter Zijlstra <peterz@infradead.org> Signed-off-by:
Jiri Olsa <jolsa@kernel.org> Signed-off-by:
Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by:
Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/20200407141427.3184722-1-jolsa@kernel.org
-
Ian Rogers authored
The void* in perf_less_group_idx() is to a member in the array which points at a perf_event*, as such it is a perf_event**. Reported-By:
John Sperbeck <jsperbeck@google.com> Fixes: 6eef8a71 ("perf/core: Use min_heap in visit_groups_merge()") Signed-off-by:
Ian Rogers <irogers@google.com> Signed-off-by:
Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by:
Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/20200321164331.107337-1-irogers@google.com
-
Peter Zijlstra authored
Song reports that installing cgroup events is broken since: db0503e4 ("perf/core: Optimize perf_install_in_event()") The problem being that cgroup events try to track cpuctx->cgrp even for disabled events, which is pointless and actively harmful since the above commit. Rework the code to have explicit enable/disable hooks for cgroup events, such that we can limit cgroup tracking to active events. More specifically, since the above commit disabled events are no longer added to their context from the 'right' CPU, and we can't access things like the current cgroup for a remote CPU. Cc: <stable@vger.kernel.org> # v5.5+ Fixes: db0503e4 ("perf/core: Optimize perf_install_in_event()") Reported-by:
Song Liu <songliubraving@fb.com> Tested-by:
Song Liu <songliubraving@fb.com> Reviewed-by:
Song Liu <songliubraving@fb.com> Signed-off-by:
Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by:
Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/20200318193337.GB20760@hirez.programming.kicks-ass.net
-
- Apr 07, 2020
-
-
Jan Kara authored
Commit 769071ac "ns: Introduce Time Namespace" broke reporting of inotify ucounts (max_inotify_instances, max_inotify_watches) in /proc/sys/user because it has added UCOUNT_TIME_NAMESPACES into enum ucount_type but didn't properly update reporting in kernel/ucount.c:setup_userns_sysctls(). This problem got fixed in commit eeec26d5 "time/namespace: Add max_time_namespaces ucount". Add BUILD_BUG_ON to catch a similar problem in the future. Signed-off-by:
Jan Kara <jack@suse.cz> Signed-off-by:
Thomas Gleixner <tglx@linutronix.de> Acked-by:
Andrei Vagin <avagin@gmail.com> Link: https://lkml.kernel.org/r/20200407154643.10102-1-jack@suse.cz
-
Gustavo A. R. Silva authored
The current codebase makes use of the zero-length array language extension to the C90 standard, but the preferred mechanism to declare variable-length types such as these ones is a flexible array member[1][2], introduced in C99: struct foo { int stuff; struct boo array[]; }; By making use of the mechanism above, we will get a compiler warning in case the flexible array does not occur last in the structure, which will help us prevent some kind of undefined behavior bugs from being inadvertently introduced[3] to the codebase from now on. Also, notice that, dynamic memory allocations won't be affected by this change: "Flexible array members have incomplete type, and so the sizeof operator may not be applied. As a quirk of the original implementation of zero-length arrays, sizeof evaluates to zero."[1] This issue was found with the help of Coccinelle. [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html [2] https://github.com/KSPP/linux/issues/21 [3] commit 76497732 ("cxgb3/l2t: Fix undefined behaviour") Signed-off-by:
Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Cc: Peter Oberparleiter <oberpar@linux.ibm.com> Link: http://lkml.kernel.org/r/20200302224851.GA26467@embeddedor Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Gustavo A. R. Silva authored
The current codebase makes use of the zero-length array language extension to the C90 standard, but the preferred mechanism to declare variable-length types such as these ones is a flexible array member[1][2], introduced in C99: struct foo { int stuff; struct boo array[]; }; By making use of the mechanism above, we will get a compiler warning in case the flexible array does not occur last in the structure, which will help us prevent some kind of undefined behavior bugs from being inadvertently introduced[3] to the codebase from now on. Also, notice that, dynamic memory allocations won't be affected by this change: "Flexible array members have incomplete type, and so the sizeof operator may not be applied. As a quirk of the original implementation of zero-length arrays, sizeof evaluates to zero."[1] This issue was found with the help of Coccinelle. [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html [2] https://github.com/KSPP/linux/issues/21 [3] commit 76497732 ("cxgb3/l2t: Fix undefined behaviour") Signed-off-by:
Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Cc: Peter Oberparleiter <oberpar@linux.ibm.com> Link: http://lkml.kernel.org/r/20200302224501.GA14175@embeddedor Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Gustavo A. R. Silva authored
The current codebase makes use of the zero-length array language extension to the C90 standard, but the preferred mechanism to declare variable-length types such as these ones is a flexible array member[1][2], introduced in C99: struct foo { int stuff; struct boo array[]; }; By making use of the mechanism above, we will get a compiler warning in case the flexible array does not occur last in the structure, which will help us prevent some kind of undefined behavior bugs from being inadvertently introduced[3] to the codebase from now on. Also, notice that, dynamic memory allocations won't be affected by this change: "Flexible array members have incomplete type, and so the sizeof operator may not be applied. As a quirk of the original implementation of zero-length arrays, sizeof evaluates to zero."[1] This issue was found with the help of Coccinelle. [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html [2] https://github.com/KSPP/linux/issues/21 [3] commit 76497732 ("cxgb3/l2t: Fix undefined behaviour") Signed-off-by:
Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Acked-by:
Peter Oberparleiter <oberpar@linux.ibm.com> Link: http://lkml.kernel.org/r/20200213152241.GA877@embeddedor Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Qiujun Huang authored
There is a typo in comment. Fix it. s/assuems/assumes/ Signed-off-by:
Qiujun Huang <hqjagain@gmail.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Acked-by:
Luis Chamberlain <mcgrof@kernel.org> Link: http://lkml.kernel.org/r/1585891029-6450-1-git-send-email-hqjagain@gmail.com Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Will Deacon authored
kallsyms_lookup_name() and kallsyms_on_each_symbol() are exported to modules despite having no in-tree users and being wide open to abuse by out-of-tree modules that can use them as a method to invoke arbitrary non-exported kernel functions. Unexport kallsyms_lookup_name() and kallsyms_on_each_symbol(). Signed-off-by:
Will Deacon <will@kernel.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Reviewed-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Masami Hiramatsu <mhiramat@kernel.org> Reviewed-by:
Quentin Perret <qperret@google.com> Acked-by:
Alexei Starovoitov <ast@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: K.Prasad <prasad@linux.vnet.ibm.com> Cc: Miroslav Benes <mbenes@suse.cz> Cc: Petr Mladek <pmladek@suse.com> Cc: Joe Lawrence <joe.lawrence@redhat.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: http://lkml.kernel.org/r/20200221114404.14641-4-will@kernel.org Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Masahiro Yamada authored
Commit ac7c3e4f ("compiler: enable CONFIG_OPTIMIZE_INLINING forcibly") made this always-on option. We released v5.4 and v5.5 including that commit. Remove the CONFIG option and clean up the code now. Signed-off-by:
Masahiro Yamada <masahiroy@kernel.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Reviewed-by:
Miguel Ojeda <miguel.ojeda.sandonis@gmail.com> Reviewed-by:
Nathan Chancellor <natechancellor@gmail.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov <bp@alien8.de> Cc: David Miller <davem@davemloft.net> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20200220110807.32534-2-masahiroy@kernel.org Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Nathan Chancellor authored
Clang warns: ../kernel/extable.c:37:52: warning: array comparison always evaluates to a constant [-Wtautological-compare] if (main_extable_sort_needed && __stop___ex_table > __start___ex_table) { ^ 1 warning generated. These are not true arrays, they are linker defined symbols, which are just addresses. Using the address of operator silences the warning and does not change the resulting assembly with either clang/ld.lld or gcc/ld (tested with diff + objdump -Dr). Suggested-by:
Nick Desaulniers <ndesaulniers@google.com> Signed-off-by:
Nathan Chancellor <natechancellor@gmail.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Reviewed-by:
Andrew Morton <akpm@linux-foundation.org> Link: https://github.com/ClangBuiltLinux/linux/issues/892 Link: http://lkml.kernel.org/r/20200219202036.45702-1-natechancellor@gmail.com Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Alexey Dobriyan authored
Now that "struct proc_ops" exist we can start putting there stuff which could not fly with VFS "struct file_operations"... Most of fs/proc/inode.c file is dedicated to make open/read/.../close reliable in the event of disappearing /proc entries which usually happens if module is getting removed. Files like /proc/cpuinfo which never disappear simply do not need such protection. Save 2 atomic ops, 1 allocation, 1 free per open/read/close sequence for such "permanent" files. Enable "permanent" flag for /proc/cpuinfo /proc/kmsg /proc/modules /proc/slabinfo /proc/stat /proc/sysvipc/* /proc/swaps More will come once I figure out foolproof way to prevent out module authors from marking their stuff "permanent" for performance reasons when it is not. This should help with scalability: benchmark is "read /proc/cpuinfo R times by N threads scattered over the system". N R t, s (before) t, s (after) ----------------------------------------------------- 64 4096 1.582458 1.530502 -3.2% 256 4096 6.371926 6.125168 -3.9% 1024 4096 25.64888 24.47528 -4.6% Benchmark source: #include <chrono> #include <iostream> #include <thread> #include <vector> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <unistd.h> const int NR_CPUS = sysconf(_SC_NPROCESSORS_ONLN); int N; const char *filename; int R; int xxx = 0; int glue(int n) { cpu_set_t m; CPU_ZERO(&m); CPU_SET(n, &m); return sched_setaffinity(0, sizeof(cpu_set_t), &m); } void f(int n) { glue(n % NR_CPUS); while (*(volatile int *)&xxx == 0) { } for (int i = 0; i < R; i++) { int fd = open(filename, O_RDONLY); char buf[4096]; ssize_t rv = read(fd, buf, sizeof(buf)); asm volatile ("" :: "g" (rv)); close(fd); } } int main(int argc, char *argv[]) { if (argc < 4) { std::cerr << "usage: " << argv[0] << ' ' << "N /proc/filename R "; return 1; } N = atoi(argv[1]); filename = argv[2]; R = atoi(argv[3]); for (int i = 0; i < NR_CPUS; i++) { if (glue(i) == 0) break; } std::vector<std::thread> T; T.reserve(N); for (int i = 0; i < N; i++) { T.emplace_back(f, i); } auto t0 = std::chrono::system_clock::now(); { *(volatile int *)&xxx = 1; for (auto& t: T) { t.join(); } } auto t1 = std::chrono::system_clock::now(); std::chrono::duration<double> dt = t1 - t0; std::cout << dt.count() << ' '; return 0; } P.S.: Explicit randomization marker is added because adding non-function pointer will silently disable structure layout randomization. [akpm@linux-foundation.org: coding style fixes] Reported-by:
kbuild test robot <lkp@intel.com> Reported-by:
Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by:
Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Joe Perches <joe@perches.com> Link: http://lkml.kernel.org/r/20200222201539.GA22576@avx2 Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Anshuman Khandual authored
This replaces all remaining open encodings with is_vm_hugetlb_page(). Signed-off-by:
Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Acked-by:
Vlastimil Babka <vbabka@suse.cz> Cc: Paul Mackerras <paulus@ozlabs.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Will Deacon <will@kernel.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Cc: Nick Piggin <npiggin@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Guo Ren <guoren@kernel.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Paul Burton <paulburton@kernel.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Rich Felker <dalias@libc.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Link: http://lkml.kernel.org/r/1582520593-30704-4-git-send-email-anshuman.khandual@arm.com Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Anshuman Khandual authored
Lets move vma_is_accessible() helper to include/linux/mm.h which makes it available for general use. While here, this replaces all remaining open encodings for VMA access check with vma_is_accessible(). Signed-off-by:
Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Acked-by:
Geert Uytterhoeven <geert@linux-m68k.org> Acked-by:
Guo Ren <guoren@kernel.org> Acked-by:
Vlastimil Babka <vbabka@suse.cz> Cc: Guo Ren <guoren@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Paul Burton <paulburton@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Rich Felker <dalias@libc.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Nick Piggin <npiggin@gmail.com> Cc: Paul Mackerras <paulus@ozlabs.org> Cc: Will Deacon <will@kernel.org> Link: http://lkml.kernel.org/r/1582520593-30704-3-git-send-email-anshuman.khandual@arm.com Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Li Xinhai authored
Set ->vm_next and ->vm_prev to NULL to prevent potential misuse from the new duplicated vma. Currently, only in fork path there are misuse for handling anon_vma. No other bugs been revealed with this patch applied. Signed-off-by:
Li Xinhai <lixinhai.lxh@gmail.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Acked-by:
Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Rik van Riel <riel@redhat.com> Link: http://lkml.kernel.org/r/1581150928-3214-4-git-send-email-lixinhai.lxh@gmail.com Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Li Xinhai authored
Patch series "mm: Fix misuse of parent anon_vma in dup_mmap path". This patchset fixes the misuse of parenet anon_vma, which mainly caused by child vma's vm_next and vm_prev are left same as its parent after duplicate vma. Finally, code reached parent vma's neighbor by referring pointer of child vma and executed wrong logic. The first two patches fix relevant issues, and the third patch sets vm_next and vm_prev to NULL when duplicate vma to prevent potential misuse in future. Effects of the first bug is that causes rmap code to check both parent and child's page table, although a page couldn't be mapped by both parent and child, because child vma has WIPEONFORK so all pages mapped by child are 'new' and not relevant to parent. Effects of the second bug is that the relationship of anon_vma of parent and child are totallyconvoluted. It would cause 'son', 'grandson', ..., etc, to share 'parent' anon_vma, which disobey the design rule of reusing anon_vma (the rule to be followed is that reusing should among vma of same process, and vma should not gone through fork). So, both issues should cause unnecessary rmap walking and have unexpected complexity. These two issues would not be directly visible, I used debugging code to check the anon_vma pointers of parent and child when inspecting the suspicious implementation of issue #2, then find the problem. This patch (of 3): In dup_mmap(), anon_vma_prepare() is called for vma has VM_WIPEONFORK, and parameter 'tmp' (i.e., the new vma of child) has same ->vm_next and ->vm_prev as its parent vma. That allows anon_vma used by parent been mistakenly shared by child (find_mergeable_anon_vma() will do this reuse work). Besides this issue, call anon_vma_prepare() should be avoided because we don't copy page for this vma. Preparing anon_vma will be handled during fault. Fixes: d2cd9ede ("mm,fork: introduce MADV_WIPEONFORK") Signed-off-by:
Li Xinhai <lixinhai.lxh@gmail.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Acked-by:
Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Rik van Riel <riel@redhat.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Link: http://lkml.kernel.org/r/1581150928-3214-2-git-send-email-lixinhai.lxh@gmail.com Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Dmitry Safonov authored
Michael noticed that userns limit for number of time namespaces is missing. Furthermore, time namespace introduced UCOUNT_TIME_NAMESPACES, but didn't introduce an array member in user_table[]. It would make array's initialisation OOB write, but by luck the user_table array has an excessive empty member (all accesses to the array are limited with UCOUNT_COUNTS - so it silently reuses the last free member. Fixes user-visible regression: max_inotify_instances by reason of the missing UCOUNT_ENTRY() has limited max number of namespaces instead of the number of inotify instances. Fixes: 769071ac ("ns: Introduce Time Namespace") Reported-by:
Michael Kerrisk (man-pages) <mtk.manpages@gmail.com> Signed-off-by:
Dmitry Safonov <dima@arista.com> Signed-off-by:
Thomas Gleixner <tglx@linutronix.de> Acked-by:
Andrei Vagin <avagin@gmail.com> Acked-by:
Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: stable@kernel.org Link: https://lkml.kernel.org/r/20200406171342.128733-1-dima@arista.com
-
Michael Kerrisk (man-pages) authored
Looking at the contents of the /proc/PID/ns/time_for_children symlink shows an anomaly: $ ls -l /proc/self/ns/* |awk '{print $9, $10, $11}' ... /proc/self/ns/pid -> pid:[4026531836] /proc/self/ns/pid_for_children -> pid:[4026531836] /proc/self/ns/time -> time:[4026531834] /proc/self/ns/time_for_children -> time_for_children:[4026531834] /proc/self/ns/user -> user:[4026531837] ... The reference for 'time_for_children' should be a 'time' namespace, just as the reference for 'pid_for_children' is a 'pid' namespace. In other words, the above time_for_children link should read: /proc/self/ns/time_for_children -> time:[4026531834] Fixes: 769071ac ("ns: Introduce Time Namespace") Signed-off-by:
Michael Kerrisk <mtk.manpages@gmail.com> Signed-off-by:
Thomas Gleixner <tglx@linutronix.de> Reviewed-by:
Dmitry Safonov <dima@arista.com> Acked-by:
Christian Brauner <christian.brauner@ubuntu.com> Acked-by:
Andrei Vagin <avagin@gmail.com> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/a2418c48-ed80-3afe-116e-6611cb799557@gmail.com
-
- Apr 06, 2020
-
-
Christoph Hellwig authored
Use in_compat_syscall to copy directly from the 32-bit ABI structure. Signed-off-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Rafael J. Wysocki <rafael.j.wysocki@intel.com>
-
Christoph Hellwig authored
Move the handling of the SNAPSHOT_SET_SWAP_AREA ioctl from the main ioctl helper into a helper function. Signed-off-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Rafael J. Wysocki <rafael.j.wysocki@intel.com>
-
- Apr 03, 2020
-
-
Waiman Long authored
The cpuset in cgroup v1 accepts a special "cpuset_v2_mode" mount option that make cpuset.cpus and cpuset.mems behave more like those in cgroup v2. Document it to make other people more aware of this feature that can be useful in some circumstances. Signed-off-by:
Waiman Long <longman@redhat.com> Signed-off-by:
Tejun Heo <tj@kernel.org>
-
Tejun Heo authored
This reverts commit a49e4629 ("cpuset: Make cpuset hotplug synchronous") as it may deadlock with cpu hotplug path. Link: http://lkml.kernel.org/r/F0388D99-84D7-453B-9B6B-EEFF0E7BE4CC@lca.pw Signed-off-by:
Tejun Heo <tj@kernel.org> Reported-by:
Qian Cai <cai@lca.pw> Cc: Prateek Sood <prsood@codeaurora.org>
-
Steven Rostedt (VMware) authored
When dumping out the trace data in latency format, a check is made to peek at the next event to compare its timestamp to the current one, and if the delta is of a greater size, it will add a marker showing so. But to do this, it needs to save the current event otherwise peeking at the next event will remove the current event. To save the event, a temp buffer is used, and if the event is bigger than the temp buffer, the temp buffer is freed and a bigger buffer is allocated. This allocation is a problem when called in atomic context. The only way this gets called via atomic context is via ftrace_dump(). Thus, use a static buffer of 128 bytes (which covers most events), and if the event is bigger than that, simply return NULL. The callers of trace_find_next_entry() need to handle a NULL case, as that's what would happen if the allocation failed. Link: https://lore.kernel.org/r/20200326091256.GR11705@shao2-debian Fixes: ff895103 ("tracing: Save off entry when peeking at next entry") Reported-by:
kernel test robot <rong.a.chen@intel.com> Signed-off-by:
Steven Rostedt (VMware) <rostedt@goodmis.org>
-
- Apr 02, 2020
-
-
Sebastian Andrzej Siewior authored
Since commit 5bbe3547 ("mm: allow compaction of unevictable pages") it is allowed to examine mlocked pages and compact them by default. On -RT even minor pagefaults are problematic because it may take a few 100us to resolve them and until then the task is blocked. Make compact_unevictable_allowed = 0 default and issue a warning on RT if it is changed. [bigeasy@linutronix.de: v5] Link: https://lore.kernel.org/linux-mm/20190710144138.qyn4tuttdq6h7kqx@linutronix.de/ Link: http://lkml.kernel.org/r/20200319165536.ovi75tsr2seared4@linutronix.de Signed-off-by:
Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Reviewed-by:
Andrew Morton <akpm@linux-foundation.org> Acked-by:
Mel Gorman <mgorman@techsingularity.net> Acked-by:
Vlastimil Babka <vbabka@suse.cz> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Iurii Zaikin <yzaikin@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Link: https://lore.kernel.org/linux-mm/20190710144138.qyn4tuttdq6h7kqx@linutronix.de/ Link: http://lkml.kernel.org/r/20200303202225.nhqc3v5gwlb7x6et@linutronix.de Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-