Skip to content
Snippets Groups Projects
  1. May 24, 2024
    • Will Deacon's avatar
      arm64: patching: fix handling of execmem addresses · b1480ed2
      Will Deacon authored
      Klara Modin reported warnings for a kernel configured with BPF_JIT but
      without MODULES:
      
      [   44.131296] Trying to vfree() bad address (000000004a17c299)
      [   44.138024] WARNING: CPU: 1 PID: 193 at mm/vmalloc.c:3189 remove_vm_area (mm/vmalloc.c:3189 (discriminator 1))
      [   44.146675] CPU: 1 PID: 193 Comm: kworker/1:2 Tainted: G      D W          6.9.0-01786-g2c9e5d4a0082 #25
      [   44.158229] Hardware name: Raspberry Pi 3 Model B (DT)
      [   44.164433] Workqueue: events bpf_prog_free_deferred
      [   44.170492] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
      [   44.178601] pc : remove_vm_area (mm/vmalloc.c:3189 (discriminator 1))
      [   44.183705] lr : remove_vm_area (mm/vmalloc.c:3189 (discriminator 1))
      [   44.188772] sp : ffff800082a13c70
      [   44.193112] x29: ffff800082a13c70 x28: 0000000000000000 x27: 0000000000000000
      [   44.201384] x26: 0000000000000000 x25: ffff00003a44efa0 x24: 00000000d4202000
      [   44.209658] x23: ffff800081223dd0 x22: ffff00003a198a40 x21: ffff8000814dd880
      [   44.217924] x20: 00000000d4202000 x19: ffff8000814dd880 x18: 0000000000000006
      [   44.226206] x17: 0000000000000000 x16: 0000000000000020 x15: 0000000000000002
      [   44.234460] x14: ffff8000811a6370 x13: 0000000020000000 x12: 0000000000000000
      [   44.242710] x11: ffff8000811a6370 x10: 0000000000000144 x9 : ffff8000811fe370
      [   44.250959] x8 : 0000000000017fe8 x7 : 00000000fffff000 x6 : ffff8000811fe370
      [   44.259206] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000000
      [   44.267457] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff000002203240
      [   44.275703] Call trace:
      [   44.279158] remove_vm_area (mm/vmalloc.c:3189 (discriminator 1))
      [   44.283858] vfree (mm/vmalloc.c:3322)
      [   44.287835] execmem_free (mm/execmem.c:70)
      [   44.292347] bpf_jit_free_exec+0x10/0x1c
      [   44.297283] bpf_prog_pack_free (kernel/bpf/core.c:1006)
      [   44.302457] bpf_jit_binary_pack_free (kernel/bpf/core.c:1195)
      [   44.307951] bpf_jit_free (include/linux/filter.h:1083 arch/arm64/net/bpf_jit_comp.c:2474)
      [   44.312342] bpf_prog_free_deferred (kernel/bpf/core.c:2785)
      [   44.317785] process_one_work (kernel/workqueue.c:3273)
      [   44.322684] worker_thread (kernel/workqueue.c:3342 (discriminator 2) kernel/workqueue.c:3429 (discriminator 2))
      [   44.327292] kthread (kernel/kthread.c:388)
      [   44.331342] ret_from_fork (arch/arm64/kernel/entry.S:861)
      
      The problem is because bpf_arch_text_copy() silently fails to write to the
      read-only area as a result of patch_map() faulting and the resulting
      -EFAULT being chucked away.
      
      Update patch_map() to use CONFIG_EXECMEM instead of
      CONFIG_STRICT_MODULE_RWX to check for vmalloc addresses.
      
      Link: https://lkml.kernel.org/r/20240521213813.703309-1-rppt@kernel.org
      
      
      Fixes: 2c9e5d4a ("bpf: remove CONFIG_BPF_JIT dependency on CONFIG_MODULES of")
      Signed-off-by: default avatarWill Deacon <will@kernel.org>
      Signed-off-by: default avatarMike Rapoport (IBM) <rppt@kernel.org>
      Reported-by: default avatarKlara Modin <klarasmodin@gmail.com>
      Closes: https://lore.kernel.org/all/7983fbbf-0127-457c-9394-8d6e4299c685@gmail.com
      
      
      Tested-by: default avatarKlara Modin <klarasmodin@gmail.com>
      Cc: Björn Töpel <bjorn@kernel.org>
      Cc: Luis Chamberlain <mcgrof@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b1480ed2
    • Jeff Xu's avatar
      mseal: wire up mseal syscall · ff388fe5
      Jeff Xu authored
      Patch series "Introduce mseal", v10.
      
      This patchset proposes a new mseal() syscall for the Linux kernel.
      
      In a nutshell, mseal() protects the VMAs of a given virtual memory range
      against modifications, such as changes to their permission bits.
      
      Modern CPUs support memory permissions, such as the read/write (RW) and
      no-execute (NX) bits.  Linux has supported NX since the release of kernel
      version 2.6.8 in August 2004 [1].  The memory permission feature improves
      the security stance on memory corruption bugs, as an attacker cannot
      simply write to arbitrary memory and point the code to it.  The memory
      must be marked with the X bit, or else an exception will occur. 
      Internally, the kernel maintains the memory permissions in a data
      structure called VMA (vm_area_struct).  mseal() additionally protects the
      VMA itself against modifications of the selected seal type.
      
      Memory sealing is useful to mitigate memory corruption issues where a
      corrupted pointer is passed to a memory management system.  For example,
      such an attacker primitive can break control-flow integrity guarantees
      since read-only memory that is supposed to be trusted can become writable
      or .text pages can get remapped.  Memory sealing can automatically be
      applied by the runtime loader to seal .text and .rodata pages and
      applications can additionally seal security critical data at runtime.  A
      similar feature already exists in the XNU kernel with the
      VM_FLAGS_PERMANENT [3] flag and on OpenBSD with the mimmutable syscall
      [4].  Also, Chrome wants to adopt this feature for their CFI work [2] and
      this patchset has been designed to be compatible with the Chrome use case.
      
      Two system calls are involved in sealing the map:  mmap() and mseal().
      
      The new mseal() is an syscall on 64 bit CPU, and with following signature:
      
      int mseal(void addr, size_t len, unsigned long flags)
      addr/len: memory range.
      flags: reserved.
      
      mseal() blocks following operations for the given memory range.
      
      1> Unmapping, moving to another location, and shrinking the size,
         via munmap() and mremap(), can leave an empty space, therefore can
         be replaced with a VMA with a new set of attributes.
      
      2> Moving or expanding a different VMA into the current location,
         via mremap().
      
      3> Modifying a VMA via mmap(MAP_FIXED).
      
      4> Size expansion, via mremap(), does not appear to pose any specific
         risks to sealed VMAs. It is included anyway because the use case is
         unclear. In any case, users can rely on merging to expand a sealed VMA.
      
      5> mprotect() and pkey_mprotect().
      
      6> Some destructive madvice() behaviors (e.g. MADV_DONTNEED) for anonymous
         memory, when users don't have write permission to the memory. Those
         behaviors can alter region contents by discarding pages, effectively a
         memset(0) for anonymous memory.
      
      The idea that inspired this patch comes from Stephen Röttger’s work in
      V8 CFI [5].  Chrome browser in ChromeOS will be the first user of this
      API.
      
      Indeed, the Chrome browser has very specific requirements for sealing,
      which are distinct from those of most applications.  For example, in the
      case of libc, sealing is only applied to read-only (RO) or read-execute
      (RX) memory segments (such as .text and .RELRO) to prevent them from
      becoming writable, the lifetime of those mappings are tied to the lifetime
      of the process.
      
      Chrome wants to seal two large address space reservations that are managed
      by different allocators.  The memory is mapped RW- and RWX respectively
      but write access to it is restricted using pkeys (or in the future ARM
      permission overlay extensions).  The lifetime of those mappings are not
      tied to the lifetime of the process, therefore, while the memory is
      sealed, the allocators still need to free or discard the unused memory. 
      For example, with madvise(DONTNEED).
      
      However, always allowing madvise(DONTNEED) on this range poses a security
      risk.  For example if a jump instruction crosses a page boundary and the
      second page gets discarded, it will overwrite the target bytes with zeros
      and change the control flow.  Checking write-permission before the discard
      operation allows us to control when the operation is valid.  In this case,
      the madvise will only succeed if the executing thread has PKEY write
      permissions and PKRU changes are protected in software by control-flow
      integrity.
      
      Although the initial version of this patch series is targeting the Chrome
      browser as its first user, it became evident during upstream discussions
      that we would also want to ensure that the patch set eventually is a
      complete solution for memory sealing and compatible with other use cases. 
      The specific scenario currently in mind is glibc's use case of loading and
      sealing ELF executables.  To this end, Stephen is working on a change to
      glibc to add sealing support to the dynamic linker, which will seal all
      non-writable segments at startup.  Once this work is completed, all
      applications will be able to automatically benefit from these new
      protections.
      
      In closing, I would like to formally acknowledge the valuable
      contributions received during the RFC process, which were instrumental in
      shaping this patch:
      
      Jann Horn: raising awareness and providing valuable insights on the
        destructive madvise operations.
      Liam R. Howlett: perf optimization.
      Linus Torvalds: assisting in defining system call signature and scope.
      Theo de Raadt: sharing the experiences and insight gained from
        implementing mimmutable() in OpenBSD.
      
      MM perf benchmarks
      ==================
      This patch adds a loop in the mprotect/munmap/madvise(DONTNEED) to
      check the VMAs’ sealing flag, so that no partial update can be made,
      when any segment within the given memory range is sealed.
      
      To measure the performance impact of this loop, two tests are developed.
      [8]
      
      The first is measuring the time taken for a particular system call,
      by using clock_gettime(CLOCK_MONOTONIC). The second is using
      PERF_COUNT_HW_REF_CPU_CYCLES (exclude user space). Both tests have
      similar results.
      
      The tests have roughly below sequence:
      for (i = 0; i < 1000, i++)
          create 1000 mappings (1 page per VMA)
          start the sampling
          for (j = 0; j < 1000, j++)
              mprotect one mapping
          stop and save the sample
          delete 1000 mappings
      calculates all samples.
      
      Below tests are performed on Intel(R) Pentium(R) Gold 7505 @ 2.00GHz,
      4G memory, Chromebook.
      
      Based on the latest upstream code:
      The first test (measuring time)
      syscall__	vmas	t	t_mseal	delta_ns	per_vma	%
      munmap__  	1	909	944	35	35	104%
      munmap__  	2	1398	1502	104	52	107%
      munmap__  	4	2444	2594	149	37	106%
      munmap__  	8	4029	4323	293	37	107%
      munmap__  	16	6647	6935	288	18	104%
      munmap__  	32	11811	12398	587	18	105%
      mprotect	1	439	465	26	26	106%
      mprotect	2	1659	1745	86	43	105%
      mprotect	4	3747	3889	142	36	104%
      mprotect	8	6755	6969	215	27	103%
      mprotect	16	13748	14144	396	25	103%
      mprotect	32	27827	28969	1142	36	104%
      madvise_	1	240	262	22	22	109%
      madvise_	2	366	442	76	38	121%
      madvise_	4	623	751	128	32	121%
      madvise_	8	1110	1324	215	27	119%
      madvise_	16	2127	2451	324	20	115%
      madvise_	32	4109	4642	534	17	113%
      
      The second test (measuring cpu cycle)
      syscall__	vmas	cpu	cmseal	delta_cpu	per_vma	%
      munmap__	1	1790	1890	100	100	106%
      munmap__	2	2819	3033	214	107	108%
      munmap__	4	4959	5271	312	78	106%
      munmap__	8	8262	8745	483	60	106%
      munmap__	16	13099	14116	1017	64	108%
      munmap__	32	23221	24785	1565	49	107%
      mprotect	1	906	967	62	62	107%
      mprotect	2	3019	3203	184	92	106%
      mprotect	4	6149	6569	420	105	107%
      mprotect	8	9978	10524	545	68	105%
      mprotect	16	20448	21427	979	61	105%
      mprotect	32	40972	42935	1963	61	105%
      madvise_	1	434	497	63	63	115%
      madvise_	2	752	899	147	74	120%
      madvise_	4	1313	1513	200	50	115%
      madvise_	8	2271	2627	356	44	116%
      madvise_	16	4312	4883	571	36	113%
      madvise_	32	8376	9319	943	29	111%
      
      Based on the result, for 6.8 kernel, sealing check adds
      20-40 nano seconds, or around 50-100 CPU cycles, per VMA.
      
      In addition, I applied the sealing to 5.10 kernel:
      The first test (measuring time)
      syscall__	vmas	t	tmseal	delta_ns	per_vma	%
      munmap__	1	357	390	33	33	109%
      munmap__	2	442	463	21	11	105%
      munmap__	4	614	634	20	5	103%
      munmap__	8	1017	1137	120	15	112%
      munmap__	16	1889	2153	263	16	114%
      munmap__	32	4109	4088	-21	-1	99%
      mprotect	1	235	227	-7	-7	97%
      mprotect	2	495	464	-30	-15	94%
      mprotect	4	741	764	24	6	103%
      mprotect	8	1434	1437	2	0	100%
      mprotect	16	2958	2991	33	2	101%
      mprotect	32	6431	6608	177	6	103%
      madvise_	1	191	208	16	16	109%
      madvise_	2	300	324	24	12	108%
      madvise_	4	450	473	23	6	105%
      madvise_	8	753	806	53	7	107%
      madvise_	16	1467	1592	125	8	108%
      madvise_	32	2795	3405	610	19	122%
      					
      The second test (measuring cpu cycle)
      syscall__	nbr_vma	cpu	cmseal	delta_cpu	per_vma	%
      munmap__	1	684	715	31	31	105%
      munmap__	2	861	898	38	19	104%
      munmap__	4	1183	1235	51	13	104%
      munmap__	8	1999	2045	46	6	102%
      munmap__	16	3839	3816	-23	-1	99%
      munmap__	32	7672	7887	216	7	103%
      mprotect	1	397	443	46	46	112%
      mprotect	2	738	788	50	25	107%
      mprotect	4	1221	1256	35	9	103%
      mprotect	8	2356	2429	72	9	103%
      mprotect	16	4961	4935	-26	-2	99%
      mprotect	32	9882	10172	291	9	103%
      madvise_	1	351	380	29	29	108%
      madvise_	2	565	615	49	25	109%
      madvise_	4	872	933	61	15	107%
      madvise_	8	1508	1640	132	16	109%
      madvise_	16	3078	3323	245	15	108%
      madvise_	32	5893	6704	811	25	114%
      
      For 5.10 kernel, sealing check adds 0-15 ns in time, or 10-30
      CPU cycles, there is even decrease in some cases.
      
      It might be interesting to compare 5.10 and 6.8 kernel
      The first test (measuring time)
      syscall__	vmas	t_5_10	t_6_8	delta_ns	per_vma	%
      munmap__	1	357	909	552	552	254%
      munmap__	2	442	1398	956	478	316%
      munmap__	4	614	2444	1830	458	398%
      munmap__	8	1017	4029	3012	377	396%
      munmap__	16	1889	6647	4758	297	352%
      munmap__	32	4109	11811	7702	241	287%
      mprotect	1	235	439	204	204	187%
      mprotect	2	495	1659	1164	582	335%
      mprotect	4	741	3747	3006	752	506%
      mprotect	8	1434	6755	5320	665	471%
      mprotect	16	2958	13748	10790	674	465%
      mprotect	32	6431	27827	21397	669	433%
      madvise_	1	191	240	49	49	125%
      madvise_	2	300	366	67	33	122%
      madvise_	4	450	623	173	43	138%
      madvise_	8	753	1110	357	45	147%
      madvise_	16	1467	2127	660	41	145%
      madvise_	32	2795	4109	1314	41	147%
      
      The second test (measuring cpu cycle)
      syscall__	vmas	cpu_5_10	c_6_8	delta_cpu	per_vma	%
      munmap__	1	684	1790	1106	1106	262%
      munmap__	2	861	2819	1958	979	327%
      munmap__	4	1183	4959	3776	944	419%
      munmap__	8	1999	8262	6263	783	413%
      munmap__	16	3839	13099	9260	579	341%
      munmap__	32	7672	23221	15549	486	303%
      mprotect	1	397	906	509	509	228%
      mprotect	2	738	3019	2281	1140	409%
      mprotect	4	1221	6149	4929	1232	504%
      mprotect	8	2356	9978	7622	953	423%
      mprotect	16	4961	20448	15487	968	412%
      mprotect	32	9882	40972	31091	972	415%
      madvise_	1	351	434	82	82	123%
      madvise_	2	565	752	186	93	133%
      madvise_	4	872	1313	442	110	151%
      madvise_	8	1508	2271	763	95	151%
      madvise_	16	3078	4312	1234	77	140%
      madvise_	32	5893	8376	2483	78	142%
      
      From 5.10 to 6.8
      munmap: added 250-550 ns in time, or 500-1100 in cpu cycle, per vma.
      mprotect: added 200-750 ns in time, or 500-1200 in cpu cycle, per vma.
      madvise: added 33-50 ns in time, or 70-110 in cpu cycle, per vma.
      
      In comparison to mseal, which adds 20-40 ns or 50-100 CPU cycles, the
      increase from 5.10 to 6.8 is significantly larger, approximately ten times
      greater for munmap and mprotect.
      
      When I discuss the mm performance with Brian Makin, an engineer who worked
      on performance, it was brought to my attention that such performance
      benchmarks, which measuring millions of mm syscall in a tight loop, may
      not accurately reflect real-world scenarios, such as that of a database
      service.  Also this is tested using a single HW and ChromeOS, the data
      from another HW or distribution might be different.  It might be best to
      take this data with a grain of salt.
      
      
      This patch (of 5):
      
      Wire up mseal syscall for all architectures.
      
      Link: https://lkml.kernel.org/r/20240415163527.626541-1-jeffxu@chromium.org
      Link: https://lkml.kernel.org/r/20240415163527.626541-2-jeffxu@chromium.org
      
      
      Signed-off-by: default avatarJeff Xu <jeffxu@chromium.org>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Reviewed-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Guenter Roeck <groeck@chromium.org>
      Cc: Jann Horn <jannh@google.com> [Bug #2]
      Cc: Jeff Xu <jeffxu@google.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Jorge Lucangeli Obes <jorgelo@chromium.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Muhammad Usama Anjum <usama.anjum@collabora.com>
      Cc: Pedro Falcato <pedro.falcato@gmail.com>
      Cc: Stephen Röttger <sroettger@google.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Amer Al Shanawany <amer.shanawany@gmail.com>
      Cc: Javier Carrasco <javier.carrasco.cruz@gmail.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ff388fe5
  2. May 23, 2024
    • Dongli Zhang's avatar
      genirq/cpuhotplug, x86/vector: Prevent vector leak during CPU offline · a6c11c0a
      Dongli Zhang authored
      
      The absence of IRQD_MOVE_PCNTXT prevents immediate effectiveness of
      interrupt affinity reconfiguration via procfs. Instead, the change is
      deferred until the next instance of the interrupt being triggered on the
      original CPU.
      
      When the interrupt next triggers on the original CPU, the new affinity is
      enforced within __irq_move_irq(). A vector is allocated from the new CPU,
      but the old vector on the original CPU remains and is not immediately
      reclaimed. Instead, apicd->move_in_progress is flagged, and the reclaiming
      process is delayed until the next trigger of the interrupt on the new CPU.
      
      Upon the subsequent triggering of the interrupt on the new CPU,
      irq_complete_move() adds a task to the old CPU's vector_cleanup list if it
      remains online. Subsequently, the timer on the old CPU iterates over its
      vector_cleanup list, reclaiming old vectors.
      
      However, a rare scenario arises if the old CPU is outgoing before the
      interrupt triggers again on the new CPU.
      
      In that case irq_force_complete_move() is not invoked on the outgoing CPU
      to reclaim the old apicd->prev_vector because the interrupt isn't currently
      affine to the outgoing CPU, and irq_needs_fixup() returns false. Even
      though __vector_schedule_cleanup() is later called on the new CPU, it
      doesn't reclaim apicd->prev_vector; instead, it simply resets both
      apicd->move_in_progress and apicd->prev_vector to 0.
      
      As a result, the vector remains unreclaimed in vector_matrix, leading to a
      CPU vector leak.
      
      To address this issue, move the invocation of irq_force_complete_move()
      before the irq_needs_fixup() call to reclaim apicd->prev_vector, if the
      interrupt is currently or used to be affine to the outgoing CPU.
      
      Additionally, reclaim the vector in __vector_schedule_cleanup() as well,
      following a warning message, although theoretically it should never see
      apicd->move_in_progress with apicd->prev_cpu pointing to an offline CPU.
      
      Fixes: f0383c24 ("genirq/cpuhotplug: Add support for cleaning up move in progress")
      Signed-off-by: default avatarDongli Zhang <dongli.zhang@oracle.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/r/20240522220218.162423-1-dongli.zhang@oracle.com
      a6c11c0a
    • Alexandre Ghiti's avatar
      riscv: Fix early ftrace nop patching · 6ca445d8
      Alexandre Ghiti authored
      
      Commit c97bf629 ("riscv: Fix text patching when IPI are used")
      converted ftrace_make_nop() to use patch_insn_write() which does not
      emit any icache flush relying entirely on __ftrace_modify_code() to do
      that.
      
      But we missed that ftrace_make_nop() was called very early directly when
      converting mcount calls into nops (actually on riscv it converts 2B nops
      emitted by the compiler into 4B nops).
      
      This caused crashes on multiple HW as reported by Conor and Björn since
      the booting core could have half-patched instructions in its icache
      which would trigger an illegal instruction trap: fix this by emitting a
      local flush icache when early patching nops.
      
      Fixes: c97bf629 ("riscv: Fix text patching when IPI are used")
      Signed-off-by: default avatarAlexandre Ghiti <alexghiti@rivosinc.com>
      Reported-by: default avatarConor Dooley <conor.dooley@microchip.com>
      Tested-by: default avatarConor Dooley <conor.dooley@microchip.com>
      Reviewed-by: default avatarBjörn Töpel <bjorn@rivosinc.com>
      Tested-by: default avatarBjörn Töpel <bjorn@rivosinc.com>
      Link: https://lore.kernel.org/r/20240523115134.70380-1-alexghiti@rivosinc.com
      
      
      Signed-off-by: default avatarPalmer Dabbelt <palmer@rivosinc.com>
      6ca445d8
    • Steven Rostedt (Google)'s avatar
      tracing/treewide: Remove second parameter of __assign_str() · 2c92ca84
      Steven Rostedt (Google) authored
      With the rework of how the __string() handles dynamic strings where it
      saves off the source string in field in the helper structure[1], the
      assignment of that value to the trace event field is stored in the helper
      value and does not need to be passed in again.
      
      This means that with:
      
        __string(field, mystring)
      
      Which use to be assigned with __assign_str(field, mystring), no longer
      needs the second parameter and it is unused. With this, __assign_str()
      will now only get a single parameter.
      
      There's over 700 users of __assign_str() and because coccinelle does not
      handle the TRACE_EVENT() macro I ended up using the following sed script:
      
        git grep -l __assign_str | while read a ; do
            sed -e 's/\(__assign_str([^,]*[^ ,]\) *,[^;]*/\1)/' $a > /tmp/test-file;
            mv /tmp/test-file $a;
        done
      
      I then searched for __assign_str() that did not end with ';' as those
      were multi line assignments that the sed script above would fail to catch.
      
      Note, the same updates will need to be done for:
      
        __assign_str_len()
        __assign_rel_str()
        __assign_rel_str_len()
      
      I tested this with both an allmodconfig and an allyesconfig (build only for both).
      
      [1] https://lore.kernel.org/linux-trace-kernel/20240222211442.634192653@goodmis.org/
      
      Link: https://lore.kernel.org/linux-trace-kernel/20240516133454.681ba6a0@rorschach.local.home
      
      
      
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Julia Lawall <Julia.Lawall@inria.fr>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      Acked-by: default avatarJani Nikula <jani.nikula@intel.com>
      Acked-by: Christian König <christian.koenig@amd.com> for the amdgpu parts.
      Acked-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> #for
      Acked-by: Rafael J. Wysocki <rafael@kernel.org> # for thermal
      Acked-by: default avatarTakashi Iwai <tiwai@suse.de>
      Acked-by: Darrick J. Wong <djwong@kernel.org>	# xfs
      Tested-by: default avatarGuenter Roeck <linux@roeck-us.net>
      2c92ca84
  3. May 22, 2024
  4. May 21, 2024
    • Jiangfeng Xiao's avatar
      arm64: asm-bug: Add .align 2 to the end of __BUG_ENTRY · ffbf4fb9
      Jiangfeng Xiao authored
      
      When CONFIG_DEBUG_BUGVERBOSE=n, we fail to add necessary padding bytes
      to bug_table entries, and as a result the last entry in a bug table will
      be ignored, potentially leading to an unexpected panic(). All prior
      entries in the table will be handled correctly.
      
      The arm64 ABI requires that struct fields of up to 8 bytes are
      naturally-aligned, with padding added within a struct such that struct
      are suitably aligned within arrays.
      
      When CONFIG_DEBUG_BUGVERPOSE=y, the layout of a bug_entry is:
      
      	struct bug_entry {
      		signed int      bug_addr_disp;	// 4 bytes
      		signed int      file_disp;	// 4 bytes
      		unsigned short  line;		// 2 bytes
      		unsigned short  flags;		// 2 bytes
      	}
      
      ... with 12 bytes total, requiring 4-byte alignment.
      
      When CONFIG_DEBUG_BUGVERBOSE=n, the layout of a bug_entry is:
      
      	struct bug_entry {
      		signed int      bug_addr_disp;	// 4 bytes
      		unsigned short  flags;		// 2 bytes
      		< implicit padding >		// 2 bytes
      	}
      
      ... with 8 bytes total, with 6 bytes of data and 2 bytes of trailing
      padding, requiring 4-byte alginment.
      
      When we create a bug_entry in assembly, we align the start of the entry
      to 4 bytes, which implicitly handles padding for any prior entries.
      However, we do not align the end of the entry, and so when
      CONFIG_DEBUG_BUGVERBOSE=n, the final entry lacks the trailing padding
      bytes.
      
      For the main kernel image this is not a problem as find_bug() doesn't
      depend on the trailing padding bytes when searching for entries:
      
      	for (bug = __start___bug_table; bug < __stop___bug_table; ++bug)
      		if (bugaddr == bug_addr(bug))
      			return bug;
      
      However for modules, module_bug_finalize() depends on the trailing
      bytes when calculating the number of entries:
      
      	mod->num_bugs = sechdrs[i].sh_size / sizeof(struct bug_entry);
      
      ... and as the last bug_entry lacks the necessary padding bytes, this entry
      will not be counted, e.g. in the case of a single entry:
      
      	sechdrs[i].sh_size == 6
      	sizeof(struct bug_entry) == 8;
      
      	sechdrs[i].sh_size / sizeof(struct bug_entry) == 0;
      
      Consequently module_find_bug() will miss the last bug_entry when it does:
      
      	for (i = 0; i < mod->num_bugs; ++i, ++bug)
      		if (bugaddr == bug_addr(bug))
      			goto out;
      
      ... which can lead to a kenrel panic due to an unhandled bug.
      
      This can be demonstrated with the following module:
      
      	static int __init buginit(void)
      	{
      		WARN(1, "hello\n");
      		return 0;
      	}
      
      	static void __exit bugexit(void)
      	{
      	}
      
      	module_init(buginit);
      	module_exit(bugexit);
      	MODULE_LICENSE("GPL");
      
      ... which will trigger a kernel panic when loaded:
      
      	------------[ cut here ]------------
      	hello
      	Unexpected kernel BRK exception at EL1
      	Internal error: BRK handler: 00000000f2000800 [#1] PREEMPT SMP
      	Modules linked in: hello(O+)
      	CPU: 0 PID: 50 Comm: insmod Tainted: G           O       6.9.1 #8
      	Hardware name: linux,dummy-virt (DT)
      	pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
      	pc : buginit+0x18/0x1000 [hello]
      	lr : buginit+0x18/0x1000 [hello]
      	sp : ffff800080533ae0
      	x29: ffff800080533ae0 x28: 0000000000000000 x27: 0000000000000000
      	x26: ffffaba8c4e70510 x25: ffff800080533c30 x24: ffffaba8c4a28a58
      	x23: 0000000000000000 x22: 0000000000000000 x21: ffff3947c0eab3c0
      	x20: ffffaba8c4e3f000 x19: ffffaba846464000 x18: 0000000000000006
      	x17: 0000000000000000 x16: ffffaba8c2492834 x15: 0720072007200720
      	x14: 0720072007200720 x13: ffffaba8c49b27c8 x12: 0000000000000312
      	x11: 0000000000000106 x10: ffffaba8c4a0a7c8 x9 : ffffaba8c49b27c8
      	x8 : 00000000ffffefff x7 : ffffaba8c4a0a7c8 x6 : 80000000fffff000
      	x5 : 0000000000000107 x4 : 0000000000000000 x3 : 0000000000000000
      	x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff3947c0eab3c0
      	Call trace:
      	 buginit+0x18/0x1000 [hello]
      	 do_one_initcall+0x80/0x1c8
      	 do_init_module+0x60/0x218
      	 load_module+0x1ba4/0x1d70
      	 __do_sys_init_module+0x198/0x1d0
      	 __arm64_sys_init_module+0x1c/0x28
      	 invoke_syscall+0x48/0x114
      	 el0_svc_common.constprop.0+0x40/0xe0
      	 do_el0_svc+0x1c/0x28
      	 el0_svc+0x34/0xd8
      	 el0t_64_sync_handler+0x120/0x12c
      	 el0t_64_sync+0x190/0x194
      	Code: d0ffffe0 910003fd 91000000 9400000b (d4210000)
      	---[ end trace 0000000000000000 ]---
      	Kernel panic - not syncing: BRK handler: Fatal exception
      
      Fix this by always aligning the end of a bug_entry to 4 bytes, which is
      correct regardless of CONFIG_DEBUG_BUGVERBOSE.
      
      Fixes: 9fb7410f ("arm64/BUG: Use BRK instruction for generic BUG traps")
      
      Signed-off-by: default avatarYuanbin Xie <xieyuanbin1@huawei.com>
      Signed-off-by: default avatarJiangfeng Xiao <xiaojiangfeng@huawei.com>
      Reviewed-by: default avatarMark Rutland <mark.rutland@arm.com>
      Link: https://lore.kernel.org/r/1716212077-43826-1-git-send-email-xiaojiangfeng@huawei.com
      
      
      Signed-off-by: default avatarWill Deacon <will@kernel.org>
      ffbf4fb9
    • Thomas Gleixner's avatar
      x86/topology: Handle bogus ACPI tables correctly · 9d22c963
      Thomas Gleixner authored
      
      The ACPI specification clearly states how the processors should be
      enumerated in the MADT:
      
       "To ensure that the boot processor is supported post initialization,
        two guidelines should be followed. The first is that OSPM should
        initialize processors in the order that they appear in the MADT. The
        second is that platform firmware should list the boot processor as the
        first processor entry in the MADT.
        ...
        Failure of OSPM implementations and platform firmware to abide by
        these guidelines can result in both unpredictable and non optimal
        platform operation."
      
      The kernel relies on that ordering to detect the real BSP on crash kernels
      which is important to avoid sending a INIT IPI to it as that would cause a
      full machine reset.
      
      On a Dell XPS 16 9640 the BIOS ignores this rule and enumerates the CPUs in
      the wrong order. As a consequence the kernel falsely detects a crash kernel
      and disables the corresponding CPU.
      
      Prevent this by checking the IA32_APICBASE MSR for the BSP bit on the boot
      CPU. If that bit is set, then the MADT based BSP detection can be safely
      ignored. If the kernel detects a mismatch between the BSP bit and the first
      enumerated MADT entry then emit a firmware bug message.
      
      This obviously also has to be taken into account when the boot APIC ID and
      the first enumerated APIC ID match. If the boot CPU does not have the BSP
      bit set in the APICBASE MSR then there is no way for the boot CPU to
      determine which of the CPUs is the real BSP. Sending an INIT to the real
      BSP would reset the machine so the only sane way to deal with that is to
      limit the number of CPUs to one and emit a corresponding warning message.
      
      Fixes: 5c5682b9 ("x86/cpu: Detect real BSP on crash kernels")
      Reported-by: default avatarCarsten Tolkmit <ctolkmit@ennit.de>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Tested-by: default avatarCarsten Tolkmit <ctolkmit@ennit.de>
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/r/87le48jycb.ffs@tglx
      Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218837
      9d22c963
  5. May 20, 2024
  6. May 19, 2024
Loading