Skip to content
Snippets Groups Projects
  1. Nov 06, 2023
  2. Nov 03, 2023
    • Jens Axboe's avatar
      io_uring/net: ensure socket is marked connected on connect retry · f8f9ab2d
      Jens Axboe authored
      io_uring does non-blocking connection attempts, which can yield some
      unexpected results if a connect request is re-attempted by an an
      application. This is equivalent to the following sync syscall sequence:
      
      sock = socket(AF_INET, SOCK_STREAM | SOCK_NONBLOCK, IPPROTO_TCP);
      connect(sock, &addr, sizeof(addr);
      
      ret == -1 and errno == EINPROGRESS expected here. Now poll for POLLOUT
      on sock, and when that returns, we expect the socket to be connected.
      But if we follow that procedure with:
      
      connect(sock, &addr, sizeof(addr));
      
      you'd expect ret == -1 and errno == EISCONN here, but you actually get
      ret == 0. If we attempt the connection one more time, then we get EISCON
      as expected.
      
      io_uring used to do this, but turns out that bluetooth fails with EBADFD
      if you attempt to re-connect. Also looks like EISCONN _could_ occur with
      this sequence.
      
      Retain the ->in_progress logic, but work-around a potential EISCONN or
      EBADFD error and only in those cases look at the sock_error(). This
      should work in general and avoid the odd sequence of a repeated connect
      request returning success when the socket is already connected.
      
      This is all a side effect of the socket state being in a CONNECTING
      state when we get EINPROGRESS, and only a re-connect or other related
      operation will turn that into CONNECTED.
      
      Cc: stable@vger.kernel.org
      Fixes: 3fb1bd68 ("io_uring/net: handle -EINPROGRESS correct for IORING_OP_CONNECT")
      Link: https://github.com/axboe/liburing/issues/980
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      f8f9ab2d
    • Jens Axboe's avatar
      io_uring/rw: don't attempt to allocate async data if opcode doesn't need it · 0df96fb7
      Jens Axboe authored
      The new read multishot method doesn't need to allocate async data ever,
      as it doesn't do vectored IO and it must only be used with provided
      buffers. While it doesn't have ->prep_async() set, it also sets
      ->async_size to 0, which is different from any other read/write type we
      otherwise support.
      
      If it's used on a file type that isn't pollable, we do try and allocate
      this async data, and then try and use that data. But since we passed in
      a size of 0 for the data, we get a NULL back on data allocation. We then
      proceed to dereference that to copy state, and that obviously won't end
      well.
      
      Add a check in io_setup_async_rw() for this condition, and avoid copying
      state. Also add a check for whether or not buffer selection is specified
      in prep while at it.
      
      Fixes: fc68fcda ("io_uring/rw: add support for IORING_OP_READ_MULTISHOT")
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=218101
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      0df96fb7
  3. Oct 28, 2023
  4. Oct 25, 2023
    • Jens Axboe's avatar
      io_uring/rw: disable IOCB_DIO_CALLER_COMP · 838b35bb
      Jens Axboe authored
      
      If an application does O_DIRECT writes with io_uring and the file system
      supports IOCB_DIO_CALLER_COMP, then completions of the dio write side is
      done from the task_work that will post the completion event for said
      write as well.
      
      Whenever a dio write is done against a file, the inode i_dio_count is
      elevated. This enables other callers to use inode_dio_wait() to wait for
      previous writes to complete. If we defer the full dio completion to
      task_work, we are dependent on that task_work being run before the
      inode i_dio_count can be decremented.
      
      If the same task that issues io_uring dio writes with
      IOCB_DIO_CALLER_COMP performs a synchronous system call that calls
      inode_dio_wait(), then we can deadlock as we're blocked sleeping on
      the event to become true, but not processing the completions that will
      result in the inode i_dio_count being decremented.
      
      Until we can guarantee that this is the case, then disable the deferred
      caller completions.
      
      Fixes: 099ada2c ("io_uring/rw: add write support for IOCB_DIO_CALLER_COMP")
      Reported-by: default avatarAndres Freund <andres@anarazel.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      838b35bb
    • Jens Axboe's avatar
      io_uring/fdinfo: lock SQ thread while retrieving thread cpu/pid · 7644b1a1
      Jens Axboe authored
      We could race with SQ thread exit, and if we do, we'll hit a NULL pointer
      dereference when the thread is cleared. Grab the SQPOLL data lock before
      attempting to get the task cpu and pid for fdinfo, this ensures we have a
      stable view of it.
      
      Cc: stable@vger.kernel.org
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=218032
      
      
      Reviewed-by: default avatarGabriel Krisman Bertazi <krisman@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      7644b1a1
  5. Oct 19, 2023
  6. Oct 18, 2023
    • Jens Axboe's avatar
      io_uring: fix crash with IORING_SETUP_NO_MMAP and invalid SQ ring address · 8b51a395
      Jens Axboe authored
      
      If we specify a valid CQ ring address but an invalid SQ ring address,
      we'll correctly spot this and free the allocated pages and clear them
      to NULL. However, we don't clear the ring page count, and hence will
      attempt to free the pages again. We've already cleared the address of
      the page array when freeing them, but we don't check for that. This
      causes the following crash:
      
      Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
      Oops [#1]
      Modules linked in:
      CPU: 0 PID: 20 Comm: kworker/u2:1 Not tainted 6.6.0-rc5-dirty #56
      Hardware name: ucbbar,riscvemu-bare (DT)
      Workqueue: events_unbound io_ring_exit_work
      epc : io_pages_free+0x2a/0x58
       ra : io_rings_free+0x3a/0x50
       epc : ffffffff808811a2 ra : ffffffff80881406 sp : ffff8f80000c3cd0
       status: 0000000200000121 badaddr: 0000000000000000 cause: 000000000000000d
       [<ffffffff808811a2>] io_pages_free+0x2a/0x58
       [<ffffffff80881406>] io_rings_free+0x3a/0x50
       [<ffffffff80882176>] io_ring_exit_work+0x37e/0x424
       [<ffffffff80027234>] process_one_work+0x10c/0x1f4
       [<ffffffff8002756e>] worker_thread+0x252/0x31c
       [<ffffffff8002f5e4>] kthread+0xc4/0xe0
       [<ffffffff8000332a>] ret_from_fork+0xa/0x1c
      
      Check for a NULL array in io_pages_free(), but also clear the page counts
      when we free them to be on the safer side.
      
      Reported-by: default avatar <rtm@csail.mit.edu>
      Fixes: 03d89a2d ("io_uring: support for user allocated memory for rings/sqes")
      Cc: stable@vger.kernel.org
      Reviewed-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      8b51a395
  7. Oct 05, 2023
    • Jeff Moyer's avatar
      io-wq: fully initialize wqe before calling cpuhp_state_add_instance_nocalls() · 0f8baa3c
      Jeff Moyer authored
      
      I received a bug report with the following signature:
      
      [ 1759.937637] BUG: unable to handle page fault for address: ffffffffffffffe8
      [ 1759.944564] #PF: supervisor read access in kernel mode
      [ 1759.949732] #PF: error_code(0x0000) - not-present page
      [ 1759.954901] PGD 7ab615067 P4D 7ab615067 PUD 7ab617067 PMD 0
      [ 1759.960596] Oops: 0000 1 PREEMPT SMP PTI
      [ 1759.964804] CPU: 15 PID: 109 Comm: cpuhp/15 Kdump: loaded Tainted: G X ------- — 5.14.0-362.3.1.el9_3.x86_64 #1
      [ 1759.976609] Hardware name: HPE ProLiant DL380 Gen10/ProLiant DL380 Gen10, BIOS U30 06/20/2018
      [ 1759.985181] RIP: 0010:io_wq_for_each_worker.isra.0+0x24/0xa0
      [ 1759.990877] Code: 90 90 90 90 90 90 0f 1f 44 00 00 41 56 41 55 41 54 55 48 8d 6f 78 53 48 8b 47 78 48 39 c5 74 4f 49 89 f5 49 89 d4 48 8d 58 e8 <8b> 13 85 d2 74 32 8d 4a 01 89 d0 f0 0f b1 0b 75 5c 09 ca 78 3d 48
      [ 1760.009758] RSP: 0000:ffffb6f403603e20 EFLAGS: 00010286
      [ 1760.015013] RAX: 0000000000000000 RBX: ffffffffffffffe8 RCX: 0000000000000000
      [ 1760.022188] RDX: ffffb6f403603e50 RSI: ffffffffb11e95b0 RDI: ffff9f73b09e9400
      [ 1760.029362] RBP: ffff9f73b09e9478 R08: 000000000000000f R09: 0000000000000000
      [ 1760.036536] R10: ffffffffffffff00 R11: ffffb6f403603d80 R12: ffffb6f403603e50
      [ 1760.043712] R13: ffffffffb11e95b0 R14: ffffffffb28531e8 R15: ffff9f7a6fbdf548
      [ 1760.050887] FS: 0000000000000000(0000) GS:ffff9f7a6fbc0000(0000) knlGS:0000000000000000
      [ 1760.059025] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 1760.064801] CR2: ffffffffffffffe8 CR3: 00000007ab610002 CR4: 00000000007706e0
      [ 1760.071976] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [ 1760.079150] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [ 1760.086325] PKRU: 55555554
      [ 1760.089044] Call Trace:
      [ 1760.091501] <TASK>
      [ 1760.093612] ? show_trace_log_lvl+0x1c4/0x2df
      [ 1760.097995] ? show_trace_log_lvl+0x1c4/0x2df
      [ 1760.102377] ? __io_wq_cpu_online+0x54/0xb0
      [ 1760.106584] ? __die_body.cold+0x8/0xd
      [ 1760.110356] ? page_fault_oops+0x134/0x170
      [ 1760.114479] ? kernelmode_fixup_or_oops+0x84/0x110
      [ 1760.119298] ? exc_page_fault+0xa8/0x150
      [ 1760.123247] ? asm_exc_page_fault+0x22/0x30
      [ 1760.127458] ? __pfx_io_wq_worker_affinity+0x10/0x10
      [ 1760.132453] ? __pfx_io_wq_worker_affinity+0x10/0x10
      [ 1760.137446] ? io_wq_for_each_worker.isra.0+0x24/0xa0
      [ 1760.142527] __io_wq_cpu_online+0x54/0xb0
      [ 1760.146558] cpuhp_invoke_callback+0x109/0x460
      [ 1760.151029] ? __pfx_io_wq_cpu_offline+0x10/0x10
      [ 1760.155673] ? __pfx_smpboot_thread_fn+0x10/0x10
      [ 1760.160320] cpuhp_thread_fun+0x8d/0x140
      [ 1760.164266] smpboot_thread_fn+0xd3/0x1a0
      [ 1760.168297] kthread+0xdd/0x100
      [ 1760.171457] ? __pfx_kthread+0x10/0x10
      [ 1760.175225] ret_from_fork+0x29/0x50
      [ 1760.178826] </TASK>
      [ 1760.181022] Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache netfs rfkill sunrpc vfat fat dm_multipath intel_rapl_msr intel_rapl_common isst_if_common ipmi_ssif nfit libnvdimm mgag200 i2c_algo_bit ioatdma drm_shmem_helper drm_kms_helper acpi_ipmi syscopyarea x86_pkg_temp_thermal sysfillrect ipmi_si intel_powerclamp sysimgblt ipmi_devintf coretemp acpi_power_meter ipmi_msghandler rapl pcspkr dca intel_pch_thermal intel_cstate ses lpc_ich intel_uncore enclosure hpilo mei_me mei acpi_tad fuse drm xfs sd_mod sg bnx2x nvme nvme_core crct10dif_pclmul crc32_pclmul nvme_common ghash_clmulni_intel smartpqi tg3 t10_pi mdio uas libcrc32c crc32c_intel scsi_transport_sas usb_storage hpwdt wmi dm_mirror dm_region_hash dm_log dm_mod
      [ 1760.248623] CR2: ffffffffffffffe8
      
      A cpu hotplug callback was issued before wq->all_list was initialized.
      This results in a null pointer dereference.  The fix is to fully setup
      the io_wq before calling cpuhp_state_add_instance_nocalls().
      
      Signed-off-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Link: https://lore.kernel.org/r/x49y1ghnecs.fsf@segfault.boston.devel.redhat.com
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      0f8baa3c
    • Gabriel Krisman Bertazi's avatar
      io_uring/kbuf: Use slab for struct io_buffer objects · b3a4dbc8
      Gabriel Krisman Bertazi authored
      
      The allocation of struct io_buffer for metadata of provided buffers is
      done through a custom allocator that directly gets pages and
      fragments them.  But, slab would do just fine, as this is not a hot path
      (in fact, it is a deprecated feature) and, by keeping a custom allocator
      implementation we lose benefits like tracking, poisoning,
      sanitizers. Finally, the custom code is more complex and requires
      keeping the list of pages in struct ctx for no good reason.  This patch
      cleans this path up and just uses slab.
      
      I microbenchmarked it by forcing the allocation of a large number of
      objects with the least number of io_uring commands possible (keeping
      nbufs=USHRT_MAX), with and without the patch.  There is a slight
      increase in time spent in the allocation with slab, of course, but even
      when allocating to system resources exhaustion, which is not very
      realistic and happened around 1/2 billion provided buffers for me, it
      wasn't a significant hit in system time.  Specially if we think of a
      real-world scenario, an application doing register/unregister of
      provided buffers will hit ctx->io_buffers_cache more often than actually
      going to slab.
      
      Signed-off-by: default avatarGabriel Krisman Bertazi <krisman@suse.de>
      Link: https://lore.kernel.org/r/20231005000531.30800-4-krisman@suse.de
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      b3a4dbc8
    • Gabriel Krisman Bertazi's avatar
      io_uring/kbuf: Allow the full buffer id space for provided buffers · f74c746e
      Gabriel Krisman Bertazi authored
      
      nbufs tracks the number of buffers and not the last bgid. In 16-bit, we
      have 2^16 valid buffers, but the check mistakenly rejects the last
      bid. Let's fix it to make the interface consistent with the
      documentation.
      
      Fixes: ddf0322d ("io_uring: add IORING_OP_PROVIDE_BUFFERS")
      Signed-off-by: default avatarGabriel Krisman Bertazi <krisman@suse.de>
      Link: https://lore.kernel.org/r/20231005000531.30800-3-krisman@suse.de
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      f74c746e
    • Gabriel Krisman Bertazi's avatar
      io_uring/kbuf: Fix check of BID wrapping in provided buffers · ab69838e
      Gabriel Krisman Bertazi authored
      
      Commit 3851d25c ("io_uring: check for rollover of buffer ID when
      providing buffers") introduced a check to prevent wrapping the BID
      counter when sqe->off is provided, but it's off-by-one too
      restrictive, rejecting the last possible BID (65534).
      
      i.e., the following fails with -EINVAL.
      
           io_uring_prep_provide_buffers(sqe, addr, size, 0xFFFF, 0, 0);
      
      Fixes: 3851d25c ("io_uring: check for rollover of buffer ID when providing buffers")
      Signed-off-by: default avatarGabriel Krisman Bertazi <krisman@suse.de>
      Link: https://lore.kernel.org/r/20231005000531.30800-2-krisman@suse.de
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      ab69838e
  8. Oct 03, 2023
    • Jens Axboe's avatar
      io_uring: don't allow IORING_SETUP_NO_MMAP rings on highmem pages · 223ef474
      Jens Axboe authored
      
      On at least arm32, but presumably any arch with highmem, if the
      application passes in memory that resides in highmem for the rings,
      then we should fail that ring creation. We fail it with -EINVAL, which
      is what kernels that don't support IORING_SETUP_NO_MMAP will do as well.
      
      Cc: stable@vger.kernel.org
      Fixes: 03d89a2d ("io_uring: support for user allocated memory for rings/sqes")
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      223ef474
    • Jens Axboe's avatar
      io_uring: ensure io_lockdep_assert_cq_locked() handles disabled rings · 1658633c
      Jens Axboe authored
      
      io_lockdep_assert_cq_locked() checks that locking is correctly done when
      a CQE is posted. If the ring is setup in a disabled state with
      IORING_SETUP_R_DISABLED, then ctx->submitter_task isn't assigned until
      the ring is later enabled. We generally don't post CQEs in this state,
      as no SQEs can be submitted. However it is possible to generate a CQE
      if tagged resources are being updated. If this happens and PROVE_LOCKING
      is enabled, then the locking check helper will dereference
      ctx->submitter_task, which hasn't been set yet.
      
      Fixup io_lockdep_assert_cq_locked() to handle this case correctly. While
      at it, convert it to a static inline as well, so that generated line
      offsets will actually reflect which condition failed, rather than just
      the line offset for io_lockdep_assert_cq_locked() itself.
      
      Reported-and-tested-by: default avatar <syzbot+efc45d4e7ba6ab4ef1eb@syzkaller.appspotmail.com>
      Fixes: f26cc959 ("io_uring: lockdep annotate CQ locking")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      1658633c
    • Jens Axboe's avatar
      io_uring/kbuf: don't allow registered buffer rings on highmem pages · f8024f1f
      Jens Axboe authored
      syzbot reports that registering a mapped buffer ring on arm32 can
      trigger an OOPS. Registered buffer rings have two modes, one of them
      is the application passing in the memory that the buffer ring should
      reside in. Once those pages are mapped, we use page_address() to get
      a virtual address. This will obviously fail on highmem pages, which
      aren't mapped.
      
      Add a check if we have any highmem pages after mapping, and fail the
      attempt to register a provided buffer ring if we do. This will return
      the same error as kernels that don't support provided buffer rings to
      begin with.
      
      Link: https://lore.kernel.org/io-uring/000000000000af635c0606bcb889@google.com/
      
      
      Fixes: c56e022c ("io_uring: add support for user mapped provided buffer ring")
      Cc: stable@vger.kernel.org
      Reported-by: default avatar <syzbot+2113e61b8848fa7951d8@syzkaller.appspotmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      f8024f1f
    • Jens Axboe's avatar
      io_uring/rsrc: cleanup io_pin_pages() · 922a2c78
      Jens Axboe authored
      
      This function is overly convoluted with a goto error path, and checks
      under the mmap_read_lock() that don't need to be at all. Rearrange it
      a bit so the checks and errors fall out naturally, rather than needing
      to jump around for it.
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      922a2c78
  9. Sep 29, 2023
    • Jens Axboe's avatar
      io_uring/fs: remove sqe->rw_flags checking from LINKAT · a52d4f65
      Jens Axboe authored
      
      This is unionized with the actual link flags, so they can of course be
      set and they will be evaluated further down. If not we fail any LINKAT
      that has to set option flags.
      
      Fixes: cf30da90 ("io_uring: add support for IORING_OP_LINKAT")
      Cc: stable@vger.kernel.org
      Reported-by: default avatarThomas Leonard <talex5@gmail.com>
      Link: https://github.com/axboe/liburing/issues/955
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      a52d4f65
    • Jens Axboe's avatar
      io_uring: add support for vectored futex waits · 8f350194
      Jens Axboe authored
      
      This adds support for IORING_OP_FUTEX_WAITV, which allows registering a
      notification for a number of futexes at once. If one of the futexes are
      woken, then the request will complete with the index of the futex that got
      woken as the result. This is identical to what the normal vectored futex
      waitv operation does.
      
      Use like IORING_OP_FUTEX_WAIT, except sqe->addr must now contain a
      pointer to a struct futex_waitv array, and sqe->off must now contain the
      number of elements in that array. As flags are passed in the futex_vector
      array, and likewise for the value and futex address(es), sqe->addr2
      and sqe->addr3 are also reserved for IORING_OP_FUTEX_WAITV.
      
      For cancelations, FUTEX_WAITV does not rely on the futex_unqueue()
      return value as we're dealing with multiple futexes. Instead, a separate
      per io_uring request atomic is used to claim ownership of the request.
      
      Waiting on N futexes could be done with IORING_OP_FUTEX_WAIT as well,
      but that punts a lot of the work to the application:
      
      1) Application would need to submit N IORING_OP_FUTEX_WAIT requests,
         rather than just a single IORING_OP_FUTEX_WAITV.
      
      2) When one futex is woken, application would need to cancel the
         remaining N-1 requests that didn't trigger.
      
      While this is of course doable, having a single vectored futex wait
      makes for much simpler application code.
      
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      8f350194
    • Jens Axboe's avatar
      io_uring: add support for futex wake and wait · 194bb58c
      Jens Axboe authored
      
      Add support for FUTEX_WAKE/WAIT primitives.
      
      IORING_OP_FUTEX_WAKE is mix of FUTEX_WAKE and FUTEX_WAKE_BITSET, as
      it does support passing in a bitset.
      
      Similary, IORING_OP_FUTEX_WAIT is a mix of FUTEX_WAIT and
      FUTEX_WAIT_BITSET.
      
      For both of them, they are using the futex2 interface.
      
      FUTEX_WAKE is straight forward, as those can always be done directly from
      the io_uring submission without needing async handling. For FUTEX_WAIT,
      things are a bit more complicated. If the futex isn't ready, then we
      rely on a callback via futex_queue->wake() when someone wakes up the
      futex. From that calback, we queue up task_work with the original task,
      which will post a CQE and wake it, if necessary.
      
      Cancelations are supported, both from the application point-of-view,
      but also to be able to cancel pending waits if the ring exits before
      all events have occurred. The return value of futex_unqueue() is used
      to gate who wins the potential race between cancelation and futex
      wakeups. Whomever gets a 'ret == 1' return from that claims ownership
      of the io_uring futex request.
      
      This is just the barebones wait/wake support. PI or REQUEUE support is
      not added at this point, unclear if we might look into that later.
      
      Likewise, explicit timeouts are not supported either. It is expected
      that users that need timeouts would do so via the usual io_uring
      mechanism to do that using linked timeouts.
      
      The SQE format is as follows:
      
      `addr`		Address of futex
      `fd`		futex2(2) FUTEX2_* flags
      `futex_flags`	io_uring specific command flags. None valid now.
      `addr2`		Value of futex
      `addr3`		Mask to wake/wait
      
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      194bb58c
  10. Sep 28, 2023
  11. Sep 21, 2023
    • Jens Axboe's avatar
      io_uring: add IORING_OP_WAITID support · f31ecf67
      Jens Axboe authored
      
      This adds support for an async version of waitid(2), in a fully async
      version. If an event isn't immediately available, wait for a callback
      to trigger a retry.
      
      The format of the sqe is as follows:
      
      sqe->len		The 'which', the idtype being queried/waited for.
      sqe->fd			The 'pid' (or id) being waited for.
      sqe->file_index		The 'options' being set.
      sqe->addr2		A pointer to siginfo_t, if any, being filled in.
      
      buf_index, add3, and waitid_flags are reserved/unused for now.
      waitid_flags will be used for options for this request type. One
      interesting use case may be to add multi-shot support, so that the
      request stays armed and posts a notification every time a monitored
      process state change occurs.
      
      Note that this does not support rusage, on Arnd's recommendation.
      
      See the waitid(2) man page for details on the arguments.
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      f31ecf67
    • Jens Axboe's avatar
      io_uring/rw: add support for IORING_OP_READ_MULTISHOT · fc68fcda
      Jens Axboe authored
      
      This behaves like IORING_OP_READ, except:
      
      1) It only supports pollable files (eg pipes, sockets, etc). Note that
         for sockets, you probably want to use recv/recvmsg with multishot
         instead.
      
      2) It supports multishot mode, meaning it will repeatedly trigger a
         read and fill a buffer when data is available. This allows similar
         use to recv/recvmsg but on non-sockets, where a single request will
         repeatedly post a CQE whenever data is read from it.
      
      3) Because of #2, it must be used with provided buffers. This is
         uniformly true across any request type that supports multishot and
         transfers data, with the reason being that it's obviously not
         possible to pass in a single buffer for the data, as multiple reads
         may very well trigger before an application has a chance to process
         previous CQEs and the data passed from them.
      
      Reviewed-by: default avatarGabriel Krisman Bertazi <krisman@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      fc68fcda
    • Jens Axboe's avatar
      io_uring/rw: mark readv/writev as vectored in the opcode definition · d2d778fb
      Jens Axboe authored
      
      This is cleaner than gating on the opcode type, particularly as more
      read/write type opcodes may be added.
      
      Then we can use that for the data import, and for __io_read() on
      whether or not we need to copy state.
      
      Reviewed-by: default avatarGabriel Krisman Bertazi <krisman@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      d2d778fb
    • Jens Axboe's avatar
      io_uring/rw: split io_read() into a helper · a08d195b
      Jens Axboe authored
      
      Add __io_read() which does the grunt of the work, leaving the completion
      side to the new io_read(). No functional changes in this patch.
      
      Reviewed-by: default avatarGabriel Krisman Bertazi <krisman@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      a08d195b
  12. Sep 14, 2023
  13. Sep 07, 2023
    • Jens Axboe's avatar
      Revert "io_uring: fix IO hang in io_wq_put_and_exit from do_exit()" · 023464fe
      Jens Axboe authored
      
      This reverts commit b484a40d.
      
      This commit cancels all requests with io-wq, not just the ones from the
      originating task. This breaks use cases that have thread pools, or just
      multiple tasks issuing requests on the same ring. The liburing
      regression test for this also shows that problem:
      
      $ test/thread-exit.t
      cqe->res=-125, Expected 512
      
      where an IO thread gets its request canceled rather than complete
      successfully.
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      023464fe
    • Pavel Begunkov's avatar
      io_uring: fix unprotected iopoll overflow · 27122c07
      Pavel Begunkov authored
      
      [   71.490669] WARNING: CPU: 3 PID: 17070 at io_uring/io_uring.c:769
      io_cqring_event_overflow+0x47b/0x6b0
      [   71.498381] Call Trace:
      [   71.498590]  <TASK>
      [   71.501858]  io_req_cqe_overflow+0x105/0x1e0
      [   71.502194]  __io_submit_flush_completions+0x9f9/0x1090
      [   71.503537]  io_submit_sqes+0xebd/0x1f00
      [   71.503879]  __do_sys_io_uring_enter+0x8c5/0x2380
      [   71.507360]  do_syscall_64+0x39/0x80
      
      We decoupled CQ locking from ->task_complete but haven't fixed up places
      forcing locking for CQ overflows.
      
      Fixes: ec26c225 ("io_uring: merge iopoll and normal completion paths")
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      27122c07
    • Pavel Begunkov's avatar
      io_uring: break out of iowq iopoll on teardown · 45500dc4
      Pavel Begunkov authored
      
      io-wq will retry iopoll even when it failed with -EAGAIN. If that
      races with task exit, which sets TIF_NOTIFY_SIGNAL for all its workers,
      such workers might potentially infinitely spin retrying iopoll again and
      again and each time failing on some allocation / waiting / etc. Don't
      keep spinning if io-wq is dying.
      
      Fixes: 561fb04a ("io_uring: replace workqueue usage with io-wq")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      45500dc4
  14. Sep 05, 2023
  15. Sep 01, 2023
Loading