Skip to content
Snippets Groups Projects
  1. Jul 15, 2019
  2. Jul 12, 2019
  3. Jul 10, 2019
    • Damien Le Moal's avatar
      block: Disable write plugging for zoned block devices · b49773e7
      Damien Le Moal authored
      
      Simultaneously writing to a sequential zone of a zoned block device
      from multiple contexts requires mutual exclusion for BIO issuing to
      ensure that writes happen sequentially. However, even for a well
      behaved user correctly implementing such synchronization, BIO plugging
      may interfere and result in BIOs from the different contextx to be
      reordered if plugging is done outside of the mutual exclusion section,
      e.g. the plug was started by a function higher in the call chain than
      the function issuing BIOs.
      
               Context A                     Context B
      
         | blk_start_plug()
         | ...
         | seq_write_zone()
           | mutex_lock(zone)
           | bio-0->bi_iter.bi_sector = zone->wp
           | zone->wp += bio_sectors(bio-0)
           | submit_bio(bio-0)
           | bio-1->bi_iter.bi_sector = zone->wp
           | zone->wp += bio_sectors(bio-1)
           | submit_bio(bio-1)
           | mutex_unlock(zone)
           | return
         | -----------------------> | seq_write_zone()
        				| mutex_lock(zone)
           				| bio-2->bi_iter.bi_sector = zone->wp
           				| zone->wp += bio_sectors(bio-2)
      				| submit_bio(bio-2)
      				| mutex_unlock(zone)
         | <------------------------- |
         | blk_finish_plug()
      
      In the above example, despite the mutex synchronization ensuring the
      correct BIO issuing order 0, 1, 2, context A BIOs 0 and 1 end up being
      issued after BIO 2 of context B, when the plug is released with
      blk_finish_plug().
      
      While this problem can be addressed using the blk_flush_plug_list()
      function (in the above example, the call must be inserted before the
      zone mutex lock is released), a simple generic solution in the block
      layer avoid this additional code in all zoned block device user code.
      The simple generic solution implemented with this patch is to introduce
      the internal helper function blk_mq_plug() to access the current
      context plug on BIO submission. This helper returns the current plug
      only if the target device is not a zoned block device or if the BIO to
      be plugged is not a write operation. Otherwise, the caller context plug
      is ignored and NULL returned, resulting is all writes to zoned block
      device to never be plugged.
      
      Signed-off-by: default avatarDamien Le Moal <damien.lemoal@wdc.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      b49773e7
    • Konstantin Khlebnikov's avatar
      blk-throttle: fix zero wait time for iops throttled group · 3a10f999
      Konstantin Khlebnikov authored
      
      After commit 991f61fe ("Blk-throttle: reduce tail io latency when
      iops limit is enforced") wait time could be zero even if group is
      throttled and cannot issue requests right now. As a result
      throtl_select_dispatch() turns into busy-loop under irq-safe queue
      spinlock.
      
      Fix is simple: always round up target time to the next throttle slice.
      
      Fixes: 991f61fe ("Blk-throttle: reduce tail io latency when iops limit is enforced")
      Signed-off-by: default avatarKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: stable@vger.kernel.org # v4.19+
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      3a10f999
    • Damien Le Moal's avatar
      block: Fix potential overflow in blk_report_zones() · 113ab72e
      Damien Le Moal authored
      
      For large values of the number of zones reported and/or large zone
      sizes, the sector increment calculated with
      
      blk_queue_zone_sectors(q) * n
      
      in blk_report_zones() loop can overflow the unsigned int type used for
      the calculation as both "n" and blk_queue_zone_sectors() value are
      unsigned int. E.g. for a device with 256 MB zones (524288 sectors),
      overflow happens with 8192 or more zones reported.
      
      Changing the return type of blk_queue_zone_sectors() to sector_t, fixes
      this problem and avoids overflow problem for all other callers of this
      helper too. The same change is also applied to the bdev_zone_sectors()
      helper.
      
      Fixes: e76239a3 ("block: add a report_zones method")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarDamien Le Moal <damien.lemoal@wdc.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      113ab72e
    • Tejun Heo's avatar
      blkcg: implement REQ_CGROUP_PUNT · d3f77dfd
      Tejun Heo authored
      
      When a shared kthread needs to issue a bio for a cgroup, doing so
      synchronously can lead to priority inversions as the kthread can be
      trapped waiting for that cgroup.  This patch implements
      REQ_CGROUP_PUNT flag which makes submit_bio() punt the actual issuing
      to a dedicated per-blkcg work item to avoid such priority inversions.
      
      This will be used to fix priority inversions in btrfs compression and
      should be generally useful as we grow filesystem support for
      comprehensive IO control.
      
      Cc: Chris Mason <clm@fb.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      d3f77dfd
    • Tejun Heo's avatar
      cgroup, blkcg: Prepare some symbols for module and !CONFIG_CGROUP usages · 9b0eb69b
      Tejun Heo authored
      
      btrfs is going to use css_put() and wbc helpers to improve cgroup
      writeback support.  Add dummy css_get() definition and export wbc
      helpers to prepare for module and !CONFIG_CGROUP builds.
      
      Reported-by: default avatarkbuild test robot <lkp@intel.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      9b0eb69b
    • Josef Bacik's avatar
      blk-cgroup: turn on psi memstall stuff · fd112c74
      Josef Bacik authored
      
      With the psi stuff in place we can use the memstall flag to indicate
      pressure that happens from throttling.
      
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      fd112c74
    • Josef Bacik's avatar
      block: init flush rq ref count to 1 · b554db14
      Josef Bacik authored
      
      We discovered a problem in newer kernels where a disconnect of a NBD
      device while the flush request was pending would result in a hang.  This
      is because the blk mq timeout handler does
      
              if (!refcount_inc_not_zero(&rq->ref))
                      return true;
      
      to determine if it's ok to run the timeout handler for the request.
      Flush_rq's don't have a ref count set, so we'd skip running the timeout
      handler for this request and it would just sit there in limbo forever.
      
      Fix this by always setting the refcount of any request going through
      blk_init_rq() to 1.  I tested this with a nbd-server that dropped flush
      requests to verify that it hung, and then tested with this patch to
      verify I got the timeout as expected and the error handling kicked in.
      Thanks,
      
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      b554db14
  4. Jul 06, 2019
  5. Jul 05, 2019
    • Dennis Zhou's avatar
      blk-iolatency: fix STS_AGAIN handling · c9b3007f
      Dennis Zhou authored
      
      The iolatency controller is based on rq_qos. It increments on
      rq_qos_throttle() and decrements on either rq_qos_cleanup() or
      rq_qos_done_bio(). a3fb01ba fixes the double accounting issue where
      blk_mq_make_request() may call both rq_qos_cleanup() and
      rq_qos_done_bio() on REQ_NO_WAIT. So checking STS_AGAIN prevents the
      double decrement.
      
      The above works upstream as the only way we can get STS_AGAIN is from
      blk_mq_get_request() failing. The STS_AGAIN handling isn't a real
      problem as bio_endio() skipping only happens on reserved tag allocation
      failures which can only be caused by driver bugs and already triggers
      WARN.
      
      However, the fix creates a not so great dependency on how STS_AGAIN can
      be propagated. Internally, we (Facebook) carry a patch that kills read
      ahead if a cgroup is io congested or a fatal signal is pending. This
      combined with chained bios progagate their bi_status to the parent is
      not already set can can cause the parent bio to not clean up properly
      even though it was successful. This consequently leaks the inflight
      counter and can hang all IOs under that blkg.
      
      To nip the adverse interaction early, this removes the rq_qos_cleanup()
      callback in iolatency in favor of cleaning up always on the
      rq_qos_done_bio() path.
      
      Fixes: a3fb01ba ("blk-iolatency: only account submitted bios")
      Debugged-by: default avatarTejun Heo <tj@kernel.org>
      Debugged-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDennis Zhou <dennis@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      c9b3007f
  6. Jul 03, 2019
  7. Jul 01, 2019
    • Ming Lei's avatar
      block: fix .bi_size overflow · 79d08f89
      Ming Lei authored
      
      'bio->bi_iter.bi_size' is 'unsigned int', which at most hold 4G - 1
      bytes.
      
      Before 07173c3e ("block: enable multipage bvecs"), one bio can
      include very limited pages, and usually at most 256, so the fs bio
      size won't be bigger than 1M bytes most of times.
      
      Since we support multi-page bvec, in theory one fs bio really can
      be added > 1M pages, especially in case of hugepage, or big writeback
      with too many dirty pages. Then there is chance in which .bi_size
      is overflowed.
      
      Fixes this issue by using bio_full() to check if the added segment may
      overflow .bi_size.
      
      Cc: Liu Yiding <liuyd.fnst@cn.fujitsu.com>
      Cc: kernel test robot <rong.a.chen@intel.com>
      Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
      Cc: linux-xfs@vger.kernel.org
      Cc: linux-fsdevel@vger.kernel.org
      Cc: stable@vger.kernel.org
      Fixes: 07173c3e ("block: enable multipage bvecs")
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      79d08f89
  8. Jun 29, 2019
  9. Jun 28, 2019
    • Douglas Anderson's avatar
      block, bfq: NULL out the bic when it's no longer valid · dbc3117d
      Douglas Anderson authored
      
      In reboot tests on several devices we were seeing a "use after free"
      when slub_debug or KASAN was enabled.  The kernel complained about:
      
        Unable to handle kernel paging request at virtual address 6b6b6c2b
      
      ...which is a classic sign of use after free under slub_debug.  The
      stack crawl in kgdb looked like:
      
       0  test_bit (addr=<optimized out>, nr=<optimized out>)
       1  bfq_bfqq_busy (bfqq=<optimized out>)
       2  bfq_select_queue (bfqd=<optimized out>)
       3  __bfq_dispatch_request (hctx=<optimized out>)
       4  bfq_dispatch_request (hctx=<optimized out>)
       5  0xc056ef00 in blk_mq_do_dispatch_sched (hctx=0xed249440)
       6  0xc056f728 in blk_mq_sched_dispatch_requests (hctx=0xed249440)
       7  0xc0568d24 in __blk_mq_run_hw_queue (hctx=0xed249440)
       8  0xc0568d94 in blk_mq_run_work_fn (work=<optimized out>)
       9  0xc024c5c4 in process_one_work (worker=0xec6d4640, work=0xed249480)
       10 0xc024cff4 in worker_thread (__worker=0xec6d4640)
      
      Digging in kgdb, it could be found that, though bfqq looked fine,
      bfqq->bic had been freed.
      
      Through further digging, I postulated that perhaps it is illegal to
      access a "bic" (AKA an "icq") after bfq_exit_icq() had been called
      because the "bic" can be freed at some point in time after this call
      is made.  I confirmed that there certainly were cases where the exact
      crashing code path would access the "bic" after bfq_exit_icq() had
      been called.  Sspecifically I set the "bfqq->bic" to (void *)0x7 and
      saw that the bic was 0x7 at the time of the crash.
      
      To understand a bit more about why this crash was fairly uncommon (I
      saw it only once in a few hundred reboots), you can see that much of
      the time bfq_exit_icq_fbqq() fully frees the bfqq and thus it can't
      access the ->bic anymore.  The only case it doesn't is if
      bfq_put_queue() sees a reference still held.
      
      However, even in the case when bfqq isn't freed, the crash is still
      rare.  Why?  I tracked what happened to the "bic" after the exit
      routine.  It doesn't get freed right away.  Rather,
      put_io_context_active() eventually called put_io_context() which
      queued up freeing on a workqueue.  The freeing then actually happened
      later than that through call_rcu().  Despite all these delays, some
      extra debugging showed that all the hoops could be jumped through in
      time and the memory could be freed causing the original crash.  Phew!
      
      To make a long story short, assuming it truly is illegal to access an
      icq after the "exit_icq" callback is finished, this patch is needed.
      
      Cc: stable@vger.kernel.org
      Reviewed-by: default avatarPaolo Valente <paolo.valente@unimore.it>
      Signed-off-by: default avatarDouglas Anderson <dianders@chromium.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      dbc3117d
  10. Jun 27, 2019
  11. Jun 26, 2019
  12. Jun 25, 2019
    • Paolo Valente's avatar
      block, bfq: fix operator in BFQQ_TOTALLY_SEEKY · e6feaf21
      Paolo Valente authored
      
      By mistake, there is a '&' instead of a '==' in the definition of the
      macro BFQQ_TOTALLY_SEEKY. This commit replaces the wrong operator with
      the correct one.
      
      Fixes: 7074f076 ("block, bfq: do not tag totally seeky queues as soft rt")
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      e6feaf21
    • Paolo Valente's avatar
      block, bfq: re-schedule empty queues if they deserve I/O plugging · 3726112e
      Paolo Valente authored
      
      Consider, on one side, a bfq_queue Q that remains empty while in
      service, and, on the other side, the pending I/O of bfq_queues that,
      according to their timestamps, have to be served after Q.  If an
      uncontrolled amount of I/O from the latter bfq_queues were dispatched
      while Q is waiting for its new I/O to arrive, then Q's bandwidth
      guarantees would be violated. To prevent this, I/O dispatch is plugged
      until Q receives new I/O (except for a properly controlled amount of
      injected I/O). Unfortunately, preemption breaks I/O-dispatch plugging,
      for the following reason.
      
      Preemption is performed in two steps. First, Q is expired and
      re-scheduled. Second, the new bfq_queue to serve is chosen. The first
      step is needed by the second, as the second can be performed only
      after Q's timestamps have been properly updated (done in the
      expiration step), and Q has been re-queued for service. This
      dependency is a consequence of the way how BFQ's scheduling algorithm
      is currently implemented.
      
      But Q is not re-scheduled at all in the first step, because Q is
      empty. As a consequence, an uncontrolled amount of I/O may be
      dispatched until Q becomes non empty again. This breaks Q's service
      guarantees.
      
      This commit addresses this issue by re-scheduling Q even if it is
      empty. This in turn breaks the assumption that all scheduled queues
      are non empty. Then a few extra checks are now needed.
      
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      3726112e
    • Paolo Valente's avatar
      block, bfq: preempt lower-weight or lower-priority queues · 96a291c3
      Paolo Valente authored
      
      BFQ enqueues the I/O coming from each process into a separate
      bfq_queue, and serves bfq_queues one at a time. Each bfq_queue may be
      served for at most timeout_sync milliseconds (default: 125 ms). This
      service scheme is prone to the following inaccuracy.
      
      While a bfq_queue Q1 is in service, some empty bfq_queue Q2 may
      receive I/O, and, according to BFQ's scheduling policy, may become the
      right bfq_queue to serve, in place of the currently in-service
      bfq_queue. In this respect, postponing the service of Q2 to after the
      service of Q1 finishes may delay the completion of Q2's I/O, compared
      with an ideal service in which all non-empty bfq_queues are served in
      parallel, and every non-empty bfq_queue is served at a rate
      proportional to the bfq_queue's weight. This additional delay is equal
      at most to the time Q1 may unjustly remain in service before switching
      to Q2.
      
      If Q1 and Q2 have the same weight, then this time is most likely
      negligible compared with the completion time to be guaranteed to Q2's
      I/O. In addition, first, one of the reasons why BFQ may want to serve
      Q1 for a while is that this boosts throughput and, second, serving Q1
      longer reduces BFQ's overhead. As a conclusion, it is usually better
      not to preempt Q1 if both Q1 and Q2 have the same weight.
      
      In contrast, as Q2's weight or priority becomes higher and higher
      compared with that of Q1, the above delay becomes larger and larger,
      compared with the I/O completion times that have to be guaranteed to
      Q2 according to Q2's weight. So reducing this delay may be more
      important than avoiding the costs of preempting Q1.
      
      Accordingly, this commit preempts Q1 if Q2 has a higher weight or a
      higher priority than Q1. Preemption causes Q1 to be re-scheduled, and
      triggers a new choice of the next bfq_queue to serve. If Q2 really is
      the next bfq_queue to serve, then Q2 will be set in service
      immediately.
      
      This change reduces the component of the I/O latency caused by the
      above delay by about 80%. For example, on an (old) PLEXTOR PX-256M5
      SSD, the maximum latency reported by fio drops from 15.1 to 3.2 ms for
      a process doing sporadic random reads while another process is doing
      continuous sequential reads.
      
      Signed-off-by: default avatarNicola Bottura <bottura.nicola95@gmail.com>
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      96a291c3
    • Paolo Valente's avatar
      block, bfq: detect wakers and unconditionally inject their I/O · 13a857a4
      Paolo Valente authored
      
      A bfq_queue Q may happen to be synchronized with another
      bfq_queue Q2, i.e., the I/O of Q2 may need to be completed for Q to
      receive new I/O. We call Q2 "waker queue".
      
      If I/O plugging is being performed for Q, and Q is not receiving any
      more I/O because of the above synchronization, then, thanks to BFQ's
      injection mechanism, the waker queue is likely to get served before
      the I/O-plugging timeout fires.
      
      Unfortunately, this fact may not be sufficient to guarantee a high
      throughput during the I/O plugging, because the inject limit for Q may
      be too low to guarantee a lot of injected I/O. In addition, the
      duration of the plugging, i.e., the time before Q finally receives new
      I/O, may not be minimized, because the waker queue may happen to be
      served only after other queues.
      
      To address these issues, this commit introduces the explicit detection
      of the waker queue, and the unconditional injection of a pending I/O
      request of the waker queue on each invocation of
      bfq_dispatch_request().
      
      One may be concerned that this systematic injection of I/O from the
      waker queue delays the service of Q's I/O. Fortunately, it doesn't. On
      the contrary, next Q's I/O is brought forward dramatically, for it is
      not blocked for milliseconds.
      
      Reported-by: default avatarSrivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu>
      Tested-by: default avatarSrivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu>
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      13a857a4
    • Paolo Valente's avatar
      block, bfq: bring forward seek&think time update · a3f9bce3
      Paolo Valente authored
      
      Until the base value for request service times gets finally computed
      for a bfq_queue, the inject limit for that queue does depend on the
      think-time state (short|long) of the queue. A timely update of the
      think time then guarantees a quicker activation or deactivation of the
      injection. Fortunately, the think time of a bfq_queue is updated in
      the same code path as the inject limit; but after the inject limit.
      
      This commits moves the update of the think time before the update of
      the inject limit. For coherence, it moves the update of the seek time
      too.
      
      Reported-by: default avatarSrivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu>
      Tested-by: default avatarSrivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu>
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      a3f9bce3
    • Paolo Valente's avatar
      block, bfq: update base request service times when possible · 24792ad0
      Paolo Valente authored
      
      I/O injection gets reduced if it increases the request service times
      of the victim queue beyond a certain threshold.  The threshold, in its
      turn, is computed as a function of the base service time enjoyed by
      the queue when it undergoes no injection.
      
      As a consequence, for injection to work properly, the above base value
      has to be accurate. In this respect, such a value may vary over
      time. For example, it varies if the size or the spatial locality of
      the I/O requests in the queue change. It is then important to update
      this value whenever possible. This commit performs this update.
      
      Reported-by: default avatarSrivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu>
      Tested-by: default avatarSrivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu>
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      24792ad0
    • Paolo Valente's avatar
      block, bfq: fix rq_in_driver check in bfq_update_inject_limit · db599f9e
      Paolo Valente authored
      
      One of the cases where the parameters for injection may be updated is
      when there are no more in-flight I/O requests. The number of in-flight
      requests is stored in the field bfqd->rq_in_driver of the descriptor
      bfqd of the device. So, the controlled condition is
      bfqd->rq_in_driver == 0.
      
      Unfortunately, this is wrong because, the instruction that checks this
      condition is in the code path that handles the completion of a
      request, and, in particular, the instruction is executed before
      bfqd->rq_in_driver is decremented in such a code path.
      
      This commit fixes this issue by just replacing 0 with 1 in the
      comparison.
      
      Reported-by: default avatarSrivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu>
      Tested-by: default avatarSrivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu>
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      db599f9e
Loading