Skip to content
Snippets Groups Projects
  1. Sep 26, 2023
    • Coly Li's avatar
      badblocks: improve badblocks_set() for multiple ranges handling · 1726c774
      Coly Li authored
      
      Recently I received a bug report that current badblocks code does not
      properly handle multiple ranges. For example,
              badblocks_set(bb, 32, 1, true);
              badblocks_set(bb, 34, 1, true);
              badblocks_set(bb, 36, 1, true);
              badblocks_set(bb, 32, 12, true);
      Then indeed badblocks_show() reports,
              32 3
              36 1
      But the expected bad blocks table should be,
              32 12
      Obviously only the first 2 ranges are merged and badblocks_set() returns
      and ignores the rest setting range.
      
      This behavior is improper, if the caller of badblocks_set() wants to set
      a range of blocks into bad blocks table, all of the blocks in the range
      should be handled even the previous part encountering failure.
      
      The desired way to set bad blocks range by badblocks_set() is,
      - Set as many as blocks in the setting range into bad blocks table.
      - Merge the bad blocks ranges and occupy as less as slots in the bad
        blocks table.
      - Fast.
      
      Indeed the above proposal is complicated, especially with the following
      restrictions,
      - The setting bad blocks range can be acknowledged or not acknowledged.
      - The bad blocks table size is limited.
      - Memory allocation should be avoided.
      
      The basic idea of the patch is to categorize all possible bad blocks
      range setting combinations into much less simplified and more less
      special conditions. Inside badblocks_set() there is an implicit loop
      composed by jumping between labels 're_insert' and 'update_sectors'. No
      matter how large the setting bad blocks range is, in every loop just a
      minimized range from the head is handled by a pre-defined behavior from
      one of the categorized conditions. The logic is simple and code flow is
      manageable.
      
      The different relative layout between the setting range and existing bad
      block range are checked and handled (merge, combine, overwrite, insert)
      by the helpers in previous patch. This patch is to make all the helpers
      work together with the above idea.
      
      This patch only has the algorithm improvement for badblocks_set(). There
      are following patches contain improvement for badblocks_clear() and
      badblocks_check(). But the algorithm in badblocks_set() is fundamental
      and typical, other improvement in clear and check routines are based on
      all the helpers and ideas in this patch.
      
      In order to make the change to be more clear for code review, this patch
      does not directly modify existing badblocks_set(), and just add a new
      one named _badblocks_set(). Later patch will remove current existing
      badblocks_set() code and make it as a wrapper of _badblocks_set(). So
      the new added change won't be mixed with deleted code, the code review
      can be easier.
      
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Geliang Tang <geliang.tang@suse.com>
      Cc: Hannes Reinecke <hare@suse.de>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: NeilBrown <neilb@suse.de>
      Cc: Vishal L Verma <vishal.l.verma@intel.com>
      Cc: Wols Lists <antlists@youngman.org.uk>
      Cc: Xiao Ni <xni@redhat.com>
      Reviewed-by: default avatarXiao Ni <xni@redhat.com>
      Acked-by: default avatarGeliang Tang <geliang.tang@suse.com>
      Link: https://lore.kernel.org/r/20230811170513.2300-4-colyli@suse.de
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      1726c774
    • Coly Li's avatar
      badblocks: add helper routines for badblock ranges handling · c3c6a86e
      Coly Li authored
      
      This patch adds several helper routines to improve badblock ranges
      handling. These helper routines will be used later in the improved
      version of badblocks_set()/badblocks_clear()/badblocks_check().
      
      - Helpers prev_by_hint() and prev_badblocks() are used to find the bad
        range from bad table which the searching range starts at or after.
      
      - The following helpers are to decide the relative layout between the
        manipulating range and existing bad block range from bad table.
        - can_merge_behind()
          Return 'true' if the manipulating range can backward merge with the
          bad block range.
        - can_merge_front()
          Return 'true' if the manipulating range can forward merge with the
          bad block range.
        - can_combine_front()
          Return 'true' if two adjacent bad block ranges before the
          manipulating range can be merged.
        - overlap_front()
          Return 'true' if the manipulating range exactly overlaps with the
          bad block range in front of its range.
        - overlap_behind()
          Return 'true' if the manipulating range exactly overlaps with the
          bad block range behind its range.
        - can_front_overwrite()
          Return 'true' if the manipulating range can forward overwrite the
          bad block range in front of its range.
      
      - The following helpers are to add the manipulating range into the bad
        block table. Different routine is called with the specific relative
        layout between the manipulating range and other bad block range in the
        bad block table.
        - behind_merge()
          Merge the manipulating range with the bad block range behind its
          range, and return the number of merged length in unit of sector.
        - front_merge()
          Merge the manipulating range with the bad block range in front of
          its range, and return the number of merged length in unit of sector.
        - front_combine()
          Combine the two adjacent bad block ranges before the manipulating
          range into a larger one.
        - front_overwrite()
          Overwrite partial of whole bad block range which is in front of the
          manipulating range. The overwrite may split existing bad block range
          and generate more bad block ranges into the bad block table.
        - insert_at()
          Insert the manipulating range at a specific location in the bad
          block table.
      
      All the above helpers are used in later patches to improve the bad block
      ranges handling for badblocks_set()/badblocks_clear()/badblocks_check().
      
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Geliang Tang <geliang.tang@suse.com>
      Cc: Hannes Reinecke <hare@suse.de>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: NeilBrown <neilb@suse.de>
      Cc: Vishal L Verma <vishal.l.verma@intel.com>
      Cc: Xiao Ni <xni@redhat.com>
      Reviewed-by: default avatarXiao Ni <xni@redhat.com>
      Acked-by: default avatarGeliang Tang <geliang.tang@suse.com>
      Link: https://lore.kernel.org/r/20230811170513.2300-3-colyli@suse.de
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      c3c6a86e
  2. Sep 22, 2023
  3. Sep 11, 2023
  4. Sep 06, 2023
  5. Aug 31, 2023
  6. Aug 30, 2023
  7. Aug 23, 2023
  8. Aug 22, 2023
  9. Aug 21, 2023
  10. Aug 19, 2023
  11. Aug 18, 2023
    • Sweet Tea Dorminy's avatar
      blk-crypto: dynamically allocate fallback profile · c984ff14
      Sweet Tea Dorminy authored
      
      blk_crypto_profile_init() calls lockdep_register_key(), which warns and
      does not register if the provided memory is a static object.
      blk-crypto-fallback currently has a static blk_crypto_profile and calls
      blk_crypto_profile_init() thereupon, resulting in the warning and
      failure to register.
      
      Fortunately it is simple enough to use a dynamically allocated profile
      and make lockdep function correctly.
      
      Fixes: 2fb48d88 ("blk-crypto: use dynamic lock class for blk_crypto_profile::lock")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSweet Tea Dorminy <sweettea-kernel@dorminy.me>
      Reviewed-by: default avatarEric Biggers <ebiggers@google.com>
      Link: https://lore.kernel.org/r/20230817141615.15387-1-sweettea-kernel@dorminy.me
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      c984ff14
    • Ming Lei's avatar
      blk-cgroup: hold queue_lock when removing blkg->q_node · c164c7bc
      Ming Lei authored
      
      When blkg is removed from q->blkg_list from blkg_free_workfn(), queue_lock
      has to be held, otherwise, all kinds of bugs(list corruption, hard lockup,
      ..) can be triggered from blkg_destroy_all().
      
      Fixes: f1c006f1 ("blk-cgroup: synchronize pd_free_fn() from blkg_free_workfn() and blkcg_deactivate_policy()")
      Cc: Yu Kuai <yukuai3@huawei.com>
      Cc: xiaoli feng <xifeng@redhat.com>
      Cc: Chunyu Hu <chuhu@redhat.com>
      Cc: Mike Snitzer <snitzer@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Link: https://lore.kernel.org/r/20230817141751.1128970-1-ming.lei@redhat.com
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      c164c7bc
    • Tejun Heo's avatar
      blk-cgroup: Fix NULL deref caused by blkg_policy_data being installed before init · ec14a87e
      Tejun Heo authored
      
      blk-iocost sometimes causes the following crash:
      
        BUG: kernel NULL pointer dereference, address: 00000000000000e0
        ...
        RIP: 0010:_raw_spin_lock+0x17/0x30
        Code: be 01 02 00 00 e8 79 38 39 ff 31 d2 89 d0 5d c3 0f 1f 00 0f 1f 44 00 00 55 48 89 e5 65 ff 05 48 d0 34 7e b9 01 00 00 00 31 c0 <f0> 0f b1 0f 75 02 5d c3 89 c6 e8 ea 04 00 00 5d c3 0f 1f 84 00 00
        RSP: 0018:ffffc900023b3d40 EFLAGS: 00010046
        RAX: 0000000000000000 RBX: 00000000000000e0 RCX: 0000000000000001
        RDX: ffffc900023b3d20 RSI: ffffc900023b3cf0 RDI: 00000000000000e0
        RBP: ffffc900023b3d40 R08: ffffc900023b3c10 R09: 0000000000000003
        R10: 0000000000000064 R11: 000000000000000a R12: ffff888102337000
        R13: fffffffffffffff2 R14: ffff88810af408c8 R15: ffff8881070c3600
        FS:  00007faaaf364fc0(0000) GS:ffff88842fdc0000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00000000000000e0 CR3: 00000001097b1000 CR4: 0000000000350ea0
        Call Trace:
         <TASK>
         ioc_weight_write+0x13d/0x410
         cgroup_file_write+0x7a/0x130
         kernfs_fop_write_iter+0xf5/0x170
         vfs_write+0x298/0x370
         ksys_write+0x5f/0xb0
         __x64_sys_write+0x1b/0x20
         do_syscall_64+0x3d/0x80
         entry_SYSCALL_64_after_hwframe+0x46/0xb0
      
      This happens because iocg->ioc is NULL. The field is initialized by
      ioc_pd_init() and never cleared. The NULL deref is caused by
      blkcg_activate_policy() installing blkg_policy_data before initializing it.
      
      blkcg_activate_policy() was doing the following:
      
      1. Allocate pd's for all existing blkg's and install them in blkg->pd[].
      2. Initialize all pd's.
      3. Online all pd's.
      
      blkcg_activate_policy() only grabs the queue_lock and may release and
      re-acquire the lock as allocation may need to sleep. ioc_weight_write()
      grabs blkcg->lock and iterates all its blkg's. The two can race and if
      ioc_weight_write() runs during #1 or between #1 and #2, it can encounter a
      pd which is not initialized yet, leading to crash.
      
      The crash can be reproduced with the following script:
      
        #!/bin/bash
      
        echo +io > /sys/fs/cgroup/cgroup.subtree_control
        systemd-run --unit touch-sda --scope dd if=/dev/sda of=/dev/null bs=1M count=1 iflag=direct
        echo 100 > /sys/fs/cgroup/system.slice/io.weight
        bash -c "echo '8:0 enable=1' > /sys/fs/cgroup/io.cost.qos" &
        sleep .2
        echo 100 > /sys/fs/cgroup/system.slice/io.weight
      
      with the following patch applied:
      
      > diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
      > index fc49be622e05..38d671d5e10c 100644
      > --- a/block/blk-cgroup.c
      > +++ b/block/blk-cgroup.c
      > @@ -1553,6 +1553,12 @@ int blkcg_activate_policy(struct gendisk *disk, const struct blkcg_policy *pol)
      > 		pd->online = false;
      > 	}
      >
      > +       if (system_state == SYSTEM_RUNNING) {
      > +               spin_unlock_irq(&q->queue_lock);
      > +               ssleep(1);
      > +               spin_lock_irq(&q->queue_lock);
      > +       }
      > +
      > 	/* all allocated, init in the same order */
      > 	if (pol->pd_init_fn)
      > 		list_for_each_entry_reverse(blkg, &q->blkg_list, q_node)
      
      I don't see a reason why all pd's should be allocated, initialized and
      onlined together. The only ordering requirement is that parent blkgs to be
      initialized and onlined before children, which is guaranteed from the
      walking order. Let's fix the bug by allocating, initializing and onlining pd
      for each blkg and holding blkcg->lock over initialization and onlining. This
      ensures that an installed blkg is always fully initialized and onlined
      removing the the race window.
      
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarBreno Leitao <leitao@debian.org>
      Fixes: 9d179b86 ("blkcg: Fix multiple bugs in blkcg_activate_policy()")
      Link: https://lore.kernel.org/r/ZN0p5_W-Q9mAHBVY@slm.duckdns.org
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      ec14a87e
  12. Aug 14, 2023
  13. Aug 10, 2023
  14. Aug 09, 2023
Loading