- Oct 28, 2023
-
-
Jan Kara authored
The implementation of bdev holder operations such as fs_bdev_mark_dead() and fs_bdev_sync() grab sb->s_umount semaphore under bdev->bd_holder_lock. This is problematic because it leads to disk->open_mutex -> sb->s_umount lock ordering which is counterintuitive (usually we grab higher level (e.g. filesystem) locks first and lower level (e.g. block layer) locks later) and indeed makes lockdep complain about possible locking cycles whenever we open a block device while holding sb->s_umount semaphore. Implement a function bdev_super_lock_shared() which safely transitions from holding bdev->bd_holder_lock to holding sb->s_umount on alive superblock without introducing the problematic lock dependency. We use this function fs_bdev_sync() and fs_bdev_mark_dead(). Signed-off-by:
Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20231018152924.3858-1-jack@suse.cz Link: https://lore.kernel.org/r/20231017184823.1383356-1-hch@lst.de Reviewed-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Christian Brauner <brauner@kernel.org>
-
Jan Kara authored
Convert disk_scan_partitions() and blkdev_bszset() to use bdev_open_by_dev(). Acked-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Christian Brauner <brauner@kernel.org> Signed-off-by:
Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230927093442.25915-3-jack@suse.cz Signed-off-by:
Christian Brauner <brauner@kernel.org>
-
Jan Kara authored
Convert blkdev_open() to use bdev_open_by_dev(). To be able to propagate handle from blkdev_open() to blkdev_release() we need to stop using existence of file->private_data to determine exclusive block device opens. Use bdev_handle->mode for this purpose since file->f_flags isn't usable for this (O_EXCL is cleared from the flags during open). Acked-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Christian Brauner <brauner@kernel.org> Signed-off-by:
Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230927093442.25915-2-jack@suse.cz Signed-off-by:
Christian Brauner <brauner@kernel.org>
-
Jan Kara authored
Create struct bdev_handle that contains all parameters that need to be passed to blkdev_put() and provide bdev_open_* functions that return this structure instead of plain bdev pointer. This will eventually allow us to pass one more argument to blkdev_put() (renamed to bdev_release()) without too much hassle. Acked-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Christian Brauner <brauner@kernel.org> Signed-off-by:
Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230927093442.25915-1-jack@suse.cz Signed-off-by:
Christian Brauner <brauner@kernel.org>
-
- Oct 13, 2023
-
-
Milan Broz authored
The commit 3bfeb612 introduced the use of keyring for sed-opal. Unfortunately, there is also a possibility to save the Opal key used in opal_lock_unlock(). This patch switches the order of operation, so the cached key is used instead of failure for opal_get_key. The problem was found by the cryptsetup Opal test recently added to the cryptsetup tree. Fixes: 3bfeb612 ("block: sed-opal: keyring support for SED keys") Tested-by:
Ondrej Kozina <okozina@redhat.com> Signed-off-by:
Milan Broz <gmazyland@gmail.com> Link: https://lore.kernel.org/r/20231003100209.380037-1-gmazyland@gmail.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Oct 11, 2023
-
-
Sarthak Kukreti authored
Only call truncate_bdev_range() if the fallocate mode is supported. This fixes a bug where data in the pagecache could be invalidated if the fallocate() was called on the block device with an invalid mode. Fixes: 25f4c414 ("block: implement (some of) fallocate for block devices") Cc: stable@vger.kernel.org Reported-by:
"Darrick J. Wong" <djwong@kernel.org> Signed-off-by:
Sarthak Kukreti <sarthakkukreti@chromium.org> Reviewed-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
"Darrick J. Wong" <djwong@kernel.org> Signed-off-by:
Mike Snitzer <snitzer@kernel.org> Fixes: line? I've never seen those wrapped. Link: https://lore.kernel.org/r/20231011201230.750105-1-sarthakkukreti@chromium.org Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Sep 26, 2023
-
-
Randy Dunlap authored
Drop one function parameter's kernel-doc comment since the parameter was removed. This prevents a kernel-doc warning: block/disk-events.c:300: warning: Excess function parameter 'events' description in 'disk_force_media_change' Fixes: ab6860f6 ("block: simplify the disk_force_media_change interface") Signed-off-by:
Randy Dunlap <rdunlap@infradead.org> Reported-by:
kernel test robot <lkp@intel.com> Closes: lore.kernel.org/r/202309060957.vfl0mUur-lkp@intel.com Cc: Christoph Hellwig <hch@lst.de> Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org Reviewed-by:
Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230926005232.23666-1-rdunlap@infradead.org Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Sep 18, 2023
-
-
Kemeng Shi authored
The rq_qos_wait calls common wake-up function rq_qos_wake_function to get token. Just replace stale wbt_wake_function with rq_qos_wake_function in comment. Signed-off-by:
Kemeng Shi <shikemeng@huaweicloud.com> Acked-by:
Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230914091508.36232-1-shikemeng@huaweicloud.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Sep 11, 2023
-
-
Chengming Zhou authored
When nr_hw_queues shrink, we free the excess tags before realloc'ing hw_ctxs for each queue. During that resize, we may need to access those tags, like blk_mq_tag_idle(hctx) will access queue shared tags. This can cause a slab use-after-free, as reported by KASAN. Fix it by moving the releasing of excess tags to the end. Fixes: e1dd7bc9 ("blk-mq: fix tags leak when shrink nr_hw_queues") Reported-by:
Yi Zhang <yi.zhang@redhat.com> Closes: https://lore.kernel.org/all/CAHj4cs_CK63uoDpGBGZ6DN4OCTpzkR3UaVgK=LX8Owr8ej2ieQ@mail.gmail.com/ Cc: Ming Lei <ming.lei@redhat.com> Signed-off-by:
Chengming Zhou <zhouchengming@bytedance.com> Reviewed-by:
Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20230908005702.2183908-1-chengming.zhou@linux.dev Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Sep 06, 2023
-
-
Christoph Hellwig authored
There is no need to unpin the added page when adding it to the bio fails as that is done by the loop below. Instead we want to unpin it when adding a single page to the bio more than once as bio_release_pages will only unpin it once. Fixes: d1916c86 ("block: move same page handling from __bio_add_pc_page to the callers") Signed-off-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230905124731.328255-1-hch@lst.de Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Aug 31, 2023
-
-
Li Lingfeng authored
Commit a33df75c ("block: use an xarray for disk->part_tbl") remove disk_expand_part_tbl() in add_partition(), which means all kinds of devices will support extended dynamic `dev_t`. However, some devices with GENHD_FL_NO_PART are not expected to add or resize partition. Fix this by adding check of GENHD_FL_NO_PART before add or resize partition. Fixes: a33df75c ("block: use an xarray for disk->part_tbl") Signed-off-by:
Li Lingfeng <lilingfeng3@huawei.com> Reviewed-by:
Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230831075900.1725842-1-lilingfeng@huaweicloud.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
file_remove_privs instantly returns 0 when not called for regular files, so don't bother. Signed-off-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20230831121911.280155-1-hch@lst.de Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Aug 30, 2023
-
-
Yu Kuai authored
Currently, 'carryover_ios/bytes' is not handled in throtl_trim_slice(), for consequence, 'carryover_ios/bytes' will be used to throttle bio multiple times, for example: 1) set iops limit to 100, and slice start is 0, slice end is 100ms; 2) current time is 0, and 10 ios are dispatched, those io won't be throttled and io_disp is 10; 3) still at current time 0, update iops limit to 1000, carryover_ios is updated to (0 - 10) = -10; 4) in this slice(0 - 100ms), io_allowed = 100 + (-10) = 90, which means only 90 ios can be dispatched without waiting; 5) assume that io is throttled in slice(0 - 100ms), and throtl_trim_slice() update silce to (100ms - 200ms). In this case, 'carryover_ios/bytes' is not cleared and still only 90 ios can be dispatched between 100ms - 200ms. Fix this problem by updating 'carryover_ios/bytes' in throtl_trim_slice(). Fixes: a880ae93 ("blk-throttle: fix io hung due to configuration updates") Reported-by:
zhuxiaohui <zhuxiaohui.400@bytedance.com> Link: https://lore.kernel.org/all/20230812072116.42321-1-zhuxiaohui.400@bytedance.com/ Signed-off-by:
Yu Kuai <yukuai3@huawei.com> Acked-by:
Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230816012708.1193747-5-yukuai1@huaweicloud.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Yu Kuai authored
There are no functional changes, just make the code cleaner. Signed-off-by:
Yu Kuai <yukuai3@huawei.com> Acked-by:
Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230816012708.1193747-4-yukuai1@huaweicloud.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Yu Kuai authored
carryover_ios/bytes[] can be negative in the case that ios are dispatched in the slice in advance, and then configuration is updated. For example: 1) set iops limit to 1000, and slice start is 0, slice end is 100ms; 2) current time is 0, and 100 ios are dispatched, those ios will not be throttled, hence io_disp is 100; 3) still at current time 0, update iops limit to 100, then carryover_ios is (0 - 100) = -100; 4) then, dispatch a new io at time 0, the expected result is that this io will wait for 1s. The calculation in tg_within_iops_limit: io_disp = 0; io_allowed = calculate_io_allowed + carryover_ios = 10 + (-100) = -90; io won't be throttled if (io_disp + 1 < io_allowed) passed. Before this patch, in step 4) (io_disp + 1 < io_allowed) is passed, because -90 for unsigned value is very huge, and such io won't be throttled. Fix this problem by checking if 'io/bytes_allowed' is negative first. Signed-off-by:
Yu Kuai <yukuai3@huawei.com> Acked-by:
Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230816012708.1193747-3-yukuai1@huaweicloud.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Yu Kuai authored
'carryover_bytes/ios' can be negative, indicate that some bio is dispatched in advance within slice while configuration is updated. Print a huge value is not user-friendly. Signed-off-by:
Yu Kuai <yukuai3@huawei.com> Acked-by:
Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230816012708.1193747-2-yukuai1@huaweicloud.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Aug 23, 2023
-
-
Xu Panda authored
The implementation of strscpy() is more robust and safer. That's now the recommended way to copy NUL terminated strings. Signed-off-by:
Xu Panda <xu.panda@zte.com.cn> Signed-off-by:
Yang Yang <yang.yang29@zte.com> Reviewed-by:
Justin Stitt <justinstitt@google.com> Link: https://lore.kernel.org/r/202212031422587503771@zte.com.cn Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Aug 22, 2023
-
-
Greg Joyce authored
Extend the SED block driver so it can alternatively obtain a key from a sed-opal kernel keyring. The SED ioctls will indicate the source of the key, either directly in the ioctl data or from the keyring. This allows the use of SED commands in scripts such as udev scripts so that drives may be automatically unlocked as they become available. Signed-off-by:
Greg Joyce <gjoyce@linux.vnet.ibm.com> Reviewed-by:
Jonathan Derrick <jonathan.derrick@linux.dev> Acked-by:
Jarkko Sakkinen <jarkko@kernel.org> Link: https://lore.kernel.org/r/20230721211534.3437070-4-gjoyce@linux.vnet.ibm.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Greg Joyce authored
This is used in conjunction with IOC_OPAL_REVERT_TPR to return a drive to Original Factory State without erasing the data. If IOC_OPAL_REVERT_LSP is called with opal_revert_lsp.options bit OPAL_PRESERVE set prior to calling IOC_OPAL_REVERT_TPR, the drive global locking range will not be erased. Signed-off-by:
Greg Joyce <gjoyce@linux.vnet.ibm.com> Reviewed-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Jonathan Derrick <jonathan.derrick@linux.dev> Acked-by:
Jarkko Sakkinen <jarkko@kernel.org> Link: https://lore.kernel.org/r/20230721211534.3437070-3-gjoyce@linux.vnet.ibm.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Greg Joyce authored
Add IOC_OPAL_DISCOVERY ioctl to return raw discovery data to a SED Opal application. This allows the application to display drive capabilities and state. Signed-off-by:
Greg Joyce <gjoyce@linux.vnet.ibm.com> Reviewed-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Jonathan Derrick <jonathan.derrick@linux.dev> Acked-by:
Jarkko Sakkinen <jarkko@kernel.org> Link: https://lore.kernel.org/r/20230721211534.3437070-2-gjoyce@linux.vnet.ibm.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Chengming Zhou authored
Just like blk_mq_alloc_tag_set(), it's better to prepare all tags before using to map to queue ctxs in blk_mq_map_swqueue(), which now have to consider empty set->tags[]. The good point is that we can fallback easily if increasing nr_hw_queues fail, instead of just mapping to hctx[0] when fail in blk_mq_map_swqueue(). And the fallback path already has tags free & clean handling, so all is good. Signed-off-by:
Chengming Zhou <zhouchengming@bytedance.com> Reviewed-by:
Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230821095602.70742-3-chengming.zhou@linux.dev Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Chengming Zhou authored
When we increase nr_hw_queues fail, the fallback path will use blk_mq_update_queue_map() to clear and update all maps. Obviously, this line of update of HCTX_TYPE_DEFAULT only is not needed, so delete it. Signed-off-by:
Chengming Zhou <zhouchengming@bytedance.com> Reviewed-by:
Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230821095602.70742-2-chengming.zhou@linux.dev Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Chengming Zhou authored
Although we don't need to realloc set->tags[] when shrink nr_hw_queues, we need to free them. Or these tags will be leaked. How to reproduce: 1. mount -t configfs configfs /mnt 2. modprobe null_blk nr_devices=0 submit_queues=8 3. mkdir /mnt/nullb/nullb0 4. echo 1 > /mnt/nullb/nullb0/power 5. echo 4 > /mnt/nullb/nullb0/submit_queues 6. rmdir /mnt/nullb/nullb0 In step 4, will alloc 9 tags (8 submit queues and 1 poll queue), then in step 5, new_nr_hw_queues = 5 (4 submit queues and 1 poll queue). At last in step 6, only these 5 tags are freed, the other 4 tags leaked. Signed-off-by:
Chengming Zhou <zhouchengming@bytedance.com> Reviewed-by:
Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230821095602.70742-1-chengming.zhou@linux.dev Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Aug 21, 2023
-
-
Christoph Hellwig authored
BLKFLSBUF is a historic ioctl that is called on a file handle to a block device and syncs either the file system mounted on that block device if there is one, or otherwise the just the data on the block device. Replace the get_super based syncing with a holder operation to remove the last usage of get_super, and to also support syncing the file system if the block device is not the main block device stored in s_dev. Signed-off-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Josef Bacik <josef@toxicpanda.com> Message-Id: <20230811100828.1897174-16-hch@lst.de> Signed-off-by:
Christian Brauner <brauner@kernel.org>
-
Christoph Hellwig authored
Combine the newly merged bdev_mark_dead helper with the existing mark_dead holder operation so that all operations that invalidate a device that is dead or being removed now go through the holder ops. This allows file systems to explicitly shutdown either ASAP (for a surprise removal) or after writing back data (for an orderly removal), and do so not only for the main device. Signed-off-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Josef Bacik <josef@toxicpanda.com> Message-Id: <20230811100828.1897174-15-hch@lst.de> Signed-off-by:
Christian Brauner <brauner@kernel.org>
-
Christoph Hellwig authored
We currently have two interfaces that take a block_devices and the find a mounted file systems to flush or invaldidate data on it. Both are a bit problematic because they only work for the "main" block devices that is used as s_dev for the super_block, and because they don't call into the file system at all. Merge the two into a new bdev_mark_dead helper that does both the syncing and invalidation and which is properly documented. This is in preparation of merging the functionality into the ->mark_dead holder operation so that it will work on additional block devices used by a file systems and give us a single entry point for invalidation of dead devices or media. Note that a single standalone fsync_bdev call for an obscure ioctl remains for now, but that one will also be deal with in a bit. Signed-off-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Josef Bacik <josef@toxicpanda.com> Message-Id: <20230811100828.1897174-14-hch@lst.de> Signed-off-by:
Christian Brauner <brauner@kernel.org>
-
Christoph Hellwig authored
This message isn't exactly helpful, and file systems already print way more useful messages when shut down while active. Signed-off-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Josef Bacik <josef@toxicpanda.com> Message-Id: <20230811100828.1897174-13-hch@lst.de> Signed-off-by:
Christian Brauner <brauner@kernel.org>
-
Christoph Hellwig authored
Hard code the events to DISK_EVENT_MEDIA_CHANGE as that is the only useful use case, and drop the superfluous return value. Signed-off-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Josef Bacik <josef@toxicpanda.com> Message-Id: <20230811100828.1897174-9-hch@lst.de> Signed-off-by:
Christian Brauner <brauner@kernel.org>
-
- Aug 19, 2023
-
-
Chengming Zhou authored
Chuck reported [1] an IO hang problem on NFS exports that reside on SATA devices and bisected to commit 615939a2 ("blk-mq: defer to the normal submission path for post-flush requests"). We analysed the IO hang problem, found there are two postflush requests waiting for each other. The first postflush request completed the REQ_FSEQ_DATA sequence, so go to the REQ_FSEQ_POSTFLUSH sequence and added in the flush pending list, but failed to blk_kick_flush() because of the second postflush request which is inflight waiting in scheduler queue. The second postflush waiting in scheduler queue can't be dispatched because the first postflush hasn't released scheduler resource even though it has completed by itself. Fix it by releasing scheduler resource when the first postflush request completed, so the second postflush can be dispatched and completed, then make blk_kick_flush() succeed. While at it, remove the check for e->ops.finish_request, as all schedulers set that. Reaffirm this requirement by adding a WARN_ON_ONCE() at scheduler registration time, just like we do for insert_requests and dispatch_request. [1] https://lore.kernel.org/all/7A57C7AE-A51A-4254-888B-FE15CA21F9E9@oracle.com/ Link: https://lore.kernel.org/linux-block/20230819031206.2744005-1-chengming.zhou@linux.dev/ Reported-by:
kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202308172100.8ce4b853-oliver.sang@intel.com Fixes: 615939a2 ("blk-mq: defer to the normal submission path for post-flush requests") Reported-by:
Chuck Lever <chuck.lever@oracle.com> Signed-off-by:
Chengming Zhou <zhouchengming@bytedance.com> Tested-by:
Chuck Lever <chuck.lever@oracle.com> Link: https://lore.kernel.org/r/20230813152325.3017343-1-chengming.zhou@linux.dev [axboe: folded in incremental fix and added tags] Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Aug 18, 2023
-
-
Sweet Tea Dorminy authored
blk_crypto_profile_init() calls lockdep_register_key(), which warns and does not register if the provided memory is a static object. blk-crypto-fallback currently has a static blk_crypto_profile and calls blk_crypto_profile_init() thereupon, resulting in the warning and failure to register. Fortunately it is simple enough to use a dynamically allocated profile and make lockdep function correctly. Fixes: 2fb48d88 ("blk-crypto: use dynamic lock class for blk_crypto_profile::lock") Cc: stable@vger.kernel.org Signed-off-by:
Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Reviewed-by:
Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20230817141615.15387-1-sweettea-kernel@dorminy.me Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Ming Lei authored
When blkg is removed from q->blkg_list from blkg_free_workfn(), queue_lock has to be held, otherwise, all kinds of bugs(list corruption, hard lockup, ..) can be triggered from blkg_destroy_all(). Fixes: f1c006f1 ("blk-cgroup: synchronize pd_free_fn() from blkg_free_workfn() and blkcg_deactivate_policy()") Cc: Yu Kuai <yukuai3@huawei.com> Cc: xiaoli feng <xifeng@redhat.com> Cc: Chunyu Hu <chuhu@redhat.com> Cc: Mike Snitzer <snitzer@kernel.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by:
Ming Lei <ming.lei@redhat.com> Acked-by:
Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230817141751.1128970-1-ming.lei@redhat.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Tejun Heo authored
blk-iocost sometimes causes the following crash: BUG: kernel NULL pointer dereference, address: 00000000000000e0 ... RIP: 0010:_raw_spin_lock+0x17/0x30 Code: be 01 02 00 00 e8 79 38 39 ff 31 d2 89 d0 5d c3 0f 1f 00 0f 1f 44 00 00 55 48 89 e5 65 ff 05 48 d0 34 7e b9 01 00 00 00 31 c0 <f0> 0f b1 0f 75 02 5d c3 89 c6 e8 ea 04 00 00 5d c3 0f 1f 84 00 00 RSP: 0018:ffffc900023b3d40 EFLAGS: 00010046 RAX: 0000000000000000 RBX: 00000000000000e0 RCX: 0000000000000001 RDX: ffffc900023b3d20 RSI: ffffc900023b3cf0 RDI: 00000000000000e0 RBP: ffffc900023b3d40 R08: ffffc900023b3c10 R09: 0000000000000003 R10: 0000000000000064 R11: 000000000000000a R12: ffff888102337000 R13: fffffffffffffff2 R14: ffff88810af408c8 R15: ffff8881070c3600 FS: 00007faaaf364fc0(0000) GS:ffff88842fdc0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000000000e0 CR3: 00000001097b1000 CR4: 0000000000350ea0 Call Trace: <TASK> ioc_weight_write+0x13d/0x410 cgroup_file_write+0x7a/0x130 kernfs_fop_write_iter+0xf5/0x170 vfs_write+0x298/0x370 ksys_write+0x5f/0xb0 __x64_sys_write+0x1b/0x20 do_syscall_64+0x3d/0x80 entry_SYSCALL_64_after_hwframe+0x46/0xb0 This happens because iocg->ioc is NULL. The field is initialized by ioc_pd_init() and never cleared. The NULL deref is caused by blkcg_activate_policy() installing blkg_policy_data before initializing it. blkcg_activate_policy() was doing the following: 1. Allocate pd's for all existing blkg's and install them in blkg->pd[]. 2. Initialize all pd's. 3. Online all pd's. blkcg_activate_policy() only grabs the queue_lock and may release and re-acquire the lock as allocation may need to sleep. ioc_weight_write() grabs blkcg->lock and iterates all its blkg's. The two can race and if ioc_weight_write() runs during #1 or between #1 and #2, it can encounter a pd which is not initialized yet, leading to crash. The crash can be reproduced with the following script: #!/bin/bash echo +io > /sys/fs/cgroup/cgroup.subtree_control systemd-run --unit touch-sda --scope dd if=/dev/sda of=/dev/null bs=1M count=1 iflag=direct echo 100 > /sys/fs/cgroup/system.slice/io.weight bash -c "echo '8:0 enable=1' > /sys/fs/cgroup/io.cost.qos" & sleep .2 echo 100 > /sys/fs/cgroup/system.slice/io.weight with the following patch applied: > diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c > index fc49be622e05..38d671d5e10c 100644 > --- a/block/blk-cgroup.c > +++ b/block/blk-cgroup.c > @@ -1553,6 +1553,12 @@ int blkcg_activate_policy(struct gendisk *disk, const struct blkcg_policy *pol) > pd->online = false; > } > > + if (system_state == SYSTEM_RUNNING) { > + spin_unlock_irq(&q->queue_lock); > + ssleep(1); > + spin_lock_irq(&q->queue_lock); > + } > + > /* all allocated, init in the same order */ > if (pol->pd_init_fn) > list_for_each_entry_reverse(blkg, &q->blkg_list, q_node) I don't see a reason why all pd's should be allocated, initialized and onlined together. The only ordering requirement is that parent blkgs to be initialized and onlined before children, which is guaranteed from the walking order. Let's fix the bug by allocating, initializing and onlining pd for each blkg and holding blkcg->lock over initialization and onlining. This ensures that an installed blkg is always fully initialized and onlined removing the the race window. Signed-off-by:
Tejun Heo <tj@kernel.org> Reported-by:
Breno Leitao <leitao@debian.org> Fixes: 9d179b86 ("blkcg: Fix multiple bugs in blkcg_activate_policy()") Link: https://lore.kernel.org/r/ZN0p5_W-Q9mAHBVY@slm.duckdns.org Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Aug 14, 2023
-
-
Kent Overstreet authored
This reverts 6f822e1b - this helper is used by bcachefs. Signed-off-by:
Kent Overstreet <kent.overstreet@linux.dev> Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org Link: https://lore.kernel.org/r/20230813182636.2966159-4-kent.overstreet@linux.dev Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Kent Overstreet authored
bio_iov_iter_get_pages() trims the IO based on the block size of the block device the IO will be issued to. However, bcachefs is a multi device filesystem; when we're creating the bio we don't yet know which block device the bio will be submitted to - we have to handle the alignment checks elsewhere. Thus this is needed to avoid a null ptr deref. Signed-off-by:
Kent Overstreet <kent.overstreet@linux.dev> Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org Link: https://lore.kernel.org/r/20230813182636.2966159-3-kent.overstreet@linux.dev Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Kent Overstreet authored
- bio_set_pages_dirty(), bio_check_pages_dirty() - dio path - blk_status_to_str() - error messages - bio_add_folio() - this should definitely be exported for everyone, it's the modern version of bio_add_page() Signed-off-by:
Kent Overstreet <kent.overstreet@gmail.com> Cc: linux-block@vger.kernel.org Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by:
Kent Overstreet <kent.overstreet@linux.dev> Link: https://lore.kernel.org/r/20230813182636.2966159-2-kent.overstreet@linux.dev Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Aug 10, 2023
-
-
Jens Axboe authored
A previous commit added a lockdep annotation, but botched it. Use the right type. Fixes: 4eb44d10 ("block: remove init_mutex and open-code blk_iolatency_try_init") Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Li Lingfeng authored
Commit a13696b8 ("blk-iolatency: Make initialization lazy") adds a mutex named "init_mutex" in blk_iolatency_try_init for the race condition of initializing RQ_QOS_LATENCY. Now a new lock has been add to struct request_queue by commit a13bd91b ("block/rq_qos: protect rq_qos apis with a new lock"). And it has been held in blkg_conf_open_bdev before calling blk_iolatency_init. So it's not necessary to keep init_mutex in blk_iolatency_try_init, just remove it. Since init_mutex has been removed, blk_iolatency_try_init can be open-coded back to iolatency_set_limit() like ioc_qos_write(). Signed-off-by:
Li Lingfeng <lilingfeng3@huawei.com> Reviewed-by:
Michal Koutný <mkoutny@suse.com> Link: https://lore.kernel.org/r/20230810035111.2236335-1-lilingfeng@huaweicloud.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Aug 09, 2023
-
-
Jinyoung Choi authored
In general, the bvec data structure consists of one for physically continuous pages. But, in the bvec configuration for bip, physically continuous integrity pages are composed of each bvec. Allow bio_integrity_add_page() to create multi-page bvecs, just like the bio payloads. This simplifies adding larger payloads, and fixes support for non-tiny workloads with nvme, which stopped using scatterlist for metadata a while ago. Cc: Christoph Hellwig <hch@lst.de> Cc: Martin K. Petersen <martin.petersen@oracle.com> Fixes: 783b94bd ("nvme-pci: do not build a scatterlist to map metadata") Reviewed-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Jinyoung Choi <j-young.choi@samsung.com> Tested-by:
"Martin K. Petersen" <martin.petersen@oracle.com> Reviewed-by:
"Martin K. Petersen" <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20230803025202epcms2p82f57cbfe32195da38c776377b55aed59@epcms2p8 Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Jinyoung Choi authored
bio_integrity_add_page() returns the add length if successful, else 0, just as bio_add_page. Simply check return value checking in bio_integrity_prep to not deal with a > 0 but < len case that can't happen. Cc: Christoph Hellwig <hch@lst.de> Cc: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Jinyoung Choi <j-young.choi@samsung.com> Tested-by:
"Martin K. Petersen" <martin.petersen@oracle.com> Reviewed-by:
"Martin K. Petersen" <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20230803025058epcms2p5a4d0db5da2ad967668932d463661c633@epcms2p5 Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Jinyoung Choi authored
Previously, the bip's bi_size has been set before an integrity pages were added. If a problem occurs in the process of adding pages for bip, the bi_size mismatch problem must be dealt with. When the page is successfully added to bvec, the bi_size is updated. The parts affected by the change were also contained in this commit. Cc: Christoph Hellwig <hch@lst.de> Cc: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Jinyoung Choi <j-young.choi@samsung.com> Tested-by:
"Martin K. Petersen" <martin.petersen@oracle.com> Reviewed-by:
"Martin K. Petersen" <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20230803024956epcms2p38186a17392706650c582d38ef3dbcd32@epcms2p3 Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-