- Jun 26, 2023
-
-
Yu Kuai authored
sysfs entry /sys/block/[device]/queue/wbt_lat_usec will be created even if CONFIG_BLK_WBT is disabled, while read and write will always fail. It doesn't make sense to create a sysfs entry that can't be accessed, so don't create such entry. Signed-off-by:
Yu Kuai <yukuai3@huawei.com> Reviewed-by:
Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230527010644.647900-2-yukuai1@huaweicloud.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Jun 25, 2023
-
-
Ming Lei authored
Request allocated from sched tags can't be issued via ->queue_rqs() directly, since driver tag isn't allocated yet. This is the 1st misuse of RQF_USE_SCHED for figuring out plug->has_elevator. Request allocated from sched tags can't be ended by blk_mq_end_request_batch() too, fix the 2nd RQF_USE_SCHED misuse in blk_mq_add_to_batch(). Without this patch, NVMe uring cmd passthrough IO workload can run into hang easily with real io scheduler. Fixes: dd6216bb ("blk-mq: make sure elevator callbacks aren't called for passthrough request") Reported-by:
Guangwu Zhang <guazhang@redhat.com> Closes: https://lore.kernel.org/linux-block/CAGS2=YrBjpLPOKa-gzcKuuOG60AGth5794PNCDwatdnnscB9ug@mail.gmail.com/ Cc: Christoph Hellwig <hch@lst.de> Signed-off-by:
Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230624130105.1443879-1-ming.lei@redhat.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Jinke Han authored
After commit f382fb0b ("block: remove legacy IO schedulers"), blkio.throttle.io_serviced and blkio.throttle.io_service_bytes become the only stable io stats interface of cgroup v1, and these statistics are done in the blk-throttle code. But the current code only counts the bios that are actually throttled. When the user does not add the throttle limit, the io stats for cgroup v1 has nothing. I fix it according to the statistical method of v2, and made it count all ios accurately. Fixes: a7b36ee6 ("block: move blk-throtl fast path inline") Tested-by:
Andrea Righi <andrea.righi@canonical.com> Signed-off-by:
Jinke Han <hanjinke.666@bytedance.com> Acked-by:
Muchun Song <songmuchun@bytedance.com> Acked-by:
Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230507170631.89607-1-hanjinke.666@bytedance.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Jun 22, 2023
-
-
Christoph Hellwig authored
When we didn't find a device and didn't guess it might be a partition, it might still show up later, so don't disable rootwait for it by returning -EINVAL. Fixes: 079caa35 ("init: clear root_wait on all invalid root= strings") Reported-by:
Guenter Roeck <linux@roeck-us.net> Signed-off-by:
Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230622150644.600327-1-hch@lst.de Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Jun 21, 2023
-
-
Bart Van Assche authored
Fix the documentation of the devt_from_partuuid() return value. Fix the following two recently introduced kernel-doc warnings: block/bdev.c:570: warning: Function parameter or member 'hops' not described in 'bd_finish_claiming' block/early-lookup.c:46: warning: Function parameter or member 'devt' not described in 'devt_from_partuuid' Cc: Christoph Hellwig <hch@lst.de> Fixes: 0718afd4 ("block: introduce holder ops") Fixes: cf056a43 ("init: improve the name_to_dev_t interface") Signed-off-by:
Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20230621165054.743815-1-bvanassche@acm.org Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Ming Lei authored
In case of real io scheduler, q->elevator is set, so blk_mq_run_hw_queue() may just check if scheduler queue has request to dispatch, see __blk_mq_sched_dispatch_requests(). Then IO hang may be caused because all passthorugh requests may stay in sw queue. And any passthrough request should have been inserted to hctx->dispatch always. Reported-by:
Guangwu Zhang <guazhang@redhat.com> Fixes: d97217e7 ("blk-mq: don't queue plugged passthrough requests into scheduler") Signed-off-by:
Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230621132208.1142318-1-ming.lei@redhat.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Ivan Orlov authored
Now that the driver core allows for struct class to be in read-only memory, move the bsg_class structure to be declared at build time placing it into read-only memory, instead of having to be dynamically allocated at boot time. Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-scsi@vger.kernel.org Cc: linux-block@vger.kernel.org Suggested-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by:
Ivan Orlov <ivan.orlov0322@gmail.com> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org> Link: https://lore.kernel.org/r/20230620180129.645646-8-gregkh@linuxfoundation.org Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
FMODE_EXEC has nothing to do with exclusive opens, and even is of the wrong type. We need to check for BLK_OPEN_EXCL here. Fixes: 985958b8 ("block: fix wrong mode for blkdev_get_by_dev() from disk_scan_partitions()") Reported-by:
kernel test robot <lkp@intel.com> Signed-off-by:
Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230621124914.185992-1-hch@lst.de Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Jun 20, 2023
-
-
Michael Schmitz authored
The Amiga partition parser module uses signed int for partition sector address and count, which will overflow for disks larger than 1 TB. Use u64 as type for sector address and size to allow using disks up to 2 TB without LBD support, and disks larger than 2 TB with LBD. The RBD format allows to specify disk sizes up to 2^128 bytes (though native OS limitations reduce this somewhat, to max 2^68 bytes), so check for u64 overflow carefully to protect against overflowing sector_t. Bail out if sector addresses overflow 32 bits on kernels without LBD support. This bug was reported originally in 2012, and the fix was created by the RDB author, Joanne Dow <jdow@earthlink.net>. A patch had been discussed and reviewed on linux-m68k at that time but never officially submitted (now resubmitted as patch 1 in this series). This patch adds additional error checking and warning messages. Reported-by:
Martin Steigerwald <Martin@lichtvoll.de> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=43511 Fixes: 1da177e4 ("Linux-2.6.12-rc2") Message-ID: <201206192146.09327.Martin@lichtvoll.de> Cc: <stable@vger.kernel.org> # 5.2 Signed-off-by:
Michael Schmitz <schmitzmic@gmail.com> Reviewed-by:
Geert Uytterhoeven <geert@linux-m68k.org> Reviewed-by:
Christoph Hellwig <hch@infradead.org> Link: https://lore.kernel.org/r/20230620201725.7020-4-schmitzmic@gmail.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Michael Schmitz authored
The Amiga partition parser module uses signed int for partition sector address and count, which will overflow for disks larger than 1 TB. Use sector_t as type for sector address and size to allow using disks up to 2 TB without LBD support, and disks larger than 2 TB with LBD. This bug was reported originally in 2012, and the fix was created by the RDB author, Joanne Dow <jdow@earthlink.net>. A patch had been discussed and reviewed on linux-m68k at that time but never officially submitted. This patch differs from Joanne's patch only in its use of sector_t instead of unsigned int. No checking for overflows is done (see patch 3 of this series for that). Reported-by:
Martin Steigerwald <Martin@lichtvoll.de> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=43511 Fixes: 1da177e4 ("Linux-2.6.12-rc2") Message-ID: <201206192146.09327.Martin@lichtvoll.de> Cc: <stable@vger.kernel.org> # 5.2 Signed-off-by:
Michael Schmitz <schmitzmic@gmail.com> Tested-by:
Martin Steigerwald <Martin@lichtvoll.de> Reviewed-by:
Geert Uytterhoeven <geert@linux-m68k.org> Reviewed-by:
Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230620201725.7020-2-schmitzmic@gmail.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Min Li authored
In the function bdev_add_partition(),there is no check that the start and end sectors exceed the size of the disk before calling add_partition. When we call the block's ioctl interface directly to add a partition, and the capacity of the disk is set to 0 by driver,the command will continue to execute. Signed-off-by:
Min Li <min15.li@samsung.com> Reviewed-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230619091214.31615-1-min15.li@samsung.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Jingbo Xu authored
Allow of unprivileged Persistent Reservation operations on devices if the write permission check on the device node has passed. brw-rw---- 1 root disk 259, 0 Jun 13 07:09 /dev/nvme0n1 In the example above, the "disk" group of nvme0n1 is also allowed to make reservations on the device even without CAP_SYS_ADMIN. Signed-off-by:
Jingbo Xu <jefflexu@linux.alibaba.com> Reviewed-by:
Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230613084008.93795-3-jefflexu@linux.alibaba.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Jingbo Xu authored
Refuse Persistent Reservation operations on partitions as reservation on partitions doesn't make sense. Besides, introduce blkdev_pr_allowed() helper, where more policies could be placed here later. Signed-off-by:
Jingbo Xu <jefflexu@linux.alibaba.com> Reviewed-by:
Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230613084008.93795-2-jefflexu@linux.alibaba.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Yu Kuai authored
After commit 2736e8ee ("block: use the holder as indication for exclusive opens"), blkdev_get_by_dev() will warn if holder is NULL and mode contains 'FMODE_EXCL'. holder from blkdev_get_by_dev() from disk_scan_partitions() is always NULL, hence it should not use 'FMODE_EXCL', which is broben by the commit. For consequence, WARN_ON_ONCE() will be triggered from blkdev_get_by_dev() if user scan partitions with device opened exclusively. Fix this problem by removing 'FMODE_EXCL' from disk_scan_partitions(), as it used to be. Reported-by:
<syzbot+00cd27751f78817f167b@syzkaller.appspotmail.com> Link: https://syzkaller.appspot.com/bug?extid=00cd27751f78817f167b Fixes: 2736e8ee ("block: use the holder as indication for exclusive opens") Signed-off-by:
Yu Kuai <yukuai3@huawei.com> Reviewed-by:
Christian Brauner <brauner@kernel.org> Reviewed-by:
Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230618140402.7556-1-yukuai1@huaweicloud.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
Reported-by:
Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by:
Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230620043536.707249-1-hch@lst.de Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Demi Marie Obenour authored
Currently, associating a loop device with a different file descriptor does not increment its diskseq. This allows the following race condition: 1. Program X opens a loop device 2. Program X gets the diskseq of the loop device. 3. Program X associates a file with the loop device. 4. Program X passes the loop device major, minor, and diskseq to something. 5. Program X exits. 6. Program Y detaches the file from the loop device. 7. Program Y attaches a different file to the loop device. 8. The opener finally gets around to opening the loop device and checks that the diskseq is what it expects it to be. Even though the diskseq is the expected value, the result is that the opener is accessing the wrong file. From discussions with Christoph Hellwig, it appears that disk_force_media_change() was supposed to call inc_diskseq(), but in fact it does not. Adding a Fixes: tag to indicate this. Christoph's Reported-by is because he stated that disk_force_media_change() calls inc_diskseq(), which is what led me to discover that it should but does not. Reported-by:
Christoph Hellwig <hch@infradead.org> Signed-off-by:
Demi Marie Obenour <demi@invisiblethingslab.com> Fixes: e6138dc1 ("block: add a helper to raise a media changed event") Cc: stable@vger.kernel.org # 5.15+ Reviewed-by:
Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230607170837.1559-1-demi@invisiblethingslab.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Jun 16, 2023
-
-
Ming Lei authored
After grabbing q->sysfs_lock, q->elevator may become NULL because of elevator switch. Fix the NULL dereference on q->elevator by checking it with lock. Reported-by:
Guangwu Zhang <guazhang@redhat.com> Signed-off-by:
Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230616132354.415109-1-ming.lei@redhat.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
Now that all block direct I/O helpers use page pinning, this flag is unused. Signed-off-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Christian Brauner <brauner@kernel.org> Reviewed-by:
Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by:
David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/20230614140341.521331-4-hch@lst.de Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Jun 15, 2023
-
-
Yu Kuai authored
Commit 99d055b4 ("block: remove per-disk debugfs files in blk_unregister_queue") moves blk_trace_shutdown() from blk_release_queue() to blk_unregister_queue(), this is safe if blktrace is created through sysfs, however, there is a regression in corner case. blktrace can still be enabled after del_gendisk() through ioctl if the disk is opened before del_gendisk(), and if blktrace is not shutdown through ioctl before closing the disk, debugfs entries will be leaked. Fix this problem by shutdown blktrace in disk_release(), this is safe because blk_trace_remove() is reentrant. Fixes: 99d055b4 ("block: remove per-disk debugfs files in blk_unregister_queue") Signed-off-by:
Yu Kuai <yukuai3@huawei.com> Reviewed-by:
Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230610022003.2557284-4-yukuai1@huaweicloud.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Jun 14, 2023
-
-
Ed Tsai authored
commit f168420c ("blk-mq: don't redirect completion for hctx withs only one ctx mapping") When nvme applies a 1:1 mapping of hctx and ctx, there will be no remote request. But for ufs, the submission and completion queues could be asymmetric. (e.g. Multiple SQs share one CQ) Therefore, 1:1 mapping of hctx and ctx won't complete request on the submission cpu. In this situation, this nr_ctx check could violate the QUEUE_FLAG_SAME_FORCE, as a result, check on cpu id when there is only one ctx mapping. Signed-off-by:
Ed Tsai <ed.tsai@mediatek.com> Signed-off-by:
Po-Wen Kao <powen.kao@mediatek.com> Suggested-by:
Keith Busch <kbusch@kernel.org> Reviewed-by:
Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230614002529.6636-1-ed.tsai@mediatek.com [axboe: fixed up indentation] Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Jun 12, 2023
-
-
Yu Kuai authored
In __blk_mq_tag_busy/idle(), updating 'active_queues' and calculating 'wake_batch' is not atomic: t1: t2: _blk_mq_tag_busy blk_mq_tag_busy inc active_queues // assume 1->2 inc active_queues // 2 -> 3 blk_mq_update_wake_batch // calculate based on 3 blk_mq_update_wake_batch /* calculate based on 2, while active_queues is actually 3. */ Fix this problem by protecting them wih 'tags->lock', this is not a hot path, so performance should not be concerned. And now that all writers are inside the lock, switch 'actives_queues' from atomic to unsigned int. Fixes: 180dccb0 ("blk-mq: fix tag_get wait task can't be awakened") Signed-off-by:
Yu Kuai <yukuai3@huawei.com> Reviewed-by:
Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230610023043.2559121-1-yukuai1@huaweicloud.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
Store the file struct used as the holder in file->private_data as an indicator that this file descriptor was opened exclusively to remove the last use of FMODE_EXCL. Signed-off-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20230608110258.189493-30-hch@lst.de Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
Always use I_BDEV(file->f_mapping->host) to find the bdev for a file to free up file->private_data for other uses. Signed-off-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Hannes Reinecke <hare@suse.de> Acked-by:
Christian Brauner <brauner@kernel.org> Link: https://lore.kernel.org/r/20230608110258.189493-29-hch@lst.de Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
The only overlap between the block open flags mapped into the fmode_t and other uses of fmode_t are FMODE_READ and FMODE_WRITE. Define a new blk_mode_t instead for use in blkdev_get_by_{dev,path}, ->open and ->ioctl and stop abusing fmode_t. Signed-off-by:
Christoph Hellwig <hch@lst.de> Acked-by: Jack Wang <jinpu.wang@ionos.com> [rnbd] Reviewed-by:
Hannes Reinecke <hare@suse.de> Reviewed-by:
Christian Brauner <brauner@kernel.org> Link: https://lore.kernel.org/r/20230608110258.189493-28-hch@lst.de Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
A few ioctl handlers have fmode_t arguments that are entirely unused, remove them. Signed-off-by:
Christoph Hellwig <hch@lst.de> Acked-by:
Christian Brauner <brauner@kernel.org> Reviewed-by:
Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20230608110258.189493-27-hch@lst.de Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
All these helpers are only used in core block code, so move them out of the public header. Signed-off-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Hannes Reinecke <hare@suse.de> Acked-by:
Christian Brauner <brauner@kernel.org> Link: https://lore.kernel.org/r/20230608110258.189493-26-hch@lst.de Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
Instead of passing a fmode_t and only checking it for FMODE_WRITE, pass a bool open_for_write to prepare for callers that won't have the fmode_t. Signed-off-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by:
Hannes Reinecke <hare@suse.de> Acked-by:
Christian Brauner <brauner@kernel.org> Link: https://lore.kernel.org/r/20230608110258.189493-21-hch@lst.de Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
The current interface for exclusive opens is rather confusing as it requires both the FMODE_EXCL flag and a holder. Remove the need to pass FMODE_EXCL and just key off the exclusive open off a non-NULL holder. For blkdev_put this requires adding the holder argument, which provides better debug checking that only the holder actually releases the hold, but at the same time allows removing the now superfluous mode argument. Signed-off-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Hannes Reinecke <hare@suse.de> Acked-by:
Christian Brauner <brauner@kernel.org> Acked-by: David Sterba <dsterba@suse.com> [btrfs] Acked-by: Jack Wang <jinpu.wang@ionos.com> [rnbd] Link: https://lore.kernel.org/r/20230608110258.189493-16-hch@lst.de Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
Make the function name match the method name. Signed-off-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Hannes Reinecke <hare@suse.de> Acked-by:
Christian Brauner <brauner@kernel.org> Link: https://lore.kernel.org/r/20230608110258.189493-11-hch@lst.de Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
The mode argument to the ->release block_device_operation is never used, so remove it. Signed-off-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Hannes Reinecke <hare@suse.de> Acked-by:
Christian Brauner <brauner@kernel.org> Acked-by: Jack Wang <jinpu.wang@ionos.com> [rnbd] Link: https://lore.kernel.org/r/20230608110258.189493-10-hch@lst.de Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
->open is only called on the whole device. Make that explicit by passing a gendisk instead of the block_device. Signed-off-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Hannes Reinecke <hare@suse.de> Acked-by:
Christian Brauner <brauner@kernel.org> Acked-by: Jack Wang <jinpu.wang@ionos.com> [rnbd] Link: https://lore.kernel.org/r/20230608110258.189493-9-hch@lst.de Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
bdev_check_media_change should only ever be called for the whole device. Pass a gendisk to make that explicit and rename the function to disk_check_media_change. Signed-off-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Hannes Reinecke <hare@suse.de> Acked-by:
Christian Brauner <brauner@kernel.org> Link: https://lore.kernel.org/r/20230608110258.189493-8-hch@lst.de Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
For whole devices ->open is called for each open, but for partitions it is only called on the first open of a partition, e.g.: open("/dev/vdb", ...) open("/dev/vdb", ...) - 2 call to ->open open("/dev/vdb1", ...) open("/dev/vdb", ...) - 2 call to ->open open("/dev/vdb", ...) open("/dev/vdb", ...) - just open call to ->open This is problematic as various block drivers look at open flags and might not do all the required setup if the earlier open was with an odd flag like O_NDELAY or the magic 3 ioctl-only open mode. Signed-off-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Phillip Potter <phil@philpotter.co.uk> Reviewed-by:
Hannes Reinecke <hare@suse.de> Acked-by:
Christian Brauner <brauner@kernel.org> Link: https://lore.kernel.org/r/20230608110258.189493-2-hch@lst.de Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Jun 09, 2023
-
-
Christoph Hellwig authored
The previous rootwait fix added an -EINVAL return to a completely bogus superflous branch, fix this. Fixes: 1341c7d2 ("block: fix rootwait=") Reported-by:
Mark Brown <broonie@kernel.org> Signed-off-by:
Christoph Hellwig <hch@lst.de> Tested-by:
Fabio Estevam <festevam@gmail.com> Tested-by:
Marek Szyprowski <m.szyprowski@samsung.com> Tested-by:
Mark Brown <broonie@kernel.org> Link: https://lore.kernel.org/r/20230609051737.328930-1-hch@lst.de Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Jun 07, 2023
-
-
Christoph Hellwig authored
Failures to look up the gendisk must return -ENODEV so that rootwait retries the lookup instead of -EINVAL which exits early. Fixes: cf056a43 ("init: improve the name_to_dev_t interface") Reported-by:
Fabio Estevam <festevam@gmail.com> Signed-off-by:
Christoph Hellwig <hch@lst.de> Tested-by:
Fabio Estevam <festevam@gmail.com> Link: https://lore.kernel.org/r/20230607135746.92995-1-hch@lst.de Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Waiman Long authored
When blkg_alloc() is called to allocate a blkcg_gq structure with the associated blkg_iostat_set's, there are 2 fields within blkg_iostat_set that requires proper initialization - blkg & sync. The former field was introduced by commit 3b8cc629 ("blk-cgroup: Optimize blkcg_rstat_flush()") while the later one was introduced by commit f7331648 ("blk-cgroup: reimplement basic IO stats using cgroup rstat"). Unfortunately those fields in the blkg_iostat_set's are not properly re-initialized when they are cleared in v1's blkcg_reset_stats(). This can lead to a kernel panic due to NULL pointer access of the blkg pointer. The missing initialization of sync is less problematic and can be a problem in a debug kernel due to missing lockdep initialization. Fix these problems by re-initializing them after memory clearing. Fixes: 3b8cc629 ("blk-cgroup: Optimize blkcg_rstat_flush()") Fixes: f7331648 ("blk-cgroup: reimplement basic IO stats using cgroup rstat") Signed-off-by:
Waiman Long <longman@redhat.com> Reviewed-by:
Ming Lei <ming.lei@redhat.com> Acked-by:
Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230606180724.2455066-1-longman@redhat.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Yu Kuai authored
Recursive spin_lock/unlock_irq() is not safe, because spin_unlock_irq() will enable irq unconditionally: spin_lock_irq queue_lock -> disable irq spin_lock_irq ioc->lock spin_unlock_irq ioc->lock -> enable irq /* * AA dead lock will be triggered if current context is preempted by irq, * and irq try to hold queue_lock again. */ spin_unlock_irq queue_lock Fix this problem by using spin_lock/unlock() directly for 'ioc->lock'. Fixes: 5a0ac57c ("blk-ioc: protect ioc_destroy_icq() by 'queue_lock'") Signed-off-by:
Yu Kuai <yukuai3@huawei.com> Reviewed-by:
Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230606011438.3743440-1-yukuai1@huaweicloud.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Hou Tao authored
Since commit a78418e6 ("block: Always initialize bio IO priority on submit"), bio->bi_ioprio will never be IOPRIO_CLASS_NONE when calling blkcg_set_ioprio(), so there will be no way to promote the io-priority of one cgroup to IOPRIO_CLASS_RT, because bi_ioprio will always be greater than or equals to IOPRIO_CLASS_RT. It seems possible to call blkcg_set_ioprio() first then try to initialize bi_ioprio later in bio_set_ioprio(), but this doesn't work for bio in which bi_ioprio is already initialized (e.g., direct-io), so introduce a new promote-to-rt policy to promote the iopriority of bio to IOPRIO_CLASS_RT if the ioprio is not already RT. For none-to-rt policy, although it doesn't work now, but considering that its purpose was also to override the io-priority to RT and allowing for a smoother transition, just keep it and treat it as an alias of the promote-to-rt policy. Acked-by:
Tejun Heo <tj@kernel.org> Reviewed-by:
Bart Van Assche <bvanassche@acm.org> Reviewed-by:
Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by:
Jan Kara <jack@suse.cz> Signed-off-by:
Hou Tao <houtao1@huawei.com> Reviewed-by:
Bagas Sanjaya <bagasdotme@gmail.com> Link: https://lore.kernel.org/r/20230428074404.280532-1-houtao@huaweicloud.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Jun 05, 2023
-
-
Li Nan authored
adjust_inuse_and_calc_cost() use spin_lock_irq() and IRQ will be enabled when unlock. DEADLOCK might happen if we have held other locks and disabled IRQ before invoking it. Fix it by using spin_lock_irqsave() instead, which can keep IRQ state consistent with before when unlock. ================================ WARNING: inconsistent lock state 5.10.0-02758-g8e5f91fd772f #26 Not tainted -------------------------------- inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage. kworker/2:3/388 [HC0[0]:SC0[0]:HE0:SE1] takes: ffff888118c00c28 (&bfqd->lock){?.-.}-{2:2}, at: spin_lock_irq ffff888118c00c28 (&bfqd->lock){?.-.}-{2:2}, at: bfq_bio_merge+0x141/0x390 {IN-HARDIRQ-W} state was registered at: __lock_acquire+0x3d7/0x1070 lock_acquire+0x197/0x4a0 __raw_spin_lock_irqsave _raw_spin_lock_irqsave+0x3b/0x60 bfq_idle_slice_timer_body bfq_idle_slice_timer+0x53/0x1d0 __run_hrtimer+0x477/0xa70 __hrtimer_run_queues+0x1c6/0x2d0 hrtimer_interrupt+0x302/0x9e0 local_apic_timer_interrupt __sysvec_apic_timer_interrupt+0xfd/0x420 run_sysvec_on_irqstack_cond sysvec_apic_timer_interrupt+0x46/0xa0 asm_sysvec_apic_timer_interrupt+0x12/0x20 irq event stamp: 837522 hardirqs last enabled at (837521): [<ffffffff84b9419d>] __raw_spin_unlock_irqrestore hardirqs last enabled at (837521): [<ffffffff84b9419d>] _raw_spin_unlock_irqrestore+0x3d/0x40 hardirqs last disabled at (837522): [<ffffffff84b93fa3>] __raw_spin_lock_irq hardirqs last disabled at (837522): [<ffffffff84b93fa3>] _raw_spin_lock_irq+0x43/0x50 softirqs last enabled at (835852): [<ffffffff84e00558>] __do_softirq+0x558/0x8ec softirqs last disabled at (835845): [<ffffffff84c010ff>] asm_call_irq_on_stack+0xf/0x20 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&bfqd->lock); <Interrupt> lock(&bfqd->lock); *** DEADLOCK *** 3 locks held by kworker/2:3/388: #0: ffff888107af0f38 ((wq_completion)kthrotld){+.+.}-{0:0}, at: process_one_work+0x742/0x13f0 #1: ffff8881176bfdd8 ((work_completion)(&td->dispatch_work)){+.+.}-{0:0}, at: process_one_work+0x777/0x13f0 #2: ffff888118c00c28 (&bfqd->lock){?.-.}-{2:2}, at: spin_lock_irq #2: ffff888118c00c28 (&bfqd->lock){?.-.}-{2:2}, at: bfq_bio_merge+0x141/0x390 stack backtrace: CPU: 2 PID: 388 Comm: kworker/2:3 Not tainted 5.10.0-02758-g8e5f91fd772f #26 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 Workqueue: kthrotld blk_throtl_dispatch_work_fn Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x107/0x167 print_usage_bug valid_state mark_lock_irq.cold+0x32/0x3a mark_lock+0x693/0xbc0 mark_held_locks+0x9e/0xe0 __trace_hardirqs_on_caller lockdep_hardirqs_on_prepare.part.0+0x151/0x360 trace_hardirqs_on+0x5b/0x180 __raw_spin_unlock_irq _raw_spin_unlock_irq+0x24/0x40 spin_unlock_irq adjust_inuse_and_calc_cost+0x4fb/0x970 ioc_rqos_merge+0x277/0x740 __rq_qos_merge+0x62/0xb0 rq_qos_merge bio_attempt_back_merge+0x12c/0x4a0 blk_mq_sched_try_merge+0x1b6/0x4d0 bfq_bio_merge+0x24a/0x390 __blk_mq_sched_bio_merge+0xa6/0x460 blk_mq_sched_bio_merge blk_mq_submit_bio+0x2e7/0x1ee0 __submit_bio_noacct_mq+0x175/0x3b0 submit_bio_noacct+0x1fb/0x270 blk_throtl_dispatch_work_fn+0x1ef/0x2b0 process_one_work+0x83e/0x13f0 process_scheduled_works worker_thread+0x7e3/0xd80 kthread+0x353/0x470 ret_from_fork+0x1f/0x30 Fixes: b0853ab4 ("blk-iocost: revamp in-period donation snapbacks") Signed-off-by:
Li Nan <linan122@huawei.com> Acked-by:
Tejun Heo <tj@kernel.org> Reviewed-by:
Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20230527091904.3001833-1-linan666@huaweicloud.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
early_lookup_bdev is now only used during the early boot code as it should, so mark it __init to not waste run time memory on it. Signed-off-by:
Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230531125535.676098-25-hch@lst.de Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-