- Jul 15, 2019
-
-
Mauro Carvalho Chehab authored
There are lots of documents that belong to the admin-guide but are on random places (most under Documentation root dir). Move them to the admin guide. Signed-off-by:
Mauro Carvalho Chehab <mchehab+samsung@kernel.org> Acked-by:
Alexandre Belloni <alexandre.belloni@bootlin.com> Acked-by:
Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
-
Mauro Carvalho Chehab authored
Those files belong to the admin guide, so add them. Signed-off-by:
Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
-
Mauro Carvalho Chehab authored
Rename the block documentation files to ReST, add an index for them and adjust in order to produce a nice html output via the Sphinx build system. At its new index.rst, let's add a :orphan: while this is not linked to the main index.rst file, in order to avoid build warnings. Signed-off-by:
Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
-
- Jul 12, 2019
-
-
Damien Le Moal authored
Limit the size of the struct blk_zone array used in blk_revalidate_disk_zones() to avoid memory allocation failures leading to disk revalidation failure. Also further reduce the likelyhood of such failures by using kvcalloc() (that is vmalloc()) instead of allocating contiguous pages with alloc_pages(). Fixes: 515ce606 ("scsi: sd_zbc: Fix sd_zbc_report_zones() buffer allocation") Fixes: e76239a3 ("block: add a report_zones method") Cc: stable@vger.kernel.org Reviewed-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by:
Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Damien Le Moal authored
Only GFP_KERNEL and GFP_NOIO are used with blkdev_report_zones(). In preparation of using vmalloc() for large report buffer and zone array allocations used by this function, remove its "gfp_t gfp_mask" argument and rely on the caller context to use memalloc_noio_save/restore() where necessary (block layer zone revalidation and dm-zoned I/O error path). Reviewed-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by:
Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Damien Le Moal authored
To allow the SCSI subsystem scsi_execute_req() function to issue requests using large buffers that are better allocated with vmalloc() rather than kmalloc(), modify bio_map_kern() to allow passing a buffer allocated with vmalloc(). To do so, detect vmalloc-ed buffers using is_vmalloc_addr(). For vmalloc-ed buffers, flush the buffer using flush_kernel_vmap_range(), use vmalloc_to_page() instead of virt_to_page() to obtain the pages of the buffer, and invalidate the buffer addresses with invalidate_kernel_vmap_range() on completion of read BIOs. This last point is executed using the function bio_invalidate_vmalloc_pages() which is defined only if the architecture defines ARCH_HAS_FLUSH_KERNEL_DCACHE_PAGE, that is, if the architecture actually needs the invalidation done. Fixes: 515ce606 ("scsi: sd_zbc: Fix sd_zbc_report_zones() buffer allocation") Fixes: e76239a3 ("block: add a report_zones method") Cc: stable@vger.kernel.org Reviewed-by:
Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by:
Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by:
Ming Lei <ming.lei@redhat.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Wenwen Wang authored
In bio_integrity_prep(), a kernel buffer is allocated through kmalloc() to hold integrity metadata. Later on, the buffer will be attached to the bio structure through bio_integrity_add_page(), which returns the number of bytes of integrity metadata attached. Due to unexpected situations, bio_integrity_add_page() may return 0. As a result, bio_integrity_prep() needs to be terminated with 'false' returned to indicate this error. However, the allocated kernel buffer is not freed on this execution path, leading to a memory leak. To fix this issue, free the allocated buffer before returning from bio_integrity_prep(). Reviewed-by:
Ming Lei <ming.lei@redhat.com> Acked-by:
Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by:
Wenwen Wang <wenwen@cs.uga.edu> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Jul 10, 2019
-
-
Damien Le Moal authored
Simultaneously writing to a sequential zone of a zoned block device from multiple contexts requires mutual exclusion for BIO issuing to ensure that writes happen sequentially. However, even for a well behaved user correctly implementing such synchronization, BIO plugging may interfere and result in BIOs from the different contextx to be reordered if plugging is done outside of the mutual exclusion section, e.g. the plug was started by a function higher in the call chain than the function issuing BIOs. Context A Context B | blk_start_plug() | ... | seq_write_zone() | mutex_lock(zone) | bio-0->bi_iter.bi_sector = zone->wp | zone->wp += bio_sectors(bio-0) | submit_bio(bio-0) | bio-1->bi_iter.bi_sector = zone->wp | zone->wp += bio_sectors(bio-1) | submit_bio(bio-1) | mutex_unlock(zone) | return | -----------------------> | seq_write_zone() | mutex_lock(zone) | bio-2->bi_iter.bi_sector = zone->wp | zone->wp += bio_sectors(bio-2) | submit_bio(bio-2) | mutex_unlock(zone) | <------------------------- | | blk_finish_plug() In the above example, despite the mutex synchronization ensuring the correct BIO issuing order 0, 1, 2, context A BIOs 0 and 1 end up being issued after BIO 2 of context B, when the plug is released with blk_finish_plug(). While this problem can be addressed using the blk_flush_plug_list() function (in the above example, the call must be inserted before the zone mutex lock is released), a simple generic solution in the block layer avoid this additional code in all zoned block device user code. The simple generic solution implemented with this patch is to introduce the internal helper function blk_mq_plug() to access the current context plug on BIO submission. This helper returns the current plug only if the target device is not a zoned block device or if the BIO to be plugged is not a write operation. Otherwise, the caller context plug is ignored and NULL returned, resulting is all writes to zoned block device to never be plugged. Signed-off-by:
Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Konstantin Khlebnikov authored
After commit 991f61fe ("Blk-throttle: reduce tail io latency when iops limit is enforced") wait time could be zero even if group is throttled and cannot issue requests right now. As a result throtl_select_dispatch() turns into busy-loop under irq-safe queue spinlock. Fix is simple: always round up target time to the next throttle slice. Fixes: 991f61fe ("Blk-throttle: reduce tail io latency when iops limit is enforced") Signed-off-by:
Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: stable@vger.kernel.org # v4.19+ Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Damien Le Moal authored
For large values of the number of zones reported and/or large zone sizes, the sector increment calculated with blk_queue_zone_sectors(q) * n in blk_report_zones() loop can overflow the unsigned int type used for the calculation as both "n" and blk_queue_zone_sectors() value are unsigned int. E.g. for a device with 256 MB zones (524288 sectors), overflow happens with 8192 or more zones reported. Changing the return type of blk_queue_zone_sectors() to sector_t, fixes this problem and avoids overflow problem for all other callers of this helper too. The same change is also applied to the bdev_zone_sectors() helper. Fixes: e76239a3 ("block: add a report_zones method") Cc: stable@vger.kernel.org Signed-off-by:
Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Tejun Heo authored
When a shared kthread needs to issue a bio for a cgroup, doing so synchronously can lead to priority inversions as the kthread can be trapped waiting for that cgroup. This patch implements REQ_CGROUP_PUNT flag which makes submit_bio() punt the actual issuing to a dedicated per-blkcg work item to avoid such priority inversions. This will be used to fix priority inversions in btrfs compression and should be generally useful as we grow filesystem support for comprehensive IO control. Cc: Chris Mason <clm@fb.com> Reviewed-by:
Josef Bacik <josef@toxicpanda.com> Reviewed-by:
Jan Kara <jack@suse.cz> Signed-off-by:
Tejun Heo <tj@kernel.org> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Tejun Heo authored
btrfs is going to use css_put() and wbc helpers to improve cgroup writeback support. Add dummy css_get() definition and export wbc helpers to prepare for module and !CONFIG_CGROUP builds. Reported-by:
kbuild test robot <lkp@intel.com> Reviewed-by:
Jan Kara <jack@suse.cz> Signed-off-by:
Tejun Heo <tj@kernel.org> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Josef Bacik authored
With the psi stuff in place we can use the memstall flag to indicate pressure that happens from throttling. Signed-off-by:
Josef Bacik <jbacik@fb.com> Signed-off-by:
Tejun Heo <tj@kernel.org> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Josef Bacik authored
We discovered a problem in newer kernels where a disconnect of a NBD device while the flush request was pending would result in a hang. This is because the blk mq timeout handler does if (!refcount_inc_not_zero(&rq->ref)) return true; to determine if it's ok to run the timeout handler for the request. Flush_rq's don't have a ref count set, so we'd skip running the timeout handler for this request and it would just sit there in limbo forever. Fix this by always setting the refcount of any request going through blk_init_rq() to 1. I tested this with a nbd-server that dropped flush requests to verify that it hung, and then tested with this patch to verify I got the timeout as expected and the error handling kicked in. Thanks, Signed-off-by:
Josef Bacik <josef@toxicpanda.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Jul 06, 2019
-
-
Greg Kroah-Hartman authored
When the blk-mq debugfs file creation logic was "cleaned up" it was cleaned up too much, causing the queue file to not be created in the correct location. Turns out the check for the directory being present is needed as if that has not happened yet, the files should not be created, and the function will be called later on in the initialization code so that the files can be created in the correct location. Fixes: 6cfc0081 ("blk-mq: no need to check return value of debugfs_create functions") Reported-by:
Stephen Rothwell <sfr@canb.auug.org.au> Cc: linux-block@vger.kernel.org Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Jul 05, 2019
-
-
Dennis Zhou authored
The iolatency controller is based on rq_qos. It increments on rq_qos_throttle() and decrements on either rq_qos_cleanup() or rq_qos_done_bio(). a3fb01ba fixes the double accounting issue where blk_mq_make_request() may call both rq_qos_cleanup() and rq_qos_done_bio() on REQ_NO_WAIT. So checking STS_AGAIN prevents the double decrement. The above works upstream as the only way we can get STS_AGAIN is from blk_mq_get_request() failing. The STS_AGAIN handling isn't a real problem as bio_endio() skipping only happens on reserved tag allocation failures which can only be caused by driver bugs and already triggers WARN. However, the fix creates a not so great dependency on how STS_AGAIN can be propagated. Internally, we (Facebook) carry a patch that kills read ahead if a cgroup is io congested or a fatal signal is pending. This combined with chained bios progagate their bi_status to the parent is not already set can can cause the parent bio to not clean up properly even though it was successful. This consequently leaks the inflight counter and can hang all IOs under that blkg. To nip the adverse interaction early, this removes the rq_qos_cleanup() callback in iolatency in favor of cleaning up always on the rq_qos_done_bio() path. Fixes: a3fb01ba ("blk-iolatency: only account submitted bios") Debugged-by:
Tejun Heo <tj@kernel.org> Debugged-by:
Josef Bacik <josef@toxicpanda.com> Signed-off-by:
Dennis Zhou <dennis@kernel.org> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Jul 03, 2019
-
-
Christoph Hellwig authored
Fix a regression introduced when removing bi_phys_segments for Write Zeroes requests, which need to have a segment count of zero, as they don't have a payload. Fixes: 14ccb66b ("block: remove the bi_phys_segments field in struct bio") Reported-by:
Jens Axboe <axboe@kernel.dk> Signed-off-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Bart Van Assche authored
Move the blk_mq_bio_to_request() call in front of the if-statement. Cc: Hannes Reinecke <hare@suse.com> Cc: Omar Sandoval <osandov@fb.com> Reviewed-by:
Minwoo Im <minwoo.im.dev@gmail.com> Reviewed-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Ming Lei <ming.lei@redhat.com> Signed-off-by:
Bart Van Assche <bvanassche@acm.org> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Bart Van Assche authored
No code that occurs between blk_mq_get_ctx() and blk_mq_put_ctx() depends on preemption being disabled for its correctness. Since removing the CPU preemption calls does not measurably affect performance, simplify the blk-mq code by removing the blk_mq_put_ctx() function and also by not disabling preemption in blk_mq_get_ctx(). Cc: Hannes Reinecke <hare@suse.com> Cc: Omar Sandoval <osandov@fb.com> Reviewed-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Ming Lei <ming.lei@redhat.com> Signed-off-by:
Bart Van Assche <bvanassche@acm.org> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Jul 01, 2019
-
-
Ming Lei authored
'bio->bi_iter.bi_size' is 'unsigned int', which at most hold 4G - 1 bytes. Before 07173c3e ("block: enable multipage bvecs"), one bio can include very limited pages, and usually at most 256, so the fs bio size won't be bigger than 1M bytes most of times. Since we support multi-page bvec, in theory one fs bio really can be added > 1M pages, especially in case of hugepage, or big writeback with too many dirty pages. Then there is chance in which .bi_size is overflowed. Fixes this issue by using bio_full() to check if the added segment may overflow .bi_size. Cc: Liu Yiding <liuyd.fnst@cn.fujitsu.com> Cc: kernel test robot <rong.a.chen@intel.com> Cc: "Darrick J. Wong" <darrick.wong@oracle.com> Cc: linux-xfs@vger.kernel.org Cc: linux-fsdevel@vger.kernel.org Cc: stable@vger.kernel.org Fixes: 07173c3e ("block: enable multipage bvecs") Reviewed-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Ming Lei <ming.lei@redhat.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Jun 29, 2019
-
-
Jonas Rabenstein authored
Check whether the shadow mbr does fit in the provided space on the target. Also a proper firmware should handle this case and return an error we may prevent problems or even damage with crappy firmwares. Signed-off-by:
Jonas Rabenstein <jonas.rabenstein@studium.uni-erlangen.de> Signed-off-by:
David Kozub <zub@linux.fjfi.cvut.cz> Reviewed-by:
Scott Bauer <sbauer@plzdonthack.me> Reviewed-by:
Jon Derrick <jonathan.derrick@intel.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Jonas Rabenstein authored
Allow modification of the shadow mbr. If the shadow mbr is not marked as done, this data will be presented read only as the device content. Only after marking the shadow mbr as done and unlocking a locking range the actual content is accessible. Co-authored-by:
David Kozub <zub@linux.fjfi.cvut.cz> Signed-off-by:
Jonas Rabenstein <jonas.rabenstein@studium.uni-erlangen.de> Signed-off-by:
David Kozub <zub@linux.fjfi.cvut.cz> Reviewed-by:
Scott Bauer <sbauer@plzdonthack.me> Reviewed-by:
Jon Derrick <jonathan.derrick@intel.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Jonas Rabenstein authored
Enable users to mark the shadow mbr as done without completely deactivating the shadow mbr feature. This may be useful on reboots, when the power to the disk is not disconnected in between and the shadow mbr stores the required boot files. Of course, this saves also the (few) commands required to enable the feature if it is already enabled and one only wants to mark the shadow mbr as done. Co-authored-by:
David Kozub <zub@linux.fjfi.cvut.cz> Signed-off-by:
Jonas Rabenstein <jonas.rabenstein@studium.uni-erlangen.de> Signed-off-by:
David Kozub <zub@linux.fjfi.cvut.cz> Reviewed-by:
Christoph Hellwig <hch@lst.de> Reviewed by: Scott Bauer <sbauer@plzdonthack.me> Reviewed-by:
Jon Derrick <jonathan.derrick@intel.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
If we pass pages through an iov_iter we always already have a reference in the caller. Thus remove the ITER_BVEC_FLAG_NO_REF and don't take reference to pages by default for bvec backed iov_iters. Reviewed-by:
Minwoo Im <minwoo.im.dev@gmail.com> Reviewed-by:
Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
Use bio_release_pages instead of open coding it. Reviewed-by:
Minwoo Im <minwoo.im.dev@gmail.com> Reviewed-by:
Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
Use bio_release_pages instead of open coding it. Reviewed-by:
Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by:
Minwoo Im <minwoo.im.dev@gmail.com> Signed-off-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
A lot of callers of bio_release_pages also want to mark the released pages as dirty. Add a mark_dirty parameter to avoid a second relatively expensive bio_for_each_segment_all loop. Reviewed-by:
Minwoo Im <minwoo.im.dev@gmail.com> Signed-off-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
Move the BIO_NO_PAGE_REF check into bio_release_pages instead of duplicating it in both callers. Also make the function available outside of bio.c so that we can reuse it in other direct I/O implementations. Reviewed-by:
Minwoo Im <minwoo.im.dev@gmail.com> Reviewed-by:
Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Revanth Rajashekar authored
'who' an unsigned variable in stucture opal_session_info can never be lesser than zero. Hence, the condition "who < OPAL_ADMIN1" can never be true. Signed-off-by:
Revanth Rajashekar <revanth.rajashekar@intel.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Revanth Rajashekar authored
PSID is a 32 character password printed on the drive label, to prove its physical access. This PSID reverttper function is very useful to regain the control over the drive when it is locked and the user can no longer access it because of some failures. However, *all the data on the drive is completely erased*. This method is advisable only when the user is exhausted of all other recovery methods. PSID capabilities are described in: https://trustedcomputinggroup.org/wp-content/uploads/TCG_Storage-Opal_Feature_Set_PSID_v1.00_r1.00.pdf Signed-off-by:
Revanth Rajashekar <revanth.rajashekar@intel.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Jun 28, 2019
-
-
Douglas Anderson authored
In reboot tests on several devices we were seeing a "use after free" when slub_debug or KASAN was enabled. The kernel complained about: Unable to handle kernel paging request at virtual address 6b6b6c2b ...which is a classic sign of use after free under slub_debug. The stack crawl in kgdb looked like: 0 test_bit (addr=<optimized out>, nr=<optimized out>) 1 bfq_bfqq_busy (bfqq=<optimized out>) 2 bfq_select_queue (bfqd=<optimized out>) 3 __bfq_dispatch_request (hctx=<optimized out>) 4 bfq_dispatch_request (hctx=<optimized out>) 5 0xc056ef00 in blk_mq_do_dispatch_sched (hctx=0xed249440) 6 0xc056f728 in blk_mq_sched_dispatch_requests (hctx=0xed249440) 7 0xc0568d24 in __blk_mq_run_hw_queue (hctx=0xed249440) 8 0xc0568d94 in blk_mq_run_work_fn (work=<optimized out>) 9 0xc024c5c4 in process_one_work (worker=0xec6d4640, work=0xed249480) 10 0xc024cff4 in worker_thread (__worker=0xec6d4640) Digging in kgdb, it could be found that, though bfqq looked fine, bfqq->bic had been freed. Through further digging, I postulated that perhaps it is illegal to access a "bic" (AKA an "icq") after bfq_exit_icq() had been called because the "bic" can be freed at some point in time after this call is made. I confirmed that there certainly were cases where the exact crashing code path would access the "bic" after bfq_exit_icq() had been called. Sspecifically I set the "bfqq->bic" to (void *)0x7 and saw that the bic was 0x7 at the time of the crash. To understand a bit more about why this crash was fairly uncommon (I saw it only once in a few hundred reboots), you can see that much of the time bfq_exit_icq_fbqq() fully frees the bfqq and thus it can't access the ->bic anymore. The only case it doesn't is if bfq_put_queue() sees a reference still held. However, even in the case when bfqq isn't freed, the crash is still rare. Why? I tracked what happened to the "bic" after the exit routine. It doesn't get freed right away. Rather, put_io_context_active() eventually called put_io_context() which queued up freeing on a workqueue. The freeing then actually happened later than that through call_rcu(). Despite all these delays, some extra debugging showed that all the hoops could be jumped through in time and the memory could be freed causing the original crash. Phew! To make a long story short, assuming it truly is illegal to access an icq after the "exit_icq" callback is finished, this patch is needed. Cc: stable@vger.kernel.org Reviewed-by:
Paolo Valente <paolo.valente@unimore.it> Signed-off-by:
Douglas Anderson <dianders@chromium.org> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Jun 27, 2019
-
-
Damien Le Moal authored
bio_flush_dcache_pages() is unused. Remove it. Reviewed-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Jun 26, 2019
-
-
Douglas Anderson authored
Some debug code suggested by Paolo was tripping when I did reboot stress tests. Specifically in bfq_bfqq_resume_state() "bic->saved_wr_start_at_switch_to_srt" was later than the current value of "jiffies". A bit of debugging showed that "bic->saved_wr_start_at_switch_to_srt" was actually 0 and a bit more debugging showed that was because we had run through the "unlikely" case in the bfq_bfqq_save_state() function. Let's init "saved_wr_start_at_switch_to_srt" in the unlikely case to something sane. NOTE: this fixes no known real-world errors. Reviewed-by:
Paolo Valente <paolo.valente@linaro.org> Reviewed-by:
Guenter Roeck <groeck@chromium.org> Signed-off-by:
Douglas Anderson <dianders@chromium.org> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Jun 25, 2019
-
-
Paolo Valente authored
By mistake, there is a '&' instead of a '==' in the definition of the macro BFQQ_TOTALLY_SEEKY. This commit replaces the wrong operator with the correct one. Fixes: 7074f076 ("block, bfq: do not tag totally seeky queues as soft rt") Signed-off-by:
Paolo Valente <paolo.valente@linaro.org> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Paolo Valente authored
Consider, on one side, a bfq_queue Q that remains empty while in service, and, on the other side, the pending I/O of bfq_queues that, according to their timestamps, have to be served after Q. If an uncontrolled amount of I/O from the latter bfq_queues were dispatched while Q is waiting for its new I/O to arrive, then Q's bandwidth guarantees would be violated. To prevent this, I/O dispatch is plugged until Q receives new I/O (except for a properly controlled amount of injected I/O). Unfortunately, preemption breaks I/O-dispatch plugging, for the following reason. Preemption is performed in two steps. First, Q is expired and re-scheduled. Second, the new bfq_queue to serve is chosen. The first step is needed by the second, as the second can be performed only after Q's timestamps have been properly updated (done in the expiration step), and Q has been re-queued for service. This dependency is a consequence of the way how BFQ's scheduling algorithm is currently implemented. But Q is not re-scheduled at all in the first step, because Q is empty. As a consequence, an uncontrolled amount of I/O may be dispatched until Q becomes non empty again. This breaks Q's service guarantees. This commit addresses this issue by re-scheduling Q even if it is empty. This in turn breaks the assumption that all scheduled queues are non empty. Then a few extra checks are now needed. Signed-off-by:
Paolo Valente <paolo.valente@linaro.org> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Paolo Valente authored
BFQ enqueues the I/O coming from each process into a separate bfq_queue, and serves bfq_queues one at a time. Each bfq_queue may be served for at most timeout_sync milliseconds (default: 125 ms). This service scheme is prone to the following inaccuracy. While a bfq_queue Q1 is in service, some empty bfq_queue Q2 may receive I/O, and, according to BFQ's scheduling policy, may become the right bfq_queue to serve, in place of the currently in-service bfq_queue. In this respect, postponing the service of Q2 to after the service of Q1 finishes may delay the completion of Q2's I/O, compared with an ideal service in which all non-empty bfq_queues are served in parallel, and every non-empty bfq_queue is served at a rate proportional to the bfq_queue's weight. This additional delay is equal at most to the time Q1 may unjustly remain in service before switching to Q2. If Q1 and Q2 have the same weight, then this time is most likely negligible compared with the completion time to be guaranteed to Q2's I/O. In addition, first, one of the reasons why BFQ may want to serve Q1 for a while is that this boosts throughput and, second, serving Q1 longer reduces BFQ's overhead. As a conclusion, it is usually better not to preempt Q1 if both Q1 and Q2 have the same weight. In contrast, as Q2's weight or priority becomes higher and higher compared with that of Q1, the above delay becomes larger and larger, compared with the I/O completion times that have to be guaranteed to Q2 according to Q2's weight. So reducing this delay may be more important than avoiding the costs of preempting Q1. Accordingly, this commit preempts Q1 if Q2 has a higher weight or a higher priority than Q1. Preemption causes Q1 to be re-scheduled, and triggers a new choice of the next bfq_queue to serve. If Q2 really is the next bfq_queue to serve, then Q2 will be set in service immediately. This change reduces the component of the I/O latency caused by the above delay by about 80%. For example, on an (old) PLEXTOR PX-256M5 SSD, the maximum latency reported by fio drops from 15.1 to 3.2 ms for a process doing sporadic random reads while another process is doing continuous sequential reads. Signed-off-by:
Nicola Bottura <bottura.nicola95@gmail.com> Signed-off-by:
Paolo Valente <paolo.valente@linaro.org> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Paolo Valente authored
A bfq_queue Q may happen to be synchronized with another bfq_queue Q2, i.e., the I/O of Q2 may need to be completed for Q to receive new I/O. We call Q2 "waker queue". If I/O plugging is being performed for Q, and Q is not receiving any more I/O because of the above synchronization, then, thanks to BFQ's injection mechanism, the waker queue is likely to get served before the I/O-plugging timeout fires. Unfortunately, this fact may not be sufficient to guarantee a high throughput during the I/O plugging, because the inject limit for Q may be too low to guarantee a lot of injected I/O. In addition, the duration of the plugging, i.e., the time before Q finally receives new I/O, may not be minimized, because the waker queue may happen to be served only after other queues. To address these issues, this commit introduces the explicit detection of the waker queue, and the unconditional injection of a pending I/O request of the waker queue on each invocation of bfq_dispatch_request(). One may be concerned that this systematic injection of I/O from the waker queue delays the service of Q's I/O. Fortunately, it doesn't. On the contrary, next Q's I/O is brought forward dramatically, for it is not blocked for milliseconds. Reported-by:
Srivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu> Tested-by:
Srivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu> Signed-off-by:
Paolo Valente <paolo.valente@linaro.org> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Paolo Valente authored
Until the base value for request service times gets finally computed for a bfq_queue, the inject limit for that queue does depend on the think-time state (short|long) of the queue. A timely update of the think time then guarantees a quicker activation or deactivation of the injection. Fortunately, the think time of a bfq_queue is updated in the same code path as the inject limit; but after the inject limit. This commits moves the update of the think time before the update of the inject limit. For coherence, it moves the update of the seek time too. Reported-by:
Srivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu> Tested-by:
Srivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu> Signed-off-by:
Paolo Valente <paolo.valente@linaro.org> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Paolo Valente authored
I/O injection gets reduced if it increases the request service times of the victim queue beyond a certain threshold. The threshold, in its turn, is computed as a function of the base service time enjoyed by the queue when it undergoes no injection. As a consequence, for injection to work properly, the above base value has to be accurate. In this respect, such a value may vary over time. For example, it varies if the size or the spatial locality of the I/O requests in the queue change. It is then important to update this value whenever possible. This commit performs this update. Reported-by:
Srivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu> Tested-by:
Srivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu> Signed-off-by:
Paolo Valente <paolo.valente@linaro.org> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Paolo Valente authored
One of the cases where the parameters for injection may be updated is when there are no more in-flight I/O requests. The number of in-flight requests is stored in the field bfqd->rq_in_driver of the descriptor bfqd of the device. So, the controlled condition is bfqd->rq_in_driver == 0. Unfortunately, this is wrong because, the instruction that checks this condition is in the code path that handles the completion of a request, and, in particular, the instruction is executed before bfqd->rq_in_driver is decremented in such a code path. This commit fixes this issue by just replacing 0 with 1 in the comparison. Reported-by:
Srivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu> Tested-by:
Srivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu> Signed-off-by:
Paolo Valente <paolo.valente@linaro.org> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-