- Dec 15, 2016
-
-
Jan Kara authored
Currently we never clear dirty tags in DAX mappings and thus address ranges to flush accumulate. Now that we have locking of radix tree entries, we have all the locking necessary to reliably clear the radix tree dirty tag when flushing caches for corresponding address range. Similarly to page_mkclean() we also have to write-protect pages to get a page fault when the page is next written to so that we can mark the entry dirty again. Link: http://lkml.kernel.org/r/1479460644-25076-21-git-send-email-jack@suse.cz Signed-off-by:
Jan Kara <jack@suse.cz> Reviewed-by:
Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Jan Kara authored
Currently PTE gets updated in wp_pfn_shared() after dax_pfn_mkwrite() has released corresponding radix tree entry lock. When we want to writeprotect PTE on cache flush, we need PTE modification to happen under radix tree entry lock to ensure consistent updates of PTE and radix tree (standard faults use page lock to ensure this consistency). So move update of PTE bit into dax_pfn_mkwrite(). Link: http://lkml.kernel.org/r/1479460644-25076-20-git-send-email-jack@suse.cz Signed-off-by:
Jan Kara <jack@suse.cz> Reviewed-by:
Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Jan Kara authored
Currently, flushing of caches for DAX mappings was ignoring entry lock. So far this was ok (modulo a bug that a difference in entry lock could cause cache flushing to be mistakenly skipped) but in the following patches we will write-protect PTEs on cache flushing and clear dirty tags. For that we will need more exclusion. So do cache flushing under an entry lock. This allows us to remove one lock-unlock pair of mapping->tree_lock as a bonus. Link: http://lkml.kernel.org/r/1479460644-25076-19-git-send-email-jack@suse.cz Signed-off-by:
Jan Kara <jack@suse.cz> Reviewed-by:
Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Jan Kara authored
Move final handling of COW faults from generic code into DAX fault handler. That way generic code doesn't have to be aware of peculiarities of DAX locking so remove that knowledge and make locking functions private to fs/dax.c. Link: http://lkml.kernel.org/r/1479460644-25076-11-git-send-email-jack@suse.cz Signed-off-by:
Jan Kara <jack@suse.cz> Acked-by:
Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by:
Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Jan Kara authored
Every single user of vmf->virtual_address typed that entry to unsigned long before doing anything with it so the type of virtual_address does not really provide us any additional safety. Just use masked vmf->address which already has the appropriate type. Link: http://lkml.kernel.org/r/1479460644-25076-3-git-send-email-jack@suse.cz Signed-off-by:
Jan Kara <jack@suse.cz> Acked-by:
Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
- Dec 13, 2016
-
-
Johannes Weiner authored
Support handing __radix_tree_replace() a callback that gets invoked for all leaf nodes that change or get freed as a result of the slot replacement, to assist users tracking nodes with node->private_list. This prepares for putting page cache shadow entries into the radix tree root again and drastically simplifying the shadow tracking. Link: http://lkml.kernel.org/r/20161117193134.GD23430@cmpxchg.org Signed-off-by:
Johannes Weiner <hannes@cmpxchg.org> Suggested-by:
Jan Kara <jack@suse.cz> Reviewed-by:
Jan Kara <jack@suse.cz> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <mawilcox@linuxonhyperv.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Johannes Weiner authored
The bug in khugepaged fixed earlier in this series shows that radix tree slot replacement is fragile; and it will become more so when not only NULL<->!NULL transitions need to be caught but transitions from and to exceptional entries as well. We need checks. Re-implement radix_tree_replace_slot() on top of the sanity-checked __radix_tree_replace(). This requires existing callers to also pass the radix tree root, but it'll warn us when somebody replaces slots with contents that need proper accounting (transitions between NULL entries, real entries, exceptional entries) and where a replacement through the slot pointer would corrupt the radix tree node counts. Link: http://lkml.kernel.org/r/20161117193021.GB23430@cmpxchg.org Signed-off-by:
Johannes Weiner <hannes@cmpxchg.org> Suggested-by:
Jan Kara <jack@suse.cz> Reviewed-by:
Jan Kara <jack@suse.cz> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <mawilcox@linuxonhyperv.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Johannes Weiner authored
The way the page cache is sneaking shadow entries of evicted pages into the radix tree past the node entry accounting and tracking them manually in the upper bits of node->count is fraught with problems. These shadow entries are marked in the tree as exceptional entries, which are a native concept to the radix tree. Maintain an explicit counter of exceptional entries in the radix tree node. Subsequent patches will switch shadow entry tracking over to that counter. DAX and shmem are the other users of exceptional entries. Since slot replacements that change the entry type from regular to exceptional must now be accounted, introduce a __radix_tree_replace() function that does replacement and accounting, and switch DAX and shmem over. The increase in radix tree node size is temporary. A followup patch switches the shadow tracking to this new scheme and we'll no longer need the upper bits in node->count and shrink that back to one byte. Link: http://lkml.kernel.org/r/20161117192945.GA23430@cmpxchg.org Signed-off-by:
Johannes Weiner <hannes@cmpxchg.org> Reviewed-by:
Jan Kara <jack@suse.cz> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <mawilcox@linuxonhyperv.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Jan Kara authored
Commit 642261ac: "dax: add struct iomap based DAX PMD support" has introduced unmapping of page tables if huge page needs to be split in grab_mapping_entry(). However the unmapping happens after radix_tree_preload() call which disables preemption and thus unmap_mapping_range() tries to acquire i_mmap_lock in atomic context which is a bug. Fix the problem by moving unmapping before radix_tree_preload() call. Fixes: 642261ac Signed-off-by:
Jan Kara <jack@suse.cz> Signed-off-by:
Theodore Ts'o <tytso@mit.edu>
-
- Nov 21, 2016
-
-
Jan Kara authored
No one uses functions using the get_block callback anymore. Rip them out and update documentation. Reviewed-by:
Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by:
Jan Kara <jack@suse.cz> Signed-off-by:
Theodore Ts'o <tytso@mit.edu>
-
- Nov 09, 2016
-
-
Jan Kara authored
Introduce a flag telling iomap operations whether they are handling a fault or other IO. That may influence behavior wrt inode size and similar things. Signed-off-by:
Jan Kara <jack@suse.cz> Reviewed-by:
Dave Chinner <dchinner@redhat.com> Reviewed-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Dave Chinner <david@fromorbit.com>
-
- Nov 08, 2016
-
-
Ross Zwisler authored
DAX PMDs have been disabled since Jan Kara introduced DAX radix tree based locking. This patch allows DAX PMDs to participate in the DAX radix tree based locking scheme so that they can be re-enabled using the new struct iomap based fault handlers. There are currently three types of DAX 4k entries: 4k zero pages, 4k DAX mappings that have an associated block allocation, and 4k DAX empty entries. The empty entries exist to provide locking for the duration of a given page fault. This patch adds three equivalent 2MiB DAX entries: Huge Zero Page (HZP) entries, PMD DAX entries that have associated block allocations, and 2 MiB DAX empty entries. Unlike the 4k case where we insert a struct page* into the radix tree for 4k zero pages, for HZP we insert a DAX exceptional entry with the new RADIX_DAX_HZP flag set. This is because we use a single 2 MiB zero page in every 2MiB hole mapping, and it doesn't make sense to have that same struct page* with multiple entries in multiple trees. This would cause contention on the single page lock for the one Huge Zero Page, and it would break the page->index and page->mapping associations that are assumed to be valid in many other places in the kernel. One difficult use case is when one thread is trying to use 4k entries in radix tree for a given offset, and another thread is using 2 MiB entries for that same offset. The current code handles this by making the 2 MiB user fall back to 4k entries for most cases. This was done because it is the simplest solution, and because the use of 2MiB pages is already opportunistic. If we were to try to upgrade from 4k pages to 2MiB pages for a given range, we run into the problem of how we lock out 4k page faults for the entire 2MiB range while we clean out the radix tree so we can insert the 2MiB entry. We can solve this problem if we need to, but I think that the cases where both 2MiB entries and 4K entries are being used for the same range will be rare enough and the gain small enough that it probably won't be worth the complexity. Signed-off-by:
Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by:
Jan Kara <jack@suse.cz> Signed-off-by:
Dave Chinner <david@fromorbit.com>
-
Ross Zwisler authored
No functional change. The static functions put_locked_mapping_entry() and put_unlocked_mapping_entry() will soon be used in error cases in grab_mapping_entry(), so move their definitions above this function. Signed-off-by:
Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by:
Jan Kara <jack@suse.cz> Signed-off-by:
Dave Chinner <david@fromorbit.com>
-
Ross Zwisler authored
The RADIX_DAX_* defines currently mostly live in fs/dax.c, with just RADIX_DAX_ENTRY_LOCK being in include/linux/dax.h so it can be used in mm/filemap.c. When we add PMD support, though, mm/filemap.c will also need access to the RADIX_DAX_PTE type so it can properly construct a 4k sized empty entry. Instead of shifting the defines between dax.c and dax.h as they are individually used in other code, just move them wholesale to dax.h so they'll be available when we need them. Signed-off-by:
Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Jan Kara <jack@suse.cz> Signed-off-by:
Dave Chinner <david@fromorbit.com>
-
Ross Zwisler authored
Currently iomap_end() doesn't do anything for DAX page faults for both ext2 and XFS. ext2_iomap_end() just checks for a write underrun, and xfs_file_iomap_end() checks to see if it needs to finish a delayed allocation. However, in the future iomap_end() calls might be needed to make sure we have balanced allocations, locks, etc. So, add calls to iomap_end() with appropriate error handling to dax_iomap_fault(). Signed-off-by:
Ross Zwisler <ross.zwisler@linux.intel.com> Suggested-by:
Jan Kara <jack@suse.cz> Reviewed-by:
Jan Kara <jack@suse.cz> Signed-off-by:
Dave Chinner <david@fromorbit.com>
-
Ross Zwisler authored
To be able to correctly calculate the sector from a file position and a struct iomap there is a complex little bit of logic that currently happens in both dax_iomap_actor() and dax_iomap_fault(). This will need to be repeated yet again in the DAX PMD fault handler when it is added, so break it out into a helper function. Signed-off-by:
Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Jan Kara <jack@suse.cz> Signed-off-by:
Dave Chinner <david@fromorbit.com>
-
Ross Zwisler authored
The recently added DAX functions that use the new struct iomap data structure were named iomap_dax_rw(), iomap_dax_fault() and iomap_dax_actor(). These are actually defined in fs/dax.c, though, so should be part of the "dax" namespace and not the "iomap" namespace. Rename them to dax_iomap_rw(), dax_iomap_fault() and dax_iomap_actor() respectively. Signed-off-by:
Ross Zwisler <ross.zwisler@linux.intel.com> Suggested-by:
Dave Chinner <david@fromorbit.com> Reviewed-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Jan Kara <jack@suse.cz> Signed-off-by:
Dave Chinner <david@fromorbit.com>
-
Ross Zwisler authored
dax_pmd_fault() is the old struct buffer_head + get_block_t based 2 MiB DAX fault handler. This fault handler has been disabled for several kernel releases, and support for PMDs will be reintroduced using the struct iomap interface instead. Signed-off-by:
Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Jan Kara <jack@suse.cz> Signed-off-by:
Dave Chinner <david@fromorbit.com>
-
Ross Zwisler authored
DAX radix tree locking currently locks entries based on the unique combination of the 'mapping' pointer and the pgoff_t 'index' for the entry. This works for PTEs, but as we move to PMDs we will need to have all the offsets within the range covered by the PMD to map to the same bit lock. To accomplish this, for ranges covered by a PMD entry we will instead lock based on the page offset of the beginning of the PMD entry. The 'mapping' pointer is still used in the same way. Signed-off-by:
Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Jan Kara <jack@suse.cz> Signed-off-by:
Dave Chinner <david@fromorbit.com>
-
Ross Zwisler authored
No functional change. Consistently use the variable name 'entry' instead of 'ret' for DAX radix tree entries. This was already happening in most of the code, so update get_unlocked_mapping_entry(), grab_mapping_entry() and dax_unlock_mapping_entry(). Signed-off-by:
Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Jan Kara <jack@suse.cz> Signed-off-by:
Dave Chinner <david@fromorbit.com>
-
Ross Zwisler authored
Don't take down the kernel if we get an invalid 'from' and 'length' argument pair. Just warn once and return an error. Signed-off-by:
Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Jan Kara <jack@suse.cz> Signed-off-by:
Dave Chinner <david@fromorbit.com>
-
Ross Zwisler authored
The global 'wait_table' variable is only used within fs/dax.c, and generates the following sparse warning: fs/dax.c:39:19: warning: symbol 'wait_table' was not declared. Should it be static? Make it static so it has scope local to fs/dax.c, and to make sparse happy. Signed-off-by:
Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Jan Kara <jack@suse.cz> Signed-off-by:
Dave Chinner <david@fromorbit.com>
-
Ross Zwisler authored
Now that ext4 properly sets bh.b_size when we call get_block() for a hole, rely on that value and remove the buffer_size_valid() sanity check. Signed-off-by:
Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by:
Jan Kara <jack@suse.cz> Reviewed-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Dave Chinner <david@fromorbit.com>
-
- Oct 08, 2016
-
-
Aaron Lu authored
The global zero page is used to satisfy an anonymous read fault. If THP(Transparent HugePage) is enabled then the global huge zero page is used. The global huge zero page uses an atomic counter for reference counting and is allocated/freed dynamically according to its counter value. CPU time spent on that counter will greatly increase if there are a lot of processes doing anonymous read faults. This patch proposes a way to reduce the access to the global counter so that the CPU load can be reduced accordingly. To do this, a new flag of the mm_struct is introduced: MMF_USED_HUGE_ZERO_PAGE. With this flag, the process only need to touch the global counter in two cases: 1 The first time it uses the global huge zero page; 2 The time when mm_user of its mm_struct reaches zero. Note that right now, the huge zero page is eligible to be freed as soon as its last use goes away. With this patch, the page will not be eligible to be freed until the exit of the last process from which it was ever used. And with the use of mm_user, the kthread is not eligible to use huge zero page either. Since no kthread is using huge zero page today, there is no difference after applying this patch. But if that is not desired, I can change it to when mm_count reaches zero. Case used for test on Haswell EP: usemem -n 72 --readonly -j 0x200000 100G Which spawns 72 processes and each will mmap 100G anonymous space and then do read only access to that space sequentially with a step of 2MB. CPU cycles from perf report for base commit: 54.03% usemem [kernel.kallsyms] [k] get_huge_zero_page CPU cycles from perf report for this commit: 0.11% usemem [kernel.kallsyms] [k] mm_get_huge_zero_page Performance(throughput) of the workload for base commit: 1784430792 Performance(throughput) of the workload for this commit: 4726928591 164% increase. Runtime of the workload for base commit: 707592 us Runtime of the workload for this commit: 303970 us 50% drop. Link: http://lkml.kernel.org/r/fe51a88f-446a-4622-1363-ad1282d71385@intel.com Signed-off-by:
Aaron Lu <aaron.lu@intel.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Ebru Akagunduz <ebru.akagunduz@gmail.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
- Sep 19, 2016
-
-
Christoph Hellwig authored
Very similar to the existing dax_fault function, but instead of using the get_block callback we rely on the iomap_ops vector from iomap.c. That also avoids having to do two calls into the file system for write faults. Signed-off-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by:
Dave Chinner <david@fromorbit.com>
-
Christoph Hellwig authored
This is a much simpler implementation of the DAX read/write path that makes use of the iomap infrastructure. It does not try to mirror the direct I/O calling conventions and thus doesn't have to deal with i_dio_count or the end_io handler, but instead leaves locking and filesystem-specific I/O completion to the caller. Signed-off-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by:
Dave Chinner <david@fromorbit.com>
-
Christoph Hellwig authored
This way we can use this helper for the iomap based DAX implementation as well. Signed-off-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by:
Dave Chinner <david@fromorbit.com>
-
Christoph Hellwig authored
This way we can use this helper for the iomap based DAX implementation as well. Signed-off-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by:
Dave Chinner <david@fromorbit.com>
-
- Jul 26, 2016
-
-
Ross Zwisler authored
Remove the unused wrappers dax_fault() and dax_pmd_fault(). After this removal, rename __dax_fault() and __dax_pmd_fault() to dax_fault() and dax_pmd_fault() respectively, and update all callers. The dax_fault() and dax_pmd_fault() wrappers were initially intended to capture some filesystem independent functionality around page faults (calling sb_start_pagefault() & sb_end_pagefault(), updating file mtime and ctime). However, the following commits: 5726b27b ("ext2: Add locking for DAX faults") ea3d7209 ("ext4: fix races between page faults and hole punching") added locking to the ext2 and ext4 filesystems after these common operations but before __dax_fault() and __dax_pmd_fault() were called. This means that these wrappers are no longer used, and are unlikely to be used in the future. XFS has had locking analogous to what was recently added to ext2 and ext4 since DAX support was initially introduced by: 6b698ede ("xfs: add DAX file operations support") Link: http://lkml.kernel.org/r/20160714214049.20075-2-ross.zwisler@linux.intel.com Signed-off-by:
Ross Zwisler <ross.zwisler@linux.intel.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Chinner <david@fromorbit.com> Reviewed-by:
Jan Kara <jack@suse.cz> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
- Jul 13, 2016
-
-
Dan Williams authored
The __pmem address space was meant to annotate codepaths that touch persistent memory and need to coordinate a call to wmb_pmem(). Now that wmb_pmem() is gone, there is little need to keep this annotation. Cc: Christoph Hellwig <hch@lst.de> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by:
Dan Williams <dan.j.williams@intel.com>
-
- Jul 12, 2016
-
-
Dan Williams authored
Flushing posted-write queues is now deferred to REQ_FLUSH context, or otherwise handled by an ADR event at the platform level. Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by:
Dan Williams <dan.j.williams@intel.com>
-
- Jun 27, 2016
-
-
Eric Sandeen authored
This isn't functionally apparent for some reason, but when we test io at extreme offsets at the end of the loff_t rang, such as in fstests xfs/071, the calculation of "max" in dax_io() can be wrong due to pos + size overflowing. For example, # xfs_io -c "pwrite 9223372036854771712 512" /mnt/test/file enters dax_io with: start 0x7ffffffffffff000 end 0x7ffffffffffff200 and the rounded up "size" variable is 0x1000. This yields: pos + size 0x8000000000000000 (overflows loff_t) end 0x7ffffffffffff200 Due to the overflow, the min() function picks the wrong value for the "max" variable, and when we send (max - pos) into i.e. copy_from_iter_pmem() it is also the wrong value. This somehow(tm) gets magically absorbed without incident, probably because iter->count is correct. But it seems best to fix it up properly by comparing the two values as unsigned. Signed-off-by:
Eric Sandeen <sandeen@redhat.com> Signed-off-by:
Dan Williams <dan.j.williams@intel.com>
-
- May 21, 2016
-
-
NeilBrown authored
These don't belong in radix-tree.h any more than PAGECACHE_TAG_* do. Let's try to maintain the idea that radix-tree simply implements an abstract data type. Signed-off-by:
NeilBrown <neilb@suse.com> Reviewed-by:
Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by:
Jan Kara <jack@suse.cz> Signed-off-by:
Matthew Wilcox <willy@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
- May 19, 2016
-
-
Jan Kara authored
Currently faults are protected against truncate by filesystem specific i_mmap_sem and page lock in case of hole page. Cow faults are protected DAX radix tree entry locking. So there's no need for i_mmap_lock in DAX code. Remove it. Reviewed-by:
Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by:
Jan Kara <jack@suse.cz> Signed-off-by:
Ross Zwisler <ross.zwisler@linux.intel.com>
-
Jan Kara authored
When doing cow faults, we cannot directly fill in PTE as we do for other faults as we rely on generic code to do proper accounting of the cowed page. We also have no page to lock to protect against races with truncate as other faults have and we need the protection to extend until the moment generic code inserts cowed page into PTE thus at that point we have no protection of fs-specific i_mmap_sem. So far we relied on using i_mmap_lock for the protection however that is completely special to cow faults. To make fault locking more uniform use DAX entry lock instead. Reviewed-by:
Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by:
Jan Kara <jack@suse.cz> Signed-off-by:
Ross Zwisler <ross.zwisler@linux.intel.com>
-
Jan Kara authored
Currently DAX page fault locking is racy. CPU0 (write fault) CPU1 (read fault) __dax_fault() __dax_fault() get_block(inode, block, &bh, 0) -> not mapped get_block(inode, block, &bh, 0) -> not mapped if (!buffer_mapped(&bh)) if (vmf->flags & FAULT_FLAG_WRITE) get_block(inode, block, &bh, 1) -> allocates blocks if (page) -> no if (!buffer_mapped(&bh)) if (vmf->flags & FAULT_FLAG_WRITE) { } else { dax_load_hole(); } dax_insert_mapping() And we are in a situation where we fail in dax_radix_entry() with -EIO. Another problem with the current DAX page fault locking is that there is no race-free way to clear dirty tag in the radix tree. We can always end up with clean radix tree and dirty data in CPU cache. We fix the first problem by introducing locking of exceptional radix tree entries in DAX mappings acting very similarly to page lock and thus synchronizing properly faults against the same mapping index. The same lock can later be used to avoid races when clearing radix tree dirty tag. Reviewed-by:
NeilBrown <neilb@suse.com> Reviewed-by:
Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by:
Jan Kara <jack@suse.cz> Signed-off-by:
Ross Zwisler <ross.zwisler@linux.intel.com>
-
Jan Kara authored
We will use lowest available bit in the radix tree exceptional entry for locking of the entry. Define it. Also clean up definitions of DAX entry type bits in DAX exceptional entries to use defined constants instead of hardcoding numbers and cleanup checking of these bits to not rely on how other bits in the entry are set. Reviewed-by:
Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by:
Jan Kara <jack@suse.cz> Signed-off-by:
Ross Zwisler <ross.zwisler@linux.intel.com>
-
Jan Kara authored
Currently the handling of huge pages for DAX is racy. For example the following can happen: CPU0 (THP write fault) CPU1 (normal read fault) __dax_pmd_fault() __dax_fault() get_block(inode, block, &bh, 0) -> not mapped get_block(inode, block, &bh, 0) -> not mapped if (!buffer_mapped(&bh) && write) get_block(inode, block, &bh, 1) -> allocates blocks truncate_pagecache_range(inode, lstart, lend); dax_load_hole(); This results in data corruption since process on CPU1 won't see changes into the file done by CPU0. The race can happen even if two normal faults race however with THP the situation is even worse because the two faults don't operate on the same entries in the radix tree and we want to use these entries for serialization. So make THP support in DAX code depend on CONFIG_BROKEN for now. Signed-off-by:
Jan Kara <jack@suse.cz> Signed-off-by:
Ross Zwisler <ross.zwisler@linux.intel.com>
-
Jan Kara authored
Currently dax_pmd_fault() decides to fill a PMD-sized hole only if returned buffer has BH_Uptodate set. However that doesn't get set for any mapping buffer so that branch is actually a dead code. The BH_Uptodate check doesn't make any sense so just remove it. Signed-off-by:
Jan Kara <jack@suse.cz> Signed-off-by:
Ross Zwisler <ross.zwisler@linux.intel.com>
-
- May 18, 2016
-
-
Vishal Verma authored
The distinction between PAGE_SIZE and PAGE_CACHE_SIZE was removed in 09cbfeaf mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros The comments for the above functions described a distinction between those, that is now redundant, so remove those paragraphs Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Jan Kara <jack@suse.cz> Signed-off-by:
Vishal Verma <vishal.l.verma@intel.com>
-