Skip to content
Snippets Groups Projects
  1. Dec 12, 2022
    • Shiyang Ruan's avatar
      fsdax: invalidate pages when CoW · f80e1668
      Shiyang Ruan authored
      CoW changes the share state of a dax page, but the share count of the page
      isn't updated.  The next time access this page, it should have been a
      newly accessed, but old association exists.  So, we need to clear the
      share state when CoW happens, in both dax_iomap_rw() and dax_zero_iter().
      
      Link: https://lkml.kernel.org/r/1669908538-55-3-git-send-email-ruansy.fnst@fujitsu.com
      
      
      Signed-off-by: default avatarShiyang Ruan <ruansy.fnst@fujitsu.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f80e1668
    • Shiyang Ruan's avatar
      fsdax: introduce page->share for fsdax in reflink mode · 16900426
      Shiyang Ruan authored
      Patch series "fsdax,xfs: fix warning messages", v2.
      
      Many testcases failed in dax+reflink mode with warning message in dmesg.
      Such as generic/051,075,127.  The warning message is like this:
      [  775.509337] ------------[ cut here ]------------
      [  775.509636] WARNING: CPU: 1 PID: 16815 at fs/dax.c:386 dax_insert_entry.cold+0x2e/0x69
      [  775.510151] Modules linked in: auth_rpcgss oid_registry nfsv4 algif_hash af_alg af_packet nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink ip6table_filter ip6_tables iptable_filter ip_tables x_tables dax_pmem nd_pmem nd_btt sch_fq_codel configfs xfs libcrc32c fuse
      [  775.524288] CPU: 1 PID: 16815 Comm: fsx Kdump: loaded Tainted: G        W          6.1.0-rc4+ #164 eb34e4ee4200c7cbbb47de2b1892c5a3e027fd6d
      [  775.524904] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Arch Linux 1.16.0-3-3 04/01/2014
      [  775.525460] RIP: 0010:dax_insert_entry.cold+0x2e/0x69
      [  775.525797] Code: c7 c7 18 eb e0 81 48 89 4c 24 20 48 89 54 24 10 e8 73 6d ff ff 48 83 7d 18 00 48 8b 54 24 10 48 8b 4c 24 20 0f 84 e3 e9 b9 ff <0f> 0b e9 dc e9 b9 ff 48 c7 c6 a0 20 c3 81 48 c7 c7 f0 ea e0 81 48
      [  775.526708] RSP: 0000:ffffc90001d57b30 EFLAGS: 00010082
      [  775.527042] RAX: 000000000000002a RBX: 0000000000000000 RCX: 0000000000000042
      [  775.527396] RDX: ffffea000a0f6c80 RSI: ffffffff81dfab1b RDI: 00000000ffffffff
      [  775.527819] RBP: ffffea000a0f6c40 R08: 0000000000000000 R09: ffffffff820625e0
      [  775.528241] R10: ffffc90001d579d8 R11: ffffffff820d2628 R12: ffff88815fc98320
      [  775.528598] R13: ffffc90001d57c18 R14: 0000000000000000 R15: 0000000000000001
      [  775.528997] FS:  00007f39fc75d740(0000) GS:ffff88817bc80000(0000) knlGS:0000000000000000
      [  775.529474] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  775.529800] CR2: 00007f39fc772040 CR3: 0000000107eb6001 CR4: 00000000003706e0
      [  775.530214] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  775.530592] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [  775.531002] Call Trace:
      [  775.531230]  <TASK>
      [  775.531444]  dax_fault_iter+0x267/0x6c0
      [  775.531719]  dax_iomap_pte_fault+0x198/0x3d0
      [  775.532002]  __xfs_filemap_fault+0x24a/0x2d0 [xfs aa8d25411432b306d9554da38096f4ebb86bdfe7]
      [  775.532603]  __do_fault+0x30/0x1e0
      [  775.532903]  do_fault+0x314/0x6c0
      [  775.533166]  __handle_mm_fault+0x646/0x1250
      [  775.533480]  handle_mm_fault+0xc1/0x230
      [  775.533810]  do_user_addr_fault+0x1ac/0x610
      [  775.534110]  exc_page_fault+0x63/0x140
      [  775.534389]  asm_exc_page_fault+0x22/0x30
      [  775.534678] RIP: 0033:0x7f39fc55820a
      [  775.534950] Code: 00 01 00 00 00 74 99 83 f9 c0 0f 87 7b fe ff ff c5 fe 6f 4e 20 48 29 fe 48 83 c7 3f 49 8d 0c 10 48 83 e7 c0 48 01 fe 48 29 f9 <f3> a4 c4 c1 7e 7f 00 c4 c1 7e 7f 48 20 c5 f8 77 c3 0f 1f 44 00 00
      [  775.535839] RSP: 002b:00007ffc66a08118 EFLAGS: 00010202
      [  775.536157] RAX: 00007f39fc772001 RBX: 0000000000042001 RCX: 00000000000063c1
      [  775.536537] RDX: 0000000000006400 RSI: 00007f39fac42050 RDI: 00007f39fc772040
      [  775.536919] RBP: 0000000000006400 R08: 00007f39fc772001 R09: 0000000000042000
      [  775.537304] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000001
      [  775.537694] R13: 00007f39fc772000 R14: 0000000000006401 R15: 0000000000000003
      [  775.538086]  </TASK>
      [  775.538333] ---[ end trace 0000000000000000 ]---
      
      This also affects dax+noreflink mode if we run the test after a
      dax+reflink test.  So, the most urgent thing is solving the warning
      messages.
      
      With these fixes, most warning messages in dax_associate_entry() are gone.
      But honestly, generic/388 will randomly failed with the warning.  The
      case shutdown the xfs when fsstress is running, and do it for many times. 
      I think the reason is that dax pages in use are not able to be invalidated
      in time when fs is shutdown.  The next time dax page to be associated, it
      still remains the mapping value set last time.  I'll keep on solving it.
      
      The warning message in dax_writeback_one() can also be fixed because of
      the dax unshare.
      
      
      This patch (of 8):
      
      fsdax page is used not only when CoW, but also mapread.  To make the it
      easily understood, use 'share' to indicate that the dax page is shared by
      more than one extent.  And add helper functions to use it.
      
      Also, the flag needs to be renamed to PAGE_MAPPING_DAX_SHARED.
      
      [ruansy.fnst@fujitsu.com: rename several functions]
        Link: https://lkml.kernel.org/r/1669972991-246-1-git-send-email-ruansy.fnst@fujitsu.com
      [ruansy.fnst@fujitsu.com: v2.2]
        Link: https://lkml.kernel.org/r/1670381359-53-1-git-send-email-ruansy.fnst@fujitsu.com
      Link: https://lkml.kernel.org/r/1669908538-55-1-git-send-email-ruansy.fnst@fujitsu.com
      Link: https://lkml.kernel.org/r/1669908538-55-2-git-send-email-ruansy.fnst@fujitsu.com
      
      
      Signed-off-by: default avatarShiyang Ruan <ruansy.fnst@fujitsu.com>
      Reviewed-by: default avatarAllison Henderson <allison.henderson@oracle.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      16900426
  2. Jul 26, 2022
    • Li Jinlin's avatar
      fsdax: Fix infinite loop in dax_iomap_rw() · 17d9c15c
      Li Jinlin authored
      
      I got an infinite loop and a WARNING report when executing a tail command
      in virtiofs.
      
        WARNING: CPU: 10 PID: 964 at fs/iomap/iter.c:34 iomap_iter+0x3a2/0x3d0
        Modules linked in:
        CPU: 10 PID: 964 Comm: tail Not tainted 5.19.0-rc7
        Call Trace:
        <TASK>
        dax_iomap_rw+0xea/0x620
        ? __this_cpu_preempt_check+0x13/0x20
        fuse_dax_read_iter+0x47/0x80
        fuse_file_read_iter+0xae/0xd0
        new_sync_read+0xfe/0x180
        ? 0xffffffff81000000
        vfs_read+0x14d/0x1a0
        ksys_read+0x6d/0xf0
        __x64_sys_read+0x1a/0x20
        do_syscall_64+0x3b/0x90
        entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
      The tail command will call read() with a count of 0. In this case,
      iomap_iter() will report this WARNING, and always return 1 which casuing
      the infinite loop in dax_iomap_rw().
      
      Fixing by checking count whether is 0 in dax_iomap_rw().
      
      Fixes: ca289e0b ("fsdax: switch dax_iomap_rw to use iomap_iter")
      Signed-off-by: default avatarLi Jinlin <lijinlin3@huawei.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Link: https://lore.kernel.org/r/20220725032050.3873372-1-lijinlin3@huawei.com
      
      
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      17d9c15c
  3. Jul 18, 2022
  4. Jun 30, 2022
  5. May 16, 2022
  6. Apr 29, 2022
    • Muchun Song's avatar
      dax: fix missing writeprotect the pte entry · 06083a09
      Muchun Song authored
      Currently dax_mapping_entry_mkclean() fails to clean and write protect the
      pte entry within a DAX PMD entry during an *sync operation.  This can
      result in data loss in the following sequence:
      
        1) process A mmap write to DAX PMD, dirtying PMD radix tree entry and
           making the pmd entry dirty and writeable.
        2) process B mmap with the @offset (e.g. 4K) and @length (e.g. 4K)
           write to the same file, dirtying PMD radix tree entry (already
           done in 1)) and making the pte entry dirty and writeable.
        3) fsync, flushing out PMD data and cleaning the radix tree entry. We
           currently fail to mark the pte entry as clean and write protected
           since the vma of process B is not covered in dax_entry_mkclean().
        4) process B writes to the pte. These don't cause any page faults since
           the pte entry is dirty and writeable. The radix tree entry remains
           clean.
        5) fsync, which fails to flush the dirty PMD data because the radix tree
           entry was clean.
        6) crash - dirty data that should have been fsync'd as part of 5) could
           still have been in the processor cache, and is lost.
      
      Just to use pfn_mkclean_range() to clean the pfns to fix this issue.
      
      Link: https://lkml.kernel.org/r/20220403053957.10770-6-songmuchun@bytedance.com
      
      
      Fixes: 4b4bb46d ("dax: clear dirty entry tags on cache flush")
      Signed-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Ross Zwisler <zwisler@kernel.org>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Cc: Xiyu Yang <xiyuyang19@fudan.edu.cn>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      06083a09
    • Muchun Song's avatar
      dax: fix cache flush on PMD-mapped pages · e583b5c4
      Muchun Song authored
      The flush_cache_page() only remove a PAGE_SIZE sized range from the cache.
      However, it does not cover the full pages in a THP except a head page. 
      Replace it with flush_cache_range() to fix this issue.  This is just a
      documentation issue with the respect to properly documenting the expected
      usage of cache flushing before modifying the pmd.  However, in practice
      this is not a problem due to the fact that DAX is not available on
      architectures with virtually indexed caches per:
      
        commit d92576f1 ("dax: does not work correctly with virtual aliasing caches")
      
      Link: https://lkml.kernel.org/r/20220403053957.10770-3-songmuchun@bytedance.com
      
      
      Fixes: f729c8c9 ("dax: wrprotect pmd_t in dax_mapping_entry_mkclean")
      Signed-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: default avatarDan Williams <dan.j.williams@intel.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Ross Zwisler <zwisler@kernel.org>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Cc: Xiyu Yang <xiyuyang19@fudan.edu.cn>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e583b5c4
  7. Feb 18, 2022
  8. Feb 02, 2022
  9. Dec 18, 2021
  10. Dec 04, 2021
  11. Aug 17, 2021
  12. Jul 08, 2021
  13. Jun 29, 2021
    • Jan Kara's avatar
      dax: fix ENOMEM handling in grab_mapping_entry() · 1a14e377
      Jan Kara authored
      grab_mapping_entry() has a bug in handling of ENOMEM condition.  Suppose
      we have a PMD entry at index i which we are downgrading to a PTE entry.
      grab_mapping_entry() will set pmd_downgrade to true, lock the entry, clear
      the entry in xarray, and decrement mapping->nrpages.  The it will call:
      
      	entry = dax_make_entry(pfn_to_pfn_t(0), flags);
      	dax_lock_entry(xas, entry);
      
      which inserts new PTE entry into xarray.  However this may fail allocating
      the new node.  We handle this by:
      
      	if (xas_nomem(xas, mapping_gfp_mask(mapping) & ~__GFP_HIGHMEM))
      		goto retry;
      
      however pmd_downgrade stays set to true even though 'entry' returned from
      get_unlocked_entry() will be NULL now.  And we will go again through the
      downgrade branch.  This is mostly harmless except that mapping->nrpages is
      decremented again and we temporarily have an invalid entry stored in
      xarray.  Fix the problem by setting pmd_downgrade to false each time we
      lookup the entry we work with so that it matches the entry we found.
      
      Link: https://lkml.kernel.org/r/20210622160015.18004-1-jack@suse.cz
      
      
      Fixes: b15cd800 ("dax: Convert page fault handlers to XArray")
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarDan Williams <dan.j.williams@intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1a14e377
  14. May 07, 2021
  15. May 05, 2021
  16. Feb 09, 2021
    • Paolo Bonzini's avatar
      mm: provide a saner PTE walking API for modules · 9fd6dad1
      Paolo Bonzini authored
      
      Currently, the follow_pfn function is exported for modules but
      follow_pte is not.  However, follow_pfn is very easy to misuse,
      because it does not provide protections (so most of its callers
      assume the page is writable!) and because it returns after having
      already unlocked the page table lock.
      
      Provide instead a simplified version of follow_pte that does
      not have the pmdpp and range arguments.  The older version
      survives as follow_invalidate_pte() for use by fs/dax.c.
      
      Reviewed-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      9fd6dad1
  17. Dec 16, 2020
Loading