- Sep 01, 2023
-
-
Paolo Bonzini authored
KVM x86 MMU changes for 6.6: - Rip out the ancient MMU_DEBUG crud and replace the useful bits with CONFIG_KVM_PROVE_MMU - Overhaul KVM's page-track APIs, and KVMGT's usage, to reduce the API surface that is needed by external users (currently only KVMGT), and fix a variety of issues in the process - Fix KVM's handling of !visible guest roots to avoid premature triple fault injection by loading a dummy root backed by the zero page
-
- Aug 31, 2023
-
-
Sean Christopherson authored
Explicitly include mmu.h in spte.h instead of relying on the "parent" to include mmu.h. spte.h references a variety of macros and variables that are defined/declared in mmu.h, and so including spte.h before (or instead of) mmu.h will result in build errors, e.g. arch/x86/kvm/mmu/spte.h: In function ‘is_mmio_spte’: arch/x86/kvm/mmu/spte.h:242:23: error: ‘enable_mmio_caching’ undeclared 242 | likely(enable_mmio_caching); | ^~~~~~~~~~~~~~~~~~~ arch/x86/kvm/mmu/spte.h: In function ‘is_large_pte’: arch/x86/kvm/mmu/spte.h:302:22: error: ‘PT_PAGE_SIZE_MASK’ undeclared 302 | return pte & PT_PAGE_SIZE_MASK; | ^~~~~~~~~~~~~~~~~ arch/x86/kvm/mmu/spte.h: In function ‘is_dirty_spte’: arch/x86/kvm/mmu/spte.h:332:56: error: ‘PT_WRITABLE_MASK’ undeclared 332 | return dirty_mask ? spte & dirty_mask : spte & PT_WRITABLE_MASK; | ^~~~~~~~~~~~~~~~ Fixes: 5a9624af ("KVM: mmu: extract spte.h and spte.c") Link: https://lore.kernel.org/r/20230808224059.2492476-1-seanjc@google.com Signed-off-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
When attempting to allocate a shadow root for a !visible guest root gfn, e.g. that resides in MMIO space, load a dummy root that is backed by the zero page instead of immediately synthesizing a triple fault shutdown (using the zero page ensures any attempt to translate memory will generate a !PRESENT fault and thus VM-Exit). Unless the vCPU is racing with memslot activity, KVM will inject a page fault due to not finding a visible slot in FNAME(walk_addr_generic), i.e. the end result is mostly same, but critically KVM will inject a fault only *after* KVM runs the vCPU with the bogus root. Waiting to inject a fault until after running the vCPU fixes a bug where KVM would bail from nested VM-Enter if L1 tried to run L2 with TDP enabled and a !visible root. Even though a bad root will *probably* lead to shutdown, (a) it's not guaranteed and (b) the CPU won't read the underlying memory until after VM-Enter succeeds. E.g. if L1 runs L2 with a VMX preemption timer value of '0', then architecturally the preemption timer VM-Exit is guaranteed to occur before the CPU executes any instruction, i.e. before the CPU needs to translate a GPA to a HPA (so long as there are no injected events with higher priority than the preemption timer). If KVM manages to get to FNAME(fetch) with a dummy root, e.g. because userspace created a memslot between installing the dummy root and handling the page fault, simply unload the MMU to allocate a new root and retry the instruction. Use KVM_REQ_MMU_FREE_OBSOLETE_ROOTS to drop the root, as invoking kvm_mmu_free_roots() while holding mmu_lock would deadlock, and conceptually the dummy root has indeeed become obsolete. The only difference versus existing usage of KVM_REQ_MMU_FREE_OBSOLETE_ROOTS is that the root has become obsolete due to memslot *creation*, not memslot deletion or movement. Reported-by:
Reima Ishii <ishiir@g.ecc.u-tokyo.ac.jp> Cc: Yu Zhang <yu.c.zhang@linux.intel.com> Link: https://lore.kernel.org/r/20230729005200.1057358-6-seanjc@google.com Signed-off-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Explicitly inject a page fault if guest attempts to use a !visible gfn as a page table. kvm_vcpu_gfn_to_hva_prot() will naturally handle the case where there is no memslot, but doesn't catch the scenario where the gfn points at a KVM-internal memslot. Letting the guest backdoor its way into accessing KVM-internal memslots isn't dangerous on its own, e.g. at worst the guest can crash itself, but disallowing the behavior will simplify fixing how KVM handles !visible guest root gfns (immediately synthesizing a triple fault when loading the root is architecturally wrong). Link: https://lore.kernel.org/r/20230729005200.1057358-5-seanjc@google.com Signed-off-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Explicitly check that tdp_iter_start() is handed a valid shadow page to harden KVM against bugs, e.g. if KVM calls into the TDP MMU with an invalid or shadow MMU root (which would be a fatal KVM bug), the shadow page pointer will be NULL. Opportunistically stop the TDP MMU iteration instead of continuing on with garbage if the incoming root is bogus. Attempting to walk a garbage root is more likely to caused major problems than doing nothing. Cc: Yu Zhang <yu.c.zhang@linux.intel.com> Link: https://lore.kernel.org/r/20230729005200.1057358-4-seanjc@google.com Signed-off-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Harden kvm_mmu_new_pgd() against NULL pointer dereference bugs by sanity checking that the target root has an associated shadow page prior to dereferencing said shadow page. The code in question is guaranteed to only see roots with shadow pages as fast_pgd_switch() explicitly frees the current root if it doesn't have a shadow page, i.e. is a PAE root, and that in turn prevents valid roots from being cached, but that's all very subtle. Link: https://lore.kernel.org/r/20230729005200.1057358-3-seanjc@google.com Signed-off-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Add a dedicated helper for converting a root hpa to a shadow page in anticipation of using a "dummy" root to handle the scenario where KVM needs to load a valid shadow root (from hardware's perspective), but the guest doesn't have a visible root to shadow. Similar to PAE roots, the dummy root won't have an associated kvm_mmu_page and will need special handling when finding a shadow page given a root. Opportunistically retrieve the root shadow page in kvm_mmu_sync_roots() *after* verifying the root is unsync (the dummy root can never be unsync). Link: https://lore.kernel.org/r/20230729005200.1057358-2-seanjc@google.com Signed-off-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Open code gpa_to_gfn() in kvmgt_page_track_write() and drop KVMGT's dependency on kvm_host.h, i.e. include only kvm_page_track.h. KVMGT assumes "gfn == gpa >> PAGE_SHIFT" all over the place, including a few lines below in the same function with the same gpa, i.e. there's no reason to use KVM's helper for this one case. No functional change intended. Reviewed-by:
Yan Zhao <yan.y.zhao@intel.com> Tested-by:
Yongwei Ma <yongwei.ma@intel.com> Reviewed-by:
Zhi Wang <zhi.a.wang@intel.com> Link: https://lore.kernel.org/r/20230729013535.1070024-30-seanjc@google.com Signed-off-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Get/put references to KVM when a page-track notifier is (un)registered instead of relying on the caller to do so. Forcing the caller to do the bookkeeping is unnecessary and adds one more thing for users to get wrong, e.g. see commit 9ed1fdee ("drm/i915/gvt: Get reference to KVM iff attachment to VM is successful"). Reviewed-by:
Yan Zhao <yan.y.zhao@intel.com> Tested-by:
Yongwei Ma <yongwei.ma@intel.com> Link: https://lore.kernel.org/r/20230729013535.1070024-29-seanjc@google.com Signed-off-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Refactor KVM's exported/external page-track, a.k.a. write-track, APIs to take only the gfn and do the required memslot lookup in KVM proper. Forcing users of the APIs to get the memslot unnecessarily bleeds KVM internals into KVMGT and complicates usage of the APIs. No functional change intended. Reviewed-by:
Yan Zhao <yan.y.zhao@intel.com> Tested-by:
Yongwei Ma <yongwei.ma@intel.com> Link: https://lore.kernel.org/r/20230729013535.1070024-28-seanjc@google.com Signed-off-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Bug the VM if something attempts to write-track a gfn, but write-tracking isn't enabled. The VM is doomed (and KVM has an egregious bug) if KVM or KVMGT wants to shadow guest page tables but can't because write-tracking isn't enabled. Tested-by:
Yongwei Ma <yongwei.ma@intel.com> Link: https://lore.kernel.org/r/20230729013535.1070024-27-seanjc@google.com Signed-off-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
When adding/removing gfns to/from write-tracking, assert that mmu_lock is held for write, and that either slots_lock or kvm->srcu is held. mmu_lock must be held for write to protect gfn_write_track's refcount, and SRCU or slots_lock must be held to protect the memslot itself. Tested-by:
Yan Zhao <yan.y.zhao@intel.com> Tested-by:
Yongwei Ma <yongwei.ma@intel.com> Link: https://lore.kernel.org/r/20230729013535.1070024-26-seanjc@google.com Signed-off-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Rename the page-track APIs to capture that they're all about tracking writes, now that the facade of supporting multiple modes is gone. Opportunstically replace "slot" with "gfn" in anticipation of removing the @slot param from the external APIs. No functional change intended. Tested-by:
Yongwei Ma <yongwei.ma@intel.com> Link: https://lore.kernel.org/r/20230729013535.1070024-25-seanjc@google.com Signed-off-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Drop "support" for multiple page-track modes, as there is no evidence that array-based and refcounted metadata is the optimal solution for other modes, nor is there any evidence that other use cases, e.g. for access-tracking, will be a good fit for the page-track machinery in general. E.g. one potential use case of access-tracking would be to prevent guest access to poisoned memory (from the guest's perspective). In that case, the number of poisoned pages is likely to be a very small percentage of the guest memory, and there is no need to reference count the number of access-tracking users, i.e. expanding gfn_track[] for a new mode would be grossly inefficient. And for poisoned memory, host userspace would also likely want to trap accesses, e.g. to inject #MC into the guest, and that isn't currently supported by the page-track framework. A better alternative for that poisoned page use case is likely a variation of the proposed per-gfn attributes overlay (linked), which would allow efficiently tracking the sparse set of poisoned pages, and by default would exit to userspace on access. Link: https://lore.kernel.org/all/Y2WB48kD0J4VGynX@google.com Cc: Ben Gardon <bgardon@google.com> Tested-by:
Yongwei Ma <yongwei.ma@intel.com> Link: https://lore.kernel.org/r/20230729013535.1070024-24-seanjc@google.com Signed-off-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Disable the page-track notifier code at compile time if there are no external users, i.e. if CONFIG_KVM_EXTERNAL_WRITE_TRACKING=n. KVM itself now hooks emulated writes directly instead of relying on the page-track mechanism. Provide a stub for "struct kvm_page_track_notifier_node" so that including headers directly from the command line, e.g. for testing include guards, doesn't fail due to a struct having an incomplete type. Reviewed-by:
Yan Zhao <yan.y.zhao@intel.com> Tested-by:
Yongwei Ma <yongwei.ma@intel.com> Link: https://lore.kernel.org/r/20230729013535.1070024-23-seanjc@google.com Signed-off-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Bury the declaration of the page-track helpers that are intended only for internal KVM use in a "private" header. In addition to guarding against unwanted usage of the internal-only helpers, dropping their definitions avoids exposing other structures that should be KVM-internal, e.g. for memslots. This is a baby step toward making kvm_host.h a KVM-internal header in the very distant future. Tested-by:
Yongwei Ma <yongwei.ma@intel.com> Link: https://lore.kernel.org/r/20230729013535.1070024-22-seanjc@google.com Signed-off-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Yan Zhao authored
Remove ->track_remove_slot(), there are no longer any users and it's unlikely a "flush" hook will ever be the correct API to provide to an external page-track user. Cc: Zhenyu Wang <zhenyuw@linux.intel.com> Suggested-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
Yan Zhao <yan.y.zhao@intel.com> Tested-by:
Yongwei Ma <yongwei.ma@intel.com> Link: https://lore.kernel.org/r/20230729013535.1070024-21-seanjc@google.com Signed-off-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Yan Zhao authored
Switch from the poorly named and flawed ->track_flush_slot() to the newly introduced ->track_remove_region(). From KVMGT's perspective, the two hooks are functionally equivalent, the only difference being that ->track_remove_region() is called only when KVM is 100% certain the memory region will be removed, i.e. is invoked slightly later in KVM's memslot modification flow. Cc: Zhenyu Wang <zhenyuw@linux.intel.com> Suggested-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
Yan Zhao <yan.y.zhao@intel.com> [sean: handle name change, massage changelog, rebase] Tested-by:
Yan Zhao <yan.y.zhao@intel.com> Tested-by:
Yongwei Ma <yongwei.ma@intel.com> Reviewed-by:
Zhi Wang <zhi.a.wang@intel.com> Link: https://lore.kernel.org/r/20230729013535.1070024-20-seanjc@google.com Signed-off-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Yan Zhao authored
Add a new page-track hook, track_remove_region(), that is called when a memslot DELETE operation is about to be committed. The "remove" hook will be used by KVMGT and will effectively replace the existing track_flush_slot() altogether now that KVM itself doesn't rely on the "flush" hook either. The "flush" hook is flawed as it's invoked before the memslot operation is guaranteed to succeed, i.e. KVM might ultimately keep the existing memslot without notifying external page track users, a.k.a. KVMGT. In practice, this can't currently happen on x86, but there are no guarantees that won't change in the future, not to mention that "flush" does a very poor job of describing what is happening. Pass in the gfn+nr_pages instead of the slot itself so external users, i.e. KVMGT, don't need to exposed to KVM internals (memslots). This will help set the stage for additional cleanups to the page-track APIs. Opportunistically align the existing srcu_read_lock_held() usage so that the new case doesn't stand out like a sore thumb (and not aligning the new code makes bots unhappy). Cc: Zhenyu Wang <zhenyuw@linux.intel.com> Tested-by:
Yongwei Ma <yongwei.ma@intel.com> Signed-off-by:
Yan Zhao <yan.y.zhao@intel.com> Co-developed-by:
Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20230729013535.1070024-19-seanjc@google.com Signed-off-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
When handling a slot "flush", don't call back into KVM to drop write protection for gfns in the slot. Now that KVM rejects attempts to move memory slots while KVMGT is attached, the only time a slot is "flushed" is when it's being removed, i.e. the memslot and all its write-tracking metadata is about to be deleted. Reviewed-by:
Yan Zhao <yan.y.zhao@intel.com> Tested-by:
Yongwei Ma <yongwei.ma@intel.com> Reviewed-by:
Zhi Wang <zhi.a.wang@intel.com> Link: https://lore.kernel.org/r/20230729013535.1070024-18-seanjc@google.com Signed-off-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Disallow moving memslots if the VM has external page-track users, i.e. if KVMGT is being used to expose a virtual GPU to the guest, as KVMGT doesn't correctly handle moving memory regions. Note, this is potential ABI breakage! E.g. userspace could move regions that aren't shadowed by KVMGT without harming the guest. However, the only known user of KVMGT is QEMU, and QEMU doesn't move generic memory regions. KVM's own support for moving memory regions was also broken for multiple years (albeit for an edge case, but arguably moving RAM is itself an edge case), e.g. see commit edd4fa37 ("KVM: x86: Allocate new rmap and large page tracking when moving memslot"). Reviewed-by:
Yan Zhao <yan.y.zhao@intel.com> Tested-by:
Yongwei Ma <yongwei.ma@intel.com> Link: https://lore.kernel.org/r/20230729013535.1070024-17-seanjc@google.com Signed-off-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Drop @vcpu from KVM's ->track_write() hook provided for external users of the page-track APIs now that KVM itself doesn't use the page-track mechanism. Reviewed-by:
Yan Zhao <yan.y.zhao@intel.com> Tested-by:
Yongwei Ma <yongwei.ma@intel.com> Reviewed-by:
Zhi Wang <zhi.a.wang@intel.com> Link: https://lore.kernel.org/r/20230729013535.1070024-16-seanjc@google.com Signed-off-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Don't use the generic page-track mechanism to handle writes to guest PTEs in KVM's MMU. KVM's MMU needs access to information that should not be exposed to external page-track users, e.g. KVM needs (for some definitions of "need") the vCPU to query the current paging mode, whereas external users, i.e. KVMGT, have no ties to the current vCPU and so should never need the vCPU. Moving away from the page-track mechanism will allow dropping use of the page-track mechanism for KVM's own MMU, and will also allow simplifying and cleaning up the page-track APIs. Reviewed-by:
Yan Zhao <yan.y.zhao@intel.com> Tested-by:
Yongwei Ma <yongwei.ma@intel.com> Link: https://lore.kernel.org/r/20230729013535.1070024-15-seanjc@google.com Signed-off-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Call kvm_mmu_zap_all_fast() directly when flushing a memslot instead of bouncing through the page-track mechanism. KVM (unfortunately) needs to zap and flush all page tables on memslot DELETE/MOVE irrespective of whether KVM is shadowing guest page tables. This will allow changing KVM to register a page-track notifier on the first shadow root allocation, and will also allow deleting the misguided kvm_page_track_flush_slot() hook itself once KVM-GT also moves to a different method for reacting to memslot changes. No functional change intended. Cc: Yan Zhao <yan.y.zhao@intel.com> Link: https://lore.kernel.org/r/20221110014821.1548347-2-seanjc@google.com Reviewed-by:
Yan Zhao <yan.y.zhao@intel.com> Tested-by:
Yongwei Ma <yongwei.ma@intel.com> Link: https://lore.kernel.org/r/20230729013535.1070024-14-seanjc@google.com Signed-off-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Move x86's implementation of kvm_arch_flush_shadow_{all,memslot}() into mmu.c, and make kvm_mmu_zap_all() static as it was globally visible only for kvm_arch_flush_shadow_all(). This will allow refactoring kvm_arch_flush_shadow_memslot() to call kvm_mmu_zap_all() directly without having to expose kvm_mmu_zap_all_fast() outside of mmu.c. Keeping everything in mmu.c will also likely simplify supporting TDX, which intends to do zap only relevant SPTEs on memslot updates. No functional change intended. Suggested-by:
Yan Zhao <yan.y.zhao@intel.com> Tested-by:
Yongwei Ma <yongwei.ma@intel.com> Link: https://lore.kernel.org/r/20230729013535.1070024-13-seanjc@google.com Signed-off-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Use vgpu_lock instead of KVM's mmu_lock to protect accesses to the hash table used to track which gfns are write-protected when shadowing the guest's GTT, and hoist the acquisition of vgpu_lock from intel_vgpu_page_track_handler() out to its sole caller, kvmgt_page_track_write(). This fixes a bug where kvmgt_page_track_write(), which doesn't hold kvm->mmu_lock, could race with intel_gvt_page_track_remove() and trigger a use-after-free. Fixing kvmgt_page_track_write() by taking kvm->mmu_lock is not an option as mmu_lock is a r/w spinlock, and intel_vgpu_page_track_handler() might sleep when acquiring vgpu->cache_lock deep down the callstack: intel_vgpu_page_track_handler() | |-> page_track->handler / ppgtt_write_protection_handler() | |-> ppgtt_handle_guest_write_page_table_bytes() | |-> ppgtt_handle_guest_write_page_table() | |-> ppgtt_handle_guest_entry_removal() | |-> ppgtt_invalidate_pte() | |-> intel_gvt_dma_unmap_guest_page() | |-> mutex_lock(&vgpu->cache_lock); Reviewed-by:
Yan Zhao <yan.y.zhao@intel.com> Tested-by:
Yongwei Ma <yongwei.ma@intel.com> Reviewed-by:
Zhi Wang <zhi.a.wang@intel.com> Link: https://lore.kernel.org/r/20230729013535.1070024-12-seanjc@google.com Signed-off-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Drop intel_vgpu_reset_gtt() as it no longer has any callers. In addition to eliminating dead code, this eliminates the last possible scenario where __kvmgt_protect_table_find() can be reached without holding vgpu_lock. Requiring vgpu_lock to be held when calling __kvmgt_protect_table_find() will allow a protecting the gfn hash with vgpu_lock without too much fuss. No functional change intended. Fixes: ba25d977 ("drm/i915/gvt: Do not destroy ppgtt_mm during vGPU D3->D0.") Reviewed-by:
Yan Zhao <yan.y.zhao@intel.com> Tested-by:
Yongwei Ma <yongwei.ma@intel.com> Reviewed-by:
Zhi Wang <zhi.a.wang@intel.com> Link: https://lore.kernel.org/r/20230729013535.1070024-11-seanjc@google.com Signed-off-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Use an "unsigned long" instead of an "int" when iterating over the gfns in a memslot. The number of pages in the memslot is tracked as an "unsigned long", e.g. KVMGT could theoretically break if a KVM memslot larger than 16TiB were deleted (2^32 * 4KiB). Reviewed-by:
Yan Zhao <yan.y.zhao@intel.com> Tested-by:
Yongwei Ma <yongwei.ma@intel.com> Reviewed-by:
Zhi Wang <zhi.a.wang@intel.com> Link: https://lore.kernel.org/r/20230729013535.1070024-10-seanjc@google.com Signed-off-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Now that gvt_pin_guest_page() explicitly verifies the pinned PFN is a transparent hugepage page, don't use KVM's gfn_to_pfn() to pre-check if a 2MiB GTT entry is possible and instead just try to map the GFN with a 2MiB entry. Using KVM to query pfn that is ultimately managed through VFIO is odd, and KVM's gfn_to_pfn() is not intended for non-KVM consumption; it's exported only because of KVM vendor modules (x86 and PPC). Open code the check on 2MiB support instead of keeping is_2MB_gtt_possible() around for a single line of code. Move the call to intel_gvt_dma_map_guest_page() for a 4KiB entry into its case statement, i.e. fork the common path into the 4KiB and 2MiB "direct" shadow paths. Keeping the call in the "common" path is arguably more in the spirit of "one change per patch", but retaining the local "page_size" variable is silly, i.e. the call site will be changed either way, and jumping around the no-longer-common code is more subtle and rather odd, i.e. would just need to be immediately cleaned up. Drop the error message from gvt_pin_guest_page() when KVMGT attempts to shadow a 2MiB guest page that isn't backed by a compatible hugepage in the host. Dropping the pre-check on a THP makes it much more likely that the "error" will be encountered in normal operation. Reviewed-by:
Yan Zhao <yan.y.zhao@intel.com> Tested-by:
Yan Zhao <yan.y.zhao@intel.com> Tested-by:
Yongwei Ma <yongwei.ma@intel.com> Reviewed-by:
Zhi Wang <zhi.a.wang@intel.com> Link: https://lore.kernel.org/r/20230729013535.1070024-9-seanjc@google.com Signed-off-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Bail from ppgtt_populate_shadow_entry() if an unexpected GTT entry type is encountered instead of subtly falling through to the common "direct shadow" path. Eliminating the default/error path's reliance on the common handling will allow hoisting intel_gvt_dma_map_guest_page() into the case statements so that the 2MiB case can try intel_gvt_dma_map_guest_page() and fallback to splitting the entry on failure. Reviewed-by:
Zhi Wang <zhi.a.wang@intel.com> Tested-by:
Yongwei Ma <yongwei.ma@intel.com> Reviewed-by:
Yan Zhao <yan.y.zhao@intel.com> Link: https://lore.kernel.org/r/20230729013535.1070024-8-seanjc@google.com Signed-off-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Move the check that a vGPU is attached from is_2MB_gtt_possible() all the way up to shadow_ppgtt_mm() to avoid unnecessary work, and to make it more obvious that a future cleanup of is_2MB_gtt_possible() isn't introducing a bug. is_2MB_gtt_possible() has only one caller, ppgtt_populate_shadow_entry(), and all paths in ppgtt_populate_shadow_entry() eventually check for attachment by way of intel_gvt_dma_map_guest_page(). And of the paths that lead to ppgtt_populate_shadow_entry(), shadow_ppgtt_mm() is the only one that doesn't already check for INTEL_VGPU_STATUS_ACTIVE or INTEL_VGPU_STATUS_ATTACHED. workload_thread() <= pick_next_workload() => INTEL_VGPU_STATUS_ACTIVE | -> dispatch_workload() | |-> prepare_workload() | -> intel_vgpu_sync_oos_pages() | | | |-> ppgtt_set_guest_page_sync() | | | |-> sync_oos_page() | | | |-> ppgtt_populate_shadow_entry() | |-> intel_vgpu_flush_post_shadow() | 1: |-> ppgtt_handle_guest_write_page_table() | |-> ppgtt_handle_guest_entry_add() | 2: | -> ppgtt_populate_spt_by_guest_entry() | | | |-> ppgtt_populate_spt() | | | |-> ppgtt_populate_shadow_entry() | | | |-> ppgtt_populate_spt_by_guest_entry() [see 2] | |-> ppgtt_populate_shadow_entry() kvmgt_page_track_write() <= KVM callback => INTEL_VGPU_STATUS_ATTACHED | |-> intel_vgpu_page_track_handler() | |-> ppgtt_write_protection_handler() | |-> ppgtt_handle_guest_write_page_table_bytes() | |-> ppgtt_handle_guest_write_page_table() [see 1] Reviewed-by:
Yan Zhao <yan.y.zhao@intel.com> Tested-by:
Yan Zhao <yan.y.zhao@intel.com> Signed-off-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Put the struct page reference acquired by gfn_to_pfn(), KVM's API is that the caller is ultimately responsible for dropping any reference. Note, kvm_release_pfn_clean() ensures the pfn is actually a refcounted struct page before trying to put any references. Fixes: b901b252 ("drm/i915/gvt: Add 2M huge gtt support") Reviewed-by:
Yan Zhao <yan.y.zhao@intel.com> Tested-by:
Yongwei Ma <yongwei.ma@intel.com> Reviewed-by:
Zhi Wang <zhi.a.wang@intel.com> Link: https://lore.kernel.org/r/20230729013535.1070024-6-seanjc@google.com Signed-off-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Yan Zhao authored
Attempt to unpin pages in the error path of gvt_pin_guest_page() if and only if at least one page was successfully pinned. Unpinning doesn't cause functional problems, but vfio_device_container_unpin_pages() rightfully warns about being asked to unpin zero pages. Signed-off-by:
Yan Zhao <yan.y.zhao@intel.com> [sean: write changelog] Reviewed-by:
Zhi Wang <zhi.a.wang@intel.com> Link: https://lore.kernel.org/r/20230729013535.1070024-5-seanjc@google.com Signed-off-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
When shadowing a GTT entry with a 2M page, verify that the pfns are contiguous, not just that the struct page pointers are contiguous. The memory map is virtual contiguous if "CONFIG_FLATMEM=y || CONFIG_SPARSEMEM_VMEMMAP=y", but not for "CONFIG_SPARSEMEM=y && CONFIG_SPARSEMEM_VMEMMAP=n", so theoretically KVMGT could encounter struct pages that are virtually contiguous, but not physically contiguous. In practice, this flaw is likely a non-issue as it would cause functional problems iff a section isn't 2M aligned _and_ is directly adjacent to another section with discontiguous pfns. Tested-by:
Yongwei Ma <yongwei.ma@intel.com> Reviewed-by:
Zhi Wang <zhi.a.wang@intel.com> Reviewed-by:
Yan Zhao <yan.y.zhao@intel.com> Link: https://lore.kernel.org/r/20230729013535.1070024-4-seanjc@google.com Signed-off-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Yan Zhao authored
Currently intel_gvt_is_valid_gfn() is called in two places: (1) shadowing guest GGTT entry (2) shadowing guest PPGTT leaf entry, which was introduced in commit cc753fbe ("drm/i915/gvt: validate gfn before set shadow page entry"). However, now it's not necessary to call this interface any more, because a. GGTT partial write issue has been fixed by commit bc0686ff ("drm/i915/gvt: support inconsecutive partial gtt entry write") commit 510fe10b ("drm/i915/gvt: fix a bug of partially write ggtt enties") b. PPGTT resides in normal guest RAM and we only treat 8-byte writes as valid page table writes. Any invalid GPA found is regarded as an error, either due to guest misbehavior/attack or bug in host shadow code. So,rather than do GFN pre-checking and replace invalid GFNs with scratch GFN and continue silently, just remove the pre-checking and abort PPGTT shadowing on error detected. c. GFN validity check is still performed in intel_gvt_dma_map_guest_page() --> gvt_pin_guest_page(). It's more desirable to call VFIO interface to do both validity check and mapping. Calling intel_gvt_is_valid_gfn() to do GFN validity check from KVM side while later mapping the GFN through VFIO interface is unnecessarily fragile and confusing for unaware readers. Signed-off-by:
Yan Zhao <yan.y.zhao@intel.com> [sean: remove now-unused local variables] Acked-by:
Zhi Wang <zhi.a.wang@intel.com> Tested-by:
Yongwei Ma <yongwei.ma@intel.com> Link: https://lore.kernel.org/r/20230729013535.1070024-3-seanjc@google.com Signed-off-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Check that the pfn found by gfn_to_pfn() is actually backed by "struct page" memory prior to retrieving and dereferencing the page. KVM supports backing guest memory with VM_PFNMAP, VM_IO, etc., and so there is no guarantee the pfn returned by gfn_to_pfn() has an associated "struct page". Fixes: b901b252 ("drm/i915/gvt: Add 2M huge gtt support") Reviewed-by:
Yan Zhao <yan.y.zhao@intel.com> Tested-by:
Yongwei Ma <yongwei.ma@intel.com> Reviewed-by:
Zhi Wang <zhi.a.wang@intel.com> Link: https://lore.kernel.org/r/20230729013535.1070024-2-seanjc@google.com Signed-off-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Introduce KVM_BUG_ON_DATA_CORRUPTION() and use it in the low-level rmap helpers to convert the existing BUG()s to WARN_ON_ONCE() when the kernel is built with CONFIG_BUG_ON_DATA_CORRUPTION=n, i.e. does NOT want to BUG() on corruption of host kernel data structures. Environments that don't have infrastructure to automatically capture crash dumps, i.e. aren't likely to enable CONFIG_BUG_ON_DATA_CORRUPTION=y, are typically better served overall by WARN-and-continue behavior (for the kernel, the VM is dead regardless), as a BUG() while holding mmu_lock all but guarantees the _best_ case scenario is a panic(). Make the BUG()s conditional instead of removing/replacing them entirely as there's a non-zero chance (though by no means a guarantee) that the damage isn't contained to the target VM, e.g. if no rmap is found for a SPTE then KVM may be double-zapping the SPTE, i.e. has already freed the memory the SPTE pointed at and thus KVM is reading/writing memory that KVM no longer owns. Link: https://lore.kernel.org/all/20221129191237.31447-1-mizhang@google.com Suggested-by:
Mingwei Zhang <mizhang@google.com> Cc: David Matlack <dmatlack@google.com> Cc: Jim Mattson <jmattson@google.com> Reviewed-by:
Mingwei Zhang <mizhang@google.com> Link: https://lore.kernel.org/r/20230729004722.1056172-13-seanjc@google.com Signed-off-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Mingwei Zhang authored
Plumb "struct kvm" all the way to pte_list_remove() to allow the usage of KVM_BUG() and/or KVM_BUG_ON(). This will allow killing only the offending VM instead of doing BUG() if the kernel is built with CONFIG_BUG_ON_DATA_CORRUPTION=n, i.e. does NOT want to BUG() if KVM's data structures (rmaps) appear to be corrupted. Signed-off-by:
Mingwei Zhang <mizhang@google.com> [sean: tweak changelog] Link: https://lore.kernel.org/r/20230729004722.1056172-12-seanjc@google.com Signed-off-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Use BUILD_BUG_ON_INVALID() instead of an empty do-while loop to stub out KVM_MMU_WARN_ON() when CONFIG_KVM_PROVE_MMU=n, that way _some_ build issues with the usage of KVM_MMU_WARN_ON() will be dected even if the kernel is using the stubs, e.g. basic syntax errors will be detected. Reviewed-by:
Philippe Mathieu-Daudé <philmd@linaro.org> Link: https://lore.kernel.org/r/20230729004722.1056172-11-seanjc@google.com Signed-off-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Replace MMU_DEBUG, which requires manually modifying KVM to enable the macro, with a proper Kconfig, KVM_PROVE_MMU. Now that pgprintk() and rmap_printk() are gone, i.e. the macro guards only KVM_MMU_WARN_ON() and won't flood the kernel logs, enabling the option for debug kernels is both desirable and feasible. Reviewed-by:
Philippe Mathieu-Daudé <philmd@linaro.org> Link: https://lore.kernel.org/r/20230729004722.1056172-10-seanjc@google.com Signed-off-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-