Skip to content
Snippets Groups Projects
  1. Jun 22, 2023
    • Gavin Shan's avatar
      KVM: Avoid illegal stage2 mapping on invalid memory slot · 2230f9e1
      Gavin Shan authored
      
      We run into guest hang in edk2 firmware when KSM is kept as running on
      the host. The edk2 firmware is waiting for status 0x80 from QEMU's pflash
      device (TYPE_PFLASH_CFI01) during the operation of sector erasing or
      buffered write. The status is returned by reading the memory region of
      the pflash device and the read request should have been forwarded to QEMU
      and emulated by it. Unfortunately, the read request is covered by an
      illegal stage2 mapping when the guest hang issue occurs. The read request
      is completed with QEMU bypassed and wrong status is fetched. The edk2
      firmware runs into an infinite loop with the wrong status.
      
      The illegal stage2 mapping is populated due to same page sharing by KSM
      at (C) even the associated memory slot has been marked as invalid at (B)
      when the memory slot is requested to be deleted. It's notable that the
      active and inactive memory slots can't be swapped when we're in the middle
      of kvm_mmu_notifier_change_pte() because kvm->mn_active_invalidate_count
      is elevated, and kvm_swap_active_memslots() will busy loop until it reaches
      to zero again. Besides, the swapping from the active to the inactive memory
      slots is also avoided by holding &kvm->srcu in __kvm_handle_hva_range(),
      corresponding to synchronize_srcu_expedited() in kvm_swap_active_memslots().
      
        CPU-A                    CPU-B
        -----                    -----
                                 ioctl(kvm_fd, KVM_SET_USER_MEMORY_REGION)
                                 kvm_vm_ioctl_set_memory_region
                                 kvm_set_memory_region
                                 __kvm_set_memory_region
                                 kvm_set_memslot(kvm, old, NULL, KVM_MR_DELETE)
                                   kvm_invalidate_memslot
                                     kvm_copy_memslot
                                     kvm_replace_memslot
                                     kvm_swap_active_memslots        (A)
                                     kvm_arch_flush_shadow_memslot   (B)
        same page sharing by KSM
        kvm_mmu_notifier_invalidate_range_start
              :
        kvm_mmu_notifier_change_pte
          kvm_handle_hva_range
          __kvm_handle_hva_range
          kvm_set_spte_gfn            (C)
              :
        kvm_mmu_notifier_invalidate_range_end
      
      Fix the issue by skipping the invalid memory slot at (C) to avoid the
      illegal stage2 mapping so that the read request for the pflash's status
      is forwarded to QEMU and emulated by it. In this way, the correct pflash's
      status can be returned from QEMU to break the infinite loop in the edk2
      firmware.
      
      We tried a git-bisect and the first problematic commit is cd4c7183 ("
      KVM: arm64: Convert to the gfn-based MMU notifier callbacks"). With this,
      clean_dcache_guest_page() is called after the memory slots are iterated
      in kvm_mmu_notifier_change_pte(). clean_dcache_guest_page() is called
      before the iteration on the memory slots before this commit. This change
      literally enlarges the racy window between kvm_mmu_notifier_change_pte()
      and memory slot removal so that we're able to reproduce the issue in a
      practical test case. However, the issue exists since commit d5d8184d
      ("KVM: ARM: Memory virtualization setup").
      
      Cc: stable@vger.kernel.org # v3.9+
      Fixes: d5d8184d ("KVM: ARM: Memory virtualization setup")
      Reported-by: default avatarShuai Hu <hshuai@redhat.com>
      Reported-by: default avatarZhenyu Zhang <zhenyzha@redhat.com>
      Signed-off-by: default avatarGavin Shan <gshan@redhat.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarOliver Upton <oliver.upton@linux.dev>
      Reviewed-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarShaoqin Huang <shahuang@redhat.com>
      Message-Id: <20230615054259.14911-1-gshan@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      2230f9e1
  2. Jun 19, 2023
    • Ryan Roberts's avatar
      mm: ptep_get() conversion · c33c7948
      Ryan Roberts authored
      Convert all instances of direct pte_t* dereferencing to instead use
      ptep_get() helper.  This means that by default, the accesses change from a
      C dereference to a READ_ONCE().  This is technically the correct thing to
      do since where pgtables are modified by HW (for access/dirty) they are
      volatile and therefore we should always ensure READ_ONCE() semantics.
      
      But more importantly, by always using the helper, it can be overridden by
      the architecture to fully encapsulate the contents of the pte.  Arch code
      is deliberately not converted, as the arch code knows best.  It is
      intended that arch code (arm64) will override the default with its own
      implementation that can (e.g.) hide certain bits from the core code, or
      determine young/dirty status by mixing in state from another source.
      
      Conversion was done using Coccinelle:
      
      ----
      
      // $ make coccicheck \
      //          COCCI=ptepget.cocci \
      //          SPFLAGS="--include-headers" \
      //          MODE=patch
      
      virtual patch
      
      @ depends on patch @
      pte_t *v;
      @@
      
      - *v
      + ptep_get(v)
      
      ----
      
      Then reviewed and hand-edited to avoid multiple unnecessary calls to
      ptep_get(), instead opting to store the result of a single call in a
      variable, where it is correct to do so.  This aims to negate any cost of
      READ_ONCE() and will benefit arch-overrides that may be more complex.
      
      Included is a fix for an issue in an earlier version of this patch that
      was pointed out by kernel test robot.  The issue arose because config
      MMU=n elides definition of the ptep helper functions, including
      ptep_get().  HUGETLB_PAGE=n configs still define a simple
      huge_ptep_clear_flush() for linking purposes, which dereferences the ptep.
      So when both configs are disabled, this caused a build error because
      ptep_get() is not defined.  Fix by continuing to do a direct dereference
      when MMU=n.  This is safe because for this config the arch code cannot be
      trying to virtualize the ptes because none of the ptep helpers are
      defined.
      
      Link: https://lkml.kernel.org/r/20230612151545.3317766-4-ryan.roberts@arm.com
      
      
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Link: https://lore.kernel.org/oe-kbuild-all/202305120142.yXsNEo6H-lkp@intel.com/
      
      
      Signed-off-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrey Konovalov <andreyknvl@gmail.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Daniel Vetter <daniel@ffwll.ch>
      Cc: Dave Airlie <airlied@gmail.com>
      Cc: Dimitri Sivanich <dimitri.sivanich@hpe.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Ian Rogers <irogers@google.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
      Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c33c7948
  3. Jun 13, 2023
  4. Jun 09, 2023
    • Lorenzo Stoakes's avatar
      mm/gup: remove vmas parameter from get_user_pages_remote() · ca5e8632
      Lorenzo Stoakes authored
      The only instances of get_user_pages_remote() invocations which used the
      vmas parameter were for a single page which can instead simply look up the
      VMA directly. In particular:-
      
      - __update_ref_ctr() looked up the VMA but did nothing with it so we simply
        remove it.
      
      - __access_remote_vm() was already using vma_lookup() when the original
        lookup failed so by doing the lookup directly this also de-duplicates the
        code.
      
      We are able to perform these VMA operations as we already hold the
      mmap_lock in order to be able to call get_user_pages_remote().
      
      As part of this work we add get_user_page_vma_remote() which abstracts the
      VMA lookup, error handling and decrementing the page reference count should
      the VMA lookup fail.
      
      This forms part of a broader set of patches intended to eliminate the vmas
      parameter altogether.
      
      [akpm@linux-foundation.org: avoid passing NULL to PTR_ERR]
      Link: https://lkml.kernel.org/r/d20128c849ecdbf4dd01cc828fcec32127ed939a.1684350871.git.lstoakes@gmail.com
      
      
      Signed-off-by: default avatarLorenzo Stoakes <lstoakes@gmail.com>
      Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> (for arm64)
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: Janosch Frank <frankja@linux.ibm.com> (for s390)
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Cc: Christian König <christian.koenig@amd.com>
      Cc: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Jarkko Sakkinen <jarkko@kernel.org>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Sakari Ailus <sakari.ailus@linux.intel.com>
      Cc: Sean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ca5e8632
    • Lorenzo Stoakes's avatar
      mm/gup: remove unused vmas parameter from get_user_pages() · 54d02069
      Lorenzo Stoakes authored
      Patch series "remove the vmas parameter from GUP APIs", v6.
      
      (pin_/get)_user_pages[_remote]() each provide an optional output parameter
      for an array of VMA objects associated with each page in the input range.
      
      These provide the means for VMAs to be returned, as long as mm->mmap_lock
      is never released during the GUP operation (i.e.  the internal flag
      FOLL_UNLOCKABLE is not specified).
      
      In addition, these VMAs can only be accessed with the mmap_lock held and
      become invalidated the moment it is released.
      
      The vast majority of invocations do not use this functionality and of
      those that do, all but one case retrieve a single VMA to perform checks
      upon.
      
      It is not egregious in the single VMA cases to simply replace the
      operation with a vma_lookup().  In these cases we duplicate the (fast)
      lookup on a slow path already under the mmap_lock, abstracted to a new
      get_user_page_vma_remote() inline helper function which also performs
      error checking and reference count maintenance.
      
      The special case is io_uring, where io_pin_pages() specifically needs to
      assert that the VMAs underlying the range do not result in broken
      long-term GUP file-backed mappings.
      
      As GUP now internally asserts that FOLL_LONGTERM mappings are not
      file-backed in a broken fashion (i.e.  requiring dirty tracking) - as
      implemented in "mm/gup: disallow FOLL_LONGTERM GUP-nonfast writing to
      file-backed mappings" - this logic is no longer required and so we can
      simply remove it altogether from io_uring.
      
      Eliminating the vmas parameter eliminates an entire class of danging
      pointer errors that might have occured should the lock have been
      incorrectly released.
      
      In addition, the API is simplified and now clearly expresses what it is
      intended for - applying the specified GUP flags and (if pinning) returning
      pinned pages.
      
      This change additionally opens the door to further potential improvements
      in GUP and the possible marrying of disparate code paths.
      
      I have run this series against gup_test with no issues.
      
      Thanks to Matthew Wilcox for suggesting this refactoring!
      
      
      This patch (of 6):
      
      No invocation of get_user_pages() use the vmas parameter, so remove it.
      
      The GUP API is confusing and caveated.  Recent changes have done much to
      improve that, however there is more we can do.  Exporting vmas is a prime
      target as the caller has to be extremely careful to preclude their use
      after the mmap_lock has expired or otherwise be left with dangling
      pointers.
      
      Removing the vmas parameter focuses the GUP functions upon their primary
      purpose - pinning (and outputting) pages as well as performing the actions
      implied by the input flags.
      
      This is part of a patch series aiming to remove the vmas parameter
      altogether.
      
      Link: https://lkml.kernel.org/r/cover.1684350871.git.lstoakes@gmail.com
      Link: https://lkml.kernel.org/r/589e0c64794668ffc799651e8d85e703262b1e9d.1684350871.git.lstoakes@gmail.com
      
      
      Signed-off-by: default avatarLorenzo Stoakes <lstoakes@gmail.com>
      Suggested-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Acked-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      Acked-by: Christian König <christian.koenig@amd.com> (for radeon parts)
      Acked-by: default avatarJarkko Sakkinen <jarkko@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Acked-by: Sean Christopherson <seanjc@google.com> (KVM)
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
      Cc: Janosch Frank <frankja@linux.ibm.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Sakari Ailus <sakari.ailus@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      54d02069
  5. Jun 06, 2023
  6. Jun 01, 2023
  7. May 26, 2023
  8. May 19, 2023
    • Michal Luczaj's avatar
      KVM: Fix vcpu_array[0] races · afb2acb2
      Michal Luczaj authored
      
      In kvm_vm_ioctl_create_vcpu(), add vcpu to vcpu_array iff it's safe to
      access vcpu via kvm_get_vcpu() and kvm_for_each_vcpu(), i.e. when there's
      no failure path requiring vcpu removal and destruction. Such order is
      important because vcpu_array accessors may end up referencing vcpu at
      vcpu_array[0] even before online_vcpus is set to 1.
      
      When online_vcpus=0, any call to kvm_get_vcpu() goes through
      array_index_nospec() and ends with an attempt to xa_load(vcpu_array, 0):
      
      	int num_vcpus = atomic_read(&kvm->online_vcpus);
      	i = array_index_nospec(i, num_vcpus);
      	return xa_load(&kvm->vcpu_array, i);
      
      Similarly, when online_vcpus=0, a kvm_for_each_vcpu() does not iterate over
      an "empty" range, but actually [0, ULONG_MAX]:
      
      	xa_for_each_range(&kvm->vcpu_array, idx, vcpup, 0, \
      			  (atomic_read(&kvm->online_vcpus) - 1))
      
      In both cases, such online_vcpus=0 edge case, even if leading to
      unnecessary calls to XArray API, should not be an issue; requesting
      unpopulated indexes/ranges is handled by xa_load() and xa_for_each_range().
      
      However, this means that when the first vCPU is created and inserted in
      vcpu_array *and* before online_vcpus is incremented, code calling
      kvm_get_vcpu()/kvm_for_each_vcpu() already has access to that first vCPU.
      
      This should not pose a problem assuming that once a vcpu is stored in
      vcpu_array, it will remain there, but that's not the case:
      kvm_vm_ioctl_create_vcpu() first inserts to vcpu_array, then requests a
      file descriptor. If create_vcpu_fd() fails, newly inserted vcpu is removed
      from the vcpu_array, then destroyed:
      
      	vcpu->vcpu_idx = atomic_read(&kvm->online_vcpus);
      	r = xa_insert(&kvm->vcpu_array, vcpu->vcpu_idx, vcpu, GFP_KERNEL_ACCOUNT);
      	kvm_get_kvm(kvm);
      	r = create_vcpu_fd(vcpu);
      	if (r < 0) {
      		xa_erase(&kvm->vcpu_array, vcpu->vcpu_idx);
      		kvm_put_kvm_no_destroy(kvm);
      		goto unlock_vcpu_destroy;
      	}
      	atomic_inc(&kvm->online_vcpus);
      
      This results in a possible race condition when a reference to a vcpu is
      acquired (via kvm_get_vcpu() or kvm_for_each_vcpu()) moments before said
      vcpu is destroyed.
      
      Signed-off-by: default avatarMichal Luczaj <mhal@rbox.co>
      Message-Id: <20230510140410.1093987-2-mhal@rbox.co>
      Cc: stable@vger.kernel.org
      Fixes: c5b07754 ("KVM: Convert the kvm->vcpus array to a xarray", 2021-12-08)
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      afb2acb2
    • Sean Christopherson's avatar
      KVM: Don't enable hardware after a restart/shutdown is initiated · e0ceec22
      Sean Christopherson authored
      
      Reject hardware enabling, i.e. VM creation, if a restart/shutdown has
      been initiated to avoid re-enabling hardware between kvm_reboot() and
      machine_{halt,power_off,restart}().  The restart case is especially
      problematic (for x86) as enabling VMX (or clearing GIF in KVM_RUN on
      SVM) blocks INIT, which results in the restart/reboot hanging as BIOS
      is unable to wake and rendezvous with APs.
      
      Note, this bug, and the original issue that motivated the addition of
      kvm_reboot(), is effectively limited to a forced reboot, e.g. `reboot -f`.
      In a "normal" reboot, userspace will gracefully teardown userspace before
      triggering the kernel reboot (modulo bugs, errors, etc), i.e. any process
      that might do ioctl(KVM_CREATE_VM) is long gone.
      
      Fixes: 8e1c1815 ("KVM: VMX: Disable VMX when system shutdown")
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Acked-by: default avatarMarc Zyngier <maz@kernel.org>
      Message-Id: <20230512233127.804012-3-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      e0ceec22
    • Sean Christopherson's avatar
      KVM: Use syscore_ops instead of reboot_notifier to hook restart/shutdown · 6735150b
      Sean Christopherson authored
      
      Use syscore_ops.shutdown to disable hardware virtualization during a
      reboot instead of using the dedicated reboot_notifier so that KVM disables
      virtualization _after_ system_state has been updated.  This will allow
      fixing a race in KVM's handling of a forced reboot where KVM can end up
      enabling hardware virtualization between kernel_restart_prepare() and
      machine_restart().
      
      Rename KVM's hook to match the syscore op to avoid any possible confusion
      from wiring up a "reboot" helper to a "shutdown" hook (neither "shutdown
      nor "reboot" is completely accurate as the hook handles both).
      
      Opportunistically rewrite kvm_shutdown()'s comment to make it less VMX
      specific, and to explain why kvm_rebooting exists.
      
      Cc: Marc Zyngier <maz@kernel.org>
      Cc: Oliver Upton <oliver.upton@linux.dev>
      Cc: James Morse <james.morse@arm.com>
      Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
      Cc: Zenghui Yu <yuzenghui@huawei.com>
      Cc: kvmarm@lists.linux.dev
      Cc: Huacai Chen <chenhuacai@kernel.org>
      Cc: Aleksandar Markovic <aleksandar.qemu.devel@gmail.com>
      Cc: Anup Patel <anup@brainfault.org>
      Cc: Atish Patra <atishp@atishpatra.org>
      Cc: kvm-riscv@lists.infradead.org
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Acked-by: default avatarMarc Zyngier <maz@kernel.org>
      Message-Id: <20230512233127.804012-2-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      6735150b
  9. May 16, 2023
  10. Mar 31, 2023
  11. Mar 27, 2023
    • Dmytro Maluka's avatar
      KVM: x86/ioapic: Resample the pending state of an IRQ when unmasking · fef8f2b9
      Dmytro Maluka authored
      
      KVM irqfd based emulation of level-triggered interrupts doesn't work
      quite correctly in some cases, particularly in the case of interrupts
      that are handled in a Linux guest as oneshot interrupts (IRQF_ONESHOT).
      Such an interrupt is acked to the device in its threaded irq handler,
      i.e. later than it is acked to the interrupt controller (EOI at the end
      of hardirq), not earlier.
      
      Linux keeps such interrupt masked until its threaded handler finishes,
      to prevent the EOI from re-asserting an unacknowledged interrupt.
      However, with KVM + vfio (or whatever is listening on the resamplefd)
      we always notify resamplefd at the EOI, so vfio prematurely unmasks the
      host physical IRQ, thus a new physical interrupt is fired in the host.
      This extra interrupt in the host is not a problem per se. The problem is
      that it is unconditionally queued for injection into the guest, so the
      guest sees an extra bogus interrupt. [*]
      
      There are observed at least 2 user-visible issues caused by those
      extra erroneous interrupts for a oneshot irq in the guest:
      
      1. System suspend aborted due to a pending wakeup interrupt from
         ChromeOS EC (drivers/platform/chrome/cros_ec.c).
      2. Annoying "invalid report id data" errors from ELAN0000 touchpad
         (drivers/input/mouse/elan_i2c_core.c), flooding the guest dmesg
         every time the touchpad is touched.
      
      The core issue here is that by the time when the guest unmasks the IRQ,
      the physical IRQ line is no longer asserted (since the guest has
      acked the interrupt to the device in the meantime), yet we
      unconditionally inject the interrupt queued into the guest by the
      previous resampling. So to fix the issue, we need a way to detect that
      the IRQ is no longer pending, and cancel the queued interrupt in this
      case.
      
      With IOAPIC we are not able to probe the physical IRQ line state
      directly (at least not if the underlying physical interrupt controller
      is an IOAPIC too), so in this patch we use irqfd resampler for that.
      Namely, instead of injecting the queued interrupt, we just notify the
      resampler that this interrupt is done. If the IRQ line is actually
      already deasserted, we are done. If it is still asserted, a new
      interrupt will be shortly triggered through irqfd and injected into the
      guest.
      
      In the case if there is no irqfd resampler registered for this IRQ, we
      cannot fix the issue, so we keep the existing behavior: immediately
      unconditionally inject the queued interrupt.
      
      This patch fixes the issue for x86 IOAPIC only. In the long run, we can
      fix it for other irqchips and other architectures too, possibly taking
      advantage of reading the physical state of the IRQ line, which is
      possible with some other irqchips (e.g. with arm64 GIC, maybe even with
      the legacy x86 PIC).
      
      [*] In this description we assume that the interrupt is a physical host
          interrupt forwarded to the guest e.g. by vfio. Potentially the same
          issue may occur also with a purely virtual interrupt from an
          emulated device, e.g. if the guest handles this interrupt, again, as
          a oneshot interrupt.
      
      Signed-off-by: default avatarDmytro Maluka <dmy@semihalf.com>
      Link: https://lore.kernel.org/kvm/31420943-8c5f-125c-a5ee-d2fde2700083@semihalf.com/
      Link: https://lore.kernel.org/lkml/87o7wrug0w.wl-maz@kernel.org/
      
      
      Message-Id: <20230322204344.50138-3-dmy@semihalf.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      fef8f2b9
    • Dmytro Maluka's avatar
      KVM: irqfd: Make resampler_list an RCU list · d583fbd7
      Dmytro Maluka authored
      
      It is useful to be able to do read-only traversal of the list of all the
      registered irqfd resamplers without locking the resampler_lock mutex.
      In particular, we are going to traverse it to search for a resampler
      registered for the given irq of an irqchip, and that will be done with
      an irqchip spinlock (ioapic->lock) held, so it is undesirable to lock a
      mutex in this context. So turn this list into an RCU list.
      
      For protecting the read side, reuse kvm->irq_srcu which is already used
      for protecting a number of irq related things (kvm->irq_routing,
      irqfd->resampler->list, kvm->irq_ack_notifier_list,
      kvm->arch.mask_notifier_list).
      
      Signed-off-by: default avatarDmytro Maluka <dmy@semihalf.com>
      Message-Id: <20230322204344.50138-2-dmy@semihalf.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      d583fbd7
  12. Mar 24, 2023
  13. Mar 23, 2023
  14. Mar 16, 2023
    • Thomas Huth's avatar
      KVM: Standardize on "int" return types instead of "long" in kvm_main.c · f15ba52b
      Thomas Huth authored
      
      KVM functions use "long" return values for functions that are wired up
      to "struct file_operations", but otherwise use "int" return values for
      functions that can return 0/-errno in order to avoid unintentional
      divergences between 32-bit and 64-bit kernels.
      Some code still uses "long" in unnecessary spots, though, which can
      cause a little bit of confusion and unnecessary size casts. Let's
      change these spots to use "int" types, too.
      
      Signed-off-by: default avatarThomas Huth <thuth@redhat.com>
      Message-Id: <20230208140105.655814-6-thuth@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      f15ba52b
  15. Feb 01, 2023
    • Sean Christopherson's avatar
      KVM: Destroy target device if coalesced MMIO unregistration fails · b1cb1fac
      Sean Christopherson authored
      
      Destroy and free the target coalesced MMIO device if unregistering said
      device fails.  As clearly noted in the code, kvm_io_bus_unregister_dev()
      does not destroy the target device.
      
        BUG: memory leak
        unreferenced object 0xffff888112a54880 (size 64):
          comm "syz-executor.2", pid 5258, jiffies 4297861402 (age 14.129s)
          hex dump (first 32 bytes):
            38 c7 67 15 00 c9 ff ff 38 c7 67 15 00 c9 ff ff  8.g.....8.g.....
            e0 c7 e1 83 ff ff ff ff 00 30 67 15 00 c9 ff ff  .........0g.....
          backtrace:
            [<0000000006995a8a>] kmalloc include/linux/slab.h:556 [inline]
            [<0000000006995a8a>] kzalloc include/linux/slab.h:690 [inline]
            [<0000000006995a8a>] kvm_vm_ioctl_register_coalesced_mmio+0x8e/0x3d0 arch/x86/kvm/../../../virt/kvm/coalesced_mmio.c:150
            [<00000000022550c2>] kvm_vm_ioctl+0x47d/0x1600 arch/x86/kvm/../../../virt/kvm/kvm_main.c:3323
            [<000000008a75102f>] vfs_ioctl fs/ioctl.c:46 [inline]
            [<000000008a75102f>] file_ioctl fs/ioctl.c:509 [inline]
            [<000000008a75102f>] do_vfs_ioctl+0xbab/0x1160 fs/ioctl.c:696
            [<0000000080e3f669>] ksys_ioctl+0x76/0xa0 fs/ioctl.c:713
            [<0000000059ef4888>] __do_sys_ioctl fs/ioctl.c:720 [inline]
            [<0000000059ef4888>] __se_sys_ioctl fs/ioctl.c:718 [inline]
            [<0000000059ef4888>] __x64_sys_ioctl+0x6f/0xb0 fs/ioctl.c:718
            [<000000006444fa05>] do_syscall_64+0x9f/0x4e0 arch/x86/entry/common.c:290
            [<000000009a4ed50b>] entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
        BUG: leak checking failed
      
      Fixes: 5d3c4c79 ("KVM: Stop looking for coalesced MMIO zones if the bus is destroyed")
      Cc: stable@vger.kernel.org
      Reported-by: default avatar柳菁峰 <liujingfeng@qianxin.com>
      Reported-by: default avatarMichal Luczaj <mhal@rbox.co>
      Link: https://lore.kernel.org/r/20221219171924.67989-1-seanjc@google.com
      Link: https://lore.kernel.org/all/20230118220003.1239032-1-mhal@rbox.co
      
      
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      b1cb1fac
  16. Jan 20, 2023
    • Yi Liu's avatar
      kvm/vfio: Fix potential deadlock on vfio group_lock · 51cdc8bc
      Yi Liu authored
      
      Currently it is possible that the final put of a KVM reference comes from
      vfio during its device close operation.  This occurs while the vfio group
      lock is held; however, if the vfio device is still in the kvm device list,
      then the following call chain could result in a deadlock:
      
      VFIO holds group->group_lock/group_rwsem
        -> kvm_put_kvm
         -> kvm_destroy_vm
          -> kvm_destroy_devices
           -> kvm_vfio_destroy
            -> kvm_vfio_file_set_kvm
             -> vfio_file_set_kvm
              -> try to hold group->group_lock/group_rwsem
      
      The key function is the kvm_destroy_devices() which triggers destroy cb
      of kvm_device_ops. It calls back to vfio and try to hold group_lock. So
      if this path doesn't call back to vfio, this dead lock would be fixed.
      Actually, there is a way for it. KVM provides another point to free the
      kvm-vfio device which is the point when the device file descriptor is
      closed. This can be achieved by providing the release cb instead of the
      destroy cb. Also rename kvm_vfio_destroy() to be kvm_vfio_release().
      
      	/*
      	 * Destroy is responsible for freeing dev.
      	 *
      	 * Destroy may be called before or after destructors are called
      	 * on emulated I/O regions, depending on whether a reference is
      	 * held by a vcpu or other kvm component that gets destroyed
      	 * after the emulated I/O.
      	 */
      	void (*destroy)(struct kvm_device *dev);
      
      	/*
      	 * Release is an alternative method to free the device. It is
      	 * called when the device file descriptor is closed. Once
      	 * release is called, the destroy method will not be called
      	 * anymore as the device is removed from the device list of
      	 * the VM. kvm->lock is held.
      	 */
      	void (*release)(struct kvm_device *dev);
      
      Fixes: 421cfe65 ("vfio: remove VFIO_GROUP_NOTIFY_SET_KVM")
      Reported-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      Suggested-by: default avatarKevin Tian <kevin.tian@intel.com>
      Reviewed-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      Signed-off-by: default avatarYi Liu <yi.l.liu@intel.com>
      Reviewed-by: default avatarMatthew Rosato <mjrosato@linux.ibm.com>
      Link: https://lore.kernel.org/r/20230114000351.115444-1-mjrosato@linux.ibm.com
      Link: https://lore.kernel.org/r/20230120150528.471752-1-yi.l.liu@intel.com
      
      
      [aw: update comment as well, s/destroy/release/]
      Signed-off-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      51cdc8bc
  17. Jan 11, 2023
  18. Dec 29, 2022
    • Sean Christopherson's avatar
      KVM: Clean up error labels in kvm_init() · 9f1a4c00
      Sean Christopherson authored
      
      Convert the last two "out" lables to "err" labels now that the dust has
      settled, i.e. now that there are no more planned changes to the order
      of things in kvm_init().
      
      Use "err" instead of "out" as it's easier to describe what failed than it
      is to describe what needs to be unwound, e.g. if allocating a per-CPU kick
      mask fails, KVM needs to free any masks that were allocated, and of course
      needs to unwind previous operations.
      
      Reported-by: default avatarChao Gao <chao.gao@intel.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20221130230934.1014142-51-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      9f1a4c00
    • Sean Christopherson's avatar
      KVM: Opt out of generic hardware enabling on s390 and PPC · 441f7bfa
      Sean Christopherson authored
      
      Allow architectures to opt out of the generic hardware enabling logic,
      and opt out on both s390 and PPC, which don't need to manually enable
      virtualization as it's always on (when available).
      
      In addition to letting s390 and PPC drop a bit of dead code, this will
      hopefully also allow ARM to clean up its related code, e.g. ARM has its
      own per-CPU flag to track which CPUs have enable hardware due to the
      need to keep hardware enabled indefinitely when pKVM is enabled.
      
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Acked-by: default avatarAnup Patel <anup@brainfault.org>
      Message-Id: <20221130230934.1014142-50-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      441f7bfa
    • Sean Christopherson's avatar
      KVM: Register syscore (suspend/resume) ops early in kvm_init() · 35774a9f
      Sean Christopherson authored
      
      Register the suspend/resume notifier hooks at the same time KVM registers
      its reboot notifier so that all the code in kvm_init() that deals with
      enabling/disabling hardware is bundled together.  Opportunstically move
      KVM's implementations to reside near the reboot notifier code for the
      same reason.
      
      Bunching the code together will allow architectures to opt out of KVM's
      generic hardware enable/disable logic with minimal #ifdeffery.
      
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20221130230934.1014142-49-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      35774a9f
    • Isaku Yamahata's avatar
      KVM: Make hardware_enable_failed a local variable in the "enable all" path · e6fb7d6e
      Isaku Yamahata authored
      
      Rework detecting hardware enabling errors to use a local variable in the
      "enable all" path to track whether or not enabling was successful across
      all CPUs.  Using a global variable complicates paths that enable hardware
      only on the current CPU, e.g. kvm_resume() and kvm_online_cpu().
      
      Opportunistically add a WARN if hardware enabling fails during
      kvm_resume(), KVM is all kinds of hosed if CPU0 fails to enable hardware.
      The WARN is largely futile in the current code, as KVM BUG()s on spurious
      faults on VMX instructions, e.g. attempting to run a vCPU on CPU if
      hardware enabling fails will explode.
      
        ------------[ cut here ]------------
        kernel BUG at arch/x86/kvm/x86.c:508!
        invalid opcode: 0000 [#1] SMP
        CPU: 3 PID: 1009 Comm: CPU 4/KVM Not tainted 6.1.0-rc1+ #11
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
        RIP: 0010:kvm_spurious_fault+0xa/0x10
        Call Trace:
         vmx_vcpu_load_vmcs+0x192/0x230 [kvm_intel]
         vmx_vcpu_load+0x16/0x60 [kvm_intel]
         kvm_arch_vcpu_load+0x32/0x1f0
         vcpu_load+0x2f/0x40
         kvm_arch_vcpu_ioctl_run+0x19/0x9d0
         kvm_vcpu_ioctl+0x271/0x660
         __x64_sys_ioctl+0x80/0xb0
         do_syscall_64+0x2b/0x50
         entry_SYSCALL_64_after_hwframe+0x46/0xb0
      
      But, the WARN may provide a breadcrumb to understand what went awry, and
      someday KVM may fix one or both of those bugs, e.g. by finding a way to
      eat spurious faults no matter the context (easier said than done due to
      side effects of certain operations, e.g. Intel's VMCLEAR).
      
      Signed-off-by: default avatarIsaku Yamahata <isaku.yamahata@intel.com>
      [sean: rebase, WARN on failure in kvm_resume()]
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20221130230934.1014142-48-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      e6fb7d6e
    • Sean Christopherson's avatar
      KVM: Use a per-CPU variable to track which CPUs have enabled virtualization · 37d25881
      Sean Christopherson authored
      
      Use a per-CPU variable instead of a shared bitmap to track which CPUs
      have successfully enabled virtualization hardware.  Using a per-CPU bool
      avoids the need for an additional allocation, and arguably yields easier
      to read code.  Using a bitmap would be advantageous if KVM used it to
      avoid generating IPIs to CPUs that failed to enable hardware, but that's
      an extreme edge case and not worth optimizing, and the low level helpers
      would still want to keep their individual checks as attempting to enable
      virtualization hardware when it's already enabled can be problematic,
      e.g. Intel's VMXON will fault.
      
      Opportunistically change the order in hardware_enable_nolock() to set
      the flag if and only if hardware enabling is successful, instead of
      speculatively setting the flag and then clearing it on failure.
      
      Add a comment explaining that the check in hardware_disable_nolock()
      isn't simply paranoia.  Waaay back when, commit 1b6c0168 ("KVM: Keep
      track of which cpus have virtualization enabled"), added the logic as a
      guards against CPU hotplug racing with hardware enable/disable.  Now that
      KVM has eliminated the race by taking cpu_hotplug_lock for read (via
      cpus_read_lock()) when enabling or disabling hardware, at first glance it
      appears that the check is now superfluous, i.e. it's tempting to remove
      the per-CPU flag entirely...
      
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20221130230934.1014142-47-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      37d25881
    • Isaku Yamahata's avatar
      KVM: Remove on_each_cpu(hardware_disable_nolock) in kvm_exit() · 667a83bf
      Isaku Yamahata authored
      
      Drop the superfluous invocation of hardware_disable_nolock() during
      kvm_exit(), as it's nothing more than a glorified nop.
      
      KVM automatically disables hardware on all CPUs when the last VM is
      destroyed, and kvm_exit() cannot be called until the last VM goes
      away as the calling module is pinned by an elevated refcount of the fops
      associated with /dev/kvm.  This holds true even on x86, where the caller
      of kvm_exit() is not kvm.ko, but is instead a dependent module, kvm_amd.ko
      or kvm_intel.ko, as kvm_chardev_ops.owner is set to the module that calls
      kvm_init(), not hardcoded to the base kvm.ko module.
      
      Signed-off-by: default avatarIsaku Yamahata <isaku.yamahata@intel.com>
      [sean: rework changelog]
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20221130230934.1014142-46-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      667a83bf
    • Isaku Yamahata's avatar
      KVM: Drop kvm_count_lock and instead protect kvm_usage_count with kvm_lock · 0bf50497
      Isaku Yamahata authored
      
      Drop kvm_count_lock and instead protect kvm_usage_count with kvm_lock now
      that KVM hooks CPU hotplug during the ONLINE phase, which can sleep.
      Previously, KVM hooked the STARTING phase, which is not allowed to sleep
      and thus could not take kvm_lock (a mutex).  This effectively allows the
      task that's initiating hardware enabling/disabling to preempted and/or
      migrated.
      
      Note, the Documentation/virt/kvm/locking.rst statement that kvm_count_lock
      is "raw" because hardware enabling/disabling needs to be atomic with
      respect to migration is wrong on multiple fronts.  First, while regular
      spinlocks can be preempted, the task holding the lock cannot be migrated.
      Second, preventing migration is not required.  on_each_cpu() disables
      preemption, which ensures that cpus_hardware_enabled correctly reflects
      hardware state.  The task may be preempted/migrated between bumping
      kvm_usage_count and invoking on_each_cpu(), but that's perfectly ok as
      kvm_usage_count is still protected, e.g. other tasks that call
      hardware_enable_all() will be blocked until the preempted/migrated owner
      exits its critical section.
      
      KVM does have lockless accesses to kvm_usage_count in the suspend/resume
      flows, but those are safe because all tasks must be frozen prior to
      suspending CPUs, and a task cannot be frozen while it holds one or more
      locks (userspace tasks are frozen via a fake signal).
      
      Preemption doesn't need to be explicitly disabled in the hotplug path.
      The hotplug thread is pinned to the CPU that's being hotplugged, and KVM
      only cares about having a stable CPU, i.e. to ensure hardware is enabled
      on the correct CPU.  Lockep, i.e. check_preemption_disabled(), plays nice
      with this state too, as is_percpu_thread() is true for the hotplug thread.
      
      Signed-off-by: default avatarIsaku Yamahata <isaku.yamahata@intel.com>
      Co-developed-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20221130230934.1014142-45-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      0bf50497
    • Sean Christopherson's avatar
      KVM: Ensure CPU is stable during low level hardware enable/disable · 2c106f29
      Sean Christopherson authored
      
      Use the non-raw smp_processor_id() in the low hardware enable/disable
      helpers as KVM absolutely relies on the CPU being stable, e.g. KVM would
      end up with incorrect state if the task were migrated between accessing
      cpus_hardware_enabled and actually enabling/disabling hardware.
      
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20221130230934.1014142-44-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      2c106f29
    • Chao Gao's avatar
      KVM: Disable CPU hotplug during hardware enabling/disabling · e4aa7f88
      Chao Gao authored
      
      Disable CPU hotplug when enabling/disabling hardware to prevent the
      corner case where if the following sequence occurs:
      
        1. A hotplugged CPU marks itself online in cpu_online_mask
        2. The hotplugged CPU enables interrupt before invoking KVM's ONLINE
           callback
        3  hardware_{en,dis}able_all() is invoked on another CPU
      
      the hotplugged CPU will be included in on_each_cpu() and thus get sent
      through hardware_{en,dis}able_nolock() before kvm_online_cpu() is called.
      
              start_secondary { ...
                      set_cpu_online(smp_processor_id(), true); <- 1
                      ...
                      local_irq_enable();  <- 2
                      ...
                      cpu_startup_entry(CPUHP_AP_ONLINE_IDLE); <- 3
              }
      
      KVM currently fudges around this race by keeping track of which CPUs have
      done hardware enabling (see commit 1b6c0168 "KVM: Keep track of which
      cpus have virtualization enabled"), but that's an inefficient, convoluted,
      and hacky solution.
      
      Signed-off-by: default avatarChao Gao <chao.gao@intel.com>
      [sean: split to separate patch, write changelog]
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20221130230934.1014142-43-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      e4aa7f88
    • Chao Gao's avatar
      KVM: Rename and move CPUHP_AP_KVM_STARTING to ONLINE section · aaf12a7b
      Chao Gao authored
      
      The CPU STARTING section doesn't allow callbacks to fail. Move KVM's
      hotplug callback to ONLINE section so that it can abort onlining a CPU in
      certain cases to avoid potentially breaking VMs running on existing CPUs.
      For example, when KVM fails to enable hardware virtualization on the
      hotplugged CPU.
      
      Place KVM's hotplug state before CPUHP_AP_SCHED_WAIT_EMPTY as it ensures
      when offlining a CPU, all user tasks and non-pinned kernel tasks have left
      the CPU, i.e. there cannot be a vCPU task around. So, it is safe for KVM's
      CPU offline callback to disable hardware virtualization at that point.
      Likewise, KVM's online callback can enable hardware virtualization before
      any vCPU task gets a chance to run on hotplugged CPUs.
      
      Drop kvm_x86_check_processor_compatibility()'s WARN that IRQs are
      disabled, as the ONLINE section runs with IRQs disabled.  The WARN wasn't
      intended to be a requirement, e.g. disabling preemption is sufficient,
      the IRQ thing was purely an aggressive sanity check since the helper was
      only ever invoked via SMP function call.
      
      Rename KVM's CPU hotplug callbacks accordingly.
      
      Suggested-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarChao Gao <chao.gao@intel.com>
      Signed-off-by: default avatarIsaku Yamahata <isaku.yamahata@intel.com>
      Reviewed-by: default avatarYuan Yao <yuan.yao@intel.com>
      [sean: drop WARN that IRQs are disabled]
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20221130230934.1014142-42-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      aaf12a7b
    • Sean Christopherson's avatar
      KVM: Drop kvm_arch_check_processor_compat() hook · 81a1cf9f
      Sean Christopherson authored
      
      Drop kvm_arch_check_processor_compat() and its support code now that all
      architecture implementations are nops.
      
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarPhilippe Mathieu-Daudé <philmd@linaro.org>
      Reviewed-by: Eric Farman <farman@linux.ibm.com>	# s390
      Acked-by: default avatarAnup Patel <anup@brainfault.org>
      Reviewed-by: default avatarKai Huang <kai.huang@intel.com>
      Message-Id: <20221130230934.1014142-33-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      81a1cf9f
    • Sean Christopherson's avatar
      KVM: Drop kvm_arch_{init,exit}() hooks · a578a0a9
      Sean Christopherson authored
      
      Drop kvm_arch_init() and kvm_arch_exit() now that all implementations
      are nops.
      
      No functional change intended.
      
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: Eric Farman <farman@linux.ibm.com>	# s390
      Reviewed-by: default avatarPhilippe Mathieu-Daudé <philmd@linaro.org>
      Acked-by: default avatarAnup Patel <anup@brainfault.org>
      Message-Id: <20221130230934.1014142-30-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      a578a0a9
    • Sean Christopherson's avatar
      KVM: Drop arch hardware (un)setup hooks · 63a1bd8a
      Sean Christopherson authored
      
      Drop kvm_arch_hardware_setup() and kvm_arch_hardware_unsetup() now that
      all implementations are nops.
      
      No functional change intended.
      
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: Eric Farman <farman@linux.ibm.com>	# s390
      Acked-by: default avatarAnup Patel <anup@brainfault.org>
      Message-Id: <20221130230934.1014142-10-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      63a1bd8a
    • Sean Christopherson's avatar
      KVM: Teardown VFIO ops earlier in kvm_exit() · 73b8dc04
      Sean Christopherson authored
      
      Move the call to kvm_vfio_ops_exit() further up kvm_exit() to try and
      bring some amount of symmetry to the setup order in kvm_init(), and more
      importantly so that the arch hooks are invoked dead last by kvm_exit().
      This will allow arch code to move away from the arch hooks without any
      change in ordering between arch code and common code in kvm_exit().
      
      That kvm_vfio_ops_exit() is called last appears to be 100% arbitrary.  It
      was bolted on after the fact by commit 571ee1b6 ("kvm: vfio: fix
      unregister kvm_device_ops of vfio").  The nullified kvm_device_ops_table
      is also local to kvm_main.c and is used only when there are active VMs,
      so unless arch code is doing something truly bizarre, nullifying the
      table earlier in kvm_exit() is little more than a nop.
      
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarCornelia Huck <cohuck@redhat.com>
      Reviewed-by: default avatarEric Farman <farman@linux.ibm.com>
      Message-Id: <20221130230934.1014142-5-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      73b8dc04
    • Sean Christopherson's avatar
      KVM: Allocate cpus_hardware_enabled after arch hardware setup · c9650228
      Sean Christopherson authored
      
      Allocate cpus_hardware_enabled after arch hardware setup so that arch
      "init" and "hardware setup" are called back-to-back and thus can be
      combined in a future patch.  cpus_hardware_enabled is never used before
      kvm_create_vm(), i.e. doesn't have a dependency with hardware setup and
      only needs to be allocated before /dev/kvm is exposed to userspace.
      
      Free the object before the arch hooks are invoked to maintain symmetry,
      and so that arch code can move away from the hooks without having to
      worry about ordering changes.
      
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarYuan Yao <yuan.yao@intel.com>
      Message-Id: <20221130230934.1014142-4-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      c9650228
    • Sean Christopherson's avatar
      KVM: Initialize IRQ FD after arch hardware setup · 5910ccf0
      Sean Christopherson authored
      
      Move initialization of KVM's IRQ FD workqueue below arch hardware setup
      as a step towards consolidating arch "init" and "hardware setup", and
      eventually towards dropping the hooks entirely.  There is no dependency
      on the workqueue being created before hardware setup, the workqueue is
      used only when destroying VMs, i.e. only needs to be created before
      /dev/kvm is exposed to userspace.
      
      Move the destruction of the workqueue before the arch hooks to maintain
      symmetry, and so that arch code can move away from the hooks without
      having to worry about ordering changes.
      
      Reword the comment about kvm_irqfd_init() needing to come after
      kvm_arch_init() to call out that kvm_arch_init() must come before common
      KVM does _anything_, as x86 very subtly relies on that behavior to deal
      with multiple calls to kvm_init(), e.g. if userspace attempts to load
      kvm_amd.ko and kvm_intel.ko.  Tag the code with a FIXME, as x86's subtle
      requirement is gross, and invoking an arch callback as the very first
      action in a helper that is called only from arch code is silly.
      
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20221130230934.1014142-3-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      5910ccf0
Loading