aboutsummaryrefslogtreecommitdiff
path: root/arch/x86/kvm/svm/nested.c
AgeCommit message (Collapse)Author
2022-11-18KVM: nSVM: hyper-v: Enable L2 TLB flushVitaly Kuznetsov
Implement Hyper-V L2 TLB flush for nSVM. The feature needs to be enabled both in extended 'nested controls' in VMCB and VP assist page. According to Hyper-V TLFS, synthetic vmexit to L1 is performed with - HV_SVM_EXITCODE_ENL exit_code. - HV_SVM_ENL_EXITCODE_TRAP_AFTER_FLUSH exit_info_1. Note: VP assist page is cached in 'struct kvm_vcpu_hv' so recalc_intercepts() doesn't need to read from guest's memory. KVM needs to update the case upon each VMRUN and after svm_set_nested_state (svm_get_nested_state_pages()) to handle the case when the guest got migrated while L2 was running. Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20221101145426.251680-29-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-18KVM: x86: Introduce .hv_inject_synthetic_vmexit_post_tlb_flush() nested hookVitaly Kuznetsov
Hyper-V supports injecting synthetic L2->L1 exit after performing L2 TLB flush operation but the procedure is vendor specific. Introduce .hv_inject_synthetic_vmexit_post_tlb_flush nested hook for it. Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20221101145426.251680-22-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-18KVM: nSVM: Keep track of Hyper-V hv_vm_id/hv_vp_idVitaly Kuznetsov
Similar to nSVM, KVM needs to know L2's VM_ID/VP_ID and Partition assist page address to handle L2 TLB flush requests. Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20221101145426.251680-21-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-18x86/hyperv: KVM: Rename "hv_enlightenments" to "hv_vmcb_enlightenments"Sean Christopherson
Now that KVM isn't littered with "struct hv_enlightenments" casts, rename the struct to "hv_vmcb_enlightenments" to highlight the fact that the struct is specifically for SVM's VMCB. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Michael Kelley <mikelley@microsoft.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20221101145426.251680-5-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-18KVM: SVM: Add a proper field for Hyper-V VMCB enlightenmentsSean Christopherson
Add a union to provide hv_enlightenments side-by-side with the sw_reserved bytes that Hyper-V's enlightenments overlay. Casting sw_reserved everywhere is messy, confusing, and unnecessarily unsafe. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20221101145426.251680-4-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-18x86/hyperv: Move VMCB enlightenment definitions to hyperv-tlfs.hSean Christopherson
Move Hyper-V's VMCB enlightenment definitions to the TLFS header; the definitions come directly from the TLFS[*], not from KVM. No functional change intended. [*] https://learn.microsoft.com/en-us/virtualization/hyper-v-on-windows/tlfs/datatypes/hv_svm_enlightened_vmcb_fields [vitaly: rename VMCB_HV_ -> HV_VMCB_ to match the rest of hyperv-tlfs.h, keep svm/hyperv.h] Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20221101145426.251680-2-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-17Merge branch 'kvm-svm-harden' into HEADPaolo Bonzini
This fixes three issues in nested SVM: 1) in the shutdown_interception() vmexit handler we call kvm_vcpu_reset(). However, if running nested and L1 doesn't intercept shutdown, the function resets vcpu->arch.hflags without properly leaving the nested state. This leaves the vCPU in inconsistent state and later triggers a kernel panic in SVM code. The same bug can likely be triggered by sending INIT via local apic to a vCPU which runs a nested guest. On VMX we are lucky that the issue can't happen because VMX always intercepts triple faults, thus triple fault in L2 will always be redirected to L1. Plus, handle_triple_fault() doesn't reset the vCPU. INIT IPI can't happen on VMX either because INIT events are masked while in VMX mode. Secondarily, KVM doesn't honour SHUTDOWN intercept bit of L1 on SVM. A normal hypervisor should always intercept SHUTDOWN, a unit test on the other hand might want to not do so. Finally, the guest can trigger a kernel non rate limited printk on SVM from the guest, which is fixed as well. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-17KVM: x86: allow L1 to not intercept triple faultMaxim Levitsky
This is SVM correctness fix - although a sane L1 would intercept SHUTDOWN event, it doesn't have to, so we have to honour this. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20221103141351.50662-8-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-17KVM: x86: add kvm_leave_nestedMaxim Levitsky
add kvm_leave_nested which wraps a call to nested_ops->leave_nested into a function. Cc: stable@vger.kernel.org Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20221103141351.50662-4-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-17KVM: x86: nSVM: harden svm_free_nested against freeing vmcb02 while still in useMaxim Levitsky
Make sure that KVM uses vmcb01 before freeing nested state, and warn if that is not the case. This is a minimal fix for CVE-2022-3344 making the kernel print a warning instead of a kernel panic. Cc: stable@vger.kernel.org Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20221103141351.50662-3-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-09KVM: x86: compile out vendor-specific code if SMM is disabledPaolo Bonzini
Vendor-specific code that deals with SMI injection and saving/restoring SMM state is not needed if CONFIG_KVM_SMM is disabled, so remove the four callbacks smi_allowed, enter_smm, leave_smm and enable_smi_window. The users in svm/nested.c and x86.c also have to be compiled out; the amount of #ifdef'ed code is small and it's not worth moving it to smm.c. enter_smm is now used only within #ifdef CONFIG_KVM_SMM, and the stub can therefore be removed. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20220929172016.319443-7-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-09KVM: x86: start moving SMM-related functions to new filesPaolo Bonzini
Create a new header and source with code related to system management mode emulation. Entry and exit will move there too; for now, opportunistically rename put_smstate to PUT_SMSTATE while moving it to smm.h, and adjust the SMM state saving code. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20220929172016.319443-2-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26KVM: x86: Rename inject_pending_events() to kvm_check_and_inject_events()Sean Christopherson
Rename inject_pending_events() to kvm_check_and_inject_events() in order to capture the fact that it handles more than just pending events, and to (mostly) align with kvm_check_nested_events(), which omits the "inject" for brevity. Add a comment above kvm_check_and_inject_events() to provide a high-level synopsis, and to document a virtualization hole (KVM erratum if you will) that exists due to KVM not strictly tracking instruction boundaries with respect to coincident instruction restarts and asynchronous events. No functional change inteded. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20220830231614.3580124-25-seanjc@google.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26KVM: x86: Morph pending exceptions to pending VM-Exits at queue timeSean Christopherson
Morph pending exceptions to pending VM-Exits (due to interception) when the exception is queued instead of waiting until nested events are checked at VM-Entry. This fixes a longstanding bug where KVM fails to handle an exception that occurs during delivery of a previous exception, KVM (L0) and L1 both want to intercept the exception (e.g. #PF for shadow paging), and KVM determines that the exception is in the guest's domain, i.e. queues the new exception for L2. Deferring the interception check causes KVM to esclate various combinations of injected+pending exceptions to double fault (#DF) without consulting L1's interception desires, and ends up injecting a spurious #DF into L2. KVM has fudged around the issue for #PF by special casing emulated #PF injection for shadow paging, but the underlying issue is not unique to shadow paging in L0, e.g. if KVM is intercepting #PF because the guest has a smaller maxphyaddr and L1 (but not L0) is using shadow paging. Other exceptions are affected as well, e.g. if KVM is intercepting #GP for one of SVM's workaround or for the VMware backdoor emulation stuff. The other cases have gone unnoticed because the #DF is spurious if and only if L1 resolves the exception, e.g. KVM's goofs go unnoticed if L1 would have injected #DF anyways. The hack-a-fix has also led to ugly code, e.g. bailing from the emulator if #PF injection forced a nested VM-Exit and the emulator finds itself back in L1. Allowing for direct-to-VM-Exit queueing also neatly solves the async #PF in L2 mess; no need to set a magic flag and token, simply queue a #PF nested VM-Exit. Deal with event migration by flagging that a pending exception was queued by userspace and check for interception at the next KVM_RUN, e.g. so that KVM does the right thing regardless of the order in which userspace restores nested state vs. event state. When "getting" events from userspace, simply drop any pending excpetion that is destined to be intercepted if there is also an injected exception to be migrated. Ideally, KVM would migrate both events, but that would require new ABI, and practically speaking losing the event is unlikely to be noticed, let alone fatal. The injected exception is captured, RIP still points at the original faulting instruction, etc... So either the injection on the target will trigger the same intercepted exception, or the source of the intercepted exception was transient and/or non-deterministic, thus dropping it is ok-ish. Fixes: a04aead144fd ("KVM: nSVM: fix running nested guests when npt=0") Fixes: feaf0c7dc473 ("KVM: nVMX: Do not generate #DF if #PF happens during exception delivery into L2") Cc: Jim Mattson <jmattson@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20220830231614.3580124-22-seanjc@google.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26KVM: x86: Formalize blocking of nested pending exceptionsSean Christopherson
Capture nested_run_pending as block_pending_exceptions so that the logic of why exceptions are blocked only needs to be documented once instead of at every place that employs the logic. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20220830231614.3580124-16-seanjc@google.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26KVM: x86: Make kvm_queued_exception a properly named, visible structSean Christopherson
Move the definition of "struct kvm_queued_exception" out of kvm_vcpu_arch in anticipation of adding a second instance in kvm_vcpu_arch to handle exceptions that occur when vectoring an injected exception and are morphed to VM-Exit instead of leading to #DF. Opportunistically take advantage of the churn to rename "nr" to "vector". No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20220830231614.3580124-15-seanjc@google.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26KVM: x86: Print guest pgd in kvm_nested_vmenter()Mingwei Zhang
Print guest pgd in kvm_nested_vmenter() to enrich the information for tracing. When tdp is enabled, print the value of tdp page table (EPT/NPT); when tdp is disabled, print the value of non-nested CR3. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Mingwei Zhang <mizhang@google.com> Link: https://lore.kernel.org/r/20220825225755.907001-4-mizhang@google.com [sean: print nested_cr3 vs. nested_eptp vs. guest_cr3] Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26KVM: x86: Update trace function for nested VM entry to support VMXMingwei Zhang
Update trace function for nested VM entry to support VMX. Existing trace function only supports nested VMX and the information printed out is AMD specific. So, rename trace_kvm_nested_vmrun() to trace_kvm_nested_vmenter(), since 'vmenter' is generic. Add a new field 'isa' to recognize Intel and AMD; Update the output to print out VMX/SVM related naming respectively, eg., vmcb vs. vmcs; npt vs. ept. Opportunistically update the call site of trace_kvm_nested_vmenter() to make one line per parameter. Signed-off-by: Mingwei Zhang <mizhang@google.com> Link: https://lore.kernel.org/r/20220825225755.907001-2-mizhang@google.com [sean: align indentation, s/update/rename in changelog] Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28KVM: x86: Split kvm_is_valid_cr4() and export only the non-vendor bitsSean Christopherson
Split the common x86 parts of kvm_is_valid_cr4(), i.e. the reserved bits checks, into a separate helper, __kvm_is_valid_cr4(), and export only the inner helper to vendor code in order to prevent nested VMX from calling back into vmx_is_valid_cr4() via kvm_is_valid_cr4(). On SVM, this is a nop as SVM doesn't place any additional restrictions on CR4. On VMX, this is also currently a nop, but only because nested VMX is missing checks on reserved CR4 bits for nested VM-Enter. That bug will be fixed in a future patch, and could simply use kvm_is_valid_cr4() as-is, but nVMX has _another_ bug where VMXON emulation doesn't enforce VMX's restrictions on CR0/CR4. The cleanest and most intuitive way to fix the VMXON bug is to use nested_host_cr{0,4}_valid(). If the CR4 variant routes through kvm_is_valid_cr4(), using nested_host_cr4_valid() won't do the right thing for the VMXON case as vmx_is_valid_cr4() enforces VMX's restrictions if and only if the vCPU is post-VMXON. Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220607213604.3346000-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28KVM: nSVM: Pull CS.Base from actual VMCB12 for soft int/ex re-injectionMaciej S. Szmigiero
enter_svm_guest_mode() first calls nested_vmcb02_prepare_control() to copy control fields from VMCB12 to the current VMCB, then nested_vmcb02_prepare_save() to perform a similar copy of the save area. This means that nested_vmcb02_prepare_control() still runs with the previous save area values in the current VMCB so it shouldn't take the L2 guest CS.Base from this area. Explicitly pull CS.Base from the actual VMCB12 instead in enter_svm_guest_mode(). Granted, having a non-zero CS.Base is a very rare thing (and even impossible in 64-bit mode), having it change between nested VMRUNs is probably even rarer, but if it happens it would create a really subtle bug so it's better to fix it upfront. Fixes: 6ef88d6e36c2 ("KVM: SVM: Re-inject INT3/INTO instead of retrying the instruction") Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Message-Id: <4caa0f67589ae3c22c311ee0e6139496902f2edc.1658159083.git.maciej.szmigiero@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-24KVM: x86: nSVM: always intercept x2apic msrsMaxim Levitsky
As a preparation for x2avic, this patch ensures that x2apic msrs are always intercepted for the nested guest. Reviewed-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Tested-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20220519102709.24125-11-suravee.suthikulpanit@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-09Merge branch 'kvm-5.20-early'Paolo Bonzini
s390: * add an interface to provide a hypervisor dump for secure guests * improve selftests to show tests x86: * Intel IPI virtualization * Allow getting/setting pending triple fault with KVM_GET/SET_VCPU_EVENTS * PEBS virtualization * Simplify PMU emulation by just using PERF_TYPE_RAW events * More accurate event reinjection on SVM (avoid retrying instructions) * Allow getting/setting the state of the speaker port data bit * Rewrite gfn-pfn cache refresh * Refuse starting the module if VM-Entry/VM-Exit controls are inconsistent * "Notify" VM exit
2022-06-09KVM: x86: SVM: fix nested PAUSE filtering when L0 intercepts PAUSEPaolo Bonzini
Commit 74fd41ed16fd ("KVM: x86: nSVM: support PAUSE filtering when L0 doesn't intercept PAUSE") introduced passthrough support for nested pause filtering, (when the host doesn't intercept PAUSE) (either disabled with kvm module param, or disabled with '-overcommit cpu-pm=on') Before this commit, L1 KVM didn't intercept PAUSE at all; afterwards, the feature was exposed as supported by KVM cpuid unconditionally, thus if L1 could try to use it even when the L0 KVM can't really support it. In this case the fallback caused KVM to intercept each PAUSE instruction; in some cases, such intercept can slow down the nested guest so much that it can fail to boot. Instead, before the problematic commit KVM was already setting both thresholds to 0 in vmcb02, but after the first userspace VM exit shrink_ple_window was called and would reset the pause_filter_count to the default value. To fix this, change the fallback strategy - ignore the guest threshold values, but use/update the host threshold values unless the guest specifically requests disabling PAUSE filtering (either simple or advanced). Also fix a minor bug: on nested VM exit, when PAUSE filter counter were copied back to vmcb01, a dirty bit was not set. Thanks a lot to Suravee Suthikulpanit for debugging this! Fixes: 74fd41ed16fd ("KVM: x86: nSVM: support PAUSE filtering when L0 doesn't intercept PAUSE") Reported-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Tested-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Co-developed-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20220518072709.730031-1-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-08KVM: x86: Introduce "struct kvm_caps" to track misc caps/settingsSean Christopherson
Add kvm_caps to hold a variety of capabilites and defaults that aren't handled by kvm_cpu_caps because they aren't CPUID bits in order to reduce the amount of boilerplate code required to add a new feature. The vast majority (all?) of the caps interact with vendor code and are written only during initialization, i.e. should be tagged __read_mostly, declared extern in x86.h, and exported. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220524135624.22988-4-chenyi.qiang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-08KVM: nSVM: Transparently handle L1 -> L2 NMI re-injectionMaciej S. Szmigiero
A NMI that L1 wants to inject into its L2 should be directly re-injected, without causing L0 side effects like engaging NMI blocking for L1. It's also worth noting that in this case it is L1 responsibility to track the NMI window status for its L2 guest. Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Message-Id: <f894d13501cd48157b3069a4b4c7369575ddb60e.1651440202.git.maciej.szmigiero@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-08KVM: SVM: Re-inject INTn instead of retrying the insn on "failure"Sean Christopherson
Re-inject INTn software interrupts instead of retrying the instruction if the CPU encountered an intercepted exception while vectoring the INTn, e.g. if KVM intercepted a #PF when utilizing shadow paging. Retrying the instruction is architecturally wrong e.g. will result in a spurious #DB if there's a code breakpoint on the INT3/O, and lack of re-injection also breaks nested virtualization, e.g. if L1 injects a software interrupt and vectoring the injected interrupt encounters an exception that is intercepted by L0 but not L1. Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Message-Id: <1654ad502f860948e4f2d57b8bd881d67301f785.1651440202.git.maciej.szmigiero@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-08KVM: SVM: Re-inject INT3/INTO instead of retrying the instructionSean Christopherson
Re-inject INT3/INTO instead of retrying the instruction if the CPU encountered an intercepted exception while vectoring the software exception, e.g. if vectoring INT3 encounters a #PF and KVM is using shadow paging. Retrying the instruction is architecturally wrong, e.g. will result in a spurious #DB if there's a code breakpoint on the INT3/O, and lack of re-injection also breaks nested virtualization, e.g. if L1 injects a software exception and vectoring the injected exception encounters an exception that is intercepted by L0 but not L1. Due to, ahem, deficiencies in the SVM architecture, acquiring the next RIP may require flowing through the emulator even if NRIPS is supported, as the CPU clears next_rip if the VM-Exit is due to an exception other than "exceptions caused by the INT3, INTO, and BOUND instructions". To deal with this, "skip" the instruction to calculate next_rip (if it's not already known), and then unwind the RIP write and any side effects (RFLAGS updates). Save the computed next_rip and use it to re-stuff next_rip if injection doesn't complete. This allows KVM to do the right thing if next_rip was known prior to injection, e.g. if L1 injects a soft event into L2, and there is no backing INTn instruction, e.g. if L1 is injecting an arbitrary event. Note, it's impossible to guarantee architectural correctness given SVM's architectural flaws. E.g. if the guest executes INTn (no KVM injection), an exit occurs while vectoring the INTn, and the guest modifies the code stream while the exit is being handled, KVM will compute the incorrect next_rip due to "skipping" the wrong instruction. A future enhancement to make this less awful would be for KVM to detect that the decoded instruction is not the correct INTn and drop the to-be-injected soft event (retrying is a lesser evil compared to shoving the wrong RIP on the exception stack). Reported-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Message-Id: <65cb88deab40bc1649d509194864312a89bbe02e.1651440202.git.maciej.szmigiero@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-08KVM: nSVM: Sync next_rip field from vmcb12 to vmcb02Maciej S. Szmigiero
The next_rip field of a VMCB is *not* an output-only field for a VMRUN. This field value (instead of the saved guest RIP) in used by the CPU for the return address pushed on stack when injecting a software interrupt or INT3 or INTO exception. Make sure this field gets synced from vmcb12 to vmcb02 when entering L2 or loading a nested state and NRIPS is exposed to L1. If NRIPS is supported in hardware but not exposed to L1 (nrips=0 or hidden by userspace), stuff vmcb02's next_rip from the new L2 RIP to emulate a !NRIPS CPU (which saves RIP on the stack as-is). Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Message-Id: <c2e0a3d78db3ae30530f11d4e9254b452a89f42b.1651440202.git.maciej.szmigiero@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-07KVM: SVM: fix tsc scaling cache logicMaxim Levitsky
SVM uses a per-cpu variable to cache the current value of the tsc scaling multiplier msr on each cpu. Commit 1ab9287add5e2 ("KVM: X86: Add vendor callbacks for writing the TSC multiplier") broke this caching logic. Refactor the code so that all TSC scaling multiplier writes go through a single function which checks and updates the cache. This fixes the following scenario: 1. A CPU runs a guest with some tsc scaling ratio. 2. New guest with different tsc scaling ratio starts on this CPU and terminates almost immediately. This ensures that the short running guest had set the tsc scaling ratio just once when it was set via KVM_SET_TSC_KHZ. Due to the bug, the per-cpu cache is not updated. 3. The original guest continues to run, it doesn't restore the msr value back to its own value, because the cache matches, and thus continues to run with a wrong tsc scaling ratio. Fixes: 1ab9287add5e2 ("KVM: X86: Add vendor callbacks for writing the TSC multiplier") Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20220606181149.103072-1-mlevitsk@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-29KVM: x86: Clean up and document nested #PF workaroundSean Christopherson
Replace the per-vendor hack-a-fix for KVM's #PF => #PF => #DF workaround with an explicit, common workaround in kvm_inject_emulated_page_fault(). Aside from being a hack, the current approach is brittle and incomplete, e.g. nSVM's KVM_SET_NESTED_STATE fails to set ->inject_page_fault(), and nVMX fails to apply the workaround when VMX is intercepting #PF due to allow_smaller_maxphyaddr=1. Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-13KVM: x86: Drop WARNs that assert a triple fault never "escapes" from L2Sean Christopherson
Remove WARNs that sanity check that KVM never lets a triple fault for L2 escape and incorrectly end up in L1. In normal operation, the sanity check is perfectly valid, but it incorrectly assumes that it's impossible for userspace to induce KVM_REQ_TRIPLE_FAULT without bouncing through KVM_RUN (which guarantees kvm_check_nested_state() will see and handle the triple fault). The WARN can currently be triggered if userspace injects a machine check while L2 is active and CR4.MCE=0. And a future fix to allow save/restore of KVM_REQ_TRIPLE_FAULT, e.g. so that a synthesized triple fault isn't lost on migration, will make it trivially easy for userspace to trigger the WARN. Clearing KVM_REQ_TRIPLE_FAULT when forcibly leaving guest mode is tempting, but wrong, especially if/when the request is saved/restored, e.g. if userspace restores events (including a triple fault) and then restores nested state (which may forcibly leave guest mode). Ignoring the fact that KVM doesn't currently provide the necessary APIs, it's userspace's responsibility to manage pending events during save/restore. ------------[ cut here ]------------ WARNING: CPU: 7 PID: 1399 at arch/x86/kvm/vmx/nested.c:4522 nested_vmx_vmexit+0x7fe/0xd90 [kvm_intel] Modules linked in: kvm_intel kvm irqbypass CPU: 7 PID: 1399 Comm: state_test Not tainted 5.17.0-rc3+ #808 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:nested_vmx_vmexit+0x7fe/0xd90 [kvm_intel] Call Trace: <TASK> vmx_leave_nested+0x30/0x40 [kvm_intel] vmx_set_nested_state+0xca/0x3e0 [kvm_intel] kvm_arch_vcpu_ioctl+0xf49/0x13e0 [kvm] kvm_vcpu_ioctl+0x4b9/0x660 [kvm] __x64_sys_ioctl+0x83/0xb0 do_syscall_64+0x3b/0xc0 entry_SYSCALL_64_after_hwframe+0x44/0xae </TASK> ---[ end trace 0000000000000000 ]--- Fixes: cb6a32c2b877 ("KVM: x86: Handle triple fault in L2 without killing L1") Cc: stable@vger.kernel.org Cc: Chenyi Qiang <chenyi.qiang@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220407002315.78092-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02KVM: x86: SVM: allow AVIC to co-exist with a nested guest runningMaxim Levitsky
Inhibit the AVIC of the vCPU that is running nested for the duration of the nested run, so that all interrupts arriving from both its vCPU siblings and from KVM are delivered using normal IPIs and cause that vCPU to vmexit. Note that unlike normal AVIC inhibition, there is no need to update the AVIC mmio memslot, because the nested guest uses its own set of paging tables. That also means that AVIC doesn't need to be inhibited VM wide. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20220322174050.241850-7-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02KVM: x86: nSVM: implement nested vGIFMaxim Levitsky
In case L1 enables vGIF for L2, the L2 cannot affect L1's GIF, regardless of STGI/CLGI intercepts, and since VM entry enables GIF, this means that L1's GIF is always 1 while L2 is running. Thus in this case leave L1's vGIF in vmcb01, while letting L2 control the vGIF thus implementing nested vGIF. Also allow KVM to toggle L1's GIF during nested entry/exit by always using vmcb01. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20220322174050.241850-5-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02KVM: x86: nSVM: support PAUSE filtering when L0 doesn't intercept PAUSEMaxim Levitsky
Expose the pause filtering and threshold in the guest CPUID and support PAUSE filtering when possible: - If the L0 doesn't intercept PAUSE (cpu_pm=on), then allow L1 to have full control over PAUSE filtering. - if the L1 doesn't intercept PAUSE, use host values and update the adaptive count/threshold even when running nested. - Otherwise always exit to L1; it is not really possible to merge the fields correctly. It is expected that in this case, userspace will not enable this feature in the guest CPUID, to avoid having the guest update both fields pointlessly. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20220322174050.241850-4-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02KVM: x86: nSVM: implement nested LBR virtualizationMaxim Levitsky
This was tested with kvm-unit-test that was developed for this purpose. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20220322174050.241850-3-mlevitsk@redhat.com> [Copy all of DEBUGCTL except for reserved bits. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02KVM: x86: nSVM: correctly virtualize LBR msrs when L2 is runningMaxim Levitsky
When L2 is running without LBR virtualization, we should ensure that L1's LBR msrs continue to update as usual. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20220322174050.241850-2-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02kvm: x86: SVM: use vmcb* instead of svm->vmcb where it makes senseMaxim Levitsky
This makes the code a bit shorter and cleaner. No functional change intended. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20220322172449.235575-4-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02KVM: x86: nSVM: implement nested VMLOAD/VMSAVEMaxim Levitsky
This was tested by booting L1,L2,L3 (all Linux) and checking that no VMLOAD/VMSAVE vmexits happened. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20220301143650.143749-4-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-25KVM: x86/mmu: load new PGD after the shadow MMU is initializedPaolo Bonzini
Now that __kvm_mmu_new_pgd does not look at the MMU's root_level and shadow_root_level anymore, pull the PGD load after the initialization of the shadow MMUs. Besides being more intuitive, this enables future simplifications and optimizations because it's not necessary anymore to compute the role outside kvm_init_mmu. In particular, kvm_mmu_reset_context was not attempting to use a cached PGD to avoid having to figure out the new role. With this change, it could follow what nested_{vmx,svm}_load_cr3 are doing, and avoid unloading all the cached roots. Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-10KVM: nSVM: Implement Enlightened MSR-Bitmap featureVitaly Kuznetsov
Similar to nVMX commit 502d2bf5f2fd ("KVM: nVMX: Implement Enlightened MSR Bitmap feature"), add support for the feature for nSVM (Hyper-V on KVM). Notable differences from nVMX implementation: - As the feature uses SW reserved fields in VMCB control, KVM needs to make sure it's dealing with a Hyper-V guest (kvm_hv_hypercall_enabled()). - 'msrpm_base_pa' needs to be always be overwritten in nested_svm_vmrun_msrpm(), even when the update is skipped. As an optimization, nested_vmcb02_prepare_control() copies it from VMCB01 so when MSR-Bitmap feature for L2 is disabled nothing needs to be done. - 'struct vmcb_ctrl_area_cached' needs to be extended with clean fields/sw reserved data and __nested_copy_vmcb_control_to_cache() needs to copy it so nested_svm_vmrun_msrpm() can use it later. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20220202095100.129834-5-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-10KVM: nSVM: Track whether changes in L0 require MSR bitmap for L2 to be rebuiltVitaly Kuznetsov
Similar to nVMX commit ed2a4800ae9d ("KVM: nVMX: Track whether changes in L0 require MSR bitmap for L2 to be rebuilt"), introduce a flag to keep track of whether MSR bitmap for L2 needs to be rebuilt due to changes in MSR bitmap for L1 or switching to a different L2. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20220202095100.129834-2-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-08KVM: x86: nSVM: fix potential NULL derefernce on nested migrationMaxim Levitsky
Turns out that due to review feedback and/or rebases I accidentally moved the call to nested_svm_load_cr3 to be too early, before the NPT is enabled, which is very wrong to do. KVM can't even access guest memory at that point as nested NPT is needed for that, and of course it won't initialize the walk_mmu, which is main issue the patch was addressing. Fix this for real. Fixes: 232f75d3b4b5 ("KVM: nSVM: call nested_svm_load_cr3 on nested state load") Cc: stable@vger.kernel.org Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20220207155447.840194-3-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-26KVM: x86: Forcibly leave nested virt when SMM state is toggledSean Christopherson
Forcibly leave nested virtualization operation if userspace toggles SMM state via KVM_SET_VCPU_EVENTS or KVM_SYNC_X86_EVENTS. If userspace forces the vCPU out of SMM while it's post-VMXON and then injects an SMI, vmx_enter_smm() will overwrite vmx->nested.smm.vmxon and end up with both vmxon=false and smm.vmxon=false, but all other nVMX state allocated. Don't attempt to gracefully handle the transition as (a) most transitions are nonsencial, e.g. forcing SMM while L2 is running, (b) there isn't sufficient information to handle all transitions, e.g. SVM wants access to the SMRAM save state, and (c) KVM_SET_VCPU_EVENTS must precede KVM_SET_NESTED_STATE during state restore as the latter disallows putting the vCPU into L2 if SMM is active, and disallows tagging the vCPU as being post-VMXON in SMM if SMM is not active. Abuse of KVM_SET_VCPU_EVENTS manifests as a WARN and memory leak in nVMX due to failure to free vmcs01's shadow VMCS, but the bug goes far beyond just a memory leak, e.g. toggling SMM on while L2 is active puts the vCPU in an architecturally impossible state. WARNING: CPU: 0 PID: 3606 at free_loaded_vmcs arch/x86/kvm/vmx/vmx.c:2665 [inline] WARNING: CPU: 0 PID: 3606 at free_loaded_vmcs+0x158/0x1a0 arch/x86/kvm/vmx/vmx.c:2656 Modules linked in: CPU: 1 PID: 3606 Comm: syz-executor725 Not tainted 5.17.0-rc1-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:free_loaded_vmcs arch/x86/kvm/vmx/vmx.c:2665 [inline] RIP: 0010:free_loaded_vmcs+0x158/0x1a0 arch/x86/kvm/vmx/vmx.c:2656 Code: <0f> 0b eb b3 e8 8f 4d 9f 00 e9 f7 fe ff ff 48 89 df e8 92 4d 9f 00 Call Trace: <TASK> kvm_arch_vcpu_destroy+0x72/0x2f0 arch/x86/kvm/x86.c:11123 kvm_vcpu_destroy arch/x86/kvm/../../../virt/kvm/kvm_main.c:441 [inline] kvm_destroy_vcpus+0x11f/0x290 arch/x86/kvm/../../../virt/kvm/kvm_main.c:460 kvm_free_vcpus arch/x86/kvm/x86.c:11564 [inline] kvm_arch_destroy_vm+0x2e8/0x470 arch/x86/kvm/x86.c:11676 kvm_destroy_vm arch/x86/kvm/../../../virt/kvm/kvm_main.c:1217 [inline] kvm_put_kvm+0x4fa/0xb00 arch/x86/kvm/../../../virt/kvm/kvm_main.c:1250 kvm_vm_release+0x3f/0x50 arch/x86/kvm/../../../virt/kvm/kvm_main.c:1273 __fput+0x286/0x9f0 fs/file_table.c:311 task_work_run+0xdd/0x1a0 kernel/task_work.c:164 exit_task_work include/linux/task_work.h:32 [inline] do_exit+0xb29/0x2a30 kernel/exit.c:806 do_group_exit+0xd2/0x2f0 kernel/exit.c:935 get_signal+0x4b0/0x28c0 kernel/signal.c:2862 arch_do_signal_or_restart+0x2a9/0x1c40 arch/x86/kernel/signal.c:868 handle_signal_work kernel/entry/common.c:148 [inline] exit_to_user_mode_loop kernel/entry/common.c:172 [inline] exit_to_user_mode_prepare+0x17d/0x290 kernel/entry/common.c:207 __syscall_exit_to_user_mode_work kernel/entry/common.c:289 [inline] syscall_exit_to_user_mode+0x19/0x60 kernel/entry/common.c:300 do_syscall_64+0x42/0xb0 arch/x86/entry/common.c:86 entry_SYSCALL_64_after_hwframe+0x44/0xae </TASK> Cc: stable@vger.kernel.org Reported-by: syzbot+8112db3ab20e70d50c31@syzkaller.appspotmail.com Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220125220358.2091737-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-08KVM: X86: Remove mmu parameter from load_pdptrs()Lai Jiangshan
It uses vcpu->arch.walk_mmu always; nested EPT does not have PDPTRs, and nested NPT treats them like all other non-leaf page table levels instead of caching them. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211124122055.64424-11-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-08KVM: SVM: Remove references to VCPU_EXREG_CR3Lai Jiangshan
VCPU_EXREG_CR3 is never cleared from vcpu->arch.regs_avail or vcpu->arch.regs_dirty in SVM; therefore, marking CR3 as available is merely a NOP, and testing it will likewise always succeed. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211108124407.12187-9-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-08KVM: nSVM: introduce struct vmcb_ctrl_area_cachedEmanuele Giuseppe Esposito
This structure will replace vmcb_control_area in svm_nested_state, providing only the fields that are actually used by the nested state. This avoids having and copying around uninitialized fields. The cost of this, however, is that all functions (in this case vmcb_is_intercept) expect the old structure, so they need to be duplicated. In addition, in svm_get_nested_state() user space expects a vmcb_control_area struct, so we need to copy back all fields in a temporary structure before copying it to userspace. Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20211103140527.752797-7-eesposit@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-08KVM: nSVM: split out __nested_vmcb_check_controlsPaolo Bonzini
Remove the struct vmcb_control_area parameter from nested_vmcb_check_controls, for consistency with the functions that operate on the save area. This way, VMRUN uses the version without underscores for both areas, while KVM_SET_NESTED_STATE uses the version with underscores. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-08KVM: nSVM: use svm->nested.save to load vmcb12 registers and avoid TOC/TOU racesEmanuele Giuseppe Esposito
Use the already checked svm->nested.save cached fields (EFER, CR0, CR4, ...) instead of vmcb12's in nested_vmcb02_prepare_save(). This prevents from creating TOC/TOU races, since the guest could modify the vmcb12 fields. This also avoids the need of force-setting EFER_SVME in nested_vmcb02_prepare_save. Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20211103140527.752797-6-eesposit@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-08KVM: nSVM: use vmcb_save_area_cached in nested_vmcb_valid_sregs()Emanuele Giuseppe Esposito
Now that struct vmcb_save_area_cached contains the required vmcb fields values (done in nested_load_save_from_vmcb12()), check them to see if they are correct in nested_vmcb_valid_sregs(). While at it, rename nested_vmcb_valid_sregs in nested_vmcb_check_save. __nested_vmcb_check_save takes the additional @save parameter, so it is helpful when we want to check a non-svm save state, like in svm_set_nested_state. The reason for that is that save is the L1 state, not L2, so we check it without moving it to svm->nested.save. Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com> Message-Id: <20211103140527.752797-5-eesposit@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-08KVM: nSVM: rename nested_load_control_from_vmcb12 in ↵Emanuele Giuseppe Esposito
nested_copy_vmcb_control_to_cache Following the same naming convention of the previous patch, rename nested_load_control_from_vmcb12. In addition, inline copy_vmcb_control_area as it is only called by this function. __nested_copy_vmcb_control_to_cache() works with vmcb_control_area parameters and it will be useful in next patches, when we use local variables instead of svm cached state. Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com> Message-Id: <20211103140527.752797-4-eesposit@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>