1
0
mirror of https://git.FreeBSD.org/src.git synced 2025-01-23 16:01:42 +00:00
Commit Graph

61 Commits

Author SHA1 Message Date
Neel Natu
b060ba5024 Simplify the assignment of memory to virtual machines by requiring a single
command line option "-m <memsize in MB>" to specify the memory size.

Prior to this change the user needed to explicitly specify the amount of
memory allocated below 4G (-m <lowmem>) and the amount above 4G (-M <highmem>).

The "-M" option is no longer supported by 'bhyveload' and 'bhyve'.

The start of the PCI hole is fixed at 3GB and cannot be directly changed
using command line options. However it is still possible to change this in
special circumstances via the 'vm_set_lowmem_limit()' API provided by
libvmmapi.

Submitted by:	Dinakar Medavaram (initial version)
Reviewed by:	grehan
Obtained from:	NetApp
2013-03-18 22:38:30 +00:00
Neel Natu
1e7d750c75 Change the type of 'ndesc' from 'int' to 'uint16_t' so that descriptor index
wraparound is handled correctly.

The gory details are available here:
http://lists.freebsd.org/pipermail/freebsd-virtualization/2013-March/001119.html

This fixes a regression introduced in r247871.

Pointed out by:	Bruce Evans, Chris Torek
2013-03-16 05:40:29 +00:00
Neel Natu
4b5e84f615 Convert the offset into the bar that contains the MSI-X table to an offset
into the MSI-X table before using it to calculate the table index.

In the common case where the MSI-X table is located at the begining of the
BAR these two offsets are identical and thus the code was working by accident.

This change will fix the case where the MSI-X table is located in the middle
or at the end of the BAR that contains it.

Obtained from:	NetApp
2013-03-11 17:36:37 +00:00
Peter Grehan
6be7c5e31c Simplify virtio ring num-available calculation.
Submitted by:	Chris Torek, torek at torek dot net
2013-03-06 07:28:20 +00:00
Peter Grehan
ba02487a0e Reorder code to avoid the stat buffer being used uninitialized.
Obtained from:	NetApp
2013-03-06 06:24:09 +00:00
Neel Natu
91039bb268 Specify the length of the mapping requested from 'paddr_guest2host()'.
This seems prudent to do in its own right but it also opens up the possibility
of not having to mmap the entire guest address space in the 'bhyve' process
context.

Discussed with:	grehan
Obtained from:	NetApp
2013-03-01 02:26:28 +00:00
Neel Natu
58a6b0338a Ignore the BARRIER flag in the virtio block header.
This capability is not advertised by the host so ignore it even if the guest
insists on setting the flag.

Reviewed by:	grehan
Obtained from:	NetApp
2013-02-26 20:02:17 +00:00
Neel Natu
42b4049c34 Get rid of unused struct member.
Pointed out by:	Gopakumar T
Obtained from:	NetApp
2013-02-25 20:31:47 +00:00
Peter Grehan
0ab13648f5 Add the ability to have a 'fallback' search for memory ranges.
These set of ranges will be looked at if a standard memory
range isn't found, and won't be installed in the cache.
Use this to implement the memory behaviour of the PCI hole on
x86 systems, where writes are ignored and reads always return -1.
This allows breakpoints to be set when issuing a 'boot -d', which
has the side effect of accessing the PCI hole when changing the
PTE protection on kernel code, since the pmap layer hasn't been
initialized (a bug, but present in existing FreeBSD releases so
has to be handled).

Reviewed by:	neel
Obtained from:	NetApp
2013-02-22 00:46:32 +00:00
Neel Natu
74f80b236d Advertise PCI-E capability in the hostbridge device presented to the guest.
FreeBSD wants to see this capability in at least one device in the PCI
hierarchy before it allows use of MSI or MSI-X.

Obtained from:	NetApp
2013-02-15 18:41:36 +00:00
Neel Natu
485b3300cc Implement guest vcpu pinning using 'pthread_setaffinity_np(3)'.
Prior to this change pinning was implemented via an ioctl (VM_SET_PINNING)
that called 'sched_bind()' on behalf of the user thread.

The ULE implementation of 'sched_bind()' bumps up 'td_pinned' which in turn
runs afoul of the assertion '(td_pinned == 0)' in userret().

Using the cpuset affinity to implement pinning of the vcpu threads works with
both 4BSD and ULE schedulers and has the happy side-effect of getting rid
of a bunch of code in vmm.ko.

Discussed with:	grehan
2013-02-11 20:36:07 +00:00
John Baldwin
0895e9c70c Install <dev/agp/agpreg.h> and <dev/pci/pcireg.h> as userland headers
in /usr/include.

MFC after:	2 weeks
2013-02-05 18:55:09 +00:00
Neel Natu
445e089e21 Add support for MSI-X interrupts in the virtio block device and make that
the default.

The current behavior of advertising a single MSI vector can be requested by
setting the environment variable "BHYVE_USE_MSI" to "yes". The use of MSI
is not compliant with the virtio specification and will be eventually phased
out.

Submitted by:	Gopakumar T
Obtained from:	NetApp
2013-02-01 16:58:59 +00:00
Neel Natu
2b89a04496 Fix a broken assumption in the passthru implementation that the MSI-X table
can only be located at the beginning or the end of the BAR.

If the MSI-table is located in the middle of a BAR then we will split the
BAR into two and create two mappings - one before the table and one after
the table - leaving a hole in place of the table so accesses to it can be
trapped and emulated.

Obtained from:	NetApp
2013-02-01 03:49:09 +00:00
Neel Natu
aa12663f49 Fix a bug in the passthru implementation where it would assume that all
devices are MSI-X capable. This in turn would lead it to treat bar 0 as
the MSI-X table bar even if the underlying device did not support MSI-X.

Fix this by providing an API to query the MSI-X table index of the emulated
device. If the underlying device does not support MSI-X then this API will
return -1.

Obtained from:	NetApp
2013-02-01 02:41:47 +00:00
Neel Natu
c9b4e98754 Add support for MSI-X interrupts in the virtio network device and make that
the default.

The current behavior of advertising a single MSI vector can be requested by
setting the environment variable "BHYVE_USE_MSI" to "true". The use of MSI
is not compliant with the virtio specification and will be eventually phased
out.

Submitted by:	Gopakumar T
Obtained from:	NetApp
2013-01-30 04:30:36 +00:00
Peter Grehan
2838f343c8 Improve correctness of rtc register implementation.
Submitted by:	tycho nightingale at pluribusnetworks com
2013-01-25 22:43:20 +00:00
Neel Natu
1caafc47a3 Use the correct type (uint64_t) to retrieve sysctl machdep.tsc_freq.
Simplify the function a bit by falling through after initialization and
return via the normal code path.

Reviewed by:	grehan
Obtained from:	NetApp
2013-01-25 06:27:03 +00:00
Neel Natu
2e81a7e8ab Allocate the memory for the MSI-X table dynamically instead of allocating 32KB
statically. In most cases the number of table entries will be far less than
the maximum of 2048 allowed by the PCI specification.

Reuse macros from pcireg.h to interpret the MSI-X capability instead of rolling
our own.

Obtained from:	NetApp
2013-01-21 22:07:05 +00:00
Neel Natu
c3cbaac942 Get rid of redundant 'table_size' field in struct pi_msix. If needed it can
always be calculated from the number of entries in the MSI-X table.

Obtained from:	NetApp
2013-01-21 08:12:59 +00:00
Neel Natu
24be8623c6 Use <vmname> in a consistent manner in usage messages output by 'bhyve',
'bhyveload' and 'bhyvectl'.

Pointed out by:	joel@
2013-01-20 03:47:13 +00:00
Peter Grehan
70d980fc34 Don't completely drain the read file descriptor. Instead, only
fill up to the uart's rx fifo size, and leave any remaining input
for when the rx fifo is read. This allows cut'n'paste of long lines
to be done into the bhyve console without truncation.

Also, introduce a mutex since the file input will run in the mevent
thread context and may corrupt state accessed by a vCPU thread.

Reviewed by:	neel
Approved by:	NetApp
2013-01-07 07:33:48 +00:00
Peter Grehan
5fdfc2b893 Use 64-bit arithmetic throughout, and lock accesses to globals.
With this change, dbench with >= 4 processes runs without getting
weird jumps forward in time when the APCI pmtimer is the default
timecounter.

Obtained from:	NetApp
2013-01-07 04:51:43 +00:00
Neel Natu
5f0677d392 The "unrestricted guest" capability is a feature of Intel VT-x that allows
the guest to execute real or unpaged protected mode code - bhyve relies on
this feature to execute the AP bootstrap code.

Get rid of the hack that allowed bhyve to support SMP guests on processors
that do not have the "unrestricted guest" capability. This hack was entirely
FreeBSD-specific and would not work with any other guest OS.

Instead, limit the number of vcpus to 1 when executing on processors without
"unrestricted guest" capability.

Suggested by:	grehan
Obtained from:	NetApp
2013-01-04 02:04:41 +00:00
Peter Grehan
4e8c7465ad Change thread name for the main kqueue event loop to "<vmname> mevent" so
it can be easily distinguished from other non-vCPU threads in forthcoming
changes.

Obtained from:	NetApp
2012-12-20 23:01:53 +00:00
Peter Grehan
e285ef8d28 Rename fbsdrun.* -> bhyverun.*
bhyve is intended to be a generic hypervisor, and not FreeBSD-specific.

(renaming internal routines will come later)

Reviewed by:	neel
Obtained from:	NetApp
2012-12-13 01:58:11 +00:00
Peter Grehan
31f78e49b3 Properly reset the tx/rx rings when a guest requests a device reset.
Obtained from:	NetApp
2012-12-12 19:45:36 +00:00
Peter Grehan
100ace48f3 Create unique MAC addresses for virtio devices that are
created with non-zero PCI function numbers.

Remove obsolete reference to CFE.

Obtained from:	NetApp
2012-12-12 19:25:48 +00:00
Peter Grehan
2e325b3328 Determine the correct length and sector size for raw devices.
Obtained from:	NetApp
Tested by:	Michael Dexter with iscsi LUNs
2012-12-08 02:37:18 +00:00
Peter Grehan
09538d411a - Add in an XSDT to stop acpidump from exiting with a
'XSDT corrupted' error
- Fix up OEMID/OEM Table ID string padding in the DSDT.

Output on a verbose boot now looks like

...
ACPI: RSDP 0xf0400 00024 (v02 BHYVE )
ACPI: XSDT 0xf0480 00034 (v01 BHYVE  BVXSDT   00000001 INTL 20120320)
ACPI: APIC 0xf0500 0004A (v01 BHYVE  BVMADT   00000001 INTL 20120320)
ACPI: FACP 0xf0600 0010C (v05 BHYVE  BVFACP   00000001 INTL 20120320)
ACPI: DSDT 0xf0800 000F2 (v02 BHYVE  BVDSDT   00000001 INTL 20120320)
ACPI: FACS 0xf0780 00040
...

Obtained from:	NetApp
2012-11-30 07:00:14 +00:00
Neel Natu
48a29f4e07 Cleanup the user-space paging exit handler now that the unified instruction
emulation is in place.

Obtained from:	NetApp
2012-11-28 13:34:44 +00:00
Neel Natu
ba9b7bf73a Revamp the x86 instruction emulation in bhyve.
On a nested page table fault the hypervisor will:
- fetch the instruction using the guest %rip and %cr3
- decode the instruction in 'struct vie'
- emulate the instruction in host kernel context for local apic accesses
- any other type of mmio access is punted up to user-space (e.g. ioapic)

The decoded instruction is passed as collateral to the user-space process
that is handling the PAGING exit.

The emulation code is fleshed out to include more addressing modes (e.g. SIB)
and more types of operands (e.g. imm8). The source code is unified into a
single file (vmm_instruction_emul.c) that is compiled into vmm.ko as well
as /usr/sbin/bhyve.

Reviewed by:	grehan
Obtained from:	NetApp
2012-11-28 00:02:17 +00:00
Neel Natu
a07896de6c MSI-X does not need to be enabled in the message control register for the
guest to access the MSI-x tables.

Obtained from:	NetApp
2012-11-22 04:17:32 +00:00
Neel Natu
4662939d55 Mask the %eax register properly based on whether the "out" instruction is
operating on 1, 2 or 4 bytes.

There could be garbage in the unused bytes so zero them off.

Obtained from:	NetApp
2012-11-21 00:14:03 +00:00
Peter Grehan
430a787224 ACPI support for bhyve.
The -A option will create the minimal set of required ACPI tables in
guest memory. Since ACPI mandates an IOAPIC, the -I option must also
be used.

Template ASL files are created, and then passed to the iasl compiler
to generate AML files. These are then loaded into guest physical mem.

In support of this, the ACPI PM timer is implemented, in 32-bit mode.

Tested on 7.4/8.*/9.*/10-CURRENT.

Reviewed by:	neel
Obtained from:	NetApp
Discussed with:	jhb (a long while back)
2012-11-20 07:01:26 +00:00
Neel Natu
a10c6f5544 IFC @ r242684 2012-11-11 03:26:14 +00:00
Peter Grehan
de06f9bdf5 Change the thread name of the vCPU threads to contain the
name of the VM and the vCPU number. This helps hugely
when using top -H to identify what a VM is doing.

Reviewed by:	neel
Obtained from:	NetApp
2012-10-31 19:17:55 +00:00
Peter Grehan
00c66a5a85 Exit if the requested num vCPUs exceeds the maximum rather
than waiting until AP bringup detects an out-of-range vCPU.

While here, fix all error output to use fprintf(stderr, ...

Reviewed by:	neel
Reported by:	@allanjude
2012-10-31 03:29:52 +00:00
Neel Natu
275eece835 Present the bvm dbgport to the guest only when explicitly requested via
the "-g" command line option.

Suggested by:	grehan
Obtained from:	NetApp
2012-10-27 22:58:02 +00:00
Neel Natu
e365fca6c8 Present the bvm console device to the guest only when explicitly requested via
the "-b" command line option.

Reviewed by:	grehan
Obtained from:	NetApp
2012-10-27 22:33:23 +00:00
Neel Natu
90415e0b4e Ignore PCI configuration accesses to all bus numbers other than PCI bus 0.
Obtained from:	NetApp
2012-10-27 02:39:08 +00:00
Peter Grehan
fbfc1c763c Remove mptable generation code from libvmmapi and move it to bhyve.
Firmware tables require too much knowledge of system configuration,
and it's difficult to pass that information in general terms to a library.
The upcoming ACPI work exposed this - it will also livein bhyve.

Also, remove code specific to NetApp from the mptable name, and remove
the -n option from bhyve.

Reviewed by:	neel
Obtained from:	NetApp
2012-10-26 13:40:12 +00:00
Peter Grehan
4d1e669cad Rework how guest MMIO regions are dealt with.
- New memory region interface. An RB tree holds the regions,
with a last-found per-vCPU cache to deal with the common case
of repeated guest accesses to MMIO registers in the same page.

- Support memory-mapped BARs in PCI emulation.

 mem.c/h - memory region interface

 instruction_emul.c/h - remove old region interface.
 Use gpa from EPT exit to avoid a tablewalk to
 determine operand address. Determine operand size
 and use when calling through to region handler.

 fbsdrun.c - call into region interface on paging
  exit. Distinguish between instruction emul error
  and region not found

 pci_emul.c/h - implement new BAR callback api.
 Split BAR alloc routine into routines that
 require/don't require the BAR phys address.

 ioapic.c
 pci_passthru.c
 pci_virtio_block.c
 pci_virtio_net.c
 pci_uart.c  - update to new BAR callback i/f

Reviewed by:	neel
Obtained from:	NetApp
2012-10-19 18:11:17 +00:00
Neel Natu
3019332f6d Deal with transient EBUSY error return from vm_run() by retrying the operation. 2012-10-12 18:49:07 +00:00
Neel Natu
73820fb0a4 Add an option "-a" to present the local apic in the XAPIC mode instead of the
default X2APIC mode to the guest.
2012-09-26 00:06:17 +00:00
Neel Natu
edf89256dd Add an explicit exit code 'SPINUP_AP' to tell the controlling process that an
AP needs to be activated by spinning up an execution context for it.

The local apic emulation is now completely done in the hypervisor and it will
detect writes to the ICR_LO register that try to bring up the AP. In response
to such writes it will return to userspace with an exit code of SPINUP_AP.

Reviewed by: grehan
2012-09-25 02:33:25 +00:00
Neel Natu
25d4944e06 Fix a bug in how a 64-bit bar in a pci passthru device would be presented to
the guest. Prior to the fix it was possible for such a bar to appear as a
32-bit bar as long as it was allocated from the region below 4GB.

This had the potential to confuse some drivers that were particular about
the size of the bars.

Obtained from:	NetApp
2012-08-06 07:20:25 +00:00
Neel Natu
99d653892b Add support for emulating PCI multi-function devices.
These function number is specified by an optional [:<func>] after the slot
number: -s 1:0,virtio-net,tap0

Ditto for the mptable naming: -n 1:0,e0a

Obtained from:	NetApp
2012-08-06 06:51:27 +00:00
Neel Natu
b0b53d3af9 Device model for ioapic emulation.
With this change the uart emulation is entirely interrupt driven.

Obtained from: NetApp
2012-08-05 00:00:52 +00:00
Neel Natu
da91bbd7a7 The displacement field in the decoded instruction should be treated as a 8-bit
or 32-bit signed integer.

Simplify the handling of indirect addressing with displacement by
unconditionally adding the 'instruction->disp' to the target address.
This is alright since 'instruction->disp' is non-zero only for the
addressing modes that specify a displacement.

Obtained from: NetApp
2012-08-04 23:51:21 +00:00