mlx4_en_stop_port() calls mlx4_en_put_qp() which can refer the link level
address of the network interface, which in turn will be freed by the
network interface detach function. Make sure the port is stopped
before detaching the network interface.
MFC after: 1 week
Sponsored by: Mellanox Technologies
It can happen during shutdown that the lock will recurse when the mlx4en(4)
instance is part of a lagg interface. Call ether_ifdetach() unlocked.
Backtrace:
panic(): _sx_xlock_hard: recursed on non-recursive sx &mdev->state_lock
_sx_xlock_hard()
_sx_xlock()
mlx4_en_ioctl()
if_setlladdr()
lagg_port_destroy()
lagg_port_ifdetach()
if_detach()
mlx4_en_destroy_netdev()
mlx4_en_remove()
mlx4_remove_device()
mlx4_unregister_device()
mlx4_unload_one()
mlx4_shutdown()
linux_pci_shutdown()
bus_generic_shutdown()
MFC after: 1 week
Sponsored by: Mellanox Technologies
Copy small packets like TCP ACKs into a new mbuf
reusing the existing mbuf to receive a new ethernet
frame. This avoids wasting buffer space for
small sized packets.
Submitted by: hselasky@
Approved by: hselasky (mentor)
MFC after: 1 week
Sponsored by: Mellanox Technologies
Adding an interface might be done outside the device_attach() routine
and will then cause a panic, due to the VNET not being defined.
Submitted by: hselasky@
Approved by: hselasky (mentor)
MFC after: 1 week
Sponsored by: Mellanox Technologies
The "priv->pkstats.rx_dropped" is written twice in a row.
Submitted by: hselasky@
Approved by: hselasky (mentor)
MFC after: 1 week
Sponsored by: Mellanox Technologies
Also when the MTU is greater than MCLBYTES.
Submitted by: hselasky@
Approved by: hselasky (mentor)
MFC after: 1 week
Sponsored by: Mellanox Technologies
The hardware queues are deep enough currently and using the DRBR and associated
callbacks only leads to more task switching in the TX path. The is also a race
setting the queue_state which can lead to hung TX rings.
Submitted by: hselasky@
Approved by: hselasky (mentor)
MFC after: 1 week
Sponsored by: Mellanox Technologies
In last mlx4 update (r325841) we lost the sysctl to show the
firmware version for mlx4 devices.
Add both board identifier and firmware version under:
sys.device.mlx4_core0.hw sysctl node.
Approved by: hselasky (mentor)
MFC after: 1 week
Sponsored by: Mellanox Technologies
When acting as a VF it is required to add steering rules for all unicast
addresses. Even if promiscious mode is selected. Else incoming data packets
will be dropped.
MFC after: 3 days
Approved by: re (gjb)
Sponsored by: Mellanox Technologies
When receiving a PCP change all GID entries are reloaded.
This ensures the relevant GID entries use prio tagging,
by setting VLAN present and VLAN ID to zero.
The priority for prio tagged traffic is set using the regular
rdma_set_service_type() function.
Fake the real network device to have a VLAN ID of zero
when prio tagging is enabled. This is logic is hidden inside
the rdma_vlan_dev_vlan_id() function which must always be used
to retrieve the VLAN ID throughout all of ibcore and the
infiniband network drivers.
The VLAN presence information then propagates through all
of ibcore and so incoming connections will have the VLAN
bit set. The incoming VLAN ID is then checked against the
return value of rdma_vlan_dev_vlan_id().
MFC after: 1 week
Sponsored by: Mellanox Technologies
This fixes 32-bit compat (no ioctl command defintions are required
as struct ifreq is the same size). This is believed to be sufficent to
fully support ifconfig on 32-bit systems.
Reviewed by: kib
Obtained from: CheriBSD
MFC after: 1 week
Relnotes: yes
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D14900
Creating a UD address handle from user-space or from the kernel-space,
when the link layer is ethernet, requires resolving the remote L3
address into a L2 address. Doing this from the kernel is easy because
the required ARP(IPv4) and ND6(IPv6) address resolving APIs are readily
available. In userspace such an interface does not exist and kernel
help is required.
It should be noted that in an IP-based GID environment, the GID itself
does not contain all the information needed to resolve the destination
IP address. For example information like VLAN ID and SCOPE ID, is not
part of the GID and must be fetched from the GID attributes. Therefore
a source GID should always be referred to as a GID index. Instead of
going through various racy steps to obtain information about the
GID attributes from user-space, this is now all done by the kernel.
This patch optimises the L3 to L2 address resolving using the existing
create address handle uverbs interface, retrieving back the L2 address
as an additional user-space information structure.
This commit combines the following Linux upstream commits:
IB/core: Let create_ah return extended response to user
IB/core: Change ib_resolve_eth_dmac to use it in create AH
IB/mlx5: Make create/destroy_ah available to userspace
IB/mlx5: Use kernel driver to help userspace create ah
IB/mlx5: Report that device has udata response in create_ah
MFC after: 1 week
Sponsored by: Mellanox Technologies
These were made redundant in r325841.
Reviewed by: hselasky
MFC after: 1 week (hselasky will MFC)
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D13401
Background:
The coming ibcore update forces an update of mlx4ib(4) which in turn requires
an updated mlx4 core module. This also affects the mlx4en(4) module because
commonly used APIs are updated. This commit is a middle step updating the
mlx4 modules towards the new ibcore.
This change contains no major new features.
Changes in mlx4:
a) Improved error handling when mlx4 PCI devices are
detached inside VMs.
b) Major update of codebase towards Linux 4.9.
Changes in mlx4ib(4):
a) Minimal changes needed in order to compile using the
updated mlx4 core APIs.
Changes in mlx4en(4):
a) Update flow steering code in mlx4en to use new APIs for
registering MAC addresses and IP addresses.
b) Update all statistics counters to be 64-bit.
c) Minimal changes needed in order to compile using the
updated mlx4 core APIs.
Sponsored by: Mellanox Technologies
MFC after: 1 week
iWarp and RoCE in ibcore. The selection of RDMA_PS_TCP can not be used
to indicate iWarp protocol use. Backport the proper IB device
capabilities from Linux upstream to distinguish between iWarp and
RoCE. Only allocate the additional socket required for iWarp for RDMA
IDs when at least one iWarp device present. This resolves
interopability issues between iWarp and RoCE in ibcore
Reviewed by: np @
Differential Revision: https://reviews.freebsd.org/D12563
Sponsored by: Mellanox Technologies
MFC after: 3 days
Follow up from r324201 to fix compilation with gcc, which complains
about non-ICE case expressions.
Reviewed by: hselasky
Differential Revision: https://reviews.freebsd.org/D12675
is in VF or SRIOV mode typically in a virtual machine environment.
Submitted by: Sepherosa Ziehau <sephe@dragonflybsd.org>
MFC after: 3 days
Sponsored by: Mellanox Technologies
The m_defrag() function can only defrag mbuf chains which have a valid
mbuf packet header. In r291699 when the mlx4en(4) driver was converted
into using BUSDMA(9), the call to m_defrag() was moved after the part
of the transmit routine which strips the header from the mbuf chain.
This effectivly disabled the mbuf defrag mechanism and such packets
simply got dropped.
This patch removes the stripping of mbufs from a chain and loads all
mbufs using busdma. If busdma finds there are no segments, unload
the DMA map and free the mbuf right away, because that means all
data in the mbuf has been inlined in the TX ring. Else proceed
as usual.
Add a per-ring rounter for the number of defrag attempts and
make sure the oversized_packets counter gets zeroed while at it.
The counters are per-ring to avoid excessive cache misses in the
TX path.
Submitted by: mjoras@
Differential Revision: https://reviews.freebsd.org/D11683
MFC after: 1 week
Sponsored by: Mellanox Technologies
Add io_mapping_init_wc() and add a third (unused) parameter to
io_mapping_map_wc().
Reviewed by: hselasky
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D11286
mlx4en network interfaces. This prevents infinite unit number growth
typically when the mlx4en driver is used inside virtual machines which
support runtime PCI attach and detach.
MFC after: 3 days
Sponsored by: Mellanox Technologies
mlx4en(4) driver in SRIOV mode.
Place a copy of the destination MAC address in the send WQE only under
SRIOV/eSwitch configuration or when the device is in selftest. This
allows communication between functions on the same host.
PR: 216493
MFC after: 3 days
Sponsored by: Mellanox Technologies
driver. Else if the port is up the resource might still be busy and
the MTT free will fail.
PR: 216493
MFC after: 3 days
Sponsored by: Mellanox Technologies
When creating EQs to handle CQ completion events for the PF or for
VFs, we create enough EQE entries to handle completions for the max
number of CQs that can use that EQ.
When SRIOV is activated, the max number of CQs a VF (or the PF) can
obtain is its CQ quota (determined by the Hypervisor resource
tracker). Therefore, when creating an EQ, the number of EQE entries
that the VF should request for that EQ is the CQ quota value (and not
the total number of CQs available in the firmware).
Under SRIOV, the PF, also must use its CQ quota, because the resource
tracker also controls how many CQs the PF can obtain.
Using the firmware total CQs instead of the CQ quota when creating EQs
resulted wasting MTT entries, due to allocating more EQEs than were
needed.
MFC after: 3 days
Sponsored by: Mellanox Technologies
When using Blue-Flame, BF, the QPN overrides the VLAN, CV, and SV
fields in the WQE. Thus, BF may only be used for QPNs with bits 6,7
unset.
The current ethernet driver code reserves a TX QP range with 256b
alignment.
This is wrong because if there are more than 64 TX QPs in use, QPNs >=
base + 65 will have bits 6/7 set.
This problem is not specific for the Ethernet driver, any entity that
tries to reserve more than 64 BF-enabled QPs should fail. Also, using
ranges is not necessary here and is wasteful.
The new mechanism introduced here will support reservation for "Eth
QPs eligible for BF" for all drivers: bare-metal, multi-PF, and VFs
(when hypervisors support WC in VMs). The flow we use is:
1. In mlx4_en, allocate Tx QPs one by one instead of a range allocation,
and request "BF enabled QPs" if BF is supported for the function
2. In the ALLOC_RES FW command, change param1 to:
a. param1[23:0] - number of QPs
b. param1[31-24] - flags controlling QPs reservation
Bit 31 refers to Eth blueflame supported QPs. Those QPs must have bits
6 and 7 unset in order to be used in Ethernet.
Bits 24-30 of the flags are currently reserved.
When a function tries to allocate a QP, it states the required
attributes for this QP. Those attributes are considered "best-effort".
If an attribute, such as Ethernet BF enabled QP, is a must-have
attribute, the function has to check that attribute is supported
before trying to do the allocation.
In a lower layer of the code, mlx4_qp_reserve_range masks out the bits
which are unsupported. If SRIOV is used, the PF validates those
attributes and masks out unsupported attributes as well. In order to
notify VFs which attributes are supported, the VF uses QUERY_FUNC_CAP
command. This command's mailbox is filled by the PF, which notifies
which QP allocation attributes it supports.
Obtained from: Linux (dual BSD/GPLv2 licensed)
Submitted by: Dexuan Cui @ microsoft . com
Differential Revision: https://reviews.freebsd.org/D8868
MFC after: 2 weeks
Sponsored by: Mellanox Technologies
Previously, the mlx4 driver queried the firmware in order to get the
number of supported EQs. Under SRIOV, since this was done before the
driver notified the firmware how many VFs it actually needs, the
firmware had to take into account a worst case scenario and always
allocated four EQs per VF, where one was used for events while the
others were used for completions. Now, when the firmware supports the
asymmetric allocation scheme, denoted by exposing num_sys_eqs > 0 (-->
MLX4_DEV_CAP_FLAG2_SYS_EQS), we use the QUERY_FUNC command to query
the firmware before enabling SRIOV. Thus we can get more EQs and MSI-X
vectors per function. Moreover, when running in the new
firmware/driver mode, the limitation that the number of EQs should be
a power of two is lifted.
Obtained from: Linux (dual BSD/GPLv2 licensed)
Submitted by: Dexuan Cui @ microsoft . com
Differential Revision: https://reviews.freebsd.org/D8867
MFC after: 2 weeks
Sponsored by: Mellanox Technologies