With clang 15, the following -Werror warning is produced:
sys/net/iflib.c:993:8: error: variable 'n' set but not used [-Werror,-Wunused-but-set-variable]
u_int n;
^
The 'n' variable appears to have been a debugging aid that has never
been used for anything, so remove it.
MFC after: 3 days
Adds a new function pointer to struct if_txrx in order to allow
drivers to set their own function that will determine which queue
a packet should be sent on.
Since this includes a kernel ABI change, bump the __FreeBSD_version
as well.
(This motivation behind this is to allow the driver to examine the
UP in the VLAN tag and determine which queue to TX on based on
that, in support of HW TX traffic shaping.)
Signed-off-by: Eric Joyner <erj@FreeBSD.org>
Reviewed by: kbowling@, stallamr@netapp.com
Tested by: jeffrey.e.pieper@intel.com
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D31485
In iflib_device_register(), the CTX_LOCK is acquired first and then
IFNET_WLOCK is acquired by ether_ifattach(). However, in netmap_hw_reg()
we do the opposite: IFNET_RLOCK is acquired first, and then CTX_LOCK
is acquired by iflib_netmap_register(). Fix this LOR issue by wrapping
the CTX_LOCK/UNLOCK calls in iflib_device_register with an additional
IFNET_WLOCK. This is safe since the IFNET_WLOCK is recursive.
MFC after: 1 month
I've noticed relations between iflib_timer() vs ixl_admin_timer().
Both scheduled at the same 2Hz rate, but the second is rescheduling
the first each time, so if the first get any slower, it won't be
executed at all. Revert this until deeper investigation.
This reverts commit 90bc1cf657.
The introduction of <sched.h> improved compatibility with some 3rd
party software, but caused the configure scripts of some ports to
assume that they were run in a GLIBC compatible environment.
Parts of sched.h were made conditional on -D_WITH_CPU_SET_T being
added to ports, but there still were compatibility issues due to
invalid assumptions made in autoconfigure scripts.
The differences between the FreeBSD version of macros like CPU_AND,
CPU_OR, etc. and the GLIBC versions was in the number of arguments:
FreeBSD used a 2-address scheme (one source argument is also used as
the destination of the operation), while GLIBC uses a 3-adderess
scheme (2 source operands and a separately passed destination).
The GLIBC scheme provides a super-set of the functionality of the
FreeBSD macros, since it does not prevent passing the same variable
as source and destination arguments. In code that wanted to preserve
both source arguments, the FreeBSD macros required a temporary copy of
one of the source arguments.
This patch set allows to unconditionally provide functions and macros
expected by 3rd party software written for GLIBC based systems, but
breaks builds of externally maintained sources that use any of the
following macros: CPU_AND, CPU_ANDNOT, CPU_OR, CPU_XOR.
One contributed driver (contrib/ofed/libmlx5) has been patched to
support both the old and the new CPU_OR signatures. If this commit
is merged to -STABLE, the version test will have to be extended to
cover more ranges.
Ports that have added -D_WITH_CPU_SET_T to build on -CURRENT do
no longer require that option.
The FreeBSD version has been bumped to 1400046 to reflect this
incompatible change.
Reviewed by: kib
MFC after: 2 weeks
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D33451
On SIOCSIFCAP, some bits in ifp->if_capenable may be toggled.
When this happens, apply the same change to isc_capenable, which
is the iflib private copy of if_capenable (for a subset of the
IFCAP_* bits). In this way the iflib drivers can check the bits
using isc_capenable rather than if_capenable. This is convenient
because the latter access requires an additional indirection
through the ifp, and it is also less likely to be in cache.
PR: 260068
Reviewed by: kbowling, gallatin
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D33156
iflib_stop modifies iflib data structures that are used by _task_fn_rx,
most prominently the free lists. So, iflib_stop has to ensure that the
rx task threads are not active.
This should help to fix a crash seen when iflib_if_ioctl (e.g.,
SIOCSIFCAP) is called while there is already traffic flowing.
The crash has been seen on VMWare guests with vmxnet3 driver.
My guess is that on physical hardware the couple of 1ms delays that
iflib_stop has after disabling interrupts are enough for the queued work
to be completed before any iflib state is touched.
But on busy hypervisors the guests might not get enough CPU time to
complete the work, thus there can be a race between the taskqueue
threads and the work done to handle an ioctl, specifically in iflib_stop
and iflib_init_locked.
PR: 259458
Reviewed by: markj
MFC after: 3 weeks
Differential Revision: https://reviews.freebsd.org/D32926
When iflib devices are in netmap mode the driver
counters are no longer updated making it look from
userspace tools that traffic has stopped.
Reported by: Franco Fichtner <franco@opnsense.org>
Reviewed by: vmaffione, iflib (erj, gallatin)
Obtained from: OPNsense
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D31550
ENETC NIC found in LS1028A has a bug where clearing TX pidx/cidx
causes the ring to hang after being re-enabled.
Add a new flag, if set iflib will preserve the indices during restart.
Submitted by: Kornel Duleba <mindal@semihalf.com>
Reviewed by: gallatin, erj
Obtained from: Semihalf
Sponsored by: Alstom Group
Differential Revision: https://reviews.freebsd.org/D30728
Drivers can specify padding of received frames with iri_pad field.
This can be used to enforce ip alignment by hardware.
Iflib ignored that padding when processing small frames,
which rendered this feature inoperable.
I found it while writing a driver for a NIC that can ip align
received packets. Note that this doesn't change behavior of existing
drivers as they all set iri_pad to 0.
Submitted by: Kornel Duleba <mindal@semihalf.com>
Reviewed by: gallatin
Obtained from: Semihalf
Sponsored by: Alstom Group
Differential Revision: https://reviews.freebsd.org/D30009
iflib now supports mapping each (TX,RX) queue pair to the same CPU
(default), to separate CPUs, or to a pair of physical and logical CPUs
that share the same L2 cache. The mapping mechanism supports unequal
numbers of TX and RX queues, with the excess queues always being
mapped to consecutive physical CPUs. When the platform cannot
distinguish between physical and logical CPUs, all are treated as
physical CPUs. See the comment on get_cpuid_for_queue() for the
entire matrix.
The following device-specific tunables influence the mapping process:
dev.<device>.<unit>.iflib.core_offset (existing)
dev.<device>.<unit>.iflib.separate_txrx (existing)
dev.<device>.<unit>.iflib.use_logical_cores (new)
The following new, read-only sysctls provide visibility of the mapping
results:
dev.<device>.<unit>.iflib.{t,r}xq<n>.cpu
When an iflib driver allocates TX softirqs without providing reference
RX IRQs, iflib now binds those TX softirqs to CPUs using the above
mapping mechanism (that is, treats them as if they were TX IRQs).
Previously, such bindings were left up to the grouptaskqueue code and
thus fell outside of the iflib CPU mapping strategy.
Reviewed by: kbowling
Tested by: olivier, pkelsey
MFC after: 3 weeks
Differential Revision: https://reviews.freebsd.org/D24094
Changes to the LRO code have exposed a bug in iflib where devices
which are not capable of doing LRO are still calling
tcp_lro_flush_all(), even when they have not initialized the LRO
context. This used to be mostly harmless, but the LRO code now sets
the VNET based on the ifp in the lro context and will try to access it
through a NULL ifp resulting in a panic at boot.
To fix this, we unconditionally initializes LRO so that we have a
valid LRO context when calling tcp_lro_flush_all(). One alternative is
to check the device capabilities before calling tcp_lro_flush_all() or
adding a new state flag in the ctx. However, it seems unwise to add an
extra, mostly useless test for higher performance devices when we can
just initialize LRO for all devices.
Reviewed by: erj, hselasky, markj, olivier
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D29928
This per-driver callback is invoked by netmap when it wants
to align the number of TX/RX netmap rings and/or the number of
TX/RX netmap slots to the actual state configured in the hardware.
The alignment happens when netmap mode is switched on (with no
active netmap file descriptors for that netmap port), or when
collecting netmap port information.
MFC after: 1 week
For interfaces with admin completion queues, introduce a new devmethod
IFDI_ADMIN_COMPLETION_HANDLE and a corresponding flag IFLIB_HAS_ADMINCQ.
This provides an option for handling any admin cq logic, which cannot be
run from an interrupt context.
Said method is called from within iflib's admin task, making it safe to
sleep.
Reviewed by: mmacy
Submitted by: Artur Rojek <ar@semihalf.com>
Obtained from: Semihalf
Sponsored by: Amazon, Inc.
Differential Revision: https://reviews.freebsd.org/D28708
Commit 6dd69f0064 ("iflib: introduce isc_dma_width")
failed to build on powerpc due to implicit type conversion
error. Fix that.
Submitted by: Artur Rojek <ar@semihalf.com>
Obtained from: Semihalf
Sponsored by: Amazon, Inc.
Some DMA controllers are unable to address the full host memory space
and are instead limited to a subset of address range (e.g. 48-bit).
Allow the driver to specify the maximum allowed DMA addressing width
(in bits) for the NIC hardware, by introducing a new field in
if_softc_ctx.
If said field is omitted (set to 0), the lowaddr of DMA window bounds
defaults to BUS_SPACE_MAXADDR.
Submitted by: Artur Rojek <ar@semihalf.com>
Obtained from: Semihalf
Sponsored by: Amazon, Inc.
Differential Revision: https://reviews.freebsd.org/D28706
iflib_rxeof() was counting everything twice. This was introduced when
pfil hooks were added to the iflib receive path. We want to count rx
packets/bytes before the pfil hooks are executed, so remove the counter
adjustments that are executed after.
PR: 253583
Reviewed by: gallatin, erj
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D28900
This fixes a warning from GCC for kernels without netmap since the
return value is never used.
Reviewed by: vmaffione, erj
Differential Revision: https://reviews.freebsd.org/D28598
iflib_init_locked() assumes that iflib_stop() has been called, however,
it is not called for media changes.
iflib_if_init_locked() calls stop then init, so fixes the problem.
PR: 253473
MFC after: 3 days
Reviewed by: markj
Sponsored by: Juniper Networks, Inc., Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D28667
Memory and PCI resources are freed with no particular order. This could
cause use-after-frees when detaching following a failed attach. For
instance, iflib_tx_structures_free() frees ctx->ifc_txqs[] but
iflib_tqg_detach() attempts to access this array. Similarly, adapter
queues gets freed by IFDI_QUEUES_FREE() but IFDI_DETACH() attempts to
access adapter queues to free PCI resources.
MFC after: 2 weeks
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D27634
Originally IFCAP_NOMAP meant that the mbuf has external storage pointer
that points to unmapped address. Then, this was extended to array of
such pointers. Then, such mbufs were augmented with header/trailer.
Basically, extended mbufs are extended, and set of features is subject
to change. The new name should be generic enough to avoid further
renaming.
The NS_MOREFRAG flag can be set in a netmap slot to represent a
multi-fragment packet. Only the last fragment of a packet does
not have the flag set. On TX rings, the flag may be set by the
userspace application. The kernel will look at the flag and use it
to properly set up the NIC TX descriptors.
On RX rings, the kernel may set the flag if the packet received
was split across multiple netmap buffers. The userspace application
should look at the flag to know when the packet is complete.
Submitted by: rajesh1.kumar_amd.com
Reviewed by: vmaffione
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D27799
rxd_frag_to_sd() have pf_rv parameter as NULL with the current
code. This patch fixes the NULL pointer dereference in that
case thus avoiding a possible panic.
Submitted by: rajesh1.kumar at amd.com
Reviewed by: gallatin
Differential Revision: https://reviews.freebsd.org/D28115
Restore the hwofs functionality temporarily disabled by
7ba6ecf216 to prevent issues with iflib.
This patch brings the necessary changes to iflib to
enable howfs to allow interface restarts without
disrupting netmap applications actively using its
rings.
After this change, it becomes possible for multiple
non-cooperating netmap applications to use non-overlapping
subsets of the available netmap rings without clashing
with each other.
PR: 252453
MFC after: 1 week
The iflib_queues_alloc() allocates isc_nrxqs iflib_dma_info structs
for each rxqset, and links each struct to a different free list.
As a result, it must be isc_nrxqs >= isc_nfl (plus the completion
queue, if present).
Add an assertion to make this constraint explicit.
MFC after: 2 weeks
Since 1d238b07d5, krings are disabled before
a reinit cycle triggered by iflib_netmap_register.
However, this operation is actually necessary also for
any interface reinit triggered by other causes (i.e.,
ifconfig commands).
We achieve this goal by moving the krings enable/disable
operation inside iflib_stop() and iflib_init_locked().
Once here, this change also removes some redundant operations
from iflib_netmap_register(), that are already performed by
iflib_stop().
PR: 252453
MFC after: 1 week
When netmap_fl_refill() is called at initialization time (e.g.,
during netmap_iflib_register()), nic_i must be 0, since the
free list is reinitialized. At the end of the refill cycle, nic_i
must still be zero, because exactly N descriptors (N is the ring size)
are refilled.
This patch therefore fixes the assertions to check on nic_i rather
than on nm_i. The current netmap_reset() may in fact cause nm_i
to be != 0 while the device is resetting: this may happen when
multiple non-cooperating processes open different subsets of the
available netmap rings.
PR: 252518
MFC after: 1 week
When different processes open separate subsets of the
available rings of a same netmap interface, a device
reset may be performed while one of the processes
is actively using some rings (e.g., caused by another
process executing a nmport_open()).
With this patch, such situation will cause the
active process to get a POLLERR, so that it can
have a chance to detect the situation.
We also guarantee that no process is running a txsync
or rxsync (ioctl or poll) while an iflib device reset
is in progress.
PR: 252453
MFC after: 1 week
Doing a 'dd' over iscsi will reliably cause stalls. Tx
cleaning _should_ reliably happen as data is sent.
However, currently if the transmit queue fills it will
wait until the iflib timer (hz/2) runs.
This change causes the the tx taskq thread to be run
if there are completed descriptors.
While here:
- make timer interrupt delay a sysctl
- simplify txd_db_check handling
- comment on INTR types
Background on the change:
Initially doorbell updates were minimized by only writing to the register
on every fourth packet. If txq_drain would return without writing to the
doorbell it scheduled a callout on the next tick to do the doorbell write
to ensure that the write otherwise happened "soon". At that time a sysctl
was added for users to avoid the potential added latency by simply writing
to the doorbell register on every packet. This worked perfectly well for
e1000 and ixgbe ... and appeared to work well on ixl. However, as it
turned out there was a race to this approach that would lockup the ixl MAC.
It was possible for a lower producer index to be written after a higher one.
On e1000 and ixgbe this was harmless - on ixl it was fatal. My initial
response was to add a lock around doorbell writes - fixing the problem but
adding an unacceptable amount of lock contention.
The next iteration was to use transmit interrupts to drive delayed doorbell
writes. If there were no packets in the queue all doorbell writes would be
immediate as the queue started to fill up we could delay doorbell writes
further and further. At the start of drain if we've cleaned any packets we
know we've moved the state machine along and we write the doorbell (an
obvious missing optimization was to skip that doorbell write if db_pending
is zero). This change required that tx interrupts be scheduled periodically
as opposed to just when the hardware txq was full. However, that just leads
to our next problem.
Initially dedicated msix vectors were used for both tx and rx. However, it
was often possible to use up all available vectors before we set up all the
queues we wanted. By having rx and tx share a vector for a given queue we
could halve the number of vectors used by a given configuration. The problem
here is that with this change only e1000 passed the necessary value to have
the fast interrupt drive tx when appropriate.
Reported by: mav@
Tested by: mav@
Reviewed by: gallatin@
MFC after: 1 month
Sponsored by: iXsystems
Differential Revision: https://reviews.freebsd.org/D27683
In some error paths we would fail to detach from the iflib taskqueue
groups. Also move the detach code into its own subroutine instead of
duplicating it.
Submitted by: Sai Rajesh Tallamraju <stallamr@netapp.com>
MFC after: 2 weeks
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D27342
Code was supposed to call callout_reset_sbt_on() rather than
callout_reset_sbt(). This resulted into passing a "cpu" value
to a "flag" argument. A recipe for subtle errors.
PR: 248652
Reported by: sg@efficientip.com
MFC with: r367093
The way netmap TX is handled in iflib when TX interrupts are not
used (IFC_NETMAP_TX_IRQ not set) has some issues:
- The netmap_tx_irq() function gets called by iflib_timer(), which
gets scheduled with tick granularity (hz). This is not frequent
enough for 10Gbps NICs and beyond (e.g., ixgbe or ixl). The end
result is that the transmitting netmap application is not woken
up fast enough to saturate the link with small packets.
- The iflib_timer() functions also calls isc_txd_credits_update()
to ask for more TX completion updates. However, this violates
the netmap requirement that only txsync can access the TX queue
for datapath operations. Only netmap_tx_irq() may be called out
of the txsync context.
This change introduces per-tx-queue netmap timers, using microsecond
granularity to ensure that netmap_tx_irq() can be called often enough
to allow for maximum packet rate. The timer routine simply calls
netmap_tx_irq() to wake up the netmap application. The latter will
wake up and call txsync to collect TX completion updates.
This change brings back line rate speed with small packets for ixgbe.
For the time being, timer expiration is hardcoded to 90 microseconds,
in order to avoid introducing a new sysctl.
We may eventually implement an adaptive expiration period or use another
deferred work mechanism in place of timers.
Also, fix the timers usage to make sure that each queue is serviced
by a different CPU.
PR: 248652
Reported by: sg@efficientip.com
MFC after: 2 weeks
The pidx argument of isc_rxd_flush() indicates which is the last valid
receive descriptor to be used by the NIC. However, current code has
multiple issues:
- Intel drivers write pidx to their RDT register, which means that
NICs will only use the descriptors up to pidx-1 (modulo ring size N),
and won't actually use the one pointed by pidx. This does not break
reception, but it is anyway confusing and suboptimal (the NIC will
actually see only N-2 descriptors as available, rather than N-1).
Other drivers (if_vmx, if_bnxt, if_mgb) adhere to this semantic).
- The semantic used by Intel (RDT is one descriptor past the last
valid one) is used by most (if not all) NICs, and it is also used
on the TX side (also in iflib). Since iflib is not currently
using this semantic for RX, it must decrement fl->ifl_pidx
(modulo N) before calling isc_rxd_flush(), and then the
per-driver callback implementation must increment the index
again (to match the real semantic). This is confusing and suboptimal.
- The iflib refill function is also called at initialization.
However, in case the ring size is smaller than 128 (e.g. if_mgb),
the refill function will actually prepare all the receive
descriptors (N), without leaving one unused, as most of NICs assume
(e.g. to avoid RDT to overrun RDH). I can speculate that the code
looks like this right now because this issue showed up during
testing (e.g. with if_mgb), and it was easy to workaround by
decrementing pidx before isc_rxd_flush().
The goal of this change is to simplify the code (removing a bunch
of instructions from the RX fast path), and to make the semantic of
isc_rxd_flush() consistent across drivers. To achieve this, we:
- change the semantics of the pidx argument to the usual one (that
is the index one past the last valid one), so that both iflib and
drivers avoid the decrement/increment dance.
- fix the initialization code to prepare at most N-1 descriptors.
Reviewed by: markj
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D26191
At initialization time, the netmap RX refill function used to
prepare the NIC RX ring with N-1 buffers rather than N (with
N equal to the number of descriptors in the NIC RX ring).
This is not how netmap is supposed to work, as it would keep
kring->nr_hwcur not in sync with the NIC "next index to refill"
(i.e., fl->ifl_pidx). Instead we prepare N buffers, although we
still publish (with isc_rxd_flush()) only the first N-1 buffers,
to avoid the NIC producer pointer to overrun the NIC consumer
pointer (for NICs where this is a real issue, e.g. Intel ones).
MFC after: 2 weeks
The semantic of the pidx argument of isc_rxd_flush() is the
last valid index of in the free list, rather than the next
index to be published. However, netmap was still using the
old convention. While there, also refactor the netmap_fl_refill()
to simplify a little bit and add an assertion.
MFC after: 2 weeks
For drivers with IFLIB_HAS_RXCQ set, there is a separate completion
queue. In this case, the netmap rxsync routine needs to update
rxq->ifr_cq_cidx in the same way it is updated by iflib_rxeof().
This improves the situation for vmx(4) and bnxt(4) drivers, which
use iflib and have the IFLIB_HAS_RXCQ bit set.
PR: 248494
MFC after: 3 weeks
First, fix the initialization of the fl->ifl_rxd_idxs array,
which was affected by an off-by-one bug.
Once there, refactor the function to use better names for
local variables, optimize the variable assignments, and
merge the bus_dmamap_sync() inner loop with the outer one.
PR: 248494
MFC after: 3 weeks
Netmap only uses free list 0 to keep it consistent with its
one-to-one mapping between each netmap ring and a device RX
(or TX) queue.
However, the current iflib_netmap_rxsync() routine was
mistakenly updating the ifl_cidx field of both free lists.
PR: 248494
MFC after: 2 weeks