From Jake:
A user may set a sysctl to override the default number of Tx or Rx
descriptors. However, certain calculations in the iflib core expect the
number of descriptors to be a power of 2.
Update _iflib_assert to verify that all of the shared context parameters
for the number of descriptors are powers of 2.
Modify iflib_reset_qvalues to check that the provided isc_nrxd value is
a power of 2. If it's not, print a warning message and then use the
default value.
An alternative might be to try rounding the number down instead.
However, this creates problems in case the rounded down value is below
the minimum value that the driver would support.
Submitted by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed by: marius@
MFC after: 1 week
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D19880
MSI. Unlike as with iflib_fast_intr_ctx(), the former will also enqueue
_task_fn_tx() in addition to _task_fn_rx() if appropriate, bringing TCP
TX throughput of EM-class devices on par with the MSI-X case and, thus,
close to wirespeed/pre-iflib(4) times again. [1]
Note that independently of the interrupt type, the UDP performance with
these MACs still is abysmal and nowhere near to where it was before the
conversion of em(4) to iflib(4).
o In iflib_init_locked(), announce which free list failed to set up.
o In _task_fn_tx() when running netmap(4), issue ifdi_intr_enable instead
of the ifdi_tx_queue_intr_enable method in case of a "legacy" interrupt
as the latter is valid with MSI-X only.
o Instead of adding the missing - and apparently convoluted enough that a
DBG_COUNTER_INC was put into a wrong spot in _task_fn_rx() - checks for
ifdi_{r,t}x_queue_intr_enable being available in the MSI-X case also to
iflib_fast_intr_rxtx(), factor these out to iflib_device_register() and
make the checks fail gracefully rather than panic. This avoids invoking
the checks at runtime over and over again in iflib_fast_intr_rxtx() and
_task_fn_{r,t}x() - even if it's just in case of INVARIANTS - and makes
these functions more readable.
o In iflib_rx_structures_setup(), only initialize LRO resources if device
and driver have LRO capability in order to not waste memory. Also, free
the LRO resources again if setting them up fails for one of the queues.
However, don't bother invoking iflib_rx_sds_free() in that case because
iflib_rx_structures_setup() doesn't call iflib_rxsd_alloc() either (and
iflib_{device,pseudo}_register() will issue iflib_rx_sds_free() in case
of failure via iflib_rx_structures_free(), but there definitely is some
asymmetry left to be fixed, though).
o Similarly, free LRO resources again in iflib_rx_structures_free().
o In iflib_irq_set_affinity(), handle get_core_offset() errors gracefully
instead of panicing (but only in case of INVARIANTS). This is a follow-
up to r344132, as such driver bugs shouldn't be fatal.
o Likewise, handle unknown iflib_intr_type_t in iflib_irq_alloc_generic()
gracefully, too.
o Bring yet more sanity to iflib_msix_init():
- If the device doesn't provide enough MSI-X vectors or not all vectors
can be allocate so the expected number of queues in addition to admin
interrupts can't be supported, try MSI next (and then INTx) as proper
MSI-X vector distribution can't be assured in such cases. In essence,
this change brings r254008 forward to iflib(4). Also, this is the fix
alluded to in the commit message of r343934.
- If the MSI-X allocation has failed, don't prematurely announce MSI is
going to be used as the latter in fact may not be available either.
- When falling back to MSI, only release the MSI-X table resource again
if it was allocated in iflib_msix_init(), i. e. isn't supplied by the
driver, in the first place.
o In mp_ndesc_handler(), handle unknown type arguments gracefully, too.
PR: 235031 (likely) [1]
Reviewed by: shurd
Differential Revision: https://reviews.freebsd.org/D20175
- Remove the only ever written to ift_db_mtx_name member of struct iflib_txq.
- Remove the unused or only ever written to ifr_size, ifr_cq_pidx, ifr_cq_gen
and ifr_lro_enabled members of struct iflib_rxq.
- Consistently spell DMA, RX and TX uppercase in comments, messages etc.
instead of mixing with some lowercase variants.
- Consistently use if_t instead of a mix of if_t and struct ifnet pointers.
- Bring the function comments of _iflib_fl_refill(), iflib_rx_sds_free() and
iflib_fl_setup() in line with reality.
- Judging problem reports, people are wondering what on earth messages like:
"TX(0) desc avail = 1024, pidx = 0"
are trying to indicate. Thus, extend this string to be more like that of
non-iflib(4) Ethernet MAC drivers, notifying about a watchdog timeout due
to which the interface will be reset.
- Take advantage of the M_HAS_VLANTAG macro.
- Use false/true rather than FALSE/TRUE for variables of type bool.
- Use FALLTHROUGH as advocated by style(9).
It's atypical, but not invalid, for a driver to pass no capabilities.
Submitted by: Gerald Aryeetey <aryeeteygerald_rogers.com>
Reviewed by: shurd
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D20142
By default, cores are now assigned to queues in a sequential
manner rather than all NICs starting at the first core. On a four-core
system with two NICs each using two queue pairs, the nic:queue -> core
mapping has changed from this:
0:0 -> 0, 0:1 -> 1
1:0 -> 0, 1:1 -> 1
To this:
0:0 -> 0, 0:1 -> 1
1:0 -> 2, 1:1 -> 3
Additionally, a device can now be configured to use separate cores for TX
and RX queues.
Two new tunables have been added, dev.X.Y.iflib.separate_txrx and
dev.X.Y.iflib.core_offset. If core_offset is set, the NIC is not part
of the auto-assigned sequence.
Reviewed by: marius
MFC after: 2 weeks
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D20029
As with mlx5en, the idea is to drop unwanted traffic as early
in receive as possible, before mbufs are allocated and anything
is passed up the stack. This can save considerable CPU time
when a machine is under a flooding style DOS attack.
The major change here is to remove the unneeded abstraction where
callers of rxd_frag_to_sd() get back a pointer to the mbuf ring, and
are responsible for NULL'ing that mbuf themselves. Now this happens
directly in rxd_frag_to_sd(), and it returns an mbuf. This allows us
to use the decision (and potentially mbuf) returned by the pfil
hooks. The driver can now recycle mbufs to avoid re-allocation when
packets are dropped.
Reviewed by: marius (shurd and erj also provided feedback)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D19645
From Jake:
iflib_if_transmit returns ENOBUFS when the device is down, or when the
link isn't active.
This was changed in r308792 from return (0), so that the function
correctly reports an error that it was unable to transmit.
However, using ENOBUFS can cause some network applications to produce
the following or similar errors:
"ping: sendto: No buffer space available"
This is a bit confusing as the real cause of the issue is that the
network device is down.
Replace the ENOBUFS return with ENETDOWN to indicate more clearly that
the reason for the failure to send is due to the network device is
offline.
This will cause the error message to be reported as
"ping: sendto: Network is down"
Submitted by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed by: shurd@, sbruno@, bz@
MFC after: 1 week
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D19652
From Jake:
The iflib_device_register function takes the CTX lock before calling
IFDI_ATTACH_PRE, and releases it upon finishing the registration.
Mirror this process in iflib_pseudo_register, so that we always hold the
CTX lock during the attach process when registering a pseudo interface
or a regular interface.
This was caught by code inspection while attempting to analyze where the
CTX lock was held.
Submitted by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed by: shurd@, erj@
MFC after: 1 week
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D19604
From Jake:
The iflib core never modifies the isc_driver_version string. Allow
drivers to safely assign pointers to constant buffers by marking this
parameter const.
Submitted by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed by: erj@, gallatin@, jhb@
MFC after: 1 week
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D19577
From Jake:
iflib_fl_setup calculates a suitable buffer size for the Rx mbufs based
on the isc_max_frame_size value that drivers setup. This calculation is
repeated by drivers when programming their hardware with the size of
each Rx buffer.
This can lead to a mismatch where the iflib mbuf size is different from
the expected size of the buffer as programmed by the hardware. This can
lead to unexpected results.
If iflib ever wants to support mbuf sizes larger than one page, every
driver must be updated to account for the new possible buffer sizes.
Fix this by calculating the mbuf size prior to calling IFDI_INIT, and
adding the iflib_get_rx_mbuf_sz function which will expose this value to
drivers, so that they do not repeat the same calculation.
Submitted by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed by: shurd@, erj@
MFC after: 1 week
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D19489
From Jake:
iflib_encap calls bus_dmamap_load_mbuf_sg. Upon it returning EFBIG, an
m_collapse and an m_defrag are attempted to shrink the mbuf cluster to
fit within the DMA segment limitations.
However, if we call m_defrag, and then bus_dmamap_load_mbuf_sg returns
EFBIG on the now defragmented mbuf, we will continuously re-call
bus_dmamap_load_mbuf_sg over and over.
This happens because m_head isn't NULL, and remap is >1, so we don't try
to m_collapse or m_defrag again. The only way we exit the loop is if
m_head is NULL. However, m_head can't be modified by the call to
bus_dmamap_load_mbuf_sg, because we don't pass it as a double pointer.
I believe this will be an incredibly rare occurrence, because it is
unlikely that bus_dmamap_load_mbuf_sg will actually fail on the second
defragment with an EFBIG error. However, it still seems like
a possibility that we should account for.
Fix the exit check to ensure that if remap is >1, we will also exit,
even if m_head is not NULL.
Submitted by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed by: shurd@, gallatin@
MFC after: 1 week
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D19468
From Jake:
"The iflib_fl_setup() function tries to pick various buffer sizes based
on the max_frame_size value defined by the parent driver. However, this
code was wrapped under CONTIGMALLOC_WORKS, which was never actually
defined anywhere.
This same code pattern was used in if_em.c, likely trying to match
what iflib uses.
Since CONTIGMALLOC_WORKS is not defined, remove this dead code from
iflib_fl_setup and if_em.c
Given that various iflib drivers appear to be using a similar
calculation, it might be worth making this buffer size a value that the
driver can peek at in the future."
Submitted by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed by: shurd@
MFC after: 1 week
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D19199
So far, intr_{g,s}etaffinity(9) take a single int for identifying
a device interrupt. This approach doesn't work on all architectures
supported, as a single int isn't sufficient to globally specify a
device interrupt. In particular, with multiple interrupt controllers
in one system as found on e. g. arm and arm64 machines, an interrupt
number as returned by rman_get_start(9) may be only unique relative
to the bus and, thus, interrupt controller, a certain device hangs
off from.
In turn, this makes taskqgroup_attach{,_cpu}(9) and - internal to
the gtaskqueue implementation - taskqgroup_attach_deferred{,_cpu}()
not work across architectures. Yet in turn, iflib(4) as gtaskqueue
consumer so far doesn't fit architectures where interrupt numbers
aren't globally unique.
However, at least for intr_setaffinity(..., CPU_WHICH_IRQ, ...) as
employed by the gtaskqueue implementation to bind an interrupt to a
particular CPU, using bus_bind_intr(9) instead is equivalent from
a functional point of view, with bus_bind_intr(9) taking the device
and interrupt resource arguments required for uniquely specifying a
device interrupt.
Thus, change the gtaskqueue implementation to employ bus_bind_intr(9)
instead and intr_{g,s}etaffinity(9) to take the device and interrupt
resource arguments required respectively. This change also moves
struct grouptask from <sys/_task.h> to <sys/gtaskqueue.h> and wraps
struct gtask along with the gtask_fn_t typedef into #ifdef _KERNEL
as userland likes to include <sys/_task.h> or indirectly drags it
in - for better or worse also with _KERNEL defined -, which with
device_t and struct resource dependencies otherwise is no longer
as easily possible now.
The userland inclusion problem probably can be improved a bit by
introducing a _WANT_TASK (as well as a _WANT_MOUNT) akin to the
existing _WANT_PRISON etc., which is orthogonal to this change,
though, and likely needs an exp-run.
While at it:
- Change the gt_cpu member in the grouptask structure to be of type
int as used elswhere for specifying CPUs (an int16_t may be too
narrow sooner or later),
- move the gtaskqueue_enqueue_fn typedef from <sys/gtaskqueue.h> to
the gtaskqueue implementation as it's only used and needed there,
- change the GTASK_INIT macro to use "gtask" rather than "task" as
argument given that it actually operates on a struct gtask rather
than a struct task, and
- let subr_gtaskqueue.c consistently use __func__ to print functions
names.
Reported by: mmel
Reviewed by: mmel
Differential Revision: https://reviews.freebsd.org/D19139
o Correct the obvious bugs in the netmap(4) parts:
- No longer check for the existence of DMA maps as bus_dma(9)
is used unconditionally in iflib(4) since r341095.
- Supply the correct DMA tag and map pairs to bus_dma(9)
functions (see also the commit message of r343753).
- In iflib_netmap_timer_adjust(), add synchronization of the
TX descriptors before calling the ift_txd_credits_update
method as the latter evaluates the TX descriptors possibly
updated by the MAC.
- In _task_fn_tx(), wrap the netmap(4)-specific bits in
#ifdef DEV_NETMAP just as done in _task_fn_admin() and
_task_fn_rx() respectively.
o In iflib_fast_intr_rxtx(), synchronize the TX rather than
the RX descriptors before calling the ift_txd_credits_update
method (see also above).
o There's no need to synchronize an RX buffer that is going to
be recycled in iflib_rxd_pkt_get(), yet; it's sufficient to
do that as late as passing RX buffers to the MAC via the
ift_rxd_refill method. Hence, combine that synchronization
with the synchronization of new buffers into a common spot
in _iflib_fl_refill().
o There's no need to synchronize the RX descriptors of a free
list in preparation of the MAC updating their statuses with
every invocation of rxd_frag_to_sd(); it's enough to do this
once before handing control over to the MAC, i. e. before
calling ift_rxd_flush method in _iflib_fl_refill(), which
already performs the necessary synchronization.
o Given that the ift_rxd_available method evaluates the RX
descriptors which possibly have been altered by the MAC,
synchronize as appropriate beforehand. Most notably this
is now done in iflib_rxd_avail(), which in turn means that
we don't need to issue the same synchronization yet again
before calling the ift_rxd_pkt_get method in iflib_rxeof().
o In iflib_txd_db_check(), synchronize the TX descriptors
before handing them over to the MAC for transmission via
the ift_txd_flush method.
o In iflib_encap(), move the TX buffer synchronization after
the invocation of the ift_txd_encap() method. If the MAC
driver fails to encapsulate the packet and we retry with
a defragmented mbuf chain or finally fail, the cycles for
TX buffer synchronization have been wasted. Synchronizing
afterwards matches what non-iflib(4) drivers typically do
and is sufficient as the MAC will not actually start with
the transmission before - in this case - the ift_txd_flush
method is called.
Moreover, for the latter reason the synchronization of the
TX descriptors in iflib_encap() can go as it's enough to
synchronize them before passing control over to the MAC by
issuing the ift_txd_flush() method (see above).
o In iflib_txq_can_drain(), only synchronize TX descriptors
if the ift_txd_credits_update method accessing these is
actually called.
Differential Revision: https://reviews.freebsd.org/D19081
controller datasheet revision 3.3, in the context of Ethernet
MACs the control data describing the packet buffers typically
are named "descriptors". Each of these descriptors references
one buffer, multiple of which a packet can be composed of.
By contrast, in comments, messages and the names of structure
members, iflib(4) refers to DMA resources employed for RX and
TX buffers (rather than control data) as "desc(riptors)".
This odd naming convention of iflib(4) made reviewing r343085
and identifying wrong and missing bus_dmamap_sync(9) calls in
particular way harder than it already is. This convention may
also explain why the netmap(4) part of iflib(4) pairs the DMA
tags for control data with DMA maps of buffers and vice versa
in calls to bus_dma(9) functions.
Therefore, change iflib(4) to refer to buf(fers) when buffers
and not the usual understanding of descriptors is meant. This
change does not include corrections to the DMA resources used
in the netmap(4) parts. However, it revises error messages to
state which kind of allocation/creation failed. Specifically,
the "Unable to allocate tx_buffer (map) memory" copy & pasted
inappropriately on several occasions was replaced with proper
messages.
o Enhance some other error messages to indicate which half - RX
or TX - they apply to instead of using identical text in both
cases and generally canonicalize them.
o Correct the descriptions of iflib_{r,t}xsd_alloc() to reflect
reality; current code doesn't use {r,t}x_buffer structures.
o In iflib_queues_alloc():
- Remove redundant BUS_DMA_NOWAIT of iflib_dma_alloc() calls,
- change the M_WAITOK from malloc(9) calls into M_NOWAIT. The
return values are already checked, deferred DMA allocations
not being an option at this point, BUS_DMA_NOWAIT has to be
used anyway and prior malloc(9) calls in this function also
specify M_NOWAIT.
Reviewed by: shurd
Differential Revision: https://reviews.freebsd.org/D19067
bus_teardown_intr(9) before pci_release_msi(9).
- Ensure that iflib(4) and associated drivers pass correct RIDs to
bus_release_resource(9) by obtaining the RIDs via rman_get_rid(9)
on the corresponding resources instead of using the RIDs initially
passed to bus_alloc_resource_any(9) as the latter function may
change those RIDs. Solely em(4) for the ioport resource (but not
others) and bnxt(4) were using the correct RIDs by caching the ones
returned by bus_alloc_resource_any(9).
- Change the logic of iflib_msix_init() around to only map the MSI-X
BAR if MSI-X is actually supported, i. e. pci_msix_count(9) returns
> 0. Otherwise the "Unable to map MSIX table " message triggers for
devices that simply don't support MSI-X and the user may think that
something is wrong while in fact everything works as expected.
- Put some (mostly redundant) debug messages emitted by iflib(4)
and em(4) during attachment under bootverbose. The non-verbose
output of em(4) seen during attachment now is close to the one
prior to the conversion to iflib(4).
- Replace various variants of spelling "MSI-X" (several in messages)
with "MSI-X" as used in the PCI specifications.
- Remove some trailing whitespace from messages emitted by iflib(4)
and change them to consistently start with uppercase.
- Remove some obsolete comments about releasing interrupts from
drivers and correct a few others.
Reviewed by: erj, Jacob Keller, shurd
Differential Revision: https://reviews.freebsd.org/D18980
corresponding bitmap before adding an mbuf has actually succeeded.
Previously, m_gethdr(M_NOWAIT, ...) failing caused a "hole" in the
RX ring but not in its bitmap. One implication of such a hole was
that in a subsequent call to _iflib_fl_refill() with the RX buffer
accounting still indicating another reclaimable buffer, bit_ffc(3)
nevertheless returned -1 in frag_idx which in turn caused havoc
when used as an index. Thus, additionally assert that frag_idx is
0 or greater.
Another possible consequence of a hole in the RX ring was a NULL-
dereference when trying to use the unallocated mbuf, for example
in iflib_rxd_pkt_get().
While at it, make the variable declarations in _iflib_fl_refill()
conform to style(9) and remove redundant checks already performed
by bit_ffc{,_at}(3).
- In iflib_queues_alloc(), don't pass redundant M_ZERO to bit_alloc(3).
Reported and tested by: pho
The new loop to sync and unload descriptors was indexed
by "i", rather than "j". The panic was caused by "i"
being advanced rather than "j", and eventually becoming
out of bounds.
Reviewed by: kib
MFC after: 3 days
Sponsored by: Netflix
Also, expose IFLIB_MAX_RX_SEGS to iflib drivers and add
iflib_dma_alloc_align() to the iflib API.
Performance is generally better with the tunable/sysctl
dev.vmx.<index>.iflib.tx_abdicate=1.
Reviewed by: shurd
MFC after: 1 week
Relnotes: yes
Sponsored by: RG Nets
Differential Revision: https://reviews.freebsd.org/D18761
- Specify BUS_DMA_NOWAIT for bus_dmamap_load() on rx refill, since
callbacks are not supposed to be used.
- Match tso/non-tso tags to corresponding tx map operations. Create
separate tso maps for tx descriptors. In particular, do not use
non-tso tag to load, unload, or destroy a map created with tso tag.
- Add missed bus_dmamap_sync() calls.
Submitted by: marius.
Reported and tested by: pho
Reviewed by: marius
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
iflib_init_locked() assumes that iflib_stop() has been called, however,
it is not called for suspend. iflib_if_init_locked() calls stop then init,
so fixes the problem.
This was causing errors after a resume from suspend.
PR: 224059
Reported by: zeising
MFC after: 1 week
Sponsored by: Limelight Networks
r336560 was supposed to restore pre-r323954 behaviour when tx_abdicate is
not set (the default case). However, it appears that rather than the drainage
check being made conditional on tx_abdicate being set, it was duplicated
so it occured twice if tx_abdicate was set and once if it was not.
Now when !tx_abdicate, drainage is only checked if the doorbell isn't
pending.
Reported by: lev
MFC after: 1 week
Sponsored by: Limelight Networks
- Remove the complex mechanism to choose between using busdma
and raw pmap_kextract at runtime. The reduced complexity makes
the code easier to read and maintain.
- Fix a bug in the small packet receive path where clusters were
repeatedly mapped but never unmapped. We now store the cluster's
bus address and avoid re-mapping the cluster each time a small
packet is received.
This patch fixes bugs I've seen where ixl(4) will not even
respond to ping without seeing DMAR faults.
I see a small improvement (14%) on packet forwarding tests using
a Haswell based Xeon E5-2697 v3. Olivier sees a small
regression (-3% to -6%) with lower end hardware.
Reviewed by: mmacy
Not objected to by: sbruno
MFC after: 8 weeks
Sponsored by: Netflix, Inc
Differential Revision: https://reviews.freebsd.org/D17901
iflib_stop() was not resetting the rxq completion queue state variables.
This meant that for any driver that has receive completion queues, after a
reinit, iflib would start asking what's available on the rx side starting at
whatever the completion queue index was prior to the stop, instead of at 0.
Submitted by: pkelsey
Reported by: pkelsey
MFC after: 3 days
Sponsored by: Limelight Networks
Ensure that any time CSUM_IP_TSO or CSUM_IP6_TSO is set that the corresponding
CSUM_IP6?_TCP / CSUM_IP flags are also set.
Rather than requireing drivers to bake-in an understanding that TSO implies
checksum offloads, make it explicit.
This change requires us to move the IFLIB_NEED_ZERO_CSUM implementation to
ensure it's zeroed for TSO.
Reported by: Jacob Keller <jacob.e.keller@intel.com>
MFC after: 1 week
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D17801
r333502 removed initialization of ifc_nhwtxqs, and it's not clear
there's a need to copy it into the struct iflib_ctx at all. Use
ctx->ifc_sctx->isc_ntxqs instead.
Further, iflib_stop() did not clear the last ring in the case where
isc_nfl != isc_nrxqs (such as when IFLIB_HAS_RXCQ is set). Use
ctx->ifc_sctx->isc_nrxqs here instead of isc_nfl.
Reported by: pkelsey
Reviewed by: pkelsey
MFC after: 3 days
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D17979
r338838 attempted to fix issues with rxcsum and rxcsum6.
However, the rxcsum bits were set as though if_setcapenablebit() was
being called, not if_togglecapenable() which is in use. As a result,
it was not possible to disable rxcsum when rxcsum6 was supported.
PR: 233004
Reported by: lev
Reviewed by: lev
MFC after: 3 days
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D17881
That commit is causing kernel panics in em(4), so this will be reverted
until those are fixed.
Reported by: ae@, pho@, et al
Sponsored by: Intel Corporation
The taskqgroup_detach function does not check if task is already enqueued when
detaching it. This may lead to kernel panic if enqueued task starts after
context state lock is destroyed. Ensure that the already enqueued admin tasks
are executed before detaching them.
The issue was discovered during validation of D16429. Unloading of if_ixlv
followed by immediate removal of VFs with iovctl -D may lead to panic on
NODEBUG kernel.
As well, check if iflib is in detach before enqueueing new admin or iov
tasks, to prevent new tasks from executing while the taskqgroup tasks
are being drained.
Submitted by: Krzysztof Galazka <krzysztof.galazka@intel.com>
Reviewed by: shurd@, erj@
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D17404
Finishes the conversion of the 40Gb Intel Ethernet drivers to iflib(9) for
FreeBSD 12.0, and fixes numerous bugs in both ixl(4) and iavf(4).
This commit also re-adds the VF driver to GENERIC since it now compiles and
functions.
The VF driver name was changed from ixlv(4) to iavf(4) because the VF driver is
now intended to be used with future products, not just with Fortville/Fort Park
VFs.
A man page update that documents these drivers is forthcoming in a separate
commit.
Reviewed by: sbruno@, kbowling@
Tested by: jeffrey.e.pieper@intel.com
Approved by: re (gjb@)
Relnotes: yes
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D16429
Various capabilities were not being handled correctly in the
SIOCSIFCAP handler. Specifically:
IFCAP_RXCSUM and IFCAP_RXCSUM_IPV6 could be set even if not supported
It was impossible to disable IFCAP_RXCSUM and/or IFCAP_RXCSUM_IPV6 via
ifconfig since it does ioctl() per command-line flag rather than combine
them into a single call.
IFCAP_VLAN_HWCSUM could not be modified via the ioctl()
Setting any combination of the three IFCAP_WOL flags would set only
IFCAP_WOL_MCAST | IFCAP_WOL_MAGIC. For example, setting only
IFCAP_WOL_UCAST would result in both IFCAP_WOL_MCAST and IFCAP_WOL_MAGIC
being enabled, but IFCAP_WOL_UCAST would not be enabled.
Because if_vlancap() was called before if_togglecapenable(), vlan flags
were sometimes not applied correctly.
Interfaces were being unnecessarily stopped and restarted for WoL
PR: 231151
Submitted by: Kaho Toshikazu <kaho@elam.kais.kyoto-u.ac.jp>
Reported by: Shirkdog <mshirk@daemon-security.com>
Reviewed by: galladin
Approved by: re (gjb)
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D17158
Remove sysctls:
txq_drain_encapfail - now a duplicate of encap_txd_encap_fail
intr_link - was never incremented
intr_msix - was never incremented
rx_zero_len - was never incremented
The following were not incremented in all code-paths that apply:
m_pullups, mbuf_defrag, rxd_flush, tx_encap, rx_intr_enables, tx_frees,
encap_txd_encap_fail.
Fixes:
Replace the broken collapse_pkthdr() implementation with an MPASS().
fl_refills and fl_refills_large were not incremented when using netmap.
Reviewed by: gallatin
Approved by: re (marius)
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D16733
The MP ring may have txq pointers enqueued. Previously, these were
passed to m_free() when IFC_QFLUSH was set. This patch checks for
the value and doesn't call m_free().
Reviewed by: gallatin
Approved by: re (gjb)
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D16882
Since r336313, TSO support for LEM-class devices is removed again as it
was before the conversion of {l,}em(4) to iflib(4) in r311849 and as a
result, isc_tx_tso_segments_max is 0 for LEM-class devices now. Thus,
inappropriate watermarks were used for this class.
This is really only a band-aid, though, because so far iflib(9) doesn't
fully take into account that DMA engines can support different maxima
of segments for transfers of TSO and non-TSO packets. For example, the
DESC_RECLAIMABLE macro is based on isc_tx_nsegments while MAX_TX_DESC
used isc_tx_tso_segments_max only. For most in-tree consumers that
doesn't make a difference as the maxima are the same for both kinds of
transfers (that is, apart from the fact that TSO may require up to 2
sentinel descriptors but also not with every MAC supported). However,
isc_tx_nsegments is 8 but isc_tx_tso_segments_max is 85 by default
with ixl(4).
tests for avail > 0, avail can never be 0 within that loop. Thus, move
decrementing avail and budget_left into the loop and before the code which
checks for additional descriptors having become available in case all the
previous ones have been processed but there still is budget left so the
latter code works as expected. [1]
- In iflib_{busdma_load_mbuf_sg,parse_header}(), remove dead stores to m
and n respectively. [2, 3]
- In collapse_pkthdr(), ensure that m_next isn't NULL before dereferencing
it. [4]
- Remove a duplicate assignment of segs in iflib_encap().
Reported by: Coverity
CID: 1356027 [1], 1356047 [2], 1368205 [3], 1356028 [4]
r323954 changed the mp ring behaviour when 64-bit atomics were
available to abdicate the TX ring rather than having one become a
consumer thereby running to completion on TX. The consumer of the mp
ring was then triggered in the tx task rather than blocking the TX call.
While this significantly lowered the number of RX drops in small-packet
forwarding, it also negatively impacts TX performance.
With this change, the default behaviour is reverted, causing one TX ring
to become a consumer during the enqueue call. A new sysctl,
dev.X.Y.iflib.tx_abdicate is added to control this behaviour.
Reviewed by: gallatin
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D16302
Use the timer to poll for TX completions when there are
outstanding TX slots. Track when the last driver timer was called
to prevent overcalling it. Also clean up some kring vs NIC ring
usage.
Reviewed by: marius, Johannes Lundberg <johalun0@gmail.com>
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D16300
- Ever since the workaround for the silicon bug of TSO4 causing MAC hangs
was committed in r295133, CSUM_TSO always got disabled unconditionally
by em(4) on the first invocation of em_init_locked(). However, even with
that problem fixed, it turned out that for at least e. g. 82579 not all
necessary TSO workarounds are in place, still causing MAC hangs even at
Gigabit speed. Thus, for stable/11, TSO usage was deliberately disabled
in r323292 (r323293 for stable/10) for the EM-class by default, allowing
users to turn it on if it happens to work with their particular EM MAC
in a Gigabit-only environment.
In head, the TSO workaround for speeds other than Gigabit was lost with
the conversion to iflib(9) in r311849 (possibly along with another one
or two TSO workarounds). Yet at the same time, for EM-class MACs TSO4
got enabled by default again, causing device hangs. Therefore, change the
default for this hardware class back to have TSO4 off, allowing users
to turn it on manually if it happens to work in their environment as
we do in stable/{10,11}. An alternative would be to add a whitelist of
EM-class devices where TSO4 actually is reliable with the workarounds in
place, but given that the advantage of TSO at Gigabit speed is rather
limited - especially with the overhead of these workarounds -, that's
really not worth it. [1]
This change includes the addition of an isc_capabilities to struct
if_softc_ctx so iflib(9) can also handle interface capabilities that
shouldn't be enabled by default which is used to handle the default-off
capabilities of e1000 as suggested by shurd@ and moving their handling
from em_setup_interface() to em_if_attach_pre() accordingly.
- Although 82543 support TSO4 in theory, the former lem(4) didn't have
support for TSO4, presumably because TSO4 is even more broken in the
LEM-class of MACs than the later EM ones. Still, TSO4 for LEM-class
devices was enabled as part of the conversion to iflib(9) in r311849,
causing device hangs. So revert back to the pre-r311849 behavior of
not supporting TSO4 for LEM-class at all, which includes not creating
a TSO DMA tag in iflib(9) for devices not having IFCAP_TSO4 set. [2]
- In fact, the FreeBSD TCP stack can handle a TSO size of IP_MAXPACKET
(65535) rather than FREEBSD_TSO_SIZE_MAX (65518). However, the TSO
DMA must have a maxsize of the maximum TSO size plus the size of a
VLAN header for software VLAN tagging. The iflib(9) converted em(4),
thus, first correctly sets scctx->isc_tx_tso_size_max to EM_TSO_SIZE
in em_if_attach_pre(), but later on overrides it with IP_MAXPACKET
in em_setup_interface() (apparently, left-over from pre-iflib(9)
times). So remove the later and correct iflib(9) to correctly cap
the maximum TSO size reported to the stack at IP_MAXPACKET. While at
it, let iflib(9) use if_sethwtsomax*().
This change includes the addition of isc_tso_max{seg,}size DMA engine
constraints for the TSO DMA tag to struct if_shared_ctx and letting
iflib_txsd_alloc() automatically adjust the maxsize of that tag in case
IFCAP_VLAN_MTU is supported as requested by shurd@.
- Move the if_setifheaderlen(9) call for adjusting the maximum Ethernet
header length from {ixgbe,ixl,ixlv,ixv,em}_setup_interface() to iflib(9)
so adjustment is automatically done in case IFCAP_VLAN_MTU is supported.
As a consequence, this adjustment now is also done in case of bnxt(4)
which missed it previously.
- Move the reduction of the maximum TSO segment count reported to the
stack by the number of m_pullup(9) calls (which in the worst case,
can add another mbuf and, thus, the requirement for another DMA
segment each) in the transmit path for performance reasons from
em_setup_interface() to iflib_txsd_alloc() as these pull-ups are now
done in iflib_parse_header() rather than in the no longer existing
em_xmit(). Moreover, this optimization applies to all drivers using
iflib(9) and not just em(4); all in-tree iflib(9) consumers still
have enough room to handle full size TSO packets. Also, reduce the
adjustment to the maximum number of m_pullup(9)'s now performed in
iflib_parse_header().
- Prior to the conversion of em(4)/igb(4)/lem(4) and ixl(4) to iflib(9)
in r311849 and r335338 respectively, these drivers didn't enable
IFCAP_VLAN_HWFILTER by default due to VLAN events not being passed
through by lagg(4). With iflib(9), IFCAP_VLAN_HWFILTER was turned on
by default but also lagg(4) was fixed in that regard in r203548. So
just remove the now redundant and defunct IFCAP_VLAN_HWFILTER handling
in {em,ixl,ixlv}_setup_interface().
- Nuke other redundant IFCAP_* setting in {em,ixl,ixlv}_setup_interface()
which is (more completely) already done in {em,ixl,ixlv}_if_attach_pre()
now.
- Remove some redundant/dead setting of scctx->isc_tx_csum_flags in
em_if_attach_pre().
- Remove some IFCAP_* duplicated either directly or indirectly (e. g.
via IFCAP_HWCSUM) in {EM,IGB,IXL}_CAPS.
- Don't bother to fiddle with IFCAP_HWSTATS in ixgbe(4)/ixgbev(4) as
iflib(9) adds that capability unconditionally.
- Remove some unused macros from em(4).
- Bump __FreeBSD_version as some of the above changes require the modules
of drivers using iflib(9) to be recompiled.
Okayed by: sbruno@ at 201806 DevSummit Transport Working Group [1]
Reviewed by: sbruno (earlier version), erj
PR: 219428 (part of; comment #10) [1], 220997 (part of; comment #3) [2]
Differential Revision: https://reviews.freebsd.org/D15720