Even though 64-bit atomics are supported on i386 there are panics
indicating that the code does not work correctly there. Switch
to mutex based variant (and fix that while we're here).
Reported by: pho, kib
There are risks associated with waiting on a preemptible epoch section.
Change the name to make them not be the default and document the issue
under CAVEATS.
Reported by: markj
adds:
- epoch_enter_critical() - can be called inside a different epoch,
starts a section that will acquire any MTX_DEF mutexes or do
anything that might sleep.
- epoch_exit_critical() - corresponding exit call
- epoch_wait_critical() - wait variant that is guaranteed that any
threads in a section are running.
- epoch_global_critical - an epoch_wait_critical safe epoch instance
Requested by: markj
Approved by: sbruno
When poll() is called via netmap, txsync is initially called,
and if there are no available buffers to reclaim, it waits for the driver
to notify of new buffers. Since the TX IRQ is generally not used in iflib
drivers, this ends up causing a timeout.
Work around this by having the reclaim DELAY(1) if it's initially unable
to reclaim anything, then schedule the tx task, which will spin by
continuously rescheduling itself until some buffers are reclaimed. In
general, the delay is enough to allow some buffers to be reclaimed, so
spinning is minimized.
Reported by: Johannes Lundberg <johalun0@gmail.com>
Reviewed by: sbruno
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D15455
Use the new epoch based reclamation API. Now the hot paths will not
block at all, and the sx lock is used for the softc data. This fixes LORs
reported where the rwlock was obtained when the sxlock was held.
Submitted by: mmacy
Reported by: Harry Schmalzbauer <freebsd@omnilan.de>
Reviewed by: sbruno
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D15355
Part 3 of many ...
The VPC framework relies heavily on cloning pseudo interfaces
(vmnics, vpc switch, vcpswitch port, hostif, vxlan if, etc).
This pulls in that piece. Some ancillary changes get pulled
in as a side effect.
Reviewed by: shurd@
Approved by: sbruno@
Sponsored by: Joyent, Inc.
Differential Revision: https://reviews.freebsd.org/D15347
It is guaranteed that if_ipsec(4) interface is used only for tunnel
mode IPsec, i.e. decrypted and decapsultaed packet has its own IP header.
Thus we can consider it as new packet and clear the protocols flags.
This allows ICMP/ICMPv6 properly handle errors that may cause this packet.
PR: 228108
MFC after: 1 week
if_bridge has a lot of limitations that make it scale poorly to higher data
rates. In my projects/VPC branch I leverage the bridge interface between
layers for my high speed soft switch as well as for purposes of stacking
in general.
Reviewed by: sbruno@
Approved by: sbruno@
Differential Revision: https://reviews.freebsd.org/D15344
Make if_printf() use vlog() instead of vprintf(). This means it can no
longer return the number of characters printed, as it used to, but every
single call to if_printf() in the entire kernel ignores the return value
anyway; just return 0 so we don't have to change the prototype.
Consistently use if_printf() throughout sys/net/if.c, instead of a
mixture of if_printf() and log().
In ifa_maintain_loopback_route(), don't needlessly log an error if we
either failed to add a route because it already existed or failed to
remove one because it did not. We still return an error code, though.
MFC after: 1 week
Print a message when iflib_tx_structures_setup fails, like we do for
iflib_rx_structures_setup.
Now that we always print a message from within
iflib_qset_structures_setup when it fails, stop printing one in
iflib_device_register() at the call site.
Submitted by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed by: gallatin
MFC after: 3 days
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D15300
The canonical check for whether or not a ring is drainable is
TXQ_AVAIL() > MAX_TX_DESC() + 2. Use this same construct here,
in order to avoid a potential off-by-one error where we might otherwise
fail to request an interrupt.
Reviewed by: mmacy
Sponsored by: Netflix
to sleep on commands to the NIC when updating multicast filters. More generally this permitted
driver's to use an sx as a softc lock. Unfortunately this change introduced a race whereby a
a multicast update would still be queued for deletion when ifconfig deleted the interface
thus calling down in to _purgemaddrs and synchronously deleting _all_ of the multicast addresses
on the interface.
Synchronously remove all external references to a multicast address before enqueueing for delete.
Reported by: lwhsu
Approved by: sbruno
em(4) and igb(4) were tested by me, and ixgbe(4) and bnxt(4) were
tested by sbruno.
Reviewed by: mmacy, shurd
MFC after: 1 month
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D15262
This is a component of a system which lets the kernel dump core to
a remote host after a panic, rather than to a local storage device.
The server component is available in the ports tree. netdump is
particularly useful on diskless systems.
The netdump(4) man page contains some details describing the protocol.
Support for configuring netdump will be added to dumpon(8) in a future
commit. To use netdump, the kernel must have been compiled with the
NETDUMP option.
The initial revision of netdump was written by Darrell Anderson and
was integrated into Sandvine's OS, from which this version was derived.
Reviewed by: bdrewery, cem (earlier versions), julian, sbruno
MFC after: 1 month
X-MFC note: use a spare field in struct ifnet
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D15253
In r301567, code was added to cleanup to prevent memory leaks for the
Tx and Rx ring structs. This code carefully tracked txq and rxq, and
made sure to free them properly during cleanup.
Because we assigned the txq and rxq pointers into the ctx->ifc_txqs and
ctx->ifc_rxqs, we carefully reset these pointers to NULL, so that
cleanup code would not accidentally free the memory twice.
This was changed by r304021 ("Update iflib to support more NIC designs"),
which removed this resetting of the pointers to NULL, because it re-used
the txq and rxq pointers as an index into the queue set array.
Unfortunately, the cleanup code was left alone. Thus, if we fail to
allocate DMA or fail to configure the queues using the drivers ifdi
methods, we will attempt to free txq and rxq. These variables would now
incorrectly point to the wrong location, resulting in a page fault.
There are a number of methods to correct this, but ultimately the root
cause was that we reuse the txq and rxq pointers for two different
purposes.
Instead, when allocating, store the returned pointer directly into
ctx->ifc_txqs and ctx->ifc_rxqs. Then, assign this to txq and rxq as
index pointers before starting the loop to allocate each queue.
Drop the cleanup code for txq and rxq, and only use ctx->ifc_txqs and
ctx->ifc_rxqs.
Thus, we no longer need to free txq or rxq under any error flow, and
intsead rely solely on the pointers stored in ctx->ifc_txqs and
ctx->ifc_rxqs. This prevents the invalid free(), and ensures that we
still properly cleanup after ourselves as before when failing to
allocate.
Submitted by: Jacob Keller
Reviewed by: gallatin, sbruno
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D15285
This pointer was no longer written to as of r315217. Since nothing writes
to the variable, remove it.
Submitted by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed by: gallatin, kmacy, sbruno
Differential Revision: https://reviews.freebsd.org/D15284
Since the move to SMP NIC driver locking has had to go through serious
contortions using mtx around long running hardware operations. This moves
iflib past that.
Individual drivers may now sleep when appropriate.
Submitted by: Matthew Macy <mmacy@mattmacy.io>
Reviewed by: shurd
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D14983
Multicast incorrectly calls in to drivers with a mutex held causing drivers
to have to go through all manner of contortions to use a non sleepable lock.
Serialize multicast updates instead.
Submitted by: mmacy <mmacy@mattmacy.io>
Reviewed by: shurd, sbruno
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D14969
1) Don't give up if m_collapse() fails. Rather than giving up, try
m_defrag() immediately.
2) Fix a leak where, if the NIC driver rejected the defrag'ed chain
as having too many segments, we would fail to free the chain.
Reviewed by: Matthew Macy <mmacy@mattmacy.io> (this version of patch)
Submitted by: Matthew Macy <mmacy@mattmacy.io> (early version of leak fix)
When the PCP is changed for either a VLAN network interface or when
prio tagging is enabled for a regular ethernet network interface,
broadcast the IFNET_EVENT_PCP event so applications like ibcore can
update its GID tables accordingly.
MFC after: 3 days
Reviewed by: ae, kib
Differential Revision: https://reviews.freebsd.org/D15040
Sponsored by: Mellanox Technologies
We use transformation rather than accessors as virtually ever driver
implements SIOCGIFMEDIA and all would have to be touched.
Keep the code readable by always performing copies and (possiably no-op)
transforms.
Reviewed by: jhb, kib
Obtained from: CheriBSD
MFC after: 1 week
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D14996
This fixes media display for 802.11 wireless devices.
Software outside the base system that uses these media types and
defines should use #ifdef IFM_FDDI or IFM_TOKEN to include or remove
support.
Reported by: zeising
Reviewed by: emaste, kib, zeising
Tested by: zeising
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D15170
during ifnet detach.
Since destroying interface is not atomic operation and due to the
lack of synhronization during destroy, it is possible, that in the
time between bpfdetach() and if_free() some queued on destroying
interface mbuf will be used by ether_input_internal() and
bpf_peers_present() can dereference NULL bpf_if pointer. To protect
from this, assign pointer to empty bpf_if_ext structure instead of
NULL pointer after bpfdetach().
Reviewed by: melifaro, eugen
Obtained from: Yandex LLC
MFC after: 1 week
Sponsored by: Yandex LLC
Differential Revision: https://reviews.freebsd.org/D15083
Previously, if there are no threads, all queues which targeted
cores that share an L2 cache were bound to a single core. The intent is
to distribute them across these cores.
Reported by: olivier
Reviewed by: sbruno
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D15120
While Arcnet has some continued deployment in industrial controls, the
lack of drivers for any of the PCI, USB, or PCIe NICs on the market
suggests such users aren't running FreeBSD.
Evidence in the PR database suggests that the cm(4) driver (our sole
Arcnet NIC) was broken in 5.0 and has not worked since.
PR: 182297
Reviewed by: jhibbits, vangyzen
Relnotes: yes
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D15057
Add one extra lock initialization to iflib_register() that was missed
in the git<->phab conversion.
Split out flag manipulation from general context manipulation in iflib
To avoid blocking on the context lock in the swi thread and risk potential
deadlocks, this change protects lighter weight updates that only need to
be consistent with each other with their own lock.
Submitted by: Matthew Macy <mmacy@mattmacy.io>
Reviewed by: shurd
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D14967
Changelist:
- Turn tx_rings and rx_rings arrays into arrays of pointers to kring
structs. This patch includes fixes for ixv, ixl, ix, re, cxgbe, iflib,
vtnet and ptnet drivers to cope with the change.
- Generalize the nm_config() callback to accept a struct containing many
parameters.
- Introduce NKR_FAKERING to support buffers sharing (used for netmap
pipes)
- Improved API for external VALE modules.
- Various bug fixes and improvements to the netmap memory allocator,
including support for externally (userspace) allocated memory.
- Refactoring of netmap pipes: now linked rings share the same netmap
buffers, with a separate set of kring pointers (rhead, rcur, rtail).
Buffer swapping does not need to happen anymore.
- Large refactoring of the control API towards an extensible solution;
the goal is to allow the addition of more commands and extension of
existing ones (with new options) without the need of hacks or the
risk of running out of configuration space.
A new NIOCCTRL ioctl has been added to handle all the requests of the
new control API, which cover all the functionalities so far supported.
The netmap API bumps from 11 to 12 with this patch. Full backward
compatibility is provided for the old control command (NIOCREGIF), by
means of a new netmap_legacy module. Many parts of the old netmap.h
header has now been moved to netmap_legacy.h (included by netmap.h).
Approved by: hrs (mentor)
Also, since ifc_nhwrxqs is only used in one place, remove it from the struct.
This was preventing iflib_dma_free() from being called via
iflib_device_detach().
Submitted by: Matthew Macy <mmacy@mattmacy.io>
Reviewed by: shurd
Sponsored by: Limelight Networks
Defines in net/if_media.h remain in case code copied from ifconfig is in
use elsewere (supporting non-existant media type is harmless).
Reviewed by: kib, jhb
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D15017
To avoid blocking on the context lock in the swi thread and risk potential
deadlocks, this change protects lighter weight updates that only need to
be consistent with each other with their own lock.
Submitted by: Matthew Macy <mmacy@mattmacy.io>
Reviewed by: shurd
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D14967
This allows NIC drivers to sleep on polling config operations.
Submitted by: Matthew Macy <mmacy@mattmacy.io>
Reviewed by: shurd
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D14982
Changelist:
- remove unused nkr_slot_flags
- new nm_intr adapter callback to enable/disable interrupts
- remove unused sysctls and document the other sysctls
- new infrastructure to support NS_MOREFRAG for NIC ports
- support for external memory allocator (for now linux-only),
including linux-specific changes in common headers
- optimizations within netmap pipes datapath
- improvements on VALE control API
- new nm_parse() helper function in netmap_user.h
- various bug fixes and code clean up
Approved by: hrs (mentor)
Portable programs that use SIOCGIFCONF (e.g. traceroute) assume
that each pseudo ifreq is of length MAX(sizeof(struct ifreq),
sizeof(ifr_name) + ifr_addr.sa_len). For short sockaddrs we copied
too much from the source sockaddr resulting in a heap leak.
I believe only one such sockaddr exists (struct sockaddr_sco which
is 8 bytes) and it is unclear if such sockaddrs end up on interfaces
in practice. If it did, the result would be an 8 byte heap leak on
current architectures.
admbugs: 869
Reviewed by: kib
Obtained from: CheriBSD
MFC after: 3 days
Security: kernel heap leak
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D14981
opt_compat.h is mentioned in nearly 180 files. In-progress network
driver compabibility improvements may add over 100 more so this is
closer to "just about everywhere" than "only some files" per the
guidance in sys/conf/options.
Keep COMPAT_LINUX32 in opt_compat.h as it is confined to a subset of
sys/compat/linux/*.c. A fake _COMPAT_LINUX option ensure opt_compat.h
is created on all architectures.
Move COMPAT_LINUXKPI to opt_dontuse.h as it is only used to control the
set of compiled files.
Reviewed by: kib, cem, jhb, jtl
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D14941
These ioctls can process a number of items at a time, which puts us at
risk of overflow in mallocarray() and of impossibly large allocations
even if we don't overflow.
Limit the allocation to required size (or the user allocation, if that's
smaller). That does mean we need to do the allocation with the rules
lock held (so the number doesn't change while we're doing this), so it
can't M_WAITOK.
MFC after: 1 week
Use an accessor to access ifgr_group and ifgr_groups.
Use an macro CASE_IOC_IFGROUPREQ(cmd) in place of case statements such
as "case SIOCAIFGROUP:". This avoids poluting the switch statements
with large numbers of #ifdefs.
Reviewed by: kib
Obtained from: CheriBSD
MFC after: 1 week
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D14960
The previous split of zeroing ifr_name and ifr_addr seperately is safe
on current architectures, but would be unsafe if pointers were larger
than 8 bytes. Combining the zeroing adds no real cost (a few
instructions) and makes the security property easier to verify.
Reviewed by: kib, emaste
Obtained from: CheriBSD
MFC after: 3 days
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D14912
The change upgrades the driver to use the split Communication Status
Block (CSB) format. In this way the variables written by the guest
and read by the host are allocated in a different cacheline than
the variables written by the host and read by the guest; this is
needed to avoid cache thrashing.
Approved by: hrs (mentor)
- The two types must be type-punnable for shared members of ifr_ifru.
This allows compatibility accessors to be shared.
- There must be no padding gap between ifr_name and ifr_ifru. This is
assumed in tcpdump's use of SIOCGIFFLAGS output which attempts to be
broadly portable. This is true for all current architectures, but very
large (256-bit) fat-pointers could violate this invariant.
Reviewed by: kib
Obtained from: CheriBSD
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D14910
This fixes 32-bit compat (no ioctl command defintions are required
as struct ifreq is the same size). This is believed to be sufficent to
fully support ifconfig on 32-bit systems.
Reviewed by: kib
Obtained from: CheriBSD
MFC after: 1 week
Relnotes: yes
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D14900
The original implementation used a reference to ifr_data and a cast to
do the equivalent of accessing ifr_addr. This was copied multiple
times since 1996.
Approved by: kib
MFC after: 1 week
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D14873
Make all kernel accesses to ifru_buffer go via access functions
which take the process ABI into account and use an appropriate union
to access members in the correct place in struct ifreq.
Reviewed by: kib
Obtained from: CheriBSD
MFC after: 1 week
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D14846
According to 802.1Q-2014, VLAN tagged packets with VLAN id 0 should be
considered as untagged, and only PCP and DEI values from the VLAN tag
are meaningful. See for instance
https://www.cisco.com/c/en/us/td/docs/switches/connectedgrid/cg-switch-sw-master/software/configuration/guide/vlan0/b_vlan_0.html.
Make it possible to specify PCP value for outgoing packets on an
ethernet interface. When PCP is supplied, the tag is appended, VLAN
id set to 0, and PCP is filled by the supplied value. The code to do
VLAN tag encapsulation is refactored from the if_vlan.c and moved into
if_ethersubr.c.
Drivers might have issues with filtering VID 0 packets on
receive. This bug should be fixed for each driver.
Reviewed by: ae (previous version), hselasky, melifaro
Sponsored by: Mellanox Technologies
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D14702
If one has added fields to struct mbuf such that MHLEN is smaller than
this threshold (128), iflib_rxd_pkt_get() may otherwise overrun the
internal mbuf buffer while copying.
Reviewed by: mmacy
MFC after: 3 days
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D14843
Forwarded packets passed through PFIL_OUT, which made it difficult for
firewalls to figure out if they were forwarding or producing packets. This in
turn is an issue for pf for IPv6 fragment handling: it needs to call
ip6_output() or ip6_forward() to handle the fragments. Figuring out which was
difficult (and until now, incorrect).
Having pfil distinguish the two removes an ugly piece of code from pf.
Introduce a new variant of the netpfil callbacks with a flags variable, which
has PFIL_FWD set for forwarded packets. This allows pf to reliably work out if
a packet is forwarded.
Reviewed by: ae, kevans
Differential Revision: https://reviews.freebsd.org/D13715
Currently each bfp descriptor uses u64 variables to maintain its counters.
On interfaces with high packet rate this leads to unnecessary contention
and inaccurate reporting.
PR: kern/205320
Reported by: elofu17 at hotmail.com
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D14726
Current arp/nd code relies on the feedback from the datapath indicating
that the entry is still used. This mechanism is incorporated into the
arpresolve()/nd6_resolve() routines. After the inpcb route cache
introduction, the packet path for the locally-originated packets changed,
passing cached lle pointer to the ether_output() directly. This resulted
in the arp/ndp entry expire each time exactly after the configured max_age
interval. During the small window between the ARP/NDP request and reply
from the router, most of the packets got lost.
Fix this behaviour by plugging datapath notification code to the packet
path used by route cache. Unify the notification code by using single
inlined function with the per-AF callbacks.
Reported by: sthaug at nethelp.no
Reviewed by: ae
MFC after: 2 weeks
If the user configures a states_hashsize or source_nodes_hashsize value we may
not have enough memory to allocate this. This used to lock up pf, because these
allocations used M_WAITOK.
Cope with this by attempting the allocation with M_NOWAIT and falling back to
the default sizes (with M_WAITOK) if these fail.
PR: 209475
Submitted by: Fehmi Noyan Isi <fnoyanisi AT yahoo.com>
MFC after: 3 weeks
Differential Revision: https://reviews.freebsd.org/D14367
Only require a gateway to be specified on a route add request. On
a route change request that does not specify the gateway, the
gateway will remain the same. This allows changing other route
parameters without having to re-specifying the gateway, like in
"route change 10.0.0.0/8 -mtu 9000".
Update the route(8) manpage to explicitly call out this usage
as being supported.
MFC after: 2 weeks
Sponsored by: Dell EMC Isilon
Reviewed By: eugen (rtsock.c change), rgrimes
Differential Revision: https://reviews.freebsd.org/D14291
Dmamap is created only on IFC attach. If we remove it on
buffer release, we won't be able to do ifconfig down&up. Only destroy
when in detach.
Reported by: wma
Reviewed by: wma
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D14060
Sometimes 32 bit and 64 bit ioctls are represented by the same number.
It causes unnecessary switch to 32 bit commpatible mode.
This patch prevents switching when we are dealing with 64 bit executable.
It fixes issue mentioned here
Authored by: Patryk Duda <pdk@semihalf.com>
Submitted by: Wojciech Macek <wma@semihalf.com>
Reviewed by: andrew, wma
Obtained from: Semihalf
Sponsored by: IBM, QCM Technologies
Differential revision: https://reviews.freebsd.org/D14023
Added CTLFLAG_VNET to net.link.lagg.lacp.default_strict_mode which was missed
in r290450.
Reported by: julian@
MFC after: 1 week
Sponsored by: Multiplay
Increment the route table generation count after modifying a
route. This signals back to TCP connections that they need to
update their L2 caches as the gateway for their route may have
changed. This is a heavier hammer than is needed, strictly
speaking, but route changes will be unlikely enough that the
performance effects of invalidating all connection route caches
should be negligible.
MFC after: 1 week
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D13990
Reviewed by: karels
Add a new macro to clear both the L3 and L2 route caches, to
hopefully prevent future instances where only the L3 cache was
cleared when both should have been.
MFC after: 1 week
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D13989
Reviewed by: karels
When the inpcb route cache is invalidated after a change to the
routing tables, we need to invalidate the LLE cache as well.
Previous to this change packets for the connection would continue
to use the old L2 information from the old L3 gateway, and the
packets for the connection would likely be blackholed.
MFC after: 1 week
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D13988
Reviewed by: karels
Route messages are aligned to the host long type alignment, which
breaks 32bit.
Reported and tested by: lwhsu
Diagnosed by: Yuri Pankov <yuripv@icloud.com>
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Uses of mallocarray(9).
The use of mallocarray(9) has rocketed the required swap to build FreeBSD.
This is likely caused by the allocation size attributes which put extra pressure
on the compiler.
Given that most of these checks are superfluous we have to choose better
where to use mallocarray(9). We still have more uses of mallocarray(9) but
hopefully this is enough to bring swap usage to a reasonable level.
Reported by: wosch
PR: 225197
Focus on code where we are doing multiplications within malloc(9). None of
these ire likely to overflow, however the change is still useful as some
static checkers can benefit from the allocation attributes we use for
mallocarray.
This initial sweep only covers malloc(9) calls with M_NOWAIT. No good
reason but I started doing the changes before r327796 and at that time it
was convenient to make sure the sorrounding code could handle NULL values.
X-Differential revision: https://reviews.freebsd.org/D13837
Disabled the use of RSS hash from the network card aka flowid for
lagg(4) interfaces by default as it's currently incompatible with
the lacp and loadbalance protocols.
The incompatibility is due to the fact that the flowid isn't know
for the first packet of a new outbound stream which can result in
the hash calculation method changing and hence a stream being
incorrectly split across multiple interfaces during normal
operation.
This can be re-enabled by setting the following in loader.conf:
net.link.lagg.default_use_flowid="1"
Discussed with: kmacy
Sponsored by: Multiplay
As everywhere else, we want to pass rman_get_start(irq->ii_res). This
caused set affinity errors when not using MSI-X vectors (legacy and MSI
interrupts).
Reported by: sbruno
Sponsored by: Limelight Networks
gtask->gt_taskqueue is NULL when EARLY_AP_STARTUP is not enabled.
Remove assertion to allow this config to work.
Reported by: oleg
Sponsored by: Limelight Networks
This reduces noise when kernel is compiled by newer GCC versions,
such as one used by external toolchain ports.
Reviewed by: kib, andrew(sys/arm and sys/arm64), emaste(partial), erj(partial)
Reviewed by: jhb (sys/dev/pci/* sys/kern/vfs_aio.c and sys/kern/kern_synch.c)
Differential Revision: https://reviews.freebsd.org/D10385
Sometimes caller is only interested in how many clones
are there and NULL pointer is passed for the destination
buffer. Do not pass it to copyout then.
if_clone_attach function will drop the reference on failure which will
free the if_clone structure. No need to do it second time.
Reviewed by: glebius, ae
Differential Revision: https://reviews.freebsd.org/D10386
It seems that tcp_lro_rx() doesn't verify TCP checksums, so
if there are bad checksums in the packets caused by invalid data, the
invalid data will pass through without errors.
This was noticed with the igb driver and a specific internet host:
fetch http://www.mpfr.org/mpfr-current/mpfr-3.1.6.tar.xz -o test.bin && sha256 test.bin
Would result in a different value sometimes.
This ends up making LRO require RXCSUM to be enabled, and RXCSUM to
support TCP and UDP checksums.
PR: 224346
Reported by: gjb
Reviewed by: sbruno
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D13561
This will attempt to use a different thread/core on the same L2
cache when possible, or use the same cpu as the rx thread when not.
If SMP isn't enabled, don't go looking for cores to use. This is mostly
useful when using shared TX/RX queues.
Reviewed by: sbruno
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D12446
Email address has changed, uses consistent name (Matthew, not Matt)
Reported by: Matthew Macy <mmacy@mattmacy.io>
Differential Revision: https://reviews.freebsd.org/D13537
vxlan_ftable entries are sorted in ascending order, due to wrong arguments
order it is possible to stop search before existing element will be found.
Then new element will be allocated in vxlan_ftable_update_locked() and can
be inserted in the list second time or trigger MPASS() assertion with
enabled INVARIANTS.
PR: 224371
MFC after: 1 week
If a route is modified in a way that changes the route's source
address (i.e. the address used to access the gateway), then a
reference on the ifaddr representing the old source address will
be leaked if the address type does not have an ifa_rtrequest
method defined. Plug the leak by releasing the reference in
all cases.
Differential Revision: https://reviews.freebsd.org/D13417
Reviewed by: ae
MFC after: 3 weeks
Sponsored by: Dell
Previously, the counter was only incremented when m_append() failed. Since
the function can also fail on m_dup() now, increment the counter there as
well.
Sponsored by: Limelight Networks
Fix memory leak where mbuf chain wasn't free()d if iflib_ether_pad()
has a failure in m_dup().
Reported by: "Ryan Stone" <rysto32@gmail.com>
Sponsored by: Limelight Networks
If ethernet padding is enabled, and a read-only mbuf is passed,
it would modify the mbuf using m_append(). Instead, call m_dup() and
append to the new packet.
Reported by: Pyun YongHyeon
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D13414
Some bnxt devices do not correctly send frames smaller than
52 bytes (without CRC), so add a quirk that will pad frames to an
arbitrary size before passing off to the encap routine.
Reported by: Bhargava Chenna Marreddy <bhargava.marreddy@broadcom.com>
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D13269
The LRO possible test was calling CURVNET_SET once for IPv4 or IPv6 for
each packet in a chain. Only call it once per chain instead.
Submitted by: Matthew Macy <mmacy@mattmacy.io>
Reviewed by: cem, ae
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D13368
- Updates tables in affected files with new entries from newer spec
revisions of SFF-8472, SFF-8024, and SFF-8636
- Change ifconfig to read and display the extended compliance code for
SFP media if the extended compliance code is not 0. This was being displayed
for QSFP transceivers only, but SFP28 media uses this to report 25G
capability.
Reviewed by: melifaro, sbruno
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D13286
SIOCGIFXMEDIA is required for extended ethernet media types,
but iflib did not support it.
Reported by: Bhargava Chenna Marreddy <bhargava.marreddy@broadcom.com>
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D13312
used inside "if" statements comparing with another value.
Detailed explanation:
"if (a ? b : c != 0)" is not the same like "if ((a ? b : c) != 0)"
which is the expected behaviour of a function macro.
Affects:
toecore, linuxkpi and ibcore.
Reviewed by: kib
MFC after: 3 days
Sponsored by: Mellanox Technologies
If a device didn't support MSI-X, ctx->ifc_cpus would not be initialized,
but the IRQ allocation routines still uses the value. Move the
initialization to common code.
Sponsored by: Limelight Networks
type to any value. This can cause page faults and panics due to accessing
uninitialized fields in the "struct ifnet" which are specific to the network
device type.
MFC after: 1 week
Found by: jau@iki.fi
PR: 223767
Sponsored by: Mellanox Technologies
Mainly focus on files that use BSD 2-Clause license, however the tool I
was using misidentified many licenses so this was mostly a manual - error
prone - task.
The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.
No functional change intended.
bit_nclear() takes the bit numbers for the start and end bits, not the start
and a count. This was resulting in memory corruption past the end of the
bitstr_t.
Sponsored by: Limelight Networks
Mainly focus on files that use BSD 3-Clause license.
The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.
Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.
The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.
Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.
Initially, only tag files that use BSD 4-Clause "Original" license.
RelNotes: yes
Differential Revision: https://reviews.freebsd.org/D13133
The intent appears to be having one RX/TX queue set per core,
but since scctx->isc_n[tr]xqsets is set to max before calling
iflib_msix_init(), both end up being set to total number of cores.
Use ctx->ifc_sysctl_n[rt]xqs as the selected value and
scctx->isc_n[rt]xqsets as the max. This should result in what appears
to be the intended behaviour
Reviewed by: sbruno
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D13096
boot for the received packets.
The rcv_tstmp field overlaps the place of Ln header length indicators,
not used by received packets. The basic pkthdr rearrangement change
in sys/mbuf.h was provided by gallatin.
There are two accompanying M_ flags: M_TSTMP means that there is the
timestamp (and it was generated by hardware).
Another flag M_TSTMP_HPREC indicates that the timestamp is
high-precision. Practically M_TSTMP_HPREC means that hardware
provided additional precision comparing with the stamps when the flag
is not set. E.g., for ConnectX all packets are stamped by hardware
when PCIe transaction to write out the completion descriptor is
performed, but PTP packet are stamped on port. For Intel cards, when
PTP assist is enabled, only PTP packets are stamped in the limited
number of registers, so if Intel cards ever start support this
mechanism, they would always set M_TSTMP | M_TSTMP_HPREC if hardware
timestamp is present for the given packet.
Add IFCAP_HWRXTSTMP interface capability to indicate the support for
hardware rx timestamping, and ifconfig(8) command to toggle it.
Based on the patch by: gallatin
Reviewed by: gallatin (previous version), hselasky
Sponsored by: Mellanox Technologies
MFC after: 2 weeks (? mbuf KBI issue)
X-Differential revision: https://reviews.freebsd.org/D12638
Preserve packet order between tcp_lro_rx() and if_input() to avoid
creating extra corner cases. If no packets can be LROed, combine them
into one chain for submission via if_input(). If any packet can
potentially be LROed however, retain old behaviour and call if_input()
for each packet.
This should keep the 12% improvement for small packet forwarding intact,
but mostly avoids impacting the LRO case.
Reviewed by: cem, sbruno
Approved by: sbruno (mentor)
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D12876
even if kernel routing table already has a route to the address in question
installed by some routing daemon (PR 223129).
Also, allow loopback route deletion when stopping a VIMAGE jail (PR 222647).
PR: 222647, 223129
Reviewed by: gnn
Approved by: avg (mentor), mav (mentor)
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D12747
The VNET_SYSUNINIT() callback is executed after the MOD_UNLOAD. That means
that netisr_unregister() has already been called when
netisr_unregister_vnet() gets calls, leading to an assertion failure.
Restore the expected order of operations by performing everything that
was done in MOD_UNLOAD to a SYSUNINIT() (that will be called after the
VNET_SYSUNINIT()).
Differential Revision: https://reviews.freebsd.org/D12771
ifl_pidx and ifl_credits are going out of sync in _iflib_fl_refill() as they
use different update log. Use the same update logic for both, and add a
final call to isc_rxd_refill() to handle early exits from the loop.
PR: 221990
Reported by: pho
Reviewed by: sbruno
Approved by: sbruno (mentor)
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D12798
iru_init() was declared and used outside the DEV_NETMAP
conditional blocks, but was implemented inside one. Move the
implementation out of the DEV_NETMAP block to allow building with
netmap disabled.
Reported by: Andrew Turner <andrew@fubar.geek.nz>
Reviewed by: sbruno
Approved by: sbruno (mentor)
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D12842
Fix error when refilling netmap buffers that resulted in the first
buffer of the successive passes through ifl_bus_addrs[] leaving the
first value unset (tmp_pidx started at 1, not zero after the first time
through the loop).
Leave the one unused buffer required by some NICs visible in the netmap
ring rather than hidden. There will always be a buffer in use by the
kernel now when an iflib driver is used via netmap.
Always get the netmap slot index via netmap_idx_n2k() to account for
nkr_hwofs in a consistent way.
Split shared functionality into new functions.
iru_init(): shared by _iflib_fl_refill() and netmap_fl_refill()
netmap_fl_refill(): shared by iflib_netmap_rxsync() and
iflib_netmap_rxq_init()
PR: 222744
Reported by: Shirkdog <mshirk@daemon-security.com>
Reviewed by: sbruno
Approved by: sbruno (mentor)
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D12769
It was reported on the community call that with
hw.pci.enable_msix=0, iflib would enable MSI-X on the device and attempt
to use it, which caused issues. Test the sysctl explicitly and do not
enable MSI-X if it's disabled globally.
Reviewed by: sbruno
Approved by: sbruno (mentor)
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D12805
1. prefetch 128 bytes of mbufs.
2. Re-order filling the pkt_info so cache stalls happen at the end
3. Define empty prefetch2cachelines() macro when the function isn't present.
Provides small performance improvments on some hardware
Reviewed by: sbruno
Approved by: sbruno (mentor)
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D12447
HEAD. Enable VIMAGE in GENERIC kernels and some others (where GENERIC does
not exist) on HEAD.
Disable building LINT-VIMAGE with VIMAGE being default.
This should give it a lot more exposure in the run-up to 12 to help
us evaluate whether to keep it on by default or not.
We are also hoping to get better performance testing.
The feature can be disabled using nooptions.
Requested by: many
Reviewed by: kristof, emaste, hiren
X-MFC after: never
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D12639
Allocate smallest unit number from pool via ifc_alloc_unit_next()
and exact unit number (if available) via ifc_alloc_unit_specific().
While here, address possible deadlock (mentioned in PR).
PR: 217401
MFC after: 5 days
Differential Revision: https://reviews.freebsd.org/D12551
Improved logging added in r323879 exposed an error during
attach. We need the irq, not the rid to work correctly. em uses
shared irqs, so it will use the same irq for TX as RX. bnxt does
not use shared irqs, or TX irqs at all, so there's no need to set
the TX irq affinity.
Reviewed by: sbruno
Approved by: sbruno (mentor)
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D12496
GPUs: radeonkms, i915kms
NICs: if_em, if_igb, if_bnxt
This metadata isn't used yet, but it will be handy to have later to
implement automatic module loading.
Reviewed by: imp, mmacy
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D12488
Move TX out of the enqueue() path. As a result, we need
to have ifmp_ring_check_drainage() pick up from the abdicate state.
We also need to either enqueue the TX task, or check drainage
after calling ifmp_ring_enqueue() to ensure it's sent.
This change results in a 30% small packet forwarding improvement.
Reviewed by: olivier, sbruno
Approved by: sbruno (mentor)
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D12439
This allows tuning the rx budget for special load profiles
as well as more easily testing to determine sane defaults.
Reviewed by: sbruno
Approved by: sbruno (mentor)
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D12445
Build a list of mbufs to pass to if_input() after LRO. Results in
12% small packet forwarding rate improvement.
Reviewed by: sbruno
Approved by: sbruno (mentor)
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D12444
If the packet is smaller than MTU, disable the TSO flags.
Move TCP header parsing inside the IS_TSO?() test.
Add a new IFLIB_NEED_ZERO_CSUM flag to indicate the checksums need to be zeroed before TX.
Reviewed by: sbruno
Approved by: sbruno (mentor)
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D12442
This ensures that the loader will not load the module if it's also built in to
the kernel.
PR: 220860
Submitted by: Eugene Grosbein <eugen@freebsd.org>
Reported by: Marie Helene Kvello-Aune <marieheleneka@gmail.com>
RXQ setup for netmap was broken because netmap_rxq_init was getting called
before IFDI_INIT - thus we ended up with ring tail pointer being reset to zero.
Reviewed by: sbruno
Approved by: sbruno (mentor)
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D12140
indicating whether the page's wire count transitioned to zero. Use that
return value in zbuf_page_free() rather than checking the wire count.
MFC after: 1 week
This was really too big of a commit even if everything worked, but there
are multiple new issues introduced in the one huge commit, so it's not
worth keeping this until it's fixed.
I'll work on splitting this up into logical chunks and introduce them one
at a time over the next week or two.
Approved by: sbruno (mentor)
Sponsored by: Limelight Networks