Fix two KASSERTs to catch the condition they are intended to,
add two asserts to ensure that the appropriate locking is in
place and fix some things related to style.
No functional change intended.
MFC after: 1 week
Sponsored by: Netflix, Inc.
In case of a failure of tcp_newtcpcb, where NULL is returned,
* call CC_ALGO(tp)->cb_destroy, after CC_ALGO(tp)->cb_init was called.
* call khelp_destroy_osd(), after khelp_init_osd() was called.
Reviewed by: glebius, rscheff
MFC after: 1 week
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D45753
If syncache_socket() fails after calling tcp_newtcpcb(), the resources
allocated in tcp_newtcpcb() needs to be freed. Just call
tcp_discardcb() to do this.
Thanks to jtl for making me aware of the issue and proposing a fix.
Reviewed by: glebius, jtl, rscheff
MFC after: 1 week
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D45749
Change 483fe9651 embedded struct inpcb into struct udpcb and updated the
intoudpcb macro to use __containerof to locate it. This change accidentally
introduced a dependency on the identifier inp being defined in the block the
macro is expanded in. This should have been the macro argument ip. This change
makes this simple correction.
No functional change intended.
Reviewed by: kp
Sponsored by: Rubicon Communications, LLC ("Netgate")
Use the v4/v6 union members rather than the uint32_t ones.
Export IN_ARE_MASKED_ADDR_EQUAL() in in_var.h and use it (and its IPv6
equivalent) for masked comparisons rather than hand-rolled code.
Event: Kitchener-Waterloo Hackathon 202406
Before this patch, a stack (tfb) accepts a tcpcb (tp), if the
tp->t_state is TCPS_CLOSED or tfb->tfb_tcp_handoff_ok is not NULL
and tfb->tfb_tcp_handoff_ok(tp) returns 0.
After this patch, the only check is tfb->tfb_tcp_handoff_ok(tp)
returns 0. tfb->tfb_tcp_handoff_ok must always be provided.
For existing TCP stacks (FreeBSD, RACK and BBR) there is no
functional change. However, the logic is simpler.
Reviewed by: lstewart, peter_lei_ieee_.org, rrs
MFC after: 1 week
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D45253
pr_attach is only called on a socket (so) with so->so_listen != NULL
via sonewconn. However, sonewconn is not called from the TCP code.
The listening sockets are handled in tcp_syncache.c without using
sonewconn. Therefore, the code removed is never executed.
No functional change intended.
Reviewed by: rrs, peter.lei_ieee.org
MFC after: 1 week
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D45412
When we first find an inp, we set also the tp. If then a second
lookup is necessary, the inp is recomputed. If this fails, the
tp is not cleared, which resulted in failing KASSERT.
Therefore, clear the tp when staring the inp lookup procedure.
Reported by: Jenkins
Fixes: 02d15215ce ("tcp: improve blackhole support")
MFC after: 1 week
Sponsored by: Netflix, Inc.
There are two improvements to the TCP blackhole support:
(1) If net.inet.tcp.blackhole is set to 2, also sent no RST whenever
a segment is received on an existing closed socket or if there is
a port mismatch when using UDP encapsulation.
(2) If net.inet.tcp.blackhole is set to 3, no RST segment is sent in
response to incoming segments on closed sockets or in response to
unexpected segments on listening sockets.
Thanks to gallatin@ for suggesting such an improvement.
Reviewed by: gallatin
MFC after: 1 week
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D45304
Ensure that the inp is not dropped when starting a stack switch.
While there, clean-up the code by using INP_WLOCK_RECHECK, which
also re-assigns tp.
Reviewed by: glebius
MFC after: 3 days
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D45241
With the commit of D44903 we no longer need the SAD option. Instead all stacks that
use the sack filter inherit its protection against sack-attack.
Reviewed by: tuexen@
Differential Revision:https://reviews.freebsd.org/D45216
The !DXR2 code corresponds to the original DXR encoding proposal from
2012 with a single direct-lookup stage, which is inferior to the more
recent (DXR2) variant with two-stage trie both in terms of memory
footprint of the lookup structures, and in terms of overall lookup
througput.
I'm axing the old code chunks to (hopefully) somewhat improve readability,
as well as to simplify future maintenance and updates.
MFC after: 1 week
DXR lookup table encoding has an inherent structural limit on the amount
of binary search ranges it can accomodate. With the current IPv4 BGP views
(circa 1 M prefixes) and default DXR encoding we are only at around 5% of
that limit, so far, far away from hitting it. Just in case it ever gets
hit, make sure we free the allocated structures, instead of leaking it.
MFC after: 1 week
When calling dxr_init(), the FIB_ALGO infrastructure may provide a
pointer to a previous dxr instance, which permits reuse of auxiliary
dxr structures, i.e. incremental lookup structure updates. For dxr this
is a crucial feature provided by FIB_ALGO, since dxr incremental updates
are typically several orders of magnitude faster than full lookup table
rebuilds.
However, the auxiliary dxr structure caches a pointer to struct fib_data and
relies upon it for performing incremental updates. Apparently, incremental
rebuild requests from FIB_ALGO, i.e. a calls to dxr_init() with a pointer
old_data set, may (under not yet fully understood circumstances) be invoked
within a different fib_data context than the one cached in the previous
version of dxr auxiliary structures. In such (rare) events, we ignore the
offered old dxr context, and proceed with a full lookup structure rebuild
instead of attempting an incremental one using a fib_data context which
may or may not no longer be valid, and thus lead to a system crash.
PR: 278422
MFC after: 1 week
When the RACK stack wants to send a FIN, but still has outstanding
or unsent data, it sends a challenge ack. Don't do this when the
TCP endpoint is still in the front states, since it does not
make sense.
Reviewed by: rrs
MFC after: 3 days
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D45122
Right now, the code in process_chunk_drop() does not look the
the corresponding fields.
Therefore, no functional change intended.
Reported by: Coverity Scan
CID: 1472476
MFC after: 3 days
In this case uio is NULL, which needs to be checked and m must
be copied into the sctp_copy_all structure.
Reported by: Coverity Scan
CID: 1400449
MFC after: 3 days
Add a counter to track how frequently SACK has transmitted
more than one MSS using TSO. Instances when this will be
beneficial is the use of PRR, or when ACK thinning due to
GRO/LRO or ACK discards by the network are present.
Reviewed By: tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D45070
Introduce net.inet.tcp.sack.tso for future use when TSO is ready
to be used during loss recovery.
Reviewed By: tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D45068
While the SACK Scoreboard in the base stack limits
the number of holes by default to only 128 per connection
in order to prevent CPU load attacks by splitting SACKs,
filtering out SACK blocks of unusually small size can
further improve the actual processing of SACK loss recovery.
Reviewed By: tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D45075
There is only one functional change here - we don't allow SIOCSVH (or
netlink request) to change sc->sc_version. I'm convinced that allowing
such a change doesn't brings any practical value, but creates enless
minefields in front of both developers and end users (sysadmins). If
you want to switch from VRRP to CARP or vice versa, you'd need to recreate
the VHID.
Oh, one tiny funtional change: carp_ioctl_set() won't modify any fields
if it returns EINVAL. Previously you could provide valid advbase with
invalid advskew - that used to modify advbase and return EINVAL.
All other changes is a sweep around not ever using CARP fields when
we are in VRRP mode and vice versa. Also adding assertions on sc_version
where necessary.
Do not send VRRP vars in CARP mode via NetLink and vice versa. However
in compat ioctl SIOCGVH for VRRP mode the CARP fields would be zeroes.
This allows to declare softc as union and thus prevent any future logic
deterioration wrt to mixing VRRP and CARP.
Reviewed by: kp
Differential Revision: https://reviews.freebsd.org/D45039
- Separate HMAC preparation (CARP specific) from tagging.
- In unicast mode (CARP specific) don't put tag at all.
- Don't put pointer to software context into the tag. Putting just vhid,
an integer value, is a safer design.
Reviewed by: kp
Differential Revision: https://reviews.freebsd.org/D45038
Provide inline send_ad_locked() that switches between protocol
specific sending function.
Rename carp_send_ad() to carp_callout() to avoid getting lost in
all these multiple foo_send_ad.
No functional change intended.
Reviewed by: kp
Differential Revision: https://reviews.freebsd.org/D45036
Allow carp(4) to use the VRRPv3 protocol (RFC 5798). We can distinguish carp and
VRRP based on the protocol version number (carp is 2, VRRPv3 is 3), and support
both from the carp(4) code.
Reviewed by: glebius
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D44774
Replace local PAD macro with PADTCPOLEN macro
No functional change.
Reviewed By: tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D45076
Previously it was possible for dxr_build() to return with da->fd
unset in case of range_tbl or x_tbl malloc() failures. This
may have led to NULL ptr dereferencing in dxr_change_rib_batch().
MFC after: 1 week
PR: 278422
There is a type of attack that a TCP peer can launch on a connection. This is for sure in Rack or BBR and probably even the default stack if it uses lists in sack processing. The idea of the attack is that the attacker is driving you to look at 100's of sack blocks that only update 1 byte. So for example if you have 1 - 10,000 bytes outstanding the attacker sends in something like:
ACK 0 SACK(1-512) SACK(1024 - 1536), SACK(2048-2536), SACK(4096 - 4608), SACK(8192-8704)
This first sack looks fine but then the attacker sends
ACK 0 SACK(1-512) SACK(1025 - 1537), SACK(2049-2537), SACK(4097 - 4609), SACK(8193-8705)
ACK 0 SACK(1-512) SACK(1027 - 1539), SACK(2051-2539), SACK(4099 - 4611), SACK(8195-8707)
...
These blocks are making you hunt across your linked list and split things up so that you have an entry for every other byte. Has your list grows you spend more and more CPU running through the lists. The idea here is the attacker chooses entries as far apart as possible that make you run through the list. This example is small but in theory if the window is open to say 1Meg you could end up with 100's of thousands link list entries.
To combat this we introduce three things.
when the peer requests a very small MSS we stop processing SACK's from them. This prevents a malicious peer from just using a small MSS to do the same thing.
Any time we get a sack block, we use the sack-filter to remove sacks that are smaller than the smallest v4 mss (minus 40 for max TCP options) unless it ties up to snd_max (since that is legal). All other sacks in theory should be at least an MSS. If we get such an attacker that means we basically start skipping all but MSS sized Sacked blocks.
The sack filter used to throw away data when its bounds were exceeded, instead now we increase its size to 15 and then throw away sack's if the filter gets over-run to prevent the malicious attacker from over-running the sack filter and thus we start to process things anyway.
The default stack will need to start using the sack-filter which we have talked about in past conference calls to take full advantage of the protections offered by it (and reduce cpu consumption when processing sacks).
After this set of changes is in rack can drop its SAD detection completely
Reviewed by:tuexen@, rscheff@
Differential Revision: <https://reviews.freebsd.org/D44903>
In the error path during allocating an in_pcb, the credentials
associated with the new struct get their reference count
increased early on, but not decremented when the allocation
fails.
Reported by: cmiller_netapp.com
MFC after: 3 days
Reviewed by: jhb, tuexen
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D45033
This is used by 802.3 Ethernet. (Also be used by 802.4 Token Bus and
802.5 Token Ring, but we don't support those.)
This was accidentally removed along with FDDI support in commit
0437c8e3b1, presumably because comments implied it was used only by
FDDI or Token Ring.
Fixes: 0437c8e3b1 ("Remove support for FDDI networks.")
Reviewed-by: emaste
Signed-off-by: Denny Page <dennypage@me.com>
Pull-request: https://github.com/freebsd/freebsd-src/pull/1166
Fix a typo, which resulted in missing r_ctl.gate_to_fs in the BBLog
event.
Reported by: Coverity Scan
CID: 1540024
Reviewed by: rrs, rscheff
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D44648
RFC 9293 describes the handling of data in the CLOSE-WAIT, CLOSING,
LAST-ACK, and TIME-WAIT states:
This should not occur since a FIN has been received from the remote
side. Ignore the segment text.
Therefore, implement this handling.
Reviewed by: rrs, rscheff
MFC after: 3 days
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D44746
struct tcpcb embeds a struct osd and a struct callout. Rather than
forcing all consumers to pull in the same headers, include the headers
directly.
No functional change intended.
Reviewed by: glebius
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D44685
When debugging network issues one common clue is an unexpectedly
incrementing error counter. This is helpful, in that it gives us an
idea of what might be going wrong, but often these counters may be
incremented in different functions.
Add a static probe point for them so that we can use dtrace to get
futher information (e.g. a stack trace).
For example:
dtrace -n 'mib:ip:count: { printf("%d", arg0); stack(); }'
This can be disabled by setting the following kernel option:
options KDTRACE_NO_MIB_SDT
Reviewed by: gallatin, tuexen (previous version), gnn (previous version)
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D43504
Also log, when dropping text or FIN after having received a FIN.
This is the intended behavior described in RFC 9293.
A follow-up patch will enforce this behavior for the base stack
and the RACK stack.
Reviewed by: rscheff
MFC after: 3 days
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D44669
When in rack_output() jumping to the label out, don't write errno into
the log buffer, since the pointer is not initialized.
Reported by: Coverity Scan
CID: 1523773
Reviewed by: rscheff
MFC after: 3 days
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D44647
In rack_output(), idle is used as a boolean variable. So don't use it
as an int and don't clear it afterwards.
This avoids setting idle to false, when it is not intended.
Reported by: olivier
Reviewed by: rrs, rscheff
MFC after: 3 days
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D44610
Ensure that tv.tv_sec is zero in all code paths.
Reported by: Coverity Scan
CID: 1527724
Reviewed by: rscheff
MFC after: 3 days
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D44584
t_state is an unsigned variable, so no need for testing that it is
non-negative.
Reported by: Coverity Scan
CID: 1390885
Reviewed by: glebius
MFC after: 1 week
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D44619
Make the comment consistent with the code.
Reviewed by: rscheff
MFC after: 1 week
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D44611
The target_slot argument of max_slots_available() can be NULL.
Therefore, check for this in all places.
Right now, all callers provide non-NULL pointer.
Reported by: Coverity Scan
CID: 1527732
Reviewed by: rrs
MFC after: 3 days
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D44527
Before a protocol specific control block started to embed inpcb in self
(see 0aa120d52f, e68b379244, 483fe96511) this pointer used to point
at it.
Retain kf_sock_inpcb field in the struct kinfo_file in <sys/user.h>. The
exp-run detected a minimal use of the field in ports:
* sysutils/lsof - patched upstream
* net-mgmt/netdata - patch accepted upstream
* emulators/qemu-user-static - upstream master branch seems not using
the field anymore
We can keep the field around for some time, but eventually it may be
reused for something else.
PR: 277659 (exp-run)
Reviewed by: tuexen
Differential Revision: https://reviews.freebsd.org/D44491
HPTS inserts a softclock for system call return that optimizes performance. However when
no HPTS threads need the help (i.e. when they have less than 100 or so connections) then
there should be little work done i.e. check the counter and return instead of running through
all the threads getting locks etc.ptimize HPTS so that little work is done until we have a hpts
thread that is over the connection threshold.
Reported by: eduardo
Reviewed by: gallatin, glebius, tuexen
Tested by: gallatin
Differential Revision: https://reviews.freebsd.org/D44420
The length of tldl_reason is TCP_LOG_REASON_LEN, not TCP_LOG_ID_LEN.
No functional change intended.
Reported by: Coverity Scan
CID: 1418074
CID: 1418276
Reviewed by: glebius, rscheff
MFC after: 3 days
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D44510
Most of them can be declared as static after the move out of in6_proto.c.
Keeping sysctl(9) declarations with their text descriptions next to the
variable declaration create self-documenting code. There should be no
functional changes.
Differential Revision: https://reviews.freebsd.org/D44481
Instead of fixing up invalid values set by a user in badport_bandlim()
which is a fast path function, provide a sysctl handler
sysctl_icmplim_and_jitter(), that will check that jitter is less than the
limit.
Provide jitter initilization function icmplim_new_jitter() used at boot,
in the sysctl handler and when we actually hit the limit. This also fixes
no jitter on a fresh booted system until first limit hit.
Instead of CVE number provide link the the actual paper that explains what
and why we are doing here. The CVE number isn't very informative, it will
just tell you what RedHat version you need to upgrade to.
Reviewed by: kp, tuexen, zlei
Differential Revision: https://reviews.freebsd.org/D44478
The limiting of the very last second has been done using certain jitter
value. We update the jitter for the next second. But the logging should
report the jitter before the change.
Reviewed by: kp, tuexen, zlei
Differential Revision: https://reviews.freebsd.org/D44477
The uninitialization may be executed only on a kernel with VIMAGE.
Reviewed by: kp, tuexen, zlei
Differential Revision: https://reviews.freebsd.org/D44476
We need per-VNET struct counter_rate, but we don't need per-VNET set of
const char *. Also, identical word "response" can go into the format
string instead of being stored 7 times.
Reviewed by: kp, zlei, tuexen
Differential Revision: https://reviews.freebsd.org/D44475
Ensure that there is no data on SYN segments unless doing TFO.
This check is already in RACK and BBR.
Reported by: glebius
Reviewed by: rscheff
MFC after: 1 week
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D44384
This socket option can be used by in-kernel consumers (like NFS) to
request a NIC to use optimized receive of large buffers for a
connection. The current use case is to support DDP by the TOE on
Chelsio NICs.
Reviewed by: rscheff, tuexen, glebius
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D44000
Don't report a BACKUP CARP address as local. These two functions are used
only by source address validation for input packets, controlled by sysctls
net.inet.ip.source_address_validation and
net.inet6.ip6.source_address_validation. For this purpose we definitely
want to treat BACKUP addresses as non local.
This change is conservative and doesn't modify compat in_localip() and
in6_localip(). They are used more widely than the FIB-aware versions.
The change would modify the notion of ipfw(4) 'me' keyword. There might
be other consequences as in_localip() is used by various tunneling
protocols.
PR: 277349
When a TCP callout decides to disable self, e.g. tcp_timer_2msl() calling
tcp_close(), we must also clear all other possible timers. Otherwise,
upon return, the callout would be scheduled again in tcp_timer_enter().
Revert 57e27ff07a, which was a temporary partial revert of otherwise
correct 62d47d73b7, that exposed the problem being fixed now. Add an
extra assertion in tcp_timer_enter() to check we aren't arming callout for
a closed connection.
Reviewed by: rscheff
The macro is more obfuscating than helping as it just checks a single flag
of t_flags. All other t_flags bits are checked without a macro.
A bigger problem was that declaration of the macro in tcp_var.h depended
on a kernel option. It is a bad practice to create such definitions in
installable headers.
Reviewed by: rscheff, tuexen, kib
Differential Revision: https://reviews.freebsd.org/D44362
These KPIs were added in dd0e6c383a and through 15 years had zero use.
They slightly remind what IfAPI does for struct ifnet. But IfAPI does
that for the sake of large collection of NIC drivers not being aware of
struct ifnet. For the sockets it is unclear what could be a large
collection of externally written kernel modules that need extensively use
sockets and not be aware of their internals at the same time. This
isolation of a structure knowledge requires a lot of work, and just
throwing in a few KPIs isn't helpful.
Reviewed by: kib, olce, markj
Differential Revision: https://reviews.freebsd.org/D44311
These KPIs were added in 9d29c635da and through 15 years had zero use.
They slightly remind what IfAPI does for struct ifnet. But IfAPI does
that for the sake of large collection of NIC drivers not being aware of
struct ifnet. For the inpcb it is unclear what could be a large
collection of externally written kernel modules that need extensively use
inpcb and not be aware of its internals at the same time. This isolation
of a structure knowledge requires a lot of work, and just throwing in a
few KPIs isn't helpful.
Reviewed by: kib, bz, markj
Differential Revision: https://reviews.freebsd.org/D44310
and drop the definition for userspace (which matched TCP_RFC7413) since
it depends on presence of the kernel option.
Reviewed by: glebius, rscheff
Sponsored by: NVIDIA networking
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D44349
Doing a deep copy of the keys early allows users of the
tls_enable structure to assume kernel memory.
This enables the socket options to be set by kernel threads.
Reviewed By: #transport, tuexen, jhb, rrs
Sponsored by: NetApp, Inc.
X-NetApp-PR: #79
Differential Revision: https://reviews.freebsd.org/D44250
This brings the rack stack up to the current level used at NF. Many fixes
and improvements have been added. I also add in a fix to BBR to deal with
the changes that have been in hpts for a while i.e. only one call no matter
if mbuf queue or tcp_output.
It basically does little except BBlogs and is a placemark for future work on
doing path capacity measurements.
With a bit of a struggle with git I finally got rack_pcm.c into place (apologies
for not noticing this error). The LINT kernel is running on my box now .. sigh.
Reviewed by: tuexen, glebius
Sponsored by: Netflix Inc.
Differential Revision:https://reviews.freebsd.org/D43986
This brings the rack stack up to the current level used at NF. Many fixes
and improvements have been added. I also add in a fix to BBR to deal with
the changes that have been in hpts for a while i.e. only one call no matter
if mbuf queue or tcp_output.
Note there is a new file that I can't figure out how to get in rack_pcm.c
It basically does little except BBlogs and is a placemark for future work on
doing path capacity measurements.
Reviewed by: tuexen, glebius
Sponsored by: Netflix Inc.
Differential Revision:https://reviews.freebsd.org/D43986
When doing mbuf queueing, the packet filter hooks in ether_demux(),
ip_input(), and ip6_input() are by-passed. This means that the packet
filters don't process incoming packets, which might result in
connection failures. For example bypassing the TCP sequence number
validation will result in dropping valid packets.
Please note that this patch is only disabling mbuf queueing, not LRO.
Reported by: Herbert J. Skuhra
Reviewed by: glebius, rrs, rscheff
MFC after: 1 week
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D43769
Visibility into the contents of the buffer when a write(2) has failed
can be immensely useful in debugging IPC issues -- pushing this to
discuss the idea, or maybe an alternative where we can set a flag like
KTRFAC_ERRIO to enable it.
When a genio event is potentially raised after an error, currently we'll
just free the uio and return. However, such data can be useful when
debugging communication between processes to, e.g., understand what the
remote side should have grabbed before closing a pipe. Tap out the
entire buffer on failure rather than simply discarding it.
Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D43799
Ok lets fix up the tcp_in_hpts() so that it also says yes if you
are in the race state moving and you are scheduled to be put in.
This also requires changing the MPASS to be the old version non
inline function of tcp_in_hpts().
This change also adds a new inline macro so that a uint64_t timestamp can be
obtained by a transport (aka Rack will use this).
Reviewed by: glebius, tuexen
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D44157
This is a migitation to avoid sudden extreme jumps in
cwnd, as t_epoch can be very out of date after an RTO.
Per RFC9438, sec 4.8, t_epoch is to be reset whenever
cwnd grows beyond ssthresh (CC phase transitions from
slow start to congestion avoidance), to be fixed with
the upcoming cc_cubic changes.
MFC after: 3 days
Reviewed By: cc, #transport
Sponsored by: NetApp, Inc
Differential Revision: https://reviews.freebsd.org/D44023
Ensure that snd_fack holds a valid value when doing
the post_recovery CC processing, for preparation of
the cc_cubic update, so that local pipe calculations
can correctly refer to snd_fack during and after CC events.
Reviewed By: tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D43957
Facilitate easier troubleshooting by enumerating
all congestion control signals. Typecast the
enum to int, when a congestion control module uses
private signals.
No external change.
Reviewed By: glebius, tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D43838
Make sure the divident is at least one. While cwnd should
never be smaller than t_maxseg, this can happen during
Path MTU Discovery, or when TCP options are considered
in other parts of the stack.
PR: 276674
MFC after: 3 days
Reviewed By: tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D43797
The FreeBSD TCP base stack handles them also the same way.
In case of packet filters dropping packets in the output path,
this avoids retranmitting the dropped packet every 10ms or so.
Reviewed by: rscheff
MFC after: 1 week
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D43773
That was lost in transition from one-for-all soshutdown() to protocol
specific methods. Only protocols that listen(2) were affected. This is
not a documented or specified feature, but some software relies on it. At
least the FreeSWITCH telephony software uses this behavior on
PF_INET/SOCK_STREAM.
Fixes: 5bba272807
Follow up on D43768 to properly deal with the non-default
pipe calculation. When CC_RTO is processed, the timeout
will have already pulled back snd_nxt. Further, snd_fack
is not pulled along with snd_una.
Reviewed By: tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D43876
Stop timers when in tcp_close() instead of doing that in tcp_discardcb().
A connection in CLOSED state shall not need any timers. Assert that no
timer is rescheduled after that in tcp_timer_activate() and verfiy that
this is also the expected state in tcp_discardcb().
PR: 276761
Reviewed By: glebius, tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D43792
When sending a RST control segment in tcp_output() it
means we are in TCPS_CLOSED state, called from tcp_drop().
Once the RST is sent, don't call tcp_timer_activate() or
update anything in tcpcb, since that will go away shortly.
PR: 276761
Provided by: glebius
Reviewed By: glebius, tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D43808
The SACK scoreboard is conceptually an extention of the socket
buffer. Remove it when the socket buffer goes away with
soisdisconnected(). Verify that this is also the expected
state in tcp_discardcb().
PR: 276761
Reviewed by: glebius, tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D43805
The implicit assumption of snd_nxt always being larger than
snd_recover is not true after RTO. In that case, cwnd
would get inflated to ssthresh, which may be much larger
than the current pipe (data in flight).
Reviewed By: tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D43653
per RFC5681, only adjust ssthresh on the initital
retransmission timeout. Since RTO often happens
during loss recovery, while cwnd no longer tracks
all data in flight, calculcate pipe properly.
Reviewed By: tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D43768
tcp_fixed_maxseg() is the streamlined calculation of typical
tcp options and more suitable for heavy use in the congestion
control modules on every received packet.
No external functional change.
Reviewed By: tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D43779
This is a preparation for adding dtrace hooks in a follow-up commit,
which are missing in the code path, where packets are directly queued
to the tcpcb. The dtrace hooks expect the fields to be in host byte
order. This only applies when TCP HPTS is used.
No functional change intended.
Reviewed by: rscheff
MFC after: 1 week
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D43594
pfil hooks (i.e. firewalls) may pass, modify or free the mbuf passed
to them. (E.g. when rejecting a packet, or when gathering up packets
for reassembly).
If the hook returns PFIL_PASS the mbuf must still be present. Assert
this in pfil_mem_common() and ensure that ipfilter follows this
convention. pf and ipfw already did.
Similarly, if the hook returns PFIL_DROPPED or PFIL_CONSUMED the mbuf
must have been freed (or now be owned by the firewall for further
processing, like packet scheduling or reassembly).
This allows us to remove a few extraneous NULL checks.
Suggested by: tuexen
Reviewed by: tuexen, zlei
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D43617
The RFC6675 pipe calculation (sack.revised, enabled
by default since D28702), uses outdated information,
while the previous default calculated it correctly
with up-to-date information from the incoming ACK.
This difference can become as large as the receive
window (not the congestion window previously),
potentially triggering a massive burst of new packets.
MFC after: 1 week
Reviewed By: tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D43520
Use SEQ_SUB instead of a plain subtraction, for an implict
type conversion and prevention of a possible overflow.
Use curly brackets in stacked if statements throughout.
Use of the ? operator to enhance readability when clearing
the FIN flag in tcp_output().
None of the above change the function.
Reviewed By: tuexen, cc, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D43539
Shifting bits is quicker than checking header flag bits
one by one. Also improve readability by the use of switch
statements.
No change in behaviour.
Reviewed By: glebius, tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D43560
Account for SACK retransmitted bytes once the actual length
is known. This prevents a call to tcp_maxseg() and prepares
for TSO support when transmitting from the SACK scoreboard.
Reviewed By: tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D43447
Improve slowpath processing (reordering, retransmissions)
slightly by calculating maxseg only once. This typically
saves one of two calls to tcp_maxseg().
Reviewed By: glebius, tuexen, cc, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D43536
This debugging code has been lingering for years with
no known use.
No functional change.
Reviewed by: tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D43511
This patch provides UDP encapsulation of ESP packets over IPv6.
Ports the IPv4 code to IPv6 and adds support for IPv6 in udpencap.c
As required by the RFC and unlike in IPv4 encapsulation,
UDP checksums are calculated.
Co-authored-by: Aurelien Cazuc <aurelien.cazuc.external@stormshield.eu>
Sponsored-by: Stormshield
Sponsored-by: Wiktel
Sponsored-by: Klara, Inc.
With removal of dom_dispose method the function boils down to two
meaningful function calls: socantrcvmore() and sbrelease(). The latter is
only relevant for protocols that use generic socket buffers.
The socket I/O sx(9) lock acquisition in sorflush() is not relevant for
shutdown(2) operation as it doesn't do any I/O that may interleave with
read(2) or write(2). The socket buffer mutex acquisition inside
sbrelease() is what guarantees thread safety. This sx(9) acquisition in
soshutdown() can be tracked down to 4.4BSD times, where it used to be
sblock(), and it was carried over through the years evolving together with
sockets with no reconsideration of why do we carry it over. I can't tell
if that sblock() made sense back then, but it doesn't make any today.
Reviewed by: tuexen
Differential Revision: https://reviews.freebsd.org/D43415
Disassemble a one-for-all soshutdown() into protocol specific methods.
This creates a small amount of copy & paste, but makes code a lot more
self documented, as protocol specific method would execute only the code
that is relevant to that protocol and nothing else. This also fixes a
couple recent regressions and reduces risk of future regressions. The
extended KPI for the new pr_shutdown removes need for the extra pr_flush
which was added for the sake of SCTP which could not perform its shutdown
properly with the old one. Particularly for SCTP this change streamlines
a lot of code.
Some notes on why certain parts of code were copied or were not to certain
protocols:
* The (SS_ISCONNECTED | SS_ISCONNECTING | SS_ISDISCONNECTING) check is
needed only for those protocols that may be connected or disconnected.
* The above reduces into only SS_ISCONNECTED for those protocols that
always connect instantly.
* The ENOTCONN and continue processing hack is left only for datagram
protocols.
* The SOLISTENING(so) block is copied to those protocols that listen(2).
* sorflush() on SHUT_RD is copied almost to every protocol, but that
will be refactored later.
* wakeup(&so->so_timeo) is copied to protocols that can make a non-instant
connect(2), can SO_LINGER or can accept(2).
There are three protocols (netgraph(4), Bluetooth, SDP) that did not have
pr_shutdown, but old soshutdown() would still perform sorflush() on
SHUT_RD for them and also wakeup(9). Those protocols partially supported
shutdown(2) returning EOPNOTSUP for SHUT_WR/SHUT_RDWR, now they fully lost
shutdown(2) support. I'm pretty sure netgraph(4) and Bluetooth are okay
about that and SDP is almost abandoned anyway.
Reviewed by: tuexen
Differential Revision: https://reviews.freebsd.org/D43413
Since fdb987bebd it is not possible anymore to use inp_next
iterator for bound, but unconnected sockets. This applies
to TCP listening sockets. Therefore the metioned commit broke
tcpsso on listening sockets if the -i option was not used.
Fix this by iterating through all endpoints instead of only
through the bound, but unconnected ones.
Reviewed by: markj
Fixes: fdb987bebd ("inpcb: Split PCB hash tables")
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D43353
PRR state was not properly reset on subsequent ECN CE
events. Clean up after local transmission failures too.
Reviewed by: tuexen, cc, #transport
MFC after: 3 days
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D43170
Only try sending more data on pure ACKs when there is
more data available in the send buffer.
In the case of a retransmitted SYN not being sent due to
an internal error, the snd_una/snd_nxt accounting could
be off, leading to a panic. Pulling snd_nxt up to snd_una
prevents this from happening.
Reported by: fengdreamer@126.com
Reviewed by: cc, tuexen, #transport
MFC after: 1 week
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D43343
Keeping the SACK scoreboard intact after the first RTO
and retransmitting all data anew only on subsequent RTOs
allows a more timely and efficient loss recovery under
many adverse cirumstances.
Reviewed By: tuexen, #transport
MFC after: 10 weeks
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D42906
Put most symbols under __BSD_VISIBLE and limit the namespace of
tcp_[gs]et_flags.
Reviewed by: kib, karels, rscheff
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D43245
Define a mask for the code point used for ECN in the Traffic Class field
(2 bits) of an IPv6 header.
BE: 0 0 3 0 0 0 0 0
Bit: 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|Version| Traffic Class | Flow Label |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| ... |
For BE (Big Endian), or network-byte order, this corresponds to 0x00300000.
For Little Endian, it corresponds to 0x00003000.
Reviewed by: imp, markj
MFC after: 1 week
Pull Request: https://github.com/freebsd/freebsd-src/pull/879
Do not use the cached route if the destination isn't the same.
This fix a problem where an UDP packet will be sent via the wrong route
and interface if a previous one was sent via them.
PR: 275774
Reviewed by: glebius, tuexen
Sponsored by: Beckhoff Automation GmbH & Co. KG
inline is only support in C99 and newer. To support also C89, use
__inline instead as suggested by dim.
Reported by: eduardo
Reviewed by: rscheff, markj, dim, imp
Tested by: eduardo
Fixes: a8b70cf260 ("netpfil: Use accessor functions and named constants for all tcphdr flags")
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D43231