1
0
mirror of https://git.FreeBSD.org/src.git synced 2024-11-23 07:31:31 +00:00
Commit Graph

8100 Commits

Author SHA1 Message Date
Michael Tuexen
af84665261 tcp: whitespace cleanup of enum tcp_log_events
No functional change intended.

MFC after:	1 week
Sponsored by:	Netflix, Inc.
2024-07-11 11:38:11 +02:00
Michael Tuexen
811d831050 tcp: minor cleanup
Fix two KASSERTs to catch the condition they are intended to,
add two asserts to ensure that the appropriate locking is in
place and fix some things related to style.
No functional change intended.

MFC after:	1 week
Sponsored by:	Netflix, Inc.
2024-06-29 11:06:35 +02:00
Ryan Libby
0d8da0df41 tcp_rack: avoid gcc -Werror=pointer-to-int-cast on 32-bit arch
Reviewed by:	tuexen
Differential Revision:	https://reviews.freebsd.org/D45752
2024-06-27 20:40:12 -07:00
Ryan Libby
c02a8caf50 tcp_bbr: avoid gcc -Werror=pointer-to-int-cast on 32-bit arch
Reviewed by:	tuexen
Differential Revision:	https://reviews.freebsd.org/D45751
2024-06-27 20:37:26 -07:00
Michael Tuexen
14fee5324a tcp: improve failure handling in tcp_newtcpcb()
In case of a failure of tcp_newtcpcb, where NULL is returned,
* call CC_ALGO(tp)->cb_destroy, after CC_ALGO(tp)->cb_init was called.
* call khelp_destroy_osd(), after khelp_init_osd() was called.

Reviewed by:		glebius, rscheff
MFC after:		1 week
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D45753
2024-06-27 08:26:34 +02:00
Michael Tuexen
3703e1a73e tcp: improve error handling in syncache_socket()
If syncache_socket() fails after calling tcp_newtcpcb(), the resources
allocated in tcp_newtcpcb() needs to be freed. Just call
tcp_discardcb() to do this.
Thanks to jtl for making me aware of the issue and proposing a fix.
Reviewed by:		glebius, jtl, rscheff
MFC after:		1 week
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D45749
2024-06-27 07:25:15 +02:00
Zhenlei Huang
08a98731dd ip_mroute: Use NET_EPOCH_WAIT() macro
This makes it easier to grep the usage.

Reviewed by:	kp
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D45715
2024-06-24 17:57:14 +08:00
Reid Linnemann
de4bfd6b99 udp_var: correct intoudpcb macro unintended identifier dependency
Change 483fe9651 embedded struct inpcb into struct udpcb and updated the
intoudpcb macro to use __containerof to locate it. This change accidentally
introduced a dependency on the identifier inp being defined in the block the
macro is expanded in. This should have been the macro argument ip. This change
makes this simple correction.

No functional change intended.

Reviewed by:	kp
Sponsored by:	Rubicon Communications, LLC ("Netgate")
2024-06-14 17:28:28 +02:00
Kristof Provost
8f04209d37 pf: simplify pf_addrcpy() and pf_match_addr()
Use the v4/v6 union members rather than the uint32_t ones.
Export IN_ARE_MASKED_ADDR_EQUAL() in in_var.h and use it (and its IPv6
equivalent) for masked comparisons rather than hand-rolled code.

Event:		Kitchener-Waterloo Hackathon 202406
2024-06-06 15:45:31 +02:00
Michael Tuexen
86c9325d34 tcp: simplify stack switching protocol
Before this patch, a stack (tfb) accepts a tcpcb (tp), if the
tp->t_state is TCPS_CLOSED or tfb->tfb_tcp_handoff_ok is not NULL
and tfb->tfb_tcp_handoff_ok(tp) returns 0.
After this patch, the only check is tfb->tfb_tcp_handoff_ok(tp)
returns 0. tfb->tfb_tcp_handoff_ok must always be provided.
For existing TCP stacks (FreeBSD, RACK and BBR) there is no
functional change. However, the logic is simpler.

Reviewed by:		lstewart, peter_lei_ieee_.org, rrs
MFC after:		1 week
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D45253
2024-06-06 08:29:05 +02:00
Michael Tuexen
e7381521aa tcp: remove unused code in tcp_usr_attach
pr_attach is only called on a socket (so) with so->so_listen != NULL
via sonewconn. However, sonewconn is not called from the TCP code.
The listening sockets are handled in tcp_syncache.c without using
sonewconn. Therefore, the code removed is never executed.
No functional change intended.

Reviewed by:		rrs, peter.lei_ieee.org
MFC after:		1 week
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D45412
2024-05-30 21:23:45 +02:00
Michael Tuexen
df9de82f54 tcp: fix sending RST after second inp lookup
When we first find an inp, we set also the tp. If then a second
lookup is necessary, the inp is recomputed. If this fails, the
tp is not cleared, which resulted in failing KASSERT.
Therefore, clear the tp when staring the inp lookup procedure.
Reported by:	Jenkins
Fixes:		02d15215ce ("tcp: improve blackhole support")
MFC after:	1 week
Sponsored by:	Netflix, Inc.
2024-05-25 19:58:48 +02:00
Michael Tuexen
02d15215ce tcp: improve blackhole support
There are two improvements to the TCP blackhole support:
(1) If net.inet.tcp.blackhole is set to 2, also sent no RST whenever
    a segment is received on an existing closed socket or if there is
    a port mismatch when using UDP encapsulation.
(2) If net.inet.tcp.blackhole is set to 3, no RST segment is sent in
    response to incoming segments on closed sockets or in response to
    unexpected segments on listening sockets.
Thanks to gallatin@ for suggesting such an improvement.

Reviewed by:		gallatin
MFC after:		1 week
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D45304
2024-05-24 06:59:13 +02:00
Henrich Hartzer
674956e199 sys/netinet/cc: Switch from deprecated random() to prng32()
Related: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=277655

Signed-off-by: henrichhartzer@tuta.io
Reviewed by: imp, mav
Pull Request: https://github.com/freebsd/freebsd-src/pull/1162
2024-05-23 15:10:09 -06:00
Cy Schubert
380ee9b3c0 sys/netinet/icmp6.h: Fix build
Fix stdint.h file not found.

Fixes: 		4b75afe885
2024-05-23 14:03:55 -07:00
Lexi Winter
4b75afe885 sys/netinet/icmp6.h: use C99 uintX_t constants for new PREF64 struct
Reviewed by: imp, glebius (prior suggetions done)
Pull Request: https://github.com/freebsd/freebsd-src/pull/1206
2024-05-23 14:40:48 -06:00
Lexi Winter
1e8eb413f6 netinet/icmp6: add PREF64 definitions (RFC 8781)
Reviewed by: imp, glebius (prior suggetions done)
Pull Request: https://github.com/freebsd/freebsd-src/pull/1206
2024-05-23 14:40:11 -06:00
Michael Tuexen
fe136aecc2 tcp: improve inp locking in setsockopt
Ensure that the inp is not dropped when starting a stack switch.
While there, clean-up the code by using INP_WLOCK_RECHECK, which
also re-assigns tp.

Reviewed by:		glebius
MFC after:		3 days
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D45241
2024-05-23 22:19:12 +02:00
Randall Stewart
ea916b6412 Remove TCP_SAD optional code now that the sack filter performs this function.
With the commit of D44903 we no longer need the SAD option. Instead all stacks that
use the sack filter inherit its protection against sack-attack.

Reviewed by: tuexen@
 Differential Revision:https://reviews.freebsd.org/D45216
2024-05-18 10:57:04 -04:00
Marko Zec
42b3c16e30 fib_dxr: code hygiene, prune old code, no functional changes
The !DXR2 code corresponds to the original DXR encoding proposal from
2012 with a single direct-lookup stage, which is inferior to the more
recent (DXR2) variant with two-stage trie both in terms of memory
footprint of the lookup structures, and in terms of overall lookup
througput.

I'm axing the old code chunks to (hopefully) somewhat improve readability,
as well as to simplify future maintenance and updates.

MFC after:	1 week
2024-05-17 18:57:25 +02:00
Marko Zec
19bd24caa4 fib_dxr: do not leak memory if FIB constellation hits structural limit
DXR lookup table encoding has an inherent structural limit on the amount
of binary search ranges it can accomodate.  With the current IPv4 BGP views
(circa 1 M prefixes) and default DXR encoding we are only at around 5% of
that limit, so far, far away from hitting it.  Just in case it ever gets
hit, make sure we free the allocated structures, instead of leaking it.

MFC after:	1 week
2024-05-17 18:46:41 +02:00
Marko Zec
4ab122e8ef fib_dxr: check if cached fib_data matches the new request in dxr_init()
When calling dxr_init(), the FIB_ALGO infrastructure may provide a
pointer to a previous dxr instance, which permits reuse of auxiliary
dxr structures, i.e. incremental lookup structure updates.  For dxr this
is a crucial feature provided by FIB_ALGO, since dxr incremental updates
are typically several orders of magnitude faster than full lookup table
rebuilds.

However, the auxiliary dxr structure caches a pointer to struct fib_data and
relies upon it for performing incremental updates.  Apparently, incremental
rebuild requests from FIB_ALGO, i.e. a calls to dxr_init() with a pointer
old_data set, may (under not yet fully understood circumstances) be invoked
within a different fib_data context than the one cached in the previous
version of dxr auxiliary structures.  In such (rare) events, we ignore the
offered old dxr context, and proceed with a full lookup structure rebuild
instead of attempting an incremental one using a fib_data context which
may or may not no longer be valid, and thus lead to a system crash.

PR:		278422
MFC after:	1 week
2024-05-17 18:21:54 +02:00
Gordon Bergling
78e4dbc345 ipfw: Fix a typo in a source code comment
- s/defaul/default/

MFC after:	3 days
2024-05-12 10:53:40 +02:00
Michael Tuexen
2f923a0ced tcp rack: improve handling of front states
When the RACK stack wants to send a FIN, but still has outstanding
or unsent data, it sends a challenge ack. Don't do this when the
TCP endpoint is still in the front states, since it does not
make sense.
Reviewed by:		rrs
MFC after:		3 days
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D45122
2024-05-11 16:28:45 +02:00
Michael Tuexen
5120ea0d88 sctp: improve heartbeat timer computation
PR:		278666
Reviewed by:	Albin Hellqvist
MFC after:	3 days
Pull Request:	https://reviews.freebsd.org/D45107
2024-05-10 21:02:56 +02:00
Michael Tuexen
b67716dd58 sctp: store heartbeat creation time as time_t
Reported by:	Coverity Scan
CID:		1493087
MFC after:	3 days
2024-05-10 20:40:15 +02:00
Michael Tuexen
42aeb8d490 sctp: store vtag expire time as time_t
Reported by:	Coverity Scan
CID:		1492525
CID:		1493239
MFC after:	3 days
2024-05-10 20:28:38 +02:00
Michael Tuexen
9d8a3718e2 sctp: store cookie secret change time as time_t
Reported by:	Coverity Scan
CID:		1492349
CID:		1493281
MFC after:	3 days
2024-05-10 20:14:16 +02:00
Michael Tuexen
0d15140d6d sctp: minor cleanup
No functional chnage intended.
MFC after:	3 days
2024-05-09 00:51:09 +02:00
Michael Tuexen
8c37094036 sctp: allow stcb == NULL in sctp_shutdown()
Consistently handle this case.
Reported by:	Coverity Scan
CID:		1533813
MFC after:	3 days
2024-05-09 00:43:28 +02:00
Michael Tuexen
83dcc7790b sctp: don't provide uninitialized memory to process_chunk_drop()
Right now, the code in process_chunk_drop() does not look the
the corresponding fields.
Therefore, no functional change intended.
Reported by:	Coverity Scan
CID:		1472476
MFC after:	3 days
2024-05-09 00:17:13 +02:00
Michael Tuexen
e187fa5690 sctp: fix sctp_sendall() when an mbuf chain is provided
In this case uio is NULL, which needs to be checked and m must
be copied into the sctp_copy_all structure.
Reported by:	Coverity Scan
CID:		1400449
MFC after:	3 days
2024-05-08 23:45:55 +02:00
Michael Tuexen
3d40cc7ab8 sctp: add missing check
If memory allocation fails, m is NULL. Since this is possible,
check for it.
Reported by:	Coverity Scan
CID:		1086866
MFC after:	3 days
2024-05-08 23:03:34 +02:00
Richard Scheffenegger
2a9aae9e5f tcp: add counter to track when SACK loss recovery uses TSO
Add a counter to track how frequently SACK has transmitted
more than one MSS using TSO. Instances when this will be
beneficial is the use of PRR, or when ACK thinning due to
GRO/LRO or ACK discards by the network are present.

Reviewed By:		tuexen, #transport
Sponsored by:		NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D45070
2024-05-08 14:37:33 +02:00
Richard Scheffenegger
dcdfe44901 tcp: add sysctl to allow/disallow TSO during SACK loss recovery
Introduce net.inet.tcp.sack.tso for future use when TSO is ready
to be used during loss recovery.

Reviewed By:		tuexen, #transport
Sponsored by:		NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D45068
2024-05-08 14:33:20 +02:00
Richard Scheffenegger
cbf3575aa3 tcp: filter small SACK blocks
While the SACK Scoreboard in the base stack limits
the number of holes by default to only 128 per connection
in order to prevent CPU load attacks by splitting SACKs,
filtering out SACK blocks of unusually small size can
further improve the actual processing of SACK loss recovery.

Reviewed By:		tuexen, #transport
Sponsored by:		NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D45075
2024-05-08 14:00:10 +02:00
Gleb Smirnoff
a254d6870e carp: isolate VRRP from CARP
There is only one functional change here - we don't allow SIOCSVH (or
netlink request) to change sc->sc_version.  I'm convinced that allowing
such a change doesn't brings any practical value, but creates enless
minefields in front of both developers and end users (sysadmins).  If
you want to switch from VRRP to CARP or vice versa, you'd need to recreate
the VHID.

Oh, one tiny funtional change: carp_ioctl_set() won't modify any fields
if it returns EINVAL.  Previously you could provide valid advbase with
invalid advskew - that used to modify advbase and return EINVAL.

All other changes is a sweep around not ever using CARP fields when
we are in VRRP mode and vice versa.  Also adding assertions on sc_version
where necessary.

Do not send VRRP vars in CARP mode via NetLink and vice versa.  However
in compat ioctl SIOCGVH for VRRP mode the CARP fields would be zeroes.

This allows to declare softc as union and thus prevent any future logic
deterioration wrt to mixing VRRP and CARP.

Reviewed by:	kp
Differential Revision:	https://reviews.freebsd.org/D45039
2024-05-08 13:19:04 +02:00
Gleb Smirnoff
601438fbfa carp: refactor packet tagging for ether_output()
- Separate HMAC preparation (CARP specific) from tagging.
- In unicast mode (CARP specific) don't put tag at all.
- Don't put pointer to software context into the tag.  Putting just vhid,
  an integer value, is a safer design.

Reviewed by:	kp
Differential Revision:	https://reviews.freebsd.org/D45038
2024-05-08 13:19:04 +02:00
Gleb Smirnoff
cda57d955b carp: assert that we are calling correct input function. We are.
Reviewed by:	kp
Differential Revision:	https://reviews.freebsd.org/D45037
2024-05-08 13:19:04 +02:00
Gleb Smirnoff
5ee92cbd82 carp: don't chain call vrrp_send_ad via carp_send_ad
Provide inline send_ad_locked() that switches between protocol
specific sending function.

Rename carp_send_ad() to carp_callout() to avoid getting lost in
all these multiple foo_send_ad.

No functional change intended.

Reviewed by:	kp
Differential Revision:	https://reviews.freebsd.org/D45036
2024-05-08 13:19:04 +02:00
Kristof Provost
3711515467 carp: support VRRPv3
Allow carp(4) to use the VRRPv3 protocol (RFC 5798). We can distinguish carp and
VRRP based on the protocol version number (carp is 2, VRRPv3 is 3), and support
both from the carp(4) code.

Reviewed by:	glebius
Sponsored by:	Rubicon Communications, LLC ("Netgate")
Differential Revision:	https://reviews.freebsd.org/D44774
2024-05-08 13:19:03 +02:00
Gleb Smirnoff
b6b4ac2faa tcp_hostcache: remove unnecessary socketvar.h 2024-05-07 14:15:49 -07:00
Richard Scheffenegger
59884aea8b tcp: clean up macro useage in tcp_fixed_maxseg()
Replace local PAD macro with PADTCPOLEN macro
No functional change.

Reviewed By:		tuexen, #transport
Sponsored by:		NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D45076
2024-05-04 13:04:25 +02:00
Marko Zec
b24e353f9e fib_dxr: set fib_data field in struct dxr_aux early enough
Previously it was possible for dxr_build() to return with da->fd
unset in case of range_tbl or x_tbl malloc() failures.  This
may have led to NULL ptr dereferencing in dxr_change_rib_batch().

MFC after:	1 week

PR:		278422
2024-05-07 17:44:09 +02:00
Marko Zec
4aa275f12d fib_dxr: s/KASSERT/MPASS/
MFC after:	1 week
2024-05-07 17:33:23 +02:00
Marko Zec
7a5de1d4cc fib_dxr: KASSERTs for chasing NULL ptr and runaway refcount suspects
MFC after:	1 week
2024-05-07 17:22:00 +02:00
Marko Zec
ed541e201a fib_dxr: move the bulko of malloc() failure logging into dxr_build() 2024-05-07 17:11:30 +02:00
Marko Zec
5295e891d0 fib_dxr: update comment.
MFC after:	1 week
2024-05-06 20:42:31 +02:00
Marko Zec
858010643c fib_dxr: free() does nothing if arg is NULL, so remove a redundant check.
MFC after:	1 week
2024-05-06 20:37:44 +02:00
Marko Zec
308caa38cd fib_dxr: log malloc() failures.
MFC after:	1 week
2024-05-06 20:21:55 +02:00
Randall Stewart
fce03f85c5 TCP can be subject to Sack Attacks lets fix this issue.
There is a type of attack that a TCP peer can launch on a connection. This is for sure in Rack or BBR and probably even the default stack if it uses lists in sack processing. The idea of the attack is that the attacker is driving you to look at 100's of sack blocks that only update 1 byte. So for example if you have 1 - 10,000 bytes outstanding the attacker sends in something like:

ACK 0 SACK(1-512) SACK(1024 - 1536), SACK(2048-2536), SACK(4096 - 4608), SACK(8192-8704)
This first sack looks fine but then the attacker sends

ACK 0 SACK(1-512) SACK(1025 - 1537), SACK(2049-2537), SACK(4097 - 4609), SACK(8193-8705)
ACK 0 SACK(1-512) SACK(1027 - 1539), SACK(2051-2539), SACK(4099 - 4611), SACK(8195-8707)
...
These blocks are making you hunt across your linked list and split things up so that you have an entry for every other byte. Has your list grows you spend more and more CPU running through the lists. The idea here is the attacker chooses entries as far apart as possible that make you run through the list. This example is small but in theory if the window is open to say 1Meg you could end up with 100's of thousands link list entries.

To combat this we introduce three things.

when the peer requests a very small MSS we stop processing SACK's from them. This prevents a malicious peer from just using a small MSS to do the same thing.
Any time we get a sack block, we use the sack-filter to remove sacks that are smaller than the smallest v4 mss (minus 40 for max TCP options) unless it ties up to snd_max (since that is legal). All other sacks in theory should be at least an MSS. If we get such an attacker that means we basically start skipping all but MSS sized Sacked blocks.
The sack filter used to throw away data when its bounds were exceeded, instead now we increase its size to 15 and then throw away sack's if the filter gets over-run to prevent the malicious attacker from over-running the sack filter and thus we start to process things anyway.
The default stack will need to start using the sack-filter which we have talked about in past conference calls to take full advantage of the protections offered by it (and reduce cpu consumption when processing sacks).

After this set of changes is in rack can drop its SAD detection completely

Reviewed by:tuexen@, rscheff@
 Differential Revision:	<https://reviews.freebsd.org/D44903>
2024-05-05 09:08:47 -04:00
Richard Scheffenegger
30cf0fbf26 in_pcb: don't leak credential refcounts on error
In the error path during allocating an in_pcb, the credentials
associated with the new struct get their reference count
increased early on, but not decremented when the allocation
fails.

Reported by:		cmiller_netapp.com
MFC after:		3 days
Reviewed by:		jhb, tuexen
Sponsored by:		NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D45033
2024-05-01 08:41:26 +02:00
Gleb Smirnoff
c68eed82a3 accf_tls: accept filter that waits for TLS handshake header 2024-04-24 17:53:10 -07:00
Denny Page
fcdf9a1989 Support ARP for 802 networks
This is used by 802.3 Ethernet.  (Also be used by 802.4 Token Bus and
802.5 Token Ring, but we don't support those.)

This was accidentally removed along with FDDI support in commit
0437c8e3b1, presumably because comments implied it was used only by
FDDI or Token Ring.

Fixes: 0437c8e3b1 ("Remove support for FDDI networks.")
Reviewed-by: emaste
Signed-off-by: Denny Page <dennypage@me.com>
Pull-request: https://github.com/freebsd/freebsd-src/pull/1166
2024-04-23 12:30:53 -04:00
Michael Tuexen
1941914d3b tcp rack: improve BBR_LOG_CWND event
Fix a typo, which resulted in missing r_ctl.gate_to_fs in the BBLog
event.

Reported by:		Coverity Scan
CID:			1540024
Reviewed by:		rrs, rscheff
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D44648
2024-04-18 21:57:44 +02:00
Michael Tuexen
c9cd686bd4 tcp: drop data received after a FIN has been processed
RFC 9293 describes the handling of data in the CLOSE-WAIT, CLOSING,
LAST-ACK, and TIME-WAIT states:
This should not occur since a FIN has been received from the remote
side. Ignore the segment text.
Therefore, implement this handling.

Reviewed by:		rrs, rscheff
MFC after:		3 days
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D44746
2024-04-18 21:54:42 +02:00
Michael Tuexen
605a00660e tcp bbr: improve code consistency
Improve code consistency with the RACK stack.
Reviewed by:		gallatin, rscheff
MFC after:		3 days
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D44800
2024-04-15 23:52:08 +02:00
Mark Johnston
1d14e88e53 tcp: Make tcp_var.h more self-contained
struct tcpcb embeds a struct osd and a struct callout.  Rather than
forcing all consumers to pull in the same headers, include the headers
directly.

No functional change intended.

Reviewed by:	glebius
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D44685
2024-04-10 08:53:49 -04:00
Gleb Smirnoff
f7c4d12bcd icmp: correct the assertion that checks limit + jitter
Fixes:	4399e055ea
2024-04-08 16:54:19 -07:00
Kristof Provost
60d8dbbef0 netinet: add a probe point for IP, IP6, ICMP, ICMP6, UDP and TCP stats counters
When debugging network issues one common clue is an unexpectedly
incrementing error counter. This is helpful, in that it gives us an
idea of what might be going wrong, but often these counters may be
incremented in different functions.

Add a static probe point for them so that we can use dtrace to get
futher information (e.g. a stack trace).

For example:
	dtrace -n 'mib:ip:count: { printf("%d", arg0); stack(); }'

This can be disabled by setting the following kernel option:
	options 	KDTRACE_NO_MIB_SDT

Reviewed by:	gallatin, tuexen (previous version), gnn (previous version)
Sponsored by:	Rubicon Communications, LLC ("Netgate")
Differential Revision:	https://reviews.freebsd.org/D43504
2024-04-08 17:29:59 +02:00
Michael Tuexen
e8c149ab85 tcp: add some debug output
Also log, when dropping text or FIN after having received a FIN.
This is the intended behavior described in RFC 9293.
A follow-up patch will enforce this behavior for the base stack
and the RACK stack.
Reviewed by:		rscheff
MFC after:		3 days
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D44669
2024-04-07 22:41:24 +02:00
Michael Tuexen
3e1c8a35f7 tcp: improve consistency
No functional change intended.

Reported by:		Coverity Scan
CID:			1523781
Reviewed by:		rscheff
MFC after:		3 days
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D44645
2024-04-06 10:02:06 +02:00
Michael Tuexen
d902c8f55b tcp rack: fix memory corruption
When in rack_output() jumping to the label out, don't write errno into
the log buffer, since the pointer is not initialized.

Reported by:		Coverity Scan
CID:			1523773
Reviewed by:		rscheff
MFC after:		3 days
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D44647
2024-04-06 09:55:46 +02:00
Michael Tuexen
7df0ef5f48 tcp rack: fix sending
In rack_output(), idle is used as a boolean variable. So don't use it
as an int and don't clear it afterwards.
This avoids setting idle to false, when it is not intended.

Reported by:		olivier
Reviewed by:		rrs, rscheff
MFC after:		3 days
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D44610
2024-04-05 17:47:03 +02:00
Michael Tuexen
60bc195745 tcp bblog: cleanup
Remove redundant checks and improve error checking.

Reported by:		Coverity Scan
CID:			1523780
Reviewed by:		rscheff
MFC after:		3 days
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D44616
2024-04-05 17:36:40 +02:00
Michael Tuexen
aaaa01c0c8 tcp hpts: initialize variable
Ensure that  tv.tv_sec is zero in all code paths.

Reported by:		Coverity Scan
CID:			1527724
Reviewed by:		rscheff
MFC after:		3 days
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D44584
2024-04-05 17:30:31 +02:00
Michael Tuexen
6b454da6bb tcp: address a warning
t_state is an unsigned variable, so no need for testing that it is
non-negative.

Reported by:		Coverity Scan
CID:			1390885
Reviewed by:		glebius
MFC after:		1 week
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D44619
2024-04-04 00:14:59 +02:00
Michael Tuexen
e0bd180130 tcp: fix conversion of rttvar
A wrong variable and wrong scaling factors were used.

Reported by:		Coverity Scan
CID:			1508689
Reviewed by:		rscheff
MFC after:		1 week
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D44612
2024-04-03 18:39:31 +02:00
Michael Tuexen
5a268d8688 tcp: fix comment
Make the comment consistent with the code.

Reviewed by:		rscheff
MFC after:		1 week
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D44611
2024-04-03 18:26:04 +02:00
Michael Tuexen
b600644fdd tcp hpts: improve consistency
The target_slot argument of max_slots_available() can be NULL.
Therefore, check for this in all places.
Right now, all callers provide non-NULL pointer.

Reported by:		Coverity Scan
CID:			1527732
Reviewed by:		rrs
MFC after:		3 days
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D44527
2024-04-01 21:51:59 +02:00
Gleb Smirnoff
1a8d176432 inpcb: fully retire inp_ppcb pointer
Before a protocol specific control block started to embed inpcb in self
(see 0aa120d52f, e68b379244, 483fe96511) this pointer used to point
at it.

Retain kf_sock_inpcb field in the struct kinfo_file in <sys/user.h>.  The
exp-run detected a minimal use of the field in ports:
  * sysutils/lsof - patched upstream
  * net-mgmt/netdata  - patch accepted upstream
  * emulators/qemu-user-static - upstream master branch seems not using
    the field anymore
We can keep the field around for some time, but eventually it may be
reused for something else.

PR:			277659 (exp-run)
Reviewed by:		tuexen
Differential Revision:	https://reviews.freebsd.org/D44491
2024-03-29 12:18:32 -07:00
Randall Stewart
b7b78c1c16 Optimize HPTS so that little work is done until we have a hpts thread that is over the connection threshold
HPTS inserts a softclock for system call return that optimizes performance. However when
no HPTS threads need the help (i.e. when they have less than 100 or so connections) then
there should be little work done i.e. check the counter and return instead of running through
all the threads getting locks etc.ptimize HPTS so that little work is done until we have a hpts
thread that is over the connection threshold.

Reported by:    eduardo
Reviewed by:    gallatin, glebius, tuexen
Tested by:      gallatin
Differential Revision: https://reviews.freebsd.org/D44420
2024-03-28 08:12:37 -04:00
Michael Tuexen
ed505f893a tcp bblog: use correct length
The length of tldl_reason is TCP_LOG_REASON_LEN, not TCP_LOG_ID_LEN.
No functional change intended.
Reported by:		Coverity Scan
CID:			1418074
CID:			1418276
Reviewed by:		glebius, rscheff
MFC after:		3 days
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D44510
2024-03-27 14:31:48 +01:00
Gleb Smirnoff
4399e055ea icmp: allow zero value for ICMP limits
Zero means limit is disabled, so the value doesn't need to be checked
against jitter value.

Fixes:	ac44739fd8
Fixes:	a03aff88a1
2024-03-24 19:52:03 -07:00
Gleb Smirnoff
4f96be33fe icmp6: move ICMPv6 related tunables to the files where they are used
Most of them can be declared as static after the move out of in6_proto.c.
Keeping sysctl(9) declarations with their text descriptions next to the
variable declaration create self-documenting code.  There should be no
functional changes.

Differential Revision:	https://reviews.freebsd.org/D44481
2024-03-24 09:13:23 -07:00
Gleb Smirnoff
ac44739fd8 icmp: improve ICMP limit jitter
Instead of fixing up invalid values set by a user in badport_bandlim()
which is a fast path function, provide a sysctl handler
sysctl_icmplim_and_jitter(), that will check that jitter is less than the
limit.

Provide jitter initilization function icmplim_new_jitter() used at boot,
in the sysctl handler and when we actually hit the limit.  This also fixes
no jitter on a fresh booted system until first limit hit.

Instead of CVE number provide link the the actual paper that explains what
and why we are doing here.  The CVE number isn't very informative, it will
just tell you what RedHat version you need to upgrade to.

Reviewed by:		kp, tuexen, zlei
Differential Revision:	https://reviews.freebsd.org/D44478
2024-03-24 09:13:23 -07:00
Gleb Smirnoff
b508545ce0 icmp: when logging ICMP ratelimiting message use correct jitter value
The limiting of the very last second has been done using certain jitter
value.  We update the jitter for the next second.  But the logging should
report the jitter before the change.

Reviewed by:		kp, tuexen, zlei
Differential Revision:	https://reviews.freebsd.org/D44477
2024-03-24 09:13:23 -07:00
Gleb Smirnoff
9d7f17d746 icmp: hide icmp_bandlimit_uninit() under VIMAGE
The uninitialization may be executed only on a kernel with VIMAGE.

Reviewed by:		kp, tuexen, zlei
Differential Revision:	https://reviews.freebsd.org/D44476
2024-03-24 09:13:23 -07:00
Gleb Smirnoff
7142ab4790 icmp: do not store per-VNET identical array of strings
We need per-VNET struct counter_rate, but we don't need per-VNET set of
const char *.  Also, identical word "response" can go into the format
string instead of being stored 7 times.

Reviewed by:		kp, zlei, tuexen
Differential Revision:	https://reviews.freebsd.org/D44475
2024-03-24 09:13:23 -07:00
Michael Tuexen
af700f430f tcp: no data on SYN segments unless doing TFO
Ensure that there is no data on SYN segments unless doing TFO.
This check is already in RACK and BBR.

Reported by:		glebius
Reviewed by:		rscheff
MFC after:		1 week
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D44384
2024-03-22 11:12:56 +01:00
John Baldwin
3d0a736796 tcp: Add a new kernel-only TCP_USE_DDP socket option
This socket option can be used by in-kernel consumers (like NFS) to
request a NIC to use optimized receive of large buffers for a
connection.  The current use case is to support DDP by the TOE on
Chelsio NICs.

Reviewed by:	rscheff, tuexen, glebius
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D44000
2024-03-20 15:29:02 -07:00
Gleb Smirnoff
56f7860087 carp: check CARP status in in_localip_fib(), in6_localip_fib()
Don't report a BACKUP CARP address as local.  These two functions are used
only by source address validation for input packets, controlled by sysctls
net.inet.ip.source_address_validation and
net.inet6.ip6.source_address_validation.  For this purpose we definitely
want to treat BACKUP addresses as non local.

This change is conservative and doesn't modify compat in_localip() and
in6_localip().  They are used more widely than the FIB-aware versions.
The change would modify the notion of ipfw(4) 'me' keyword.  There might
be other consequences as in_localip() is used by various tunneling
protocols.

PR:			277349
2024-03-19 11:48:59 -07:00
Gleb Smirnoff
e34ea0196f tcp: clear all TCP timers in tcp_timer_stop() when in callout
When a TCP callout decides to disable self, e.g. tcp_timer_2msl() calling
tcp_close(), we must also clear all other possible timers.  Otherwise,
upon return, the callout would be scheduled again in tcp_timer_enter().

Revert 57e27ff07a, which was a temporary partial revert of otherwise
correct 62d47d73b7, that exposed the problem being fixed now.  Add an
extra assertion in tcp_timer_enter() to check we aren't arming callout for
a closed connection.

Reviewed by:	rscheff
2024-03-18 13:57:00 -07:00
Gleb Smirnoff
dd7b86e2a0 tcp: remove IS_FASTOPEN() macro
The macro is more obfuscating than helping as it just checks a single flag
of t_flags.  All other t_flags bits are checked without a macro.

A bigger problem was that declaration of the macro in tcp_var.h depended
on a kernel option.  It is a bad practice to create such definitions in
installable headers.

Reviewed by:		rscheff, tuexen, kib
Differential Revision:	https://reviews.freebsd.org/D44362
2024-03-18 08:56:17 -07:00
Gleb Smirnoff
d62c4607e8 sockets: remove unused KPIs to manipulate sockets
These KPIs were added in dd0e6c383a and through 15 years had zero use.
They slightly remind what IfAPI does for struct ifnet.  But IfAPI does
that for the sake of large collection of NIC drivers not being aware of
struct ifnet.  For the sockets it is unclear what could be a large
collection of externally written kernel modules that need extensively use
sockets and not be aware of their internals at the same time. This
isolation of a structure knowledge requires a lot of work, and just
throwing in a few KPIs isn't helpful.

Reviewed by:		kib, olce, markj
Differential Revision:	https://reviews.freebsd.org/D44311
2024-03-18 08:50:30 -07:00
Gleb Smirnoff
027fda80fe inpcb: remove unused KPIs to manipulate inpcbs
These KPIs were added in 9d29c635da and through 15 years had zero use.
They slightly remind what IfAPI does for struct ifnet.  But IfAPI does
that for the sake of large collection of NIC drivers not being aware of
struct ifnet.  For the inpcb it is unclear what could be a large
collection of externally written kernel modules that need extensively use
inpcb and not be aware of its internals at the same time. This isolation
of a structure knowledge requires a lot of work, and just throwing in a
few KPIs isn't helpful.

Reviewed by:		kib, bz, markj
Differential Revision:	https://reviews.freebsd.org/D44310
2024-03-18 08:49:39 -07:00
Gleb Smirnoff
ab8f59ceaf rack: don't define TCPOUTFLAGS
as the code doesn't use tcp_outflags.  This should fix gcc builds.
2024-03-13 21:07:59 -07:00
Konstantin Belousov
220ee18f19 netinet/tcp_var.h: always define IS_FASTOPEN() for kernel compilation env
and drop the definition for userspace (which matched TCP_RFC7413) since
it depends on presence of the kernel option.

Reviewed by:	glebius, rscheff
Sponsored by:	NVIDIA networking
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D44349
2024-03-14 01:20:58 +02:00
Richard Scheffenegger
85df11a1de ktls: deep copy tls_enable struct for in-kernel tcp consumers
Doing a deep copy of the keys early allows users of the
tls_enable structure to assume kernel memory.
This enables the socket options to be set by kernel threads.

Reviewed By:	#transport, tuexen, jhb, rrs
Sponsored by:	NetApp, Inc.
X-NetApp-PR:	#79
Differential Revision:	https://reviews.freebsd.org/D44250
2024-03-13 13:23:13 +01:00
Gleb Smirnoff
e4315bbc85 tcp: move struct tcp_ifcap declaration under _KERNEL
Reviewed by:		rscheff, tuexen, kib
Differential Revision:	https://reviews.freebsd.org/D44340
2024-03-13 12:14:18 -07:00
Randall Stewart
e18b97bd63 Update to bring the rack stack with all its fixes in.
This brings the rack stack up to the current level used at NF. Many fixes
and improvements have been added. I also add in a fix to BBR to deal with
the changes that have been in hpts for a while i.e. only one call no matter
if mbuf queue or tcp_output.

It basically does little except BBlogs and is a placemark for future work on
doing path capacity measurements.

With a bit of a struggle with git I finally got rack_pcm.c into place (apologies
for not noticing this error). The LINT kernel is running on my box now .. sigh.

Reviewed by: tuexen, glebius
Sponsored by: Netflix Inc.
Differential Revision:https://reviews.freebsd.org/D43986
2024-03-12 07:55:02 -04:00
Brooks Davis
c112243f6b Revert "Update to bring the rack stack with all its fixes in."
This commit was incomplete and breaks LINT kernels.  The tree has been
broken for 8+ hours.

This reverts commit f6d489f402.
2024-03-11 20:28:24 +00:00
Randall Stewart
f6d489f402 Update to bring the rack stack with all its fixes in.
This brings the rack stack up to the current level used at NF. Many fixes
and improvements have been added. I also add in a fix to BBR to deal with
the changes that have been in hpts for a while i.e. only one call no matter
if mbuf queue or tcp_output.

Note there is a new file that I can't figure out how to get in rack_pcm.c

It basically does little except BBlogs and is a placemark for future work on
doing path capacity measurements.

Reviewed by: tuexen, glebius
Sponsored by: Netflix Inc.
Differential Revision:https://reviews.freebsd.org/D43986
2024-03-11 07:36:54 -04:00
Michael Tuexen
96ad640178 TCP LRO: add dtrace probe points
Add the IP, UDP, and TCP receive static probes to the code path,
which avoids if_input.

Reviewed by:		rrs, markj
MFC after:		1 week`
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D43727
2024-03-08 10:21:09 +01:00
Michael Tuexen
d1ce01214a TCP LRO: disable mbuf queuing when packet filter hooks are in place
When doing mbuf queueing, the packet filter hooks in ether_demux(),
ip_input(), and ip6_input() are by-passed. This means that the packet
filters don't process incoming packets, which might result in
connection failures. For example bypassing the TCP sequence number
validation will result in dropping valid packets.
Please note that this patch is only disabling mbuf queueing, not LRO.

Reported by:		Herbert J. Skuhra
Reviewed by:		glebius, rrs, rscheff
MFC after:		1 week
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D43769
2024-03-08 10:03:43 +01:00
Kyle Evans
47ad4f2d45 ktrace: log genio events on failed write
Visibility into the contents of the buffer when a write(2) has failed
can be immensely useful in debugging IPC issues -- pushing this to
discuss the idea, or maybe an alternative where we can set a flag like
KTRFAC_ERRIO to enable it.

When a genio event is potentially raised after an error, currently we'll
just free the uio and return.  However, such data can be useful when
debugging communication between processes to, e.g., understand what the
remote side should have grabbed before closing a pipe.  Tap out the
entire buffer on failure rather than simply discarding it.

Reviewed by:	kib, markj
Differential Revision:	https://reviews.freebsd.org/D43799
2024-03-04 23:44:09 -06:00
Randall Stewart
638b5ae1c7 HTPS has actually three states not two so the macro needs to account for that.
Ok lets fix up the tcp_in_hpts() so that it also says yes if you
are in the race state moving and you are scheduled to be put in.
This also requires changing the MPASS to be the old version non
inline function of tcp_in_hpts().

This change also adds a new inline macro so that a uint64_t timestamp can be
obtained by a transport (aka Rack will use this).

Reviewed by: glebius, tuexen
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D44157
2024-03-01 15:21:15 -05:00
Gordon Bergling
6bce41a38e carp(4): Fix a typo in a source code comment
- s/successfull/successful/

MFC after:	3 days
2024-02-27 17:39:57 +01:00
Richard Scheffenegger
8917131e00 tcp: need default in switch statement for enum.
fix clang error after c9b6241e25

Reviewed By: imp
Differential Revision: https://reviews.freebsd.org/D44081
2024-02-25 08:24:13 +01:00
Richard Scheffenegger
c9b6241e25 tcp: address enum-int-mismatch
fix gcc13 error after f74352fbcf
2024-02-25 04:46:39 +01:00
Richard Scheffenegger
5e248c23d9 tcp: retain some CC signals outside of kernel scope
Summary: fix build error after f74352fbcf

Reviewers: #transport!

Subscribers: imp, melifaro, glebius

Differential Revision: https://reviews.freebsd.org/D44066
2024-02-24 21:01:54 +01:00
Michael Tuexen
644cffe67f sctp: improve sending of packets containing an INIT ACK chunk
If the peer announced support of zero checksums, do so when sending
packets containing an INIT ACK chunk.

MFC after:	1 week
2024-02-24 19:16:36 +01:00
Richard Scheffenegger
038699a8f1 tcp: cubic - restart epoch after RTO
This is a migitation to avoid sudden extreme jumps in
cwnd, as t_epoch can be very out of date after an RTO.
Per RFC9438, sec 4.8, t_epoch is to be reset whenever
cwnd grows beyond ssthresh (CC phase transitions from
slow start to congestion avoidance), to be fixed with
the upcoming cc_cubic changes.

MFC after:		3 days
Reviewed By:		cc, #transport
Sponsored by:		NetApp, Inc
Differential Revision:	https://reviews.freebsd.org/D44023
2024-02-24 17:07:46 +01:00
Richard Scheffenegger
40fdc6d25f tcp: provide correct snd_fack on post_recovery
Ensure that snd_fack holds a valid value when doing
the post_recovery CC processing, for preparation of
the cc_cubic update, so that local pipe calculations
can correctly refer to snd_fack during and after CC events.

Reviewed By:		tuexen, #transport
Sponsored by:		NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D43957
2024-02-24 16:55:31 +01:00
Richard Scheffenegger
f74352fbcf tcp: use enum for all congestion control signals
Facilitate easier troubleshooting by enumerating
all congestion control signals. Typecast the
enum to int, when a congestion control module uses
private signals.

No external change.

Reviewed By:		glebius, tuexen, #transport
Sponsored by:		NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D43838
2024-02-24 16:41:48 +01:00
Richard Scheffenegger
38983d40c1 tcp: prevent div by zero in cc_htcp
Make sure the divident is at least one. While cwnd should
never be smaller than t_maxseg, this can happen during
Path MTU Discovery, or when TCP options are considered
in other parts of the stack.

PR:			276674
MFC after:		3 days
Reviewed By:		tuexen, #transport
Sponsored by:		NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D43797
2024-02-24 16:35:59 +01:00
Michael Tuexen
533faf21c1 sctp: improve consistency
MFC after:	1 week
2024-02-23 21:40:46 +01:00
Gordon Bergling
2fb174d18a sctp(4): Fix a typo in a source code comment
- s/anthing/anything/

MFC after:	3 days
2024-02-18 13:01:04 +01:00
Michael Tuexen
2f4e46dfdd RACK, BBR: handle EACCES like EPERM for IP output handling
The FreeBSD TCP base stack handles them also the same way.
In case of packet filters dropping packets in the output path,
this avoids retranmitting the dropped packet every 10ms or so.

Reviewed by:		rscheff
MFC after:		1 week
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D43773
2024-02-16 12:19:24 +01:00
Gleb Smirnoff
abe8379b4f sockets: repair wakeup of accept(2) by shutdown(2)
That was lost in transition from one-for-all soshutdown() to protocol
specific methods.  Only protocols that listen(2) were affected.  This is
not a documented or specified feature, but some software relies on it.  At
least the FreeSWITCH telephony software uses this behavior on
PF_INET/SOCK_STREAM.

Fixes:  5bba272807
2024-02-15 10:48:44 -08:00
Richard Scheffenegger
fcea1cc971 tcp: fix RTO ssthresh for non-6675 pipe calculation
Follow up on D43768 to properly deal with the non-default
pipe calculation. When CC_RTO is processed, the timeout
will have already pulled back snd_nxt. Further, snd_fack
is not pulled along with snd_una.

Reviewed By:		tuexen, #transport
Sponsored by:		NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D43876
2024-02-14 14:51:53 +01:00
Richard Scheffenegger
57e27ff07a tcp: partially undo D43792
At the destruction of the tcpcb, no timers are supposed to
be running. However, it turns out that stopping them in the
close() / shutdown() call does not have the desired effect
under all circumstances.

This partially reverts 62d47d73b7 to reduce the nuisance
caused.

PR:			277009
Reported-by:		syzbot+9a9aa434a14a2b35c3ba@syzkaller.appspotmail.com
Reported-by:		syzbot+e82856782410e895bae7@syzkaller.appspotmail.com
Reviewed By:		glebius, tuexen, #transport
Sponsored by:		NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D43855
2024-02-12 22:38:11 +01:00
Richard Scheffenegger
62d47d73b7 tcp: stop timers and clean scoreboard in tcp_close()
Stop timers when in tcp_close() instead of doing that in tcp_discardcb().
A connection in CLOSED state shall not need any timers. Assert that no
timer is rescheduled after that in tcp_timer_activate() and verfiy that
this is also the expected state in tcp_discardcb().

PR:			276761
Reviewed By:		glebius, tuexen, #transport
Sponsored by:		NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D43792
2024-02-10 10:30:00 +01:00
Richard Scheffenegger
a8e817cf5c tcp: stop doing superfluous work after sending RST
When sending a RST control segment in tcp_output() it
means we are in TCPS_CLOSED state, called from tcp_drop().
Once the RST is sent, don't call tcp_timer_activate() or
update anything in tcpcb, since that will go away shortly.

PR:			276761
Provided by:		glebius
Reviewed By:		glebius, tuexen, #transport
Sponsored by:		NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D43808
2024-02-10 10:25:02 +01:00
Richard Scheffenegger
3eeb22cb81 tcp: clean scoreboard when releasing the socket buffer
The SACK scoreboard is conceptually an extention of the socket
buffer. Remove it when the socket buffer goes away with
soisdisconnected(). Verify that this is also the expected
state in tcp_discardcb().

PR:			276761
Reviewed by:		glebius, tuexen, #transport
Sponsored by:		NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D43805
2024-02-10 10:20:00 +01:00
Richard Scheffenegger
23c4f23247 tcp: ensure tcp_sack_partialack does not inflate cwnd after RTO
The implicit assumption of snd_nxt always being larger than
snd_recover is not true after RTO. In that case, cwnd
would get inflated to ssthresh, which may be much larger
than the current pipe (data in flight).

Reviewed By:           tuexen, #transport
Sponsored by:          NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D43653
2024-02-08 20:40:25 +01:00
Richard Scheffenegger
32a6df57df tcp: calculate ssthresh on RTO according to RFC5681
per RFC5681, only adjust ssthresh on the initital
retransmission timeout. Since RTO often happens
during loss recovery, while cwnd no longer tracks
all data in flight, calculcate pipe properly.

Reviewed By:           tuexen, #transport
Sponsored by:          NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D43768
2024-02-08 19:18:26 +01:00
Richard Scheffenegger
1adab814e8 tcp: use tcp_fixed_maxseg instead of tcp_maxseg in cc modules
tcp_fixed_maxseg() is the streamlined calculation of typical
tcp options and more suitable for heavy use in the congestion
control modules on every received packet.

No external functional change.

Reviewed By:           tuexen, #transport
Sponsored by:          NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D43779
2024-02-08 18:36:59 +01:00
Gleb Smirnoff
ce69e37369 Revert "sockets: retire sorflush()"
Provide a comment in sorflush() why the socket I/O sx(9) lock is actually
important.

This reverts commit 507f87a799.
2024-02-03 13:08:41 -08:00
Gleb Smirnoff
f79a8585bb sockets: garbage collect SS_ISCONFIRMING
Fixes:	8df32b19de
2024-01-30 10:38:33 -08:00
Michael Tuexen
f30c7d5654 TCP LRO: convert TCP header fields to host byte order earlier
This is a preparation for adding dtrace hooks in a follow-up commit,
which are missing in the code path, where packets are directly queued
to the tcpcb. The dtrace hooks expect the fields to be in host byte
order. This only applies when TCP HPTS is used.
No functional change intended.

Reviewed by:		rscheff
MFC after:		1 week
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D43594
2024-01-29 18:52:17 +01:00
Kristof Provost
ffeab76b68 pfil: PFIL_PASS never frees the mbuf
pfil hooks (i.e. firewalls) may pass, modify or free the mbuf passed
to them. (E.g. when rejecting a packet, or when gathering up packets
for reassembly).

If the hook returns PFIL_PASS the mbuf must still be present. Assert
this in pfil_mem_common() and ensure that ipfilter follows this
convention. pf and ipfw already did.
Similarly, if the hook returns PFIL_DROPPED or PFIL_CONSUMED the mbuf
must have been freed (or now be owned by the firewall for further
processing, like packet scheduling or reassembly).

This allows us to remove a few extraneous NULL checks.

Suggested by:	tuexen
Reviewed by:	tuexen, zlei
Sponsored by:	Rubicon Communications, LLC ("Netgate")
Differential Revision:	https://reviews.freebsd.org/D43617
2024-01-29 14:10:19 +01:00
Richard Scheffenegger
0b3f9e435f tcp: move cc_post_recovery past snd_una update
The RFC6675 pipe calculation (sack.revised, enabled
by default since D28702), uses outdated information,
while the previous default calculated it correctly
with up-to-date information from the incoming ACK.

This difference can become as large as the receive
window (not the congestion window previously),
potentially triggering a massive burst of new packets.

MFC after:             1 week
Reviewed By:           tuexen, #transport
Sponsored by:          NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D43520
2024-01-28 00:18:51 +01:00
Mark Johnston
bbf86c65d0 netinet: Remove stale references to Giant from comments
MFC after:	1 week
2024-01-27 13:51:13 -05:00
Richard Scheffenegger
2d05a1c81b tcp: commonize check for more data to send, style changes
Use SEQ_SUB instead of a plain subtraction, for an implict
type conversion and prevention of a possible overflow.
Use curly brackets in stacked if statements throughout.
Use of the ? operator to enhance readability when clearing
the FIN flag in tcp_output().

None of the above change the function.

Reviewed By:           tuexen, cc, #transport
Sponsored by:          NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D43539
2024-01-26 01:20:35 +01:00
Richard Scheffenegger
fc262fd3dc tcp: AccECN access ACE field by shifting bits
Shifting bits is quicker than checking header flag bits
one by one. Also improve readability by the use of switch
statements.

No change in behaviour.

Reviewed By:           glebius, tuexen, #transport
Sponsored by:          NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D43560
2024-01-26 00:16:22 +01:00
Richard Scheffenegger
0932fb565a tcp: fix TCPSTAT accounting for SACK
Account for SACK retransmitted bytes once the actual length
is known. This prevents a call to tcp_maxseg() and prepares
for TSO support when transmitting from the SACK scoreboard.

Reviewed By:           tuexen, #transport
Sponsored by:          NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D43447
2024-01-25 22:58:33 +01:00
Richard Scheffenegger
c7c325d01d tcp: pass maxseg around instead of calculating locally
Improve slowpath processing (reordering, retransmissions)
slightly by calculating maxseg only once. This typically
saves one of two calls to tcp_maxseg().

Reviewed By:           glebius, tuexen, cc, #transport
Sponsored by:          NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D43536
2024-01-24 16:43:29 +01:00
Gleb Smirnoff
90ad2dc287 tcp: remove 20+ year old disabled code from d912c694ee 2024-01-23 13:16:34 -08:00
Gleb Smirnoff
c809435b18 tcp: clear outdated comment mentioning T/TCP 2024-01-23 12:59:21 -08:00
Gleb Smirnoff
e21c668719 tcp: pass positive errno to tcp_drop()
Fixes:	446ccdd08e
2024-01-23 12:59:21 -08:00
Gordon Bergling
9b035689f1 tcp_fastopen: Fix a typo in a source code comment
- s/posession/possession/

MFC after:	3 days
2024-01-22 21:49:47 +01:00
Gleb Smirnoff
7f3184ba79 tcp: remove outdated comment
This paragraph should have been removed in 446ccdd08e.
2024-01-22 12:42:21 -08:00
Gordon Bergling
ef0ac0a1ad tcp_hpts: Fix a typo of a function name in a comment
- s/tcp_ouput/tcp_output/

MFC after:	3 days
2024-01-20 17:29:28 +01:00
Richard Scheffenegger
dfe30e4196 tcp: remove unused tcp_sack_output_debug() function
This debugging code has been lingering for years with
no known use.

No functional change.

Reviewed by:           tuexen, #transport
Sponsored by:          NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D43511
2024-01-19 14:48:32 +01:00
Gleb Smirnoff
a079c891c0 sctp: restore missing inpcb lock
Fixes:	5bba272807
Reported-by: syzbot+b8636c973dc20fea4a9b@syzkaller.appspotmail.com
Reported-by: syzbot+d76a18ee8bbe6f7d3056@syzkaller.appspotmail.com
2024-01-16 23:11:27 -08:00
Xavier Beaudouin
80044c785c Add UDP encapsulation of ESP in IPv6
This patch provides UDP encapsulation of ESP packets over IPv6.
Ports the IPv4 code to IPv6 and adds support for IPv6 in udpencap.c
As required by the RFC and unlike in IPv4 encapsulation,
UDP checksums are calculated.

Co-authored-by:	Aurelien Cazuc <aurelien.cazuc.external@stormshield.eu>
Sponsored-by:	Stormshield
Sponsored-by:	Wiktel
Sponsored-by:	Klara, Inc.
2024-01-16 20:44:34 +00:00
Gleb Smirnoff
507f87a799 sockets: retire sorflush()
With removal of dom_dispose method the function boils down to two
meaningful function calls: socantrcvmore() and sbrelease().  The latter is
only relevant for protocols that use generic socket buffers.

The socket I/O sx(9) lock acquisition in sorflush() is not relevant for
shutdown(2) operation as it doesn't do any I/O that may interleave with
read(2) or write(2).  The socket buffer mutex acquisition inside
sbrelease() is what guarantees thread safety.  This sx(9) acquisition in
soshutdown() can be tracked down to 4.4BSD times, where it used to be
sblock(), and it was carried over through the years evolving together with
sockets with no reconsideration of why do we carry it over.  I can't tell
if that sblock() made sense back then, but it doesn't make any today.

Reviewed by:		tuexen
Differential Revision:	https://reviews.freebsd.org/D43415
2024-01-16 10:30:49 -08:00
Gleb Smirnoff
5bba272807 sockets: make pr_shutdown fully protocol specific method
Disassemble a one-for-all soshutdown() into protocol specific methods.
This creates a small amount of copy & paste, but makes code a lot more
self documented, as protocol specific method would execute only the code
that is relevant to that protocol and nothing else.  This also fixes a
couple recent regressions and reduces risk of future regressions.  The
extended KPI for the new pr_shutdown removes need for the extra pr_flush
which was added for the sake of SCTP which could not perform its shutdown
properly with the old one.  Particularly for SCTP this change streamlines
a lot of code.

Some notes on why certain parts of code were copied or were not to certain
protocols:
* The (SS_ISCONNECTED | SS_ISCONNECTING | SS_ISDISCONNECTING) check is
  needed only for those protocols that may be connected or disconnected.
* The above reduces into only SS_ISCONNECTED for those protocols that
  always connect instantly.
* The ENOTCONN and continue processing hack is left only for datagram
  protocols.
* The SOLISTENING(so) block is copied to those protocols that listen(2).
* sorflush() on SHUT_RD is copied almost to every protocol, but that
  will be refactored later.
* wakeup(&so->so_timeo) is copied to protocols that can make a non-instant
  connect(2), can SO_LINGER or can accept(2).

There are three protocols (netgraph(4), Bluetooth, SDP) that did not have
pr_shutdown, but old soshutdown() would still perform sorflush() on
SHUT_RD for them and also wakeup(9).  Those protocols partially supported
shutdown(2) returning EOPNOTSUP for SHUT_WR/SHUT_RDWR, now they fully lost
shutdown(2) support.  I'm pretty sure netgraph(4) and Bluetooth are okay
about that and SDP is almost abandoned anyway.

Reviewed by:		tuexen
Differential Revision:	https://reviews.freebsd.org/D43413
2024-01-16 10:30:37 -08:00
Gleb Smirnoff
d4033ebd05 divert: just return EOPNOTSUPP on shutdown(2)
Before this change we would always return ENOTCONN.  There is no
legitimate use of shutdown(2) on divert(4).
2024-01-12 02:04:04 -08:00
Michael Tuexen
13720136fb tcpsso: fix when used without -i option
Since fdb987bebd it is not possible anymore to use inp_next
iterator for bound, but unconnected sockets. This applies
to TCP listening sockets. Therefore the metioned commit broke
tcpsso on listening sockets if the -i option was not used.
Fix this by iterating through all endpoints instead of only
through the bound, but unconnected ones.

Reviewed by:		markj
Fixes:			fdb987bebd ("inpcb: Split PCB hash tables")
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D43353
2024-01-10 08:33:09 +01:00
John Baldwin
8cb9b68f58 sys: Use mbufq_empty instead of comparing mbufq_len against 0
Reviewed by:	bz, emaste
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D43338
2024-01-09 11:00:46 -08:00
Richard Scheffenegger
429f14f83a tcp: clean PRR state after ECN congestion recovery.
PRR state was not properly reset on subsequent ECN CE
events. Clean up after local transmission failures too.

Reviewed by:           tuexen, cc, #transport
MFC after:             3 days
Sponsored by:          NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D43170
2024-01-08 10:53:04 +01:00
Richard Scheffenegger
f4574e2dc5 tcp: prevent spurious empty segments and fix uncommon panic
Only try sending more data on pure ACKs when there is
more data available in the send buffer.

In the case of a retransmitted SYN not being sent due to
an internal error, the snd_una/snd_nxt accounting could
be off, leading to a panic. Pulling snd_nxt up to snd_una
prevents this from happening.

Reported by:           fengdreamer@126.com
Reviewed by:           cc, tuexen, #transport
MFC after:             1 week
Sponsored by:          NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D43343
2024-01-08 10:52:49 +01:00
Richard Scheffenegger
30409ecdb6 tcp: do not purge SACK scoreboard on first RTO
Keeping the SACK scoreboard intact after the first RTO
and retransmitting all data anew only on subsequent RTOs
allows a more timely and efficient loss recovery under
many adverse cirumstances.

Reviewed By:           tuexen, #transport
MFC after:             10 weeks
Sponsored by:          NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D42906
2024-01-06 20:25:38 +01:00
Richard Scheffenegger
893ed42eca tcp: Make use of enum for sack_changed
No functional change.

Reviewed By:           tuexen, #transport
MFC after:             3 days
Sponsored by:          NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D43346
2024-01-06 20:23:52 +01:00
Michael Tuexen
aa1223ac3a tcp: limit visibility of symbols
Put most symbols under __BSD_VISIBLE and limit the namespace of
tcp_[gs]et_flags.

Reviewed by:		kib, karels, rscheff
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D43245
2024-01-06 12:00:38 +01:00
Jose Luis Duran
b0e13f785b netinet: Define IPv6 ECN mask
Define a mask for the code point used for ECN in the Traffic Class field
(2 bits) of an IPv6 header.

     BE:    0       0       3       0       0       0       0       0
    Bit: 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        |Version| Traffic Class |           Flow Label                  |
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        |                              ...                              |

For BE (Big Endian), or network-byte order, this corresponds to 0x00300000.
For Little Endian, it corresponds to 0x00003000.

Reviewed by:	imp, markj
MFC after:	1 week
Pull Request:	https://github.com/freebsd/freebsd-src/pull/879
2024-01-03 12:56:28 -05:00
Richard Kümmel
7df9da47e8 Fix udp IPv4-mapped address
Do not use the cached route if the destination isn't the same.
This fix a problem where an UDP packet will be sent via the wrong route
and interface if a previous one was sent via them.

PR:	275774
Reviewed by:	glebius, tuexen
Sponsored by:	Beckhoff Automation GmbH & Co. KG
2024-01-02 07:49:12 +01:00
Michael Tuexen
642ac6015b tcp: fix ports
inline is only support in C99 and newer. To support also C89, use
__inline instead as suggested by dim.

Reported by:		eduardo
Reviewed by:		rscheff, markj, dim, imp
Tested by:		eduardo
Fixes:			a8b70cf260 ("netpfil: Use accessor functions and named constants for all tcphdr flags")
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D43231
2023-12-30 03:28:13 +01:00