freebsd

mirror of https://git.FreeBSD.org/src.git synced 2024-12-25 11:37:56 +00:00

Author	SHA1	Message	Date
Andre Oppermann	5628dd0893	When doing RFC3042 limited transmit on the first on second duplicate ACK make sure we actually have new data to send. This prevents us from sending unneccessary pure ACKs. Reported by: Matt Miller <matt@matthewjmiller.net> Tested by: Matt Miller <matt@matthewjmiller.net> MFC after: 2 weeks	2013-04-23 14:06:32 +00:00
Andre Oppermann	982c1675ff	Fix a race condition on tcp listen socket teardown with pending connections in the accept queue and contiguous new incoming SYNs. Compared to the original submitters patch I've moved the test next to the SYN handling to have it together in a logical unit and reworded the comment explaining the issue. Submitted by: Matt Miller <matt@matthewjmiller.net> Submitted by: Juan Mojica <jmojica@gmail.com> Reviewed by: Matt Miller (changes) Tested by: pho MFC after: 1 week	2013-04-09 20:52:26 +00:00
Gleb Smirnoff	4a21e86ec1	Fix VIMAGE build.	2013-04-09 09:15:26 +00:00
Gleb Smirnoff	5923c29332	Merge from projects/counters: TCP/IP stats. Convert 'struct ipstat' and 'struct tcpstat' to counter(9). This speeds up IP forwarding at extreme packet rates, and makes accounting more precise. Sponsored by: Nginx, Inc.	2013-04-08 19:57:21 +00:00
Ed Maste	ce7ad6640c	Keep fwd_tag around for subsequent pcb lookups For TIMEWAIT handling tcp_input may have to jump back for an additional pass through pcblookup. Prior to this change the fwd_tag had been discarded after the first lookup, so a new connection attempt delivered locally via 'ipfw fwd' would fail to find a match. As of r248886 the tag will be detached and freed when passed to the socket buffer.	2013-03-29 20:51:44 +00:00
Lawrence Stewart	5b648e797b	Simplify and fix a bug in cc_ack_received()'s "are we congestion window limited" logic (refer to [1] for associated discussion). snd_cwnd and snd_wnd are unsigned long and on 64 bit hosts, min() will truncate them to 32 bits and could therefore potentially corrupt the result (although under normal operation, neither variable should legitmately exceed 32 bits). [1] http://lists.freebsd.org/pipermail/freebsd-net/2013-January/034297.html Submitted by: jhb MFC after: 1 week	2013-01-22 09:44:21 +00:00
Gleb Smirnoff	b8056fae06	Fix !INET6 build after r244365.	2012-12-18 08:14:16 +00:00
Gleb Smirnoff	dd029d52fa	Clear correct flag in INET6 case.	2012-12-18 08:09:44 +00:00
Andrey V. Elsukov	f491274582	Since we use different flags to detect tcp forwarding, and we share the same code for IPv4 and IPv6 in tcp_input, we should check both M_IP_NEXTHOP and M_IP6_NEXTHOP flags. MFC after: 3 days	2012-12-17 20:55:33 +00:00
Gleb Smirnoff	78a7880f64	Fix a crash in tcp_input(), that happens when mbuf has a fwd_tag on it, but later after processing and freeing the tag, we need to jump back again to the findpcb label. Since the fwd_tag pointer wasn't NULL we tried to process and free the tag for second time. Reported & tested by: Pawel Tyll <ptyll nitronet.pl> MFC after: 3 days	2012-12-12 17:41:21 +00:00
Andre Oppermann	60ee3bb213	Back out r242262. The simplified window change/update logic wasn't complete and ready for production use. PR: kern/173309	2012-11-05 09:13:06 +00:00
Andrey V. Elsukov	ffdbf9da3b	Remove the recently added sysctl variable net.pfil.forward. Instead, add protocol specific mbuf flags M_IP_NEXTHOP and M_IP6_NEXTHOP. Use them to indicate that the mbuf's chain contains the PACKET_TAG_IPFORWARD tag. And do a tag lookup only when this flag is set. Suggested by: andre	2012-11-02 01:20:55 +00:00
Andre Oppermann	09440655fe	Increase the initial CWND to 10 segments as defined in IETF TCPM draft-ietf-tcpm-initcwnd-05. It explains why the increased initial window improves the overall performance of many web services without risking congestion collapse. As long as it remains a draft it is placed under a sysctl marking it as experimental: net.inet.tcp.experimental.initcwnd10 = 1 When it becomes an official RFC soon the sysctl will be changed to the RFC number and moved to net.inet.tcp. This implementation differs from the RFC draft in that it is a bit more conservative in the case of packet loss on SYN or SYN\|ACK because we haven't reduced the default RTO to 1 second yet. Also the restart window isn't yet increased as allowed. Both will be adjusted with upcoming changes. Is is enabled by default. In Linux it is enabled since kernel 3.0. MFC after: 2 weeks	2012-10-28 19:47:46 +00:00
Andre Oppermann	79ce26a08c	Simplify and enhance the window change/update acceptance logic, especially in the presence of bi-directional data transfers. snd_wl1 tracks the right edge, including data in the reassembly queue, of valid incoming data. This makes it like rcv_nxt plus reassembly. It never goes backwards to prevent older, possibly reordered segments from updating the window. snd_wl2 tracks the left edge of sent data. This makes it a duplicate of snd_una. However joining them right now is difficult due to separate update dependencies in different places in the code flow. snd_wnd tracks the current advertized send window by the peer. In tcp_output() the effective window is calculated by subtracting the already in-flight data, snd_nxt less snd_una, from it. ACK's become the main clock of window updates and will always update the window when the left edge of what we sent is advanced. The ACK clock is the primary signaling mechanism in ongoing data transfers. This works reliably even in the presence of reordering, reassembly and retransmitted segments. The ACK clock is most important because it determines how much data we are allowed to inject into the network. Zero window updates get us out of persistence mode are crucial. Here a segment that neither moves ACK nor SEQ but enlarges WND is accepted. When the ACK clock is not active (that is we're not or no longer sending any data) any segment that moves the extended right SEQ edge, including out-of-order segments, updates the window. This gives us updates especially during ping-pong transfers where the peer isn't done consuming the already acknowledged data from the receive buffer while responding with data. The SSH protocol is a prime candidate to benefit from the improved bi-directional window update logic as it has its own windowing mechanism on top of TCP and is frequently sending back protocol ACK's. Tcpdump provided by: darrenr Tested by: darrenr MFC after: 2 weeks	2012-10-28 19:16:22 +00:00
Andre Oppermann	4faaea5505	Allow arbitrary MSS sizes and don't mind about the cluster size anymore. We've got more cluster sizes for quite some time now and the orginally imposed limits and the previously codified thoughts on efficiency gains are no longer true. MFC after: 2 weeks	2012-10-28 18:33:52 +00:00
Andre Oppermann	cf8f04f4c0	When SYN or SYN/ACK had to be retransmitted RFC5681 requires us to reduce the initial CWND to one segment. This reduction got lost some time ago due to a change in initialization ordering. Additionally in tcp_timer_rexmt() avoid entering fast recovery when we're still in TCPS_SYN_SENT state. MFC after: 2 weeks	2012-10-28 17:25:08 +00:00
Andre Oppermann	22efabd40c	Adjust the initial default CWND upon connection establishment to the new and increased values specified by RFC5681 Section 3.1. The even larger initial CWND per RFC3390, if enabled, is not affected. MFC after: 2 weeks	2012-10-28 17:16:09 +00:00
Andrey V. Elsukov	c1de64a495	Remove the IPFIREWALL_FORWARD kernel option and make possible to turn on the related functionality in the runtime via the sysctl variable net.pfil.forward. It is turned off by default. Sponsored by: Yandex LLC Discussed with: net@ MFC after: 2 weeks	2012-10-25 09:39:14 +00:00
Gleb Smirnoff	8ad458a471	Do not reduce ip_len by size of IP header in the ip_input() before passing a packet to protocol input routines. For several protocols this mean that now protocol needs to do subtraction itself, and for another half this means that we do not need to add header length back to the packet. Make ip_stripoptions() to adjust ip_len, since now we enter this function with a packet header whose ip_len does represent length of entire packet, not payload only.	2012-10-23 08:33:13 +00:00
Gleb Smirnoff	8f134647ca	Switch the entire IPv4 stack to keep the IP packet header in network byte order. Any host byte order processing is done in local variables and host byte order values are never[1] written to a packet. After this change a packet processed by the stack isn't modified at all[2] except for TTL. After this change a network stack hacker doesn't need to scratch his head trying to figure out what is the byte order at the given place in the stack. [1] One exception still remains. The raw sockets convert host byte order before pass a packet to an application. Probably this would remain for ages for compatibility. [2] The ip_input() still subtructs header len from ip->ip_len, but this is planned to be fixed soon. Reviewed by: luigi, Maxim Dounin <mdounin mdounin.ru> Tested by: ray, Olivier Cochard-Labbe <olivier cochard.me>	2012-10-22 21:09:03 +00:00
Gleb Smirnoff	105bd2113b	In ip_stripoptions(): - Remove unused argument and incorrect comment. - Fixup ip_len after stripping.	2012-10-12 09:24:24 +00:00
Randall Stewart	ec03d5433f	This small change takes care of a race condition that can occur when both sides close at the same time. If that occurs, without this fix the connection enters FIN1 on both sides and they will forever send FIN\|ACK at each other until the connection times out. This is because we stopped processing the FIN\|ACK and thus did not advance the sequence and so never ACK'd each others FIN. This fix adjusts it so we do process the FIN properly and the race goes away ;-) MFC after: 1 month	2012-08-25 09:26:37 +00:00
Robert Watson	0989f56cff	Update some stale comments regarding tcbinfo locking in the TCP input path: read locks on tcbinfo are no longer used, so won't happen. No functional change. MFC after: 3 days	2012-07-22 17:31:36 +00:00
Navdeep Parhar	09fe63205c	- Updated TOE support in the kernel. - Stateful TCP offload drivers for Terminator 3 and 4 (T3 and T4) ASICs. These are available as t3_tom and t4_tom modules that augment cxgb(4) and cxgbe(4) respectively. The cxgb/cxgbe drivers continue to work as usual with or without these extra features. - iWARP driver for Terminator 3 ASIC (kernel verbs). T4 iWARP in the works and will follow soon. Build-tested with make universe. 30s overview ============ What interfaces support TCP offload? Look for TOE4 and/or TOE6 in the capabilities of an interface: # ifconfig -m \| grep TOE Enable/disable TCP offload on an interface (just like any other ifnet capability): # ifconfig cxgbe0 toe # ifconfig cxgbe0 -toe Which connections are offloaded? Look for toe4 and/or toe6 in the output of netstat and sockstat: # netstat -np tcp \| grep toe # sockstat -46c \| grep toe Reviewed by: bz, gnn Sponsored by: Chelsio communications. MFC after: ~3 months (after 9.1, and after ensuring MFC is feasible)	2012-06-19 07:34:13 +00:00
Maksim Yevmenkin	77d396fd18	Plug more refcount leaks and possible NULL deref for interface address list. Submitted by: scottl@ MFC after: 3 days	2012-06-04 18:43:51 +00:00
Bjoern A. Zeeb	356ab07e2d	It turns out that too many drivers are not only parsing the L2/3/4 headers for TSO but also for generic checksum offloading. Ideally we would only have one common function shared amongst all drivers, and perhaps when updating them for IPv6 we should introduce that. Eventually we should provide the meta information along with mbufs to avoid (re-)parsing entirely. To not break IPv6 (checksums and offload) and to be able to MFC the changes without risking to hurt 3rd party drivers, duplicate the v4 framework, as other OSes have done as well. Introduce interface capability flags for TX/RX checksum offload with IPv6, to allow independent toggling (where possible). Add CSUM_*_IPV6 flags for UDP/TCP over IPv6, and reserve further for SCTP, and IPv6 fragmentation. Define CSUM_DELAY_DATA_IPV6 as we do for legacy IP and add an alias for CSUM_DATA_VALID_IPV6. This pretty much brings IPv6 handling in line with IPv4. TSO is still handled in a different way and not via if_hwassist. Update ifconfig to allow (un)setting of the new capability flags. Update loopback to announce the new capabilities and if_hwassist flags. Individual driver updates will have to follow, as will SCTP. Reported by: gallatin, dim, .. Reviewed by: gallatin (glanced at?) MFC after: 3 days X-MFC with: r235961,235959,235958	2012-05-28 09:30:13 +00:00
Bjoern A. Zeeb	45747ba53c	MFp4 bz_ipv6_fast: Add code to handle pre-checked TCP checksums as indicated by mbuf flags to save the entire computation for validation if not needed. In the IPv6 TCP output path only compute the pseudo-header checksum, set the checksum offset in the mbuf field along the appropriate flag as done in IPv4. In tcp_respond() just initialize the IPv6 payload length to 0 as ip6_output() will properly set it. Sponsored by: The FreeBSD Foundation Sponsored by: iXsystems Reviewed by: gnn (as part of the whole) MFC After: 3 days	2012-05-25 02:23:26 +00:00
Bjoern A. Zeeb	3a9391defb	MFp4 bz_ipv6_fast: Factor out the tcp_hc_getmtu() call. As the comments say it applies to both v4 and v6, so only write it once making it easier to read the protocol family specifc code. Sponsored by: The FreeBSD Foundation Sponsored by: iXsystems Reviewed by: gnn (as part of the whole) MFC After: 3 days	2012-05-25 01:13:39 +00:00
Gleb Smirnoff	ef341ee1e3	When we receive an ICMP unreach need fragmentation datagram, we take proposed MTU value from it and update the TCP host cache. Then tcp_mss_update() is called on the corresponding tcpcb. It finds the just allocated entry in the TCP host cache and updates MSS on the tcpcb. And then we do a fast retransmit of what we have in the tcp send buffer. This sequence gets broken if the TCP host cache is exausted. In this case allocation fails, and later called tcp_mss_update() finds nothing in cache. The fast retransmit is done with not reduced MSS and is immidiately replied by remote host with new ICMP datagrams and the cycle repeats. This ping-pong can go up to wirespeed. To fix this: - tcp_mss_update() gets new parameter - mtuoffer, that is like offer, but needs to have min_protoh subtracted. - tcp_mtudisc() as notification method renamed to tcp_mtudisc_notify(). - tcp_mtudisc() now accepts not a useless error argument, but proposed MTU value, that is passed to tcp_mss_update() as mtuoffer. Reported by: az Reported by: Andrey Zonov <andrey zonov.org> Reviewed by: andre (previous version of patch)	2012-04-16 13:49:03 +00:00
Bjoern A. Zeeb	d8951c8a2f	Fix PAWS (Protect Against Wrapped Sequence numbers) in cases when hz >> 1000 and thus getting outside the timestamp clock frequenceny of 1ms < x < 1s per tick as mandated by RFC1323, leading to connection resets on idle connections. Always use a granularity of 1ms using getmicrouptime() making all but relevant callouts independent of hz. Use getmicrouptime(), not getmicrotime() as the latter may make a jump possibly breaking TCP nfsroot mounts having our timestamps move forward for more than 24.8 days in a second without having been idle for that long. PR: kern/61404 Reviewed by: jhb, mav, rrs Discussed with: silby, lstewart Sponsored by: Sandvine Incorporated (originally in 2011) MFC after: 6 weeks	2012-02-15 16:09:56 +00:00
Gleb Smirnoff	9077f38738	Add new socket options: TCP_KEEPINIT, TCP_KEEPIDLE, TCP_KEEPINTVL and TCP_KEEPCNT, that allow to control initial timeout, idle time, idle re-send interval and idle send count on a per-socket basis. Reviewed by: andre, bz, lstewart	2012-02-05 16:53:02 +00:00
John Baldwin	1e96ae8193	Remove the assertion from tcp_input() that rcv_nxt is always greater than or equal to rcv_adv and fix tcp_twstart() to handle this case by assuming the last window was zero rather than a negative value. The code in tcp_input() already safely handled this case. It can happen due to delayed ACKs along with a remote sender that sends data beyond the window we previously advertised. If we have room in our socket buffer for the extra data beyond the advertised window, we will accept it. However, if the ACK for that segment is delayed, then we will not effectively fixup rcv_adv to account for that extra data until the next segment arrives and forces out an ACK. When that next segment arrives, rcv_nxt will be beyond rcv_adv. Tested by: pjd MFC after: 1 week	2012-01-05 22:29:11 +00:00
Ed Schouten	6472ac3d8a	Mark all SYSCTL_NODEs static that have no corresponding SYSCTL_DECLs. The SYSCTL_NODE macro defines a list that stores all child-elements of that node. If there's no SYSCTL_DECL macro anywhere else, there's no reason why it shouldn't be static.	2011-11-07 15:43:11 +00:00
Sergey Kandaurov	ddd0c4a969	Restore sysctl names for tcp_sendspace/tcp_recvspace. They seem to be changed unintentionally in r226437, and there were no any mentions of renaming in commit log message. Reported by: Anton Yuzhaninov <citrin citrin ru>	2011-11-02 20:58:47 +00:00
Bjoern A. Zeeb	fba0cea143	Add syntactic sugar missed in r226437 and then not added either when moving things around in r226448 but desperately needed to always make things compile successfully. MFC after: 1 week	2011-10-17 00:05:31 +00:00
Andre Oppermann	873789cb0f	Move the tcp_sendspace and tcp_recvspace sysctl's from the middle of tcp_usrreq.c to the top of tcp_output.c and tcp_input.c respectively next to the socket buffer autosizing controls. MFC after: 1 week	2011-10-16 20:18:39 +00:00
Andre Oppermann	9ec4a4cca5	Remove the ss_fltsz and ss_fltsz_local sysctl's which have long been superseded by the RFC3390 initial CWND sizing. Also remove the remnants of TCP_METRICS_CWND which used the TCP hostcache to set the initial CWND in a non-RFC compliant way. MFC after: 1 week	2011-10-16 20:06:44 +00:00
Andre Oppermann	e233e2acb3	VNET virtualize tcp_sendspace/tcp_recvspace and change the type to INT. A long is not necessary as the TCP window is limited to 2**30. A larger initial window isn't useful. MFC after: 1 week	2011-10-16 15:08:43 +00:00
Attilio Rao	4af309c810	For the INP_TIMEWAIT case, there is no valid tcpcb object tied to the inpcb object. Skip the TCP_SIGNATURE check in that case as it is consistent with the output path (no TCP_SIGNATURE for outcoming packets in TIMEWAIT state) and also because for TIMEWAIT state the verify may be less effective. Sponsored by: Sandvine Incorporated Reported by: rwatson No objections by: rwatson MFC after: 3 days	2011-10-06 14:29:38 +00:00
Bjoern A. Zeeb	b233773bb9	Increase the defaults for the maximum socket buffer limit, and the maximum TCP send and receive buffer limits from 256kB to 2MB. For sb_max_adj we need to add the cast as already used in the sysctl handler to not overflow the type doing the maths. Note that this is just the defaults. They will allow more memory to be consumed per socket/connection if needed but not change the default "idle" memory consumption. All values are still tunable by sysctls. Suggested by: gnn Discussed on: arch (Mar and Aug 2011) MFC after: 3 weeks Approved by: re (kib)	2011-08-25 09:20:13 +00:00
Bjoern A. Zeeb	6f69742441	Fix compilation in case of defined(INET) && defined(IPFIREWALL_FORWARD) but no INET6. Reported by: avg Tested by: avg MFC after: 4 weeks X-MFC with: r225044 Approved by: re (kib)	2011-08-20 18:45:38 +00:00
Bjoern A. Zeeb	8a006adb24	Add support for IPv6 to ipfw fwd: Distinguish IPv4 and IPv6 addresses and optional port numbers in user space to set the option for the correct protocol family. Add support in the kernel for carrying the new IPv6 destination address and port. Add support to TCP and UDP for IPv6 and fix UDP IPv4 to not change the address in the IP header. Add support for IPv6 forwarding to a non-local destination. Add a regession test uitilizing VIMAGE to check all 20 possible combinations I could think of. Obtained from: David Dolson at Sandvine Incorporated (original version for ipfw fwd IPv6 support) Sponsored by: Sandvine Incorporated PR: bin/117214 MFC after: 4 weeks Approved by: re (kib)	2011-08-20 17:05:11 +00:00
Robert Watson	d3c1f00350	Add _mbuf() variants of various inpcb-related interfaces, including lookup, hash install, etc. For now, these are arguments are unused, but as we add RSS support, we will want to use hashes extracted from mbufs, rather than manually calculated hashes of header fields, due to the expensive of the software version of Toeplitz (and similar hashes). Add notes that it would be nice to be able to pass mbufs into lookup routines in pf(4), optimising firewall lookup in the same way, but the code structure there doesn't facilitate that currently. (In principle there is no reason this couldn't be MFCed -- the change extends rather than modifies the KBI. However, it won't be useful without other previous possibly less MFCable changes.) Reviewed by: bz Sponsored by: Juniper Networks, Inc.	2011-06-04 16:33:06 +00:00
Robert Watson	fa046d8774	Decompose the current single inpcbinfo lock into two locks: - The existing ipi_lock continues to protect the global inpcb list and inpcb counter. This lock is now relegated to a small number of allocation and free operations, and occasional operations that walk all connections (including, awkwardly, certain UDP multicast receive operations -- something to revisit). - A new ipi_hash_lock protects the two inpcbinfo hash tables for looking up connections and bound sockets, manipulated using new INP_HASH_*() macros. This lock, combined with inpcb locks, protects the 4-tuple address space. Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb connection locks, so may be acquired while manipulating a connection on which a lock is already held, avoiding the need to acquire the inpcbinfo lock preemptively when a binding change might later be required. As a result, however, lookup operations necessarily go through a reference acquire while holding the lookup lock, later acquiring an inpcb lock -- if required. A new function in_pcblookup() looks up connections, and accepts flags indicating how to return the inpcb. Due to lock order changes, callers no longer need acquire locks before performing a lookup: the lookup routine will acquire the ipi_hash_lock as needed. In the future, it will also be able to use alternative lookup and locking strategies transparently to callers, such as pcbgroup lookup. New lookup flags are, supplementing the existing INPLOOKUP_WILDCARD flag: INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb Callers must pass exactly one of these flags (for the time being). Some notes: - All protocols are updated to work within the new regime; especially, TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely eliminated, and global hash lock hold times are dramatically reduced compared to previous locking. - The TCP syncache still relies on the pcbinfo lock, something that we may want to revisit. - Support for reverting to the FreeBSD 7.x locking strategy in TCP input is no longer available -- hash lookup locks are now held only very briefly during inpcb lookup, rather than for potentially extended periods. However, the pcbinfo ipi_lock will still be acquired if a connection state might change such that a connection is added or removed. - Raw IP sockets continue to use the pcbinfo ipi_lock for protection, due to maintaining their own hash tables. - The interface in6_pcblookup_hash_locked() is maintained, which allows callers to acquire hash locks and perform one or more lookups atomically with 4-tuple allocation: this is required only for TCPv6, as there is no in6_pcbconnect_setup(), which there should be. - UDPv6 locking remains significantly more conservative than UDPv4 locking, which relates to source address selection. This needs attention, as it likely significantly reduces parallelism in this code for multithreaded socket use (such as in BIND). - In the UDPv4 and UDPv6 multicast cases, we need to revisit locking somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which is no longer sufficient. A second check once the inpcb lock is held should do the trick, keeping the general case from requiring the inpcb lock for every inpcb visited. - This work reminds us that we need to revisit locking of the v4/v6 flags, which may be accessed lock-free both before and after this change. - Right now, a single lock name is used for the pcbhash lock -- this is undesirable, and probably another argument is required to take care of this (or a char array name field in the pcbinfo?). This is not an MFC candidate for 8.x due to its impact on lookup and locking semantics. It's possible some of these issues could be worked around with compatibility wrappers, if necessary. Reviewed by: bz Sponsored by: Juniper Networks, Inc.	2011-05-30 09:43:55 +00:00
John Baldwin	5891ebd6cd	Oops, fix order of sequence numbers in KASSERT()'s to catch negative receive windows to match the labels in the panic message. Submitted by: trociny	2011-05-14 14:41:40 +00:00
John Baldwin	f701e30d7f	Handle a rare edge case with nearly full TCP receive buffers. If a TCP buffer fills up causing the remote sender to enter into persist mode, but there is still room available in the receive buffer when a window probe arrives (either due to window scaling, or due to the local application very slowing draining data from the receive buffer), then the single byte of data in the window probe is accepted. However, this can cause rcv_nxt to be greater than rcv_adv. This condition will only last until the next ACK packet is pushed out via tcp_output(), and since the previous ACK advertised a zero window, the ACK should be pushed out while the TCP pcb is write-locked. During the window while rcv_nxt is greather than rcv_adv, a few places would compute the remaining receive window via rcv_adv - rcv_nxt. However, this value was then (uint32_t)-1. On a 64 bit machine this could expand to a positive 2^32 - 1 when cast to a long. In particular, when calculating the receive window in tcp_output(), the result would be that the receive window was computed as 2^32 - 1 resulting in advertising a far larger window to the remote peer than actually existed. Fix various places that compute the remaining receive window to either assert that it is not negative (i.e. rcv_nxt <= rcv_adv), or treat the window as full if rcv_nxt is greather than rcv_adv. Reviewed by: bz MFC after: 1 month	2011-05-02 21:05:52 +00:00
Bjoern A. Zeeb	29bd2010d4	Fix a mismerge from p4 in that in_localaddr() is not available without INET. Sponsored by: The FreeBSD Foundation Sponsored by: iXsystems MFC after: 4 days	2011-04-30 16:30:18 +00:00
Bjoern A. Zeeb	b287c6c70c	Make the TCP code compile without INET. Sort #includes and add #ifdef INETs. Add some comments at #endifs given more nestedness. To make the compiler happy, some default initializations were added in accordance with the style on the files. Reviewed by: gnn Sponsored by: The FreeBSD Foundation Sponsored by: iXsystems MFC after: 4 days	2011-04-30 11:21:29 +00:00
John Baldwin	672dc4aea2	TCP reuses t_rxtshift to determine the backoff timer used for both the persist state and the retransmit timer. However, the code that implements "bad retransmit recovery" only checks t_rxtshift to see if an ACK has been received in during the first retransmit timeout window. As a result, if ticks has wrapped over to a negative value and a socket is in the persist state, it can incorrectly treat an ACK from the remote peer as a "bad retransmit recovery" and restore saved values such as snd_ssthresh and snd_cwnd. However, if the socket has never had a retransmit timeout, then these saved values will be zero, so snd_ssthresh and snd_cwnd will be set to 0. If the socket is in fast recovery (this can be caused by excessive duplicate ACKs such as those fixed by 220794), then each ACK that arrives triggers either NewReno or SACK partial ACK handling which clamps snd_cwnd to be no larger than snd_ssthresh. In effect, the socket's send window is permamently stuck at 0 even though the remote peer is advertising a much larger window and pending data is only sent via TCP window probes (so one byte every few seconds). Fix this by adding a new TCP pcb flag (TF_PREVVALID) that indicates that the various snd_*_prev fields in the pcb are valid and only perform "bad retransmit recovery" if this flag is set in the pcb. The flag is set on the first retransmit timeout that occurs and is cleared on subsequent retransmit timeouts or when entering the persist state. Reviewed by: bz MFC after: 2 weeks	2011-04-29 15:40:12 +00:00
Attilio Rao	2903309aca	Add the possibility to verify MD5 hash of incoming TCP packets. As long as this is a costy function, even when compiled in (along with the option TCP_SIGNATURE), it can be disabled via the net.inet.tcp.signature_verify_input sysctl. Sponsored by: Sandvine Incorporated Reviewed by: emaste, bz MFC after: 2 weeks	2011-04-25 17:13:40 +00:00

1 2 3 4 5 ...

473 Commits