freebsd

mirror of https://git.FreeBSD.org/src.git synced 2024-12-27 11:55:06 +00:00

Author	SHA1	Message	Date
Hiren Panchasara	a934d06194	Add an option to use rfc6675 based pipe/inflight bytes calculation in newreno. MFC after: 3 weeks Sponsored by: Limelight Networks	2015-12-09 08:53:41 +00:00
Hiren Panchasara	f81bc34eac	Add an option to use rfc6675 based pipe/inflight bytes calculation in cubic. Reviewed by: gnn MFC after: 3 weeks Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D4205	2015-12-09 07:56:40 +00:00
Hiren Panchasara	021eaf7996	One of the ways to detect loss is to count duplicate acks coming back from the other end till it reaches predetermined threshold which is 3 for us right now. Once that happens, we trigger fast-retransmit to do loss recovery. Main problem with the current implementation is that we don't honor SACK information well to detect whether an incoming ack is a dupack or not. RFC6675 has latest recommendations for that. According to it, dupack is a segment that arrives carrying a SACK block that identifies previously unknown information between snd_una and snd_max even if it carries new data, changes the advertised window, or moves the cumulative acknowledgment point. With the prevalence of Selective ACK (SACK) these days, improper handling can lead to delayed loss recovery. With the fix, new behavior looks like following: 0) th_ack < snd_una --> ignore Old acks are ignored. 1) th_ack == snd_una, !sack_changed --> ignore Acks with SACK enabled but without any new SACK info in them are ignored. 2) th_ack == snd_una, window == old_window --> increment Increment on a good dupack. 3) th_ack == snd_una, window != old_window, sack_changed --> increment When SACK enabled, it's okay to have advertized window changed if the ack has new SACK info. 4) th_ack > snd_una --> reset to 0 Reset to 0 when left edge moves. 5) th_ack > snd_una, sack_changed --> increment Increment if left edge moves but there is new SACK info. Here, sack_changed is the indicator that incoming ack has previously unknown SACK info in it. Note: This fix is not fully compliant to RFC6675. That may require a few changes to current implementation in order to keep per-sackhole dupack counter and change to the way we mark/handle sack holes. PR: 203663 Reviewed by: jtl MFC after: 3 weeks Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D4225	2015-12-08 21:21:48 +00:00
Alexander V. Chernikov	65ff3638df	Merge helper fib* functions used for basic lookups. Vast majority of rtalloc(9) users require only basic info from route table (e.g. "does the rtentry interface match with the interface I have?". "what is the MTU?", "Give me the IPv4 source address to use", etc..). Instead of hand-rolling lookups, checking if rtentry is up, valid, dealing with IPv6 mtu, finding "address" ifp (almost never done right), provide easy-to-use API hiding all the complexity and returning the needed info into small on-stack structure. This change also helps hiding route subsystem internals (locking, direct rtentry accesses). Additionaly, using this API improves lookup performance since rtentry is not locked. (This is safe, since all the rtentry changes happens under both radix WLOCK and rtentry WLOCK). Sponsored by: Yandex LLC	2015-12-08 10:50:03 +00:00
Michael Tuexen	c979034b18	Fix the allocation of outgoing streams: * When processing a cookie, use the number of streams announced in the INIT-ACK. * When sending an INIT-ACK for an existing association, use the value from the association, not from the end-point. MFC after: 1 week	2015-12-06 16:17:57 +00:00
Alexander V. Chernikov	f8aee88f0b	Remove LLE read lock from IPv4 fast path. LLE structure is mostly unchanged during its lifecycle. To be more specific, there are 2 things relevant for fast path lookup code: 1) link-level address change. Since r286722, these updates are performed under AFDATA WLOCK. 2) Some sort of feedback indicating that this particular entry is used so we re-send arp request to perform reachability verification instead of expiring entry. The only signal that is needed from fast path is something like binary yes/no. The latter is solved by the following changes: 1) introduce special r_skip_req field which is read lockless by fast path, but updated under (new) req_mutex mutex. If this field is non-zero, then fast path will acquire lock and set it back to 0. 2) introduce simple state machine: incomplete->reachable<->verify->deleted. Before that we implicitely had incomplete->reachable->deleted state machine, with V_arpt_keep between "reachable" and "deleted". Verification was performed in runtime 5 seconds before V_arpt_keep expire. This is changed to "change state to verify 5 seconds before V_arpt_keep, set r_skip_req to non-zero value and check it every second". If the value is zero - then send arp verification probe. These changes do not introduce any signifficant control plane overhead: typically lle callout timer would fire 1 time more each V_arpt_keep (1200s) for used lles and up to arp_maxtries (5) for dead lles. As a result, all packets towards "reachable" lle are handled by fast path without acquiring lle read lock. Additional "req_mutex" is needed because callout / arpresolve_slow() or eventhandler might keep LLE lock for signifficant amount of time, which might not be feasible for fast path locking (e.g. having rmlock as ether AFDATA or lltable own lock). Differential Revision: https://reviews.freebsd.org/D3688	2015-12-05 09:50:37 +00:00
Michael Tuexen	a4889f2dd0	Fix a bug where a stream reset request wasn't retranmitted when the peer indicated "In progress". MFC after: 1 week	2015-12-04 08:49:27 +00:00
Michael Tuexen	d96bef9c77	Ensure that outgoing streams get reset when they run dry. MFC after: 1 week	2015-12-03 15:19:29 +00:00
Michael Tuexen	4821b41e21	Minor cleanup. No functional change. MFC after: 1 week	2015-12-02 22:44:42 +00:00
Michael Tuexen	60862d8e48	Adjust the MTU when accepting an SCTP association using UDP encapsulation. MFC after: 1 week	2015-12-02 16:29:36 +00:00
Andrey V. Elsukov	b4e63e2d15	In the same way fix the problem described in r291578 for IGMPv3. In case when router has a lot of multicast groups, the reply can take several packets due to MTU limitation. Also we have a limit IGMP_MAX_RESPONSE_BURST == 4, that limits the number of packets we send in one shot. Then we recalculate the timer value and schedule the remaining packets for sending. The problem is that when we call igmp_v3_dispatch_general_query() to send remaining packets, we queue new reply in the same mbuf queue. And when number of packets is bigger than IGMP_MAX_RESPONSE_BURST, we get endless reply of IGMPv3 reports. To fix this, add the check for remaining packets in the queue. MFC after: 1 week Sponsored by: Yandex LLC	2015-12-01 11:24:30 +00:00
Alexander V. Chernikov	c00c4e46e3	Remove in_setifarnh definition.	2015-11-30 06:02:35 +00:00
Alexander V. Chernikov	e8b0643eee	Add new rt_foreach_fib_walk_del() function for deleting route entries by filter function instead of picking into routing table details in each consumer. Remove now-unused rt_expunge() (eliminating last external RTF_RNH_LOCKED user). This simplifies future nexthops/mulitipath changes and rtrequest1_fib() locking refactoring. Actual changes: Add "rt_chain" field to permit rte grouping while doing batched delete from routing table (thus growing rte 200->208 on amd64). Add "rti_filter" / "rti_filterdata" / "rti_spare" fields to rt_addrinfo to pass filter function to various routing subsystems in standard way. Convert all rt_expunge() customers to new rt_addinfo-based api and eliminate rt_expunge().	2015-11-30 05:51:14 +00:00
Michael Tuexen	c6d2bd4812	Take also the send queue and sent queue into account when triggering the sending of outgoing stream reset requests. MFC after: 3 days	2015-11-27 22:11:46 +00:00
Michael Tuexen	f0067f2251	When the sending of an SCTP outgoing stream reset request fails, don't report it to the user since all stream have been marked as pending. MFC after: 1 week	2015-11-26 23:12:41 +00:00
Michael Tuexen	52f175be70	When receiving an SCTP/UDP packet and the interface performed the UDP checksum computation and signals that it was OK, clear this bit when passing the packet to SCTP. Since the bits indicating a valid UDP checksum and a valid SCTP checksum are the same, the SCTP stack would assume that also an SCTP checksum check has been performed. MFC after: 1 week	2015-11-26 09:25:20 +00:00
Fabien Thomas	edd0e0b098	The r241129 description was wrong that the scenario is possible only for read locks on pcbs. The same race can happen with write lock semantics as well. The race scenario: - Two threads (1 and 2) locate pcb with writer semantics (INPLOOKUP_WLOCKPCB) and do in_pcbref() on it. - 1 and 2 both drop the inp hash lock. - Another thread (3) grabs the inp hash lock. Then it runs in_pcbfree(), which wlocks the pcb. They must happen faster than 1 or 2 come INP_WLOCK()! - 1 and 2 congest in INP_WLOCK(). - 3 does in_pcbremlists(), drops hash lock, and runs in_pcbrele_wlocked(), which doesn't free the pcb due to two references on it. Then it unlocks the pcb. - 1 (or 2) gets wlock on the pcb, runs in_pcbrele_wlocked(), which doesn't report inp as freed, due to 2 (or 1) still helding extra reference on it. The thread tries to do smth with a disconnected pcb and crashes. Submitted by: emeric.poupon@stormshield.eu Reviewed by: gleb@ MFC after: 1 week Sponsored by: Stormshield Tested by: Cassiano Peixoto, Stormshield	2015-11-25 14:45:43 +00:00
Andrey V. Elsukov	ef91a9765d	Overhaul if_enc(4) and make it loadable in run-time. Use hhook(9) framework to achieve ability of loading and unloading if_enc(4) kernel module. INET and INET6 code on initialization registers two helper hooks points in the kernel. if_enc(4) module uses these helper hook points and registers its hooks. IPSEC code uses these hhook points to call helper hooks implemented in if_enc(4).	2015-11-25 07:31:59 +00:00
Michael Tuexen	3bf2363dca	Fix the handling of IPSec policies in the SCTP stack. At least make sure they are not leaked... MFC after: 1 week	2015-11-21 18:21:16 +00:00
Michael Tuexen	e5d23883bf	Revert part of r291137 which seems correct, bit does not fix the resource problem I'm currently hunting down. MFC after: 1 week X-MFC with: 291137	2015-11-21 16:46:59 +00:00
Michael Tuexen	722281c070	Clear the so_pcb pointer in case of ipsec_init_policy() fails. MFC after: 1 week	2015-11-21 16:32:14 +00:00
Michael Tuexen	8ca16419bb	Don't send SHUTDOWN chunk when the association is in a front state and the applications calls shutdown(..., SHUT_WR) or shutdown(..., SHUT_RDWR). MFC after: 1 week.	2015-11-21 16:25:09 +00:00
Michael Tuexen	bd5b567213	Fix a bug where an SCTP association was moved back to SHUTDOWN_SENT state when the user issued a shutdown() call. MFC after: 1 week	2015-11-19 16:46:00 +00:00
Conrad Meyer	3fbd30d495	in_getmulti: Fix recursion on if_addr_lock on malloc failure When the M_NOWAIT allocation fails, we recurse the if_addr_lock trying to clean up. Reorder the cleanup after dropping the if_addr_lock. The obvious race is already possible between if_addmulti and IF_ADDR_WLOCK above, so it must be ok. Submitted by: Ryan Libby <rlibby@gmail.com> Reviewed by: jhb Found with: M_NOWAIT failure injection testing Sponsored by: EMC / Isilon Storage Division Differential Revision: https://reviews.freebsd.org/D4138	2015-11-18 23:53:13 +00:00
Steven Hartland	9925ac11da	Revert r290403 CARP rework invalidated this change.	2015-11-13 23:14:39 +00:00
Randall Stewart	7c4676ddee	This fixes several places where callout_stops return is examined. The new return codes of -1 were mistakenly being considered "true". Callout_stop now returns -1 to indicate the callout had either already completed or was not running and 0 to indicate it could not be stopped. Also update the manual page to make it more consistent no non-zero in the callout_stop or callout_reset descriptions. MFC after: 1 Month with associated callout change.	2015-11-13 22:51:35 +00:00
Alexander V. Chernikov	1c302b58da	Decompose arp_ifinit() into arp_add_ifa_lle() and arp_announce_ifaddr(). Rename arp_ifinit2() into arp_announce_ifaddr(). Eliminate zeroing ifa_rtrequest: it was used for calling arp_rtrequest() which was responsible for handling route cloning requests. It became obsolete since r186119 (L2/L3 split).	2015-11-09 10:35:33 +00:00
Alexander V. Chernikov	b13c5b5db2	Use lladdr_event to propagate gratiotus arp. Differential Revision: https://reviews.freebsd.org/D4019	2015-11-09 10:11:14 +00:00
Alexander V. Chernikov	ddd208f7ad	Unify setting lladdr for AF_INET[6].	2015-11-07 11:12:00 +00:00
Michael Tuexen	0bfc52bea5	Use the correct length. The wrong one was too large. MFC after: 3 days	2015-11-06 22:08:05 +00:00
Michael Tuexen	179f731bb0	The field sinfo_timetolive should have been sinfo_pr_value. Thanks to Jens Hoelscher for making me aware of the bug. MFC after: 1 week	2015-11-06 14:00:26 +00:00
Michael Tuexen	b70b526d17	Fix typos in field names of struct sctp_extrcvinfo. Provide defines to allow applications to compile. Thanks to Jens Hoelscher for making me aware of the typos. MFC after: 1 week	2015-11-06 13:08:16 +00:00
Steven Hartland	ac19560a34	Add MTU support to carp interfaces MFC after: 2 weeks Sponsored by: Multiplay	2015-11-05 17:23:02 +00:00
George V. Neville-Neil	33872124a5	Replace the fastforward path with tryforward which does not require a sysctl and will always be on. The former split between default and fast forwarding is removed by this commit while preserving the ability to use all network stack features. Differential Revision: https://reviews.freebsd.org/D4042 Reviewed by: ae, melifaro, olivier, rwatson MFC after: 1 month Sponsored by: Rubicon Communications (Netgate)	2015-11-05 07:26:32 +00:00
Hiren Panchasara	054d38e38c	Improve the sysctl node name. X-MFC with: r290122 Sponsored by: Limelight Networks	2015-11-05 02:09:48 +00:00
Andrey V. Elsukov	5dc5a0e0aa	Implement `ipfw internal olist` command to list named objects. Reviewed by: melifaro Obtained from: Yandex LLC Sponsored by: Yandex LLC	2015-11-03 10:21:53 +00:00
George V. Neville-Neil	02b90dbf45	Set the proper direction to check for policies in this one case. Pointed out by: eri Sponsored by: Rubicon Communications (Netgate)	2015-10-29 21:26:32 +00:00
Hiren Panchasara	12eeb81fc1	Calculate the correct amount of bytes that are in-flight for a connection as suggested by RFC 6675. Currently differnt places in the stack tries to guess this in suboptimal ways. The main problem is that current calculations don't take sacked bytes into account. Sacked bytes are the bytes receiver acked via SACK option. This is suboptimal because it assumes that network has more outstanding (unacked) bytes than the actual value and thus sends less data by setting congestion window lower than what's possible which in turn may cause slower recovery from losses. As an example, one of the current calculations looks something like this: snd_nxt - snd_fack + sackhint.sack_bytes_rexmit New proposal from RFC 6675 is: snd_max - snd_una - sackhint.sacked_bytes + sackhint.sack_bytes_rexmit which takes sacked bytes into account which is a new addition to the sackhint struct. Only thing we are missing from RFC 6675 is isLost() i.e. segment being considered lost and thus adjusting pipe based on that which makes this calculation a bit on conservative side. The approach is very simple. We already process each ack with sack info in tcp_sack_doack() and extract sack blocks/holes out of it. We'd now also track this new variable sacked_bytes which keeps track of total sacked bytes reported. One downside to this approach is that we may get incorrect count of sacked_bytes if the other end decides to drop sack info in the ack because of memory pressure or some other reasons. But in this (not very likely) case also the pipe calculation would be conservative which is okay as opposed to being aggressive in sending packets into the network. Next step is to use this more accurate pipe estimation to drive congestion window adjustments. In collaboration with: rrs Reviewed by: jason_eggnet dot com, rrs MFC after: 2 weeks Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D3971	2015-10-28 22:57:51 +00:00
Hiren Panchasara	356c7958a4	Add sysctl tunable net.inet.tcp.initcwnd_segments to specify initial congestion window in number of segments on fly. It is set to 10 segments by default. Remove net.inet.tcp.experimental.initcwnd10 which is now redundant. Also remove the parent node net.inet.tcp.experimental as it's not needed anymore and also because it was not well thought out. Differential Revision: https://reviews.freebsd.org/D3858 In collaboration with: lstewart Reviewed by: gnn (prev version), rwatson, allanjude, wblock (man page) MFC after: 2 weeks Relnotes: yes Sponsored by: Limelight Networks	2015-10-27 09:43:05 +00:00
George V. Neville-Neil	26882b4239	Turning on IPSEC used to introduce a slight amount of performance degradation (7%) for host host TCP connections over 10Gbps links, even when there were no secuirty policies in place. There is no change in performance on 1Gbps network links. Testing GENERIC vs. GENERIC-NOIPSEC vs. GENERIC with this change shows that the new code removes any overhead introduced by having IPSEC always in the kernel. Differential Revision: D3993 MFC after: 1 month Sponsored by: Rubicon Communications (Netgate)	2015-10-27 00:42:15 +00:00
Michael Tuexen	3db4ea954e	When processing a cookie, any mismatch in port numbers or the vtag results in failing the check. This fixes https://github.com/nplab/ETSI-SCTP-Conformance-Testsuite/blob/master/sctp-imh-tests/sctp-imh-i-3-3.pkt MFC after: 1 week	2015-10-26 21:19:49 +00:00
Michael Tuexen	6e9c45e0ee	Use __func__ instead of __FUNCTION__. This allows to compile the userland stack without errors using gcc5. Thanks to saghul for makeing me aware and providing the patch. MFC after: 1 week	2015-10-19 11:17:54 +00:00
Alexander V. Chernikov	26a6057525	Fix deletion of ifaddr lle entries when deleting prefix from interface in down state. Regression appeared in r287789, where the "prefix has no corresponding installed route" case was forgotten. Additionally, lltable_delete_addr() was called with incorrect byte order (default is network for lltable code). While here, improve comments on given cases and byte order. PR: 203573 Submitted by: phk	2015-10-18 12:26:25 +00:00
Alexander V. Chernikov	f221bcaa06	Remove several compat functions from pre-fib era.	2015-10-17 17:26:44 +00:00
Bjoern A. Zeeb	962d02b00b	Hopefully also unbreak VIMAGE kernels replacing the &V_... with &VNET_NAME(...). Everything else is just a whitespace wrapping change.	2015-10-15 01:44:32 +00:00
Bjoern A. Zeeb	f87ec781ef	Properly define functions withut argument and wrap for { for style purposes as followed in the rest of the file. This will hopefully make gcc more happy.	2015-10-14 18:30:04 +00:00
Hiren Panchasara	adf43a9279	Fix an unnecessarily aggressive behavior where mtu clamping begins on first retransmission timeout (rto) when blackhole detection is enabled. Make sure it only happens when the second attempt to send the same segment also fails with rto. Also make sure that each mtu probing stage (usually 1448 -> 1188 -> 524) follows the same pattern and gets 2 chances (rto) before further clamping down. Note: RFC4821 doesn't specify implementation details on how this situation should be handled. Differential Revision: https://reviews.freebsd.org/D3434 Reviewed by: sbruno, gnn (previous version) MFC after: 2 weeks Sponsored by: Limelight Networks	2015-10-14 06:57:28 +00:00
Hiren Panchasara	86a996e6bd	There are times when it would be really nice to have a record of the last few packets and/or state transitions from each TCP socket. That would help with narrowing down certain problems we see in the field that are hard to reproduce without understanding the history of how we got into a certain state. This change provides just that. It saves copies of the last N packets in a list in the tcpcb. When the tcpcb is destroyed, the list is freed. I thought this was likely to be more performance-friendly than saving copies of the tcpcb. Plus, with the packets, you should be able to reverse-engineer what happened to the tcpcb. To enable the feature, you will need to compile a kernel with the TCPPCAP option. Even then, the feature defaults to being deactivated. You can activate it by setting a positive value for the number of captured packets. You can do that on either a global basis or on a per-socket basis (via a setsockopt call). There is no way to get the packets out of the kernel other than using kmem or getting a coredump. I thought that would help some of the legal/privacy concerns regarding such a feature. However, it should be possible to add a future effort to export them in PCAP format. I tested this at low scale, and found that there were no mbuf leaks and the peak mbuf usage appeared to be unchanged with and without the feature. The main performance concern I can envision is the number of mbufs that would be used on systems with a large number of sockets. If you save five packets per direction per socket and have 3,000 sockets, that will consume at least 30,000 mbufs just to keep these packets. I tried to reduce the concerns associated with this by limiting the number of clusters (not mbufs) that could be used for this feature. Again, in my testing, that appears to work correctly. Differential Revision: D3100 Submitted by: Jonathan Looney <jlooney at juniper dot net> Reviewed by: gnn, hiren	2015-10-14 00:35:37 +00:00
Michael Tuexen	9372530827	Fix the timeout for INIT retransmissions in the case where RTO_MIN is smaller than RTO_INITIAL. MFC after: 1 week	2015-10-13 18:27:55 +00:00
Gleb Smirnoff	89bc042679	Fix regression from r287779, that bite me. If we call m_pullup() unconditionally, we end up with an mbuf chain of two mbufs, which later in in_arpreply() is rewritten from ARP request to ARP reply and is sent out. Looks like igb(4) (at least mine, and at least at my network) fails on such mbuf chain, so ARP reply doesn't go out wire. Thus, make the m_pullup() call conditional, as it is everywhere. Of course, the bug in igb(?) should be investigated, but better first fix the head. And unconditional m_pullup() was suboptimal, anyway.	2015-10-07 13:10:26 +00:00

1 2 3 4 5 ...

5356 Commits