NetBSD, ported to FreeBSD by Pierre Beyssac <pb@fasterix.freenix.org> and
minorly tweaked by me.
This is a standard part of FreeBSD, but must be enabled with:
"sysctl -w net.inet.ip.fastforwarding=1" ...and of course forwarding must
also be enabled. This should probably be modified to use the zone
allocator for speed and space efficiency. The current algorithm also
appears to lose if the number of active paths exceeds IPFLOW_MAX (256),
in which case it wastes lots of time trying to figure out which cache
entry to drop.
Define a parameter which indicates the maximum number of sockets in a
system, and use this to size the zone allocators used for sockets and
for certain PCBs.
Convert PF_LOCAL PCB structures to be type-stable and add a version number.
Define an external format for infomation about socket structures and use
it in several places.
Define a mechanism to get all PF_LOCAL and PF_INET PCB lists through
sysctl(3) without blocking network interrupts for an unreasonable
length of time. This probably still has some bugs and/or race
conditions, but it seems to work well enough on my machines.
It is now possible for `netstat' to get almost all of its information
via the sysctl(3) interface rather than reading kmem (changes to follow).
---------
Make callers of namei() responsible for releasing references or locks
instead of having the underlying filesystems do it. This eliminates
redundancy in all terminal filesystems and makes it possible for stacked
transport layers such as umapfs or nullfs to operate correctly.
Quality testing was done with testvn, and lat_fs from the lmbench suite.
Some NFS client testing courtesy of Patrik Kudo.
vop_mknod and vop_symlink still release the returned vpp. vop_rename
still releases 4 vnode arguments before it returns. These remaining cases
will be corrected in the next set of patches.
---------
Submitted by: Michael Hancock <michaelh@cet.co.jp>
is believed to have been broken with the Brakmo/Peterson srtt
calculation changes. The result of this bug is that TCP connections
could time out extremely quickly (in 12 seconds).
Also backed out jdp's partial fix for this problem in rev 1.17 of
tcp_timer.c as it is obsoleted by this commit.
Bug was pointed out by Kevin Lehey <kml@roller.nas.nasa.gov>.
PR: 6068
ftp://ftp.isi.edu/in-notes/iana/assignments/port-numbers
port numbers are divided into three ranges:
0 - 1023 Well Known Ports
1024 - 49151 Registered Ports
49152 - 65535 Dynamic and/or Private Ports
This patch changes the "local port range" from 40000-44999
to the range shown above (plus fix the comment in in_pcb.c).
WARNING: This may have an impact on firewall configurations!
PR: 5402
Reviewed by: phk
Submitted by: Stephen J. Roznowski <sjr@home.net>
"time" wasn't a atomic variable, so splfoo() protection were needed
around any access to it, unless you just wanted the seconds part.
Most uses of time.tv_sec now uses the new variable time_second instead.
gettime() changed to getmicrotime(0.
Remove a couple of unneeded splfoo() protections, the new getmicrotime()
is atomic, (until Bruce sets a breakpoint in it).
A couple of places needed random data, so use read_random() instead
of mucking about with time which isn't random.
Add a new nfs_curusec() function.
Mark a couple of bogosities involving the now disappeard time variable.
Update ffs_update() to avoid the weird "== &time" checks, by fixing the
one remaining call that passwd &time as args.
Change profiling in ncr.c to use ticks instead of time. Resolution is
the same.
Add new function "tvtohz()" to avoid the bogus "splfoo(), add time, call
hzto() which subtracts time" sequences.
Reviewed by: bde
all the LKM load/unload junk, and don't forget to register the SYSINIT
so that the cdevsw entry is attached.
BTW: I think the way it builds it's /dev nodes on the fly as an LKM with
vnode ops is kinda cute - I guess that'd be one way to solve the devfs
persistance problems.. :-) (ie: have the drivers make the nodes in /dev
on disk directly if they are missing, but leave them alone if present).
its own zone; this is used particularly by TCP which allocates both inpcb and
tcpcb in a single allocation. (Some hackery ensures that the tcpcb is
reasonably aligned.) Also keep track of the number of pcbs of each type
allocated, and keep a generation count (instance version number) for future
use.
connect. This check was added as part of the defense against the "land"
attack, to prevent attacks which guess the ISS from going into ESTABLISHED.
The "src == dst" check will still prevent the single-homed case of the
"land" attack, and guessing ISS's should be hard anyway.
Submitted by: David Borman <dab@bsdi.com>
since there might be permanent entries still left after
calls to DeleteLink (it will be nullified by DeleteLink
if all entries are deleted, won't it ?)
2) in PacketAliasSetAddress, set the aliasing address
even when PKT_ALIAS_RESET_ON_ADDR_CHANGE is in effect.
Just don't clean up links in this case.
Submitted by: Ari Suutari <ari@suutari.iki.fi>
via: Charles Mott <cmott@srv.net>
PR: 5041
It controls if the system is to accept source routed packets.
It used to be such that, no matter if the setting of net.inet.ip.sourceroute,
source routed packets destined at us would be accepted. Now it is
controllable with eth default set to NOT accept those.
offset is non-zero:
- Do not match fragmented packets if the rule specifies a port or
TCP flags
- Match fragmented packets if the rule does not specify a port and
TCP flags
Since ipfw cannot examine port numbers or TCP flags for such packets,
it is now illegal to specify the 'frag' option with either ports or
tcpflags. Both kernel and ipfw userland utility will reject rules
containing a combination of these options.
BEWARE: packets that were previously passed may now be rejected, and
vice versa.
Reviewed by: Archie Cobbs <archie@whistle.com>
This means that a FreeBSD will only forward source routed packets
when both net.inet.ip.forwarding and net.inet.ip.sourceroute are set
to 1.
You can hit me now ;-)
Submitted by: Thomas Ptacek
TCP and UDP port numbers in fragmented packets when IP offset != 0.
2.2.6 candidate.
Discovered by: Marc Slemko <marcs@znep.com>
Submitted by: Archie Cobbs <archie@whistle.com> w/fix from me
a hashed port list. In the new scheme, in_pcblookup() goes away and is
replaced by a new routine, in_pcblookup_local() for doing the local port
check. Note that this implementation is space inefficient in that the PCB
struct is now too large to fit into 128 bytes. I might deal with this in the
future by using the new zone allocator, but I wanted these changes to be
extensively tested in their current form first.
Also:
1) Fixed off-by-one errors in the port lookup loops in in_pcbbind().
2) Got rid of some unneeded rehashing. Adding a new routine, in_pcbinshash()
to do the initialial hash insertion.
3) Renamed in_pcblookuphash() to in_pcblookup_hash() for easier readability.
4) Added a new routine, in_pcbremlists() to remove the PCB from the various
hash lists.
5) Added/deleted comments where appropriate.
6) Removed unnecessary splnet() locking. In general, the PCB functions should
be called at splnet()...there are unfortunately a few exceptions, however.
7) Reorganized a few structs for better cache line behavior.
8) Killed my TCP_ACK_HACK kludge. It may come back in a different form in
the future, however.
These changes have been tested on wcarchive for more than a month. In tests
done here, connection establishment overhead is reduced by more than 50
times, thus getting rid of one of the major networking scalability problems.
Still to do: make tcp_fastimo/tcp_slowtimo scale well for systems with a
large number of connections. tcp_fastimo is easy; tcp_slowtimo is difficult.
WARNING: Anything that knows about inpcb and tcpcb structs will have to be
recompiled; at the very least, this includes netstat(1).
rev 1.66. This fix contains both belt and suspenders.
Belt: ignore packets where src == dst and srcport == dstport in TCPS_LISTEN.
These packets can only legitimately occur when connecting a socket to itself,
which doesn't go through TCPS_LISTEN (it goes CLOSED->SYN_SENT->SYN_RCVD->
ESTABLISHED). This prevents the "standard" "land" attack, although doesn't
prevent the multi-homed variation.
Suspenders: send a RST in response to a SYN/ACK in SYN_RECEIVED state.
The only packets we should get in SYN_RECEIVED are
1. A retransmitted SYN, or
2. An ack of our SYN/ACK.
The "land" attack depends on us accepting our own SYN/ACK as an ACK;
in SYN_RECEIVED state; this should prevent all "land" attacks.
We also move up the sequence number check for the ACK in SYN_RECEIVED.
This neither helps nor hurts with respect to the "land" attack, but
puts more of the validation checking in one spot.
PR: kern/5103
This will not make any of object files that LINT create change; there
might be differences with INET disabled, but hardly anything compiled
before without INET anyway. Now the 'obvious' things will give a
proper error if compiled without inet - ipx_ip, ipfw, tcp_debug. The
only thing that _should_ work (but can't be made to compile reasonably
easily) is sppp :-(
This commit move struct arpcom from <netinet/if_ether.h> to
<net/if_arp.h>.
consequence, ipfw's list command now adjusts its output at runtime
based on the largest packet/byte counter values.
NOTE:
o The ipfw struct has changed requiring a recompile of both kernel
and userland ipfw utility.
o This probably should not be brought into 2.2.
PR: 3738
fix PR#3618 weren't sufficient since malloc() can block - allowing the
net interrupts in and leading to the same problem mentioned in the
PR (a panic). The order of operations has been changed so that this
is no longer a problem.
Needs to be brought into the 2.2.x branch.
PR: 3618
where if you are using the "reset tcp" firewall command,
the kernel would write ethernet headers onto random kernel stack locations.
Fought to the death by: terry, julian, archie.
fix valid for 2.2 series as well.
The #ifdef IPXIP in netipx/ipx_if.h is OK (used from ipx_usrreq.c and
ifconfig.c only).
I also fixed a typo IPXTUNNEL -> IPTUNNEL (and #ifdef'ed out the code
inside, as it never could have compiled - doh.)
close small security hole where an atacker could sendpackets with
IPDIVERT protocol, and select how it would be diverted thus bypassing
the ipfirewall. Discovered by inspection rather than attack.
(you'd have to know how the firewall was configured (EXACTLY) to
make use of this but..)
hope i've found out all files that actually depend on this dependancy.
IMHO, it's not very good practice to change the size of internal
structs depending on kernel options.
Distribute all but the most fundamental malloc types. This time I also
remembered the trick to making things static: Put "static" in front of
them.
A couple of finer points by: bde
represent in the TCP header. The old code did effectively:
win = min(win, MAX_ALLOWED);
win = max(win, what_i_think_i_advertised_last_time);
so if what_i_think_i_advertised_last_time is bigger than can be
represented in the header (e.g. large buffers and no window scaling)
then we stuff a too-big number into a short. This fix reverses the
order of the comparisons.
PR: kern/4712
RST's being ignored, keeping a connection around until it times out, and
thus has the opposite effect of what was intended (which is to make the
system more robust to DoS attacks).
you don't want this (and the documentation explains why), but if you
use ipfw as an as-needed casual filter as needed which normally runs as
'allow all' then having the kernel and /sbin/ipfw get out of sync is a
*MAJOR* pain in the behind.
PR: 4141
Submitted by: Heikki Suonsivu <hsu@mail.clinet.fi>
potential problems with other automatic-reply ICMPs, but some of them may
depend on broadcast/multicast to operate. (This code can simply be
moved to the `reflect' label to generalize it.)
socket addresses in mbufs. (Socket buffers are the one exception.) A number
of kernel APIs needed to get fixed in order to make this happen. Also,
fix three protocol families which kept PCBs in mbufs to not malloc them
instead. Delete some old compatibility cruft while we're at it, and add
some new routines in the in_cksum family.
accommodate the expanded name, the ICMP types bitmap has been
reduced from 256 bits to 32.
A recompile of kernel and user level ipfw is required.
To be merged into 2.2 after a brief period in -current.
PR: bin/4209
Reviewed by: Archie Cobbs <archie@whistle.com>
be dropped when it has an unusual traffic pattern. For full details
as well as a test case that demonstrates the failure, see the
referenced PR.
Under certain circumstances involving the persist state, it is
possible for the receive side's tp->rcv_nxt to advance beyond its
tp->rcv_adv. This causes (tp->rcv_adv - tp->rcv_nxt) to become
negative. However, in the code affected by this fix, that difference
was interpreted as an unsigned number by max(). Since it was
negative, it was taken as a huge unsigned number. The effect was
to cause the receiver to believe that its receive window had negative
size, thereby rejecting all received segments including ACKs. As
the test case shows, this led to fruitless retransmissions and
eventually to a dropped connection. Even connections using the
loopback interface could be dropped. The fix substitutes the signed
imax() for the unsigned max() function.
PR: closes kern/3998
Reviewed by: davidg, fenner, wollman
these are quite extensive additions to the ipfw code.
they include a change to the API because the old method was
broken, but the user view is kept the same.
The new code allows a particular match to skip forward to a particular
line number, so that blocks of rules can be
used without checking all the intervening rules.
There are also many more ways of rejecting
connections especially TCP related, and
many many more ...
see the man page for a complete description.
switch. I needed 'LINT' to compile for other reasons so I kinda got the
blood on my hands. Note: I don't know how to test this, I don't know if
it works correctly.
Don't search for interface addresses matching interface "NULL"
it's likely to cause a page fault..
this can be triggered by the ipfw code rejecting a locally generated
packet (e.g. you decide to make some network unreachable by local users)
ppp (or will be shortly). Natd can now be updated to use
this library rather than carrying its own version of the code.
Submitted by: Charles Mott <cmott@srv.net>
in_setsockaddr and in_setpeeraddr.
Handle the case where the socket was disconnected before the network
interrupts were disabled.
Reviewed by: Garrett Wollman <wollman@khavrinen.lcs.mit.edu>
to fill in the nfs_diskless structure, at the cost of some kernel
bloat. The advantage is that this code works on a wider range of
network adapters than netboot. Several new kernel options are
documented in LINT.
Obtained from: parts of the code comes from NetBSD.
This commit includes the following changes:
1) Old-style (pr_usrreq()) protocols are no longer supported, the compatibility
glue for them is deleted, and the kernel will panic on boot if any are compiled
in.
2) Certain protocol entry points are modified to take a process structure,
so they they can easily tell whether or not it is possible to sleep, and
also to access credentials.
3) SS_PRIV is no more, and with it goes the SO_PRIVSTATE setsockopt()
call. Protocols should use the process pointer they are now passed.
4) The PF_LOCAL and PF_ROUTE families have been updated to use the new
style, as has the `raw' skeleton family.
5) PF_LOCAL sockets now obey the process's umask when creating a socket
in the filesystem.
As a result, LINT is now broken. I'm hoping that some enterprising hacker
with a bit more time will either make the broken bits work (should be
easy for netipx) or dike them out.
Use the name argument almost the same in all LKM types. Maintain
the current behavior for the external (e.g., modstat) name for DEV,
EXEC, and MISC types being #name ## "_mod" and SYCALL and VFS only
#name. This is a candidate for change and I vote just the name without
the "_mod".
Change the DISPATCH macro to MOD_DISPATCH for consistency with the
other macros.
Add an LKM_ANON #define to eliminate the magic -1 and associated
signed/unsigned warnings.
Add MOD_PRIVATE to support wcd.c's poking around in the lkm structure.
Change source in tree to use the new interface.
Reviewed by: Bruce Evans
cache lines. Removed the struct ip proto since only a couple of chars
were actually being used in it. Changed the order of compares in the
PCB hash lookup to take advantage of partial cache line fills (on PPro).
Discussed-with: wollman
the quality of the hash distribution. This does not fix a problem dealing
with poor distribution when using lots of IP aliases and listening
on the same port on every one of them...some other day perhaps; fixing
that requires significant code changes.
The use of xor was inspired by David S. Miller <davem@jenolan.rutgers.edu>
connect in TCP while sending urgent data. It is not clear what
purpose is served by doing this, but there's no good reason why it
shouldn't work.
Submitted by: tjevans@raleigh.ibm.com via wpaul
pr_usrreqs. Collapse duplicates with udp_usrreq.c and
tcp_usrreq.c (calling the generic routines in uipc_socket2.c and
in_pcb.c). Calling sockaddr()_ or peeraddr() on a detached
socket now traps, rather than harmlessly returning an error; this
should never happen. Allow the raw IP buffer sizes to be
controlled via sysctl.
in the route. This allows us to remove the unconditional setting of the
pipesize in the route, which should mean that SO_SNDBUF and SO_RCVBUF
should actually work again. While we're at it:
- Convert udp_usrreq from `mondo switch statement from Hell' to new-style.
- Delete old TCP mondo switch statement from Hell, which had previously
been diked out.
is administratively downed, all routes to that interface (including the
interface route itself) which are not static will be deleted. When
it comes back up, and addresses remaining will have their interface routes
re-added. This solves the problem where, for example, an Ethernet interface
is downed by traffic continues to flow by way of ARP entries.
affect programs that sit on top of divert(4) sockets. The
multicast routing code already unconditionally zeros the sum
before recalculating.
Any code that unconditionaly sums a packet without first zeroing
the sum (assuming that it's already zero'd) will break. No such
code seems to exist.
set it in the first place, independent of whether sin->sin_port
is set.
The result is that diverted packets that are being forwarded
will be diverted once and only once on the way in (ip_input())
and again, once and only once on the way out (ip_output()) -
twice in total. ICMP packets that don't contain a port will
now also be diverted.
to -current.
Thanks goes to Ulrike Nitzsche <ulrike@ifw-dresden.de> for giving me
a chance to test this. Only the PCI driver is tested though.
One final patch will follow in a separate commit. This is so that
everything up to here can be dragged into 2.2, if we decide so.
Reviewed by: joerg
Submitted by: Matt Thomas <matt@3am-software.com>
This will make a number of things easier in the future, as well as (finally!)
avoiding the Id-smashing problem which has plagued developers for so long.
Boy, I'm glad we're not using sup anymore. This update would have been
insane otherwise.
previous hackery involving struct in_ifaddr and arpcom. Get rid of the
abominable multi_kludge. Update all network interfaces to use the
new machanism. Distressingly few Ethernet drivers program the multicast
filter properly (assuming the hardware has one, which it usually does).
to TAILQs. Fix places which referenced these for no good reason
that I can see (the references remain, but were fixed to compile
again; they are still questionable).
The rest of the code was treating it as a header mbuf, but it was
allocated as a normal mbuf.
This fixes the panic: ip_output no HDR when you have a multicast
tunnel configured.
duplicate ip address 204.162.228.7! sent from ethernet address: 08:00:20:09:7b:1d
changed to
arp: 08:00:20:09:7b:1d is using my IP address 204.162.228.7!
and
arp info overwritten for 204.162.228.2 by 08:00:20:09:7b:1d
changed to
arp: 204.162.228.2 moved from 08:00:20:07:b6:a0 to 08:00:20:09:7b:1d
I think the new wordings are more clear and could save some support
questions.
using a sockaddr_dl.
Fix the other packet-information socket options (SO_TIMESTAMP, IP_RECVDSTADDR)
to work for multicast UDP and raw sockets as well. (They previously only
worked for unicast UDP).
"high" and "secure"), we can't use a single variable to track the most
recently used port in all three ranges.. :-] This caused the next
transient port to be allocated from the start of the range more often than
it should.
<net/if_arp.h> and fixed the things that depended on it. The nested
include just allowed unportable programs to compile and made my
simple #include checking program report that networking code doesn't
need to include <sys/socket.h>.
(yes I had tested the hell out of this).
I've also temporarily disabled the code so that it behaves as it previously
did (tail drop's the syns) pending discussion with fenner about some socket
state flags that I don't fully understand.
Submitted by: fenner
callers of it to take advantage of this. This reduces new connection
request overhead in the face of a large number of PCBs in the system.
Thanks to David Filo <filo@yahoo.com> for suggesting this and providing
a sample implementation (which wasn't used, but showed that it could be
done).
Reviewed by: wollman
drop the oldest entry in the queue.
There was a fair bit of discussion as to whether or not the
proper action is to drop a random entry in the queue. It's
my conclusion that a random drop is better than a head drop,
however profiling this section of code (done by John Capo)
shows that a head-drop results in a significant performance
increase.
There are scenarios where a random drop is more appropriate.
If I find one in reality, I'll add the random drop code under
a conditional.
Obtained from: discussions and code done by Vernon Schryver (vjs@sgi.com).
time, in seconds, that state for non-established TCP sessions stays about)
a sysctl modifyable variable.
[part 1 of two commits, I just realized I can't play with the indices as
I was typing this commit message.]
to "keepidle". this should not occur unless the connection has
been established via the 3-way handshake which requires an ACK
Submitted by: jmb
Obtained from: problem discussed in Stevens vol. 3
IPPORT_RESERVED that is used for selection when bind() is told to allocate
a reserved port.
Also, implement simple sanity checking for all the addresses set, to make
it a little harder for a user/sysadmin to shoot themselves in the feet.
pulled up already. This bug can cause the first packet from a source
to a group to be corrupted when it is delivered to a process listening
on the mrouter.
pr_usrreq mechanism which was poorly designed and error-prone. This
commit renames pr_usrreq to pr_ousrreq so that old code which depended on it
would break in an obvious manner. This commit also implements the new
interface for TCP, although the old function is left as an example
(#ifdef'ed out). This commit ALSO fixes a longstanding bug in the
TCP timer processing (introduced by davidg on 1995/04/12) which caused
timer processing on a TCB to always stop after a single timer had
expired (because it misinterpreted the return value from tcp_usrreq()
to indicate that the TCB had been deleted). Finally, some code
related to polling has been deleted from if.c because it is not
relevant t -current and doesn't look at all like my current code.
This stuff should not be too destructive if the IPDIVERT is not compiled in..
be aware that this changes the size of the ip_fw struct
so ipfw needs to be recompiled to use it.. more changes coming to clean this up.
- State when we've reached the limit on a particular rule in the kernel logfile
- State when a rule or all rules have been zero'd.
This gives a log of all actions that occur w/regard to the firewall
occurances, and can explain why a particular break-in attempt might not
get logged due to the limit being reached.
Reviewed by: alex
Reviewed by: phk
Reject the addition of rules that will never match (for example,
1.2.3.4:255.255.255.0). User level utilities specify the policy by either
masking the IP address for the user (as ipfw(8) does) or rejecting the
entry with an error. In either case, the kernel should not modify chain
entries to make them work.
LKM'ness. ACTUALLY_LKM_NOT_KERNEL is supposed to be so ugly that it
only gets used until <machine/conf.h> goes away. bsd.kmod.mk should
define a better-named general macro for this. Some places use
PSEUDO_LKM. This is another bad name.
Makefile:
Added IPFIREWALL_VERBOSE_LIMIT option (commented out).
broadcast and multicast routes, otherwise they will be expired by
arptimeout after a few minutes, reverting to " (incomplete)". This makes
the work done by rev 1.27 stay around until the route itself is deleted.
This is mainly cosmetic for 'arp' and 'netstat -r'.
for ARP requests.
The NetBSD version of this patch (see NetBSD PR kern/2381) has this change
already. This should close our PR kern/1140 .
Although it's not quite what he submitted, I got the idea from him so
Submitted by: Jin Guojun <jin@george.lbl.gov>
/kernel: in_rtqtimo: adjusted rtq_reallyold to 1066
/kernel: in_rtqtimo: adjusted rtq_reallyold to 710
inside of #ifdef DIAGNOSTIC to avoid the support questions from folks
asking what this means.
- Log ICMP type during verbose output.
- Added IPFIREWALL_VERBOSE_LIMIT option to prevent denial of service
attacks via syslog flooding.
- Filter based on ICMP type.
- Timestamp chain entries when they are matched.
- Interfaces can now be matched with a wildcard specification (i.e.
will match any interface unit for a given name).
- Prevent the firewall chain from being manipulated when securelevel
is greater than 2.
- Fixed bug that allowed the default policy to be deleted.
- Ability to zero individual accounting entries.
- Remove definitions of old_chk_ptr and old_ctl_ptr when compiling
ipfw as a lkm.
- Remove some redundant code shared between ip_fw_init and ipfw_load.
Closes PRs: 1192, 1219, and 1267.
gcc only inlines memcpy()'s whose count is constant and didn't inline
these. I want memcpy() in the kernel go away so that it's obvious that
it doesn't need to be optimized. Now it is only used for one struct
copy in si.c.