1
0
mirror of https://git.FreeBSD.org/src.git synced 2025-01-02 12:20:51 +00:00
Commit Graph

243 Commits

Author SHA1 Message Date
Andre Oppermann
62b36a7fc2 Style cleanups to the sctp_* syscall functions. 2006-11-07 21:28:12 +00:00
Andre Oppermann
bda8b1f3b8 Handle early errors in kern_sendfile() by introducing a new goto 'out'
label after the sbunlock() part.

This correctly handles calls to sendfile(2) without valid parameters
that was broken in rev. 1.240.

Coverity error:	272162
2006-11-06 21:53:19 +00:00
Randall Stewart
f8829a4a40 Ok, here it is, we finally add SCTP to current. Note that this
work is not just mine, but it is also the works of Peter Lei
and Michael Tuexen. They both are my two key other developers
working on the project.. and they need ata-boy's too:
****
peterlei@cisco.com
tuexen@fh-muenster.de
****
I did do a make sysent which updated the
syscall's and sysproto.. I hope that is correct... without
it you don't build since we have new syscalls for SCTP :-0

So go out and look at the NOTES, add
option SCTP (make sure inet and inet6 are present too)
and play with SCTP.

I will see about comitting some test tools I have after I
figure out where I should place them. I also have a
lib (libsctp.a) that adds some of the missing socketapi
functions that I need to put into lib's.. I will talk
to George about this :-)

There may still be some 64 bit issues in here, none of
us have a 64 bit processor to test with yet.. Michael
may have a MAC but thats another beast too..

If you have a mac and want to use SCTP contact Michael
he maintains a web site with a loadable module with
this code :-)

Reviewed by:	gnn
Approved by:	gnn
2006-11-03 15:23:16 +00:00
Andre Oppermann
5e20f43d31 Rename m_getm() to m_getm2() and rewrite it to allocate up to page sized
mbuf clusters.  Add a flags parameter to accept M_PKTHDR and M_EOR mbuf
chain flags.  Provide compatibility macro for m_getm() calling m_getm2()
with M_PKTHDR set.

Rewrite m_uiotombuf() to use m_getm2() for mbuf allocation and do the
uiomove() in a tight loop over the mbuf chain.  Add a flags parameter to
accept mbuf flags to be passed to m_getm2().  Adjust all callers for the
extra parameter.

Sponsored by:	TCP/IP Optimization Fundraise 2005
MFC after:	3 month
2006-11-02 17:37:22 +00:00
Andre Oppermann
d99b0dd2c5 Rewrite kern_sendfile() to work in two loops, the inner which turns as many
VM pages into mbufs as it can -- up to the free send socket buffer space.
The outer loop then drops the whole mbuf chain into the send socket buffer,
calls tcp_output() on it and then waits until 50% of the socket buffer are
free again to repeat the cycle. This way tcp_output() gets the full amount
of data to work with and can issue up to 64K sends for TSO to chop up in
the network adapter without using any CPU cycles. Thus it gets very efficient
especially with the readahead the VM and I/O system do.

The previous sendfile(2) code simply looped over the file, turned each 4K
page into an mbuf and sent it off. This had the effect that TSO could only
generate 2 packets per send instead of up to 44 at its maximum of 64K.

Add experimental SF_MNOWAIT flag to sendfile(2) to return ENOMEM instead of
sleeping on mbuf allocation failures.

Benchmarking shows significant improvements (95% confidence):
 45% less cpu (or 1.81 times better) with new sendfile vs. old sendfile (non-TSO)
 83% less cpu (or 5.7 times better) with new sendfile vs. old sendfile (TSO)

(Sender AMD Opteron 852 (2.6GHz) with em(4) PCI-X-133 interface and receiver
DELL Poweredge SC1425 P-IV Xeon 3.2GHz with em(4) LOM connected back to back
at 1000Base-TX full duplex.)

Sponsored by:	TCP/IP Optimization Fundraise 2005
MFC after:	3 month
2006-11-02 16:53:26 +00:00
Robert Watson
aed5570872 Complete break-out of sys/sys/mac.h into sys/security/mac/mac_framework.h
begun with a repo-copy of mac.h to mac_framework.h.  sys/mac.h now
contains the userspace and user<->kernel API and definitions, with all
in-kernel interfaces moved to mac_framework.h, which is now included
across most of the kernel instead.

This change is the first step in a larger cleanup and sweep of MAC
Framework interfaces in the kernel, and will not be MFC'd.

Obtained from:	TrustedBSD Project
Sponsored by:	SPARTA
2006-10-22 11:52:19 +00:00
Alan Cox
9af80719db Replace PG_BUSY with VPO_BUSY. In other words, changes to the page's
busy flag, i.e., VPO_BUSY, are now synchronized by the per-vm object
lock instead of the global page queues lock.
2006-10-22 04:28:14 +00:00
Alan Cox
5786be7cc7 Introduce a field to struct vm_page for storing flags that are
synchronized by the lock on the object containing the page.

Transition PG_WANTED and PG_SWAPINPROG to use the new field,
eliminating the need for holding the page queues lock when setting
or clearing these flags.  Rename PG_WANTED and PG_SWAPINPROG to
VPO_WANTED and VPO_SWAPINPROG, respectively.

Eliminate the assertion that the page queues lock is held in
vm_page_io_finish().

Eliminate the acquisition and release of the page queues lock
around calls to vm_page_io_finish() in kern_sendfile() and
vfs_unbusy_pages().
2006-08-09 17:43:27 +00:00
Alan Cox
7c4b7ecc4c Reduce the scope of the page queues lock in kern_sendfile() now that
vm_page_sleep_if_busy() no longer requires the caller to hold the page
queues lock.
2006-08-06 01:00:09 +00:00
Alan Cox
10c09f3f61 The page queues lock is no longer required by vm_page_io_start(). Reduce
the scope of the page queues lock in kern_sendfile() accordingly.
2006-08-04 05:53:20 +00:00
John Baldwin
f30e89ced3 Fix a file descriptor race I reintroduced when I split accept1() up into
kern_accept() and accept1().  If another thread closed the new file
descriptor and the first thread later got an error trying to copyout the
socket address, then it would attempt to close the wrong file object.  To
fix, add a struct file ** argument to kern_accept().  If it is non-NULL,
then on success kern_accept() will store a pointer to the new file object
there and not release any of the references.  It is up to the calling code
to drop the references appropriately (including a call to fdclose() in case
of error to safely handle the aforementioned race).  While I'm at it, go
ahead and fix the svr4 streams code to not leak the accept fd if it gets an
error trying to copyout the streams structures.
2006-07-27 19:54:41 +00:00
Robert Watson
b0668f7151 soreceive_generic(), and sopoll_generic(). Add new functions sosend(),
soreceive(), and sopoll(), which are wrappers for pru_sosend,
pru_soreceive, and pru_sopoll, and are now used univerally by socket
consumers rather than either directly invoking the old so*() functions
or directly invoking the protocol switch method (about an even split
prior to this commit).

This completes an architectural change that was begun in 1996 to permit
protocols to provide substitute implementations, as now used by UDP.
Consumers now uniformly invoke sosend(), soreceive(), and sopoll() to
perform these operations on sockets -- in particular, distributed file
systems and socket system calls.

Architectural head nod:	sam, gnn, wollman
2006-07-24 15:20:08 +00:00
John Baldwin
b33887ea31 Don't free the sockaddr in kern_bind() and kern_connect() as not all
callers pass a sockaddr allocated via malloc() from M_SONAME anymore.
Instead, free it in the callers when necessary.
2006-07-19 18:28:52 +00:00
John Baldwin
c870740e09 - Split out kern_accept(), kern_getpeername(), and kern_getsockname() for
use by ABI emulators.
- Alter the interface of kern_recvit() somewhat.  Specifically, go ahead
  and hard code UIO_USERSPACE in the uio as that's what all the callers
  specify.  In place, add a new uioseg to indicate what type of pointer
  is in mp->msg_name.  Previously it was always a userland address, but
  ABI emulators may pass in kernel-side sockaddrs.  Also, remove the
  namelenp field and instead require the two places that used it to
  explicitly copy mp->msg_namelen out to userland.
- Use the patched kern_recvit() to replace svr4_recvit() and the stock
  kern_sendit() to replace svr4_sendit().
- Use kern_bind() instead of stackgap use in ti_bind().
- Use kern_getpeername() and kern_getsockname() instead of stackgap in
  svr4_stream_ti_ioctl().
- Use kern_connect() instead of stackgap in svr4_do_putmsg().
- Use kern_getpeername() and kern_accept() instead of stackgap in
  svr4_do_getmsg().
- Retire the stackgap from SVR4 compat as it is no longer used.
2006-07-10 21:38:17 +00:00
George V. Neville-Neil
fb11be62a2 Properly cast the values of valsize (the size of the value passed in)
in setsockopt so that they can be compared correctly against negative
values.  Passing in a negative value had a rather negative effect
on our socket code, making it impossible to open new sockets.

PR:		98858
Submitted by:	James.Juran@baesystems.com
MFC after:	1 week
2006-06-20 12:36:40 +00:00
Robert Watson
b37ffd3189 Move some functions and definitions from uipc_socket2.c to uipc_socket.c:
- Move sonewconn(), which creates new sockets for incoming connections on
  listen sockets, so that all socket allocate code is together in
  uipc_socket.c.

- Move 'maxsockets' and associated sysctls to uipc_socket.c with the
  socket allocation code.

- Move kern.ipc sysctl node to uipc_socket.c, add a SYSCTL_DECL() for it
  to sysctl.h and remove lots of scattered implementations in various
  IPC modules.

- Sort sodealloc() after soalloc() in uipc_socket.c for dependency order
  reasons.  Statisticize soalloc() and sodealloc() as they are now
  required only in uipc_socket.c, and are internal to the socket
  implementation.

After this change, socket allocation and deallocation is entirely
centralized in one file, and uipc_socket2.c consists entirely of socket
buffer manipulation and default protocol switch functions.

MFC after:	1 month
2006-06-10 14:34:07 +00:00
Robert Watson
20bdac8a4f Use getsock() and fput() instead of fgetsock() and fputsock() in
sendfile().  This causes sendfile() to use the file descriptor
reference to the socket instead of bumping the socket reference
count, which avoids an additional refcount operation, as well as a
potential expensive socket refcount drop, which can lead to
contention on the accept mutex.  This change also has the side
effect of further reducing the number of cases where an in-progress
I/O operation can occur on a socket after close, as using the file
descriptor refcount prevents the socket from closing while in use.

MFC after:	3 months
2006-05-25 15:10:13 +00:00
Robert Watson
102ea03373 Extend getsock() to return the struct file flags read while holding the
file lock, in the style of fgetsock().

Modify accept1() to use getsock() instead of fgetsock(), relying on the
file descriptor reference rather than an acquired socket reference to
prevent the listen socket from being destroyed during accept().  This
avoids additional reference count operations, which should improve
performance, and also avoids accept1() operating on a socket whose file
descriptor has been torn down, which may have resulted in protocol
shutdown starting.

MFC after:	3 months
2006-04-25 11:48:16 +00:00
Robert Watson
fa4c5373ce Add comment to accept1() that it should use getsock() instead of fgetsock()
to avoid additional mutex operations, and also to avoid use of soref/sorele
which are now not preferred.

MFC after:	3 months
2006-04-01 11:14:56 +00:00
Alan Cox
7c8dcf2def Use NET_LOCK_GIANT() and VFS_LOCK_GIANT() instead of unconditionally
acquiring Giant in kern_sendfile().

Guard against the forced reclamation of a vnode in kern_sendfile().

Discussed with: jeff
Reviewed by: tegge
MFC after: 3 weeks
2006-03-27 04:23:16 +00:00
Paul Saab
fa545f434c Fix 32bit sendfile by implementing kern_sendfile so that it takes
the header and trailers as iovec arguments instead of copying them
in inside of sendfile.

Reviewed by:	jhb
MFC after:	3 weeks
2006-02-28 19:39:18 +00:00
Paul Saab
ecc44de7a2 Reformat socket control messages on input/output for 32bit compatibility
on 64bit systems.

Submitted by:	ps, ups
Reviewed by:	jhb
2005-10-31 21:09:56 +00:00
Paul Saab
a372f8224c Implement the 32bit versions of recvmsg, recvfrom, sendmsg
Partially obtained from:	jhb
2005-10-15 05:57:06 +00:00
Robert Watson
6758f88ea4 Add MAC Framework and MAC policy entry point mac_check_socket_create(),
which is invoked from socket() and socketpair(), permitting MAC
policy modules to control the creation of sockets by domain, type, and
protocol.

Obtained from:	TrustedBSD Project
Sponsored by:	SPARTA, SPAWAR
Approved by:	re (scottl)
Requested by:	SCC
2005-07-05 22:49:10 +00:00
Maksim Yevmenkin
75ae257016 Change m_uiotombuf so it will accept offset at which data should be copied
to the mbuf. Offset cannot exceed MHLEN bytes. This is currently used to
fix Ethernet header alignment problem on alpha and sparc64. Also change all
users of m_uiotombuf to pass proper offset.

Reviewed by:	jmg, sam
Tested by:	Sten Spans "sten AT blinkenlights DOT nl"
MFC after:	1 week
2005-05-04 18:55:03 +00:00
Robert Watson
7f53207b92 Introduce three additional MAC Framework and MAC Policy entry points to
control socket poll() (select()), fstat(), and accept() operations,
required for some policies:

        poll()          mac_check_socket_poll()
        fstat()         mac_check_socket_stat()
        accept()        mac_check_socket_accept()

Update mac_stub and mac_test policies to be aware of these entry points.
While here, add missing entry point implementations for:

        mac_stub.c      stub_check_socket_receive()
        mac_stub.c      stub_check_socket_send()
        mac_test.c      mac_test_check_socket_send()
        mac_test.c      mac_test_check_socket_visible()

Obtained from:	TrustedBSD Project
Sponsored by:	SPAWAR, SPARTA
2005-04-16 18:46:29 +00:00
Jeff Roberson
f247a5240d - LK_NOPAUSE is a nop now.
Sponsored by:   Isilon Systems, Inc.
2005-03-31 04:37:09 +00:00
Maxim Sobolev
8d6e40c3f1 Add kernel-only flag MSG_NOSIGNAL to be used in emulation layers to surpress
SIGPIPE signal for the duration of the sento-family syscalls. Use it to
replace previously added hack in Linux layer based on temporarily setting
SO_NOSIGPIPE flag.

Suggested by:	alfred
2005-03-08 16:11:41 +00:00
Robert Watson
29bdd01910 Remove now unused 'int s' from spl().
MFC after:	3 days
2005-02-18 21:39:55 +00:00
Robert Watson
d8d716bef5 De-spl kern_connect().
MFC after:	3 days
2005-02-18 19:37:36 +00:00
Robert Watson
1e8f89541e In accept1(), extend coverage of the socket lock from just covering
soref() to also covering the update of so_state.  While no other user
threads can update the socket state here as it's not yet hooked up to
the file descriptor array yet, the protocol could also frob the
socket state here, leading to a lost update to the so_state field.
No reported instances of this bug (as yet).

MFC after:      3 days
2005-02-17 13:00:23 +00:00
Maxim Sobolev
a6886ef173 Extend kern_sendit() to take another enum uio_seg argument, which specifies
where the buffer to send lies and use it to eliminate yet another stackgap
in linuxlator.

MFC after:	2 weeks
2005-01-30 07:20:36 +00:00
Poul-Henning Kamp
8516dd18e1 Don't use VOP_GETVOBJECT, use vp->v_object directly. 2005-01-25 00:40:01 +00:00
Poul-Henning Kamp
f6dc414a5c Save a line by unlocking before we test. 2005-01-24 14:13:24 +00:00
Warner Losh
9454b2d864 /* -> /*- for copyright notices, minor format tweaks as necessary 2005-01-06 23:35:40 +00:00
Poul-Henning Kamp
124e4c3be8 Introduce an alias for FILEDESC_{UN}LOCK() with the suffix _FAST.
Use this in all the places where sleeping with the lock held is not
an issue.

The distinction will become significant once we finalize the exact
lock-type to use for this kind of case.
2004-11-13 11:53:02 +00:00
Alan Cox
d3cb0d99e0 Introduce two new options, "CPU private" and "no wait", to sf_buf_alloc().
Change the spelling of the "catch" option to be consistent with the new
options.  Implement the "no wait" option.  An implementation of the "CPU
private" for i386 will be committed at a later date.
2004-11-08 00:43:46 +00:00
Poul-Henning Kamp
ef11fbd7c4 Introduce fdclose() which will clean an entry in a filedesc.
Replace homerolled versions with call to fdclose().

Make fdunused() static to kern_descrip.c
2004-11-07 22:16:07 +00:00
Poul-Henning Kamp
b1fa752732 Use fget_locked() instead of homerolled 2004-11-07 16:09:56 +00:00
Alan Cox
d19ef81437 The synchronization provided by vm object locking has eliminated the
need for most calls to vm_page_busy().  Specifically, most calls to
vm_page_busy() occur immediately prior to a call to vm_page_remove().
In such cases, the containing vm object is locked across both calls.
Consequently, the setting of the vm page's PG_BUSY flag is not even
visible to other threads that are following the synchronization
protocol.

This change (1) eliminates the calls to vm_page_busy() that
immediately precede a call to vm_page_remove() or functions, such as
vm_page_free() and vm_page_rename(), that call it and (2) relaxes the
requirement in vm_page_remove() that the vm page's PG_BUSY flag is
set.  Now, the vm page's PG_BUSY flag is set only when the vm object
lock is released while the vm page is still in transition.  Typically,
this is when it is undergoing I/O.
2004-11-03 20:17:31 +00:00
Robert Watson
ae1e5c5d8d Move from using the socket reference count to the file reference
count to prevent sockets from being garbage collected during
socket-specific system calls.  This is the same approach used in
most VFS-specific system calls, as well as generic file descriptor
system calls such as read() and write().

To do this, add a utility function getsock(), which is logically
identical to getvnode() used for the same purpose in VFS.  Unlike
fgetsock(), it returns with the file reference count elevated, but
no bump of the socket reference count.  Replace matching calls to
fputsock() with fdrop().

This change is made to all socket system calls other than
sendfile() and accept(), but the approach should be applicable to
those system calls also.

This shaves about four mutex operations off of each of these
system calls, including send() and recv() variants, adding about
1% to pps on minimal UDP packets for UP using netblast, and 4% on
SMP.

Reviewed by:	pjd
2004-10-24 23:45:01 +00:00
Alan Cox
01ad40dac5 Use VM_ALLOC_NOBUSY instead of calling vm_page_wakeup(). 2004-10-24 20:09:59 +00:00
Alan Cox
0f777d7d9b Modify the vm object locking in do_sendfile() so that the containing object
is locked when vm_page_io_finish() is called on a page.  This is to satisfy
a new, post-RELENG_5 assertion in vm_page_io_finish().  (I am in the
process of transitioning the responsibility for synchronizing access to
various fields/flags on the page from the global page queues lock to the
per-object lock.)

Tripped over by: obrien@
2004-10-20 17:44:40 +00:00
Alan Cox
86dac448f2 Add a SOCKBUF_LOCK() to a rarely executed path in do_sendfile(). 2004-10-02 05:37:47 +00:00
John-Mark Gurney
ad3b9257c2 Add locking to the kqueue subsystem. This also makes the kqueue subsystem
a more complete subsystem, and removes the knowlege of how things are
implemented from the drivers.  Include locking around filter ops, so a
module like aio will know when not to be unloaded if there are outstanding
knotes using it's filter ops.

Currently, it uses the MTX_DUPOK even though it is not always safe to
aquire duplicate locks.  Witness currently doesn't support the ability
to discover if a dup lock is ok (in some cases).

Reviewed by:	green, rwatson (both earlier versions)
2004-08-15 06:24:42 +00:00
David Malone
e140eb430c Add a kern_setsockopt and kern_getsockopt which can read the option
values from either user land or from the kernel. Use them for
[gs]etsockopt and to clean up some calls to [gs]etsockopt in the
Linux emulation code that uses the stackgap.
2004-07-17 21:06:36 +00:00
Poul-Henning Kamp
552afd9c12 Clean up and wash struct iovec and struct uio handling.
Add copyiniov() which copies a struct iovec array in from userland into
a malloc'ed struct iovec.  Caller frees.

Change uiofromiov() to malloc the uio (caller frees) and name it
copyinuio() which is more appropriate.

Add cloneuio() which returns a malloc'ed copy.  Caller frees.

Use them throughout.
2004-07-10 15:42:16 +00:00
Robert Watson
6ec70e64c6 Remove spl()'s from do_sendfile(). 2004-07-09 01:46:03 +00:00
Robert Watson
ad6b0efff5 Acquire socket lock in the "waiting for connection" loop in
kern_connect(), replacing tsleep() with msleep() with the socket
mutex.
2004-06-24 01:43:23 +00:00
Bruce M Simpson
a3146ff925 Fix an inconsistency in socket option propagation on accept(). Propagate
the SS_NBIO flag from the parent socket to the child socket during an
accept() operation.

The file descriptor O_NONBLOCK flag would have been propagated already
by the fflag assignment, and therefore would have been inconsistent
with the underlying socket's so_state member.

This makes accept() more closely adhere to the API contract we effectively
outline in the manual page. Note also that Linux continues to differ here;
O_NONBLOCK is not propagated. The other BSDs do propagate the flag, as
does Solaris. The Single UNIX Specification does not offer specific
advice on this issue.

PR:		kern/45733
Requested by:	Jayanth Vijayaraghavan
Reviewed by:	rwatson
2004-06-22 23:58:09 +00:00