some cases, move to priv_check() if it was an operation on a thread and
no other flags were present.
Eliminate caller-side jail exception checking (also now-unused); jail
privilege exception code now goes solely in kern_jail.c.
We can't yet eliminate suser() due to some cases in the KAME code where
a privilege check is performed and then used in many different deferred
paths. Do, however, move those prototypes to priv.h.
Reviewed by: csjp
Obtained from: TrustedBSD Project
Change the VOP_OPEN(), vn_open() vnode operation and d_fdopen() cdev operation
argument from being file descriptor index into the pointer to struct file.
Proposed and reviewed by: jhb
Reviewed by: daichi (unionfs)
Approved by: re (kensmith)
as UF_OPENING. Disable closing of that entries. This should fix the crashes
caused by devfs_open() (and fifo_open()) dereferencing struct file * by
index, while the filedescriptor is closed by parallel thread.
Idea by: tegge
Reviewed by: tegge (previous version of patch)
Tested by: Peter Holm
Approved by: re (kensmith)
MFC after: 3 weeks
and flags with an sxlock. This leads to a significant and measurable
performance improvement as a result of access to shared locking for
frequent lookup operations, reduced general overhead, and reduced overhead
in the event of contention. All of these are imported for threaded
applications where simultaneous access to a shared file descriptor array
occurs frequently. Kris has reported 2x-4x transaction rate improvements
on 8-core MySQL benchmarks; smaller improvements can be expected for many
workloads as a result of reduced overhead.
- Generally eliminate the distinction between "fast" and regular
acquisisition of the filedesc lock; the plan is that they will now all
be fast. Change all locking instances to either shared or exclusive
locks.
- Correct a bug (pointed out by kib) in fdfree() where previously msleep()
was called without the mutex held; sx_sleep() is now always called with
the sxlock held exclusively.
- Universally hold the struct file lock over changes to struct file,
rather than the filedesc lock or no lock. Always update the f_ops
field last. A further memory barrier is required here in the future
(discussed with jhb).
- Improve locking and reference management in linux_at(), which fails to
properly acquire vnode references before using vnode pointers. Annotate
improper use of vn_fullpath(), which will be replaced at a future date.
In fcntl(), we conservatively acquire an exclusive lock, even though in
some cases a shared lock may be sufficient, which should be revisited.
The dropping of the filedesc lock in fdgrowtable() is no longer required
as the sxlock can be held over the sleep operation; we should consider
removing that (pointed out by attilio).
Tested by: kris
Discussed with: jhb, kris, attilio, jeff
file descriptor is closed out from under us in kern_open(). This race
is already handled and the file will be closed when kern_open() does an
fdrop just before returning.
until after the call to fdclose(). This closes an obscure race that
could result in the later call to fdclose() actually closing a different
file descriptor if another thread close()'s the file descriptor being
opened before fdrop() is called, so the fdrop() in kern_open() frees the
file object, then the second thread (or a third) creates a new file
descriptor which reuses both the same index and the same file pointer
thus tricking fdclose() in the first thread into thinking that the
original file was still open.
MFC after: 1 week
- Remove also "MP SAFE" after prior "MPSAFE" pass. (suggested by bde)
- Remove extra blank lines in some cases.
- Add extra blank lines in some cases.
- Remove no-op comments consisting solely of the function name, the word
"syscall", or the system call name.
- Add punctuation.
- Re-wrap some comments.
As consequence, getdirentries() no longer needs to drop/reacquire
directory vnode lock, that would allow it to be reclaimed in between.
Reported and tested by: Peter Holm
Approved by: rodrigc (unionfs)
MFC after: 1 week
This way we may support multiple structures in v_data vnode field within
one file system without using black magic.
Vnode-to-file-handle should be VOP in the first place, but was made VFS
operation to keep interface as compatible as possible with SUN's VFS.
BTW. Now Solaris also implements vnode-to-file-handle as VOP operation.
VFS_VPTOFH() was left for API backward compatibility, but is marked for
removal before 8.0-RELEASE.
Approved by: mckusick
Discussed with: many (on IRC)
Tested with: ufs, msdosfs, cd9660, nullfs and zfs
non-extattr functions from vfs_extattr.c, and extattr functions from
vfs_syscalls.c.
Change copyright/license on vfs_extattr.c to my copyright/license on
the extended attribute implementation (from extattr.h).
Clean up includes a bit.
Obtained from: TrustedBSD Project
specific privilege names to a broad range of privileges. These may
require some future tweaking.
Sponsored by: nCircle Network Security, Inc.
Obtained from: TrustedBSD Project
Discussed on: arch@
Reviewed (at least in part) by: mlaier, jmg, pjd, bde, ceri,
Alex Lyashkov <umka at sevcity dot net>,
Skip Ford <skip dot ford at verizon dot net>,
Antoine Brodin <antoine dot brodin at laposte dot net>
to twice unlock the vnode. Check that ni_vp and ni_dvp are different before
doing second unlock.
Reviewed by: rwatson
Approved by: pjd (mentor)
MFC after: 1 week
begun with a repo-copy of mac.h to mac_framework.h. sys/mac.h now
contains the userspace and user<->kernel API and definitions, with all
in-kernel interfaces moved to mac_framework.h, which is now included
across most of the kernel instead.
This change is the first step in a larger cleanup and sweep of MAC
Framework interfaces in the kernel, and will not be MFC'd.
Obtained from: TrustedBSD Project
Sponsored by: SPARTA
sync() and sync_fsync() without losing MNT_ASYNC. Add MNTK_ASYNC flag
which is set only when MNT_ASYNC is set and mnt_noasync is zero, and
check that flag instead of MNT_ASYNC before initiating async io.
with other commonly used sysctl name spaces, rather than declaring them
all over the place.
MFC after: 1 month
Sponsored by: nCircle Network Security, Inc.
vfs_rel() on the mountpoint if the MAC checks fail in kern_statfs() and
kern_fstatfs(). Similarly, don't perform an extra vfs_rel() if we get
a doomed vnode in kern_fstatfs(), and handle the case of mp being NULL
(for some doomed vnodes) by conditionalizing the vfs_rel() in
kern_fstatfs() on mp != NULL.
CID: 1517
Found by: Coverity Prevent (tm) (kern_fstatfs())
Pointy hat to: jhb
kern_fstatfs() so that it is still held when prison_enforce_statfs() is
called (since that function likes to poke and prod at the mountpoint
structure).
MFC after: 3 days
ibcs2_getdents(), ibcs2_read(), ogetdirentries(), svr4_sys_getdents(),
and svr4_sys_getdents64() similar to that in getdirentries().
- Mark ibcs2_getdents(), ibcs2_read(), linux_getdents(), linux_getdents64(),
linux_readdir(), ogetdirentries(), svr4_sys_getdents(), and
svr4_sys_getdents64() MPSAFE.
and use that instead of testing fdidx against -1 to determine if it should
release Giant if Giant was locked due to the requested file residing on a
non-MPSAFE VFS.
Discussed with: jeff
VFS_LOCK_GIANT/VFS_UNLOCK_GIANT calls. This completely removes Giant
acquisition in the syscall path for ffs.
Bug fix to kern_fhstatfs from: Todd Miller <Todd.Miller@sparta.com>
Sponsored by: Isilon Systems, Inc.
the VFS_STATFS call to prevent the mount from disappearing while we're
stating.
- Convert these routines to use MPSAFE namei semantics.
MFC After: 1 week
vnode is from a file system that is not MPSAFE, as vrele() expects
Giant to be held when it is called on a non-MPSAFE vnode.
Spotted by: kris
Tested by: glebius
directory. vrele() may lock the passed vnode, which in these cases would
give an invalid lock order of child -> parent. These situations are
deadlock prone although do not typically deadlock because the vrele
is typically not releasing the last reference to the vnode. Users of
vrele must consider it as a call to vn_lock() and order it appropriately.
MFC After: 1 week
Sponsored by: Isilon Systems, Inc.
Tested by: kkenn
lookup() instead of EPERM when a DELETE or RENAME operation is
attempted on "..".
In kern_unlink(), remap EINVAL errors returned from namei() to EPERM
to match existing (and POSIX required) behaviour.
Discussed with: bde
MFC after: 3 days
that a file's atime and mtime are only set to correct fractional
second values (0-999999000ns with the current interface).
Prior to this change users could create files with values outside
that range. Moreover, on 32-bit machines tv_usec offsets larger than
4.3s would result in an unnormalized AND wrong timestamp value,
due to overflow.
MFC after: 1 week
The purpose of this change is consistency (not performance improvement:)),
as it was hard to tell if fdrop() is MPSAFE or not when I saw it sometimes
under the Giant and sometimes without it.
Glanced at by: ssouhlal, kan
remove the unconditional acquisition of Giant for extended attribute related
operations. If the file system is set as being MP safe and debug.mpsafevfs is
1, do not pickup Giant.
Mark the following system calls as being MP safe so we no longer pickup Giant
in the system call handler:
o extattrctl
o extattr_set_file
o extattr_get_file
o extattr_delete_file
o extattr_set_fd
o extattr_get_fd
o extattr_delete_fd
o extattr_set_link
o extattr_get_link
o extattr_delete_link
o extattr_list_file
o extattr_list_link
o extattr_list_fd
-Pass MPSAFE flags to namei(9) lookup and introduce vfslocked variable which
will keep track of any Giant acquisitions.
-Wrap any fd operations which manipulate vnodes in VFS_{UN}LOCK_GIANT
-Drop VFS_ASSERT_GIANT into function which operate on vnodes to ensure that
we are sufficiently protected.
I've tested these changes with various TrustedBSD MAC policies which use
extended attribute a lot on SMP and UP systems (thanks to Scott Long for
making some SMP hardware available to me for testing).
Discussed with: jeff
Requested by: jhb, rwatson
links and the execution of ELF binaries. Two problems were found:
1) The link path wasn't tagged as being MP safe and thus was not properly
protected.
2) The ELF interpreter vnode wasnt being locked in namei(9) and thus was
insufficiently protected.
This commit makes the following changes:
-Sets the MPSAFE flag in NDINIT for symbolic link paths
-Sets the MPSAFE flag in NDINIT and introduce a vfslocked variable which
will be used to instruct VFS_UNLOCK_GIANT to unlock Giant if it has been
picked up.
-Drop in an assertion into vfs_lookup which ensures that if the MPSAFE
flag is NOT set, that we have picked up giant. If not panic (if WITNESS
compiled into the kernel). This should help us find conditions where vnode
operations are in-sufficiently protected.
This is a RELENG_6 candidate.
Discussed with: jeff
MFC after: 4 days
used to ensure that we weren't exiting the syscall with a lock still
held. This wasn't safe, however, because we'd already executed a vput()
and on a loaded system the vnode may have been free'd by the time we
assert. This functionality is also handled by the td_locks assert in
userret, which doesn't tell you what the syscall was, but will at least
panic before you deadlock.
Sponsored by: Isilon Systems, Inc.
Discovred by: Peter Holm
Approved by: re (blanket vfs)