while it is on a queue with the queue lock and remove the per-task locks.
- Remove TASK_DESTROY now that it is no longer needed.
- Go back to inlining TASK_INIT now that it is short again.
Inspired by: dfr
userland. The per thread ucred reference is immutable and thus needs no
locks to be read. However, until all the proc locking associated with
writes to p_ucred are completed, it is still not safe to use the per-thread
reference.
Tested on: x86 (SMP), alpha, sparc64
queue, and a mutex to protect the global list of taskqueues. The only
visible change is that a TASK_DESTROY() macro has been added to mirror
the TASK_INIT() macro to destroy a task before it is free'd.
Submitted by: Andrew Reiter <awr@watson.org>
splhigh() before the mtx_unlock and tsleep(). The splhigh() was probably
correct in the original code using simplelocks but is not correct in
5.0-current.
Noticed by: Andrew Reiter <awr@FreeBSD.org>
real effect.
Optimize vfs_msync(). Avoid having to continually drop and re-obtain
mutexes when scanning the vnode list. Improves looping case by 500%.
Optimize ffs_sync(). Avoid having to continually drop and re-obtain
mutexes when scanning the vnode list. This makes a couple of assumptions,
which I believe are ok, in regards to vnode stability when the mount list
mutex is held. Improves looping case by 500%.
(more optimization work is needed on top of these fixes)
MFC after: 1 week
or the cluster will not be properly merged. Dup the code from
cluster_wbuild() and add some printf()s to see if bad cases are present.
MFC after: 2 weeks
returns an success/failure code rather than the actual value.
- Add getenv_string() which copies a string from the environment to another
string and returns true on success.
all successful calls to VOP_OPEN() might be reflected in a call to
VOP_CLOSE(). For now, simply add a comment reflecting this problem;
this should be fixed at some point.
terminated and flushes pending dirty pages it is possible for the
object to be ref'd (0->1) and then deref'd (1->0) during termination.
We do not terminate the object a second time.
Document vop_stdgetvobject() to explicitly allow it to be called without
the vnode interlock held (for upcoming sync_msync() and ffs_sync()
performance optimizations)
MFC after: 3 days
the system load average. Previously, the load average measurement
was susceptible to synchronisation with processes that run at
regular intervals such as the system bufdaemon process.
Each interval is now chosen at random within the range of 4 to 6
seconds. This large variation is chosen so that over the shorter
5-minute load average timescale there is a good dispersion of
samples across the 5-second sample period (the time to perform 60
5-second samples now has a standard deviation of approx 4.5 seconds).
to kern_synch.c in preparation for adding some jitter to the
inter-sample time.
Note that the "vm.loadavg" sysctl still lives in vm_meter.c which
isn't the right place, but it is appropriate for the current (bad)
name of that sysctl.
Suggested by: jhb (some time ago)
Reviewed by: bde
off to witness_init() making the check for double intializating a lock by
testing the LO_INITIALIZED flag moot. Workaround this by checking the
LO_INITIALIZED flag ourself before we bzero the lock structure.
be so dangerous it isn't funny. eg: if you panic inside NFS or softdep,
and then try and sync you run into held locks and cause either deadlocks,
recursive panics or other interesting chaos. Default is unchanged.
number, portable OpenAFS applications don't have to attempt to determine
what system call number was dynamically allocated. No system call
prototype or implementation is defined.
Requested by: Tom Maher <tardis@watson.org>
This stops panics on unloading modules which define their own sysctl sets.
However, this also removes the protection against somebody actually
defining a static sysctl with an oid in the range of the dynamic ones,
which would break badly if there is already a dynamic sysctl with
the requested oid.
Apparently, the algorithm for removing sysctl sets needs a bit more work.
For the present, the panic I introduced only leads to Bad Things (tm).
Submitted by: many users of -current :(
Pointy hat to: roam (myself) for not testing rev. 1.112 enough.
- Add proc locking to the jail() syscall. This mostly involved shuffling
a few things around so that blockable things like malloc and copyin
were performed before acquiring the lock and checking the existing
ucred and then updating the ucred as one "atomic" change under the proc
lock.
- crhold() returns a reference to the ucred whose refcount it bumps.
- crcopy() now simply copies the credentials from one credential to
another and has no return value.
- a new crshared() primitive is added which returns true if a ucred's
refcount is > 1 and false (0) otherwise.
Updated by peter following KSE and Giant pushdown.
I've running with this patch for two week with no ill side effects.
PR: kern/12014: Fix SysV Semaphore handling
Submitted by: Peter Jeremy <peter.jeremy@alcatel.com.au>
a single kern.security.seeotheruids_permitted, describes as:
"Unprivileged processes may see subjects/objects with different real uid"
NOTE: kern.ps_showallprocs exists in -STABLE, and therefore there is
an API change. kern.ipc.showallsockets does not.
- Check kern.security.seeotheruids_permitted in cr_cansee().
- Replace visibility calls to socheckuid() with cr_cansee() (retain
the change to socheckuid() in ipfw, where it is used for rule-matching).
- Remove prison_unpcb() and make use of cr_cansee() against the UNIX
domain socket credential instead of comparing root vnodes for the
UDS and the process. This allows multiple jails to share the same
chroot() and not see each others UNIX domain sockets.
- Remove unused socheckproc().
Now that cr_cansee() is used universally for socket visibility, a variety
of policies are more consistently enforced, including uid-based
restrictions and jail-based restrictions. This also better-supports
the introduction of additional MAC models.
Reviewed by: ps, billf
Obtained from: TrustedBSD Project
processes to attach debugging to themselves even though the
global kern_unprivileged_procdebug_permitted policy might disallow
this.
o Move the kern_unprivileged_procdebug_permitted check above the
(p1==p2) check.
Reviewed by: des
Until now, the ptrace syscall was implemented as a wrapper that called
various functions in procfs depending on which ptrace operation was
requested. Most of these functions were themselves wrappers around
procfs_{read,write}_{,db,fp}regs(), with only some extra error checks,
which weren't necessary in the ptrace case anyway.
This commit moves procfs_rwmem() from procfs_mem.c into sys_process.c
(renaming it to proc_rwmem() in the process), and implements ptrace()
directly in terms of procfs_{read,write}_{,db,fp}regs() instead of
having it fake up a struct uio and then call procfs_do{,db,fp}regs().
It also moves the prototypes for procfs_{read,write}_{,db,fp}regs()
and proc_rwmem() from proc.h to ptrace.h, and marks all procfs files
except procfs_machdep.c as "optional procfs" instead of "standard".
confused. Since sa_sigaction and sa_handler alias each other in a
union, the bug was completely harmless. This had been fixed as part
of the SIGCHLD changes in revision 1.125, but it was reverted when
they were backed out in revision 1.126.
'regression.*'.
o Add 'regression.securelevel_nonmonotonic', conditional on 'options
REGRESSION', which allows the securelevel to be lowered for the purposes
of efficient regression testing of securelevel policy decisions.
Regression tests for securelevels will be committed shortly.
NOTE: 'options REGRESSION' should never be used on production machines, as
it permits violation of system invariants so as to improve the ability to
effectively test edge cases, and improve testing efficiency.
syscalls are of type NODEF but not in a way that fits the given
definition of that type. The exact difference of lkmressys and
lkmnosys is unclear, which makes it all the more confusing. A
reevaluation of what we have and what we really need is in order.
Spotted by: Maxime Henrion <mux@qualys.com>
Pointy hat: marcel
wait for both read AND write I/O to complete. Only NFS calls vinvalbuf()
on an active vnode (when the server indicates that the file is stale), so
this bug fix only effects NFS clients.
MFC after: 3 days
kern.ipc.showallsockets is set to 0.
Submitted by: billf (with modifications by me)
Inspired by: Dave McKay (aka pm aka Packet Magnet)
Reviewed by: peter
MFC after: 2 weeks
missed in the previous commit; a line that exceeded 80 characters. No
functional changes, but the object file's md5 checksum changes because some
lines have been displaced.
1) Allow the sending of more than one control message at a time
over a unix domain socket. This should cover the PR 29499.
2) This requires that unp_{ex,in}ternalize and unp_scan understand
mbufs with more than one control message at a time.
3) Internalize and externalize used to work on the mbuf in-place.
This made life quite complicated and the code for sizeof(int) <
sizeof(file *) could end up doing the wrong thing. The patch always
create a new mbuf/cluster now. This resulted in the change of the
prototype for the domain externalise function.
4) You can now send SCM_TIMESTAMP messages.
5) Always use CMSG_DATA(cm) to determine the start where the data
in unp_{ex,in}ternalize. It was using ((struct cmsghdr *)cm + 1)
in some places, which gives the wrong alignment on the alpha.
(NetBSD made this fix some time ago).
This results in an ABI change for discriptor passing and creds
passing on the alpha. (Probably on the IA64 and Spare ports too).
6) Fix userland programs to use CMSG_* macros too.
7) Be more careful about freeing mbufs containing (file *)s.
This is made possible by the prototype change of externalise.
PR: 29499
MFC after: 6 weeks
in vfs_syscalls.c:
if (mp->mnt_stat.f_owner != p->p_ucred->cr_uid &&
(error = suser_td(td)) != 0) {
unwrap_lots_of_stuff();
return (error);
}
to:
if (mp->mnt_stat.f_owner != p->p_ucred->cr_uid) {
error = suser_td(td);
if (error) {
unwrap_lots_of_stuff();
return (error);
}
}
This makes the code more readable when complex clauses are in use,
and minimizes conflicts for large outstanding patchsets modifying the
kernel authorization code (of which I have several), especially where
existing authorization and context code are combined in the same if()
conditional.
Obtained from: TrustedBSD Project
to avoid removing higher level directory vnodes from the namecache has
no perceivable effect and will be removed. This is especially true
when vmiodirenable is turned on, which it is by default now. ( vmiodirenable
makes a huge difference in directory caching ). The vfs.vmiodirenable and
vfs.nameileafonly sysctls have been left in to allow further testing, but
I expect to rip out vfs.nameileafonly soon too.
I have also determined through testing that the real problem with numvnodes
getting too large is due to the VM Page cache preventing the vnode from
being reclaimed. The directory stuff made only a tiny dent relative
to Poul's original code, enough so that some tests succeeded. But tests
with several million small files show that the bigger problem is the VM Page
cache. This will have to be addressed by a future commit.
MFC after: 3 days
when I changed the allocator bits. This implements per-CPU mbtypes
stats by keeping net number of decrements/increments of a given mbtype
per-CPU and then summing all of the per-CPU mbtypes to produce the total
net number of allocated mbufs of the given mbtype.
Counters are carefully balanced to avoid/prevent underflows/overflows.
mbtypes stats are re-enabled with the idea that we may occasionally
(although very rarely) observe slight inconsistencies in the stat
reporting. Most of the time, we should be fine, though.
Also make appropriate modifications to netstat(1) and systat(1) to do
the necessary reporting.
Submitted by: Jiangyi Liu <jyliu@163.net>
the static callout list allocated by the system.
Change malloc type from M_TEMP to M_KQUEUE to better track memory.
Add a kern.kq_calloutmax to globally limit the amount of kernel memory
that can be allocated by callouts.
Submitted by: iedowse (items 1, 2)
have its entry in the syscall table added. Nothing else is
done. This differs from type NOPROTO in that NOPROTO adds a
definition to syscall.h besides adding a sysent. A syscall can
now have multiple entries without conflict. Note that the
argssize is fixed and depends on the syscall name.
securelevel_gt(), determine first if a local securelevel exists --
if so, perform the check based on imax(local, global). Otherwise,
simply use the global value.
o Note: even though local securelevels might lag below the global one,
if the global value is updated to higher than local values, maximum
will still be used, making the global dominant even if there is local
lag.
Obtained from: TrustedBSD Project
one is present in the current jail, otherwise, to return the global
securelevel.
o If the securelevel is being updated, require that it be greater than
the maximum of local and global, if a local securelevel exists,
otherwise, just maximum of the global. If there is a local
securelevel, update the local one instead of the global one.
o Note: this does allow local securelevels to lag behind the global one
as long as the local one is not updated following a global increase.
Obtained from: TrustedBSD Project
a time change, and callers so that they provide td->td_proc.
o Modify settime() to use securevel_gt() for securelevel checking.
Obtained from: TrustedBSD Project
in vn_rdwr_inchunks(), allowing other processes to gain an exclusive
lock on the vnode. Specifically: directory scanning, to avoid a race to the
root directory, and multiple child processes coring simultaniously so they
can figure out that some other core'ing child has an exclusive adv lock and
just exit instead.
This completely fixes performance problems when large programs core. You
can have hundreds of copies (forked children) of the same binary core all
at once and not notice.
MFC after: 3 days
for securelevel_ge() and securelevel_gt(), I was a little surprised,
but fixed it. Turns out that it was the code that was inverted, during
a whitespace cleanup in my commit tree. This commit inverts the
checks, and restores the comment.
all the debugging code into the function versions of the mutex operations
in kern_mutex.c. This reduced the __mtx_* macros to simply wrappers of
the _{get,rel}_lock_* macros, so the __mtx_* macros were also abolished in
favor of just calling the _{get,rel}_lock_* macros. The tangled hairy mass
of macros calling macros is at least a bit more sane now.
selrecord() in ptcpoll(). The pre-KSE code used the passed in proc pointer
rather than curproc, and an earlier seltrue() call uses the passed in
thread and not curthread.
was locked by the proc lock and td_flags is locked by the sched_lock.
The places that read, set, and cleared TDF_SELECT weren't updated, so they
read and modified td_flags w/o holding the sched_lock, meaning that they
could corrupt the per-thread flags field. As an immediate band-aid,
grab sched_lock while reading and manipulating td_flags in relation to
TDF_SELECT. This will probably be cleaned up some later on.
credentials rather than the real credentials. This is useful for
implementing GUI's which need to modify icons based on access rights,
but where use of open(2) is too expensive, use of stat(2) doesn't
reflect the file system's real protection model, and use of
access() suffers from real/effective credential confusion. This
implementation provides the same semantics as the call of the same
name on SCO OpenServer. Note: using this call improperly can
leave you subject to some of the same races present in the
access(2) call.
o To implement this, break out the basic logic of access(2) into
vpaccess(), which accepts a passed credential to perform the
invocation of VOP_ACCESS(). Add eaccess(2) to invoke vpaccess(),
and modify access(2) to use vpaccess().
Obtained from: TrustedBSD Project
as a physical atomic operation. That would require the code to use the
atomic API, which it does not. Instead, the operation is made psuedo
atomic (hence the quotes) by use of the lock to protect clearing all of the
flags in question.
abstract the securelevel implementation details from the checking
code. The call in -CURRENT accepts a struct ucred--in -STABLE, it
will accept struct proc. This facilitates the upcoming commit of
per-jail securelevel support. The calls will also generate a
kernel printf if the calls are made with NULL ucred/proc pointers:
generally speaking, there are few instances of this, and they should
be fixed.
o Update p_candebug() to use securelevel_gt(); future updates to the
remainder of the kernel tree will be committed soon.
Obtained from: TrustedBSD Project
transcription during the (pcred,ucred) merge; this was not used for
the kill() system call, so does not affect direct explicit process
signalling.
Pointed out by: fenner
This works if /dev exists, or if / is read/write (nfsroot). If it is
too hard, leave it up to init -d (which will probably fail if /dev does
not exist, but there isn't much else we can do short of making a union
mount on /).
This means we get a proper /dev if you boot a 5.x kernel on a 4.x world,
which I happen to do often (the ramdisks on our install netboot servers
have 4.x userland worlds on them).
Reviewed by: audit
Add tunables for the sem* and shm* syscontrols for tuning on boottime
until they become dynamic.
SAP R/3 doesn't like the compiled in defaults.
automatically change the code to add
struct proc *p = td->td_proc;
because now 'td' is probably capable of being NULL too.
I expect to see more of this kind of error during the 'weeding'
process. It's too easy to make. (junior hacker project.. look for these :-)
Submitted by: mark Peek <mp@freebsd.org>
Note ALL MODULES MUST BE RECOMPILED
make the kernel aware that there are smaller units of scheduling than the
process. (but only allow one thread per process at this time).
This is functionally equivalent to teh previousl -current except
that there is a thread associated with each process.
Sorry john! (your next MFC will be a doosie!)
Reviewed by: peter@freebsd.org, dillon@freebsd.org
X-MFC after: ha ha ha ha
dillon in an earlier e-mail.
- We don't need to test the console right before we vfprintf() the panicstr
message. The printing of the panic message is a fine console test by
itself and doesn't make useful messages scroll off the screen or tick
developers off in quite the same.
Requested by: jlemon, imp, bmilekic, chris, gsutter, jake (2)
'struct tty' was out of sync in user and kernel due to dev_t/udev_t
mixups. This takes advantage of the fact that dev_t changes type in
userland, so it isn't too pretty.
overflow if uap->nsops (which is already unsigned) is over INT_MAX;
consequently, the bounds check below becomes valid. Previously, if a
value over INT_MAX was passed in uap->nsops, the bounds check wouldn't
catch it, and the value would be used to compute copyin()'s third
argument.
Obtained from: NetBSD
it to the MI area. KSE touched cpu_wait() which had the same change
replicated five ways for each platform. Now it can just do it once.
The only MD parts seemed to be dealing with fpu state cleanup and things
like vm86 cleanup on x86. The rest was identical.
XXX: ia64 and powerpc did not have cpu_throw(), so I've put a functional
stub in place.
Reviewed by: jake, tmm, dillon
me (addition of vn_rdwr_inchunks). The problem Yahoo is solving is that
if you have large process images core dumping, or you have a large number of
forked processes all core dumping at the same time, the original coredump code
would leave the vnode locked throughout. This can cause the directory vnode
to get locked up, which can cause the parent directory vnode to get locked
up, and so on all the way to the root node, locking the entire machine up
for extremely long periods of time.
This patch solves the problem in two ways. First it uses an advisory
non-blocking lock to abort multiple processes trying to core to the same
file. Second (my contribution) it chunks up the writes and uses bwillwrite()
to avoid holding the vnode locked while blocking in the buffer cache.
Submitted by: ps
Reviewed by: dillon
MFC after: 2 weeks
has been forcibly unmounted. If the filesystem root vnode is reached
and it has no associated mountpoint (vp->v_mount == NULL), __getcwd
would return without freeing 'buf'. Add the missing free() call.
PR: kern/30306
Submitted by: Mike Potanin <potanin@mccme.ru>
MFC after: 1 week
and I still dont know why, this was not failing on the non-kse kernel.
It certainly should have since things were using linker_kernel_file
unconditionally. This has highlighted a different problem though that
means that trying to do a kldload on a non-dynamic kernel will implode.