When a positively niced process requests a disk I/O, make
it wait for its nice value of ticks before scheduling its
I/O request if there are any other processes with I/O
requests in the disk queue. For all the gory details, see
the ``Running fsck in the Background'' paper in the Usenix
BSDCon 2002 Conference Proceedings, pages 55-64.
commit by Kirk also fixed a softupdates bug that could easily be triggered
by server side NFS.
* An edge case with shared R+W mmap()'s and truncate whereby
the system would inappropriately clear the dirty bits on
still-dirty data. (applicable to all filesystems)
THIS FIX TEMPORARILY DISABLED PENDING FURTHER TESTING.
see vm/vm_page.c line 1641
* The straddle case for VM pages and buffer cache buffers when
truncating. (applicable to NFS client side)
* Possible SMP database corruption due to vm_pager_unmap_page()
not clearing the TLB for the other cpu's. (applicable to NFS
client side but could effect all filesystems). Note: not
considered serious since the corruption occurs beyond the file
EOF.
* When flusing a dirty buffer due to B_CACHE getting cleared,
we were accidently setting B_CACHE again (that is, bwrite() sets
B_CACHE), when we really want it to stay clear after the write
is complete. This resulted in a corrupt buffer. (applicable
to all filesystems but probably only triggered by NFS)
* We have to call vtruncbuf() when ftruncate()ing to remove
any buffer cache buffers. This is still tentitive, I may
be able to remove it due to the second bug fix. (applicable
to NFS client side)
* vnode_pager_setsize() race against nfs_vinvalbuf()... we have
to set n_size before calling nfs_vinvalbuf or the NFS code
may recursively vnode_pager_setsize() to the original value
before the truncate. This is what was causing the user mmap
bus faults in the nfs tester program. (applicable to NFS
client side)
* Fix to softupdates (see ufs/ffs/ffs_inode.c 1.73, commit made
by Kirk).
Testing program written by: Avadis Tevanian, Jr.
Testing program supplied by: jkh / Apple (see Dec2001 posting to freebsd-hackers with Subject 'NFS: How to make FreeBS fall on its face in one easy step')
MFC after: 1 week
always deriving the credential for a newly accepted connection from
the listen socket. Previously, the selection of the credential
depended on the protocol: UNIX domain sockets would use the
connecting process's credential, and protocols supporting a creation
of the socket before the receiving end called accept() would use
the listening socket. After this change, it is always the listening
credential.
Reviewed by: green
a KTR log entry. Any KTR requests made while working on an entry are
ignored/discarded to prevent recursion. This is a better fix for the
hack to futz with the CPU mask and call getnanotime() if KTR_LOCK or
KTR_WITNESS was on. It also covers the actual formatting of the log entry
including dumping it to the display which the earlier hacks did not.
where our security related sysctl tuneables are located. Also, this
will help if/when we move _security node out from under _kern as to help
make _kern less cluttered.
Approved by: rwatson
Review by: rwatson
- The MI portions of struct globaldata have been consolidated into a MI
struct pcpu. The MD per-CPU data are specified via a macro defined in
machine/pcpu.h. A macro was chosen over a struct mdpcpu so that the
interface would be cleaner (PCPU_GET(my_md_field) vs.
PCPU_GET(md.md_my_md_field)).
- All references to globaldata are changed to pcpu instead. In a UP kernel,
this data was stored as global variables which is where the original name
came from. In an SMP world this data is per-CPU and ideally private to each
CPU outside of the context of debuggers. This also included combining
machine/globaldata.h and machine/globals.h into machine/pcpu.h.
- The pointer to the thread using the FPU on i386 was renamed from
npxthread to fpcurthread to be identical with other architectures.
- Make the show pcpu ddb command MI with a MD callout to display MD
fields.
- The globaldata_register() function was renamed to pcpu_init() and now
init's MI fields of a struct pcpu in addition to registering it with
the internal array and list.
- A pcpu_destroy() function was added to remove a struct pcpu from the
internal array and list.
Tested on: alpha, i386
Reviewed by: peter, jake
This flag adds a pausing utility. When ran with -p, during the kernel
probing phase, the kernel will pause after each line of output.
This pausing can be ended with the '.' key, and is automatically
suspended when entering ddb.
This flag comes in handy at systems without a serial port that either hang
during booting or reser.
Reviewed by: (partly by jlemon)
MFC after: 1 week
In this case, C99's __func__ is properly defined as:
static const char __func__[] = "function-name";
and GCC 3.1 will not allow it to be used in bogus string concatenation.
o The manual page for kevent says that EVFILT_AIO returns under the same
conditions as aio_error(). With that in mind, set the data field
of the returned struct kevent to the value that would be returned
by aio_error().
o Fix two compilation warnings.
mutable contents of struct prison (hostname, securelevel, refcount,
pr_linux, ...)
o Generally introduce mtx_lock()/mtx_unlock() calls throughout kern/
so as to enforce these protections, in particular, in kern_mib.c
protection sysctl access to the hostname and securelevel, as well as
kern_prot.c access to the securelevel for access control purposes.
o Rewrite linux emulator abstractions for accessing per-jail linux
mib entries (osname, osrelease, osversion) so that they don't return
a pointer to the text in the struct linux_prison, rather, a copy
to an array passed into the calls. Likewise, update linprocfs to
use these primitives.
o Update in_pcb.c to always use prison_getip() rather than directly
accessing struct prison.
Reviewed by: jhb
- uid's -> uids
- whitespace improvements, linewrap improvements
- reorder copyright more appropriately
- remove redundant MP SAFE comments, add one "NOT MPSAFE?"
for setgroups(), which seems to be the sole un-changed system
call in the file.
- clean up securelevel_g?() functions, improve comments.
Largely submitted by: bde
you run out of mbuf address space.
kern/subr_mbuf.c: print a warning message when mb_alloc fails, again
rate-limited to at most once per second. This covers other
cases of mbuf allocation failures. Probably it also overlaps the
one handled in vm/vm_kern.c, so maybe the latter should go away.
This warning will let us gradually remove the printf that are scattered
across most network drivers to report mbuf allocation failures.
Those are potentially dangerous, in that they are not rate-limited and
can easily cause systems to panic.
Unless there is disagreement (which does not seem to be the case
judging from the discussion on -net so far), and because this is
sort of a safety bugfix, I plan to commit a similar change to STABLE
during the weekend (it affects kern/uipc_mbuf.c there).
Discussed-with: jlemon, silby and -net
the administrator to restrict access to the kernel message buffer.
It defaults to '1', which permits access, but if set to '0', requires
that the process making the sysctl() have appropriate privilege.
o Note that for this to be effective, access to this data via system
logs derived from /dev/klog must also be limited.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs
that new models can inhabit kern.security.<modelname>.
o While I'm there, shorten somewhat excessive variable names, and clean
things up a little.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs
readability.
o Conditionalize only the SYSCTL definitions for the regression
tree, not the variables itself, decreasing the number of #ifdef
REGRESSIONs scattered in kern_mib.c, and making the code more
readable.
Sponsored by: DARPA, NAI Labs
For an object type, we maintain a variable mb_mapfull. It is 0 by default
and is only raised to 1 in one place: when an mb_pop_cont() fails for
the first time, on the assumption that the reason for the failure is
due to the underlying map for the object (e.g. clust_map, mbuf_map) being
exhausted.
Problem and Changes:
Change how we define "mb_mapfull." It now means: "set to 1 when the first
mb_pop_cont() fails only in the kmem_malloc()-ing of the object, and
only if the call was with the M_TRYWAIT flag." This is a more conservative
definition and should avoid odd [but theoretically possible] situations
from occuring. i.e. we had set mb_mapfull to 1 thinking the map for the
object was actually exhausted when we _actually_ failed in malloc()ing
the space for the bucket structure managing the objects in the page
we're allocating.
The kernel certainly doesn't use _PATH_DEV or even /dev/ to find the device.
It cannot, since "/" has not been mounted. Maybe the only affect of using
/dev/ is that it gets put in the mounted-from name for "/", so that mount(8),
etc., display an absolute path before "/" has been remounted. Many have
never bothered typing the full path, and code that constructs a path in
rootdevnames[] never bothered to construct a full path, so the example
shouldn't have it.
Submitted by: bde
vnodes. This will hopefully serve as a base from which we can
expand the MP code. We currently do not attempt to obtain any
mutex or SX locks, but the door is open to add them when we nail
down exactly how that part of it is going to work.
Replace uses of holdfp() with fget*() or fgetvp*() calls as appropriate
introduce fget(), fget_read(), fget_write() - these functions will take
a thread and file descriptor and return a file pointer with its ref
count bumped.
introduce fgetvp(), fgetvp_read(), fgetvp_write() - these functions will
take a thread and file descriptor and return a vref()'d vnode.
*_read() requires that the file pointer be FREAD, *_write that it be
FWRITE.
This continues the cleanup of struct filedesc and struct file access
routines which, when are all through with it, will allow us to then
make the API calls MP safe and be able to move Giant down into the fo_*
functions.
- Restore inferior() to being iterative rather than recursive.
- Assert that the proctree_lock is held in inferior() and change the one
caller to get a shared lock of it. This also ensures that we hold the
lock after performing the check so the check can't be made invalid out
from under us after the check but before we act on it.
Requested by: bde
one to perform a vn_open using temporary/other/fake credentials.
Modify the nfs client side locking code to use vn_open_cred() passing
proc0's ucred instead of the old way which was to temporary raise
privs while running vn_open(). This should close the race hopefully.
as being valid. Previously only the magic number and the virtual
address were checked, but it makes little sense to require that
the virtual address is the same (the message buffer is located at
the end of physical memory), and checks on the msg_bufx and msg_bufr
indices were missing.
Submitted by: Bodo Rueskamp <br@clabsms.de>
Tripped over during a kernel debugging tutorial given by: grog
Reviewed by: grog, dwmalone
MFC after: 1 week
sysctl_req', which describes in-progress sysctl requests. This permits
sysctl handlers to have access to the current thread, permitting work
on implementing td->td_ucred, migration of suser() to using struct
thread to derive the appropriate ucred, and allowing struct thread to be
passed down to other code, such as network code where td is not currently
available (and curproc is used).
o Note: netncp and netsmb are not updated to reflect this change, as they
are not currently KSE-adapted.
Reviewed by: julian
Obtained from: TrustedBSD Project
consistent with the one other function in the file, and prevents long
lines in up-coming changes. This nominally pulls kern_mib.c a little
further down the long path to style(9) compliance.
in wdrain during a write. This flag needs to be used in devices whos
strategy routines turn-around and issue another high level I/O, such as
when MD turns around and issues a VOP_WRITE to vnode backing store, in order
to avoid deadlocking the dirty buffer draining code.
Remove a vprintf() warning from MD when the backing vnode is found to be
in-use. The syncer of buf_daemon could be flushing the backing vnode at
the time of an MD operation so the warning is not correct.
MFC after: 1 week
structure changes now rather then piecemeal later on. mnt_nvnodelist
currently holds all the vnodes under the mount point. This will eventually
be split into a 'dirty' and 'clean' list. This way we only break kld's once
rather then twice. nvnodelist will eventually turn into the dirty list
and should remain compatible with the klds.
by one - see _SIG_IDX(). Revert part of my mis-correction in kern_sig.c
(but signal 0 still has to be allowed) and fix _SIG_VALID() (it was
rejecting ignal 128).
argument of 0. You cannot return EINVAL for signal 0. This broke
(in 5 minutes of testing) at least ssh-agent and screen.
However, there was a bug in the original code. Signal 128 is not
valid.
Pointy-hat to: des, jhb
as suser_td(td) works as well as suser_xxx(NULL, p->p_ucred, 0);
This simplifies upcoming changes to suser(), and causes this code
to use the right credential (well, largely) once the td->td_ucred
changes are complete. There remains some redundancy and oddness
in this code, which should be rethought after the next batch of
suser and credential changes.
in vfs_syscalls.c. Although it did save some indirection, many of
those savings will be obscured with the impending commit of suser()
changes, and the result is increased code complexity. Also, once
p->p_ucred and td->td_ucred are distinguished, this will make
vfs_mount() use the correct thread credential, rather than the
process credential.
debug another process based on their respective {effective,additional,
saved,real} gid's. p1 is only permitted to debug p2 if its effective
gids (egid + additional groups) are a strict superset of the gids of
p2. This implements properly the security test previously incorrectly
implemented in kern_ktrace.c, and is consistent with the kernel
security policy (although might be slightly confusing for those more
familiar with the userland policy).
o Restructure p_candebug() logic so that various results are generated
comparing uids, gids, credential changes, and then composed in a
single check before testing for privilege. These tests encapsulate
the "BSD" inter-process debugging policy. Other non-BSD checks remain
seperate. Additional comments are added.
Submitted by: tmm, rwatson
Obtained from: TrustedBSD Project
Reviewed by: petef, tmm, rwatson
really be moved elsewhere: p_candebug() encapsulates the security
policy decision, whereas the P_INEXEC check has to do with "correctness"
regarding race conditions, rather than security policy.
Example: even if no security protections were enforced (the "uids are
advisory" model), removing P_INEXEC could result in incorrect operation
due to races on credential evaluation and modification during execve().
Obtained from: TrustedBSD Project
o Reorder and synchronize #include's, including moving "opt_cap.h" to
above system includes.
o Introduce #ifdef'd kern.security.capabilities sysctl tree, including
kern.security.capabilities.enabled, which defaults to 0.
The rest of the file remains stubs for the time being.
Obtained from: TrustedBSD Project
o POSIX.1e capabilities authorize overriding of VEXEC for VDIR based
on CAP_DAC_READ_SEARCH, but of !VDIR based on CAP_DAC_EXECUTE. Add
appropriate conditionals to vaccess() to take that into account.
o Synchronization cap_check_xxx() -> cap_check() change.
Obtained from: TrustedBSD Project
o Add reference to TrustedBSD Project in license header.
o Update dated comments, including comment in extattr.h claiming that
no file systems support extended attributes.
o Improve comment consistency.
credential selection, rather than reference via a thread or process
pointer. This is part of a gradual migration to suser() accepting
a struct ucred instead of a struct proc, simplifying the reference
and locking semantics of suser().
Obtained from: TrustedBSD Project
- Now that apm loadable module can inform its existence to other kernel
components (e.g. i386/isa/clock.c:startrtclock()'s TCS hack).
- Exchange priority of SI_SUB_CPU and SI_SUB_KLD for above purpose.
- Add simple arbitration mechanism for APM vs. ACPI. This prevents
the kernel enables both of them.
- Remove obsolete `#ifdef DEV_APM' related code.
- Add abstracted interface for Powermanagement operations. Public apm(4)
functions, such as apm_suspend(), should be replaced new interfaces.
Currently only power_pm_suspend (successor of apm_suspend) is implemented.
Reviewed by: peter, arch@ and audit@
function symbols in the kernel in a list of C strings, with an extra
nul-termination at the end.
This sysctl requires addition of a new linker operation. Now,
linker_file_t's need to respond to "each_function_name" to export
their function symbols.
Note that the sysctl doesn't currently allow distinguishing multiple
symbols with the same name from different modules, but could quite
easily without a change to the linker operation. This will be a nicety
to have when it can be used.
Obtained from: NAI Labs CBOSS project
Funded by: DARPA
by the profiler on a running system. This is not done sparsely, as
memory is cheaper than processor speed and each gprof mcount() and
mexitcount() operation is already very expensive.
Obtained from: NAI Labs CBOSS project
Funded by: DARPA
number) instead of allocating next free unit for them. If someone needs
fixed place, he must specify it correctly. "Allocating next" is especially bad
because leads to double device detection and to "repeat make_dev panic" as
result. This can happens if the same devices present somewhere on PCI bus,
hints and ACPI. Making them present in one place only not always
possible, "sc" f.e. can't be removed from hints, it results to no console at
all.
2) In make_device(), detect when devclass_add_device() fails, free dev and
return. I.e. add missing error checking. This part needed to finish fix in 1),
but must be done this way in anycase, with old variant too.
non-existent disk in a legacy /dev on a DEVFS system would panic the system
if stat(2)'ed.
Do not whine about anonymous device nodes not having a si_devsw, they're
not supposed to.
Make it a panic to repeat make_dev() or destroy_dev(), this check
should maybe be neutered when -current goes -stable.
Whine if devsw() is called on anon dev_t's in a devfs system.
Make a hack to avoid our lazy-eval disk code triggering the above whine.
Fix the multiple make_dev() in disk code by making ${disk}${unit}s${slice}
an alias/symlink to ${disk}${unit}s${slice}c
it has not yet returned. Use this flag to deny debugging requests while
the process is execve()ing, and close once and for all any race conditions
that might occur between execve() and various debugging interfaces.
Reviewed by: jhb, rwatson
strip the space from '( struct thread *...', wrap long lines.
o Remove an unneeded comment on the topic of no lock being required as
part of the NDINIT() in __acl_get_file(), as it's really not required
there.
Obtained from: TrustedBSD Project
of Giant during the Giant unwinding phase, and start work on instrumenting
Giant for the file and proc mutexes.
These wrappers allow developers to turn on and off Giant around various
subsystems. DEVELOPERS SHOULD NEVER TURN OFF GIANT AROUND A SUBSYSTEM JUST
BECAUSE THE SYSCTL EXISTS! General developers should only considering
turning on Giant for a subsystem whos default is off (to help track down
bugs). Only developers working on particular subsystems who know what
they are doing should consider turning off Giant.
These wrappers will greatly improve our ability to unwind Giant and test
the kernel on a (mostly) subsystem by subsystem basis. They allow Giant
unwinding developers (GUDs) to emplace appropriate subsystem and structural
mutexes in the main tree and then request that the larger community test
the work by turning off Giant around the subsystem(s), without the larger
community having to mess around with patches. These wrappers also allow
GUDs to boot into a (more likely to be) working system in the midst of
their unwinding work and to test that work under more controlled
circumstances.
There is a master sysctl, kern.giant.all, which defaults to 0 (off). If
turned on it overrides *ALL* other kern.giant sysctls and forces Giant to
be turned on for all wrapped subsystems. If turned off then Giant around
individual subsystems are controlled by various other kern.giant.XXX sysctls.
Code which overlaps multiple subsystems must have all related subsystem Giant
sysctls turned off in order to run without Giant.
while it is on a queue with the queue lock and remove the per-task locks.
- Remove TASK_DESTROY now that it is no longer needed.
- Go back to inlining TASK_INIT now that it is short again.
Inspired by: dfr
userland. The per thread ucred reference is immutable and thus needs no
locks to be read. However, until all the proc locking associated with
writes to p_ucred are completed, it is still not safe to use the per-thread
reference.
Tested on: x86 (SMP), alpha, sparc64
queue, and a mutex to protect the global list of taskqueues. The only
visible change is that a TASK_DESTROY() macro has been added to mirror
the TASK_INIT() macro to destroy a task before it is free'd.
Submitted by: Andrew Reiter <awr@watson.org>
splhigh() before the mtx_unlock and tsleep(). The splhigh() was probably
correct in the original code using simplelocks but is not correct in
5.0-current.
Noticed by: Andrew Reiter <awr@FreeBSD.org>
real effect.
Optimize vfs_msync(). Avoid having to continually drop and re-obtain
mutexes when scanning the vnode list. Improves looping case by 500%.
Optimize ffs_sync(). Avoid having to continually drop and re-obtain
mutexes when scanning the vnode list. This makes a couple of assumptions,
which I believe are ok, in regards to vnode stability when the mount list
mutex is held. Improves looping case by 500%.
(more optimization work is needed on top of these fixes)
MFC after: 1 week
or the cluster will not be properly merged. Dup the code from
cluster_wbuild() and add some printf()s to see if bad cases are present.
MFC after: 2 weeks