emul code when compiling with "options KTRACE".
ktrsyscall() was expecting an array of integers, this was passing the
address of a structure containing an array of integers..
The cosmetic problem was that it was calling the "enter syscall"
trace hook twice - this looks like a cut/paste error/typo.
notebooks where a powerfail condition (external power drop; battery
state low) is signalled by an NMI. Makes it beep instead of panicing.
Reviewed by: davidg
made a change to NFS that caused buffers at EOF to be variable size. This
had the undesired side-effect of breaking delayed writes on NFS. This
fixes it.
Submitted by: John Dyson
proc or any VM system structure will have to be rebuilt!!!
Much needed overhaul of the VM system. Included in this first round of
changes:
1) Improved pager interfaces: init, alloc, dealloc, getpages, putpages,
haspage, and sync operations are supported. The haspage interface now
provides information about clusterability. All pager routines now take
struct vm_object's instead of "pagers".
2) Improved data structures. In the previous paradigm, there is constant
confusion caused by pagers being both a data structure ("allocate a
pager") and a collection of routines. The idea of a pager structure has
escentially been eliminated. Objects now have types, and this type is
used to index the appropriate pager. In most cases, items in the pager
structure were duplicated in the object data structure and thus were
unnecessary. In the few cases that remained, a un_pager structure union
was created in the object to contain these items.
3) Because of the cleanup of #1 & #2, a lot of unnecessary layering can now
be removed. For instance, vm_object_enter(), vm_object_lookup(),
vm_object_remove(), and the associated object hash list were some of the
things that were removed.
4) simple_lock's removed. Discussion with several people reveals that the
SMP locking primitives used in the VM system aren't likely the mechanism
that we'll be adopting. Even if it were, the locking that was in the code
was very inadequate and would have to be mostly re-done anyway. The
locking in a uni-processor kernel was a no-op but went a long way toward
making the code difficult to read and debug.
5) Places that attempted to kludge-up the fact that we don't have kernel
thread support have been fixed to reflect the reality that we are really
dealing with processes, not threads. The VM system didn't have complete
thread support, so the comments and mis-named routines were just wrong.
We now use tsleep and wakeup directly in the lock routines, for instance.
6) Where appropriate, the pagers have been improved, especially in the
pager_alloc routines. Most of the pager_allocs have been rewritten and
are now faster and easier to maintain.
7) The pagedaemon pageout clustering algorithm has been rewritten and
now tries harder to output an even number of pages before and after
the requested page. This is sort of the reverse of the ideal pagein
algorithm and should provide better overall performance.
8) Unnecessary (incorrect) casts to caddr_t in calls to tsleep & wakeup
have been removed. Some other unnecessary casts have also been removed.
9) Some almost useless debugging code removed.
10) Terminology of shadow objects vs. backing objects straightened out.
The fact that the vm_object data structure escentially had this
backwards really confused things. The use of "shadow" and "backing
object" throughout the code is now internally consistent and correct
in the Mach terminology.
11) Several minor bug fixes, including one in the vm daemon that caused
0 RSS objects to not get purged as intended.
12) A "default pager" has now been created which cleans up the transition
of objects to the "swap" type. The previous checks throughout the code
for swp->pg_data != NULL were really ugly. This change also provides
the rudiments for future backing of "anonymous" memory by something
other than the swap pager (via the vnode pager, for example), and it
allows the decision about which of these pagers to use to be made
dynamically (although will need some additional decision code to do
this, of course).
13) (dyson) MAP_COPY has been deprecated and the corresponding "copy
object" code has been removed. MAP_COPY was undocumented and non-
standard. It was furthermore broken in several ways which caused its
behavior to degrade to MAP_PRIVATE. Binaries that use MAP_COPY will
continue to work correctly, but via the slightly different semantics
of MAP_PRIVATE.
14) (dyson) Sharing maps have been removed. It's marginal usefulness in a
threads design can be worked around in other ways. Both #12 and #13
were done to simplify the code and improve readability and maintain-
ability. (As were most all of these changes)
TODO:
1) Rewrite most of the vnode pager to use VOP_GETPAGES/PUTPAGES. Doing
this will reduce the vnode pager to a mere fraction of its current size.
2) Rewrite vm_fault and the swap/vnode pagers to use the clustering
information provided by the new haspage pager interface. This will
substantially reduce the overhead by eliminating a large number of
VOP_BMAP() calls. The VOP_BMAP() filesystem interface should be
improved to provide both a "behind" and "ahead" indication of
contiguousness.
3) Implement the extended features of pager_haspage in swap_pager_haspage().
It currently just says 0 pages ahead/behind.
4) Re-implement the swap device (swstrategy) in a more elegant way, perhaps
via a much more general mechanism that could also be used for disk
striping of regular filesystems.
5) Do something to improve the architecture of vm_object_collapse(). The
fact that it makes calls into the swap pager and knows too much about
how the swap pager operates really bothers me. It also doesn't allow
for collapsing of non-swap pager objects ("unnamed" objects backed by
other pagers).
that call vnode_pager_alloc() so that a failure return can be dealt with.
This fixes a panic seen on NFS clients when a file being opened is deleted
on the server before the open completes.
syscall to allow applications linked against their libc's uname() to
work. Netscape 1.1N being a prime example, which prints:
"uname() failed. cant tell what system we're running on".
This change is a little ugly, but that's mainly because of the "interesting"
semantics of the BSDI extension.
Since ogetkerninfo() is only enabled by COMPAT_43, Netscape will only
be affected on kernels with that option (eg: "GENERIC")
Reviewed by: davidg
might not be handled by the same FS as the directory (e.g. special device
files)...so it must be special-cased. This bug is seen when doing
"ln /dev/console /dev/foo" or equivilent and first appeared after I fixed
the argument order of VOP_LINK. YUCK! There really needs to be a way of
specifying what vp to use in the VCALL; doing this could fix the strategy
and bwrite special-cases, too.
2) Removed unnecessary vm_object_lookup()/pager_cache(object, TRUE) pairs
after vnode_pager_alloc() calls - the object is already guaranteed to be
persistent.
3) Removed some gratuitous casts.
VOP_CLOSE() takes `F' (file) flags, not `IO' flags. At least that's
what close() passes. I previously fixed ttylclose() to check
FNONBLOCK instead of IO_NDELAY. This broke the call from vclean()
and cleaning of ptys sometimes deadlocked.
when syscons stops mapping the console to minor MAXCONS. There is
usually no corresponding device in /dev, and the correct device has
minor 0.
cons.c:
Initialize cn_tty properly, so that CPU_CONSDEV can work.
Comment about too many variants of the console tty pointer.
machdep.c:
Return device NODEV and not error EFAULT when there is no console device.
in the wrong place. Blank padding in the right place or zero padding
would be inconsistent with user mode.
Put case 'p' in alphabetical order.
Implement %p in sprintf() too. I'd like only a single, more complete
printf() core, perhaps one based on vsnprintf().
in machdep.c (it should use the global nmbclusters). Moved the calculation
of nmbclusters into conf/param.c (same place where nmbclusters has always
been assigned), and made the calculation include an extra amount based
on "maxusers". NMBCLUSTERS can still be overrided in the kernel config
file as always, but this change will make that generally unnecessary. This
fixes the "bug" reports from people who have misconfigured kernels seeing
the network "hang" when the mbuf cluster pool runs out.
Reviewed by: John Dyson
device.
v_numoutput wasn't incremented to match the b_iodone nesting. It's still
fishy that vwakeup() clears B_WRITEINPROG before biodone() has finished;
however, B_WRITEINPROG seems to be never used.
Submitted by: Bruce Evans
1) Files weren't properly synced on filesystems other than UFS. In some
cases, this lead to lost data. Most likely would be noticed on NFS.
The fix is to make the VM page sync/object_clean general rather than
in each filesystem.
2) Mixing regular and mmaped file I/O on NFS was very broken. It caused
chunks of files to end up as zeroes rather than the intended contents.
The fix was to fix several race conditions and to kludge up the
"b_dirtyoff" and "b_dirtyend" that NFS relies upon - paying attention
to page modifications that occurred via the mmapping.
Reviewed by: David Greenman
Submitted by: John Dyson
These changes solve the problem in a general way by moving the
initialization out of the individual fs_mountroot's and into swaponvp().
Submitted by: Poul-Henning Kamp
when the single user shell was terminated. These changes disallow mounting
or R/W upgrading filesystems that are dirty unless "-f" (force) option
is used with mount. /etc/rc has been modified to abort the startup if
one or more non-nfs partitions fail to mount.
Reviewed by: Poul-Henning Kamp, Rod Grimes
require specific partitions be mentioned in the kernel config
file ("swap on foo" is now obsolete).
From Poul-Henning:
The visible effect is this:
As default, unless
options "NSWAPDEV=23"
is in your config, you will have four swap-devices.
You can swapon(2) any block device you feel like, it doesn't have
to be in the kernel config.
There is a performance/resource win available by getting the NSWAPDEV right
(but only if you have just one swap-device ??), but using that as default
would be too restrictive.
The invisible effect is that:
Swap-handling disappears from the $arch part of the kernel.
It gets a lot simpler (-145 lines) and cleaner.
Reviewed by: John Dyson, David Greenman
Submitted by: Poul-Henning Kamp, with minor changes by me.
is more representative of worst case situations of 4 files/directory. (If
that last sentence doesn't make any sense, I'm not surprised. It's rather
compilcated how this all fits together....).
This should fix a problem that Ed Hudson has been complaining about where
directories with lots of symlinks could cause excessive disk I/O.
there may even be LKMs.) Also, change the internal name of `unixdomain'
to `localdomain' since AF_LOCAL is now the preferred name of this family.
Declare netisr correctly and in the right place.
Reopen the bdev for the raw partition and not the cdev if only the bdev
was open.
Don't use a bogus limit for the number of partitions to possibly reopen
(bug found by Julian).
Add function dssize() to help fix wdsize() and sdsize(). The slice
layer knows more about (un)open partitions and partition sizes than
the driver layer.
in read() and write(). FNONBLOCK is valid in ioctl() and close().
The bug caused hung ptys when a process talked to itself using nonblocking
i/o and exited while the slave pty had output to flush. ttywait() was
called and hung. Signals didn't work because the process was exiting.
`comcontrol /dev/ttyp0 drainwait 1' worked to terminate the wait. This
shows that comcontrol is not limited to hardware control. It has no i386
or driver dependencies and doesn't belong in src/sbin/i386.
Bruce
initializing proc0's frame base, too, using cpu_set_init_frame(). It's
a kludge because that macro is intended to be used only for init, but
does what we want nonetheless.
if an invalid ioctl is done on /dev/klog. logioctl() needs to return
an errno instead of -1 on a failed ioctl.
Submitted by: Mike Pritchard <mpp@mpp.com>
some (hopefully) less offensive stupidity:
If we detect that a user has loaded a module that fails to initialize
itself correctly, panic. There really isn't a safe way to recover from
something like this; we can't know that the module is bad until after
the entry point is called, by which time it's too late to do anything
about it.
- Add $Id$ string.
- Fix comment ("we might *not* be able to unload the
module afterwards without panicking...")
- Get rid of variable 'j' that I used in name checking
for(;;) loop and use 'i' instead (I thought there'd be
a problem with this, but there isn't).
if_tun_mod, etc...) from crashing the system. These modules are useful,
but because they don't yet have proper load()/unload() functions,
they can lead to panics: if, for example, you load the if_ppp module,
any user can panic the system by running modstat.
You can also hang the system outright if you try to unload the PPP
module too.
Changes are as follows:
- Save the name passed to us during the RESERVE stage for name matching
(we can't load if_ppp_mod twice: we've have two ppp0's and two ppp1's,
which is beyond strange). This makes the lkmexists() cheks somewhat
redundant, but there's no way around it that I can see.
- If we call the module entry point and find that we have no lkm_any
structure in our 'private' section, create a fake one. This keeps
modstat happy. We mark such modules as LM_UNKNOWN.
- Don't allow LM_UNLOAD modules to be unloaded: it just ain't
possible. (Unless someone wants to write a pppunattach() function. :( )
- In lkmunreserve(), mark private.lkm_any as NULL so we don't get
confused later. I think this is bogus, but I can't prove it.
XXX: the name matching used to keep the user from loading two
instances of the same module can easily be defeated simply by
changing the module name or, in the case of the oddball modules,
simply by renaming the module files. I haven't found a nice simple
way to tell one module from another.
is necessary in order for panic+sync to work. Will also gloss over a panic
that Jordan was having with the install floppies that remains unexplainable.
2) Handle "bogus_page" a little better.
3) Set page protection to VM_PROT_NONE if the entire page has become !valid.
Submitted by: John Dyson (2&3), me (1).
through a temporary buffer instead of one character at a time. The old
method takes about 6 usec/char on a 486DX2/66. This is larger than than
the combined interrupt and PIO overhead for a 16550!
This change was first implemented in 1.1.5. It was rewritten for 2.1.
The clist access functions allow a simpler implementation at some cost
in correctness and speed. There needs to be an ungetc() function to
recover from EFAULT, and it wastes time to copy through a temporary
buffer.
Don't snoop on single characters that weren't read due to EFAULT.
Rewrite a snoop comment in my approximation to English.
Undo bogus exportation of ttnread().
Fixed remaining known bugs in the buffer IO and VM system.
vfs_bio.c:
Fixed some race conditions and locking bugs. Improved performance
by removing some (now) unnecessary code and fixing some broken
logic.
Fixed process accounting of # of FS outputs.
Properly handle NFS interrupts (B_EINTR).
(various)
Replaced calls to clrbuf() with calls to an optimized routine
call vfs_bio_clrbuf().
(various FS sync)
Sync out modified vnode_pager backed pages.
ffs_vnops.c:
Do two passes: Sync out file data first, then indirect blocks.
vm_fault.c:
Fixed deadly embrace caused by acquiring locks in the wrong order.
vnode_pager.c:
Changed to use buffer I/O system for writing out modified pages. This
should fix the problem with the modification date previous not getting
updated. Also dramatically simplifies the code. Note that this is
going to change in the future and be implemented via VOP_PUTPAGES().
vm_object.c:
Fixed a pile of bugs related to cleaning (vnode) objects. The performance
of vm_object_page_clean() is terrible when dealing with huge objects,
but this will change when we implement a binary tree to keep the object
pages sorted.
vm_pageout.c:
Fixed broken clustering of pageouts. Fixed race conditions and other
lockup style bugs in the scanning of pages. Improved performance.
drivers to protect DDB from being invoked while the console is in
process-controlled (i.e., graphics) mode.
Implement the logic to use this hook from within pcvt. (I'm sure
Søren will do the syscons part RSN).
I've still got one occasion where the system stalled, but my attempts
to trigger the situation artificially resulted int the expected
behaviour. It's hard to track bugs without the console and DDB
available. :-/
Added a new type to uiomove - "UIO_NOCOPY" which causes it to update
pointers and counts, but doesn't do any data copying. This is needed
for upcoming changes to the way that the vnode pager does its page
outs.
Added a new hash init function call "phashinit" that allocates and
initializes a prime number sized hash table.
vfs_cache.c:
Changed hashing algorithm to use the remainder of dividing by a prime
number to improve the distribution characteristcs. Uses new phashinit
function in kern_subr.c.
#179). The fix implements a ttyhalfclose() (sort of), resetting the
session and pgrp pointers when the physical device is about to be
closed.
Suggested by: bde