PCATCH, to indicate that thread shall not be stopped upon receipt of
SIGSTOP until it reaches the kernel->usermode boundary.
Also change thread_single(SINGLE_NO_EXIT) to only stop threads at
the user boundary unconditionally.
Tested by: pho
Reviewed by: jhb
Approved by: re (kensmith)
process that still need to be suspended or exited from thread_single
into the new function calc_remaining().
Tested by: pho
Reviewed by: jhb
Approved by: re (kensmith)
(DPCPU), as suggested by Peter Wemm, and implement a new per-virtual
network stack memory allocator. Modify vnet to use the allocator
instead of monolithic global container structures (vinet, ...). This
change solves many binary compatibility problems associated with
VIMAGE, and restores ELF symbols for virtualized global variables.
Each virtualized global variable exists as a "reference copy", and also
once per virtual network stack. Virtualized global variables are
tagged at compile-time, placing the in a special linker set, which is
loaded into a contiguous region of kernel memory. Virtualized global
variables in the base kernel are linked as normal, but those in modules
are copied and relocated to a reserved portion of the kernel's vnet
region with the help of a the kernel linker.
Virtualized global variables exist in per-vnet memory set up when the
network stack instance is created, and are initialized statically from
the reference copy. Run-time access occurs via an accessor macro, which
converts from the current vnet and requested symbol to a per-vnet
address. When "options VIMAGE" is not compiled into the kernel, normal
global ELF symbols will be used instead and indirection is avoided.
This change restores static initialization for network stack global
variables, restores support for non-global symbols and types, eliminates
the need for many subsystem constructors, eliminates large per-subsystem
structures that caused many binary compatibility issues both for
monitoring applications (netstat) and kernel modules, removes the
per-function INIT_VNET_*() macros throughout the stack, eliminates the
need for vnet_symmap ksym(2) munging, and eliminates duplicate
definitions of virtualized globals under VIMAGE_GLOBALS.
Bump __FreeBSD_version and update UPDATING.
Portions submitted by: bz
Reviewed by: bz, zec
Discussed with: gnn, jamie, jeff, jhb, julian, sam
Suggested by: peter
Approved by: re (kensmith)
dependent memory attributes:
Rename vm_cache_mode_t to vm_memattr_t. The new name reflects the
fact that there are machine-dependent memory attributes that have
nothing to do with controlling the cache's behavior.
Introduce vm_object_set_memattr() for setting the default memory
attributes that will be given to an object's pages.
Introduce and use pmap_page_{get,set}_memattr() for getting and
setting a page's machine-dependent memory attributes. Add full
support for these functions on amd64 and i386 and stubs for them on
the other architectures. The function pmap_page_set_memattr() is also
responsible for any other machine-dependent aspects of changing a
page's memory attributes, such as flushing the cache or updating the
direct map. The uses include kmem_alloc_contig(), vm_page_alloc(),
and the device pager:
kmem_alloc_contig() can now be used to allocate kernel memory with
non-default memory attributes on amd64 and i386.
vm_page_alloc() and the device pager will set the memory attributes
for the real or fictitious page according to the object's default
memory attributes.
Update the various pmap functions on amd64 and i386 that map pages to
incorporate each page's memory attributes in the mapping.
Notes: (1) Inherent to this design are safety features that prevent
the specification of inconsistent memory attributes by different
mappings on amd64 and i386. In addition, the device pager provides a
warning when a device driver creates a fictitious page with memory
attributes that are inconsistent with the real page that the
fictitious page is an alias for. (2) Storing the machine-dependent
memory attributes for amd64 and i386 as a dedicated "int" in "struct
md_page" represents a compromise between space efficiency and the ease
of MFCing these changes to RELENG_7.
In collaboration with: jhb
Approved by: re (kib)
correctly checks for reclaimed vnode, possibly calling VOP_REVOKE for
such vnode. If the terminal is already revoked, or devfs mount was
forcibly unmounted, the revocation of doomed ctty vnode causes panic.
Reported and tested by: lstewart
Approved by: re (kensmith)
MFC after: 2 weeks
if the new file mode is the same as it was before; however, this
optimization must be disabled for filesystems that support NFSv4 ACLs.
Chmod uses pathconf(2) to determine whether this is the case - however,
pathconf(2) always follows symbolic links, while the 'chmod -h' doesn't.
This change adds lpathconf(3) to make it possible to solve that problem
in a clean way.
Reviewed by: rwatson (earlier version)
Approved by: re (kib)
As pointed out, POLLHUP should be generated, even if it hasn't been
specified on input. It is also not allowed to return both POLLOUT and
POLLHUP at the same time.
Reported by: jilles
Approved by: re (kib)
under VM environments, it's too slow for FreeBSD to work
properly. For example, ping at 10hz pings about every 600ms
instead of about every second.
Approved by: re (kib)
when all writers, observed by reader, exited. Use writer generation
counter for fifo, and store the snapshot of the fifo generation in the
f_seqcount field of struct file, that is otherwise unused for fifos.
Set FreeBSD-undocumented POLLINIGNEOF flag only when file f_seqcount is
equal to fifo' fi_wgen, and revert r89376.
Fix POLLINIGNEOF for sockets and pipes, and return POLLHUP for them.
Note that the patch does not fix not returning POLLHUP for fifos.
PR: kern/94772
Submitted by: bde (original version)
Reviewed by: rwatson, jilles
Approved by: re (kensmith)
MFC after: 6 weeks (might be)
around the sequence that drop vnode lock and then busies the mount point.
Not having vlocked node or direct reference to the mp allows for the
forced unmount to proceed, making mp unmounted or reused.
Tested by: pho
Reviewed by: jeff
Approved by: re (kensmith)
MFC after: 2 weeks
system calls:
- Centralize generation of argument tokens for VM addresses in a macro,
ADDR_TOKEN(), and properly encode 64-bit addresses in 64-bit arguments.
- Fix up argument numbers across a large number of syscalls so that they
match the numeric argument into the system call.
- Don't audit the address argument to ioctl(2) or ptrace(2), but do keep
generating tokens for mmap(2), minherit(2), since they relate to passing
object access across execve(2).
Approved by: re (audit argument blanket)
Obtained from: TrustedBSD Project
MFC after: 1 week
using the size of the descriptor array.
- A lock is not needed to fetch fd_lastfile. The results are stale the
instant it is dropped.
- Use a private mutex pool for select since the pool mutex is not used
as a leaf.
- Fetch the si_mtx pointer first before resorting to hashing to compute
the mutex address.
Reviewed by: McKusick
Approved by: re (kib)
- For x86, change the interrupt source method to assign an interrupt source
to a specific CPU to return an error value instead of void, thus allowing
it to fail.
- If moving an interrupt to a CPU fails due to a lack of IDT vectors in the
destination CPU, fail the request with ENOSPC rather than panicing.
- For MSI interrupts on x86 (but not MSI-X), only allow cpuset to be used
on the first interrupt in a group. Moving the first interrupt in a group
moves the entire group.
- Use the icu_lock to protect intr_next_cpu() on x86 instead of the
intr_table_lock to fix a LOR introduced in the last set of MSI changes.
- Add a new privilege PRIV_SCHED_CPUSET_INTR for using cpuset with
interrupts. Previously, binding an interrupt to a CPU only performed a
privilege check if the interrupt had an interrupt thread. Interrupts
without a thread could be bound by non-root users as a result.
- If an interrupt event's assign_cpu method fails, then restore the original
cpuset mask for the associated interrupt thread.
Approved by: re (kib)
rather than as paths, which would lead to them being treated as relative
pathnames and hence confusingly converted into absolute pathnames.
Capture flags to unmount(2) via an argument token.
Approved by: re (audit argument blanket)
MFC after: 3 days
easily determine how much space is left in the send queue; they do not
need to know the send queue size.
NetBSD revisions:
sys_socket.c r1.41, 1.42
filio.h r1.9
Obtained from: NetBSD
Approved by: re (kensmith)
loading hwpmc, but calculate at runtime and allocate the necessary space.
Also the current logic is wrong as it can lead to an endless loop.
Sponsored by: Sandvine Incorporated
Reported by: Ryan Stone <rstone at sandvine dot com>
Tested by: Giovanni Trematerra
<giovanni dot trematerra at gmail dot com>
Approved by: re (kib)
Introduce the new flag KNF_NOKQLOCK to allow event callers to be called
without KQ_LOCK mtx held.
- Modify VFS knote calls to always use KNF_NOKQLOCK flag. This is required
for ZFS as its getattr implementation may sleep.
Approved by: re (rwatson)
Reviewed by: kib
MFC after: 2 weeks
inbound data waiting on a filedescriptor, such as a pipe or a socket,
for instance by using select(2), poll(2), kqueue(2), ioctl(FIONREAD)
etc.
But we have no way of finding out if written data have yet to be
disposed of, for instance, transmitted (and ack'ed!) to some remote
host, or read by the applicantion at the far end of the pipe.
The closest we get, is calling shutdown(2) on a TCP socket in
non-blocking mode, but this has the undesirable sideeffect of
preventing future communication.
Add a complement to FIONREAD, called FIONWRITE, which returns the
number of bytes not yet properly disposed of. Implement it for
all sockets.
Background:
A HTTP server will want to time out connections, if no new request
arrives within a certain period after the last transmitted response
has actually been sent (and ack'ed).
For a busy HTTP server, this timeout can be subsecond duration.
In order to signal to a load-balancer that the connection is truly
dead, TCP_RST will be the preferred method, as this avoids the need
for a RTT delay for FIN handshaking, with a client which, surprisingly
often, no longer at the remote IP number.
If a slow, distant client is being served a response which is big
enough to fill the window, but small enough to fit in the socket
buffer, the write(2) call will return immediately.
If the session timeout is armed at that time, all bytes in the
response may not have been transmitted by the time it fires.
FIONWRITE allows the timeout to check that no data is outstanding
on the connection, before it TCP_RST's it.
Input & Idea from: rwatson
Approved by: re (kib)
in the case of a file system with a block size that is less than the page
size, cluster_rbuild() looks at too many of the page's valid bits.
Consequently, it may terminate prematurely, resulting in poor performance.
Reported by: bde
Reviewed by: tegge
Approved by: re (kib)
specific macros for each audit argument type. This makes it easier to
follow call-graphs, especially for automated analysis tools (such as
fxr).
In MFC, we should leave the existing AUDIT_ARG() macros as they may be
used by third-party kernel modules.
Suggested by: brooks
Approved by: re (kib)
Obtained from: TrustedBSD Project
MFC after: 1 week
required by video card drivers. Specifically, this change introduces
vm_cache_mode_t with an appropriate VM_CACHE_DEFAULT definition on all
architectures. In addition, this changes adds a vm_cache_mode_t parameter
to kmem_alloc_contig() and vm_phys_alloc_contig(). These will be the
interfaces for allocating mapped kernel memory and physical memory,
respectively, with non-default cache modes.
In collaboration with: jhb
Specifically, if a non-root user attempts to bind an interrupt the request
will now report failure with EPERM rather than silently failing with a
successful return code.
MFC after: 1 week
offset of the stat is not known until link time so we must emit a
function to call SYSCTL_ADD_PROC rather than using SYSCTL_PROC
directly.
- Eliminate the atomic from SCHED_STAT_INC now that it's using per-cpu
variables. Sched stats are always incremented while we're holding
a spinlock so no further protection is required.
Reviewed by: sam
additional privileges as well as not restricting the type of
sockets a user can open.
Note: the VIMAGE/vnet fetaure of of jails is still considered
experimental and cannot guarantee that privileged users
can be kept imprisoned if enabled.
Reviewed by: rwatson
Approved by: bz (mentor)
- The uid/cuid members of struct ipc_perm are now uid_t instead of unsigned
short.
- The gid/cgid members of struct ipc_perm are now gid_t instead of unsigned
short.
- The mode member of struct ipc_perm is now mode_t instead of unsigned short
(this is merely a style bug).
- The rather dubious padding fields for ABI compat with SV/I386 have been
removed from struct msqid_ds and struct semid_ds.
- The shm_segsz member of struct shmid_ds is now a size_t instead of an
int. This removes the need for the shm_bsegsz member in struct
shmid_kernel and should allow for complete support of SYSV SHM regions
>= 2GB.
- The shm_nattch member of struct shmid_ds is now an int instead of a
short.
- The shm_internal member of struct shmid_ds is now gone. The internal
VM object pointer for SHM regions has been moved into struct
shmid_kernel.
- The existing __semctl(), msgctl(), and shmctl() system call entries are
now marked COMPAT7 and new versions of those system calls which support
the new ABI are now present.
- The new system calls are assigned to the FBSD-1.1 version in libc. The
FBSD-1.0 symbols in libc now refer to the old COMPAT7 system calls.
- A simplistic framework for tagging system calls with compatibility
symbol versions has been added to libc. Version tags are added to
system calls by adding an appropriate __sym_compat() entry to
src/lib/libc/incldue/compat.h. [1]
PR: kern/16195 kern/113218 bin/129855
Reviewed by: arch@, rwatson
Discussed with: kan, kib [1]
them under COMPAT_FREEBSD[4567]. Starting with FreeBSD 5.0 the SYSV IPC
API was implemented via direct system calls (e.g. msgctl(), msgget(), etc.)
rather than indirecting through the var-args *sys() system calls. The
shmsys() system call was already effectively deprecated for all but
COMPAT_FREEBSD4 already as its implementation for the !COMPAT_FREEBSD4 case
was to simply invoke nosys().