VOP_READ and VOP_WRITE take the seqcount in blocks in a 16-bit field.
However, fcntl allows you to set the seqcount in bytes to any nonnegative
31-bit value. The result can be a 16-bit overflow, which will be
sign-extended in functions like ffs_read. Fix this by sanitizing the
argument in kern_fcntl. As a matter of policy, limit to IO_SEQMAX rather
than INT16_MAX.
Also, fifos have overloaded the f_seqcount field for a completely different
purpose ever since r238936. Formalize that by using a union type.
Reviewed by: cem
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D20710
This ioctl exposes VOP_BMAP information to userland. It can be used by
programs like fragmentation analyzers and optimized cp implementations. But
I'm using it to test fusefs's VOP_BMAP implementation. The "2" in the name
distinguishes it from the similar but incompatible FIBMAP ioctls in NetBSD
and Linux. FIOBMAP2 differs from FIBMAP in that it uses a 64-bit block
number instead of 32-bit, and it also returns runp and runb.
Reviewed by: mckusick
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D20705
wakeup_one() and underlying sleepq_signal() spend additional time trying
to be fair, waking thread with highest priority, sleeping longest time.
But in case of taskqueue there are many absolutely identical threads, and
any fairness between them is quite pointless. It makes even worse, since
round-robin wakeups not only make previous CPU affinity in scheduler quite
useless, but also hide from user chance to see CPU bottlenecks, when
sequential workload with one request at a time looks evenly distributed
between multiple threads.
This change adds new SLEEPQ_UNFAIR flag to sleepq_signal(), making it wakeup
thread that went to sleep last, but no longer in context switch (to avoid
immediate spinning on the thread lock). On top of that new wakeup_any()
function is added, equivalent to wakeup_one(), but setting the flag.
On top of that taskqueue(9) is switchied to wakeup_any() to wakeup its
threads.
As result, on 72-core Xeon v4 machine sequential ZFS write to 12 ZVOLs
with 16KB block size spend 34% less time in wakeup_any() and descendants
then it was spending in wakeup_one(), and total write throughput increased
by ~10% with the same as before CPU usage.
Reviewed by: markj, mmacy
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
Differential Revision: https://reviews.freebsd.org/D20669
When it comes to megabytes of text, difference between sbuf_printf() and
sbuf_cat() becomes substantial.
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
On large systems those sysctls may generate megabytes of output. Before
this change sbuf(9) code was resizing buffer by 4KB each time many times,
generating tons of TLB shootdowns. Unfortunately in this case existing
sbuf_new_for_sysctl() mechanism, supposed to help with this issue, is not
applicable, since all the sbuf writes are done in different kernel thread.
This change improves situation in two ways:
- on first sysctl call, not providing any output buffer, it sets special
sbuf drain function, just counting the data and so not needing big buffer;
- on second sysctl call it uses as initial buffer size value saved on
previous call, so that in most cases there will be no reallocation, unless
GEOM topology changed significantly.
MFC after: 1 week
Sponsored by: iXsystems, Inc.
rename the source to gsb_crc32.c.
This is a prerequisite of unifying kernel zlib instances.
PR: 229763
Submitted by: Yoshihiro Ota <ota at j.email.ne.jp>
Differential Revision: https://reviews.freebsd.org/D20193
Normally td_runtime is updated on context switch, but there are some kernel
threads that due to high absolute priority may run for many seconds without
context switches (yes, that is bad, but that is true), which means their
td_runtime was not updated all that time, that made them invisible for top
other then as some general CPU usage.
MFC after: 1 week
Sponsored by: iXsystems, Inc.
asserted. Some development boards for example will reset on DTR,
and some radio interfaces will transmit on RTS.
This patch allows "stty -f /dev/ttyu9.init -rtsdtr" to prevent
RTS and DTR from being asserted on open(), allowing these devices
to be used without problems.
Reviewed by: imp
Differential Revision: https://reviews.freebsd.org/D20031
New sysctl/tunables can now set the interval (in seconds) between
rate-limited crypto warnings. The new sysctls are:
- kern.cryptodev_warn_interval for /dev/crypto
- net.inet.ipsec.crypto_warn_interval for IPsec
- kern.kgssapi_warn_interval for KGSSAPI
Reviewed by: cem
MFC after: 1 month
Relnotes: yes
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D20555
Apply a linker script when linking i386 kernel modules to apply padding
to a set_pcpu or set_vnet section. The padding value is kind-of random
and is used to catch modules not compiled with the linker-script, so
possibly still having problems leading to kernel panics.
This is needed as the code generated on certain architectures for
non-simple-types, e.g., an array can generate an absolute relocation
on the edge (just outside) the section and thus will not be properly
relocated. Adding the padding to the end of the section will ensure
that even absolute relocations of complex types will be inside the
section, if they are the last object in there and hence relocation will
work properly and avoid panics such as observed with carp.ko or ipsec.ko.
There is a rather lengthy discussion of various options to apply in
the mentioned PRs and their depends/blocks, and the review.
There seems no best solution working across multiple toolchains and
multiple version of them, so I took the liberty of taking one,
as currently our users (and our CI system) are hitting this on
just i386 and we need some solution. I wish we would have a proper
fix rather than another "hack".
Also backout r340009 which manually, temporarily fixed CARP before 12.0-R
"by chance" after a lead-up of various other link-elf.c and related fixes.
PR: 230857,238012
With suggestions from: arichardson (originally last year)
Tested by: lwhsu
Event: Waterloo Hackathon 2019
Reported by: lwhsu, olivier
MFC after: 6 weeks
Differential Revision: https://reviews.freebsd.org/D17512
Add the debug.try_reclaim_vnode sysctl. When a pathname is written to it, it
will be reclaimed, as long as it isn't already or doomed. The purpose is to
gain test coverage for vnode reclamation, which is otherwise hard to
achieve.
Add the debug.ftry_reclaim_vnode sysctl. It does the same thing, except
that its argument is a file descriptor instead of a pathname.
Reviewed by: kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D20519
After our migration (of certain architectures) to lld the kernel is built
with a unique build-ID. Make it available via a sysctl and uname(1) to
allow the user to identify their running kernel.
Submitted by: Ali Mashtizadeh <ali_mashtizadeh.com>
MFC after: 2 weeks
Relnotes: Yes
Event: Waterloo Hackathon 2019
Differential Revision: https://reviews.freebsd.org/D20326
We already handled the case when symstrindex < 0 at line 680.
Reported by: danfe using PVS-studio
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
syscalls.conf is included using "." which per the Open Group:
If file does not contain a <slash>, the shell shall use the search
path specified by PATH to find the directory containing file.
POSIX shells don't fall back to the current working directory.
Submitted by: Nathaniel Wesley Filardo <nwf20@cl.cam.ac.uk>
Reviewed by: bdrewery
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D20476
The order is correct, it is nullfs vnode interlock -> lower vnode
interlock. vop_stdadd_writecount() is called from nullfs
VOP_ADD_WRITECOUNT() and both take interlocks.
Requested by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
struct xucred. Do not bump XUCRED_VERSION as struct layout is not changed.
PR: 215202
Reviewed by: tijl
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D20415
PT_ATTACH was consumed.
In particular, do not clear TDP_FSTP in ptracestop() if td_wchan is
non-NULL. Leave it to sleepq_catch_signal() to clear and convert zero
return code to EINTR.
Otherwise, per submitter report, if the PT_ATTACH SIGSTOP was
delivered right after the thread was added to the sleepqueue but not
yet really sleep, and cursig() caused debugger attach, the thread
sleeps instead of returning to the userspace boundary with EINTR.
PR: 231445
Reported by: Efi Weiss <valmarelox@gmail.com>
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D20381
This checks the alignment of a given pointer is sufficient for the
requested alignment asked for. This fixes the build with a recent
llvm/clang.
Sponsored by: DARPA, AFRL
Check the CTF magic number in big endian platforms. This lets DTrace FBT
handle types correctly on these platforms.
Submitted by: Brandon Bergren
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D20413
The ixl.4 manual page has documented that the threshold falsely detects
interrupt storms on 40Gbit NICs as long ago as 2015, and we have seen
similar false positives with the ioat(4) DMA device (which can push GB/s).
For example, synthetic load can be generated with tools/tools/ioat
'ioatcontrol 0 200 8192 1 1000' (allocate 200x8kB buffers, generate an
interrupt for each one, and do this for 1000 milliseconds). With
storm-detection disabled, the Broadwell-EP version of this device is capable
of generating ~350k real interrupts per second.
The following historical context comes from jhb@: Originally, the threshold
worked around incorrect routing of PCI INTx interrupts on single-CPU systems
which would end up in a hard hang during boot. Since the threshold was
added, our PCI interrupt routing was improved, most PCI interrupts use
edge-triggered MSI instead of level-triggered INTx, and typical systems have
multiple CPUs available to service interrupts.
On the off chance that the threshold is useful in the future, it remains
available as a tunable and sysctl.
Reviewed by: jhb
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D20401
- Perform ifp mismatch checks (to determine if a send tag is allocated
for a different ifp than the one the packet is being output on), in
ip_output() and ip6_output(). This avoids sending packets with send
tags to ifnet drivers that don't support send tags.
Since we are now checking for ifp mismatches before invoking
if_output, we can now try to allocate a new tag before invoking
if_output sending the original packet on the new tag if allocation
succeeds.
To avoid code duplication for the fragment and unfragmented cases,
add ip_output_send() and ip6_output_send() as wrappers around
if_output and nd6_output_ifp, respectively. All of the logic for
setting send tags and dealing with send tag-related errors is done
in these wrapper functions.
For pseudo interfaces that wrap other network interfaces (vlan and
lagg), wrapper send tags are now allocated so that ip*_output see
the wrapper ifp as the ifp in the send tag. The if_transmit
routines rewrite the send tags after performing an ifp mismatch
check. If an ifp mismatch is detected, the transmit routines fail
with EAGAIN.
- To provide clearer life cycle management of send tags, especially
in the presence of vlan and lagg wrapper tags, add a reference count
to send tags managed via m_snd_tag_ref() and m_snd_tag_rele().
Provide a helper function (m_snd_tag_init()) for use by drivers
supporting send tags. m_snd_tag_init() takes care of the if_ref
on the ifp meaning that code alloating send tags via if_snd_tag_alloc
no longer has to manage that manually. Similarly, m_snd_tag_rele
drops the refcount on the ifp after invoking if_snd_tag_free when
the last reference to a send tag is dropped.
This also closes use after free races if there are pending packets in
driver tx rings after the socket is closed (e.g. from tcpdrop).
In order for m_free to work reliably, add a new CSUM_SND_TAG flag in
csum_flags to indicate 'snd_tag' is set (rather than 'rcvif').
Drivers now also check this flag instead of checking snd_tag against
NULL. This avoids false positive matches when a forwarded packet
has a non-NULL rcvif that was treated as a send tag.
- cxgbe was relying on snd_tag_free being called when the inp was
detached so that it could kick the firmware to flush any pending
work on the flow. This is because the driver doesn't require ACK
messages from the firmware for every request, but instead does a
kind of manual interrupt coalescing by only setting a flag to
request a completion on a subset of requests. If all of the
in-flight requests don't have the flag when the tag is detached from
the inp, the flow might never return the credits. The current
snd_tag_free command issues a flush command to force the credits to
return. However, the credit return is what also frees the mbufs,
and since those mbufs now hold references on the tag, this meant
that snd_tag_free would never be called.
To fix, explicitly drop the mbuf's reference on the snd tag when the
mbuf is queued in the firmware work queue. This means that once the
inp's reference on the tag goes away and all in-flight mbufs have
been queued to the firmware, tag's refcount will drop to zero and
snd_tag_free will kick in and send the flush request. Note that we
need to avoid doing this in the middle of ethofld_tx(), so the
driver grabs a temporary reference on the tag around that loop to
defer the free to the end of the function in case it sends the last
mbuf to the queue after the inp has dropped its reference on the
tag.
- mlx5 preallocates send tags and was using the ifp pointer even when
the send tag wasn't in use. Explicitly use the ifp from other data
structures instead.
- Sprinkle some assertions in various places to assert that received
packets don't have a send tag, and that other places that overwrite
rcvif (e.g. 802.11 transmit) don't clobber a send tag pointer.
Reviewed by: gallatin, hselasky, rgrimes, ae
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20117
vtruncbuf takes a "struct ucred*" argument. AFAICT, it's been unused ever
since that function was first added in r34611. Remove it. Also, remove some
"struct ucred" arguments from fuse and nfs functions that were only used by
vtruncbuf.
Reviewed by: cem
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D20377
Add Chacha20 mode to Encrypted Kernel Crash Dumps.
Chacha20 does not require messages to be multiples of block size, so it is
valid to use the cipher on non-block-sized messages without the explicit
padding AES-CBC would require. Therefore, allow use with simultaneous dump
compression. (Continue to disallow use of AES-CBC EKCD with compression.)
dumpon(8) gains a -C cipher flag to select between chacha and aes-cbc.
It defaults to chacha if no -C option is provided. The man page documents this
behavior.
Relnotes: sure
Sponsored by: Dell EMC Isilon
The sysctl provides the listing on named linked posix shared memory
segments existing in the system.
Reuse shm_fill_kinfo() for filling individual struct kinfo_file.
Remove unneeded lock around reading of shmfd->shm_mode.
Reviewed by: jilles, tmunro
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D20258
Unless there are transient references to the object, the ref count is
equal to the number of the shared memory segment mappings plus one.
Reviewed by: jilles, tmunro
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D20258
"." and ".." names are not maintained in the mqueuefs dirent datastructure and
cannot be opened as mqueues. Creating or removing them is invalid; return
EINVAL instead of crashing.
PR: 236836
Submitted by: Torbjørn Birch Moltu <t.b.moltu AT lyse.net>
Discussed with: jilles (earlier version)
Similar to r348026, exhaustive search for uses of CTRn() and cross reference
ktr.h includes. Where it was obvious that an OS compat header of some kind
included ktr.h indirectly, .c files were left alone. Some of these files
clearly got ktr.h via header pollution in some scenarios, or tinderbox would
not be passing prior to this revision, but go ahead and explicitly include it
in files using it anyway.
Like r348026, these CUs did not show up in tinderbox as missing the include.
Reported by: peterj (arm64/mp_machdep.c)
X-MFC-With: r347984
Sponsored by: Dell EMC Isilon
NDFREE() calculates unlock_dvp after ndp->ni_vp is unlocked and zeroed
out. This makes the comparision of ni_dvp with ni_vp always fail.
Move the calculation of unlock_dvp right after unlock_vp, so that the
code sees correct ni_vp value.
Reproduced by
chdir("/usr");
open("/..", O_BENEATH | O_RDONLY);
Reported by: syzkaller
Reviewed by: markj, mckusick
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D20304
are able to determine some virtual machines, but the vm_guest variable was
still only being set to VM_GUEST_VM.
Since we do know what some of them specifically are, we can set vm_guest
appropriately.
Also, if we see the CPUID has the HV flag, but we were unable to find a
definitive vendor in the Hypervisor CPUID Information Leaf, fall back to
the older detection methods, as they may be able to determine a specific
HV type.
Add VM_GUEST_PARALLELS value to VM_GUEST for Parallels.
Approved by: cem
Differential Revision: https://reviews.freebsd.org/D20305
This allows replacing "sys/eventfilter.h" includes with "sys/_eventfilter.h"
in other header files (e.g., sys/{bus,conf,cpu}.h) and reduces header
pollution substantially.
EVENTHANDLER_DECLARE and EVENTHANDLER_LIST_DECLAREs were moved out of .c
files into appropriate headers (e.g., sys/proc.h, powernv/opal.h).
As a side effect of reduced header pollution, many .c files and headers no
longer contain needed definitions. The remainder of the patch addresses
adding appropriate includes to fix those files.
LOCK_DEBUG and LOCK_FILE_LINE_ARG are moved to sys/_lock.h, as required by
sys/mutex.h since r326106 (but silently protected by header pollution prior
to this change).
No functional change (intended). Of course, any out of tree modules that
relied on header pollution for sys/eventhandler.h, sys/lock.h, or
sys/mutex.h inclusion need to be fixed. __FreeBSD_version has been bumped.
This is described in the vmem paper: "directs vmem to use the next free
segment after the one previously allocated." The implementation adds a
new boundary tag type, M_CURSOR, which is linked into the segment list
and precedes the segment following the previous M_NEXTFIT allocation.
The cursor is used to locate the next free segment satisfying the
allocation constraints.
This implementation isn't O(1) since busy tags aren't coalesced, and we
may potentially scan the entire segment list during an M_NEXTFIT
allocation.
Reviewed by: alc
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D17226
type, use a table to make it easier to add more in the future, if needed.
Add VirtualBox detection to the table ("VBoxVBoxVBox" is the hypervisor
vendor string to look for.) Also add VM_GUEST_VBOX to the VM_GUEST
enumeration to indicate VirtualBox.
Save the CPUID base for the hypervisor entry that we detected. Driver code
may need to know about it in order to obtain additional CPUID features.
Approved by: bryanv, jhb
Differential Revision: https://reviews.freebsd.org/D16305
For machines having cmpxcgh16b instruction, i.e. everything but very
early Athlons, provide lockless implementation of delayed
invalidation.
The implementation maintains lock-less single-linked list with the
trick from the T.L. Harris article about volatile mark of the elements
being removed. Double-CAS is used to atomically update both link and
generation. New thread starting DI appends itself to the end of the
queue, setting the generation to the generation of the last element
+1. On DI finish, thread donates its generation to the previous
element. The generation of the fake head of the list is the last
passed DI generation. Basically, the implementation is a queued
spinlock but without spinlock.
Many thanks both to Peter Holm and Mark Johnson for keeping with me
while I produced intermediate versions of the patch.
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 month
MFC note: td_md.md_invl_gen should go to the end of struct thread
Differential revision: https://reviews.freebsd.org/D19630
Code walks the list of contested turnstiles to calculate the priority
to unlend.
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
We have a better, more comprehensive knob for this now:
kern.random.initial_seeding.bypass_before_seeding=1.
Requested by: delphij
Sponsored by: Dell EMC Isilon
If bumping over the counter goes over the limit we have to decrement it back.
Previous code would only bump the counter after adding the entry (thus allowing
the cache to go over the limit).
Sponsored by: The FreeBSD Foundation