In dsl_dataset_hold_obj() we used zap_contains(.., DS_FIELD_LARGE_BLOCKS)
to determine whether the extensible (zapifyed) dataset have large blocks.
The code expects the result be either 0 (found) or ENOENT (not found),
however reused the variable 'err' which later code expects to be 0.
Fix this by adopting similar code construct that is used later for
DS_FIELD_BOOKMARK_NAMES, which uses a temporary variable zaperr to catch
errors from zap_* rountines.
Reported by: Peter J. Creath (on FreeNAS; FreeNAS bug #6848)
Illumos issue: 5393 spurious failures from dsl_dataset_hold_obj()
Reviewed by: mahrens
Sponsored by: iXsystems, Inc.
X-MFC with: r274337
It is implemented for LUNs backed by ZVOLs in "dev" mode and files.
GEOM has no such API, so for LUNs backed by raw devices all LBAs will
be reported as mapped/unknown.
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
In file included from cddl/contrib/opensolaris/common/ctf/ctf_create.c:31:
In file included from sys/cddl/contrib/opensolaris/uts/common/sys/sysmacros.h:34:
sys/cddl/contrib/opensolaris/uts/common/sys/isa_defs.h:334:9: warning: '_ILP32' macro redefined [-Wmacro-redefined]
#define _ILP32
^
<built-in>:26:9: note: previous definition is here
#define _ILP32 1
^
1 warning generated.
This is because clang 3.5.0 started predefining _ILP32 and __ILP32__ for
the i386 arch. (Earlier versions already predefined _LP64 and __LP64__
for the x86_64 arch.)
Reviewed by: emaste, avg, smh, delphij, markj
Differential Revision: https://reviews.freebsd.org/D1187
This rounding up was lost in a mismerge of illumos code.
See r268075 MFV r267565.
After that commit zio_compress_data() no longer performs any compressed
size adjustment, so it needs to be done externally. On FreeBSD we round
up the size using vdev_ashift rather than SPA_MINBLOCKSIZE so that 4KB
devices are properly supported.
Additionally, zero out the buffer tail only if compression succeeds.
The compression is considered successful if the size of compressed
data after rounding up to account for the vdev ashift is less than the
original data size. It does not make sense to have the data compressed
if all the savings are lost to rounding up.
With the new zio_compress_data() it could have been possible that the
rounded compressed size would be greater than the original size and thus
we could zero beyond the allocated buffer if the zeroing code was kept
at the original place.
Discussed with: delphij, gibbs
MFC after: 2 weeks
X-MFC with: r274627
Size of physical ZIOs must never be implicitly adjusted, it's
a responsibility of a caller to make sure that such a ZIO has proper offset
and size.
Discussed with: delphij, gibbs
MFC after: 2 weeks
After r265152 TRIM requests are ZIO_TYPE_FREE instead of ZIO_TYPE_IOCTL
this meant file backed vdevs to attempted to process the ZIO as a write
causing a panic.
We now disable TRIM on file backed vdevs and ASSERT the ZIO types supported
by each vdev type to ensure we explicity support the ZIO type being
processed.
Also ensure that TRIM on init is not procesed for devices which declare they
didn't support TRIM via vdev_notrim.
PR: 195061, 194976, 191573
Sponsored by: Multiplay
have both kern_open() and kern_openat(); change the callers to use
kern_openat().
This removes one (sometimes two) levels of indirection and
consolidates arguments checks.
Reviewed by: mckusick
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
ZFS large block support.
Please note that booting from datasets that have recordsize greater
than 128KB is not supported (but it's Okay to enable the feature on
the pool). This *may* remain unchanged because of memory constraint.
Limited safety belt is provided for mounted root filesystem but use
caution is advised.
Illumos issue:
5027 zfs large block support
MFC after: 1 month
We have observed that arc_release() can be called concurrently with a
l2arc in-flight write.
Also, we have observed that arc_hdr_destroy() can be called from
arc_write_done() for a zio with ZIO_FLAG_IO_REWRITE flag in similar
circumstances.
Previously the l2arc headers would be freed while leaking their
associated compression buffers. Now the buffers are placed on
l2arc_free_on_write list for delayed freeing. This is similar to what
was already done to arc buffers that were supposed to be freed
concurrently with in-flight writes of those buffers.
In addition to fixing the discovered leaks this change also adds some
protective code to assert that a compression buffer associated with a
l2arc header is never leaked.
A new kstat l2_cdata_free_on_write is added. It keeps a count of
delayed compression buffer frees which previously would have been leaks.
Tested by: Vitalij Satanivskij <satan@ukr.net> et al
Requested by: many
MFC after: 2 weeks
Sponsored by: HybridCluster / ClusterHQ
For ZVOL-backed LUNs this allows to inform initiators if storage's used or
available spaces get above/below the configured thresholds.
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
crash.sh script attached to FreeNAS bug 4109:
https://bugs.freenas.org/issues/4109
Three are in the snapshot layer:
a) AVG explains in his notes: https://wiki.freebsd.org/AvgVfsSolarisVsFreeBSD
"VOP_INACTIVE must not do any destructive actions to a vnode
and its filesystem node, nor invalidate them in any way."
gfs_vop_inactive and zfsctl_snapshot_inactive did just that. In
OpenSolaris VOP_INACTIVE is much closer to FreeBSD's VOP_RECLAIM.
Rename & move them to gfs_vop_reclaim and zfsctl_snapshot_reclaim
and merge in the requisite vnode_destroy from zfsctl_common_reclaim.
b) gfs_lookup_dot and various zfsctl functions do not honor the
FreeBSD VFS convention of only locking from the root downward. When
looking up ".." the convention is to drop the current leaf vnode lock before
acquiring the directory vnode and then subsequently re-acquiring the lock on the
leaf vnode. This fixes that in all the places that our exercised by crash.sh.
c) The snapshot may already be unmounted when the directory vnode is reclaimed.
Check for this case and return.
One in the common layer:
d) Callers of traverse expect the reference to the vnode passed in to be
maintained. Don't release it.
This last one may be an unclear contract. There may in fact be some callers that
do expect the reference to be dropped on success in addition to callers that
expect it to be released. In this case a further audit of the callers is needed
and a consensus on the correct behavior.
PR: 184677
Submitted by: kmacy
Reviewed by: delphij, will, avg
MFC after: 2 weeks
Sponsored by: iXsystems
* Use a constant to define the number of stack frames in a probe exception.
* Only allow function symbols in powerpc64 ('.' prefixed)
* Set the fbtp_roffset for return probes, so the correct dtrace_probe call is
made.
MFC after: 1 week
- Wrong integer type was specified.
- Wrong or missing "access" specifier. The "access" specifier
sometimes included the SYSCTL type, which it should not, except for
procedural SYSCTL nodes.
- Logical OR where binary OR was expected.
- Properly assert the "access" argument passed to all SYSCTL macros,
using the CTASSERT macro. This applies to both static- and dynamically
created SYSCTLs.
- Properly assert the the data type for both static and dynamic
SYSCTLs. In the case of static SYSCTLs we only assert that the data
pointed to by the SYSCTL data pointer has the correct size, hence
there is no easy way to assert types in the C language outside a
C-function.
- Rewrote some code which doesn't pass a constant "access" specifier
when creating dynamic SYSCTL nodes, which is now a requirement.
- Updated "EXAMPLES" section in SYSCTL manual page.
MFC after: 3 days
Sponsored by: Mellanox Technologies
in userland rename in-kernel getenv()/setenv() to kern_setenv()/kern_getenv().
This fixes a namespace collision with libc symbols.
Submitted by: kmacy
Tested by: make universe
When processing async destroys ZFS would leak space every txg timeout
(5 seconds by default), if no writes occurred, until the pool is totally
full. At this point it would be unfixable without a pool recreation.
In addition if the machine was rebooted with the pool in this situation
would fail to import on boot, hanging indefinitely, as the import process
requires the ability to write data to the pool. Any attempts to query the
pool status during the hung import would not return as the import holds
the pool lock.
The only way to import such a pool would be to specify -o readonly=on
to the zpool import.
zdb -bb <pool> can be used to check for "deferred free" size which is where
this lost space will be counted.
MFC after: 3 days
Sponsored by: Multiplay
Without this change the accounting of L2ARC usage would be wrong and
give 16EB free space because the number became negative and overflows.
Obtained from: FreeNAS (issue #6239)
MFC after: 2 weeks
controls how much fraction, 1/2^arc_shrink_shift, should be reclaimed
when there is memory pressure.
Submitted by: Richard Kojedzinszky <krichy at tvnetwork.hu>
MFC after: 2 weeks
Refactor the code and stop restore_object from creating two transactions.
Illumos issue:
3693 restore_object uses at least two transactions to restore an object
MFC after: 2 weeks
FreeBSD has ARC_BUFC_NUMMETADATALISTS metadata lists and ARC_BUFC_NUMDATALISTS
data lists (currently both are 16) while illumos has just a single list
of each kind.
headroom determines how much data is scanned on a single list
during each run of the l2arc feed thread.
Because FreeBSD has more lists we proportionally decrease the limit.
Reviewed by: Brendan Gregg (earlier version)
MFC after: 2 weeks
Sponsored by: HybridCluster
FreeBSD has ARC_BUFC_NUMMETADATALISTS metadata lists and ARC_BUFC_NUMDATALISTS
data lists (currently both are 16) while illumos has just a single list
of each kind.
L2ARC_WRITE_SIZE determines the default value of l2arc_write_max which
defines limits on how much data is scanned and written to a cache device
during each run of the l2arc feed thread. The limits are applied on the
per buffer list basis.
Because FreeBSD has more lists we proportionally reduce the limits.
Reviewed by: Brendan Gregg (earlier version)
MFC after: 2 weeks
Sponsored by: HybridCluster
Don't inherit flags other than DS_FLAG_CI_DATASET and DS_FLAG_INCONSISTENT
when cloning. This prevents DS_FLAG_DEFER_DESTROY being inherited from a
clone that is marked for deferred destroy, which causes snapshots of the
clone being destroyed when getting a hold or clone.
Illumos issue:
5150 zfs clone of a defer_destroy snapshot causes strangeness
MFC after: 1 week
Add tunable for number of metaslabs per vdev
(vfs.zfs.vdev.metaslabs_per_vdev). The default remains
at 200.
Illumos issue:
5161 add tunable for number of metaslabs per vdev
MFC after: 2 weeks
In arc_kmem_reap_now(), reap range_seg_cache too to reclaim memory in
response of memory pressure.
Illumos issue:
5163 arc should reap range_seg_cache
MFC after: 1 week
Make space_map_truncate() always do space_map_reallocate(). Without
this, setting space_map_max_blksz would cause panic for existing pool,
as dmu_objset_set_blocksize would fail if the object have multiple blocks.
Illumos issues:
5164 space_map_max_blksz causes panic, does not work
5165 zdb fails assertion when run on pool with recently-enabled
spacemap_histogram feature
MFC after: 2 weeks
Remove previously added kmem methods in favour of defines which
allow diff minimisation between upstream code base.
Rebalance ARC free target to be vm_pageout_wakeup_thresh by default
which eliminates issue where ARC gets minimised instead of balancing
with VM pageout. The restores the target point prior to r270759.
Bring in missing upstream only changes which move unused code to
further eliminate code differences.
Add additional DTRACE probe to aid monitoring of ARC behaviour.
Enable upstream i386 code paths on platforms which don't define
UMA_MD_SMALL_ALLOC.
Fix mixture of byte an page values in arc_memory_throttle i386 code
path value assignment of available_memory.
PR: 187594
Review: D702
Reviewed by: avg
MFC after: 1 week
X-MFC-With: r270759 & r270861
Sponsored by: Multiplay
When performing snapshot renames we could deadlock due to the locking
in zvol_rename_minors. In order to avoid this use the same workaround
as zvol_open in zvol_rename_minors.
Add missing zvol_rename_minors to dsl_dataset_promote_sync.
Protect against invalid index into zv_name in zvol_remove_minors.
Replace zvol_remove_minor calls with zvol_remove_minors to ensure
any potential children are also renamed.
Don't fail zvol_create_minors if zvol_create_minor returns EEXIST.
Restore the valid pool check in zfs_ioc_destroy_snaps to ensure we
don't call zvol_remove_minors when zfs_unmount_snap fails.
PR: 193803
MFC after: 1 week
Sponsored by: Multiplay
This fix addresses only issues with the pynfs reports, none of these
issues are know to create problems for extant real clients.
Submitted by: Bart Hsiao <bart.hsiao@gmail.com>
Reworked by: myself
Reviewed by: rmacklem
Approved by: rmacklem
Sponsored by: QNAP Systems Inc.
This is checked for in the zfs_snapshot_004_neg STF/ATF test (currently
still in projects/zfsd rather than head).
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c:
- zfsvfs_create(): Check whether the objset name fits into
statfs.f_mntfromname, and return ENAMETOOLONG if not. Although
the filesystem can be unmounted via the umount(8) command, any
interface that relies on iterating on statfs (e.g. libzfs) will
fail to find the filesystem by its objset name, and thus assume
it's not mounted. This causes "zfs unmount", "zfs destroy",
etc. to fail on these filesystems, whether or not -f is passed.
MFC after: 1 month
Sponsored by: Spectra Logic
MFSpectraBSD: 974872 on 2013/08/09
This is primarily only of interest to ZFS developers, but it makes it
easier to get additional debugging.
Submitted by: gibbs
MFC after: 1 month
Sponsored by: Spectra Logic
MFSpectraBSD: 517074 on 2011/12/15 (by will), 662343 on 2013/03/20 (by gibbs)
If bpobj_space() returned non-zero here, the sublist would have been
left open, along with the bonus buffer hold it requires. This call
does not invoke any calls to bpobj_close() itself.
This bug doesn't have any known vector, but was found on inspection.
MFC after: 1 week
Sponsored by: Spectra Logic
Affects: All ZFS versions starting 21 May 2010 (illumos cde58dbc)
MFSpectraBSD: r1050998 on 2014/03/26
Summary:
Fix the stack tracing for dtrace/powerpc by using the trapexit/asttrapexit
return address sentinels instead of checking within the kernel address space.
As part of this, I had to add new inline functions. FBT traces the kernel, so
we have to have special case handling for this, since a trap will create a full
new trap frame, and there's no way to pass around the 'real' stack. I handle
this by special-casing 'aframes == 0' with the trap frame. If aframes counts
out to the trap frame, then assume we're looking for the full kernel trap frame,
so switch to the real stack pointer.
Test Plan: Tested on powerpc64
Reviewers: rpaulo, markj, nwhitehorn
Reviewed By: markj, nwhitehorn
Differential Revision: https://reviews.freebsd.org/D788
MFC after: 3 week
Relnotes: Yes
* vfs.zfs.vdev.async_write_active_min_dirty_percent
* vfs.zfs.vdev.async_write_active_max_dirty_percent
Added validation of min / max for ZFS sysctl
* vfs.zfs.dirty_data_max_percent
MFC after: 3 days
Correctly report hole at end of file.
When asked to find a hole, the DMU sees that there are no holes in the
object, and returns ESRCH. The ZPL interprets this as "no holes before
the end of the file", and therefore inserts the "virtual hole" at the
end of the file. Because DMU and ZPL have different ideas of where the
end of an object/file is, we will end up returning the end of file,
which is generally larger, instead of returning the end of object.
The fix is to handle the "virtual hole" in the DMU. If no hole is found,
the DMU will return a hole at the end of the file, rather than an error.
Illumos issue:
5139 SEEK_HOLE failed to report a hole at end of file
MFC after: 1 week
In zil_claim, don't issue warning if we get EBUSY (inconsistent) when
opening an objset, instead, ignore it silently.
Illumos issue:
5140 message about "%recv could not be opened" is printed when booting after crash
MFC after: 1 week
Add a new tunable/sysctl, vfs.zfs.free_max_blocks, which can be used to
limit how many blocks can be free'ed before a new transaction group is
created. The default is no limit (infinite), but we should probably have
a lower default, e.g. 100,000.
With this limit, we can guard against the case where ZFS could run out of
memory when destroying large numbers of blocks in a single transaction
group, as the entire DDT needs to be brought into memory.
Illumos issue:
5138 add tunable for maximum number of blocks freed in one txg
MFC after: 2 weeks
Enforce 4K as smallest indirect block size (previously the smallest
indirect block size was 1K but that was never used).
This makes some space estimates more accurate and uses less memory
for some data structures.
Illumos issue:
5141 zfs minimum indirect block size is 4K
MFC after: 2 weeks
In dnode_sync(), do dnode_increase_indirection() before processing
the dn_next_nblkptr.
Illumos issue:
5117 space map reallocation can cause corruption
MFC after: 3 days
Also restore kmem_used() check for i386 as it has KVA limits that the raw
page counts above don't consider
PR: 187594
Reviewed by: peter
X-MFC-With: r270759
Review: D700
Sponsored by: Multiplay
Prior to this change we triggered ARC reclaim when kmem usage passed 3/4
of the total available, as indicated by vmem_size(kmem_arena, VMEM_ALLOC).
This could lead large amounts of unused RAM e.g. on a 192GB machine with
ARC the only major RAM consumer, 40GB of RAM would remain unused.
The old method has also been seen to result in extreme RAM usage under
certain loads, causing poor performance and stalls.
We now trigger ARC reclaim when the number of free pages drops below the
value defined by the new sysctl vfs.zfs.arc_free_target, which defaults
to the value of vm.v_free_target.
Credit to Karl Denninger for the original patch on which this update was
based.
PR: 191510 and 187594
Tested by: dteske
MFC after: 1 week
Relnotes: yes
Sponsored by: Multiplay
tracepoints would continue to generate traps, which would be ignored but
could consume noticeable amounts of CPU if, say, all functions in the kernel
were instrumented.
X-MFC-With: r270067
duplicating the entire implementation for both x86 and powerpc. This makes
it easier to add support for other architectures and has no functional
impact.
Phabric: D613
Reviewed by: gnn, jhibbits, rpaulo
Tested by: jhibbits (powerpc)
MFC after: 2 weeks
In vdev_get_stats, check that the vdev is not a hole before computing the
fragmentation. This fixes a panic when removing log device.
Illumos issue:
5049 panic when removing log device
Author: Alex Reece <alex@delphix.com>
MFC after: 2 weeks
Encode CPU's number by XOR'ing the CPU ID against the 64-bit cpu_ticks().
Reviewed by: mav, gibbs
Differential Revision: https://phabric.freebsd.org/D521
MFC after: 2 weeks
In dnode_children_t, use C99's "[]" idiom for declaring the variable
sized array dnc_children at the end of the structure.
This prevents the compiler from mistakenly optimizing away accesses
beyond the array's defined size.
Illumos issue:
5038 Remove "old-style" flexible array usage in ZFS.
Author: Justin T. Gibbs <justing@spectralogic.com>
MFC after: 2 weeks
This prevents recursion of vdev_queue_io_done as per r265321 but
using a different method as recommended on the openzfs list.
We now use zio_interrupt(zio) and return ZIO_PIPELINE_STOP instead
of returning ZIO_PIPELINE_CONTINUE from vdev_*_io_start methods.
zio_vdev_io_start now ASSERTS the that vdev_op_io_start returns
ZIO_PIPELINE_STOP to ensure future changes don't reintroduce
ZIO_PIPELINE_CONTINUE returns.
Cleanup flow in vdev_geom_io_start while I'm here.
Also fix some cases not using SET_ERROR(..)
MFC after: 2 weeks
X-MFC-With: r265321
nanouptime() instead of getnanouptime(). nanouptime(9) provides more
precise result at expense of being slower.
In r269223, gethrtime() is used as creation time of dbuf, which in turn
acts as portion of lookup key to maintain AVL invariant where there can
not be duplicate items. Before this change, gethrtime() have preferred
better execution time by sacrificing precision, which may lead to panic
on busy systems with:
panic: avl_find() succeeded inside avl_add()
Reported by: allanjude, mav
PR: kern/192284
MFC after: 11 days
X-MFC-with: r269223
Increase default ARC buf_hash_table size. When typical block size is small,
the hash table could be too small, which would lead to long hash chains and
limit performance for cached reads.
A new loader tunable, vfs.zfs.arc_average_blocksize, have been added which
allows users to override the default assumption of average (typical) block
size. Old default was 65536 (64 KiB) and new default is 8192 (8 KiB).
Illumos issue:
5034 ARC's buf_hash_table is too small
MFC after: 2 weeks
Change dn->dn_dbufs from linked list to AVL tree.
Illumos issues:
4873 zvol unmap calls can take a very long time for larger datasets
MFC after: 2 weeks
code behave more like it is on Solaris.
Reported by: avg
Reviewed by: avg, mav (but bugs are mine)
Differential Revision: https://phabric.freebsd.org/D457
postponing it to zfs_vget(). zfs_root() returned vnode with the
default value of v_hash, which caused inconsistent v_hash value when
root vnode was obtained from zfs_vget().
Nullfs allocated two upper vnodes for the root zfs vnode due to
different hashes, causing consistency problems.
Reported and tested by: Harald Schmalzbauer <h.schmalzbauer@omnilan.de>
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Import Illumos changes to address the following Illumos issues:
4976 zfs should only avoid writing to a failing non-redundant
top-level vdev
4978 ztest fails in get_metaslab_refcount()
4979 extend free space histogram to device and pool
4980 metaslabs should have a fragmentation metric
4981 remove fragmented ops vector from block allocator
4982 space_map object should proactively upgrade when feature
is enabled
4984 device selection should use fragmentation metric
MFC after: 2 weeks
been transferred from zio_compress_data to its caller. Therefore, passing
the 'minblocksize' down will be a no-op.
Eliminate the parameter to reduce diff against upstream.
MFC after: 2 weeks
read workload by splitting the single teardown rrw lock into
RRM_NUM_LOCKS (17) of them.
Read acquisitions are randomly distributed among these locks based
on curthread pointer. Write acquisitions are going to all the
locks, which for the usage of this type of lock should be rare.
Illumos issue:
5008 lock contention (rrw_exit) while running a read only load
MFC after: 2 weeks
When a sync task is waiting for a txg to complete, we should hurry it along
by increasing the number of outstanding async writes (i.e. make
vdev_queue_max_async_writes() return a larger number).
Illumos issue:
4753 increase number of outstanding async writes when sync task is waiting
MFC after: 2 weeks
Change the interaction between the DMU and ARC so that when the DMU is
shutting down an objset, we do not evict the data from the ARC. Instead
we simply coordinate the destruction of the DMU's data with the ARC.
The only case where we actually need to explicitly evict from the ARC is
when dbuf_rele_and_unlock() determines that the administrator has requested
that it not be kept in memory, via the primarycache/secondarycache properties.
In this case, we evict the data from the ARC by its blkptr_t, the same way
as when a block is freed we explicitly evict it from the ARC.
Illumos issue:
4631 zvol_get_stats triggering too many reads
MFC after: 2 weeks
Instead of asserting all zio's be properly aligned, only assert
on the logical ones.
Cap uberblocks at 8k, otherwise with ashift=17, there would be
only one uberblock.
This fixes a problem that zdb would trip assert on pools with
ashift >= 0xe (8k).
While there, also change the code so it only attempt to condense
space map unless the uncondensed size consumes greater than
zfs_metaslab_condense_block_threshold blocks.
Illumos issue:
4958 zdb trips assert on pools with ashift >= 0xe
MFC after: 2 weeks
Improve extreme rewind import.
When doing an "extreme rewind" import ("zpool import -XF"), we attempt
to verify all data in the pool, essentially scrubbing the entire pool.
The problem is that spa_load_verify_cb() issues an unbounded number of
concurrent scrub i/os. This can lead to all of memory being used for
these zio's, wedging the system. Like normal scrub, we need to put a
cap on the number of outstanding i/os, and have the traverse thread
block when we reach this cap.
For this purpose the cap can be very large (10,000) to optimize the
elevator algorithm. Three kernel tunables have been added:
vfs.zfs.spa_load_verify_maxinflight
vfs.zfs.spa_load_verify_metadata
vfs.zfs.spa_load_verify_data
The latter two tunables controls whether metadata and/or user data
when doing extreme rewind.
Make 'zpool import -T' imply scrub.
Make zpool import -T <txg> accept hexadecimal values for the txg when
prefixed with 0x.
Skip txg's for which there is no uberblock when doing extreme rewind.
Skip reading all user data twice by skipping prefetches when doing
extreme rewinds as we do not access via the ARC.
Illumos issues:
4970 need controls on i/o issued by zpool import -XF
4971 zpool import -T should accept hex values
4972 zpool import -T implies extreme rewind, and thus a scrub
4973 spa_load_retry retries the same txg
4974 spa_load_verify() reads all data twice
MFC after: 2 weeks
the upstream implementation and helps ensure that a trap induced by tracing
fbt::trap:entry is handled without recursively generating another trap.
This makes it possible to run most (but not all) of the DTrace tests under
common/safety/ without triggering a kernel panic.
Submitted by: Anton Rang <anton.rang@isilon.com> (original version)
Phabric: D95
Use reserved space for ZFS administrative commands.
We reserve 1/2^spa_slop_shift = 1/32 or 3.125% of pool space (or 32MB at
least) for system use. Most ZPL operations, e.g. write(2), creat(2), will
fail with ENOSPC if we fall below this.
Certain operations, e.g. file removal and most administrative actions,
still permitted until half of the slop space is used. This would allow
users to use these operations to free up space in the pool when pool is
close to full but half of slop space is still free.
A very restricted set of operations that frees up space or change quota
are always permitted, regardless of the amount of free space.
MFC after: 2 weeks
Explicitly mark file removal transactions as "presumed to result
in a net free of space" so they will not fail with ENOSPC.
Illumos issue: 4950 files sometimes can't be removed from a full
filesystem
MFC after: 2 weeks
While it is possible to create and write file, modify its permissions, etc.
without ever doing sync, it looks odd that it is required for setting
extended file attributes on ZFS. UFS does not do sync there too.
Samba uses those extended attributes to store some its data, and doing it
synchronously by many times reduces file creation performance for systems
without SLOG device.
Reviewed by: delphij, jpaetzel, silence on fs@
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
This includes:
o All directories named *ia64*
o All files named *ia64*
o All ia64-specific code guarded by __ia64__
o All ia64-specific makefile logic
o Mention of ia64 in comments and documentation
This excludes:
o Everything under contrib/
o Everything under crypto/
o sys/xen/interface
o sys/sys/elf_common.h
Discussed at: BSDcan
6679140 asymmetric alloc/dealloc activity can induce dynamic variable drops
6679193 dtrace_dynvar walker produces flood of dtrace_dynhash_sink
This finishes a set of merges from the older OpenSolaris releases.
Still the FreeBSD port has many differences that are difficult to
account for but that seems normal given that the kernels are different.
MFC after: 1 week
6851093 system drops to kmdb with anonymous dtrace probes + kmdb
This has no effect on FreeBSD (code is ifdef'ed) but is useful as
reference for future merges.
MFC after: 1 week
Don't check ZVOL_WCE flag, used in Solaris to control device "write cache".
It is not applicable on FreeBSD and by default set to "disable".
MFC after: 3 days
These have no effect on FreeBSD, in fact they are ifdef'ed,
but make easier future merges:
6699767 panic in spec_open()
6718877 crgetzoneid() use can cause problems when forking processes with
USDT providers in a non global zone
MFC after: 3 days
MFV r260708
4427 pid provider rejects probes with valid UTF-8 names
Use of u8_textprep.c broke the build on powerpc.
Reported by: bz, rpaulo and tinderbox.
Pointyhat: me
These changes prevent sysctl(8) from returning proper output,
such as:
1) no output from sysctl(8)
2) erroneously returning ENOMEM with tools like truss(1)
or uname(1)
truss: can not get etype: Cannot allocate memory
there is an environment variable which shall initialize the SYSCTL
during early boot. This works for all SYSCTL types both statically and
dynamically created ones, except for the SYSCTL NODE type and SYSCTLs
which belong to VNETs. A new flag, CTLFLAG_NOFETCH, has been added to
be used in the case a tunable sysctl has a custom initialisation
function allowing the sysctl to still be marked as a tunable. The
kernel SYSCTL API is mostly the same, with a few exceptions for some
special operations like iterating childrens of a static/extern SYSCTL
node. This operation should probably be made into a factored out
common macro, hence some device drivers use this. The reason for
changing the SYSCTL API was the need for a SYSCTL parent OID pointer
and not only the SYSCTL parent OID list pointer in order to quickly
generate the sysctl path. The motivation behind this patch is to avoid
parameter loading cludges inside the OFED driver subsystem. Instead of
adding special code to the OFED driver subsystem to post-load tunables
into dynamically created sysctls, we generalize this in the kernel.
Other changes:
- Corrected a possibly incorrect sysctl name from "hw.cbb.intr_mask"
to "hw.pcic.intr_mask".
- Removed redundant TUNABLE statements throughout the kernel.
- Some minor code rewrites in connection to removing not needed
TUNABLE statements.
- Added a missing SYSCTL_DECL().
- Wrapped two very long lines.
- Avoid malloc()/free() inside sysctl string handling, in case it is
called to initialize a sysctl from a tunable, hence malloc()/free() is
not ready when sysctls from the sysctl dataset are registered.
- Bumped FreeBSD version to indicate SYSCTL API change.
MFC after: 2 weeks
Sponsored by: Mellanox Technologies
2915 DTrace in a zone should see "cpu", "curpsinfo", et al
2916 DTrace in a zone should be able to access fds[]
2917 DTrace in a zone should have limited provider access
MFC after: 2 weeks
4427 pid provider rejects probes with valid UTF-8 names
This make use of Solaris' u8_validate() which we happen to
use since r185029 for ZFS.
Illumos Revision: 1444d846b126463eb1059a572ff114d51f7562e5
Reference:
https://www.illumos.org/issues/4427
Obtained from: Illumos
MFC after: 2 weeks
usage from dtrace. The dtrace code already uses cdevpriv(9) since FreeBSD
8, so this change should be quite harmless.
Reviewed by: markj
Approved by: markj
MFC after: never
the 4 byte-aligned dtrace_invop_callsite can be found and that it
immediately follows the call to dtrace_invop(). Secondly, fix some pointer
arithmetic to account for differences between struct i386_frame and illumos'
struct frame. Finally, ensure that dtrace_getarg() isn't inlined. It works
by following a fixed number of frame pointers to the probe site, so inlining
breaks it.
MFC after: 3 weeks
first five for probes entered through a UD fault (i.e. FBT probes).
Specifically, handle the fact that dtrace_invop_callsite must be
16 byte-aligned and thus may not immediately follow the call to
dtrace_invop() in dtrace_invop_start(). Also fetch register arguments and
the stack pointer through a struct trapframe instead of a struct reg.
PR: 191260
Submitted by: luke.tw@gmail.com
MFC after: 3 weeks
defined. This ensures that the sdt:zfs:: probes appear despite the fact
the sdt provider is defined in the kernel rather than in zfs.ko.
Reported by: hiren
Tested by: hiren
MFC after: 2 weeks
selection. gethrtime() in our port updated with HZ rate, so unusable for
this specific purpose, completely draining benefit of multiple taskqueues.
MFC after: 2 weeks
Add a new zfs property, "redundant_metadata" which can have values "all" or
"most". The default will be "all", which is the current behavior. When set
to all, ZFS stores an extra copy of all metadata. If a single on-disk block
is corrupt, at worst a single block of user data (which is recordsize bytes
long) can be lost.
Setting to "most" will cause us to only store 1 copy of level-1 indirect
blocks of user data files. This can improve performance of random writes,
because less metadata has to be written. In practice, at worst about
100 blocks (of recordsize bytes each) of user data can be lost if a single
on-disk block is corrupt.
The exact behavior of which metadata blocks are stored redundantly may change
in future releases.
Illumos issue: 3835 zfs need not store 2 copies of all metadata
MFC after: 2 weeks
This includes decodes of recent Intel instructions, in particular
VT-x and related instructions. This allows the FBT provider to
locate the exit points of routines that include these new
instructions.
Illumos issues:
3414 Need a new word of AT_SUN_HWCAP bits
3415 Add isainfo support for f16c and rdrand
3416 Need disassembler support for rdrand and f16c
3413 isainfo -v overflows 80 columns
3417 mdb disassembler confuses rdtscp for invlpg
1518 dis should support AMD SVM/AMD-V/Pacifica instructions
1096 i386 disassembler should understand complex nops
1362 add kvmstat for monitoring of KVM statistics
1363 add vmregs[] variable to DTrace
1364 need disassembler support for VMX instructions
1365 mdb needs 16-bit disassembler support
This corresponds to Illumos-gate (github) version
eb23829ff08a873c612ac45d191d559394b4b408
Reviewed by: markj
MFC after: 1 week
4730 metaslab group taskq should be destroyed in metaslab_group_destroy()
Reviewed by: Alex Reece <alex.reece@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Original author: George Wilson
MFC after: 3 days
cause stack overflow for IO's which return ZIO_PIPELINE_CONTINUE
from the zio_vdev_io_start stage and hence don't suspend and complete
in a different thread.
This prevents double fault panic on slow machines running ZFS on
GELI volumes which return EOPNOTSUPP directly to BIO_DELETE requests.
MFC after: 1 month
X-MFC-With: r265152
The changes how TRIM requests are generated to use ZIO_TYPE_FREE + a priority
instead of ZIO_TYPE_IOCTL, until processed by vdev_geom; only then is it
translated the required geom values. This reduces the amount of changes
required for FREE requests to be supported by the new IO scheduler. This
also eliminates the need for a specific DKIOCTRIM.
Also fixed FREE vdev child IO's from running ZIO_STAGE_VDEV_IO_DONE as part
of their schedule.
As the new IO scheduler can result in a request to execute one type of IO to
actually run a different type of IO it requires that zio_trim requests are
processed without holding the trim map lock (tm->tm_lock), as the free request
execute call may result in write request running hence triggering a
trim_map_write_start call, which takes the trim map lock and hence would result
in recused on no-recursive sx lock.
This is based off avg's original work, so credit to him.
MFC after: 1 month
returns ZIO_PIPELINE_CONTINUE from vdev_op_io_start to zio_execute resulting
in the wrong ZIO continuing its pipeline.
This is a serious issue which could cause data loss / corruption but appears
to be limited to error handling such as when vdev_readable(vd) returns false.
MFC after: 2 days
4754 io issued to near-full luns even after setting noalloc threshold
4755 mg_alloc_failures is no longer needed
illumos/illumos@b6240e830b
MFC after: 2 weeks
sites and installing a hook at the kernel's trap handler. The fasttrap code
will emulate the overwritten instruction in some common cases, but otherwise
copies it out into some scratch space in the traced process' address space
and ensures that it's executed after returning from the trap.
In Solaris and illumos, this (per-thread) scratch space comes from some
reserved space in TLS, accessible via the fs segment register. This
approach is somewhat unappealing on FreeBSD since it would require some
modifications to rtld and jemalloc (for static TLS) to ensure that TLS is
executable, and would thus introduce dependencies on their implementation
details. I think it would also be impossible to safely trace static binaries
compiled without these modifications.
This change implements the functionality in a different way, by having
fasttrap map pages into the target process' address space on demand. Each
page is divided into 64-byte chunks for use by individual threads, and
fasttrap's process descriptor struct has been extended to keep track of
any scratch space allocated for the corresponding process.
With this change it's possible to trace all libc functions in a program,
e.g. with
pid$target:libc.so.*::entry {@[probefunc] = count();}
Previously this would generally cause the victim process to crash, as
tracing memcpy on amd64 requires the functionality described above.
Tested by: Prashanth Kumar <pra_udupi@yahoo.co.in> (earlier version)
MFC after: 6 weeks
this is due to a wrong dereference of a vnode when it's not locked and
can be (potentially) recycled. 'sdvp' cannot be locked on zfs_rename()
entry point because the VFS can't be sure that this scenario is
LOR-free (it might violate the parent->child lock acquisition rule).
Dereference 'tdvp' instead, which is already locked on entry, and access
'sdvp' fields only when it's safe, i.e. under ZFS_ENTER scope.
While at it, remove the usage of VOP_REALVP, as long as this is a NOP
on FreeBSD.
Discussed with: avg
Reviewed by: pjd
New ZFS property volmode and sysctl vfs.zfs.vol.mode allow switching ZVOL
between three modes:
geom -- existing fully functional behavior (default);
dev -- exposing volumes only as raw disk device file in devfs;
none -- not exposing volumes outside ZFS.
The "dev" mode is less functional (can't be partitioned, mounted, etc),
but it is faster, and in some scenarios with untrusted consumers safer.
It can be useful for NAS, VM block storages, etc.
The "none" mode may be convenient for backup servers, etc. that don't
need direct data access.
Due to the way ZVOL is integrated with main ZFS code, those property
and sysctl are checked only during pool import and volume creation.
MFC after: 1 month
Sponsored by: iXsystems, Inc.
3580 Want zvols to return volblocksize when queried for physical block size
illumos/illumos-gate@a0b60564df
It is irrelevant for FreeBSD, just reducing diff.
It is an adapted merge from the vendor branch of:
701 UNMAP support for COMSTAR (in part related to ZFS)
2130 zvol DKIOCFREE uses nested DMU transactions
4248 dtrace(1M) should never create DOF with empty probes section
4249 Only probes from the first DTrace object file will be included
Illumos Revision: 4a20ab41aadcb81c53e72fc65886e964e9add59
Reference:
https://www.illumos.org/issues/4248https://www.illumos.org/issues/4249
Obtained from: Illumos
MFC after: 1 month
To reduce the diff struct pcu.cnt field was not renamed, so
PCPU_OP(cnt.field) is still used. pc_cnt and pcpu are also used in
kvm(3) and vmstat(8). The goal was to not affect externally used KPI.
Bump __FreeBSD_version_ in case some out-of-tree module/code relies on the
the global cnt variable.
Exp-run revealed no ports using it directly.
No objection from: arch@
Sponsored by: EMC / Isilon Storage Division
further refinement is required as some device drivers intended to be
portable over FreeBSD versions rely on __FreeBSD_version to decide whether
to include capability.h.
MFC after: 3 weeks
so there is no reason to assert that we won't hit an error. Instead,
just return that error to caller and have the upper layer handle it.
Obtained from: FreeNAS
Reported by: rodrigc
Reviewed by: Matthew Ahrens
MFC after: 2 weeks
that this is done for SDT probes. This fixes the syscall/tst.args.d test,
which was failing because mmap(2)'s sixth argument wasn't available to the
probe.
MFC after: 2 weeks
illumos/illumos-gate@6fb4854bed
This fixes the tst.resize1.d and tst.resize2.d DTrace tests, which have
been failing since r261122 since they were causing dtrace(1) to attempt to
allocate and use large amounts of memory, and get killed by the OOM killer
as a result.
MFC after: 1 month
upstream DTrace code. It indicates that the kernel memory allocator need not
attempt to satisfy non-blocking allocations in low-memory conditions. This
has no direct equivalent in the malloc(9) flags, so it is just defined to 0
for now.
4574 get_clones_stat does not call zap_count in non-debug kernel
zap_count(...) is never called in non-DEBUG kernel.
As result "count" variable is always 0, and "goto fail" is always
reached. This means get_clones_stat function never makes up list
of clones for "clones" properties.
MFC after: 2 weeks
This is done to ensure that visited object IDs are always increasing.
Also, pass correct object ID to prefetch_dnode_metadata for
os_groupused_dnode.
Without this change we would hit an assert if traversal was paused on
a GROUPUSED object, which is unlikely but possible.
Apparently the same change was independently developed by Deplhix.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
MFC after: 10 days
Sponsored by: HybridCluster
The bug was introduced in r256956 "Improve ZFS N-way mirror read
performance".
The code in vdev_mirror_dva_select erroneously considers already
tried DVAs for the next attempt. Thus, it is possible that a failing DVA
would be retried forever.
As a secondary effect, if the attempts fail with checksum error, then
checksum error reports are accumulated until the original request
ultimately fails or succeeds. But because retrying is going on indefinitely
the cheksum reports accumulation will effectively be a memory leak.
Reviewed by: gibbs
MFC after: 13 days
Sponsored by: HybridCluster
If we prematurely free the name buffer and it gets quickly recycled,
then zfs_rename may see data from another lookup or even unmapped memory
via cn_nameptr.
MFC after: 6 days
Sponsored by: HybridCluster
The bug was introduced in r256956 "Improve ZFS N-way mirror read
performance".
The code in vdev_mirror_dva_select erroneously considers already
tried DVAs for the next attempt. Thus, it is possible that a failing DVA
would be retried forever.
As a secondary effect, if the attempts fail with checksum error, then
checksum error reports are accumulated until the original request
ultimately fails or succeeds. But because retrying is going on indefinitely
the cheksum reports accumulation will effectively be a memory leak.
Reviewed by: gibbs
MFC after: 13 days
Sponsored by: HybridCluster
Otherwise we could run into the following deadlock.
A thread has a transaction open and assigned to a transaction group.
That would prevent the transaction group from be quiesced and synced.
The thread is blocked in getnewvnode_reserve waiting for a vnode to
a be reclaimed. vnlru thread is blocked trying to enter ZFS VOP because
a filesystem is suspended by an ongoing rollback or receive operation.
In its turn the operation is waiting for the current transaction group
to be synced.
zfs_zget is always used outside of active transactions, but zfs_mknode
is always used in a transaction context. Thus, we hoist
getnewvnode_reserve from zfs_mknode to its callers.
While there, assert that ZFS always calls getnewvnode while having
a vnode reserved.
Reported by: adrian
Tested by: adrian
MFC after: 17 days
Sponsored by: HybridCluster
When we encounter an I/O error on a piece of metadata while deleting
a file system or zvol, we don't update the bptree_entry_phys_t's
bookmark. This would lead to double free of bp's which will lead to
space map corruption.
Instead of tolerating and allowing the corruption, panic immediately.
See Illumos #4390 for more details.
4391 panic system rather than corrupting pool if we hit bug 4390
Illumos/illumos-gate@8b36997aa2
MFC after: 2 weeks
4370 avoid transmitting holes during zfs send
4371 DMU code clean up
illumos/illumos-gate@43466aae47
NOTE: Make sure the boot code is updated if a zpool upgrade is
done on boot zpool.
MFC after: 2 weeks
(Note: this change is not applicable to FreeBSD and the file
is not included in build. It's integrated for completeness).
4128 disks in zpools never go away when pulled
illumos/illumos-gate@39cddb10a3
MFC after: 2 weeks
3306 zdb should be able to issue reads in parallel
3321 'zpool reopen' command should be documented in the man page
and help message
illumos/illumos-gate@31d7e8fa33
FreeBSD porting notes: the kernel part of this changeset depends
on Solaris buf(9S) interfaces and are not really applicable for
our use. vdev_disk.c is patched as-is to reduce diverge from
upstream, but vdev_file.c is left intact.
MFC after: 2 weeks
aggregation IDs, as is done in the upstream illumos code. This still
requires some FreeBSD-specific code, as our vmem API is not identical to the
one in illumos.
Submitted by: Mike Ma <mikemandarine@gmail.com>
4171 clean up spa_feature_*() interfaces
4172 implement extensible_dataset feature for use by other zpool
features
illumos/illumos-gate@2acef22db7
MFC after: 2 weeks
4168 ztest assertion failure in dbuf_undirty
4169 verbatim import causes zdb to segfa
4170 zhack leaves pool in ACTIVE state
illumos/illumos-gate@7fdd916c47
MFC after: 2 weeks
child process that were inherited from its parent. However, this should
not be done in the case of a vfork, since the fork handler ends up removing
the tracepoints from the shared vm space, and userland DTrace probes in the
parent will no longer fire as a result.
Now the child of a vfork may trigger userland DTrace probes enabled in its
parent, so modify the fasttrap probe handler to handle this case and handle
the child process in the same way that it would handle the traced process.
In particular, if once traces function foo() in a process that vforks, and
the child calls foo(), fasttrap will treat this call as having come from the
parent. This is the behaviour of the upstream code.
While here, add #ifdef guards to some code that isn't present upstream.
MFC after: 1 month
When a da or ada device dissappears, outstanding IOs fail with
ENXIO, not EIO. The check for EIO was probably copied from Illumos,
where that is indeed the correct errno.
Without this change, pulling a busy drive from a zpool would usually
turn it into UNAVAIL, even though pulling an idle drive would turn
it into REMOVED. With this change, it is REMOVED every time.
Also, vdev_geom_io_intr shouldn't do zfs_post_remove, because that
results in devd getting two resource.fs.zfs.removed events. The
comment said that the event had to be sent directly instead of
through the async removal thread because "the DE engine is using
this information to discard prevoius I/O errors". However, the fact
that vdev_geom_io_intr was never actually sending the events until
now, and that vdev_geom_orphan never sent them at all, and that
vdev_geom_orphan usually gets called about 2 seconds after the
actual removal, means that FreeBSD's userland can cope with a late
event just fine.
Approved by: ken (mentor)
Sponsored by: Spectra Logic Corporation
MFC after: 4 weeks
(64MB). Even if we would find one somehow, ZFS kernel code rejects such
devices. It is funny to look on attempts to read 4 256K vdev labels from
1.44MB floppy, though it is not very practical and quite slow.
The freebsd variant of dmu_write_pages is hidden under _KERNEL
to avoid needlessly pulling in vm_page_t declaration.
Besides, this function seems to be useless for ZFS userland counterpart.
MFC after: 15 days
ZFS never partially validates or invalidates a page.
The higher level VM should not do that either.
mappedread_sf correct operation depends on a page being either fully
valid or invalid.
MFC after: 7 days
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4104 ::spa_space no longer works
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
illumos/illumos-gate@0713e232b7
Note that some tunables have been removed and some new tunables have
been added. Of particular note, FreeBSD-only knob
vfs.zfs.space_map_last_hope is removed as it was a nop for some time now
(after one of the previous merges from upstream).
MFC after: 11 days
Sponsored by: HybridCluster [merge]
In illumos all ioctl zio-s are "global" at the moment. That is they act
on a whole disk, e.g. a cache flush command, and thus do not need either
offset or size parameters.
FreeBSD, on the other hand, has support for TRIM command and that
command requires proper offset and size parameters.
Without this fix all TRIM commands act on the start of any disk or
partition used by ZFS destroying any data there.
Pointyhat to: avg
Tested by: sbruno
MFC after: 3 days
X-MFC with: r258632
Sponsored by: HybridCluster
4080 zpool clear fails to clear pool
4081 need zfs_mg_noalloc_threshold
illumos/illumos-gate@22e30981d8
MFC after: 10 days
Sponsored by: HybridCluster [merge]
illumos/illumos-gate@69962b5647
Please note the following changes:
- zio_ioctl has lost its priority parameter and now TRIM is executed
with 'now' priority
- some knobs are gone and some new knobs are added; not all of them are
exposed as tunables / sysctls yet
MFC after: 10 days
Sponsored by: HybridCluster [merge]
943 zio_interrupt ends up calling taskq_dispatch with TQ_SLEEP
illumos/illumos-gate@5aeb94743e
Essentially FreeBSD taskqueues already operate in a mode that
was added to Illumos with taskq_dispatch_ent change.
We even exposed the superior FreeBSD interface as taskq_dispatch_safe.
Now we just rename taskq_dispatch_safe to taskq_dispatch_ent and
struct struct ostask to taskq_ent_t, so that code differences will be
minimal.
After this change sys/cddl/compat/opensolaris/sys/taskq.h header is no
longer needed.
Note that this commit is not an MFV because the upstream change was not
individually committed to the vendor area.
MFC after: 8 days
The new macros are implemented in terms of SDT_PROBE_DEFINE and SDT_PROBE.
Probes defined in this way will appear under SDT provider named "sdt".
Parameter types are exposed via SDT_PROBE_ARGTYPE.
This is something that illumos does not have by default.
This kind of SDT probes is already present in ZFS code, so those probes
will now be available if KDTRACE_HOOKS options is enabled.
A potential future illumos compatibility enhancement is to encode a provider
name as a prefix in a probe name.
Reviewed by: markj
MFC after: 3 weeks
X-MFC after: r258622
In its stead use the Solaris / illumos approach of emulating '-' (dash)
in probe names with '__' (two consecutive underscores).
Reviewed by: markj
MFC after: 3 weeks
option, unbreak the lock tracing release semantic by embedding
calls to LOCKSTAT_PROFILE_RELEASE_LOCK() direclty in the inlined
version of the releasing functions for mutex, rwlock and sxlock.
Failing to do so skips the lockstat_probe_func invokation for
unlocking.
- As part of the LOCKSTAT support is inlined in mutex operation, for
kernel compiled without lock debugging options, potentially every
consumer must be compiled including opt_kdtrace.h.
Fix this by moving KDTRACE_HOOKS into opt_global.h and remove the
dependency by opt_kdtrace.h for all files, as now only KDTRACE_FRAMES
is linked there and it is only used as a compile-time stub [0].
[0] immediately shows some new bug as DTRACE-derived support for debug
in sfxge is broken and it was never really tested. As it was not
including correctly opt_kdtrace.h before it was never enabled so it
was kept broken for a while. Fix this by using a protection stub,
leaving sfxge driver authors the responsibility for fixing it
appropriately [1].
Sponsored by: EMC / Isilon storage division
Discussed with: rstone
[0] Reported by: rstone
[1] Discussed with: philip
This is a fix for a regression introduced in r246293.
vm_page_clear_dirty expects the range to have DEV_BSIZE aligned boundaries,
otherwise it extends them. Thus it can happen that the whole page is
marked clean while actually having some small dirty region(s).
This commit makes the range properly aligned and ensures that only
the clean data is marked as such.
It would interesting to evaluate how much benefit clearing with DEV_BSIZE
granularity produces. Perhaps instead we should clear the whole page
when it is completely overwritten and don't bother clearing any bits
if only a portion a page is written.
Reported by: George Hartzell <hartzell@alerce.com>,
Richard Todd <rmtodd@servalan.servalan.com>
Tested by: George Hartzell <hartzell@alerce.com>,
Reviewed by: kib
MFC after: 5 days
On machines with seveal CPUs and enough RAM this can easily twice improve
ZFS performance or twice reduce CPU usage. It was disabled three years
ago due to memory and KVA exhaustion reports, but our VM subsystem got
improved a lot since that time, hopefully enough to make another try.
Fix several problems that can cause panics on kldload and kldunload.
* kproc_create(fasttrap_pid_cleanup_cb, ...) gets called before
fasttrap_provs.fth_table gets allocated. This can lead to a panic
on module load, because fasttrap_pid_cleanup_cb references
fasttrap_provs.fth_table. Move kproc_create down after the point
that fasttrap_provs.fth_table gets allocated, and modify the error
handling accordingly.
* dtrace_fasttrap_{fork,exec,exit} weren't getting NULLed until
after fasttrap_provs.fth_table got freed. That caused panics on
module unload because fasttrap_exec_exit calls
fasttrap_provider_retire, which references
fasttrap_provs.fth_table. NULL those function pointers earlier.
* There wasn't any code to destroy the
fasttrap_{tpoints,provs,procs}.fth_table mutexes on module unload,
leading to a resource leak when WITNESS is enabled. Destroy those
mutexes during fasttrap_unload().
Reviewed by: markj
Approved by: ken (mentor)
Sponsored by: Spectra Logic
MFC after: 4 weeks
It was not being correctly copied into the kernel on FreeBSD, and as a
result, probes with multiple probe sites were not being created properly.
To fix this, change the ioctl definition so that the fasttrap ioctl handler
is responsible for copying in userland data.
Submitted by: Prashanth Kumar <pra_udupi@yahoo.co.in>
MFC after: 1 month
VM subsystem twice for every written record.
Tests on 24-core system show double reduction of CPU time spent on copying
single large well-compressed file.
This patch is not really needed on illumos (while not harm either) since
their memory allocator by default uses caching for all requests up to 128K.
Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>
emulation of the call instruction caused by reversing the uaddr and kaddr
arguments when copying data out to userland: the suword* functions take the
uaddr as the first argument whereas copyout(9) takes the kaddr as the first
argument. This also partially undoes the fixes from r257143.
Submitted by: Prashanth Kumar <pra_udupi@yahoo.co.in> (original version)
MFC after: 1 month
corresponding x86 trap type. Userland DTrace probes are currently handled
by the other fasttrap hooks (dtrace_pid_probe_ptr and
dtrace_return_probe_ptr).
Discussed with: rpaulo
* Remove the unused sdt cdev.
* Don't bother keeping a list of probes in struct sdt_prov; it's not needed.
* Invoke sdt_load and sdt_unload from the module handler instead of
registering separate SYSINITs.
* Keep to within 80 columns.
* Check for errors from dtrace_unregister().
the code was trying to save the stack pointer rather than the frame pointer,
and the arguments to copyout(9) were reversed, so nothing ended up being
saved on the stack. This would cause process crashes when the pid provider
was being used to instrument calls of a function starting with this
instruction.
Reported by: symbolics@gmx.com
Tested by: symbolics@gmx.com (earlier version)
MFC after: 2 weeks
information.
The existing algorithm selects a preferred leaf vdev based on offset of the zio
request modulo the number of members in the mirror. It assumes the devices are
of equal performance and that spreading the requests randomly over both drives
will be sufficient to saturate them. In practice this results in the leaf vdevs
being under utilized.
The new algorithm takes into the following additional factors:
* Load of the vdevs (number outstanding I/O requests)
* The locality of last queued I/O vs the new I/O request.
Within the locality calculation additional knowledge about the underlying vdev
is considered such as; is the device backing the vdev a rotating media device.
This results in performance increases across the board as well as significant
increases for predominantly streaming loads and for configurations which don't
have evenly performing devices.
The following are results from a setup with 3 Way Mirror with 2 x HD's and
1 x SSD from a basic test running multiple parrallel dd's.
With pre-fetch disabled (vfs.zfs.prefetch_disable=1):
== Stripe Balanced (default) ==
Read 15360MB using bs: 1048576, readers: 3, took 161 seconds @ 95 MB/s
== Load Balanced (zfslinux) ==
Read 15360MB using bs: 1048576, readers: 3, took 297 seconds @ 51 MB/s
== Load Balanced (locality freebsd) ==
Read 15360MB using bs: 1048576, readers: 3, took 54 seconds @ 284 MB/s
With pre-fetch enabled (vfs.zfs.prefetch_disable=0):
== Stripe Balanced (default) ==
Read 15360MB using bs: 1048576, readers: 3, took 91 seconds @ 168 MB/s
== Load Balanced (zfslinux) ==
Read 15360MB using bs: 1048576, readers: 3, took 108 seconds @ 142 MB/s
== Load Balanced (locality freebsd) ==
Read 15360MB using bs: 1048576, readers: 3, took 48 seconds @ 320 MB/s
In addition to the performance changes the code was also restructured, with
the help of Justin Gibbs, to provide a more logical flow which also ensures
vdevs loads are only calculated from the set of valid candidates.
The following additional sysctls where added to allow the administrator
to tune the behaviour of the load algorithm:
* vfs.zfs.vdev.mirror.rotating_inc
* vfs.zfs.vdev.mirror.rotating_seek_inc
* vfs.zfs.vdev.mirror.rotating_seek_offset
* vfs.zfs.vdev.mirror.non_rotating_inc
* vfs.zfs.vdev.mirror.non_rotating_seek_inc
These changes where based on work started by the zfsonlinux developers:
https://github.com/zfsonlinux/zfs/pull/1487
Reviewed by: gibbs, mav, will
MFC after: 2 weeks
Sponsored by: Multiplay
zio_compress_data(..) when compressing l2arc buffers.
This eliminates l2arc I/O errors, which resulted in very poor performance on
vdev's configured with block size greater than 512b due to compression
assuming a smaller min block size than the vdev supports.
MFC after: 2 days
When safety requirements are met, it allows to avoid passing I/O requests
to GEOM g_up/g_down thread, executing them directly in the caller context.
That allows to avoid CPU bottlenecks in g_up/g_down threads, plus avoid
several context switches per I/O.
The defined now safety requirements are:
- caller should not hold any locks and should be reenterable;
- callee should not depend on GEOM dual-threaded concurency semantics;
- on the way down, if request is unmapped while callee doesn't support it,
the context should be sleepable;
- kernel thread stack usage should be below 50%.
To keep compatibility with GEOM classes not meeting above requirements
new provider and consumer flags added:
- G_CF_DIRECT_SEND -- consumer code meets caller requirements (request);
- G_CF_DIRECT_RECEIVE -- consumer code meets callee requirements (done);
- G_PF_DIRECT_SEND -- provider code meets caller requirements (done);
- G_PF_DIRECT_RECEIVE -- provider code meets callee requirements (request).
Capable GEOM class can set them, allowing direct dispatch in cases where
it is safe. If any of requirements are not met, request is queued to
g_up or g_down thread same as before.
Such GEOM classes were reviewed and updated to support direct dispatch:
CONCAT, DEV, DISK, GATE, MD, MIRROR, MULTIPATH, NOP, PART, RAID, STRIPE,
VFS, ZERO, ZFS::VDEV, ZFS::ZVOL, all classes based on g_slice KPI (LABEL,
MAP, FLASHMAP, etc).
To declare direct completion capability disk(9) KPI got new flag equivalent
to G_PF_DIRECT_SEND -- DISKFLAG_DIRECT_COMPLETION. da(4) and ada(4) disk
drivers got it set now thanks to earlier CAM locking work.
This change more then twice increases peak block storage performance on
systems with manu CPUs, together with earlier CAM locking changes reaching
more then 1 million IOPS (512 byte raw reads from 16 SATA SSDs on 4 HBAs to
256 user-level threads).
Sponsored by: iXsystems, Inc.
MFC after: 2 months
the register based on the argument index rather than relying on the fields
in struct reg to be in the right order. This assumption is incorrect on
FreeBSD and generally led to bogus argument values for the sixth argument
of PID and USDT probes; the first five are passed directly to dtrace_probe()
via the fasttrap trap handler and so were correctly handled.
MFC after: 2 weeks
null-separated strings to a single string. This can be used to print the
full arguments of a process using execsnoop (from the DTrace toolkit) or
with the following one-liner:
dtrace -n 'syscall::execve:return {trace(curpsinfo->pr_psargs);}'
Note that this relies on the process arguments being cached via the struct
proc, which means that it will not work for argvs longer than
kern.ps_arg_cache_limit. However, the following rather non-portable
script can be used to extract any argv at exec time:
fbt::kern_execve:entry
{
printf("%s", memstr(args[1]->begin_argv, ' ',
args[1]->begin_envv - args[1]->begin_argv));
}
The debug.dtrace.memstr_max sysctl limits the maximum argument size to
memstr(). Thanks to Brendan Gregg for helpful comments on freebsd-dtrace.
Tested by: Fabian Keil (earlier version)
MFC after: 2 weeks
illumos change 14172:be36a38bac3d:
illumos ZFS issues:
4082 zfs receive gets EFBIG from dmu_tx_hold_free()
Please note that this change is slightly different from r255257, because
it is merged out of order with other (larger) upstream changes.
PR: kern/182570
Reported by: Keith White <kwhite@site.uottawa.ca>
Tested by: Keith White <kwhite@site.uottawa.ca>
Approved by: re (glebius)
MFC after: 1 week
X-MFC after: r254753
handlers rather than in the dtrace device open/close methods. The current
approach can cause a panic if the device is closed which the taskqueue
thread is active, or if a kernel module containing a provider is unloaded
while retained enablings are present and the dtrace device isn't opened.
Submitted by: gibbs (original version)
Reviewed by: gibbs
Approved by: re (glebius)
MFC after: 2 weeks
Add support of Illumos dumps on zvol over RAID-Z.
Note that this only adds the features. FreeBSD would
still need more work to support dumping on zvols.
Illumos ZFS issues:
2932 support crash dumps to raidz, etc. pools
MFC after: 1 month
Approved by: re (ZFS blanket)
where 'fdvp' is a directory, 'tvp' is an already existing directory
and they have different mount points.
Reported by: avg, pjd
Reviewed by: pjd
Approved by: re (rodrigc)
Illumos ZFS issues:
3582 zfs_delay() should support a variable resolution
3584 DTrace sdt probes for ZFS txg states
Provide a compatibility shim for Solaris's cv_timedwait_hires
to help aid future porting.
Approved by: re (ZFS blanket)
dev_ref() in the clone handlers that still use it.
- Don't set SI_CHEAPCLONE flag, it's not used anywhere neither in devfs
(for anything real)
Reviewed by: kib
in the future in a backward compatible (API and ABI) way.
The cap_rights_t represents capability rights. We used to use one bit to
represent one right, but we are running out of spare bits. Currently the new
structure provides place for 114 rights (so 50 more than the previous
cap_rights_t), but it is possible to grow the structure to hold at least 285
rights, although we can make it even larger if 285 rights won't be enough.
The structure definition looks like this:
struct cap_rights {
uint64_t cr_rights[CAP_RIGHTS_VERSION + 2];
};
The initial CAP_RIGHTS_VERSION is 0.
The top two bits in the first element of the cr_rights[] array contain total
number of elements in the array - 2. This means if those two bits are equal to
0, we have 2 array elements.
The top two bits in all remaining array elements should be 0.
The next five bits in all array elements contain array index. Only one bit is
used and bit position in this five-bits range defines array index. This means
there can be at most five array elements in the future.
To define new right the CAPRIGHT() macro must be used. The macro takes two
arguments - an array index and a bit to set, eg.
#define CAP_PDKILL CAPRIGHT(1, 0x0000000000000800ULL)
We still support aliases that combine few rights, but the rights have to belong
to the same array element, eg:
#define CAP_LOOKUP CAPRIGHT(0, 0x0000000000000400ULL)
#define CAP_FCHMOD CAPRIGHT(0, 0x0000000000002000ULL)
#define CAP_FCHMODAT (CAP_FCHMOD | CAP_LOOKUP)
There is new API to manage the new cap_rights_t structure:
cap_rights_t *cap_rights_init(cap_rights_t *rights, ...);
void cap_rights_set(cap_rights_t *rights, ...);
void cap_rights_clear(cap_rights_t *rights, ...);
bool cap_rights_is_set(const cap_rights_t *rights, ...);
bool cap_rights_is_valid(const cap_rights_t *rights);
void cap_rights_merge(cap_rights_t *dst, const cap_rights_t *src);
void cap_rights_remove(cap_rights_t *dst, const cap_rights_t *src);
bool cap_rights_contains(const cap_rights_t *big, const cap_rights_t *little);
Capability rights to the cap_rights_init(), cap_rights_set(),
cap_rights_clear() and cap_rights_is_set() functions are provided by
separating them with commas, eg:
cap_rights_t rights;
cap_rights_init(&rights, CAP_READ, CAP_WRITE, CAP_FSTAT);
There is no need to terminate the list of rights, as those functions are
actually macros that take care of the termination, eg:
#define cap_rights_set(rights, ...) \
__cap_rights_set((rights), __VA_ARGS__, 0ULL)
void __cap_rights_set(cap_rights_t *rights, ...);
Thanks to using one bit as an array index we can assert in those functions that
there are no two rights belonging to different array elements provided
together. For example this is illegal and will be detected, because CAP_LOOKUP
belongs to element 0 and CAP_PDKILL to element 1:
cap_rights_init(&rights, CAP_LOOKUP | CAP_PDKILL);
Providing several rights that belongs to the same array's element this way is
correct, but is not advised. It should only be used for aliases definition.
This commit also breaks compatibility with some existing Capsicum system calls,
but I see no other way to do that. This should be fine as Capsicum is still
experimental and this change is not going to 9.x.
Sponsored by: The FreeBSD Foundation
the passed vnode belongs to the same mount point (v_vfsp or also
known as v_mount in FreeBSD). This check prevents the code from
proceeding further on vnodes that do not belong to ZFS, for
instance, on UFS or NULLFS.
The recent change (merged as r254585) on upstream changes the
check of v_vfsp to instead check the znode's z_zfsvfs. On Illumos
this would work because when the vnode comes from lofs, the
VOP_REALVP() would give the right vnode, this is not true on
FreeBSD where our VOP_REALVP is a no-op, and as such tdvp is
not guaranteed to be a ZFS vnode, and will later trigger a
failed assertion when verifying the vnode.
This changeset modifies our local shims (zfs_freebsd_rename and
zfs_freebsd_link) to check if v_mount matches before proceeding
further.
Reported by: many
Diagnostic work by: avg
kld_unload event handler which gets invoked after a linker file has been
successfully unloaded. The kld_unload and kld_load event handlers are now
invoked with the shared linker lock held, while kld_unload_try is invoked
with the lock exclusively held.
Convert hwpmc(4) to use these event handlers instead of having
kern_kldload() and kern_kldunload() invoke hwpmc(4) hooks whenever files are
loaded or unloaded. This has no functional effect, but simplifes the linker
code somewhat.
Reviewed by: jhb
Don't hold dd_lock for long by breaking it when not doing dsl_dir
accounting. It is not necessary to hold the lock while manipulating
the parent's accounting, because there is no interface for userland
to see a consistent picture of both parent and child at the same
time anyway.
Illumos ZFS issues:
4046 dsl_dataset_t ds_dir->dd_lock is highly contended
Fix a panic from dbuf_free_range() from dmu_free_object() while
doing zfs receive. This is a regression from FreeBSD r253821.
Illumos ZFS issues:
4047 panic from dbuf_free_range() from dmu_free_object() while
doing zfs receive
Illumos DTrace issues:
3089 want ::typedef
3094 libctf should support removing a dynamic type
3095 libctf does not validate arrays correctly
3096 libctf does not validate function types correctly
The flag was mandatory since r209792, where vm_page_grab(9) was
changed to only support the alloc retry semantic.
Suggested and reviewed by: alc
Sponsored by: The FreeBSD Foundation
and CIFS file attributes as BSD stat(2) flags.
This work is intended to be compatible with ZFS, the Solaris CIFS
server's interaction with ZFS, somewhat compatible with MacOS X,
and of course compatible with Windows.
The Windows attributes that are implemented were chosen based on
the attributes that ZFS already supports.
The summary of the flags is as follows:
UF_SYSTEM: Command line name: "system" or "usystem"
ZFS name: XAT_SYSTEM, ZFS_SYSTEM
Windows: FILE_ATTRIBUTE_SYSTEM
This flag means that the file is used by the
operating system. FreeBSD does not enforce any
special handling when this flag is set.
UF_SPARSE: Command line name: "sparse" or "usparse"
ZFS name: XAT_SPARSE, ZFS_SPARSE
Windows: FILE_ATTRIBUTE_SPARSE_FILE
This flag means that the file is sparse. Although
ZFS may modify this in some situations, there is
not generally any special handling for this flag.
UF_OFFLINE: Command line name: "offline" or "uoffline"
ZFS name: XAT_OFFLINE, ZFS_OFFLINE
Windows: FILE_ATTRIBUTE_OFFLINE
This flag means that the file has been moved to
offline storage. FreeBSD does not have any special
handling for this flag.
UF_REPARSE: Command line name: "reparse" or "ureparse"
ZFS name: XAT_REPARSE, ZFS_REPARSE
Windows: FILE_ATTRIBUTE_REPARSE_POINT
This flag means that the file is a Windows reparse
point. ZFS has special handling code for reparse
points, but we don't currently have the other
supporting infrastructure for them.
UF_HIDDEN: Command line name: "hidden" or "uhidden"
ZFS name: XAT_HIDDEN, ZFS_HIDDEN
Windows: FILE_ATTRIBUTE_HIDDEN
This flag means that the file may be excluded from
a directory listing if the application honors it.
FreeBSD has no special handling for this flag.
The name and bit definition for UF_HIDDEN are
identical to the definition in MacOS X.
UF_READONLY: Command line name: "urdonly", "rdonly", "readonly"
ZFS name: XAT_READONLY, ZFS_READONLY
Windows: FILE_ATTRIBUTE_READONLY
This flag means that the file may not written or
appended, but its attributes may be changed.
ZFS currently enforces this flag, but Illumos
developers have discussed disabling enforcement.
The behavior of this flag is different than MacOS X.
MacOS X uses UF_IMMUTABLE to represent the DOS
readonly permission, but that flag has a stronger
meaning than the semantics of DOS readonly permissions.
UF_ARCHIVE: Command line name: "uarch", "uarchive"
ZFS_NAME: XAT_ARCHIVE, ZFS_ARCHIVE
Windows name: FILE_ATTRIBUTE_ARCHIVE
The UF_ARCHIVED flag means that the file has changed and
needs to be archived. The meaning is same as
the Windows FILE_ATTRIBUTE_ARCHIVE attribute, and
the ZFS XAT_ARCHIVE and ZFS_ARCHIVE attribute.
msdosfs and ZFS have special handling for this flag.
i.e. they will set it when the file changes.
sys/param.h: Bump __FreeBSD_version to 1000047 for the
addition of new stat(2) flags.
chflags.1: Document the new command line flag names
(e.g. "system", "hidden") available to the
user.
ls.1: Reference chflags(1) for a list of file flags
and their meanings.
strtofflags.c: Implement the mapping between the new
command line flag names and new stat(2)
flags.
chflags.2: Document all of the new stat(2) flags, and
explain the intended behavior in a little
more detail. Explain how they map to
Windows file attributes.
Different filesystems behave differently
with respect to flags, so warn the
application developer to take care when
using them.
zfs_vnops.c: Add support for getting and setting the
UF_ARCHIVE, UF_READONLY, UF_SYSTEM, UF_HIDDEN,
UF_REPARSE, UF_OFFLINE, and UF_SPARSE flags.
All of these flags are implemented using
attributes that ZFS already supports, so
the on-disk format has not changed.
ZFS currently doesn't allow setting the
UF_REPARSE flag, and we don't really have
the other infrastructure to support reparse
points.
msdosfs_denode.c,
msdosfs_vnops.c: Add support for getting and setting
UF_HIDDEN, UF_SYSTEM and UF_READONLY
in MSDOSFS.
It supported SF_ARCHIVED, but this has been
changed to be UF_ARCHIVE, which has the same
semantics as the DOS archive attribute instead
of inverse semantics like SF_ARCHIVED.
After discussion with Bruce Evans, change
several things in the msdosfs behavior:
Use UF_READONLY to indicate whether a file
is writeable instead of file permissions, but
don't actually enforce it.
Refuse to change attributes on the root
directory, because it is special in FAT
filesystems, but allow most other attribute
changes on directories.
Don't set the archive attribute on a directory
when its modification time is updated.
Windows and DOS don't set the archive attribute
in that scenario, so we are now bug-for-bug
compatible.
smbfs_node.c,
smbfs_vnops.c: Add support for UF_HIDDEN, UF_SYSTEM,
UF_READONLY and UF_ARCHIVE in SMBFS.
This is similar to changes that Apple has
made in their version of SMBFS (as of
smb-583.8, posted on opensource.apple.com),
but not quite the same.
We map SMB_FA_READONLY to UF_READONLY,
because UF_READONLY is intended to match
the semantics of the DOS readonly flag.
The MacOS X code maps both UF_IMMUTABLE
and SF_IMMUTABLE to SMB_FA_READONLY, but
the immutable flags have stronger meaning
than the DOS readonly bit.
stat.h: Add definitions for UF_SYSTEM, UF_SPARSE,
UF_OFFLINE, UF_REPARSE, UF_ARCHIVE, UF_READONLY
and UF_HIDDEN.
The definition of UF_HIDDEN is the same as
the MacOS X definition.
Add commented-out definitions of
UF_COMPRESSED and UF_TRACKED. They are
defined in MacOS X (as of 10.8.2), but we
do not implement them (yet).
ufs_vnops.c: Add support for getting and setting
UF_ARCHIVE, UF_HIDDEN, UF_OFFLINE, UF_READONLY,
UF_REPARSE, UF_SPARSE, and UF_SYSTEM in UFS.
Alphabetize the flags that are supported.
These new flags are only stored, UFS does
not take any action if the flag is set.
Sponsored by: Spectra Logic
Reviewed by: bde (earlier version)
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zio_compress.h:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio_compress.c:
Add module lifetime functions to allocate and teardown
state data.
Report:
- Compression attempts.
- Buffers found to be empty.
- Compression calls that are skipped because
the data length is already less than or
equal to the minimum block length.
- Compression attempts that fail to yield a 12.5%
compression ratio.
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu.c:
Add calls to the zio_compress.c module's init and fini
functions.
Sponosred by: Spectra Logic Corporation
MFC after: 2 weeks
minimum allocation size for devices. Use this information to
automatically increase ZFS's minimum allocation size for new top-level
vdevs to a value that more closely matches the optimum device
allocation size.
Use GEOM's stripesize attribute, if set, as the physical sector
size of the GEOM.
Calculate the minimum blocksize of each metaslab class. Use the
calculated value instead of SPA_MINBLOCKSIZE (512b) when determining
the likelyhood of compression yeilding a reduction in physical space
usage.
Report devices with sub-optimal block size configuration in "zpool
status". Also properly fail attempts to attach devices with a
logical block size greater than 8kB, since this will cause corruption
to ZFS's label area.
Sponsored by: Spectra Logic Corporaion
MFC after: 2 weeks
Background
==========
Many modern devices use physical allocation units that are much
larger than the minimum logical allocation size accessible by
external commands. Two prevalent examples of this are 512e disk
drives (512b logical sector, 4K physical sector) and flash devices
(512b logical sector, 4K or larger allocation block size, and 128k
or larger erase block size). Operations that modify less than the
physical sector size result in a costly read-modify-write or garbage
collection sequence on these devices.
Simply exporting the true physical sector of the device to ZFS would
yield optimal performance, but has two serious drawbacks:
1) Existing pools created with devices that have different logical
and physical block sizes, but were configured to use the logical
block size (e.g. because the OS version used for pool construction
reported the logical block size instead of the physical block
size) will suddenly find that the vdev allocation size has
increased. This can be easily tolerated for active members of
the array, but ZFS would prevent replacement of a vdev with
another identical device because it now appears that the smaller
allocation size required by the pool is not supported by the new
device.
2) The device's physical block size may be too large to be supported
by ZFS. The optimal allocation size for the vdev may be quite
large. For example, a RAID controller may export a vdev that
requires read-modify-write cycles unless accessed using 64k
aligned/sized requests. ZFS currently has an 8k minimum block
size limit.
Reporting both the logical and physical allocation sizes for vdevs
solves these problems. A device may be used so long as the logical
block size is compatible with the configuration. By comparing the
logical and physical block sizes, new configurations can be optimized
and administrators can be notified of any existing pools that are
sub-optimal.
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/spa.h:
Add the SPA_ASHIFT constant. ZFS currently has a hard upper
limit of 13 (8k) for ashift and this constant is used to
both document and enforce this limit.
sys/cddl/contrib/opensolaris/uts/common/sys/fs/zfs.h:
Add the VDEV_AUX_ASHIFT_TOO_BIG error code.
Add fields for exporting the configured, logical, and
physical ashift to the vdev_stat_t structure.
Add VDEV_STAT_VALID() macro which can be used to verify the
presence of required vdev_stat_t fields in nvlist data.
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev.c:
Provide a SYSCTL_PROC handler for "max_auto_ashift". Since
the limit is only referenced long after boot when a create
operation occurs, there's no compelling need for it to be
a boot time configurable tunable. This also allows the
validation code for the max_auto_ashift value to be contained
within the sysctl handler.
Populate the new fields in the vdev_stat_t structure.
Fail vdev opens if the vdev reports an ashift larger than
SPA_MAXASHIFT.
Propogate vdev_logical_ashift and vdev_physical_ashift between
child and parent vdevs as is done for vdev_ashift.
In vdev_open(), restore code that fails opens for devices
where vdev_ashift grows. This can only happen now if the
device's logical ashift grows, which means it really isn't
safe to use the device.
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/vdev_impl.h:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_file.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_mirror.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_missing.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_raidz.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_root.c:
Update the vdev_open() API so that both logical (what was
just ashift before) and physical ashift are reported.
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/vdev_impl.h:
Add two new fields, vdev_physical_ashift and vdev_logical_ashift,
to vdev_t.
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa_config.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa.c:
Add vdev_ashift_optimize(). Call it anytime a new top-level
vdev is allocated.
cddl/contrib/opensolaris/cmd/zpool/zpool_main.c:
Add text for the VDEV_AUX_ASHIFT_TOO_BIG error.
For each sub-optimally configured leaf vdev, report configured
and native block sizes.
cddl/contrib/opensolaris/cmd/zpool/zpool_main.c:
cddl/contrib/opensolaris/lib/libzfs/common/libzfs.h:
cddl/contrib/opensolaris/lib/libzfs/common/libzfs_status.c:
Introduce a new zpool status: ZPOOL_STATUS_NON_NATIVE_ASHIFT.
This status is reported on healthy pools containing vdevs
configured to use a block size smaller than their reported
physical block size.
cddl/contrib/opensolaris/lib/libzfs/common/libzfs_status.c:
Update find_vdev_problem() and supporting functions to
provide the full vdev_stat_t structure to problem checking
routines, and to allow decent into replacing vdevs.
Add a vdev_non_native_ashift() validator which is used on
the full vdev tree to check for ZPOOL_STATUS_NON_NATIVE_ASHIFT.
cddl/contrib/opensolaris/lib/libzpool/common/kernel.c:
cddl/contrib/opensolaris/lib/libzpool/common/sys/zfs_context.h:
Enhance sysctl userland stubs now that a SYSCTL_PROC handler
is used in vdev.c.
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/metaslab.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/metaslab_impl.h:
When the group membership of a metaslab class changes (i.e.
when a vdev is added or removed from a pool), walk the group
list to determine the smallest block size currently available
and record this in the metaslab class.
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/metaslab.h:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/metaslab.c:
Add the metaslab_class_get_minblocksize() accessor.
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zio_compress.h:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio_compress.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c:
In zio_compress_data(), take the minimum blocksize as an
input parameter instead of assuming SPA_MINBLOCKSIZE.
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c:
In l2arc_compress_buf(), pass SPA_MINBLOCKSIZE as the minimum
blocksize of the device. The l2arc code performs has it's own
code for deciding if compression is worth while, so this
effectively disables zio_compress_data() from second guessing
the original decision.
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c:
In zio_write_bp_init(), use the minimum blocksize of the
normal metaslab class when compressing data.
which allow one to define SDT probes that specify translated types. The idea
is to make it easy to write SDT probe definitions that can work across
multiple operating systems. In particular, this makes it possible to port
illumos SDT probes to FreeBSD without changing their argument types, so long
as the appropriate translators are defined. Then DTrace scripts written for
Solaris/illumos will work on FreeBSD without any changes.
MFC after: 1 week
probes declared in a kernel module when that module is unloaded. In
particular,
* Unloading a module with active SDT probes will cause a panic. [1]
* A module's (FBT/SDT) probes aren't destroyed when the module is unloaded;
trying to use them after the fact will generally cause a panic.
This change fixes both problems by porting the DTrace module load/unload
handlers from illumos and registering them with the corresponding
EVENTHANDLER(9) handlers. This allows the DTrace framework to destroy all
probes defined in a module when that module is unloaded, and to prevent a
module unload from proceeding if some of its probes are active. The latter
problem has already been fixed for FBT probes by checking lf->nenabled in
kern_kldunload(), but moving the check into the DTrace framework generalizes
it to all kernel providers and also fixes a race in the current
implementation (since a probe may be activated between the check and the
call to linker_file_unload()).
Additionally, the SDT implementation has been reworked to define SDT
providers/probes/argtypes in linker sets rather than using SYSINIT/SYSUNINIT
to create and destroy SDT probes when a module is loaded or unloaded. This
simplifies things quite a bit since it means that pretty much all of the SDT
code can live in sdt.ko, and since it becomes easier to integrate SDT with
the DTrace framework. Furthermore, this allows FreeBSD to be quite flexible
in that SDT providers spanning multiple modules can be created on the fly
when a module is loaded; at the moment it looks like illumos' SDT
implementation requires all SDT probes to be statically defined in a single
kernel table.
PR: 166927, 166926, 166928
Reported by: davide [1]
Reviewed by: avg, trociny (earlier version)
MFC after: 1 month
In the future, we'll need to come up with new proc_*() functions that accept
locked processes. For now, this prevents postgresql + DTrace from crashing the
system.
MFC after: 1 month
Unify the 2 concept into a real, minimal, sxlock where the shared
acquisition represent the soft busy and the exclusive acquisition
represent the hard busy.
The old VPO_WANTED mechanism becames the hard-path for this new lock
and it becomes per-page rather than per-object.
The vm_object lock becames an interlock for this functionality:
it can be held in both read or write mode.
However, if the vm_object lock is held in read mode while acquiring
or releasing the busy state, the thread owner cannot make any
assumption on the busy state unless it is also busying it.
Also:
- Add a new flag to directly shared busy pages while vm_page_alloc
and vm_page_grab are being executed. This will be very helpful
once these functions happen under a read object lock.
- Move the swapping sleep into its own per-object flag
The KPI is heavilly changed this is why the version is bumped.
It is very likely that some VM ports users will need to change
their own code.
Sponsored by: EMC / Isilon storage division
Discussed with: alc
Reviewed by: jeff, kib
Tested by: gavin, bapt (older version)
Tested by: pho, scottl
Illumos ZFS issues:
3957 ztest should update the cachefile before killing itself
3958 multiple scans can lead to partial resilvering
3959 ddt entries are not always resilvered
3960 dsl_scan can skip over dedup-ed blocks if
physical birth != logical birth
3961 freed gang blocks are not resilvered and can cause pool to suspend
3962 ztest should print out zfs debug buffer before exiting
Fix a regression introduced by fix for Illumos bug #3834. Quote from
Matthew Ahrens on the Illumos issue:
ztest fails this assertion because ztest_dmu_read_write() does
dmu_tx_hold_free(tx, bigobj, bigoff, bigsize);
and then
dmu_object_set_checksum(os, bigobj,
(enum zio_checksum)ztest_random_dsl_prop(ZFS_PROP_CHECKSUM), tx);
If the region to free is past the end of the file, the DMU assumes that there
will be nothing to do for this object. However, ztest does set_checksum(),
which must modify the dnode. The fix is for ztest to also call
dmu_tx_hold_bonus(tx, bigobj);
so we can account for the dirty data associated with setting the checksum
Illumos ZFS issues:
3955 ztest failure: assertion refcount_count(&tx->tx_space_written)
+ delta <= tx->tx_space_towrite
Merge vendor bugfix for ZFS test suite that triggers false positives.
Illumos ZFS issues:
3949 ztest fault injection should avoid resilvering devices
3950 ztest: deadman fires when we're doing a scan
3951 ztest hang when running dedup test
3952 ztest: ztest_reguid test and ztest_fault_inject don't place nice together
transparent layering and better fragmentation.
- Normalize functions that allocate memory to use kmem_*
- Those that allocate address space are named kva_*
- Those that operate on maps are named kmap_*
- Implement recursive allocation handling for kmem_arena in vmem.
Reviewed by: alc
Tested by: pho
Sponsored by: EMC / Isilon Storage Division
Do this by forcing inclusion of
sys/cddl/compat/opensolaris/sys/debug_compat.h
via -include option into all source files from OpenSolaris.
Note that this -include option must always be after -include opt_global.h.
Additionally, remove forced definition of DEBUG for some modules and fix
their build without DEBUG.
Also, meaning of DEBUG was overloaded to enable WITNESS support for some
OpenSolaris (primarily ZFS) locks. Now this overloading is removed and
that use of DEBUG is replaced with a new option OPENSOLARIS_WITNESS.
MFC after: 17 days
Before executing any subcommand, zpool tool fetches pools configuration from
the kernel. Before features support was added, kernel was regenerating that
configuration based on data always present in memory. Unfortunately, pool
features list and activity counters are not such. They are stored in ZAP,
that normally resides in ARC, but under heavy memory pressure may be swapped
out. If pool is suspended at this point, there is no way to recover it back
since any zpool command will stuck.
This change has one predictable flaw: `zpool upgrade` always wish to upgrade
suspended pools, but fortunately it can't do it due to the suspension.
disable TRIM otherwise.
r252840 (illumos bug 3836) is based on assumption that zio_free_sync() has
no lock dependencies and should complete immediately. Unfortunately, with our
TRIM implementation that is not true due to ZIO_STAGE_VDEV_IO_START added
to the ZIO_FREE_PIPELINE, which, while not really accessing devices, still
acquires SCL_ZIO lock for read to be sure devices won't disappear.
When TRIM is disabled, this patch enables direct free execution from r252840
and removes ZIO_STAGE_VDEV_IO_START and ZIO_STAGE_VDEV_IO_ASSESS stages from
the pipeline to avoid lock acquisition. Otherwise it queues free request as
it was before r252840.
Existing async thread is running only on successfull spa_sync() completion,
that is impossible in case of pool loosing required (last) disk(s). That
indefinite delay of SPA_ASYNC_REMOVE processing made ZFS to not close the
lost disks, preventing GEOM/CAM from destroying devices and reusing names
on later disk reattach.
In earlier version of the patch I've tried to just run existing thread
immediately, unrelated to spa_sync() completion, but that exposed number
of situations where it could stuck due to locks held by stuck spa_sync(),
that are required for other kinds of async events.
Experiments with OpenIndiana snapshot confirmed that they also have this
issue with lost disks reattach.
We cannot busy a page before doing pagefaults.
Infact, it can deadlock against vnode lock, as it tries to vget().
Other functions, right now, have an opposite lock ordering, like
vm_object_sync(), which acquires the vnode lock first and then
sleeps on the busy mechanism.
Before this patch is reinserted we need to break this ordering.
Sponsored by: EMC / Isilon storage division
Reported by: kib
- It does not let pages respect the LRU policy
- It bloats the active/inactive queues of few pages
Try to avoid it as much as possible with the long-term target to
completely remove it.
Use the soft-busy mechanism to protect page content accesses during
short-term operations (like uiomove_fromphys()).
After this change only vm_fault_quick_hold_pages() is still using the
hold mechanism for page content access.
There is an additional complexity there as the quick path cannot
immediately access the page object to busy the page and the slow path
cannot however busy more than one page a time (to avoid deadlocks).
Fixing such primitive can bring to complete removal of the page hold
mechanism.
Sponsored by: EMC / Isilon storage division
Discussed with: alc
Reviewed by: jeff
Tested by: pho
Skip eviction step of processing free records when doing ZFS
receive to avoid the expensive search operation of non-existent
dbufs in dn_dbufs.
Illumos ZFS issues:
3834 incremental replication of 'holey' file systems is slow
MFC after: 2 weeks
To quote Illumos issue #3888:
When 'zfs recv -F' is used with an incremental recv it rolls
back any changes made since the last snapshot in case new
changes were made to the file system while the recv is in
progress (without -F the recv would fail when it does it's
final check to commit the recv-ed data as the recv-ed data
conflicts with the newly written data).
However, if there is a snapshot taken after the recv began
rolling back to the 'latest' snapshot will not help and the
recv will still fail. 'zfs recv -F' should be extended to
destroy any snapshots created since the source snapshot when
finishing the recv (effectively rolling back through all
snapshots, instead of just to the latest snapshot).
Illumos ZFS issues:
3888 zfs recv -F should destroy any snapshots created since the
incremental source
MFC after: 2 weeks
To quote Illumos #3875:
The problem here is that if we ever end up in the error
path, we drop the locks protecting access to the zfsvfs_t
prior to forcibly unmounting the filesystem. Because z_os
is NULL, any thread that had already picked up the zfsvfs_t
and was sitting in ZFS_ENTER() when we dropped our locks
in zfs_resume_fs() will now acquire the lock, attempt to
use z_os, and panic.
Illumos ZFS issues:
3875 panic in zfs_root() after failed rollback
MFC after: 2 weeks
existed before IOCTL code refactoring merged change 4445fffb from illumos
at r248571.
This change allows `zpool clear` to be used again to recover suspended pool.
It seems the only was supposed by the code to restore pool operation after
reconnecting lost disks that were required for data completeness. There
are still cases where `zpool clear` command can just safely stuck due to
deadlocks inside ZFS kernel part, but probably that is better then having
no chances to recover at all.
OpenSolaris version is:
13108:33bb8a0301ab
6762020 Disassembly support for Intel Advanced Vector Extensions (AVX)
This corresponds to Illumos-gate (github) version
ab47273fedff893c8ae22ec39ffc666d4fa6fc8b
MFC after: 3 weeks
- move init and fini code into separate functions (like it is done upstream)
- invoke fini code via shutdown_post_sync event hook
This should make zfs close its underlying devices during shutdown,
which may be important for their drivers.
MFC after: 20 days
All other places where a znode is allocated do not need z_vnode at all.
These are:
- zfs_create_share_dir
- zfs_create_fs
This chnage ensures two things:
- VN_LOCK_ASHARE is not erroneously called for VFIFO vnodes
- vn_lock is called on a fully constructed vnode with correct v_ops
The change also allows to make zfs_znode_cache_constructor a normal
kmem_cache constructor again (as it is in upstream).
This allows to avoid a problem where zfs_znode_cache_destructor
may be called on un-constructed znodes.
MFC after: 17 days
Unconditionally freeing a page is not good, especially if it is the page
that was wired by the caller. The checks are picked up from
kern_sendfile.
MFC after: 3 weeks
Quoting illumos issue #3836:
Currently zio_free() always puts the zio on a list for subsequent
processing by zio_free_sync(). This is only necessary for frees that
might need to issue reads (gang and dedup blocks).
By processing the majority of the frees as we encounter them, we reduce
the amount of time that the spa_sync() thread spends burning CPU and
not doing any i/o, thus increasing the overall write throughput of the
system.
Illumos ZFS issues:
3836 zio_free() can be processed immediately in the common case
MFC after: 1 week
I missed to register zfs_ioc_jail and zfs_ioc_unjail as legacy ioctl's
with the new zfs_ioctl_register_legacy() function.
These operations do not modify pools or datasets so there is no need to
log them to pool history.
Reported by: Alexander Leidinger <ale@FreeBSD.org> and others on current@
MFC after: 3 days
This could happen if a thread doing a page-in loses a ZFS range lock
race to a thread writing to the same range
This fixes "panic: vm_page_alloc: pindex already allocated" in
http://docs.FreeBSD.org/cgi/mid.cgi?1372165971.96049.42.camel
Submitted by: avg
MFC after: 1 week
function name of its corresponding DTrace probes. These descriptions may
contain whitespace, but probe names cannot, so just replace any whitespace
with underscores when creating probes.
MFC after: 1 week
Restore a previous behavior before r251646, where when destructing
ZFS snapshot, the ioctl would return ENOENT when it hit any of
them in the errlist (the new behavior was only return ENOENT when
all returns error).
Illumos ZFS issues:
3829 fix for 3740 changed behavior of zfs destroy/hold/release ioctl
MFC after: 1 week
ZFS event processing should work on R/O root filesystems
Illumos ZFS issues:
3749 zfs event processing should work on R/O root filesystems
MFC after: 2 weeks
* Illumos ZFS issue #3805 arc shouldn't cache freed blocks
Quote from the Illumos issue:
ZFS should proactively evict freed blocks from the cache.
Even though these freed blocks will never be used again, and thus
will eventually be evicted, this causes us to use memory
inefficiently for 2 reasons:
1. A block that is freed has no chance of being accessed again, but
will be kept in memory preferentially to a block that was accessed
before it (and is thus older) but has not been freed and thus has
at least some chance of being accessed again.
2. We partition the ARC into several buckets:
user data that has been accessed only once (MRU)
metadata that has been accessed only once (MRU)
user data that has been accessed more than once (MFU)
metadata that has been accessed more than once (MFU)
The user data vs metadata split is somewhat arbitrary, and the
primary control on how much memory is used to cache data vs metadata
is to simply try to keep the proportion the same as it has been in the
past (each bucket "evicts against" itself). The secondary control is
to evict data before evicting metadata.
Because of this bucketing, we may end up with one bucket mostly
containing freed blocks that are very old, while another bucket has
more recently accessed, still-allocated blocks. Data in the useful
bucket (with still-allocated blocks) may be evicted in preference to
data in the useless bucket (with old, freed blocks).
On dcenter, we saw that the MFU metadata bucket was 230MB, while the
MFU data bucket was 27GB and the MRU metadata bucket was 256GB.
However, the vast majority of data in the MRU metadata bucket (256GB)
was freed blocks, and thus useless. Meanwhile, the MFU metadata bucket
(230MB) was constantly evicting useful blocks that will be soon needed.
The problem of cache segmentation is a larger problem that needs more
investigation. However, if we stop caching freed blocks, it should
reduce the impact of this more fundamental issue.
MFC after: 2 weeks
* Illumos zfs issue #3137 L2ARC compression
Whether or not to compress buffers entering the L2ARC is
controlled by "compression" setting on the dataset, when
compression is not "off", L2ARC compression is enabled.
The compress method is always LZ4 for L2ARC when enabled
because it works best for the scenario.
MFC after: 2 weeks
dtrace_probe(). Arguments beyond these five must be obtained in an
architecture-specific way; this can be done through the getargval provider
method, and through dtrace_getarg() if getargval isn't overridden.
This change fixes two off-by-one bugs in the way these arguments are fetched
in FreeBSD's DTrace implementation. First, the SDT provider must set the
aframes parameter to 1 when creating a probe. The aframes parameter controls
the number of frames that dtrace_getarg() will step over in order to find
the frame containing the extra arguments. On FreeBSD, dtrace_getarg() is
called in SDT probe context via
dtrace_probe()->dtrace_dif_emulate()->dtrace_dif_variable->dtrace_getarg()
so aframes must be 3 since the arguments are in dtrace_probe()'s frame; it
was previously being called with a value of 2 instead. illumos uses a
different aframes value for SDT probes, but this is because illumos SDT
probes fire by triggering the #UD fault handler rather than calling
dtrace_probe() directly.
The second bug has to do with the way arguments are grabbed out
dtrace_probe()'s frame on amd64. The code currently jumps over the first
stack argument and retrieves the rest of them using a pointer into the
stack. This works on i386 because all of dtrace_probe()'s arguments will be
on the stack and the first argument is the probe ID, which should be
ignored. However, it is incorrect to ignore the first stack argument on
amd64, so we correct the pointer used to access the arguments.
MFC after: 2 weeks
seven arguments.
The original test uses Solaris' uadmin system call to trigger the test
probe; this change adds a sysctl to the dtrace_test module and gets the test
program to trigger the test probe via the sysctl handler.
The test is currently failing on amd64 because of some bugs in the way that
probe arguments beyond the first five are obtained - these bugs will be
fixed in a separate change.
USDT probes exits. This was previously done with a callout; however, it is
possible to sleep while holding the DTrace mutexes, so a panic will occur
on INVARIANTS kernels if the callout handler can't immediately acquire one
of these mutexes. This panic will be frequently triggered on systems where
a USDT-enabled program (perl, for instance) is often run.
This revision changes the fasttrap cleanup mechanism so that a dedicated
thread is used instead of a callout. The old behaviour is otherwise
preserved.
Reviewed by: rpaulo
MFC after: 1 month