freebsd

mirror of https://git.FreeBSD.org/src.git synced 2024-12-17 10:26:15 +00:00

Author	SHA1	Message	Date
Xin LI	d4548c2e8e	MFC r275533: Sync with Illumos. This have no effect to FreeBSD. Illumos issue: 5100 sparc build failed after 5004 MFC after: 2 weeks	2014-12-06 09:11:13 +00:00
Xin LI	4603a0aeb2	Use %d instead of %u for error number. This way we see ERESTART as -1 not 4294967295 when doing DTrace. MFC after: 2 weeks	2014-12-05 22:56:10 +00:00
Xin LI	26f96d922b	Fix a regression introduced in r274337 (large block support) In dsl_dataset_hold_obj() we used zap_contains(.., DS_FIELD_LARGE_BLOCKS) to determine whether the extensible (zapifyed) dataset have large blocks. The code expects the result be either 0 (found) or ENOENT (not found), however reused the variable 'err' which later code expects to be 0. Fix this by adopting similar code construct that is used later for DS_FIELD_BOOKMARK_NAMES, which uses a temporary variable zaperr to catch errors from zap_* rountines. Reported by: Peter J. Creath (on FreeNAS; FreeNAS bug #6848) Illumos issue: 5393 spurious failures from dsl_dataset_hold_obj() Reviewed by: mahrens Sponsored by: iXsystems, Inc. X-MFC with: r274337	2014-12-05 18:29:01 +00:00
Alexander Motin	ef8daf3fed	Add GET LBA STATUS command support to CTL. It is implemented for LUNs backed by ZVOLs in "dev" mode and files. GEOM has no such API, so for LUNs backed by raw devices all LBAs will be reported as mapped/unknown. MFC after: 2 weeks Sponsored by: iXsystems, Inc.	2014-12-04 11:34:19 +00:00
Andriy Gapon	782c06dfc8	zfs_putpages: actually update mtime and ctime Reported by: Paul Koch <paul.koch@akips.com> Tested by: Paul Koch <paul.koch@akips.com> MFC after: 2 weeks	2014-12-02 11:44:56 +00:00
Xin LI	d10e52d627	Revert r273060 per discussion with avg@ as we need to make L2ARC aware of 4K devices and this one is not the right fix anyway.	2014-11-26 02:20:25 +00:00
Dimitry Andric	c577699e4b	Fix the following -Werror warning from clang 3.5.0, while building cddl/lib/libctf: In file included from cddl/contrib/opensolaris/common/ctf/ctf_create.c:31: In file included from sys/cddl/contrib/opensolaris/uts/common/sys/sysmacros.h:34: sys/cddl/contrib/opensolaris/uts/common/sys/isa_defs.h:334:9: warning: '_ILP32' macro redefined [-Wmacro-redefined] #define _ILP32 ^ <built-in>:26:9: note: previous definition is here #define _ILP32 1 ^ 1 warning generated. This is because clang 3.5.0 started predefining _ILP32 and __ILP32__ for the i386 arch. (Earlier versions already predefined _LP64 and __LP64__ for the x86_64 arch.) Reviewed by: emaste, avg, smh, delphij, markj Differential Revision: https://reviews.freebsd.org/D1187	2014-11-19 07:44:21 +00:00
Xin LI	8c3d6a4ab2	Make vfs.zfs.max_recordsize read-write at runtime. MFC after: 2 weeks	2014-11-18 22:35:19 +00:00
Xin LI	8efcd876ca	Add a tunable for spa_slop_shift which controls how much space we would reserve by default. Tuning is not recommended. MFC after: 2 weeks	2014-11-18 18:52:38 +00:00
Xin LI	18144ab1a3	Allow tuning zfs_max_recordsize via loader tunable. Tuning is NOT recommended. Requested by: Slawa Olhovchenkov <slw zxy spb ru> MFC after: 2 weeks	2014-11-18 18:40:01 +00:00
Andriy Gapon	2c51c83bc8	l2arc: restore correct rounding up of asize of compressed data This rounding up was lost in a mismerge of illumos code. See r268075 MFV r267565. After that commit zio_compress_data() no longer performs any compressed size adjustment, so it needs to be done externally. On FreeBSD we round up the size using vdev_ashift rather than SPA_MINBLOCKSIZE so that 4KB devices are properly supported. Additionally, zero out the buffer tail only if compression succeeds. The compression is considered successful if the size of compressed data after rounding up to account for the vdev ashift is less than the original data size. It does not make sense to have the data compressed if all the savings are lost to rounding up. With the new zio_compress_data() it could have been possible that the rounded compressed size would be greater than the original size and thus we could zero beyond the allocated buffer if the zeroing code was kept at the original place. Discussed with: delphij, gibbs MFC after: 2 weeks X-MFC with: r274627	2014-11-17 14:45:42 +00:00
Andriy Gapon	0908b20b7e	Revert r269093 which introduced physical zio alignment transform Size of physical ZIOs must never be implicitly adjusted, it's a responsibility of a caller to make sure that such a ZIO has proper offset and size. Discussed with: delphij, gibbs MFC after: 2 weeks	2014-11-17 14:16:02 +00:00
Steven Hartland	a559adfbce	Disable TRIM on file backed ZFS vdevs and fix TRIM on init After r265152 TRIM requests are ZIO_TYPE_FREE instead of ZIO_TYPE_IOCTL this meant file backed vdevs to attempted to process the ZIO as a write causing a panic. We now disable TRIM on file backed vdevs and ASSERT the ZIO types supported by each vdev type to ensure we explicity support the ZIO type being processed. Also ensure that TRIM on init is not procesed for devices which declare they didn't support TRIM via vdev_notrim. PR: 195061, 194976, 191573 Sponsored by: Multiplay	2014-11-17 11:32:10 +00:00
Konstantin Belousov	6e646651d3	Remove the no-at variants of the kern_xx() syscall helpers. E.g., we have both kern_open() and kern_openat(); change the callers to use kern_openat(). This removes one (sometimes two) levels of indirection and consolidates arguments checks. Reviewed by: mckusick Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-11-13 18:01:51 +00:00
Xin LI	8bcd603968	MFV r274273: ZFS large block support. Please note that booting from datasets that have recordsize greater than 128KB is not supported (but it's Okay to enable the feature on the pool). This may remain unchanged because of memory constraint. Limited safety belt is provided for mounted root filesystem but use caution is advised. Illumos issue: 5027 zfs large block support MFC after: 1 month	2014-11-10 08:20:21 +00:00
Xin LI	42350b6bde	MFV r274272 and diff reduction with upstream. Illumos issue: 5244 zio pipeline callers should explicitly invoke next stage Tested with: ztest plus ZFS over GELI configuration MFC after: 1 month	2014-11-09 07:37:00 +00:00
Xin LI	81f1255e58	MFV r274271: Improve zdb -b performance: - Reduce gethrtime() call to 1/100th of blkptr's; - Skip manipulating the size-ordered tree; - Issue more (10, previously 3) async reads; - Use lighter weight testing in traverse_visitbp(); Illumos issue: 5243 zdb -b could be much faster MFC after: 2 weeks	2014-11-08 07:30:40 +00:00
Andriy Gapon	2fd3cc0cb2	fix l2arc compression buffers leak We have observed that arc_release() can be called concurrently with a l2arc in-flight write. Also, we have observed that arc_hdr_destroy() can be called from arc_write_done() for a zio with ZIO_FLAG_IO_REWRITE flag in similar circumstances. Previously the l2arc headers would be freed while leaking their associated compression buffers. Now the buffers are placed on l2arc_free_on_write list for delayed freeing. This is similar to what was already done to arc buffers that were supposed to be freed concurrently with in-flight writes of those buffers. In addition to fixing the discovered leaks this change also adds some protective code to assert that a compression buffer associated with a l2arc header is never leaked. A new kstat l2_cdata_free_on_write is added. It keeps a count of delayed compression buffer frees which previously would have been leaks. Tested by: Vitalij Satanivskij <satan@ukr.net> et al Requested by: many MFC after: 2 weeks Sponsored by: HybridCluster / ClusterHQ	2014-11-06 11:08:02 +00:00
Alexander Motin	c3e7ba3e6d	Add to CTL support for logical block provisioning threshold notifications. For ZVOL-backed LUNs this allows to inform initiators if storage's used or available spaces get above/below the configured thresholds. MFC after: 2 weeks Sponsored by: iXsystems, Inc.	2014-11-06 00:48:36 +00:00
Josh Paetzel	14127f5b21	This change addresses 4 bugs in ZFS exposed by Richard Kojedzinszky's crash.sh script attached to FreeNAS bug 4109: https://bugs.freenas.org/issues/4109 Three are in the snapshot layer: a) AVG explains in his notes: https://wiki.freebsd.org/AvgVfsSolarisVsFreeBSD "VOP_INACTIVE must not do any destructive actions to a vnode and its filesystem node, nor invalidate them in any way." gfs_vop_inactive and zfsctl_snapshot_inactive did just that. In OpenSolaris VOP_INACTIVE is much closer to FreeBSD's VOP_RECLAIM. Rename & move them to gfs_vop_reclaim and zfsctl_snapshot_reclaim and merge in the requisite vnode_destroy from zfsctl_common_reclaim. b) gfs_lookup_dot and various zfsctl functions do not honor the FreeBSD VFS convention of only locking from the root downward. When looking up ".." the convention is to drop the current leaf vnode lock before acquiring the directory vnode and then subsequently re-acquiring the lock on the leaf vnode. This fixes that in all the places that our exercised by crash.sh. c) The snapshot may already be unmounted when the directory vnode is reclaimed. Check for this case and return. One in the common layer: d) Callers of traverse expect the reference to the vnode passed in to be maintained. Don't release it. This last one may be an unclear contract. There may in fact be some callers that do expect the reference to be dropped on success in addition to callers that expect it to be released. In this case a further audit of the callers is needed and a consensus on the correct behavior. PR: 184677 Submitted by: kmacy Reviewed by: delphij, will, avg MFC after: 2 weeks Sponsored by: iXsystems	2014-10-25 17:42:44 +00:00
Justin Hibbits	3ff2096995	Whitespace X-MFC-with: r273570 MFC after: 1 week	2014-10-24 03:34:21 +00:00
Justin Hibbits	24d5dfb116	Three updates to PowerPC FBT: * Use a constant to define the number of stack frames in a probe exception. * Only allow function symbols in powerpc64 ('.' prefixed) * Set the fbtp_roffset for return probes, so the correct dtrace_probe call is made. MFC after: 1 week	2014-10-24 03:33:01 +00:00
Hans Petter Selasky	f0188618f2	Fix multiple incorrect SYSCTL arguments in the kernel: - Wrong integer type was specified. - Wrong or missing "access" specifier. The "access" specifier sometimes included the SYSCTL type, which it should not, except for procedural SYSCTL nodes. - Logical OR where binary OR was expected. - Properly assert the "access" argument passed to all SYSCTL macros, using the CTASSERT macro. This applies to both static- and dynamically created SYSCTLs. - Properly assert the the data type for both static and dynamic SYSCTLs. In the case of static SYSCTLs we only assert that the data pointed to by the SYSCTL data pointer has the correct size, hence there is no easy way to assert types in the C language outside a C-function. - Rewrote some code which doesn't pass a constant "access" specifier when creating dynamic SYSCTL nodes, which is now a requirement. - Updated "EXAMPLES" section in SYSCTL manual page. MFC after: 3 days Sponsored by: Mellanox Technologies	2014-10-21 07:31:21 +00:00
Xin LI	78701de4b7	Add tunable vfs.zfs.space_map_blksz for space map's maximum block size. MFC after: 2 weeks	2014-10-18 22:11:10 +00:00
Davide Italiano	2be111bf7d	Follow up to r225617. In order to maximize the re-usability of kernel code in userland rename in-kernel getenv()/setenv() to kern_setenv()/kern_getenv(). This fixes a namespace collision with libc symbols. Submitted by: kmacy Tested by: make universe	2014-10-16 18:04:43 +00:00
Steven Hartland	ca6505b818	Prevent ZFS leaking pool free space When processing async destroys ZFS would leak space every txg timeout (5 seconds by default), if no writes occurred, until the pool is totally full. At this point it would be unfixable without a pool recreation. In addition if the machine was rebooted with the pool in this situation would fail to import on boot, hanging indefinitely, as the import process requires the ability to write data to the pool. Any attempts to query the pool status during the hung import would not return as the import holds the pool lock. The only way to import such a pool would be to specify -o readonly=on to the zpool import. zdb -bb <pool> can be used to check for "deferred free" size which is where this lost space will be counted. MFC after: 3 days Sponsored by: Multiplay	2014-10-16 02:23:27 +00:00
Xin LI	ba6e85e0cf	Use write_psize instead of write_asize when doing vdev_space_update. Without this change the accounting of L2ARC usage would be wrong and give 16EB free space because the number became negative and overflows. Obtained from: FreeNAS (issue #6239) MFC after: 2 weeks	2014-10-13 20:39:51 +00:00
Xin LI	a4f5b8db9f	Add a tunable for arc_shrink_shift (vfs.zfs.arc_shrink_shift) that controls how much fraction, 1/2^arc_shrink_shift, should be reclaimed when there is memory pressure. Submitted by: Richard Kojedzinszky <krichy at tvnetwork.hu> MFC after: 2 weeks	2014-10-13 05:34:10 +00:00
Xin LI	eba15cf463	MFV r272804: Refactor the code and stop restore_object from creating two transactions. Illumos issue: 3693 restore_object uses at least two transactions to restore an object MFC after: 2 weeks	2014-10-09 07:52:51 +00:00
Xin LI	ce44f14b41	MFV r272803: Illumos issue: 5175 implement dmu_read_uio_dbuf() to improve cached read performance MFC after: 2 weeks	2014-10-09 07:18:40 +00:00
Andriy Gapon	c3d1d2e104	l2arc_write_buffers: reduce headroom value FreeBSD has ARC_BUFC_NUMMETADATALISTS metadata lists and ARC_BUFC_NUMDATALISTS data lists (currently both are 16) while illumos has just a single list of each kind. headroom determines how much data is scanned on a single list during each run of the l2arc feed thread. Because FreeBSD has more lists we proportionally decrease the limit. Reviewed by: Brendan Gregg (earlier version) MFC after: 2 weeks Sponsored by: HybridCluster	2014-10-07 16:08:21 +00:00
Andriy Gapon	9f96723ec5	revert r272702: wrong (earlier) change was committed	2014-10-07 16:06:10 +00:00
Andriy Gapon	4c3b02bfce	reduce L2ARC_WRITE_SIZE on FreeBSD FreeBSD has ARC_BUFC_NUMMETADATALISTS metadata lists and ARC_BUFC_NUMDATALISTS data lists (currently both are 16) while illumos has just a single list of each kind. L2ARC_WRITE_SIZE determines the default value of l2arc_write_max which defines limits on how much data is scanned and written to a cache device during each run of the l2arc feed thread. The limits are applied on the per buffer list basis. Because FreeBSD has more lists we proportionally reduce the limits. Reviewed by: Brendan Gregg (earlier version) MFC after: 2 weeks Sponsored by: HybridCluster	2014-10-07 14:30:24 +00:00
Andriy Gapon	ab26525af2	make userland __assfail from opensolaris compat honor 'aok' variable This should allow zdb -A option to actually make difference. MFC after: 2 weeks	2014-10-07 14:15:50 +00:00
Xin LI	1b5bcb8425	MFV r272591: Use loaned ARC buffer for zfs receive to avoid copy. Illumos issue: 5162 zfs recv should use loaned arc buffer to avoid copy MFC after: 2 weeks	2014-10-06 07:29:17 +00:00
Xin LI	8fb26f5aef	MFV r272585: Split the godfather zio into CPU number's to reduce lock contention. Illumos issue: 5176 lock contention on godfather zio MFC after: 2 weeks	2014-10-06 07:03:17 +00:00
Xin LI	dcb20006f0	MFV r272501: Illumos issue: 5177 remove dead code from dsl_scan.c MFC after: 2 weeks	2014-10-06 05:46:51 +00:00
Xin LI	00769ce74d	MFV r272500: Don't inherit flags other than DS_FLAG_CI_DATASET and DS_FLAG_INCONSISTENT when cloning. This prevents DS_FLAG_DEFER_DESTROY being inherited from a clone that is marked for deferred destroy, which causes snapshots of the clone being destroyed when getting a hold or clone. Illumos issue: 5150 zfs clone of a defer_destroy snapshot causes strangeness MFC after: 1 week	2014-10-06 05:42:20 +00:00
Xin LI	4bb264ae15	Don't make nested definition for range_seg_cache. Reported by: ian MFC after: 1 week X-MFC-With: r272506	2014-10-04 15:42:52 +00:00
Xin LI	4750c382a9	MFV r272499: Illumos issue: 5174 add sdt probe for blocked read in dbuf_read() MFC after: 2 weeks	2014-10-04 08:55:08 +00:00
Xin LI	eb0b70068c	Add a new sysctl, vfs.zfs.vol.unmap_enabled, which allows the system administrator to toggle whether ZFS should ignore UNMAP requests. Illumos issue: 5149 zvols need a way to ignore DKIOCFREE MFC after: 2 weeks	2014-10-04 08:51:57 +00:00
Xin LI	2d36d67c72	Diff reduction with upstream. The code change is not really applicable to FreeBSD. Illumos issue: 5148 zvol's DKIOCFREE holds zfsdev_state_lock too long MFC after: 1 month	2014-10-04 08:41:23 +00:00
Xin LI	523b4c7fdf	MFV r272496: Add tunable for number of metaslabs per vdev (vfs.zfs.vdev.metaslabs_per_vdev). The default remains at 200. Illumos issue: 5161 add tunable for number of metaslabs per vdev MFC after: 2 weeks	2014-10-04 08:29:48 +00:00
Xin LI	a8d7512709	MFV r272495: In arc_kmem_reap_now(), reap range_seg_cache too to reclaim memory in response of memory pressure. Illumos issue: 5163 arc should reap range_seg_cache MFC after: 1 week	2014-10-04 08:14:10 +00:00
Xin LI	8c20e2ff11	MFV r272494: Make space_map_truncate() always do space_map_reallocate(). Without this, setting space_map_max_blksz would cause panic for existing pool, as dmu_objset_set_blocksize would fail if the object have multiple blocks. Illumos issues: 5164 space_map_max_blksz causes panic, does not work 5165 zdb fails assertion when run on pool with recently-enabled spacemap_histogram feature MFC after: 2 weeks	2014-10-04 08:05:39 +00:00
Steven Hartland	14a0d74ea8	Refactor ZFS ARC reclaim checks and limits Remove previously added kmem methods in favour of defines which allow diff minimisation between upstream code base. Rebalance ARC free target to be vm_pageout_wakeup_thresh by default which eliminates issue where ARC gets minimised instead of balancing with VM pageout. The restores the target point prior to r270759. Bring in missing upstream only changes which move unused code to further eliminate code differences. Add additional DTRACE probe to aid monitoring of ARC behaviour. Enable upstream i386 code paths on platforms which don't define UMA_MD_SMALL_ALLOC. Fix mixture of byte an page values in arc_memory_throttle i386 code path value assignment of available_memory. PR: 187594 Review: D702 Reviewed by: avg MFC after: 1 week X-MFC-With: r270759 & r270861 Sponsored by: Multiplay	2014-10-03 20:34:55 +00:00
Steven Hartland	99140218aa	Fix various issues with zvols When performing snapshot renames we could deadlock due to the locking in zvol_rename_minors. In order to avoid this use the same workaround as zvol_open in zvol_rename_minors. Add missing zvol_rename_minors to dsl_dataset_promote_sync. Protect against invalid index into zv_name in zvol_remove_minors. Replace zvol_remove_minor calls with zvol_remove_minors to ensure any potential children are also renamed. Don't fail zvol_create_minors if zvol_create_minor returns EEXIST. Restore the valid pool check in zfs_ioc_destroy_snaps to ensure we don't call zvol_remove_minors when zfs_unmount_snap fails. PR: 193803 MFC after: 1 week Sponsored by: Multiplay	2014-10-03 14:49:48 +00:00
Marcelo Araujo	d8a5961f88	Fix failures and warnings reported by newpynfs20090424 test tool. This fix addresses only issues with the pynfs reports, none of these issues are know to create problems for extant real clients. Submitted by: Bart Hsiao <bart.hsiao@gmail.com> Reworked by: myself Reviewed by: rmacklem Approved by: rmacklem Sponsored by: QNAP Systems Inc.	2014-10-03 02:24:41 +00:00
Xin LI	43ac3722ac	Diff reduction with kernel code: instruct the compiler that the data of these types may be unaligned to their "normal" alignment and exercise caution when accessing them. PR: 194071 MFC after: 3 days	2014-10-02 00:13:08 +00:00
Will Andrews	fbce0221eb	zfsvfs_create(): Refuse to mount datasets whose names are too long. This is checked for in the zfs_snapshot_004_neg STF/ATF test (currently still in projects/zfsd rather than head). sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c: - zfsvfs_create(): Check whether the objset name fits into statfs.f_mntfromname, and return ENAMETOOLONG if not. Although the filesystem can be unmounted via the umount(8) command, any interface that relies on iterating on statfs (e.g. libzfs) will fail to find the filesystem by its objset name, and thus assume it's not mounted. This causes "zfs unmount", "zfs destroy", etc. to fail on these filesystems, whether or not -f is passed. MFC after: 1 month Sponsored by: Spectra Logic MFSpectraBSD: 974872 on 2013/08/09	2014-10-01 14:12:02 +00:00
Xin LI	0b66c7c514	Fix a mismerge in r260183 which prevents snapshot zvol devices being removed and re-instate the fix in r242862. Reported by: Leon Dang <ldang nahannisys com>, smh MFC after: 3 days	2014-09-30 18:50:45 +00:00
Steven Hartland	8caa3daf35	Remove sys/types.h include as per style (9) SDT requries sys/param.h due to use of NULL Reported by: Garrett Sponsored by: Multiplay	2014-09-18 20:38:18 +00:00
Steven Hartland	71f3caaf31	Add dtrace probe support for zfs SET_ERROR(..) MFC after: 1 week Sponsored by: Multiplay	2014-09-18 20:00:36 +00:00
Will Andrews	91dda985cc	Remove debug.zfs_flags in favor of the new vfs.zfs.debug_flags. Replace TUNABLE_INT with CTLFLAG_RWTUN. Submitted by: avg (debug.zfs_flags removal), smh (TUNABLE_INT replacement)	2014-09-18 18:46:38 +00:00
Will Andrews	f8c2f66a6c	Enable ZFS debug flags to be modified via vfs.zfs.debug_flags. This is primarily only of interest to ZFS developers, but it makes it easier to get additional debugging. Submitted by: gibbs MFC after: 1 month Sponsored by: Spectra Logic MFSpectraBSD: 517074 on 2011/12/15 (by will), 662343 on 2013/03/20 (by gibbs)	2014-09-18 16:55:41 +00:00
Will Andrews	cf0a1157d7	Reorder sysctls for spa.c global tunables; add sysctl for ccw_retry_interval. MFC after: 1 month Sponsored by: Spectra Logic	2014-09-18 16:38:03 +00:00
Will Andrews	cf7a096e72	bpobj_iterate_impl(): Close a refcount leak iterating on a sublist. If bpobj_space() returned non-zero here, the sublist would have been left open, along with the bonus buffer hold it requires. This call does not invoke any calls to bpobj_close() itself. This bug doesn't have any known vector, but was found on inspection. MFC after: 1 week Sponsored by: Spectra Logic Affects: All ZFS versions starting 21 May 2010 (illumos cde58dbc) MFSpectraBSD: r1050998 on 2014/03/26	2014-09-18 15:37:53 +00:00
Steven Hartland	d1d469e22b	Remove unused ZFS ARC functions * arc_data_buf_alloc * arc_data_buf_free MFC after: 1 week Sponsored by: Multiplay	2014-09-18 10:46:51 +00:00
Justin Hibbits	e40a5cd3ec	Fix the stack tracing for dtrace/powerpc. Summary: Fix the stack tracing for dtrace/powerpc by using the trapexit/asttrapexit return address sentinels instead of checking within the kernel address space. As part of this, I had to add new inline functions. FBT traces the kernel, so we have to have special case handling for this, since a trap will create a full new trap frame, and there's no way to pass around the 'real' stack. I handle this by special-casing 'aframes == 0' with the trap frame. If aframes counts out to the trap frame, then assume we're looking for the full kernel trap frame, so switch to the real stack pointer. Test Plan: Tested on powerpc64 Reviewers: rpaulo, markj, nwhitehorn Reviewed By: markj, nwhitehorn Differential Revision: https://reviews.freebsd.org/D788 MFC after: 3 week Relnotes: Yes	2014-09-17 02:43:47 +00:00
Steven Hartland	a889b18c52	Added missing ZFS sysctls * vfs.zfs.vdev.async_write_active_min_dirty_percent * vfs.zfs.vdev.async_write_active_max_dirty_percent Added validation of min / max for ZFS sysctl * vfs.zfs.dirty_data_max_percent MFC after: 3 days	2014-09-14 12:23:00 +00:00
Xin LI	f9290bc2c9	MFV r271518: Correctly report hole at end of file. When asked to find a hole, the DMU sees that there are no holes in the object, and returns ESRCH. The ZPL interprets this as "no holes before the end of the file", and therefore inserts the "virtual hole" at the end of the file. Because DMU and ZPL have different ideas of where the end of an object/file is, we will end up returning the end of file, which is generally larger, instead of returning the end of object. The fix is to handle the "virtual hole" in the DMU. If no hole is found, the DMU will return a hole at the end of the file, rather than an error. Illumos issue: 5139 SEEK_HOLE failed to report a hole at end of file MFC after: 1 week	2014-09-13 17:48:44 +00:00
Xin LI	dc147754b7	MFV r271517: In zil_claim, don't issue warning if we get EBUSY (inconsistent) when opening an objset, instead, ignore it silently. Illumos issue: 5140 message about "%recv could not be opened" is printed when booting after crash MFC after: 1 week	2014-09-13 17:36:34 +00:00
Xin LI	be1b14a063	MFV r271515: Add a new tunable/sysctl, vfs.zfs.free_max_blocks, which can be used to limit how many blocks can be free'ed before a new transaction group is created. The default is no limit (infinite), but we should probably have a lower default, e.g. 100,000. With this limit, we can guard against the case where ZFS could run out of memory when destroying large numbers of blocks in a single transaction group, as the entire DDT needs to be brought into memory. Illumos issue: 5138 add tunable for maximum number of blocks freed in one txg MFC after: 2 weeks	2014-09-13 17:24:56 +00:00
Xin LI	ff0fc48bde	MFV r271512: Illumos issue: 5136 fix write throttle comment in dsl_pool.c MFC after: 2 weeks	2014-09-13 16:51:23 +00:00
Xin LI	263f396e2b	MFV r271510: Enforce 4K as smallest indirect block size (previously the smallest indirect block size was 1K but that was never used). This makes some space estimates more accurate and uses less memory for some data structures. Illumos issue: 5141 zfs minimum indirect block size is 4K MFC after: 2 weeks	2014-09-13 16:26:14 +00:00
Steven Hartland	3cdd9138c3	Persist vdev_resilver_txg changes to avoid panic caused by validation vs a vdev_resilver_txg value from a previous resilver. MFC after: 1 week	2014-09-11 16:21:51 +00:00
Gleb Smirnoff	27ad26d8c7	Remove unused arguments for VOP_GETPAGES(), VOP_PUTPAGES().	2014-09-10 12:36:41 +00:00
Alexander Motin	ee9534ed96	Make ZVOL writes in device mode support IO_SYNC flag. MFC after: 1 month	2014-09-09 11:29:55 +00:00
Xin LI	817d804595	MFV r271223: In dnode_sync(), do dnode_increase_indirection() before processing the dn_next_nblkptr. Illumos issue: 5117 space map reallocation can cause corruption MFC after: 3 days	2014-09-07 13:13:42 +00:00
Peter Wemm	d903c21a64	Move the restored #ifdef i386 test back inside the #ifdef _KERNEL block where it originally was.	2014-08-31 09:05:02 +00:00
Steven Hartland	92ac3eb59f	Ensure that ZFS ARC free memory checks include cached pages Also restore kmem_used() check for i386 as it has KVA limits that the raw page counts above don't consider PR: 187594 Reviewed by: peter X-MFC-With: r270759 Review: D700 Sponsored by: Multiplay	2014-08-30 21:44:32 +00:00
Mateusz Guzik	6662ce5aab	Add missing proctree locking to fill_kinfo_proc consumers. This fixes r270444. Pointy hat: mjg Reported by: many MFC after: 1 week	2014-08-30 03:10:55 +00:00
Steven Hartland	4d19f4ad1f	Refactor ZFS ARC reclaim logic to be more VM cooperative Prior to this change we triggered ARC reclaim when kmem usage passed 3/4 of the total available, as indicated by vmem_size(kmem_arena, VMEM_ALLOC). This could lead large amounts of unused RAM e.g. on a 192GB machine with ARC the only major RAM consumer, 40GB of RAM would remain unused. The old method has also been seen to result in extreme RAM usage under certain loads, causing poor performance and stalls. We now trigger ARC reclaim when the number of free pages drops below the value defined by the new sysctl vfs.zfs.arc_free_target, which defaults to the value of vm.v_free_target. Credit to Karl Denninger for the original patch on which this update was based. PR: 191510 and 187594 Tested by: dteske MFC after: 1 week Relnotes: yes Sponsored by: Multiplay	2014-08-28 19:50:08 +00:00
Mark Johnston	35127d3c0f	Restore the correct value when disabling probes. Otherwise the instrumented tracepoints would continue to generate traps, which would be ignored but could consume noticeable amounts of CPU if, say, all functions in the kernel were instrumented. X-MFC-With: r270067	2014-08-24 17:10:47 +00:00
Xin LI	ec1b564650	Instead of using timestamp in the AVL, use the memory address when comparing. Illumos issue: 5095 panic when adding a duplicate dbuf to dn_dbufs MFC after: 3 days	2014-08-22 23:13:53 +00:00
Xin LI	fa4484104c	MFV r270197: Illumos issue: 5066 remove support for non-ANSI compilation 5068 Remove SCCSID() macro from <macros.h> MFC after: 2 weeks	2014-08-22 22:13:36 +00:00
Xin LI	d291a3bd9c	Provide compatibility shim for atomic_dec_64_nv. X-MFC-with: r270247 MFC after: 13 days	2014-08-21 08:25:46 +00:00
Xin LI	7c1db36b28	MFV r270196: Illumos issue: 5047 don't use atomic_*_nv if you discard the return value MFC after: 2 weeks	2014-08-20 22:39:26 +00:00
Xin LI	249ddb42f6	MFC r270195: Illumos issue: 5045 use atomic_{inc,dec}_* instead of atomic_add_* MFC after: 2 weeks	2014-08-20 21:44:48 +00:00
Xin LI	2bcc37f99c	MFV r270193: Illumos issues: 5042 stop using deprecated atomic functions MFC after: 2 weeks	2014-08-20 18:29:18 +00:00
Mark Johnston	266b4a78c2	Factor out the common code for function boundary tracing instead of duplicating the entire implementation for both x86 and powerpc. This makes it easier to add support for other architectures and has no functional impact. Phabric: D613 Reviewed by: gnn, jhibbits, rpaulo Tested by: jhibbits (powerpc) MFC after: 2 weeks	2014-08-16 21:42:55 +00:00
Xin LI	60723bfe21	MFV r269542: In vdev_get_stats, check that the vdev is not a hole before computing the fragmentation. This fixes a panic when removing log device. Illumos issue: 5049 panic when removing log device Author: Alex Reece <alex@delphix.com> MFC after: 2 weeks	2014-08-05 00:07:21 +00:00
Mark Johnston	2661328745	Return 0 for the PPID of threads in process 0, as process 0 doesn't have a parent process. MFC after: 2 weeks	2014-08-04 19:02:30 +00:00
Xin LI	cd741a5e1d	Revert r269404 and use cpu_ticks() for dbuf allocation. Encode CPU's number by XOR'ing the CPU ID against the 64-bit cpu_ticks(). Reviewed by: mav, gibbs Differential Revision: https://phabric.freebsd.org/D521 MFC after: 2 weeks	2014-08-03 09:47:51 +00:00
Xin LI	1dcef10eac	MFV r269427: In dnode_children_t, use C99's "[]" idiom for declaring the variable sized array dnc_children at the end of the structure. This prevents the compiler from mistakenly optimizing away accesses beyond the array's defined size. Illumos issue: 5038 Remove "old-style" flexible array usage in ZFS. Author: Justin T. Gibbs <justing@spectralogic.com> MFC after: 2 weeks	2014-08-02 08:34:22 +00:00
Ian Lepore	c311f7078c	When arm 64-bit atomic ops are available, define ARM_HAVE_ATOMIC64. Use that symbol (which will be correct in both kernel and userland contexts) rather than just __arm__ to decide whether to use a local implementation.	2014-08-02 03:44:27 +00:00
Ian Lepore	814f4c5896	Use the 64-bit atomics now provided by arm machine/atomic.h instead of (conflicting) local versions.	2014-08-01 23:45:50 +00:00
Steven Hartland	6a369c018c	Don't return ZIO_PIPELINE_CONTINUE from vdev_op_io_start methods This prevents recursion of vdev_queue_io_done as per r265321 but using a different method as recommended on the openzfs list. We now use zio_interrupt(zio) and return ZIO_PIPELINE_STOP instead of returning ZIO_PIPELINE_CONTINUE from vdev_*_io_start methods. zio_vdev_io_start now ASSERTS the that vdev_op_io_start returns ZIO_PIPELINE_STOP to ensure future changes don't reintroduce ZIO_PIPELINE_CONTINUE returns. Cleanup flow in vdev_geom_io_start while I'm here. Also fix some cases not using SET_ERROR(..) MFC after: 2 weeks X-MFC-With: r265321	2014-08-01 23:16:48 +00:00
Xin LI	125f68e708	Split gethrtime() and gethrtime_waitfree() and make the former use nanouptime() instead of getnanouptime(). nanouptime(9) provides more precise result at expense of being slower. In r269223, gethrtime() is used as creation time of dbuf, which in turn acts as portion of lookup key to maintain AVL invariant where there can not be duplicate items. Before this change, gethrtime() have preferred better execution time by sacrificing precision, which may lead to panic on busy systems with: panic: avl_find() succeeded inside avl_add() Reported by: allanjude, mav PR: kern/192284 MFC after: 11 days X-MFC-with: r269223	2014-08-01 22:33:23 +00:00
Rui Paulo	d18aa577d5	Copy strtolctype.h to sys/cddl/contrib/opensolaris/common/util to keep the kernel self-contained. Requested by: jhb	2014-07-31 08:07:23 +00:00
Xin LI	9b046b421f	MFV r269224: Increase default ARC buf_hash_table size. When typical block size is small, the hash table could be too small, which would lead to long hash chains and limit performance for cached reads. A new loader tunable, vfs.zfs.arc_average_blocksize, have been added which allows users to override the default assumption of average (typical) block size. Old default was 65536 (64 KiB) and new default is 8192 (8 KiB). Illumos issue: 5034 ARC's buf_hash_table is too small MFC after: 2 weeks	2014-07-29 09:36:48 +00:00
Xin LI	a3cbca537e	MFV r269223: Change dn->dn_dbufs from linked list to AVL tree. Illumos issues: 4873 zvol unmap calls can take a very long time for larger datasets MFC after: 2 weeks	2014-07-29 08:42:22 +00:00
Xin LI	343c95a24e	Reschedule the 'deadman' callout after handling, this makes our code behave more like it is on Solaris. Reported by: avg Reviewed by: avg, mav (but bugs are mine) Differential Revision: https://phabric.freebsd.org/D457	2014-07-29 06:57:13 +00:00
Konstantin Belousov	fe0e9a63e0	Initialize zfs vnode v_hash when the vnode is allocated, instead of postponing it to zfs_vget(). zfs_root() returned vnode with the default value of v_hash, which caused inconsistent v_hash value when root vnode was obtained from zfs_vget(). Nullfs allocated two upper vnodes for the root zfs vnode due to different hashes, causing consistency problems. Reported and tested by: Harald Schmalzbauer <h.schmalzbauer@omnilan.de> Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-07-28 14:24:18 +00:00
Xin LI	50b74c6ef1	Add two sysctls for newly added tunables. MFC after: 2 weeks	2014-07-26 19:07:08 +00:00
Xin LI	7e37b1e609	MFV r269010: Import Illumos changes to address the following Illumos issues: 4976 zfs should only avoid writing to a failing non-redundant top-level vdev 4978 ztest fails in get_metaslab_refcount() 4979 extend free space histogram to device and pool 4980 metaslabs should have a fragmentation metric 4981 remove fragmented ops vector from block allocator 4982 space_map object should proactively upgrade when feature is enabled 4984 device selection should use fragmentation metric MFC after: 2 weeks	2014-07-26 10:20:48 +00:00
Alexander Motin	1bc04f6a8c	Make sysctls under vfs.zfs.zfetch writeable. I don't see any reason for them to be read-only, while tuning them without reboot is much more convenient for experiments. MFC after: 2 weeks	2014-07-26 09:09:14 +00:00
Xin LI	0aa4ce9b7d	Transform the I/O when vdev_physical_ashift is greater than SPA_MINBLOCKSHIFT. MFC after: 2 weeks	2014-07-25 18:41:56 +00:00
Xin LI	883d80c104	As of r268075, the responsibility of rounding up buffer to optimal size have been transferred from zio_compress_data to its caller. Therefore, passing the 'minblocksize' down will be a no-op. Eliminate the parameter to reduce diff against upstream. MFC after: 2 weeks	2014-07-25 06:53:20 +00:00
Xin LI	3d4d6b0883	Correct typo introduced with r268855. MFC after: 10 days X-MFC with: r268855	2014-07-22 08:37:01 +00:00
Mark Johnston	5a5f9d21dd	Use a C wrapper for trap() instead of checking and calling the DTrace trap hook in assembly. Suggested by: kib Reviewed by: kib (original version) X-MFC-With: r268600	2014-07-19 02:27:31 +00:00
Xin LI	b4bb49887b	Reduce lock contention on the z_teardown_lock under heavily cached read workload by splitting the single teardown rrw lock into RRM_NUM_LOCKS (17) of them. Read acquisitions are randomly distributed among these locks based on curthread pointer. Write acquisitions are going to all the locks, which for the usage of this type of lock should be rare. Illumos issue: 5008 lock contention (rrw_exit) while running a read only load MFC after: 2 weeks	2014-07-19 00:26:03 +00:00
Xin LI	82599d31fe	MFV r268851: When a sync task is waiting for a txg to complete, we should hurry it along by increasing the number of outstanding async writes (i.e. make vdev_queue_max_async_writes() return a larger number). Illumos issue: 4753 increase number of outstanding async writes when sync task is waiting MFC after: 2 weeks	2014-07-18 22:34:01 +00:00
Xin LI	f886b6e3bc	MFV r268850: Change the interaction between the DMU and ARC so that when the DMU is shutting down an objset, we do not evict the data from the ARC. Instead we simply coordinate the destruction of the DMU's data with the ARC. The only case where we actually need to explicitly evict from the ARC is when dbuf_rele_and_unlock() determines that the administrator has requested that it not be kept in memory, via the primarycache/secondarycache properties. In this case, we evict the data from the ARC by its blkptr_t, the same way as when a block is freed we explicitly evict it from the ARC. Illumos issue: 4631 zvol_get_stats triggering too many reads MFC after: 2 weeks	2014-07-18 22:04:21 +00:00
Xin LI	7882b61f60	MFV r268848: Instead of asserting all zio's be properly aligned, only assert on the logical ones. Cap uberblocks at 8k, otherwise with ashift=17, there would be only one uberblock. This fixes a problem that zdb would trip assert on pools with ashift >= 0xe (8k). While there, also change the code so it only attempt to condense space map unless the uncondensed size consumes greater than zfs_metaslab_condense_block_threshold blocks. Illumos issue: 4958 zdb trips assert on pools with ashift >= 0xe MFC after: 2 weeks	2014-07-18 20:41:40 +00:00
Xin LI	7079d5877c	MFV r268714: Improve extreme rewind import. When doing an "extreme rewind" import ("zpool import -XF"), we attempt to verify all data in the pool, essentially scrubbing the entire pool. The problem is that spa_load_verify_cb() issues an unbounded number of concurrent scrub i/os. This can lead to all of memory being used for these zio's, wedging the system. Like normal scrub, we need to put a cap on the number of outstanding i/os, and have the traverse thread block when we reach this cap. For this purpose the cap can be very large (10,000) to optimize the elevator algorithm. Three kernel tunables have been added: vfs.zfs.spa_load_verify_maxinflight vfs.zfs.spa_load_verify_metadata vfs.zfs.spa_load_verify_data The latter two tunables controls whether metadata and/or user data when doing extreme rewind. Make 'zpool import -T' imply scrub. Make zpool import -T <txg> accept hexadecimal values for the txg when prefixed with 0x. Skip txg's for which there is no uberblock when doing extreme rewind. Skip reading all user data twice by skipping prefetches when doing extreme rewinds as we do not access via the ARC. Illumos issues: 4970 need controls on i/o issued by zpool import -XF 4971 zpool import -T should accept hex values 4972 zpool import -T implies extreme rewind, and thus a scrub 4973 spa_load_retry retries the same txg 4974 spa_load_verify() reads all data twice MFC after: 2 weeks	2014-07-15 22:44:04 +00:00
Xin LI	eb75155228	MFV r268702: Add missing *_destroy() calls in various places with ZFS. Illumos issue: 4975 missing mutex_destroy() calls in zfs MFC after: 2 weeks	2014-07-15 20:32:23 +00:00
Mark Johnston	291624fdf6	Invoke the DTrace trap handler before calling trap() on amd64. This matches the upstream implementation and helps ensure that a trap induced by tracing fbt::trap:entry is handled without recursively generating another trap. This makes it possible to run most (but not all) of the DTrace tests under common/safety/ without triggering a kernel panic. Submitted by: Anton Rang <anton.rang@isilon.com> (original version) Phabric: D95	2014-07-14 04:38:17 +00:00
Xin LI	1b174fa1eb	MFV r268455: Use reserved space for ZFS administrative commands. We reserve 1/2^spa_slop_shift = 1/32 or 3.125% of pool space (or 32MB at least) for system use. Most ZPL operations, e.g. write(2), creat(2), will fail with ENOSPC if we fall below this. Certain operations, e.g. file removal and most administrative actions, still permitted until half of the slop space is used. This would allow users to use these operations to free up space in the pool when pool is close to full but half of slop space is still free. A very restricted set of operations that frees up space or change quota are always permitted, regardless of the amount of free space. MFC after: 2 weeks	2014-07-09 23:14:59 +00:00
Xin LI	fdc0ee2cf5	MFV r268452: Explicitly mark file removal transactions as "presumed to result in a net free of space" so they will not fail with ENOSPC. Illumos issue: 4950 files sometimes can't be removed from a full filesystem MFC after: 2 weeks	2014-07-09 18:32:40 +00:00
Alexander Motin	e327a057a7	Remove IO_SYNC flag when writing extended file attributes on ZFS. While it is possible to create and write file, modify its permissions, etc. without ever doing sync, it looks odd that it is required for setting extended file attributes on ZFS. UFS does not do sync there too. Samba uses those extended attributes to store some its data, and doing it synchronously by many times reduces file creation performance for systems without SLOG device. Reviewed by: delphij, jpaetzel, silence on fs@ MFC after: 2 weeks Sponsored by: iXsystems, Inc.	2014-07-08 17:26:08 +00:00
Marcel Moolenaar	e7d939bda2	Remove ia64. This includes: o All directories named ia64 o All files named ia64 o All ia64-specific code guarded by __ia64__ o All ia64-specific makefile logic o Mention of ia64 in comments and documentation This excludes: o Everything under contrib/ o Everything under crypto/ o sys/xen/interface o sys/sys/elf_common.h Discussed at: BSDcan	2014-07-07 00:27:09 +00:00
Pedro F. Giffuni	5f40879138	Merge from OpenSolaris (24-Jul-2010): 6679140 asymmetric alloc/dealloc activity can induce dynamic variable drops 6679193 dtrace_dynvar walker produces flood of dtrace_dynhash_sink This finishes a set of merges from the older OpenSolaris releases. Still the FreeBSD port has many differences that are difficult to account for but that seems normal given that the kernels are different. MFC after: 1 week	2014-07-05 15:36:17 +00:00
Pedro F. Giffuni	99d8c6efbd	Merge from OpenSolaris (30-Jun-2009): 6851093 system drops to kmdb with anonymous dtrace probes + kmdb This has no effect on FreeBSD (code is ifdef'ed) but is useful as reference for future merges. MFC after: 1 week	2014-07-03 19:25:24 +00:00
Pedro F. Giffuni	87e109c3e0	Merge from OpenSolaris (22-Apr-2008): 6823388 DTrace ioctl handlers must validate all structure members MFC after: 1 week	2014-07-03 19:07:37 +00:00
Pedro F. Giffuni	e099b3a948	Merge from OpenSolaris (20-Apr-2008): 6822482 DOF validation needs to handle loadable sections flagged as unloadable MFC after: 1 week	2014-07-03 17:36:59 +00:00
Alexander Motin	5a178afd41	Fix bug in sync control in new "dev" mode of ZVOL (r265678). Don't check ZVOL_WCE flag, used in Solaris to control device "write cache". It is not applicable on FreeBSD and by default set to "disable". MFC after: 3 days	2014-07-02 21:25:32 +00:00
Pedro F. Giffuni	0b8f286e83	Merge from OpenSolaris (15-Sep-2008): 6735480 race between probe enabling and provider registration MFC after: 1 week	2014-07-01 23:37:24 +00:00
Xin LI	30324e945a	MFV r268122: 4929 want prevsnap property illumos/illumos-gate@b461c7460e MFC after: 2 weeks	2014-07-01 22:42:53 +00:00
Xin LI	9cc8a15b2e	MFV r268121: 4924 LZ4 Compression for metadata illumos/illumos-gate@b8289d24d8 MFC after: 2 weeks	2014-07-01 22:31:09 +00:00
Pedro F. Giffuni	f384ec379c	Small merges from OpenSolaris: These have no effect on FreeBSD, in fact they are ifdef'ed, but make easier future merges: 6699767 panic in spec_open() 6718877 crgetzoneid() use can cause problems when forking processes with USDT providers in a non global zone MFC after: 3 days	2014-07-01 22:16:44 +00:00
Xin LI	aa882b9048	MFV r268119: 4914 zfs on-disk bookmark structure should be named *_phys_t illumos/illumos-gate@7802d7bf98 MFC after: 2 weeks	2014-07-01 21:51:30 +00:00
Xin LI	55f6421982	- Fix handling of "new" style of ioctl in compatiblity mode [1]; - Reorganize code and reduce diff from upstream; - Improve forward compatibility shims for previous kernel; Reported by: sbruno [1] X-MFC-With: r268075	2014-07-01 20:57:39 +00:00
Pedro F. Giffuni	c6d712caf3	Revert r268007, and re-adapt MFV r260708: 4427 pid provider rejects probes with valid UTF-8 names Use of u8_textprep.c required -Wno-cast-qual for powerpc. MFC after: 2 weeks	2014-07-01 15:36:05 +00:00
Xin LI	be78a8db97	MFV r267570: 4756 metaslab_group_preload() could deadlock illumos/illumos-gate@30beaff42d MFC after: 2 weeks	2014-07-01 08:36:56 +00:00
Xin LI	3a0f8ff95e	MFV r267569: 4897 Space accounting mismatch in L2ARC/zpool illumos/illumos-dist@3038a2b421 MFC after: 2 weeks	2014-07-01 08:28:49 +00:00
Xin LI	93b8d53c09	MFV r267567: 4881 zfs send performance degradation when embedded block pointers are encountered illumos/illumos-gate@06315b795c MFC after: 2 weeks	2014-07-01 07:56:07 +00:00
Xin LI	71eaf0fda7	MFV r267566: 4390 i/o errors when deleting filesystem/zvol can lead to space map corruption MFC after: 2 weeks	2014-07-01 07:29:42 +00:00
Xin LI	29441ba3fa	MFV r267565: 4757 ZFS embedded-data block pointers ("zero block compression") 4913 zfs release should not be subject to space checks MFC after: 2 weeks	2014-07-01 06:43:15 +00:00
Pedro F. Giffuni	0135aadfc3	Reduce some warnings in the Solaris unicode support. Clean some warnings from parenthesis and minor style issues. MFC after: 3 days	2014-06-29 02:28:05 +00:00
Pedro F. Giffuni	f34dd28f7d	Revert r267869: MFV r260708 4427 pid provider rejects probes with valid UTF-8 names Use of u8_textprep.c broke the build on powerpc. Reported by: bz, rpaulo and tinderbox. Pointyhat: me	2014-06-28 19:59:12 +00:00
Hans Petter Selasky	af3b2549c4	Pull in r267961 and r267973 again. Fix for issues reported will follow.	2014-06-28 03:56:17 +00:00
Glen Barber	37a107a407	Revert r267961, r267973: These changes prevent sysctl(8) from returning proper output, such as: 1) no output from sysctl(8) 2) erroneously returning ENOMEM with tools like truss(1) or uname(1) truss: can not get etype: Cannot allocate memory	2014-06-27 22:05:21 +00:00
Hans Petter Selasky	3da1cf1e88	Extend the meaning of the CTLFLAG_TUN flag to automatically check if there is an environment variable which shall initialize the SYSCTL during early boot. This works for all SYSCTL types both statically and dynamically created ones, except for the SYSCTL NODE type and SYSCTLs which belong to VNETs. A new flag, CTLFLAG_NOFETCH, has been added to be used in the case a tunable sysctl has a custom initialisation function allowing the sysctl to still be marked as a tunable. The kernel SYSCTL API is mostly the same, with a few exceptions for some special operations like iterating childrens of a static/extern SYSCTL node. This operation should probably be made into a factored out common macro, hence some device drivers use this. The reason for changing the SYSCTL API was the need for a SYSCTL parent OID pointer and not only the SYSCTL parent OID list pointer in order to quickly generate the sysctl path. The motivation behind this patch is to avoid parameter loading cludges inside the OFED driver subsystem. Instead of adding special code to the OFED driver subsystem to post-load tunables into dynamically created sysctls, we generalize this in the kernel. Other changes: - Corrected a possibly incorrect sysctl name from "hw.cbb.intr_mask" to "hw.pcic.intr_mask". - Removed redundant TUNABLE statements throughout the kernel. - Some minor code rewrites in connection to removing not needed TUNABLE statements. - Added a missing SYSCTL_DECL(). - Wrapped two very long lines. - Avoid malloc()/free() inside sysctl string handling, in case it is called to initialize a sysctl from a tunable, hence malloc()/free() is not ready when sysctls from the sysctl dataset are registered. - Bumped FreeBSD version to indicate SYSCTL API change. MFC after: 2 weeks Sponsored by: Mellanox Technologies	2014-06-27 16:33:43 +00:00
Rui Paulo	a43f0be9fe	MFV illumos 4471 DTrace count() with histogram 4472 DTrace full width distribution histograms 4473 DTrace frequency trails MFC after: 2 weeks	2014-06-26 23:24:59 +00:00
Rui Paulo	8e648814b0	MFV illumos 4474 DTrace Userland CTF Support 4475 DTrace userland Keyword 4476 DTrace tests should be better citizens 4479 pid provider types 4480 dof emulation is missing checks MFC after: 2 weeks	2014-06-26 23:21:11 +00:00
Rui Paulo	b1f9167f94	MFV illumos 4477 DTrace should speak JSON MFC after: 2 weeks	2014-06-26 21:45:49 +00:00
Rui Paulo	0c2b601953	MFV illumos r266986: 2915 DTrace in a zone should see "cpu", "curpsinfo", et al 2916 DTrace in a zone should be able to access fds[] 2917 DTrace in a zone should have limited provider access MFC after: 2 weeks	2014-06-26 19:38:16 +00:00
Rui Paulo	dd9b2abed8	Revert r267898.	2014-06-26 17:34:42 +00:00
Rui Paulo	d8e37c5f72	Bring the following change from the illumos-joyent repository: commit 78e24ab6803bbe11ba37642624e1498ede5b239d Author: Bryan Cantrill <bryan@joyent.com> Date: Thu Oct 31 01:20:54 2013 OS-1688 DTrace count() with histogram OS-2360 DTrace full width distribution histograms OS-2361 DTrace frequency trails MFC after: 2 weeks	2014-06-26 07:06:43 +00:00
Pedro F. Giffuni	af8bd6e468	MFV r260708 4427 pid provider rejects probes with valid UTF-8 names This make use of Solaris' u8_validate() which we happen to use since r185029 for ZFS. Illumos Revision: 1444d846b126463eb1059a572ff114d51f7562e5 Reference: https://www.illumos.org/issues/4427 Obtained from: Illumos MFC after: 2 weeks	2014-06-25 14:23:30 +00:00
Davide Italiano	a99098e2ba	Continue the crusade towards a dev_clone()-free kernel, removing its usage from dtrace. The dtrace code already uses cdevpriv(9) since FreeBSD 8, so this change should be quite harmless. Reviewed by: markj Approved by: markj MFC after: never	2014-06-25 03:54:02 +00:00
Mark Johnston	efa1aff675	Fix some bugs when fetching probe arguments in i386. Firstly ensure that the 4 byte-aligned dtrace_invop_callsite can be found and that it immediately follows the call to dtrace_invop(). Secondly, fix some pointer arithmetic to account for differences between struct i386_frame and illumos' struct frame. Finally, ensure that dtrace_getarg() isn't inlined. It works by following a fixed number of frame pointers to the probe site, so inlining breaks it. MFC after: 3 weeks	2014-06-23 02:00:14 +00:00
Mark Johnston	8382ec9e6a	Fix a couple of bugs on amd64 when fetching probe arguments beyond the first five for probes entered through a UD fault (i.e. FBT probes). Specifically, handle the fact that dtrace_invop_callsite must be 16 byte-aligned and thus may not immediately follow the call to dtrace_invop() in dtrace_invop_start(). Also fetch register arguments and the stack pointer through a struct trapframe instead of a struct reg. PR: 191260 Submitted by: luke.tw@gmail.com MFC after: 3 weeks	2014-06-23 01:10:56 +00:00
Mark Johnston	9338d20884	Allow creation of SDT probes from a module in which no providers are defined. This ensures that the sdt:zfs:: probes appear despite the fact the sdt provider is defined in the kernel rather than in zfs.ko. Reported by: hiren Tested by: hiren MFC after: 2 weeks	2014-06-21 19:29:40 +00:00
Steven Hartland	74ddec2b18	Removed stale comment about multi-vdev root pool config not working MFC after: 1 week	2014-06-09 13:04:58 +00:00
Bryan Drewery	f3a7518361	- Naively fix build by partially reverting r267029 to still use gethrtime() when building libzpool. X-MFC-With: 267029	2014-06-04 05:04:15 +00:00
Alexander Motin	4220ebcf71	Replace gethrtime() with cpu_ticks(), as source of random for the taskqueue selection. gethrtime() in our port updated with HZ rate, so unusable for this specific purpose, completely draining benefit of multiple taskqueues. MFC after: 2 weeks	2014-06-03 21:06:03 +00:00
Xin LI	f4c7dd6dd0	MFV 266913+266914: 3897 zfs filesystem and snapshot limits (fix leak) 4901 zfs filesystem/snapshot limit leaks MFC after: 3 days	2014-05-31 01:00:22 +00:00
Xin LI	2bdf7f79bc	MFV r266766: Add a new zfs property, "redundant_metadata" which can have values "all" or "most". The default will be "all", which is the current behavior. When set to all, ZFS stores an extra copy of all metadata. If a single on-disk block is corrupt, at worst a single block of user data (which is recordsize bytes long) can be lost. Setting to "most" will cause us to only store 1 copy of level-1 indirect blocks of user data files. This can improve performance of random writes, because less metadata has to be written. In practice, at worst about 100 blocks (of recordsize bytes each) of user data can be lost if a single on-disk block is corrupt. The exact behavior of which metadata blocks are stored redundantly may change in future releases. Illumos issue: 3835 zfs need not store 2 copies of all metadata MFC after: 2 weeks	2014-05-27 19:46:11 +00:00
Allan Jude	ecd9567c1a	Improve sysctl descriptions for new ZFS sysctls: vfs.zfs.dirty_data_max vfs.zfs.dirty_data_max_max vfs.zfs.dirty_data_sync Reviewed by: smh Approved by: wblock (mentor)	2014-05-22 05:30:38 +00:00
Steven Hartland	df23182a62	Added sysctls / tunables for ZFS dirty data tuning Added the following new sysctls / tunables: * vfs.zfs.dirty_data_max * vfs.zfs.dirty_data_max_max * vfs.zfs.dirty_data_max_percent * vfs.zfs.dirty_data_sync * vfs.zfs.delay_min_dirty_percent * vfs.zfs.delay_scale PR: kern/189865 MFC after: 2 weeks	2014-05-21 13:36:04 +00:00
Peter Grehan	c3ddb60e2d	Update dis_tables.c to the latest Illumos version. This includes decodes of recent Intel instructions, in particular VT-x and related instructions. This allows the FBT provider to locate the exit points of routines that include these new instructions. Illumos issues: 3414 Need a new word of AT_SUN_HWCAP bits 3415 Add isainfo support for f16c and rdrand 3416 Need disassembler support for rdrand and f16c 3413 isainfo -v overflows 80 columns 3417 mdb disassembler confuses rdtscp for invlpg 1518 dis should support AMD SVM/AMD-V/Pacifica instructions 1096 i386 disassembler should understand complex nops 1362 add kvmstat for monitoring of KVM statistics 1363 add vmregs[] variable to DTrace 1364 need disassembler support for VMX instructions 1365 mdb needs 16-bit disassembler support This corresponds to Illumos-gate (github) version eb23829ff08a873c612ac45d191d559394b4b408 Reviewed by: markj MFC after: 1 week	2014-05-15 01:06:27 +00:00
Xin LI	b8cdcb8ad8	Import George Wilson's change for Illumos #4730 : 4730 metaslab group taskq should be destroyed in metaslab_group_destroy() Reviewed by: Alex Reece <alex.reece@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Sebastien Roy <sebastien.roy@delphix.com> Original author: George Wilson MFC after: 3 days	2014-05-06 19:03:04 +00:00
Steven Hartland	4f64781818	Use a zio flag to prevent recursion of vdev_queue_io_done which can cause stack overflow for IO's which return ZIO_PIPELINE_CONTINUE from the zio_vdev_io_start stage and hence don't suspend and complete in a different thread. This prevents double fault panic on slow machines running ZFS on GELI volumes which return EOPNOTSUPP directly to BIO_DELETE requests. MFC after: 1 month X-MFC-With: r265152	2014-05-04 14:05:14 +00:00
Steven Hartland	573621a6d6	Don't treat TRIM requests returning ENOTSUP as an unexpected error. MFC after: 1 month X-MFC-With: r265152	2014-05-03 02:30:01 +00:00
Steven Hartland	10138166cf	Removed pointless / duplicated call to trim_map_first. MFC after: 1 month X-MFC-With: r265152	2014-05-02 09:31:21 +00:00
Steven Hartland	82ce008538	Reintroduce priority for the TRIM ZIOs instead of using the "NOW" priority The changes how TRIM requests are generated to use ZIO_TYPE_FREE + a priority instead of ZIO_TYPE_IOCTL, until processed by vdev_geom; only then is it translated the required geom values. This reduces the amount of changes required for FREE requests to be supported by the new IO scheduler. This also eliminates the need for a specific DKIOCTRIM. Also fixed FREE vdev child IO's from running ZIO_STAGE_VDEV_IO_DONE as part of their schedule. As the new IO scheduler can result in a request to execute one type of IO to actually run a different type of IO it requires that zio_trim requests are processed without holding the trim map lock (tm->tm_lock), as the free request execute call may result in write request running hence triggering a trim_map_write_start call, which takes the trim map lock and hence would result in recused on no-recursive sx lock. This is based off avg's original work, so credit to him. MFC after: 1 month	2014-04-30 17:46:29 +00:00
Steven Hartland	101dfa0ed4	Fix ZIO reordering done by vdev_queue_io causing panics when zio_vdev_io_start returns ZIO_PIPELINE_CONTINUE from vdev_op_io_start to zio_execute resulting in the wrong ZIO continuing its pipeline. This is a serious issue which could cause data loss / corruption but appears to be limited to error handling such as when vdev_readable(vd) returns false. MFC after: 2 days	2014-04-28 09:00:00 +00:00
Steven Hartland	c2b2c5fc76	Eliminate duplicate checks in vdev_geom_io_intr error handling MFC after: 1 month	2014-04-24 15:36:00 +00:00
Steven Hartland	5b245b8ae0	Add the ability to set a minimum ashift size for ZFS pool creation or root level vdev addition. Change max_auto_ashift sysctl to error when an invalid value is requested instead of silently limiting it.	2014-04-24 01:06:03 +00:00
Xin LI	754180f4ae	MFV r264830: 4745 fix AVL code misspellings MFC after: 2 weeks	2014-04-23 20:32:39 +00:00
Xin LI	f8587167e4	MFV r264829: 3897 zfs filesystem and snapshot limits MFC after: 2 weeks	2014-04-23 20:29:46 +00:00
Xin LI	18ab4bd8d9	MFV r264668: 4754 io issued to near-full luns even after setting noalloc threshold 4755 mg_alloc_failures is no longer needed illumos/illumos@b6240e830b MFC after: 2 weeks	2014-04-18 22:04:58 +00:00
Xin LI	d301d390a7	MFV r264667: 4752 fan out read zio taskqs illumos/illumos-gate@1b497ab83e	2014-04-18 21:35:23 +00:00
Xin LI	613074ec08	MFV r264666: 4374 dn_free_ranges should use range_tree_t illumos/illumos-gate@bf16b11e8d MFC after: 2 weeks	2014-04-18 21:15:12 +00:00
Mark Johnston	38e6967f04	Ensure that all eight syscall arguments are available to dtrace_probe(), rather than just the first five. This is done by calling dtrace_probe() through a function pointer, as in illumos. MFC after: 3 weeks	2014-04-14 00:23:18 +00:00
Mark Johnston	0626f3e435	DTrace's pid provider works by inserting breakpoint instructions at probe sites and installing a hook at the kernel's trap handler. The fasttrap code will emulate the overwritten instruction in some common cases, but otherwise copies it out into some scratch space in the traced process' address space and ensures that it's executed after returning from the trap. In Solaris and illumos, this (per-thread) scratch space comes from some reserved space in TLS, accessible via the fs segment register. This approach is somewhat unappealing on FreeBSD since it would require some modifications to rtld and jemalloc (for static TLS) to ensure that TLS is executable, and would thus introduce dependencies on their implementation details. I think it would also be impossible to safely trace static binaries compiled without these modifications. This change implements the functionality in a different way, by having fasttrap map pages into the target process' address space on demand. Each page is divided into 64-byte chunks for use by individual threads, and fasttrap's process descriptor struct has been extended to keep track of any scratch space allocated for the corresponding process. With this change it's possible to trace all libc functions in a program, e.g. with pid$target:libc.so.*::entry {@[probefunc] = count();} Previously this would generally cause the victim process to crash, as tracing memcpy on amd64 requires the functionality described above. Tested by: Prashanth Kumar <pra_udupi@yahoo.co.in> (earlier version) MFC after: 6 weeks	2014-04-14 00:22:42 +00:00
Davide Italiano	2f9e29745c	Fix a panic in zfs_rename(). this is due to a wrong dereference of a vnode when it's not locked and can be (potentially) recycled. 'sdvp' cannot be locked on zfs_rename() entry point because the VFS can't be sure that this scenario is LOR-free (it might violate the parent->child lock acquisition rule). Dereference 'tdvp' instead, which is already locked on entry, and access 'sdvp' fields only when it's safe, i.e. under ZFS_ENTER scope. While at it, remove the usage of VOP_REALVP, as long as this is a NOP on FreeBSD. Discussed with: avg Reviewed by: pjd	2014-04-13 01:15:37 +00:00
Alexander Motin	f6e1dc83c3	Create zvol devices on zfs clone. While big and shiny patch is not ready, it is better to have something. PR: kern/178999 MFC after: 1 week	2014-04-11 11:56:16 +00:00
Alexander Motin	a96fefe042	In addition to r264077, tell GEOM that we do support BIO_DELETE now.	2014-04-06 16:31:28 +00:00
Alexander Motin	537650f54d	Add property and sysctl to control how ZVOLs are exposed to OS. New ZFS property volmode and sysctl vfs.zfs.vol.mode allow switching ZVOL between three modes: geom -- existing fully functional behavior (default); dev -- exposing volumes only as raw disk device file in devfs; none -- not exposing volumes outside ZFS. The "dev" mode is less functional (can't be partitioned, mounted, etc), but it is faster, and in some scenarios with untrusted consumers safer. It can be useful for NAS, VM block storages, etc. The "none" mode may be convenient for backup servers, etc. that don't need direct data access. Due to the way ZVOL is integrated with main ZFS code, those property and sysctl are checked only during pool import and volume creation. MFC after: 1 month Sponsored by: iXsystems, Inc.	2014-04-05 13:01:44 +00:00
Alexander Motin	89e84aead6	MFV r258922: 3580 Want zvols to return volblocksize when queried for physical block size illumos/illumos-gate@a0b60564df It is irrelevant for FreeBSD, just reducing diff.	2014-04-03 20:18:55 +00:00
Alexander Motin	4a03e8b64d	Add BIO_DELETE support to ZVOL. It is an adapted merge from the vendor branch of: 701 UNMAP support for COMSTAR (in part related to ZFS) 2130 zvol DKIOCFREE uses nested DMU transactions	2014-04-03 15:04:32 +00:00
Pedro F. Giffuni	23e4da439c	MFV r258379; 4248 dtrace(1M) should never create DOF with empty probes section 4249 Only probes from the first DTrace object file will be included Illumos Revision: 4a20ab41aadcb81c53e72fc65886e964e9add59 Reference: https://www.illumos.org/issues/4248 https://www.illumos.org/issues/4249 Obtained from: Illumos MFC after: 1 month	2014-04-02 15:32:44 +00:00
Bryan Drewery	44f1c91610	Rename global cnt to vm_cnt to avoid shadowing. To reduce the diff struct pcu.cnt field was not renamed, so PCPU_OP(cnt.field) is still used. pc_cnt and pcpu are also used in kvm(3) and vmstat(8). The goal was to not affect externally used KPI. Bump __FreeBSD_version_ in case some out-of-tree module/code relies on the the global cnt variable. Exp-run revealed no ports using it directly. No objection from: arch@ Sponsored by: EMC / Isilon Storage Division	2014-03-22 10:26:09 +00:00
Robert Watson	4a14441044	Update kernel inclusions of capability.h to use capsicum.h instead; some further refinement is required as some device drivers intended to be portable over FreeBSD versions rely on __FreeBSD_version to decide whether to include capability.h. MFC after: 3 weeks	2014-03-16 10:55:57 +00:00
Alexander Motin	68d17718e0	Report ZVOL block size as GEOM stripesize. MFC after: 2 weeks	2014-03-13 19:26:26 +00:00
Xin LI	8e41e26f65	MFV r262983: 4638 Panic in ZFS via rfs3_setattr()/rfs3_write(): dirtying snapshot! illumos/illumos-gate@2144b121c0	2014-03-11 00:23:50 +00:00
Xin LI	ba680558a0	All callers of static method load_nvlist() in spa.c handles error case, so there is no reason to assert that we won't hit an error. Instead, just return that error to caller and have the upper layer handle it. Obtained from: FreeNAS Reported by: rodrigc Reviewed by: Matthew Ahrens MFC after: 2 weeks	2014-03-02 02:41:33 +00:00
Mark Johnston	b53bfbba65	Expose a few DTrace parameters as sysctls under kern.dtrace and add descriptions for several existing sysctls. PR: 187027 Submitted by: Fedor Indutny <fedor@indutny.com> (original version) MFC after: 2 weeks	2014-03-01 19:06:43 +00:00
Mark Johnston	ae520d3dc4	Fix emulation of call and jmp instructions on i386 and for 32-bit processes on amd64. Submitted by: Prashanth Kumar <pra_udupi@yahoo.co.in> MFC after: 2 weeks	2014-03-01 17:55:20 +00:00
Mark Johnston	ae9f1a185c	4478 dtrace_dof_maxsize is far too small illumos/illumos-gate@d339a29bb4 PR: 187027 MFC after: 1 week	2014-02-28 02:04:41 +00:00
Mark Johnston	c0c943de72	Fix the struct reg mappings for i386 and amd64, which differ between illumos and FreeBSD. Submitted by: Prashanth Kumar <pra_udupi@yahoo.co.in> MFC after: 2 weeks	2014-02-27 01:24:47 +00:00
Mark Johnston	0339a1c2b4	Move some files that are identical on i386 and amd64 to an x86 subdirectory rather than keeping duplicate copies. Discussed with: avg MFC after: 1 week	2014-02-27 01:04:35 +00:00
Mark Johnston	5bcd30f3b1	Revert r262466, as it does not compile on PowerPC. Reported by: jhibbits	2014-02-26 01:00:00 +00:00
Mark Johnston	68ac8d05d3	Make all 8 syscall arguments available to syscall probes in the same way that this is done for SDT probes. This fixes the syscall/tst.args.d test, which was failing because mmap(2)'s sixth argument wasn't available to the probe. MFC after: 2 weeks	2014-02-25 02:58:11 +00:00
Mark Johnston	33db01542c	1452 DTrace buffer autoscaling should be less violent illumos/illumos-gate@6fb4854bed This fixes the tst.resize1.d and tst.resize2.d DTrace tests, which have been failing since r261122 since they were causing dtrace(1) to attempt to allocate and use large amounts of memory, and get killed by the OOM killer as a result. MFC after: 1 month	2014-02-22 05:18:55 +00:00
Mark Johnston	dc0f030e51	Define the KM_NORMALPRI flag for kmem_alloc(), as it is used in some upstream DTrace code. It indicates that the kernel memory allocator need not attempt to satisfy non-blocking allocations in low-memory conditions. This has no direct equivalent in the malloc(9) flags, so it is just defined to 0 for now.	2014-02-22 05:13:35 +00:00
Xin LI	5f62f8cdcb	MFV r261619: 4574 get_clones_stat does not call zap_count in non-debug kernel zap_count(...) is never called in non-DEBUG kernel. As result "count" variable is always 0, and "goto fail" is always reached. This means get_clones_stat function never makes up list of clones for "clones" properties. MFC after: 2 weeks	2014-02-08 05:35:36 +00:00
Xin LI	bea6313e6b	MFV r260834: Fix memory leak of compressed buffers in l2arc_write_done (Illumos #3995).	2014-01-18 01:45:39 +00:00
Andriy Gapon	6d03ca5789	traverse_visitbp: visit DMU_GROUPUSED_OBJECT before DMU_USERUSED_OBJECT This is done to ensure that visited object IDs are always increasing. Also, pass correct object ID to prefetch_dnode_metadata for os_groupused_dnode. Without this change we would hit an assert if traversal was paused on a GROUPUSED object, which is unlikely but possible. Apparently the same change was independently developed by Deplhix. Reviewed by: Matthew Ahrens <mahrens@delphix.com> MFC after: 10 days Sponsored by: HybridCluster	2014-01-17 10:23:46 +00:00
Andriy Gapon	fec721bc43	fix a build problem with INVARIANTS enabled introduced in r260704 Reported by: glebius MFC after: 5 days X-MFC with: r260704	2014-01-16 13:44:37 +00:00
Andriy Gapon	876fa2c17b	fix a bug in ZFS mirror code for handling multiple DVAa The bug was introduced in r256956 "Improve ZFS N-way mirror read performance". The code in vdev_mirror_dva_select erroneously considers already tried DVAs for the next attempt. Thus, it is possible that a failing DVA would be retried forever. As a secondary effect, if the attempts fail with checksum error, then checksum error reports are accumulated until the original request ultimately fails or succeeds. But because retrying is going on indefinitely the cheksum reports accumulation will effectively be a memory leak. Reviewed by: gibbs MFC after: 13 days Sponsored by: HybridCluster	2014-01-16 13:24:10 +00:00
Andriy Gapon	00126789e6	Revert r260705: wrong patch committed by accident An earlier, less efficient version was committed by accident.	2014-01-16 13:20:20 +00:00
Andriy Gapon	19f5e9076b	zfs_deleteextattr: name buffer from namei is needed by zfs_rename If we prematurely free the name buffer and it gets quickly recycled, then zfs_rename may see data from another lookup or even unmapped memory via cn_nameptr. MFC after: 6 days Sponsored by: HybridCluster	2014-01-16 12:31:27 +00:00
Andriy Gapon	2f9a31944f	fix a bug in ZFS mirror code for handling multiple DVAa The bug was introduced in r256956 "Improve ZFS N-way mirror read performance". The code in vdev_mirror_dva_select erroneously considers already tried DVAs for the next attempt. Thus, it is possible that a failing DVA would be retried forever. As a secondary effect, if the attempts fail with checksum error, then checksum error reports are accumulated until the original request ultimately fails or succeeds. But because retrying is going on indefinitely the cheksum reports accumulation will effectively be a memory leak. Reviewed by: gibbs MFC after: 13 days Sponsored by: HybridCluster	2014-01-16 12:26:54 +00:00
Andriy Gapon	b8ca4667ed	zfs: getnewvnode_reserve must be called outside of a zfs transaction Otherwise we could run into the following deadlock. A thread has a transaction open and assigned to a transaction group. That would prevent the transaction group from be quiesced and synced. The thread is blocked in getnewvnode_reserve waiting for a vnode to a be reclaimed. vnlru thread is blocked trying to enter ZFS VOP because a filesystem is suspended by an ongoing rollback or receive operation. In its turn the operation is waiting for the current transaction group to be synced. zfs_zget is always used outside of active transactions, but zfs_mknode is always used in a transaction context. Thus, we hoist getnewvnode_reserve from zfs_mknode to its callers. While there, assert that ZFS always calls getnewvnode while having a vnode reserved. Reported by: adrian Tested by: adrian MFC after: 17 days Sponsored by: HybridCluster	2014-01-16 12:22:46 +00:00
Marcel Moolenaar	642ebd6a18	In atomic_or_8_nv() load 1 and not 8 bytes from the address given. Note that atomic_or_8_nv() is not used at this time.	2014-01-06 05:00:58 +00:00
Alexander Motin	77e2eaf5b8	Fix build after r260234 by converting ddi_get_lbolt64() from inline into a macro. Otherwise compiler complains that hz variable used there either undefined or defined twice, thanks to header mess caused by compat shims.	2014-01-05 19:07:42 +00:00
Alexander Motin	ce05e707c4	In dmu_zfetch_stream_reclaim() replace division with multiplication and move it out of the loop and lock.	2014-01-03 18:44:37 +00:00
Alexander Motin	99e2428636	Remove extra conversion to nanoseconds from ddi_get_lbolt64(). As result this uses one multiplication and shifts instead of one division and two multiplications.	2014-01-03 18:08:31 +00:00
Xin LI	7c88e58f46	MFV r260155: When we encounter an I/O error on a piece of metadata while deleting a file system or zvol, we don't update the bptree_entry_phys_t's bookmark. This would lead to double free of bp's which will lead to space map corruption. Instead of tolerating and allowing the corruption, panic immediately. See Illumos #4390 for more details. 4391 panic system rather than corrupting pool if we hit bug 4390 Illumos/illumos-gate@8b36997aa2 MFC after: 2 weeks	2014-01-02 08:10:35 +00:00
Xin LI	ab0b9f6b30	MFV r260154 + 260182: 4369 implement zfs bookmarks 4368 zfs send filesystems from readonly pools Illumos/illumos-gate@78f1710053 MFC after: 2 weeks	2014-01-02 07:34:36 +00:00
Xin LI	6f2791f53a	Fix build on platforms where atomic_swap_64 is not available.	2014-01-02 03:24:44 +00:00
Xin LI	647795d181	MFV r260153: 4121 vdev_label_init should treat request as succeeded when pool is read only Illumos/illumos-gate@973c78e94b MFC after: 2 weeks	2014-01-01 01:26:39 +00:00
Xin LI	f4c8ba8370	MFV r259170: 4370 avoid transmitting holes during zfs send 4371 DMU code clean up illumos/illumos-gate@43466aae47 NOTE: Make sure the boot code is updated if a zpool upgrade is done on boot zpool. MFC after: 2 weeks	2014-01-01 00:45:28 +00:00
Xin LI	cca1e7c623	MFV r258385: (Note: this change is not applicable to FreeBSD and the file is not included in build. It's integrated for completeness). 4128 disks in zpools never go away when pulled illumos/illumos-gate@39cddb10a3 MFC after: 2 weeks	2013-12-31 21:24:00 +00:00
Xin LI	db2aff5f8b	MFV r242733: 3306 zdb should be able to issue reads in parallel 3321 'zpool reopen' command should be documented in the man page and help message illumos/illumos-gate@31d7e8fa33 FreeBSD porting notes: the kernel part of this changeset depends on Solaris buf(9S) interfaces and are not really applicable for our use. vdev_disk.c is patched as-is to reduce diverge from upstream, but vdev_file.c is left intact. MFC after: 2 weeks	2013-12-31 19:39:15 +00:00
Mark Johnston	b69b2ff588	Allocate the probe ID unrhdr before the DTrace kld_* event handlers are registered. Otherwise there is a small window during which probe IDs may be allocated before the unrhdr is allocated. MFC after: 2 weeks	2013-12-31 15:41:16 +00:00
Mark Johnston	a333376bba	Revert r260091. The vmem calls seem to be slower than the *_unr() calls that they replaced, which is important considering that probe IDs are allocated during process startup for USDT probes.	2013-12-31 15:37:51 +00:00
Mark Johnston	b9c04b396a	Now that vmem(9) is available, use vmem arenas to allocate probe and aggregation IDs, as is done in the upstream illumos code. This still requires some FreeBSD-specific code, as our vmem API is not identical to the one in illumos. Submitted by: Mike Ma <mikemandarine@gmail.com>	2013-12-30 17:37:32 +00:00
Xin LI	1aaa945f67	MFV r258374: 4171 clean up spa_feature_*() interfaces 4172 implement extensible_dataset feature for use by other zpool features illumos/illumos-gate@2acef22db7 MFC after: 2 weeks	2013-12-24 07:14:25 +00:00
Xin LI	ec097c1634	MFV r258373: 4168 ztest assertion failure in dbuf_undirty 4169 verbatim import causes zdb to segfa 4170 zhack leaves pool in ACTIVE state illumos/illumos-gate@7fdd916c47 MFC after: 2 weeks	2013-12-24 06:56:17 +00:00
Justin Hibbits	ad5017b513	Fix a brain-o. I had misread the limit as a size, but it's a pointer. Submitted by: Howard Su MFC after: 2 weeks X-MFC-with: r259668	2013-12-21 00:37:32 +00:00
Justin Hibbits	a76f5d59f4	Fix a couple bugs in FBT PowerPC. Clamp the size to a 'instruction size' not 'byte size', and fix a typo. MFC after: 2 weeks	2013-12-20 23:18:14 +00:00
Pawel Jakub Dawidek	4106732882	MFV r258923: 4188 assertion failed in dmu_tx_hold_free(): dn_datablkshift != 0 illumos/illumos-gate@bb411a08b0 MFC after: 3 days	2013-12-18 21:45:46 +00:00
Mark Johnston	7159310fa6	The fasttrap fork handler is responsible for removing tracepoints in the child process that were inherited from its parent. However, this should not be done in the case of a vfork, since the fork handler ends up removing the tracepoints from the shared vm space, and userland DTrace probes in the parent will no longer fire as a result. Now the child of a vfork may trigger userland DTrace probes enabled in its parent, so modify the fasttrap probe handler to handle this case and handle the child process in the same way that it would handle the traced process. In particular, if once traces function foo() in a process that vforks, and the child calls foo(), fasttrap will treat this call as having come from the parent. This is the behaviour of the upstream code. While here, add #ifdef guards to some code that isn't present upstream. MFC after: 1 month	2013-12-18 01:41:52 +00:00
Alan Somers	cd730bd6b2	sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c When a da or ada device dissappears, outstanding IOs fail with ENXIO, not EIO. The check for EIO was probably copied from Illumos, where that is indeed the correct errno. Without this change, pulling a busy drive from a zpool would usually turn it into UNAVAIL, even though pulling an idle drive would turn it into REMOVED. With this change, it is REMOVED every time. Also, vdev_geom_io_intr shouldn't do zfs_post_remove, because that results in devd getting two resource.fs.zfs.removed events. The comment said that the event had to be sent directly instead of through the async removal thread because "the DE engine is using this information to discard prevoius I/O errors". However, the fact that vdev_geom_io_intr was never actually sending the events until now, and that vdev_geom_orphan never sent them at all, and that vdev_geom_orphan usually gets called about 2 seconds after the actual removal, means that FreeBSD's userland can cope with a late event just fine. Approved by: ken (mentor) Sponsored by: Spectra Logic Corporation MFC after: 4 weeks	2013-12-12 00:27:22 +00:00
Mark Johnston	e53c69c1f5	Correct the check for errors from proc_rwmem(). MFC after: 2 weeks	2013-12-11 04:31:40 +00:00
Alexander Motin	f192c4873d	Don't even try to read vdev labels from devices smaller then SPA_MINDEVSIZE (64MB). Even if we would find one somehow, ZFS kernel code rejects such devices. It is funny to look on attempts to read 4 256K vdev labels from 1.44MB floppy, though it is not very practical and quite slow.	2013-12-10 12:36:44 +00:00
Xin LI	9b11826d3d	Expose spa_asize_inflation. X-MFC-With: r258632	2013-12-06 23:49:16 +00:00
Andriy Gapon	f77ffe1b22	zfs: add zfs_freebsd_putpages this should be more optimal than writing pages one-by-one via zfs_write -> update_pages in the case of multi-page putpages call MFC after: 16 days	2013-11-29 15:39:39 +00:00
Andriy Gapon	6c5b7fffce	zfs: add dmu_write_pages variant for freebsd The freebsd variant of dmu_write_pages is hidden under _KERNEL to avoid needlessly pulling in vm_page_t declaration. Besides, this function seems to be useless for ZFS userland counterpart. MFC after: 15 days	2013-11-29 15:34:43 +00:00
Andriy Gapon	fdbcc95a47	zfs: make zfs_map_page / zfs_unmap_page public MFC after: 15 days	2013-11-29 15:33:40 +00:00
Andriy Gapon	998a42756c	drop ZUT_OBJ, zfs unit testing driver never materialzied in freebsd MFC after: 5 days	2013-11-29 15:32:53 +00:00
Andriy Gapon	ac79eedf85	zfs mappedread_sf: assert that a page is never partially valid ZFS never partially validates or invalidates a page. The higher level VM should not do that either. mappedread_sf correct operation depends on a page being either fully valid or invalid. MFC after: 7 days	2013-11-29 12:19:52 +00:00
Andriy Gapon	be3d0087dc	MFV r258665: 4347 ZPL can use dmu_tx_assign(TXG_WAIT) illumos/illumos-gate@e722410c49 MFC after: 9 days X-MFC after: r258632	2013-11-28 19:44:36 +00:00
Andriy Gapon	456a87bb3b	MFV r258371,r258372: 4101 metaslab_debug should allow for fine-grained control 4101 metaslab_debug should allow for fine-grained control 4102 space_maps should store more information about themselves 4103 space map object blocksize should be increased 4104 ::spa_space no longer works 4105 removing a mirrored log device results in a leaked object 4106 asynchronously load metaslab illumos/illumos-gate@0713e232b7 Note that some tunables have been removed and some new tunables have been added. Of particular note, FreeBSD-only knob vfs.zfs.space_map_last_hope is removed as it was a nop for some time now (after one of the previous merges from upstream). MFC after: 11 days Sponsored by: HybridCluster [merge]	2013-11-28 19:37:22 +00:00
Andriy Gapon	02727275c5	opensolaris compat: add taskq_wait emulation MFC after: 10 days	2013-11-28 19:17:11 +00:00
Andriy Gapon	7bc07f0575	fix a serious bug in r258632: offset parameter must be set in zio In illumos all ioctl zio-s are "global" at the moment. That is they act on a whole disk, e.g. a cache flush command, and thus do not need either offset or size parameters. FreeBSD, on the other hand, has support for TRIM command and that command requires proper offset and size parameters. Without this fix all TRIM commands act on the start of any disk or partition used by ZFS destroying any data there. Pointyhat to: avg Tested by: sbruno MFC after: 3 days X-MFC with: r258632 Sponsored by: HybridCluster	2013-11-28 08:48:49 +00:00
Andriy Gapon	2ac1eeec44	fix debug.zfs_flags sysctl description in r258638 Pointyhat to: avg MFC after: 3 days	2013-11-26 10:57:09 +00:00
Andriy Gapon	78affb8591	expose zfs_flags as debug.zfs_flags r/w tunable and sysctl This knob is purposefully hidden under debug. MFC after: 5 days Sponsored by: HybridCluster	2013-11-26 10:46:43 +00:00
Andriy Gapon	3761ac95f7	MFV r258376: 3964 L2ARC should always compress metadata buffers illumos/illumos-gate@e4be62a2b7 MFC after: 10 days Sponsored by: HybridCluster [merge]	2013-11-26 10:14:23 +00:00
Andriy Gapon	fd51e905e2	MFV r255256: 3954 metaslabs continue to load even after hitting zfs_mg_alloc_failure limit 4080 zpool clear fails to clear pool 4081 need zfs_mg_noalloc_threshold illumos/illumos-gate@22e30981d8 MFC after: 10 days Sponsored by: HybridCluster [merge]	2013-11-26 10:02:02 +00:00
Andriy Gapon	2a4704ab01	MFV r255255: 4045 zfs write throttle & i/o scheduler performance work illumos/illumos-gate@69962b5647 Please note the following changes: - zio_ioctl has lost its priority parameter and now TRIM is executed with 'now' priority - some knobs are gone and some new knobs are added; not all of them are exposed as tunables / sysctls yet MFC after: 10 days Sponsored by: HybridCluster [merge]	2013-11-26 09:57:14 +00:00
Andriy Gapon	fb8171c240	MFV r247578: 3581 spa_zio_taskq[ZIO_TYPE_FREE][ZIO_TASKQ_ISSUE]->tq_lock is piping hot illumos/illumos-gate@ec94d32216 MFC after: 9 days Sponsored by: HybridCluster [merge]	2013-11-26 09:45:48 +00:00
Andriy Gapon	34140e78ab	734 taskq_dispatch_prealloc() desired 943 zio_interrupt ends up calling taskq_dispatch with TQ_SLEEP illumos/illumos-gate@5aeb94743e Essentially FreeBSD taskqueues already operate in a mode that was added to Illumos with taskq_dispatch_ent change. We even exposed the superior FreeBSD interface as taskq_dispatch_safe. Now we just rename taskq_dispatch_safe to taskq_dispatch_ent and struct struct ostask to taskq_ent_t, so that code differences will be minimal. After this change sys/cddl/compat/opensolaris/sys/taskq.h header is no longer needed. Note that this commit is not an MFV because the upstream change was not individually committed to the vendor area. MFC after: 8 days	2013-11-26 09:26:18 +00:00
Andriy Gapon	2dbdedbc46	opensolaris taskq: some cosmetic changes - drop trailing whitespace - remove redundant "extern" from function declarations - remove unused macro MFC after: 1 week	2013-11-26 09:10:01 +00:00
Andriy Gapon	a776a1c1c5	sdt: add support for solaris/illumos style DTRACE_PROBE macros The new macros are implemented in terms of SDT_PROBE_DEFINE and SDT_PROBE. Probes defined in this way will appear under SDT provider named "sdt". Parameter types are exposed via SDT_PROBE_ARGTYPE. This is something that illumos does not have by default. This kind of SDT probes is already present in ZFS code, so those probes will now be available if KDTRACE_HOOKS options is enabled. A potential future illumos compatibility enhancement is to encode a provider name as a prefix in a probe name. Reviewed by: markj MFC after: 3 weeks X-MFC after: r258622	2013-11-26 08:49:53 +00:00
Andriy Gapon	d9fae5ab88	dtrace sdt: remove the ugly sname parameter of SDT_PROBE_DEFINE In its stead use the Solaris / illumos approach of emulating '-' (dash) in probe names with '__' (two consecutive underscores). Reviewed by: markj MFC after: 3 weeks	2013-11-26 08:46:27 +00:00
Pawel Jakub Dawidek	1cef014007	When append-only, immutable or read-only flag is set don't allow for hard links creation. This matches UFS behaviour. Reported by: Oleg Ginzburg <olevole@olevole.ru> MFC after: 1 month	2013-11-25 21:17:14 +00:00
Attilio Rao	54366c0bd7	- For kernel compiled only with KDTRACE_HOOKS and not any lock debugging option, unbreak the lock tracing release semantic by embedding calls to LOCKSTAT_PROFILE_RELEASE_LOCK() direclty in the inlined version of the releasing functions for mutex, rwlock and sxlock. Failing to do so skips the lockstat_probe_func invokation for unlocking. - As part of the LOCKSTAT support is inlined in mutex operation, for kernel compiled without lock debugging options, potentially every consumer must be compiled including opt_kdtrace.h. Fix this by moving KDTRACE_HOOKS into opt_global.h and remove the dependency by opt_kdtrace.h for all files, as now only KDTRACE_FRAMES is linked there and it is only used as a compile-time stub [0]. [0] immediately shows some new bug as DTRACE-derived support for debug in sfxge is broken and it was never really tested. As it was not including correctly opt_kdtrace.h before it was never enabled so it was kept broken for a while. Fix this by using a protection stub, leaving sfxge driver authors the responsibility for fixing it appropriately [1]. Sponsored by: EMC / Isilon storage division Discussed with: rstone [0] Reported by: rstone [1] Discussed with: philip	2013-11-25 07:38:45 +00:00
Andriy Gapon	a7236350c3	MFV r258378: 4089 NULL pointer dereference in arc_read() illumos/illumos-gate@57815f6b95 Tested by: adrian MFC after: 4 days	2013-11-20 11:52:32 +00:00
Andriy Gapon	c5f4a0a2eb	MFV r258377: 4088 use after free in arc_release() illumos/illumos-gate@ccc22e1304 MFC after: 5 days	2013-11-20 11:47:50 +00:00
Justin Hibbits	de950c79f3	Fix the function search space. Submitted by: Howard Su	2013-11-20 01:33:13 +00:00
Andriy Gapon	3fd7f7bef7	zfs page_busy: fix the boundaries of the cleared range This is a fix for a regression introduced in r246293. vm_page_clear_dirty expects the range to have DEV_BSIZE aligned boundaries, otherwise it extends them. Thus it can happen that the whole page is marked clean while actually having some small dirty region(s). This commit makes the range properly aligned and ensures that only the clean data is marked as such. It would interesting to evaluate how much benefit clearing with DEV_BSIZE granularity produces. Perhaps instead we should clear the whole page when it is completely overwritten and don't bother clearing any bits if only a portion a page is written. Reported by: George Hartzell <hartzell@alerce.com>, Richard Todd <rmtodd@servalan.servalan.com> Tested by: George Hartzell <hartzell@alerce.com>, Reviewed by: kib MFC after: 5 days	2013-11-19 18:43:47 +00:00
Alexander Motin	c5068af559	Reenable vfs.zfs.zio.use_uma for amd64, disabled at r209261. On machines with seveal CPUs and enough RAM this can easily twice improve ZFS performance or twice reduce CPU usage. It was disabled three years ago due to memory and KVA exhaustion reports, but our VM subsystem got improved a lot since that time, hopefully enough to make another try.	2013-11-19 11:19:07 +00:00
Alan Somers	1f9e80bcdb	opensolaris/uts/common/dtrace/fasttrap.c Fix several problems that can cause panics on kldload and kldunload. * kproc_create(fasttrap_pid_cleanup_cb, ...) gets called before fasttrap_provs.fth_table gets allocated. This can lead to a panic on module load, because fasttrap_pid_cleanup_cb references fasttrap_provs.fth_table. Move kproc_create down after the point that fasttrap_provs.fth_table gets allocated, and modify the error handling accordingly. * dtrace_fasttrap_{fork,exec,exit} weren't getting NULLed until after fasttrap_provs.fth_table got freed. That caused panics on module unload because fasttrap_exec_exit calls fasttrap_provider_retire, which references fasttrap_provs.fth_table. NULL those function pointers earlier. * There wasn't any code to destroy the fasttrap_{tpoints,provs,procs}.fth_table mutexes on module unload, leading to a resource leak when WITNESS is enabled. Destroy those mutexes during fasttrap_unload(). Reviewed by: markj Approved by: ken (mentor) Sponsored by: Spectra Logic MFC after: 4 weeks	2013-11-18 16:51:56 +00:00
Steven Hartland	8dfd07b976	Fix ZFS deadlock when sending a snapshot which is mounted. MFC after: 1 week Sponsored by: Multiplay	2013-11-18 11:28:19 +00:00
Mark Johnston	dd580326fe	The fasttrap ioctl used to create probes takes a variable-sized argument. It was not being correctly copied into the kernel on FreeBSD, and as a result, probes with multiple probe sites were not being created properly. To fix this, change the ioctl definition so that the fasttrap ioctl handler is responsible for copying in userland data. Submitted by: Prashanth Kumar <pra_udupi@yahoo.co.in> MFC after: 1 month	2013-11-18 03:24:50 +00:00
Alexander Motin	e5056f9882	Introduce allocation cache to store LZ4 compression contexts without kicking VM subsystem twice for every written record. Tests on 24-core system show double reduction of CPU time spent on copying single large well-compressed file. This patch is not really needed on illumos (while not harm either) since their memory allocator by default uses caching for all requests up to 128K. Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>	2013-11-14 15:54:54 +00:00
Mark Johnston	a4cbcb127c	Use suword32 and suword64 instead of copyout(9). This fixes a bug in the emulation of the call instruction caused by reversing the uaddr and kaddr arguments when copying data out to userland: the suword* functions take the uaddr as the first argument whereas copyout(9) takes the kaddr as the first argument. This also partially undoes the fixes from r257143. Submitted by: Prashanth Kumar <pra_udupi@yahoo.co.in> (original version) MFC after: 1 month	2013-11-05 06:13:46 +00:00
Mark Johnston	57170f49f2	Remove references to an unused fasttrap probe hook, and remove the corresponding x86 trap type. Userland DTrace probes are currently handled by the other fasttrap hooks (dtrace_pid_probe_ptr and dtrace_return_probe_ptr). Discussed with: rpaulo	2013-10-31 02:35:00 +00:00
Mark Johnston	9c06d5a051	Do some cleanup of the SDT code. In particular, * Remove the unused sdt cdev. * Don't bother keeping a list of probes in struct sdt_prov; it's not needed. * Invoke sdt_load and sdt_unload from the module handler instead of registering separate SYSINITs. * Keep to within 80 columns. * Check for errors from dtrace_unregister().	2013-10-26 06:23:51 +00:00
Mark Johnston	165de3f338	Fix a couple of bugs in the fasttrap emulation of a "push %rbp" instruction: the code was trying to save the stack pointer rather than the frame pointer, and the arguments to copyout(9) were reversed, so nothing ended up being saved on the stack. This would cause process crashes when the pid provider was being used to instrument calls of a function starting with this instruction. Reported by: symbolics@gmx.com Tested by: symbolics@gmx.com (earlier version) MFC after: 2 weeks	2013-10-26 03:21:54 +00:00
Justin Hibbits	594ce9ad6f	ELF PowerPC64 ABI puts the LR save word at 16 byte offset, not 8.	2013-10-25 00:17:12 +00:00
Steven Hartland	c28078e903	Improve ZFS N-way mirror read performance by using load and locality information. The existing algorithm selects a preferred leaf vdev based on offset of the zio request modulo the number of members in the mirror. It assumes the devices are of equal performance and that spreading the requests randomly over both drives will be sufficient to saturate them. In practice this results in the leaf vdevs being under utilized. The new algorithm takes into the following additional factors: * Load of the vdevs (number outstanding I/O requests) * The locality of last queued I/O vs the new I/O request. Within the locality calculation additional knowledge about the underlying vdev is considered such as; is the device backing the vdev a rotating media device. This results in performance increases across the board as well as significant increases for predominantly streaming loads and for configurations which don't have evenly performing devices. The following are results from a setup with 3 Way Mirror with 2 x HD's and 1 x SSD from a basic test running multiple parrallel dd's. With pre-fetch disabled (vfs.zfs.prefetch_disable=1): == Stripe Balanced (default) == Read 15360MB using bs: 1048576, readers: 3, took 161 seconds @ 95 MB/s == Load Balanced (zfslinux) == Read 15360MB using bs: 1048576, readers: 3, took 297 seconds @ 51 MB/s == Load Balanced (locality freebsd) == Read 15360MB using bs: 1048576, readers: 3, took 54 seconds @ 284 MB/s With pre-fetch enabled (vfs.zfs.prefetch_disable=0): == Stripe Balanced (default) == Read 15360MB using bs: 1048576, readers: 3, took 91 seconds @ 168 MB/s == Load Balanced (zfslinux) == Read 15360MB using bs: 1048576, readers: 3, took 108 seconds @ 142 MB/s == Load Balanced (locality freebsd) == Read 15360MB using bs: 1048576, readers: 3, took 48 seconds @ 320 MB/s In addition to the performance changes the code was also restructured, with the help of Justin Gibbs, to provide a more logical flow which also ensures vdevs loads are only calculated from the set of valid candidates. The following additional sysctls where added to allow the administrator to tune the behaviour of the load algorithm: * vfs.zfs.vdev.mirror.rotating_inc * vfs.zfs.vdev.mirror.rotating_seek_inc * vfs.zfs.vdev.mirror.rotating_seek_offset * vfs.zfs.vdev.mirror.non_rotating_inc * vfs.zfs.vdev.mirror.non_rotating_seek_inc These changes where based on work started by the zfsonlinux developers: https://github.com/zfsonlinux/zfs/pull/1487 Reviewed by: gibbs, mav, will MFC after: 2 weeks Sponsored by: Multiplay	2013-10-23 09:54:58 +00:00
Steven Hartland	70c3432663	Use the vdev's ashift to calculate the supported min block size passed to zio_compress_data(..) when compressing l2arc buffers. This eliminates l2arc I/O errors, which resulted in very poor performance on vdev's configured with block size greater than 512b due to compression assuming a smaller min block size than the vdev supports. MFC after: 2 days	2013-10-22 13:31:36 +00:00
Alexander Motin	40ea77a036	Merge GEOM direct dispatch changes from the projects/camlock branch. When safety requirements are met, it allows to avoid passing I/O requests to GEOM g_up/g_down thread, executing them directly in the caller context. That allows to avoid CPU bottlenecks in g_up/g_down threads, plus avoid several context switches per I/O. The defined now safety requirements are: - caller should not hold any locks and should be reenterable; - callee should not depend on GEOM dual-threaded concurency semantics; - on the way down, if request is unmapped while callee doesn't support it, the context should be sleepable; - kernel thread stack usage should be below 50%. To keep compatibility with GEOM classes not meeting above requirements new provider and consumer flags added: - G_CF_DIRECT_SEND -- consumer code meets caller requirements (request); - G_CF_DIRECT_RECEIVE -- consumer code meets callee requirements (done); - G_PF_DIRECT_SEND -- provider code meets caller requirements (done); - G_PF_DIRECT_RECEIVE -- provider code meets callee requirements (request). Capable GEOM class can set them, allowing direct dispatch in cases where it is safe. If any of requirements are not met, request is queued to g_up or g_down thread same as before. Such GEOM classes were reviewed and updated to support direct dispatch: CONCAT, DEV, DISK, GATE, MD, MIRROR, MULTIPATH, NOP, PART, RAID, STRIPE, VFS, ZERO, ZFS::VDEV, ZFS::ZVOL, all classes based on g_slice KPI (LABEL, MAP, FLASHMAP, etc). To declare direct completion capability disk(9) KPI got new flag equivalent to G_PF_DIRECT_SEND -- DISKFLAG_DIRECT_COMPLETION. da(4) and ada(4) disk drivers got it set now thanks to earlier CAM locking work. This change more then twice increases peak block storage performance on systems with manu CPUs, together with earlier CAM locking changes reaching more then 1 million IOPS (512 byte raw reads from 16 SATA SSDs on 4 HBAs to 256 user-level threads). Sponsored by: iXsystems, Inc. MFC after: 2 months	2013-10-22 08:22:19 +00:00
Mark Johnston	7e75d58610	When fetching function arguments out of a frame on amd64, explicitly select the register based on the argument index rather than relying on the fields in struct reg to be in the right order. This assumption is incorrect on FreeBSD and generally led to bogus argument values for the sixth argument of PID and USDT probes; the first five are passed directly to dtrace_probe() via the fasttrap trap handler and so were correctly handled. MFC after: 2 weeks	2013-10-21 04:15:55 +00:00
Mark Johnston	e572bc11ec	Add a function, memstr, which can be used to convert a buffer of null-separated strings to a single string. This can be used to print the full arguments of a process using execsnoop (from the DTrace toolkit) or with the following one-liner: dtrace -n 'syscall::execve:return {trace(curpsinfo->pr_psargs);}' Note that this relies on the process arguments being cached via the struct proc, which means that it will not work for argvs longer than kern.ps_arg_cache_limit. However, the following rather non-portable script can be used to extract any argv at exec time: fbt::kern_execve:entry { printf("%s", memstr(args[1]->begin_argv, ' ', args[1]->begin_envv - args[1]->begin_argv)); } The debug.dtrace.memstr_max sysctl limits the maximum argument size to memstr(). Thanks to Brendan Gregg for helpful comments on freebsd-dtrace. Tested by: Fabian Keil (earlier version) MFC after: 2 weeks	2013-10-16 01:39:26 +00:00
Justin Hibbits	30b318b92f	Add fasttrap for PowerPC. This is the last piece of the dtrace/ppc puzzle. It's incomplete, it doesn't contain full instruction emulation, but it should be sufficient for most cases. MFC after: 1 month	2013-10-15 15:00:29 +00:00
Andriy Gapon	5d8fac897e	MFV r255257: 4082 zfs receive gets EFBIG from dmu_tx_hold_free() illumos change 14172:be36a38bac3d: illumos ZFS issues: 4082 zfs receive gets EFBIG from dmu_tx_hold_free() Please note that this change is slightly different from r255257, because it is merged out of order with other (larger) upstream changes. PR: kern/182570 Reported by: Keith White <kwhite@site.uottawa.ca> Tested by: Keith White <kwhite@site.uottawa.ca> Approved by: re (glebius) MFC after: 1 week X-MFC after: r254753	2013-10-10 09:53:46 +00:00
Mark Johnston	cb7320ce7b	Initialize and free the DTrace taskqueue in the dtrace module load/unload handlers rather than in the dtrace device open/close methods. The current approach can cause a panic if the device is closed which the taskqueue thread is active, or if a kernel module containing a provider is unloaded while retained enablings are present and the dtrace device isn't opened. Submitted by: gibbs (original version) Reviewed by: gibbs Approved by: re (glebius) MFC after: 2 weeks	2013-10-08 12:56:46 +00:00
Xin LI	6eb151f212	Improve lzjb decompress performance by reorganizing the code to tighten the copy loop. Submitted by: Denis Ahrens <denis h3q com> MFC after: 2 weeks Approved by: re (gjb)	2013-10-08 01:38:24 +00:00
Justin T. Gibbs	69d1b777e8	Optimize the block size used on ZFS cache devices as is already done for data and log devices. Reported by: Dmitryy Makarov Submitted by: smh Reviewed by: gibbs Approved by: re (delphij) MFC after: 2 weeks	2013-09-21 03:52:08 +00:00
Xin LI	253aa02fc3	MFV r254750: Add support of Illumos dumps on zvol over RAID-Z. Note that this only adds the features. FreeBSD would still need more work to support dumping on zvols. Illumos ZFS issues: 2932 support crash dumps to raidz, etc. pools MFC after: 1 month Approved by: re (ZFS blanket)	2013-09-21 00:17:26 +00:00
Davide Italiano	a25a7e386a	Fixup cross-device rename checks in ZFS. Add a check for the case where 'fdvp' is a directory, 'tvp' is an already existing directory and they have different mount points. Reported by: avg, pjd Reviewed by: pjd Approved by: re (rodrigc)	2013-09-20 23:22:00 +00:00
Xin LI	e8de677c74	MFV r247844 (illumos-gate 13975:ef6409bc370f) Illumos ZFS issues: 3582 zfs_delay() should support a variable resolution 3584 DTrace sdt probes for ZFS txg states Provide a compatibility shim for Solaris's cv_timedwait_hires to help aid future porting. Approved by: re (ZFS blanket)	2013-09-10 01:46:47 +00:00
Davide Italiano	d56b4cd4ac	- Use make_dev_credf(MAKEDEV_REF) instead of the race-prone make_dev()+ dev_ref() in the clone handlers that still use it. - Don't set SI_CHEAPCLONE flag, it's not used anywhere neither in devfs (for anything real) Reviewed by: kib	2013-09-07 13:45:44 +00:00
Pawel Jakub Dawidek	ab568de789	Handle cases where capability rights are not provided. Reported by: kib	2013-09-05 11:58:12 +00:00
Pawel Jakub Dawidek	7e473ea146	Add sysctl/tunables for various metaslab variables.	2013-09-05 00:53:01 +00:00
Pawel Jakub Dawidek	7008be5bd7	Change the cap_rights_t type from uint64_t to a structure that we can extend in the future in a backward compatible (API and ABI) way. The cap_rights_t represents capability rights. We used to use one bit to represent one right, but we are running out of spare bits. Currently the new structure provides place for 114 rights (so 50 more than the previous cap_rights_t), but it is possible to grow the structure to hold at least 285 rights, although we can make it even larger if 285 rights won't be enough. The structure definition looks like this: struct cap_rights { uint64_t cr_rights[CAP_RIGHTS_VERSION + 2]; }; The initial CAP_RIGHTS_VERSION is 0. The top two bits in the first element of the cr_rights[] array contain total number of elements in the array - 2. This means if those two bits are equal to 0, we have 2 array elements. The top two bits in all remaining array elements should be 0. The next five bits in all array elements contain array index. Only one bit is used and bit position in this five-bits range defines array index. This means there can be at most five array elements in the future. To define new right the CAPRIGHT() macro must be used. The macro takes two arguments - an array index and a bit to set, eg. #define CAP_PDKILL CAPRIGHT(1, 0x0000000000000800ULL) We still support aliases that combine few rights, but the rights have to belong to the same array element, eg: #define CAP_LOOKUP CAPRIGHT(0, 0x0000000000000400ULL) #define CAP_FCHMOD CAPRIGHT(0, 0x0000000000002000ULL) #define CAP_FCHMODAT (CAP_FCHMOD \| CAP_LOOKUP) There is new API to manage the new cap_rights_t structure: cap_rights_t cap_rights_init(cap_rights_t rights, ...); void cap_rights_set(cap_rights_t rights, ...); void cap_rights_clear(cap_rights_t rights, ...); bool cap_rights_is_set(const cap_rights_t rights, ...); bool cap_rights_is_valid(const cap_rights_t rights); void cap_rights_merge(cap_rights_t dst, const cap_rights_t src); void cap_rights_remove(cap_rights_t dst, const cap_rights_t src); bool cap_rights_contains(const cap_rights_t big, const cap_rights_t little); Capability rights to the cap_rights_init(), cap_rights_set(), cap_rights_clear() and cap_rights_is_set() functions are provided by separating them with commas, eg: cap_rights_t rights; cap_rights_init(&rights, CAP_READ, CAP_WRITE, CAP_FSTAT); There is no need to terminate the list of rights, as those functions are actually macros that take care of the termination, eg: #define cap_rights_set(rights, ...) \ __cap_rights_set((rights), __VA_ARGS__, 0ULL) void __cap_rights_set(cap_rights_t *rights, ...); Thanks to using one bit as an array index we can assert in those functions that there are no two rights belonging to different array elements provided together. For example this is illegal and will be detected, because CAP_LOOKUP belongs to element 0 and CAP_PDKILL to element 1: cap_rights_init(&rights, CAP_LOOKUP \| CAP_PDKILL); Providing several rights that belongs to the same array's element this way is correct, but is not advised. It should only be used for aliases definition. This commit also breaks compatibility with some existing Capsicum system calls, but I see no other way to do that. This should be fine as Capsicum is still experimental and this change is not going to 9.x. Sponsored by: The FreeBSD Foundation	2013-09-05 00:09:56 +00:00
Justin Hibbits	7fb93a40c2	Whitespace cleanup.	2013-09-02 23:22:05 +00:00
Justin Hibbits	f0bd82a11b	Fixes for DTrace on PowerPC: - Implement dtrace_getarg() - Sync fbt with x86, and fix a typo. - Pull in the time synchronization code from amd64.	2013-08-31 16:30:20 +00:00
Xin LI	1c1075ed93	Previously, both zfs_rename and zfs_link does a check on whether the passed vnode belongs to the same mount point (v_vfsp or also known as v_mount in FreeBSD). This check prevents the code from proceeding further on vnodes that do not belong to ZFS, for instance, on UFS or NULLFS. The recent change (merged as r254585) on upstream changes the check of v_vfsp to instead check the znode's z_zfsvfs. On Illumos this would work because when the vnode comes from lofs, the VOP_REALVP() would give the right vnode, this is not true on FreeBSD where our VOP_REALVP is a no-op, and as such tdvp is not guaranteed to be a ZFS vnode, and will later trigger a failed assertion when verifying the vnode. This changeset modifies our local shims (zfs_freebsd_rename and zfs_freebsd_link) to check if v_mount matches before proceeding further. Reported by: many Diagnostic work by: avg	2013-08-28 00:39:47 +00:00
Mark Johnston	29f4e216f2	Rename the kld_unload event handler to kld_unload_try, and add a new kld_unload event handler which gets invoked after a linker file has been successfully unloaded. The kld_unload and kld_load event handlers are now invoked with the shared linker lock held, while kld_unload_try is invoked with the lock exclusively held. Convert hwpmc(4) to use these event handlers instead of having kern_kldload() and kern_kldunload() invoke hwpmc(4) hooks whenever files are loaded or unloaded. This has no functional effect, but simplifes the linker code somewhat. Reviewed by: jhb	2013-08-24 21:13:38 +00:00
Xin LI	439024135c	MFV r254749: Don't hold dd_lock for long by breaking it when not doing dsl_dir accounting. It is not necessary to hold the lock while manipulating the parent's accounting, because there is no interface for userland to see a consistent picture of both parent and child at the same time anyway. Illumos ZFS issues: 4046 dsl_dataset_t ds_dir->dd_lock is highly contended	2013-08-24 00:42:37 +00:00
Xin LI	00e37ef129	MFV r254747: Fix a panic from dbuf_free_range() from dmu_free_object() while doing zfs receive. This is a regression from FreeBSD r253821. Illumos ZFS issues: 4047 panic from dbuf_free_range() from dmu_free_object() while doing zfs receive	2013-08-24 00:19:26 +00:00
Xin LI	3f0164abf3	MFV r254422: Illumos DTrace issues: 3089 want ::typedef 3094 libctf should support removing a dynamic type 3095 libctf does not validate arrays correctly 3096 libctf does not validate function types correctly	2013-08-23 23:21:24 +00:00
Andriy Gapon	2073a41a42	zfs: do not reject any operations on a pool just because it's a boot pool Unlike the upstream FreeBSD supports booting to all kinds of pools. Requested by: many Tested by: sbruno MFC after: 12 days	2013-08-23 14:43:32 +00:00
Andriy Gapon	17a9f2d4db	fbt: drop a local write-only variable Discovered with: gcc46 MFC after: 4 days	2013-08-23 14:41:27 +00:00
Andriy Gapon	05869c0ea7	zfs: inline and remove zfs_vnode_lock It didn't serve any useful purpose, but obscured file and line information useful for debugging. MFC after: 5 days X-MFC with: r254445	2013-08-23 14:40:09 +00:00
Konstantin Belousov	5944de8ecd	Remove the deprecated VM_ALLOC_RETRY flag for the vm_page_grab(9). The flag was mandatory since r209792, where vm_page_grab(9) was changed to only support the alloc retry semantic. Suggested and reviewed by: alc Sponsored by: The FreeBSD Foundation	2013-08-22 07:39:53 +00:00
Justin Hibbits	7ccb72b31f	Make dtrace_copy() actually work on PowerPC. Although unused currently, it may be used in the future by dtrace.	2013-08-22 02:54:20 +00:00
Kenneth D. Merry	7da1a731c6	Expand the use of stat(2) flags to allow storing some Windows/DOS and CIFS file attributes as BSD stat(2) flags. This work is intended to be compatible with ZFS, the Solaris CIFS server's interaction with ZFS, somewhat compatible with MacOS X, and of course compatible with Windows. The Windows attributes that are implemented were chosen based on the attributes that ZFS already supports. The summary of the flags is as follows: UF_SYSTEM: Command line name: "system" or "usystem" ZFS name: XAT_SYSTEM, ZFS_SYSTEM Windows: FILE_ATTRIBUTE_SYSTEM This flag means that the file is used by the operating system. FreeBSD does not enforce any special handling when this flag is set. UF_SPARSE: Command line name: "sparse" or "usparse" ZFS name: XAT_SPARSE, ZFS_SPARSE Windows: FILE_ATTRIBUTE_SPARSE_FILE This flag means that the file is sparse. Although ZFS may modify this in some situations, there is not generally any special handling for this flag. UF_OFFLINE: Command line name: "offline" or "uoffline" ZFS name: XAT_OFFLINE, ZFS_OFFLINE Windows: FILE_ATTRIBUTE_OFFLINE This flag means that the file has been moved to offline storage. FreeBSD does not have any special handling for this flag. UF_REPARSE: Command line name: "reparse" or "ureparse" ZFS name: XAT_REPARSE, ZFS_REPARSE Windows: FILE_ATTRIBUTE_REPARSE_POINT This flag means that the file is a Windows reparse point. ZFS has special handling code for reparse points, but we don't currently have the other supporting infrastructure for them. UF_HIDDEN: Command line name: "hidden" or "uhidden" ZFS name: XAT_HIDDEN, ZFS_HIDDEN Windows: FILE_ATTRIBUTE_HIDDEN This flag means that the file may be excluded from a directory listing if the application honors it. FreeBSD has no special handling for this flag. The name and bit definition for UF_HIDDEN are identical to the definition in MacOS X. UF_READONLY: Command line name: "urdonly", "rdonly", "readonly" ZFS name: XAT_READONLY, ZFS_READONLY Windows: FILE_ATTRIBUTE_READONLY This flag means that the file may not written or appended, but its attributes may be changed. ZFS currently enforces this flag, but Illumos developers have discussed disabling enforcement. The behavior of this flag is different than MacOS X. MacOS X uses UF_IMMUTABLE to represent the DOS readonly permission, but that flag has a stronger meaning than the semantics of DOS readonly permissions. UF_ARCHIVE: Command line name: "uarch", "uarchive" ZFS_NAME: XAT_ARCHIVE, ZFS_ARCHIVE Windows name: FILE_ATTRIBUTE_ARCHIVE The UF_ARCHIVED flag means that the file has changed and needs to be archived. The meaning is same as the Windows FILE_ATTRIBUTE_ARCHIVE attribute, and the ZFS XAT_ARCHIVE and ZFS_ARCHIVE attribute. msdosfs and ZFS have special handling for this flag. i.e. they will set it when the file changes. sys/param.h: Bump __FreeBSD_version to 1000047 for the addition of new stat(2) flags. chflags.1: Document the new command line flag names (e.g. "system", "hidden") available to the user. ls.1: Reference chflags(1) for a list of file flags and their meanings. strtofflags.c: Implement the mapping between the new command line flag names and new stat(2) flags. chflags.2: Document all of the new stat(2) flags, and explain the intended behavior in a little more detail. Explain how they map to Windows file attributes. Different filesystems behave differently with respect to flags, so warn the application developer to take care when using them. zfs_vnops.c: Add support for getting and setting the UF_ARCHIVE, UF_READONLY, UF_SYSTEM, UF_HIDDEN, UF_REPARSE, UF_OFFLINE, and UF_SPARSE flags. All of these flags are implemented using attributes that ZFS already supports, so the on-disk format has not changed. ZFS currently doesn't allow setting the UF_REPARSE flag, and we don't really have the other infrastructure to support reparse points. msdosfs_denode.c, msdosfs_vnops.c: Add support for getting and setting UF_HIDDEN, UF_SYSTEM and UF_READONLY in MSDOSFS. It supported SF_ARCHIVED, but this has been changed to be UF_ARCHIVE, which has the same semantics as the DOS archive attribute instead of inverse semantics like SF_ARCHIVED. After discussion with Bruce Evans, change several things in the msdosfs behavior: Use UF_READONLY to indicate whether a file is writeable instead of file permissions, but don't actually enforce it. Refuse to change attributes on the root directory, because it is special in FAT filesystems, but allow most other attribute changes on directories. Don't set the archive attribute on a directory when its modification time is updated. Windows and DOS don't set the archive attribute in that scenario, so we are now bug-for-bug compatible. smbfs_node.c, smbfs_vnops.c: Add support for UF_HIDDEN, UF_SYSTEM, UF_READONLY and UF_ARCHIVE in SMBFS. This is similar to changes that Apple has made in their version of SMBFS (as of smb-583.8, posted on opensource.apple.com), but not quite the same. We map SMB_FA_READONLY to UF_READONLY, because UF_READONLY is intended to match the semantics of the DOS readonly flag. The MacOS X code maps both UF_IMMUTABLE and SF_IMMUTABLE to SMB_FA_READONLY, but the immutable flags have stronger meaning than the DOS readonly bit. stat.h: Add definitions for UF_SYSTEM, UF_SPARSE, UF_OFFLINE, UF_REPARSE, UF_ARCHIVE, UF_READONLY and UF_HIDDEN. The definition of UF_HIDDEN is the same as the MacOS X definition. Add commented-out definitions of UF_COMPRESSED and UF_TRACKED. They are defined in MacOS X (as of 10.8.2), but we do not implement them (yet). ufs_vnops.c: Add support for getting and setting UF_ARCHIVE, UF_HIDDEN, UF_OFFLINE, UF_READONLY, UF_REPARSE, UF_SPARSE, and UF_SYSTEM in UFS. Alphabetize the flags that are supported. These new flags are only stored, UFS does not take any action if the flag is set. Sponsored by: Spectra Logic Reviewed by: bde (earlier version)	2013-08-21 23:04:48 +00:00
Justin T. Gibbs	5119608387	Add kstat entries for ZFS compression statistics. sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zio_compress.h: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio_compress.c: Add module lifetime functions to allocate and teardown state data. Report: - Compression attempts. - Buffers found to be empty. - Compression calls that are skipped because the data length is already less than or equal to the minimum block length. - Compression attempts that fail to yield a 12.5% compression ratio. sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu.c: Add calls to the zio_compress.c module's init and fini functions. Sponosred by: Spectra Logic Corporation MFC after: 2 weeks	2013-08-21 19:40:43 +00:00
Justin T. Gibbs	439d30d121	Enhance the ZFS vdev layer to maintain both a logical and a physical minimum allocation size for devices. Use this information to automatically increase ZFS's minimum allocation size for new top-level vdevs to a value that more closely matches the optimum device allocation size. Use GEOM's stripesize attribute, if set, as the physical sector size of the GEOM. Calculate the minimum blocksize of each metaslab class. Use the calculated value instead of SPA_MINBLOCKSIZE (512b) when determining the likelyhood of compression yeilding a reduction in physical space usage. Report devices with sub-optimal block size configuration in "zpool status". Also properly fail attempts to attach devices with a logical block size greater than 8kB, since this will cause corruption to ZFS's label area. Sponsored by: Spectra Logic Corporaion MFC after: 2 weeks Background ========== Many modern devices use physical allocation units that are much larger than the minimum logical allocation size accessible by external commands. Two prevalent examples of this are 512e disk drives (512b logical sector, 4K physical sector) and flash devices (512b logical sector, 4K or larger allocation block size, and 128k or larger erase block size). Operations that modify less than the physical sector size result in a costly read-modify-write or garbage collection sequence on these devices. Simply exporting the true physical sector of the device to ZFS would yield optimal performance, but has two serious drawbacks: 1) Existing pools created with devices that have different logical and physical block sizes, but were configured to use the logical block size (e.g. because the OS version used for pool construction reported the logical block size instead of the physical block size) will suddenly find that the vdev allocation size has increased. This can be easily tolerated for active members of the array, but ZFS would prevent replacement of a vdev with another identical device because it now appears that the smaller allocation size required by the pool is not supported by the new device. 2) The device's physical block size may be too large to be supported by ZFS. The optimal allocation size for the vdev may be quite large. For example, a RAID controller may export a vdev that requires read-modify-write cycles unless accessed using 64k aligned/sized requests. ZFS currently has an 8k minimum block size limit. Reporting both the logical and physical allocation sizes for vdevs solves these problems. A device may be used so long as the logical block size is compatible with the configuration. By comparing the logical and physical block sizes, new configurations can be optimized and administrators can be notified of any existing pools that are sub-optimal. sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/spa.h: Add the SPA_ASHIFT constant. ZFS currently has a hard upper limit of 13 (8k) for ashift and this constant is used to both document and enforce this limit. sys/cddl/contrib/opensolaris/uts/common/sys/fs/zfs.h: Add the VDEV_AUX_ASHIFT_TOO_BIG error code. Add fields for exporting the configured, logical, and physical ashift to the vdev_stat_t structure. Add VDEV_STAT_VALID() macro which can be used to verify the presence of required vdev_stat_t fields in nvlist data. sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev.c: Provide a SYSCTL_PROC handler for "max_auto_ashift". Since the limit is only referenced long after boot when a create operation occurs, there's no compelling need for it to be a boot time configurable tunable. This also allows the validation code for the max_auto_ashift value to be contained within the sysctl handler. Populate the new fields in the vdev_stat_t structure. Fail vdev opens if the vdev reports an ashift larger than SPA_MAXASHIFT. Propogate vdev_logical_ashift and vdev_physical_ashift between child and parent vdevs as is done for vdev_ashift. In vdev_open(), restore code that fails opens for devices where vdev_ashift grows. This can only happen now if the device's logical ashift grows, which means it really isn't safe to use the device. sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/vdev_impl.h: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev.c: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_file.c: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_mirror.c: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_missing.c: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_raidz.c: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_root.c: Update the vdev_open() API so that both logical (what was just ashift before) and physical ashift are reported. sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/vdev_impl.h: Add two new fields, vdev_physical_ashift and vdev_logical_ashift, to vdev_t. sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev.c: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa_config.c: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa.c: Add vdev_ashift_optimize(). Call it anytime a new top-level vdev is allocated. cddl/contrib/opensolaris/cmd/zpool/zpool_main.c: Add text for the VDEV_AUX_ASHIFT_TOO_BIG error. For each sub-optimally configured leaf vdev, report configured and native block sizes. cddl/contrib/opensolaris/cmd/zpool/zpool_main.c: cddl/contrib/opensolaris/lib/libzfs/common/libzfs.h: cddl/contrib/opensolaris/lib/libzfs/common/libzfs_status.c: Introduce a new zpool status: ZPOOL_STATUS_NON_NATIVE_ASHIFT. This status is reported on healthy pools containing vdevs configured to use a block size smaller than their reported physical block size. cddl/contrib/opensolaris/lib/libzfs/common/libzfs_status.c: Update find_vdev_problem() and supporting functions to provide the full vdev_stat_t structure to problem checking routines, and to allow decent into replacing vdevs. Add a vdev_non_native_ashift() validator which is used on the full vdev tree to check for ZPOOL_STATUS_NON_NATIVE_ASHIFT. cddl/contrib/opensolaris/lib/libzpool/common/kernel.c: cddl/contrib/opensolaris/lib/libzpool/common/sys/zfs_context.h: Enhance sysctl userland stubs now that a SYSCTL_PROC handler is used in vdev.c. sys/cddl/contrib/opensolaris/uts/common/fs/zfs/metaslab.c: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/metaslab_impl.h: When the group membership of a metaslab class changes (i.e. when a vdev is added or removed from a pool), walk the group list to determine the smallest block size currently available and record this in the metaslab class. sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/metaslab.h: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/metaslab.c: Add the metaslab_class_get_minblocksize() accessor. sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zio_compress.h: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio_compress.c: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c: In zio_compress_data(), take the minimum blocksize as an input parameter instead of assuming SPA_MINBLOCKSIZE. sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c: In l2arc_compress_buf(), pass SPA_MINBLOCKSIZE as the minimum blocksize of the device. The l2arc code performs has it's own code for deciding if compression is worth while, so this effectively disables zio_compress_data() from second guessing the original decision. sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c: In zio_write_bp_init(), use the minimum blocksize of the normal metaslab class when compressing data.	2013-08-21 04:10:24 +00:00
Xin LI	2640fb93f5	MFV r254421: Illumos ZFS issues: 3996 want a libzfs_core API to rollback to latest snapshot	2013-08-21 00:04:31 +00:00
Xin LI	c21d9cfe3d	MFV r254220: Illumos ZFS issues: 4039 zfs_rename()/zfs_link() needs stronger test for XDEV	2013-08-20 22:31:13 +00:00
Justin Hibbits	cc117e2773	Fix some ppc64 dtrace bugs, and enable systrace_freebsd32 for ppc64.	2013-08-19 05:10:46 +00:00
Mark Johnston	7bc992c037	Add a "translated type" argument to SDT_PROBE_ARGTYPE() and add some macros which allow one to define SDT probes that specify translated types. The idea is to make it easy to write SDT probe definitions that can work across multiple operating systems. In particular, this makes it possible to port illumos SDT probes to FreeBSD without changing their argument types, so long as the appropriate translators are defined. Then DTrace scripts written for Solaris/illumos will work on FreeBSD without any changes. MFC after: 1 week	2013-08-17 22:02:26 +00:00
Pawel Jakub Dawidek	2c40899ecc	Remove redundant variable.	2013-08-17 14:09:46 +00:00
Mark Johnston	12ede07ab8	Use kld_{load,unload} instead of mod_{load,unload} for the linker file load and unload event handlers added in r254266. Reported by: jhb X-MFC with: r254266	2013-08-14 00:42:21 +00:00
Mark Johnston	8776669b53	FreeBSD's DTrace implementation has a few problems with respect to handling probes declared in a kernel module when that module is unloaded. In particular, * Unloading a module with active SDT probes will cause a panic. [1] * A module's (FBT/SDT) probes aren't destroyed when the module is unloaded; trying to use them after the fact will generally cause a panic. This change fixes both problems by porting the DTrace module load/unload handlers from illumos and registering them with the corresponding EVENTHANDLER(9) handlers. This allows the DTrace framework to destroy all probes defined in a module when that module is unloaded, and to prevent a module unload from proceeding if some of its probes are active. The latter problem has already been fixed for FBT probes by checking lf->nenabled in kern_kldunload(), but moving the check into the DTrace framework generalizes it to all kernel providers and also fixes a race in the current implementation (since a probe may be activated between the check and the call to linker_file_unload()). Additionally, the SDT implementation has been reworked to define SDT providers/probes/argtypes in linker sets rather than using SYSINIT/SYSUNINIT to create and destroy SDT probes when a module is loaded or unloaded. This simplifies things quite a bit since it means that pretty much all of the SDT code can live in sdt.ko, and since it becomes easier to integrate SDT with the DTrace framework. Furthermore, this allows FreeBSD to be quite flexible in that SDT providers spanning multiple modules can be created on the fly when a module is loaded; at the moment it looks like illumos' SDT implementation requires all SDT probes to be statically defined in a single kernel table. PR: 166927, 166926, 166928 Reported by: davide [1] Reviewed by: avg, trociny (earlier version) MFC after: 1 month	2013-08-13 03:10:39 +00:00
Rui Paulo	e009490afc	fasttrap_fork(): unlock the processes before removing the tracepoints. In the future, we'll need to come up with new proc_*() functions that accept locked processes. For now, this prevents postgresql + DTrace from crashing the system. MFC after: 1 month	2013-08-11 00:57:01 +00:00
Attilio Rao	c7aebda8a1	The soft and hard busy mechanism rely on the vm object lock to work. Unify the 2 concept into a real, minimal, sxlock where the shared acquisition represent the soft busy and the exclusive acquisition represent the hard busy. The old VPO_WANTED mechanism becames the hard-path for this new lock and it becomes per-page rather than per-object. The vm_object lock becames an interlock for this functionality: it can be held in both read or write mode. However, if the vm_object lock is held in read mode while acquiring or releasing the busy state, the thread owner cannot make any assumption on the busy state unless it is also busying it. Also: - Add a new flag to directly shared busy pages while vm_page_alloc and vm_page_grab are being executed. This will be very helpful once these functions happen under a read object lock. - Move the swapping sleep into its own per-object flag The KPI is heavilly changed this is why the version is bumped. It is very likely that some VM ports users will need to change their own code. Sponsored by: EMC / Isilon storage division Discussed with: alc Reviewed by: jeff, kib Tested by: gavin, bapt (older version) Tested by: pho, scottl	2013-08-09 11:11:11 +00:00
Xin LI	43667c1f68	MFV r254079: Illumos ZFS issues: 3957 ztest should update the cachefile before killing itself 3958 multiple scans can lead to partial resilvering 3959 ddt entries are not always resilvered 3960 dsl_scan can skip over dedup-ed blocks if physical birth != logical birth 3961 freed gang blocks are not resilvered and can cause pool to suspend 3962 ztest should print out zfs debug buffer before exiting	2013-08-08 23:38:31 +00:00
Xin LI	9d2f243aa6	MFV r254071: Fix a regression introduced by fix for Illumos bug #3834. Quote from Matthew Ahrens on the Illumos issue: ztest fails this assertion because ztest_dmu_read_write() does dmu_tx_hold_free(tx, bigobj, bigoff, bigsize); and then dmu_object_set_checksum(os, bigobj, (enum zio_checksum)ztest_random_dsl_prop(ZFS_PROP_CHECKSUM), tx); If the region to free is past the end of the file, the DMU assumes that there will be nothing to do for this object. However, ztest does set_checksum(), which must modify the dnode. The fix is for ztest to also call dmu_tx_hold_bonus(tx, bigobj); so we can account for the dirty data associated with setting the checksum Illumos ZFS issues: 3955 ztest failure: assertion refcount_count(&tx->tx_space_written) + delta <= tx->tx_space_towrite	2013-08-07 22:21:00 +00:00
Xin LI	4f7b34578b	MFV r254070: Merge vendor bugfix for ZFS test suite that triggers false positives. Illumos ZFS issues: 3949 ztest fault injection should avoid resilvering devices 3950 ztest: deadman fires when we're doing a scan 3951 ztest hang when running dedup test 3952 ztest: ztest_reguid test and ztest_fault_inject don't place nice together	2013-08-07 21:16:14 +00:00
Jeff Roberson	5df87b21d3	Replace kernel virtual address space allocation with vmem. This provides transparent layering and better fragmentation. - Normalize functions that allocate memory to use kmem_* - Those that allocate address space are named kva_* - Those that operate on maps are named kmap_* - Implement recursive allocation handling for kmem_arena in vmem. Reviewed by: alc Tested by: pho Sponsored by: EMC / Isilon Storage Division	2013-08-07 06:21:20 +00:00
Xin LI	c668ff330e	MFV r254011: This change have no effect to FreeBSD but integrated for completeness. Illumos ZFS issues: 348 ZFS should handle DKIOCGMEDIAINFOEXT failure	2013-08-06 21:36:01 +00:00
Andriy Gapon	c319ea15f4	opensolaris code: translate INVARIANTS to DEBUG and ZFS_DEBUG Do this by forcing inclusion of sys/cddl/compat/opensolaris/sys/debug_compat.h via -include option into all source files from OpenSolaris. Note that this -include option must always be after -include opt_global.h. Additionally, remove forced definition of DEBUG for some modules and fix their build without DEBUG. Also, meaning of DEBUG was overloaded to enable WITNESS support for some OpenSolaris (primarily ZFS) locks. Now this overloading is removed and that use of DEBUG is replaced with a new option OPENSOLARIS_WITNESS. MFC after: 17 days	2013-08-06 15:51:56 +00:00
Alexander Motin	d9aca4ed74	Block reporting of ZFS features for suspended pools. Before executing any subcommand, zpool tool fetches pools configuration from the kernel. Before features support was added, kernel was regenerating that configuration based on data always present in memory. Unfortunately, pool features list and activity counters are not such. They are stored in ZAP, that normally resides in ARC, but under heavy memory pressure may be swapped out. If pool is suspended at this point, there is no way to recover it back since any zpool command will stuck. This change has one predictable flaw: `zpool upgrade` always wish to upgrade suspended pools, but fortunately it can't do it due to the suspension.	2013-08-06 14:41:41 +00:00
Alexander Motin	f8dcf872c4	Disable r252840 when ZFS TRIM is enabled (vfs.zfs.trim.enabled=1) and really disable TRIM otherwise. r252840 (illumos bug 3836) is based on assumption that zio_free_sync() has no lock dependencies and should complete immediately. Unfortunately, with our TRIM implementation that is not true due to ZIO_STAGE_VDEV_IO_START added to the ZIO_FREE_PIPELINE, which, while not really accessing devices, still acquires SCL_ZIO lock for read to be sure devices won't disappear. When TRIM is disabled, this patch enables direct free execution from r252840 and removes ZIO_STAGE_VDEV_IO_START and ZIO_STAGE_VDEV_IO_ASSESS stages from the pipeline to avoid lock acquisition. Otherwise it queues free request as it was before r252840.	2013-08-06 14:30:28 +00:00
Alexander Motin	526bb4af8a	Make `zpool clear` to reopen also reconnected cache and spare devices. Since `zpool status` reports about such kinds of errors, it is strange that they are not cleared by `zpool clear`.	2013-08-06 14:23:33 +00:00
Alexander Motin	ad727e8d64	Make ZFS to use separate thread to handle SPA_ASYNC_REMOVE async events. Existing async thread is running only on successfull spa_sync() completion, that is impossible in case of pool loosing required (last) disk(s). That indefinite delay of SPA_ASYNC_REMOVE processing made ZFS to not close the lost disks, preventing GEOM/CAM from destroying devices and reusing names on later disk reattach. In earlier version of the patch I've tried to just run existing thread immediately, unrelated to spa_sync() completion, but that exposed number of situations where it could stuck due to locks held by stuck spa_sync(), that are required for other kinds of async events. Experiments with OpenIndiana snapshot confirmed that they also have this issue with lost disks reattach.	2013-08-06 14:20:41 +00:00
Attilio Rao	be99683637	Revert r253939: We cannot busy a page before doing pagefaults. Infact, it can deadlock against vnode lock, as it tries to vget(). Other functions, right now, have an opposite lock ordering, like vm_object_sync(), which acquires the vnode lock first and then sleeps on the busy mechanism. Before this patch is reinserted we need to break this ordering. Sponsored by: EMC / Isilon storage division Reported by: kib	2013-08-05 08:55:35 +00:00
Attilio Rao	3b6714cacb	The page hold mechanism is fast but it has couple of fallouts: - It does not let pages respect the LRU policy - It bloats the active/inactive queues of few pages Try to avoid it as much as possible with the long-term target to completely remove it. Use the soft-busy mechanism to protect page content accesses during short-term operations (like uiomove_fromphys()). After this change only vm_fault_quick_hold_pages() is still using the hold mechanism for page content access. There is an additional complexity there as the quick path cannot immediately access the page object to busy the page and the slow path cannot however busy more than one page a time (to avoid deadlocks). Fixing such primitive can bring to complete removal of the page hold mechanism. Sponsored by: EMC / Isilon storage division Discussed with: alc Reviewed by: jeff Tested by: pho	2013-08-04 21:07:24 +00:00
Steven Hartland	e44e975c1b	zfs_ioc_rename should not leave the value of zc_name passed in via zc altered on return. MFC after: 1 week	2013-08-04 11:38:08 +00:00
Xin LI	bd3d1456a5	MFV r253783: Skip eviction step of processing free records when doing ZFS receive to avoid the expensive search operation of non-existent dbufs in dn_dbufs. Illumos ZFS issues: 3834 incremental replication of 'holey' file systems is slow MFC after: 2 weeks	2013-07-30 21:35:02 +00:00
Xin LI	1c4ead73c6	MFV r253782: To quote Illumos issue #3888: When 'zfs recv -F' is used with an incremental recv it rolls back any changes made since the last snapshot in case new changes were made to the file system while the recv is in progress (without -F the recv would fail when it does it's final check to commit the recv-ed data as the recv-ed data conflicts with the newly written data). However, if there is a snapshot taken after the recv began rolling back to the 'latest' snapshot will not help and the recv will still fail. 'zfs recv -F' should be extended to destroy any snapshots created since the source snapshot when finishing the recv (effectively rolling back through all snapshots, instead of just to the latest snapshot). Illumos ZFS issues: 3888 zfs recv -F should destroy any snapshots created since the incremental source MFC after: 2 weeks	2013-07-30 21:20:12 +00:00
Xin LI	d637247e1f	MFV r253781 + r253871: Illumos ZFS issues: 3894 zfs should not allow snapshot of inconsistent dataset MFC after: 2 weeks	2013-07-30 21:02:09 +00:00
Xin LI	44e362e207	MFV r253780: To quote Illumos #3875: The problem here is that if we ever end up in the error path, we drop the locks protecting access to the zfsvfs_t prior to forcibly unmounting the filesystem. Because z_os is NULL, any thread that had already picked up the zfsvfs_t and was sitting in ZFS_ENTER() when we dropped our locks in zfs_resume_fs() will now acquire the lock, attempt to use z_os, and panic. Illumos ZFS issues: 3875 panic in zfs_root() after failed rollback MFC after: 2 weeks	2013-07-30 20:37:32 +00:00
Alexander Motin	ec4d2e0d96	Allow three IOCTLs to be used on suspended pool, restoring state that existed before IOCTL code refactoring merged change 4445fffb from illumos at r248571. This change allows `zpool clear` to be used again to recover suspended pool. It seems the only was supposed by the code to restore pool operation after reconnecting lost disks that were required for data completeness. There are still cases where `zpool clear` command can just safely stuck due to deadlocks inside ZFS kernel part, but probably that is better then having no chances to recover at all.	2013-07-30 14:50:44 +00:00
Andriy Gapon	0f09691df8	dtrace disassembler: take the latest/last CDDL code from OpenSolaris OpenSolaris version is: 13108:33bb8a0301ab 6762020 Disassembly support for Intel Advanced Vector Extensions (AVX) This corresponds to Illumos-gate (github) version ab47273fedff893c8ae22ec39ffc666d4fa6fc8b MFC after: 3 weeks	2013-07-29 16:56:38 +00:00
Alexander Motin	698cd997d6	Partially close race between calls of orphan() method from GEOM and close() method from ZFS core, that reliably causes use-after-free panic if SSD vdev detached during inititial erase.	2013-07-28 20:07:34 +00:00
Alexander Motin	ffacde9be5	Following r222950, revert unintentional change cls -> class in argument name in r245264. Aside from non-uniformity, that again confused C++ compilers.	2013-07-25 08:41:22 +00:00
Andriy Gapon	f66c1f6482	zfs module: perform cleanup during shutdown in addition to module unload - move init and fini code into separate functions (like it is done upstream) - invoke fini code via shutdown_post_sync event hook This should make zfs close its underlying devices during shutdown, which may be important for their drivers. MFC after: 20 days	2013-07-24 09:59:16 +00:00
Andriy Gapon	886dbd270f	zfs: move vnode creation from zfs_znode_cache_constructor to zfs_znode_alloc All other places where a znode is allocated do not need z_vnode at all. These are: - zfs_create_share_dir - zfs_create_fs This chnage ensures two things: - VN_LOCK_ASHARE is not erroneously called for VFIFO vnodes - vn_lock is called on a fully constructed vnode with correct v_ops The change also allows to make zfs_znode_cache_constructor a normal kmem_cache constructor again (as it is in upstream). This allows to avoid a problem where zfs_znode_cache_destructor may be called on un-constructed znodes. MFC after: 17 days	2013-07-24 09:15:59 +00:00
Xin LI	c92bc5e996	Manually merge part of vendor import r238583 from Illumos. Illumos changeset: 13680:2bd022a765e2 Illumos ZFS issue: 2671 zpool import should not fail if vdev ashift has increased MFC after: 3 days	2013-07-18 00:22:42 +00:00
Andriy Gapon	37b8b2d4d8	dtrace/fasttrap: install hook functions only after all data is initialized Sponsored by: HybridCluster MFC after: 7 days	2013-07-09 09:05:00 +00:00
Andriy Gapon	9c1f50af0a	zfs: try to properly handle i/o errors in mappedread_sf Unconditionally freeing a page is not good, especially if it is the page that was wired by the caller. The checks are picked up from kern_sendfile. MFC after: 3 weeks	2013-07-09 08:47:11 +00:00
Andriy Gapon	78ed7a7855	zfs: load zpool.cache after a root fs is mounted MFC after: 3 weeks	2013-07-09 08:37:42 +00:00
Mark Johnston	46d27dbb38	Hide references to mod_lock. In FreeBSD it is always acquired with the provider lock held, so its use has no effect.	2013-07-05 22:42:10 +00:00
Martin Matuska	12df7d65b0	MFV r252839: Quoting illumos issue #3836: Currently zio_free() always puts the zio on a list for subsequent processing by zio_free_sync(). This is only necessary for frees that might need to issue reads (gang and dedup blocks). By processing the majority of the frees as we encounter them, we reduce the amount of time that the spa_sync() thread spends burning CPU and not doing any i/o, thus increasing the overall write throughput of the system. Illumos ZFS issues: 3836 zio_free() can be processed immediately in the common case MFC after: 1 week	2013-07-05 21:29:59 +00:00
Mark Johnston	0022f867b4	Be sure to destory the fasttrap cleanup mutex when unloading the fasttrap module. This should be MFCed with r250953.	2013-07-01 23:12:59 +00:00
Robert Millan	2592710c47	Enable kernel-specific code for FreeBSD also on other systems that use the kernel of FreeBSD. Reviewed by: pjd	2013-06-30 23:14:55 +00:00
Steven Hartland	3666c4917b	style(9) fixes MFC after: 2 days	2013-06-29 23:39:38 +00:00
Steven Hartland	baa0b41221	Remove invalid ASSERT which causes a panic on zfs renames when run with ASSERTS. Removal was missed in merge of illumos 3464 (r248571) MFC after: 2 days	2013-06-29 23:15:45 +00:00
Martin Matuska	f82ca5238a	Unbreak "zfs jail" and "zfs unjail" (broken since r248571) I missed to register zfs_ioc_jail and zfs_ioc_unjail as legacy ioctl's with the new zfs_ioctl_register_legacy() function. These operations do not modify pools or datasets so there is no need to log them to pool history. Reported by: Alexander Leidinger <ale@FreeBSD.org> and others on current@ MFC after: 3 days	2013-06-29 16:45:37 +00:00
Gavin Atkinson	af582854d8	Don't try to re-insert an already present but invalid page. This could happen if a thread doing a page-in loses a ZFS range lock race to a thread writing to the same range This fixes "panic: vm_page_alloc: pindex already allocated" in http://docs.FreeBSD.org/cgi/mid.cgi?1372165971.96049.42.camel Submitted by: avg MFC after: 1 week	2013-06-28 07:51:12 +00:00
Mark Johnston	837610eb04	The dtmalloc provider uses the short description of a malloc type as the function name of its corresponding DTrace probes. These descriptions may contain whitespace, but probe names cannot, so just replace any whitespace with underscores when creating probes. MFC after: 1 week	2013-06-28 03:14:40 +00:00
Xin LI	e33806a54a	MFV r252215: Restore a previous behavior before r251646, where when destructing ZFS snapshot, the ioctl would return ENOENT when it hit any of them in the errlist (the new behavior was only return ENOENT when all returns error). Illumos ZFS issues: 3829 fix for 3740 changed behavior of zfs destroy/hold/release ioctl MFC after: 1 week	2013-06-25 22:14:32 +00:00
Steven Hartland	43e695497c	Switch ZFS mutex_owner macro to use sx_xholder as its now exported via sx.h MFC after: 1 week	2013-06-21 15:55:03 +00:00
Steven Hartland	9446debe6b	Fix intermittent ZFS lock panic when kernel is compiled with debugging caused by access of uninitialized smlock in mmutex_init. MFC after: 1 week	2013-06-21 15:47:10 +00:00
Steven Hartland	5f921c5911	Fixed import of destroyed ZFS pools failing due to vdev_geom incorrectly preventing config loads from devices associated with destroyed pools. Reviewed by: avg MFC after: 1 week	2013-06-21 12:02:09 +00:00
Xin LI	9625321547	MFV r251644: Poor ZFS send / receive performance due to snapshot hold / release processing (by smh@) Illumos ZFS issues: 3740 Poor ZFS send / receive performance due to snapshot hold / release processing MFC after: 2 weeks	2013-06-12 07:07:06 +00:00
Xin LI	ed8fd1989f	MFV r251626: ZFS event processing should work on R/O root filesystems Illumos ZFS issues: 3749 zfs event processing should work on R/O root filesystems MFC after: 2 weeks	2013-06-11 19:35:44 +00:00
Xin LI	3b245f3ee1	MFV r251624: txg commit callbacks don't work Illumos ZFS issues: 3747 txg commit callbacks don't work MFC after: 2 weeks	2013-06-11 19:29:31 +00:00
Xin LI	3f3a9cac29	MFV r251622: ZFS shouldn't ignore errors unmounting snapshots Illumos ZFS issues: 3744 zfs shouldn't ignore errors unmounting snapshots MFC after: 2 weeks	2013-06-11 19:22:20 +00:00
Xin LI	57e06a1a63	MFV r251621: ZFS needs a refcount audit Illumos ZFS issues: 3741 zfs needs a refcount audit MFC after: 2 weeks	2013-06-11 19:16:14 +00:00
Xin LI	a91afe8a8d	MFV r251620: ZFS comments need cleaner, more consistent style Illumos ZFS issues: 3741 zfs comments need cleaner, more consistent style MFC after: 2 weeks	2013-06-11 19:12:06 +00:00
Xin LI	4acaabea05	MFV r251619: ZFS needs better comments. Illumos ZFS issues: 3741 zfs needs better comments MFC after: 2 weeks	2013-06-11 19:02:36 +00:00
Xin LI	9e43a32a5c	MFV r251519: * Illumos ZFS issue #3805 arc shouldn't cache freed blocks Quote from the Illumos issue: ZFS should proactively evict freed blocks from the cache. Even though these freed blocks will never be used again, and thus will eventually be evicted, this causes us to use memory inefficiently for 2 reasons: 1. A block that is freed has no chance of being accessed again, but will be kept in memory preferentially to a block that was accessed before it (and is thus older) but has not been freed and thus has at least some chance of being accessed again. 2. We partition the ARC into several buckets: user data that has been accessed only once (MRU) metadata that has been accessed only once (MRU) user data that has been accessed more than once (MFU) metadata that has been accessed more than once (MFU) The user data vs metadata split is somewhat arbitrary, and the primary control on how much memory is used to cache data vs metadata is to simply try to keep the proportion the same as it has been in the past (each bucket "evicts against" itself). The secondary control is to evict data before evicting metadata. Because of this bucketing, we may end up with one bucket mostly containing freed blocks that are very old, while another bucket has more recently accessed, still-allocated blocks. Data in the useful bucket (with still-allocated blocks) may be evicted in preference to data in the useless bucket (with old, freed blocks). On dcenter, we saw that the MFU metadata bucket was 230MB, while the MFU data bucket was 27GB and the MRU metadata bucket was 256GB. However, the vast majority of data in the MRU metadata bucket (256GB) was freed blocks, and thus useless. Meanwhile, the MFU metadata bucket (230MB) was constantly evicting useful blocks that will be soon needed. The problem of cache segmentation is a larger problem that needs more investigation. However, if we stop caching freed blocks, it should reduce the impact of this more fundamental issue. MFC after: 2 weeks	2013-06-08 09:11:20 +00:00
Xin LI	ca8a27d4b1	MFV r251474: * Illumos zfs issue #3137 L2ARC compression Whether or not to compress buffers entering the L2ARC is controlled by "compression" setting on the dataset, when compression is not "off", L2ARC compression is enabled. The compress method is always LZ4 for L2ARC when enabled because it works best for the scenario. MFC after: 2 weeks	2013-06-06 23:21:41 +00:00
Mark Johnston	f263e440d4	SDT probes can directly pass up to five arguments as arguments to dtrace_probe(). Arguments beyond these five must be obtained in an architecture-specific way; this can be done through the getargval provider method, and through dtrace_getarg() if getargval isn't overridden. This change fixes two off-by-one bugs in the way these arguments are fetched in FreeBSD's DTrace implementation. First, the SDT provider must set the aframes parameter to 1 when creating a probe. The aframes parameter controls the number of frames that dtrace_getarg() will step over in order to find the frame containing the extra arguments. On FreeBSD, dtrace_getarg() is called in SDT probe context via dtrace_probe()->dtrace_dif_emulate()->dtrace_dif_variable->dtrace_getarg() so aframes must be 3 since the arguments are in dtrace_probe()'s frame; it was previously being called with a value of 2 instead. illumos uses a different aframes value for SDT probes, but this is because illumos SDT probes fire by triggering the #UD fault handler rather than calling dtrace_probe() directly. The second bug has to do with the way arguments are grabbed out dtrace_probe()'s frame on amd64. The code currently jumps over the first stack argument and retrieves the rest of them using a pointer into the stack. This works on i386 because all of dtrace_probe()'s arguments will be on the stack and the first argument is the probe ID, which should be ignored. However, it is incorrect to ignore the first stack argument on amd64, so we correct the pointer used to access the arguments. MFC after: 2 weeks	2013-06-02 01:05:36 +00:00
Mark Johnston	18161786c6	Port the SDT test now that it's possible to create SDT probes that take seven arguments. The original test uses Solaris' uadmin system call to trigger the test probe; this change adds a sysctl to the dtrace_test module and gets the test program to trigger the test probe via the sysctl handler. The test is currently failing on amd64 because of some bugs in the way that probe arguments beyond the first five are obtained - these bugs will be fixed in a separate change.	2013-06-02 00:33:36 +00:00
Mark Johnston	427bc75e19	The fasttrap provider cleans up probes asynchronously when a process with USDT probes exits. This was previously done with a callout; however, it is possible to sleep while holding the DTrace mutexes, so a panic will occur on INVARIANTS kernels if the callout handler can't immediately acquire one of these mutexes. This panic will be frequently triggered on systems where a USDT-enabled program (perl, for instance) is often run. This revision changes the fasttrap cleanup mechanism so that a dedicated thread is used instead of a callout. The old behaviour is otherwise preserved. Reviewed by: rpaulo MFC after: 1 month	2013-05-24 03:29:32 +00:00

... 5 6 7 8 9 ...

1455 Commits