1
0
mirror of https://git.FreeBSD.org/src.git synced 2024-12-29 12:03:03 +00:00
Commit Graph

867 Commits

Author SHA1 Message Date
Jeff Roberson
22a722605d - Convert the bufobj lock to rwlock.
- Use a shared bufobj lock in getblk() and inmem().
 - Convert softdep's lk to rwlock to match the bufobj lock.
 - Move INFREECNT to b_flags and protect it with the buf lock.
 - Remove unnecessary locking around bremfree() and BKGRDINPROG.

Sponsored by:	EMC / Isilon Storage Division
Discussed with:	mckusick, kib, mdf
2013-05-31 00:43:41 +00:00
Jeff Roberson
f2cc1285c2 - Add a new general purpose path-compressed radix trie which can be used
with any structure containing a uint64_t index.  The tree code
   auto-generates type safe wrappers.
 - Eliminate the buf splay and replace it with pctrie.  This is not only
   significantly faster with large files but also allows for the possibility
   of shared locking.

Reviewed by:    alc, attilio
Sponsored by:   EMC / Isilon Storage Division
2013-05-12 04:05:01 +00:00
Konstantin Belousov
0fc6daa72d - Fix nullfs vnode reference leak in nullfs_reclaim_lowervp(). The
null_hashget() obtains the reference on the nullfs vnode, which must
  be dropped.

- Fix a wart which existed from the introduction of the nullfs
  caching, do not unlock lower vnode in the nullfs_reclaim_lowervp().
  It should be innocent, but now it is also formally safe.  Inform the
  nullfs_reclaim() about this using the NULLV_NOUNLOCK flag set on
  nullfs inode.

- Add a callback to the upper filesystems for the lower vnode
  unlinking. When inactivating a nullfs vnode, check if the lower
  vnode was unlinked, indicated by nullfs flag NULLV_DROP or VV_NOSYNC
  on the lower vnode, and reclaim upper vnode if so.  This allows
  nullfs to purge cached vnodes for the unlinked lower vnode, avoiding
  excessive caching.

Reported by:	G??ran L??wkrantz <goran.lowkrantz@ismobile.com>
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2013-05-11 11:17:44 +00:00
Marcel Moolenaar
e63091ea6c Add option WITNESS_NO_VNODE to suppress printing LORs between VNODE
locks. To support this, VNODE locks are created with the LK_IS_VNODE
flag. This flag is propagated down using the LO_IS_VNODE flag.

Note that WITNESS still records the LOR. Only the printing and the
optional entering into the kernel debugger is bypassed with the
WITNESS_NO_VNODE option.
2013-05-09 16:28:18 +00:00
Matthew D Fleming
770b41b349 Add missing vdrop() in error case.
Submitted by:	Fahad (mohd.fahadullah@isilon.com)
MFC after:	1 week
2013-05-04 18:38:16 +00:00
Rick Macklem
64fa8df6e0 Allow the vnode to be unlocked for the weird case of
LK_EXCLOTHER. LK_EXCLOTHER is only used to acquire a
usecount on a vnode during NFSv4 recovery from an
expired lease.

Reported and tested by:	pho
MFC after:	2 weeks
2013-04-16 14:22:16 +00:00
Jeff Roberson
26089666b6 Prepare to replace the buf splay with a trie:
- Don't insert BKGRDMARKER bufs into the splay or dirty/clean buf lists.
   No consumers need to find them there and it complicates the tree.
   These flags are all FFS specific and could be moved out of the buf
   cache.
 - Use pbgetvp() and pbrelvp() to associate the background and journal
   bufs with the vp.  Not only is this much cheaper it makes more sense
   for these transient bufs.
 - Fix the assertions in pbget* and pbrel*.  It's not safe to check list
   pointers which were never initialized.  Use the BX flags instead.  We
   also check B_PAGING in reassignbuf() so this should cover all cases.

Discussed with:	kib, mckusick, attilio
Sponsored by:	EMC / Isilon Storage Division
2013-04-06 22:21:23 +00:00
Attilio Rao
89f6b8632c Switch the vm_object mutex to be a rwlock. This will enable in the
future further optimizations where the vm_object lock will be held
in read mode most of the time the page cache resident pool of pages
are accessed for reading purposes.

The change is mostly mechanical but few notes are reported:
* The KPI changes as follow:
  - VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK()
  - VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK()
  - VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK()
  - VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED()
    (in order to avoid visibility of implementation details)
  - The read-mode operations are added:
    VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(),
    VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED()
* The vm/vm_pager.h namespace pollution avoidance (forcing requiring
  sys/mutex.h in consumers directly to cater its inlining functions
  using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h
  consumers now must include also sys/rwlock.h.
* zfs requires a quite convoluted fix to include FreeBSD rwlocks into
  the compat layer because the name clash between FreeBSD and solaris
  versions must be avoided.
  At this purpose zfs redefines the vm_object locking functions
  directly, isolating the FreeBSD components in specific compat stubs.

The KPI results heavilly broken by this commit.  Thirdy part ports must
be updated accordingly (I can think off-hand of VirtualBox, for example).

Sponsored by:	EMC / Isilon storage division
Reviewed by:	jeff
Reviewed by:	pjd (ZFS specific review)
Discussed with:	alc
Tested by:	pho
2013-03-09 02:32:23 +00:00
Konstantin Belousov
10b4bb0b33 Add a trivial comment to record the proper commit log for r245407:
Set the v_hash for a new vnode in the getnewvnode() to the value
calculated based on the vnode structure address.  Filesystems using
vfs_hash_insert() override the v_hash using the standard formula of
(inode_number + mnt_hashseed).  For other filesystems, the
initialization allows the vfs_hash_index() to provide useful hash too.

Suggested, reviewed and tested by:	peter
Sponsored by:	The FreeBSD Foundation
MFC after:	5 days
2013-01-14 05:52:23 +00:00
Konstantin Belousov
a41df84820 diff --git a/sys/kern/vfs_subr.c b/sys/kern/vfs_subr.c
index 7c243b6..0bdaf36 100644
--- a/sys/kern/vfs_subr.c
+++ b/sys/kern/vfs_subr.c
@@ -279,6 +279,7 @@ SYSCTL_INT(_debug, OID_AUTO, vnlru_nowhere, CTLFLAG_RW,
 #define VSHOULDFREE(vp) (!((vp)->v_iflag & VI_FREE) && !(vp)->v_holdcnt)
 #define VSHOULDBUSY(vp) (((vp)->v_iflag & VI_FREE) && (vp)->v_holdcnt)

+static int vnsz2log;

 /*
  * Initialize the vnode management data structures.
@@ -293,6 +294,7 @@ SYSCTL_INT(_debug, OID_AUTO, vnlru_nowhere, CTLFLAG_RW,
 static void
 vntblinit(void *dummy __unused)
 {
+	u_int i;
 	int physvnodes, virtvnodes;

 	/*
@@ -332,6 +334,9 @@ vntblinit(void *dummy __unused)
 	syncer_maxdelay = syncer_mask + 1;
 	mtx_init(&sync_mtx, "Syncer mtx", NULL, MTX_DEF);
 	cv_init(&sync_wakeup, "syncer");
+	for (i = 1; i <= sizeof(struct vnode); i <<= 1)
+		vnsz2log++;
+	vnsz2log--;
 }
 SYSINIT(vfs, SI_SUB_VFS, SI_ORDER_FIRST, vntblinit, NULL);

@@ -1067,6 +1072,14 @@ alloc:
 	}
 	rangelock_init(&vp->v_rl);

+	/*
+	 * For the filesystems which do not use vfs_hash_insert(),
+	 * still initialize v_hash to have vfs_hash_index() useful.
+	 * E.g., nullfs uses vfs_hash_index() on the lower vnode for
+	 * its own hashing.
+	 */
+	vp->v_hash = (uintptr_t)vp >> vnsz2log;
+
 	*vpp = vp;
 	return (0);
 }
2013-01-14 05:42:54 +00:00
Attilio Rao
c92c859b7b Fixup r244240: mp_ncpus will be 1 also in the !SMP and smp_disabled=1
case. There is no point in optimizing further the code and use a TRUE
litteral for a path that does heavyweight stuff anyway (like lock acq),
at the price of obfuscated code.

Use the appropriate check where necessary and remove a macro.

Sponsored by:	EMC / Isilon storage division
MFC after:	3 days
2012-12-26 15:20:32 +00:00
Attilio Rao
b1308d72c2 Fixup r218424: uio_yield() was scaling directly to userland priority.
When kern_yield() was introduced with the possibility to specify
a new priority, the behaviour changed by not lowering priority at all
in the consumers, making the yielding mechanism highly ineffective for
high priority kthreads like bufdaemon, syncer, vlrudaemon, etc.
There are no evidences that consumers could bear with such change in
semantic and this situation could finally lead to bugs similar to the
ones fixed in r244240.
Re-specify userland pri for kthreads involved.

Tested by:	pho
Reviewed by:	kib, mdf
MFC after:	1 week
2012-12-21 13:14:12 +00:00
Konstantin Belousov
14df601e47 When mnt_vnode_next_active iterator cannot lock the next vnode and
yields, specify the user priority for the yield.  Otherwise, a
higher-priority (kernel) thread could fall into the priority-inversion
with the thread owning the mutex lock.

On single-processor machines or UP kernels, do not loop adaptively
when the next vnode cannot be locked, instead yield unconditionally.

Restructure the iteration initializer and the iterator to remove code
duplication.  Put the code to fetch and lock a vnode next to the
current marker, into the mnt_vnode_next_active() function, and use it
instead of repeating the loop.

Reported by:	hrs, rmacklem
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2012-12-15 02:04:46 +00:00
Konstantin Belousov
686ffcaceb Do not yield while owning a mutex. The Giant reacquire in the
kern_yield() is problematic than.

The owned mutex is the mount interlock, and it is in fact not needed
to guarantee the stability of the mount list of active vnodes, so fix
the the issue by only taking the mount interlock for MNT_REF and
MNT_REL operations.

While there, augment the unconditional yield by some amount of
spinning [1].

Reported and tested by:	pho
Reviewed by:	attilio
Submitted by:	attilio [1]
MFC after:	3 days
2012-12-10 20:44:09 +00:00
Konstantin Belousov
07840861b1 The vnode_free_list_mtx is required unconditionally when iterating
over the active list. The mount interlock is not enough to guarantee
the validity of the tailq link pointers. The __mnt_vnode_next_active()
and __mnt_vnode_first_active() active lists iterators helper functions
did not provided the neccessary stability for the list, allowing the
iterators to pick garbage.

This was uncovered after the r243599 made the active list iterators
non-nop.

Since a vnode interlock is before the vnode_free_list_mtx, obtain the
vnode ilock in the non-blocking manner when under vnode_free_list_mtx,
and restart iteration after the yield if the lock attempt failed.

Assert that a vnode found on the list is active, and assert that the
helpers return the vnode with interlock owned.

Reported and tested by:	pho
MFC after:	1 week
2012-12-03 22:15:16 +00:00
David Xu
3da9ab75f4 Take first active vnode correctly.
Reviewed by:	kib
MFC after:	3 days
2012-11-27 06:07:58 +00:00
Andriy Gapon
6b991098a7 assert_vop_locked: make the assertion race-free and more efficient
this is really a minor improvement for the sake of correctness

MFC after:	6 days
2012-11-24 13:11:47 +00:00
Andriy Gapon
4f15bb6730 remove vop_lookup_pre and vop_lookup_post
Suggested by:	kib
MFC after:	5 days
2012-11-22 10:36:10 +00:00
Attilio Rao
973b795b64 insmntque() is always called with the lock held in exclusive mode,
then:
- assume the lock is held in exclusive mode and remove a moot check
  about the lock acquisition.
- in the destructor remove !MPSAFE specific chunk.

Reviewed by:	kib
MFC after:	2 weeks
2012-11-19 20:43:19 +00:00
Andriy Gapon
ab49c952d9 assert_vop_locked should treat LK_EXCLOTHER as the not locked case
... from a perspective of the current thread.

Spotted by:	mjg
Discussed with:	kib
MFC after:	18 days
2012-11-19 11:35:56 +00:00
Andriy Gapon
c496727c54 vnode_if: fix locking protocol description for lookup and cachedlookup
Also remove the checks from vop_lookup_pre and vop_lookup_post, which
are now completely redundant (before this change they were partially
redundant).

Discussed with:	kib
MFC after:	10 days
2012-11-19 11:32:56 +00:00
Attilio Rao
bc2258da88 Complete MPSAFE VFS interface and remove MNTK_MPSAFE flag.
Porters should refer to __FreeBSD_version 1000021 for this change as
it may have happened at the same timeframe.
2012-11-09 18:02:25 +00:00
Konstantin Belousov
76fd782cd9 A clarification to the behaviour of the active vnode list management
regarding the vnode page cleaning.

In collaboration with:	pho
MFC after:	1 week
2012-11-05 16:40:42 +00:00
Konstantin Belousov
90af57930c Add decoding of the missed MNT_KERN_ flags to ddb "show mount" command.
MFC after:	3 weeks
2012-11-04 13:33:13 +00:00
Konstantin Belousov
fb81941575 Add decoding of the missed VI_ and VV_ flags to ddb "show vnode" command.
MFC after:	3 days
2012-11-04 13:32:45 +00:00
Konstantin Belousov
df3161c7df Order the enumeration of the MNT_ flags to be the same as the order of
their definitions.

MFC after:	3 days
2012-11-04 13:31:41 +00:00
Konstantin Belousov
5050aa86cf Remove the support for using non-mpsafe filesystem modules.
In particular, do not lock Giant conditionally when calling into the
filesystem module, remove the VFS_LOCK_GIANT() and related
macros. Stop handling buffers belonging to non-mpsafe filesystems.

The VFS_VERSION is bumped to indicate the interface change which does
not result in the interface signatures changes.

Conducted and reviewed by:	attilio
Tested by:	pho
2012-10-22 17:50:54 +00:00
Konstantin Belousov
9b233e2307 Add a KPI to allow to reserve some amount of space in the numvnodes
counter, without actually allocating the vnodes. The supposed use of
the getnewvnode_reserve(9) is to reclaim enough free vnodes while the
code still does not hold any resources that might be needed during the
reclamation, and to consume the slack later for getnewvnode() calls
made from the innards. After the critical block is finished, the
caller shall free any reserve left, by getnewvnode_drop_reserve(9).

Reviewed by:	avg
Tested by:	pho
MFC after:	1 week
2012-10-14 19:43:37 +00:00
Attilio Rao
0a15e5d30d Remove all the checks on curthread != NULL with the exception of some MD
trap checks (eg. printtrap()).

Generally this check is not needed anymore, as there is not a legitimate
case where curthread != NULL, after pcpu 0 area has been properly
initialized.

Reviewed by:	bde, jhb
MFC after:	1 week
2012-09-13 22:26:22 +00:00
Konstantin Belousov
bcd5bb8e57 Add a facility for vgone() to inform the set of subscribed mounts
about vnode reclamation. Typical use is for the bypass mounts like
nullfs to get a notification about lower vnode going away.

Now, vgone() calls new VFS op vfs_reclaim_lowervp() with an argument
lowervp which is reclaimed. It is possible to register several
reclamation event listeners, to correctly handle the case of several
nullfs mounts over the same directory.

For the filesystem not having nullfs mounts over it, the overhead
added is a single mount interlock lock/unlock in the vnode reclamation
path.

In collaboration with:	pho
MFC after:	3 weeks
2012-09-09 19:17:15 +00:00
Konstantin Belousov
258f94423b Provide some compat32 shims for sysctl vfs.conflist. It is required
for getvfsbyname(3) operation when called from 32bit process, and
getvfsbyname(3) is used by recent bsdtar import.

Reported by:	many
Tested by:	David Naylor <naylor.b.david@gmail.com>
MFC after:	5 days
2012-08-22 20:05:34 +00:00
Andriy Gapon
7adc598a15 free wdog_kern_pat calls in post-panic paths from under SW_WATCHDOG
Those calls are useful with hardware watchdog drivers too.

MFC after:	3 weeks
2012-06-03 08:01:12 +00:00
Konstantin Belousov
8f0e91308a Add a rangelock implementation, intended to be used to range-locking
the i/o regions of the vnode data space. The implementation is quite
simple-minded, it uses the list of the lock requests, ordered by
arrival time. Each request may be for read or for write. The
implementation is fair FIFO.

MFC after:     2 month
2012-05-30 16:06:38 +00:00
Edward Tomasz Napierala
af6e6b87ad Remove unused thread argument to vrecycle().
Reviewed by:	kib
2012-04-23 14:10:34 +00:00
Edward Tomasz Napierala
c52fd858ae Remove unused thread argument from vtruncbuf().
Reviewed by:	kib
2012-04-23 13:21:28 +00:00
Kirk McKusick
dca5e0ec50 This update uses the MNT_VNODE_FOREACH_ACTIVE interface that loops
over just the active vnodes associated with a mount point to replace
MNT_VNODE_FOREACH_ALL in the vfs_msync, ffs_sync_lazy, and qsync
routines.

The vfs_msync routine is run every 30 seconds for every writably
mounted filesystem. It ensures that any files mmap'ed from the
filesystem with modified pages have those pages queued to be
written back to the file from which they are mapped.

The ffs_lazy_sync and qsync routines are run every 30 seconds for
every writably mounted UFS/FFS filesystem. The ffs_lazy_sync routine
ensures that any files that have been accessed in the previous
30 seconds have had their access times queued for updating in the
filesystem. The qsync routine ensures that any files with modified
quotas have those quotas queued to be written back to their
associated quota file.

In a system configured with 250,000 vnodes, less than 1000 are
typically active at any point in time. Prior to this change all
250,000 vnodes would be locked and inspected twice every minute
by the syncer. For UFS/FFS filesystems they would be locked and
inspected six times every minute (twice by each of these three
routines since each of these routines does its own pass over the
vnodes associated with a mount point). With this change the syncer
now locks and inspects only the tiny set of vnodes that are active.

Reviewed by: kib
Tested by:   Peter Holm
MFC after:   2 weeks
2012-04-20 07:00:28 +00:00
Kirk McKusick
f257ebbb2e This change creates a new list of active vnodes associated with
a mount point. Active vnodes are those with a non-zero use or hold
count, e.g., those vnodes that are not on the free list. Note that
this list is in addition to the list of all the vnodes associated
with a mount point.

To avoid adding another set of linkage pointers to the vnode
structure, the active list uses the existing linkage pointers
used by the free list (previously named v_freelist, now renamed
v_actfreelist).

This update adds the MNT_VNODE_FOREACH_ACTIVE interface that loops
over just the active vnodes associated with a mount point (typically
less than 1% of the vnodes associated with the mount point).

Reviewed by: kib
Tested by:   Peter Holm
MFC after:   2 weeks
2012-04-20 06:50:44 +00:00
Kirk McKusick
16165feec4 Delete a no longer useful VNASSERT missed during changes in 234400.
Suggested by: kib
2012-04-18 19:34:20 +00:00
Kirk McKusick
60005d66ab Fix a memory leak of M_VNODE_MARKER introduced in 234386.
Found by:  Peter Holm
2012-04-18 19:30:22 +00:00
Kirk McKusick
73305eb826 Drop export of vdestroy() function from kern/vfs_subr.c as it is
used only as a helper function in that file. Replace sole call to
vbusy() with inline code in vholdl(). Replace sole calls to vfree()
and vdestroy() with inline code in vdropl().

The Clang compiler already inlines these functions, so they do not
show up in a kernel backtrace which is confusing. Also you cannot
set their frame in kgdb which means that it is impossible to view
their local variables. So, while the produced code is unchanged,
the debugging should be easier.

Discussed with: kib
MFC after:      2 weeks
2012-04-17 21:46:59 +00:00
Kirk McKusick
71469bb38f Replace the MNT_VNODE_FOREACH interface with MNT_VNODE_FOREACH_ALL.
The primary changes are that the user of the interface no longer
needs to manage the mount-mutex locking and that the vnode that
is returned has its mutex locked (thus avoiding the need to check
to see if its is DOOMED or other possible end of life senarios).

To minimize compatibility issues for third-party developers, the
old MNT_VNODE_FOREACH interface will remain available so that this
change can be MFC'ed to 9. Following the MFC to 9, MNT_VNODE_FOREACH
will be removed in head.

The reason for this update is to prepare for the addition of the
MNT_VNODE_FOREACH_ACTIVE interface that will loop over just the
active vnodes associated with a mount point (typically less than
1% of the vnodes associated with the mount point).

Reviewed by: kib
Tested by:   Peter Holm
MFC after:   2 weeks
2012-04-17 16:28:22 +00:00
Kirk McKusick
ecb6e528c5 Export vinactive() from kern/vfs_subr.c (e.g., make it no longer
static and declare its prototype in sys/vnode.h) so that it can be
called from process_deferred_inactive() (in ufs/ffs/ffs_snapshot.c)
instead of the body of vinactive() being cut and pasted into
process_deferred_inactive().

Reviewed by: kib
MFC after:   2 weeks
2012-04-11 23:01:11 +00:00
Konstantin Belousov
38ddb5725b Decomission mnt_noasync. Introduce MNTK_NOASYNC mnt_kern_flag which
allows a filesystem to request VFS to not allow MNTK_ASYNC.

MFC after:	1 week
2012-03-09 00:12:05 +00:00
Mikolaj Golub
662c901c54 When detaching an unix domain socket, uipc_detach() checks
unp->unp_vnode pointer to detect if there is a vnode associated with
(binded to) this socket and does necessary cleanup if there is.

The issue is that after forced unmount this check may be too late as
the unp_vnode is reclaimed and the reference is stale.

To fix this provide a helper function that is called on a socket vnode
reclamation to do necessary cleanup.

Pointed by:	kib
Reviewed by:	kib
MFC after:	2 weeks
2012-02-25 10:15:41 +00:00
Konstantin Belousov
c480f781ea Current implementations of sync(2) and syncer vnode fsync() VOP uses
mnt_noasync counter to temporary remove MNTK_ASYNC mount option, which
is needed to guarantee a synchronous completion of the initiated i/o
before syscall or VOP return.  Global removal of MNTK_ASYNC option is
harmful because not only i/o started from corresponding thread becomes
synchronous, but all i/o is synchronous on the filesystem which is
initiated during sync(2) or syncer activity.

Instead of removing MNTK_ASYNC from mnt_kern_flag, provide a local
thread flag to disable async i/o for current thread only. Use the
opportunity to move DOINGASYNC() macro into sys/vnode.h and
consistently use it through places which tested for MNTK_ASYNC.

Some testing demonstrated 60-70% improvements in run time for the
metadata-intensive operations on async-mounted UFS volumes, but still
with great deviation due to other reasons.

Reviewed by:	mckusick
Tested by:	scottl
MFC after:	2 weeks
2012-02-06 11:04:36 +00:00
Konstantin Belousov
abc942b56c When doing vflush(WRITECLOSE), clean vnode pages.
Unmounts do vfs_msync() before calling VFS_UNMOUNT(), but there is
still a race allowing a process to dirty pages after msync
finished. Remounts rw->ro just left dirty pages in system.

Reviewed by:	alc, tegge (long time ago)
Tested by:	pho
MFC after:	2 weeks
2012-01-25 20:54:09 +00:00
Kirk McKusick
cc672d3599 Make sure all intermediate variables holding mount flags (mnt_flag)
and that all internal kernel calls passing mount flags are declared
as uint64_t so that flags in the top 32-bits are not lost.

MFC after: 2 weeks
2012-01-17 01:08:01 +00:00
John Baldwin
908cac07ce Use proper argument structure types for the extattr post-VOP hooks.
The wrong structure happened to work since the only argument used was
the vnode which is in the same place in both VOP_SETATTR() and the two
extattr VOPs.

MFC after:	3 days
2012-01-06 20:05:48 +00:00
John Baldwin
f0d6c5caf0 Add post-VOP hooks for VOP_DELETEEXTATTR() and VOP_SETEXTATTR() and use
these to trigger a NOTE_ATTRIB EVFILT_VNODE kevent when the extended
attributes of a vnode are changed.

Note that OS X already implements this behavior.

Reviewed by:	rwatson
MFC after:	2 weeks
2011-12-23 20:11:37 +00:00
John Baldwin
936c09ac0f Add the posix_fadvise(2) system call. It is somewhat similar to
madvise(2) except that it operates on a file descriptor instead of a
memory region.  It is currently only supported on regular files.

Just as with madvise(2), the advice given to posix_fadvise(2) can be
divided into two types.  The first type provide hints about data access
patterns and are used in the file read and write routines to modify the
I/O flags passed down to VOP_READ() and VOP_WRITE().  These modes are
thus filesystem independent.  Note that to ease implementation (and
since this API is only advisory anyway), only a single non-normal
range is allowed per file descriptor.

The second type of hints are used to hint to the OS that data will or
will not be used.  These hints are implemented via a new VOP_ADVISE().
A default implementation is provided which does nothing for the WILLNEED
request and attempts to move any clean pages to the cache page queue for
the DONTNEED request.  This latter case required two other changes.
First, a new V_CLEANONLY flag was added to vinvalbuf().  This requests
vinvalbuf() to only flush clean buffers for the vnode from the buffer
cache and to not remove any backing pages from the vnode.  This is
used to ensure clean pages are not wired into the buffer cache before
attempting to move them to the cache page queue.  The second change adds
a new vm_object_page_cache() method.  This method is somewhat similar to
vm_object_page_remove() except that instead of freeing each page in the
specified range, it attempts to move clean pages to the cache queue if
possible.

To preserve the ABI of struct file, the f_cdevpriv pointer is now reused
in a union to point to the currently active advice region if one is
present for regular files.

Reviewed by:	jilles, kib, arch@
Approved by:	re (kib)
MFC after:	1 month
2011-11-04 04:02:50 +00:00