freebsd

mirror of https://git.FreeBSD.org/src.git synced 2024-12-23 11:18:54 +00:00

Author	SHA1	Message	Date
Attilio Rao	dfd55c0c7b	In vm_object_split(), busy and consequently unbusy the pages only when swap_pager_copy() is invoked, otherwise there is no reason to do so. This will eliminate the necessity to busy pages most of the times. Sponsored by: EMC / Isilon storage division Reviewed by: alc	2013-06-04 22:47:01 +00:00
Alan Cox	da38420832	Update a comment.	2013-06-04 05:44:52 +00:00
Alan Cox	e23b0a193e	Relax the object locking in vm_pageout_map_deactivate_pages() and vm_pageout_object_deactivate_pages(). A read lock suffices. Sponsored by: EMC / Isilon Storage Division	2013-06-04 02:28:47 +00:00
Konstantin Belousov	be6ec55376	Remove irrelevant comments. Discussed with: alc MFC after: 3 days	2013-06-03 17:30:40 +00:00
Alan Cox	b417181250	Require that the page lock is held, instead of the object lock, when clearing the page's PGA_REFERENCED flag. Since we are typically manipulating the page's act_count field when we are clearing its PGA_REFERENCED flag, the page lock is already held everywhere that we clear the PGA_REFERENCED flag. So, in fact, this revision only changes some comments and an assertion. Nonetheless, it will enable later changes to object locking in the pageout code. Introduce vm_page_assert_locked(), which completely hides the implementation details of the page lock from the caller, and use it in vm_page_aflag_clear(). (The existing vm_page_lock_assert() could not be used in vm_page_aflag_clear().) Over the coming weeks, I expect that we'll either eliminate or replace the various uses of vm_page_lock_assert() with vm_page_assert_locked(). Reviewed by: attilio Sponsored by: EMC / Isilon Storage Division	2013-06-03 01:22:54 +00:00
Alan Cox	b4e498071d	Now that access to the page's "act_count" field is synchronized by the page lock instead of the object lock, there is no reason for vm_page_activate() to assert that the object is locked for either read or write access. (The "VPO_UNMANAGED" flag never changes after page allocation.) Sponsored by: EMC / Isilon Storage Division	2013-06-01 20:32:34 +00:00
Alan Cox	ef5ba5a31d	Simplify the definition of vm_page_lock_assert(). There is no compelling reason to inline the implementation of vm_page_lock_assert() in the !KLD_MODULES case. Use the same implementation for both KLD_MODULES and !KLD_MODULES. Reviewed by: kib	2013-05-31 16:00:42 +00:00
Konstantin Belousov	7560005c41	After the object lock was dropped, the object' reference count could change. Retest the ref_count and return from the function to not execute the further code which assumes that ref_count == 1 if it is not. Also, do not leak vnode lock if other thread cleared OBJ_TMPFS flag meantime. Reported by: bdrewery Tested by: bdrewery, pho Sponsored by: The FreeBSD Foundation	2013-05-30 20:00:19 +00:00
Konstantin Belousov	782d4a636b	Remove the capitalization in the assertion message. Print the address of the object to get useful information from optimizated kernels dump.	2013-05-30 19:53:31 +00:00
Attilio Rao	c25673ffd6	o Change the locking scheme for swp_bcount. It can now be accessed with a write lock on the object containing it OR with a read lock on the object containing it along with the swhash_mtx. o Remove some duplicate assertions for swap_pager_freespace() and swap_pager_unswapped() but keep the object locking references for documentation. Sponsored by: EMC / Isilon storage division Reviewed by: alc	2013-05-28 22:07:23 +00:00
Attilio Rao	83b375ea16	Acquire read lock on the src object for vm_fault_copy_entry(). Sponsored by: EMC / Isilon storage division Reviewed by: alc	2013-05-22 15:11:00 +00:00
Attilio Rao	9af6d512f5	o Relax locking assertions for vm_page_find_least() o Relax locking assertions for pmap_enter_object() and add them also to architectures that currently don't have any o Introduce VM_OBJECT_LOCK_DOWNGRADE() which is basically a downgrade operation on the per-object rwlock o Use all the mechanisms above to make vm_map_pmap_enter() to work mostl of the times only with readlocks. Sponsored by: EMC / Isilon storage division Reviewed by: alc	2013-05-21 20:38:19 +00:00
Konstantin Belousov	4fab678be2	Add ddb command 'show pginfo' which provides useful information about a vm page, denoted either by an address of the struct vm_page, or, if the '/p' modifier is specified, by a physical address of the corresponding frame. Reviewed by: jhb Sponsored by: The FreeBSD Foundation MFC after: 1 week	2013-05-21 11:04:00 +00:00
Alan Cox	c141ae7f49	Relax the object locking in vm_fault_prefault(). A read lock suffices. Reviewed by: attilio Sponsored by: EMC / Isilon Storage Division	2013-05-17 19:02:36 +00:00
Alan Cox	767a6420bc	Relax the object locking assertion in vm_page_lookup(). Now that a radix tree is used to maintain the object's collection of resident pages, vm_page_lookup() no longer needs an exclusive lock. Reviewed by: attilio Sponsored by: EMC / Isilon Storage Division	2013-05-17 18:49:43 +00:00
Attilio Rao	7e226537c7	o Add accessor functions to add and remove pages from a specific freelist. o Split the pool of free pages queues really by domain and not rely on definition of VM_RAW_NFREELIST. o For MAXMEMDOM > 1, wrap the RR allocation logic into a specific function that is called when calculating the allocation domain. The RR counter is kept, currently, per-thread. In the future it is expected that such function evolves in a real policy decision referee, based on specific informations retrieved by per-thread and per-vm_object attributes. o Add the concept of "probed domains" under the form of vm_ndomains. It is responsibility for every architecture willing to support multiple memory domains to correctly probe vm_ndomains along with mem_affinity segments attributes. Those two values are supposed to remain always consistent. Please also note that vm_ndomains and td_dom_rr_idx are both int because segments already store domains as int. Ideally u_int would have much more sense. Probabilly this should be cleaned up in the future. o Apply RR domain selection also to vm_phys_zero_pages_idle(). Sponsored by: EMC / Isilon storage division Partly obtained from: jeff Reviewed by: alc Tested by: jeff	2013-05-13 15:40:51 +00:00
Peter Wemm	df839389c5	Bandaid for compiling with gcc, which happens to be the default compiler for a number of platforms still.	2013-05-13 07:09:31 +00:00
Alan Cox	404eb1b3fd	Refactor vm_page_alloc()'s interactions with vm_reserv_alloc_page() and vm_page_insert() so that (1) vm_radix_lookup_le() is never called while the free page queues lock is held and (2) vm_radix_lookup_le() is called at most once. This change reduces the average time that the free page queues lock is held by vm_page_alloc() as well as vm_page_alloc()'s average overall running time. Sponsored by: EMC / Isilon Storage Division	2013-05-12 16:50:18 +00:00
Alan Cox	9f2e600890	To reduce the amount of arithmetic performed in the various radix tree functions, reverse the numbering scheme for the levels. The highest numbered level in the tree now appears near the root instead of the leaves. Sponsored by: EMC / Isilon Storage Division	2013-05-11 18:01:41 +00:00
Attilio Rao	d0b5855eb2	Fix-up r250338 by completing the removal of VM_NDOMAIN in favor of MAXMEMDOM. This unbreak builds. Sponsored by: EMC / Isilon storage division Reported by: adrian, jeli	2013-05-08 10:55:39 +00:00
Attilio Rao	941646f5ec	Rename VM_NDOMAIN into MAXMEMDOM and move it into machine/param.h in order to match the MAXCPU concept. The change should also be useful for consolidation and consistency. Sponsored by: EMC / Isilon storage division Obtained from: jeff Reviewed by: alc	2013-05-07 22:46:24 +00:00
Alan Cox	bb0e1de4ab	Remove a redundant call to panic() from vm_radix_keydiff(). The assertion before the loop accomplishes the same thing. Sponsored by: EMC / Isilon Storage Division	2013-05-07 18:45:34 +00:00
Alan Cox	2d4b9a6438	Optimize vm_radix_lookup_ge() and vm_radix_lookup_le(). Specifically, change the way that these functions ascend the tree when the search for a matching leaf fails at an interior node. Rather than returning to the root of the tree and repeating the lookup with an updated key, maintain a stack of interior nodes that were visited during the descent and use that stack to resume the lookup at the closest ancestor that might have a matching descendant. Sponsored by: EMC / Isilon Storage Division Reviewed by: attilio Tested by: pho	2013-05-04 22:50:15 +00:00
John Baldwin	f5c4b077be	Fix two bugs in the current NUMA-aware allocation code: - vm_phys_alloc_freelist_pages() can be called by vm_page_alloc_freelist() to allocate a page from a specific freelist. In the NUMA case it did not properly map the public VM_FREELIST_* constants to the correct backing freelists, nor did it try all NUMA domains for allocations from VM_FREELIST_DEFAULT. - vm_phys_alloc_pages() did not pin the thread and each call to vm_phys_alloc_freelist_pages() fetched the current domain to choose which freelist to use. If a thread migrated domains during the loop in vm_phys_alloc_pages() it could skip one of the freelists. If the other freelists were out of memory then it is possible that vm_phys_alloc_pages() would fail to allocate a page even though pages were available resulting in a panic in vm_page_alloc(). Reviewed by: alc MFC after: 1 week	2013-05-03 18:58:37 +00:00
Konstantin Belousov	53f5f8a0e1	Add a hint suggesting why tmpfs does not need a special case there.	2013-05-02 18:35:12 +00:00
Konstantin Belousov	6f2af3fcf3	Rework the handling of the tmpfs node backing swap object and tmpfs vnode v_object to avoid double-buffering. Use the same object both as the backing store for tmpfs node and as the v_object. Besides reducing memory use up to 2x times for situation of mapping files from tmpfs, it also makes tmpfs read and write operations copy twice bytes less. VM subsystem was already slightly adapted to tolerate OBJT_SWAP object as v_object. Now the vm_object_deallocate() is modified to not reinstantiate OBJ_ONEMAPPING flag and help the VFS to correctly handle VV_TEXT flag on the last dereference of the tmpfs backing object. Reviewed by: alc Tested by: pho, bf MFC after: 1 month	2013-04-28 19:38:59 +00:00
Konstantin Belousov	e5f299ff76	Make vm_object_page_clean() and vm_mmap_vnode() tolerate the vnode' v_object of non OBJT_VNODE type. For vm_object_page_clean(), simply do not assert that object type must be OBJT_VNODE, and add a comment explaining how the check for OBJ_MIGHTBEDIRTY prevents the rest of function from operating on such objects. For vm_mmap_vnode(), if the object type is not OBJT_VNODE, require it to be for swap pager (or default), handle the bypass filesystems, and correctly acquire the object reference in this case. Reviewed by: alc Tested by: pho, bf MFC after: 1 week	2013-04-28 19:25:09 +00:00
Konstantin Belousov	9b8851faae	Assert that the object type for the vnode' non-NULL v_object, passed to vnode_pager_setsize(), is either OBJT_VNODE, or, if vnode was already reclaimed, OBJT_DEAD. Note that the later is only possible due to some filesystems, in particular, nfsiods from nfs clients, call vnode_pager_setsize() with unlocked vnode. More, if the object is terminated, do not perform the resizing operation. Reviewed by: alc Tested by: pho, bf MFC after: 1 week	2013-04-28 19:19:26 +00:00
Konstantin Belousov	6ded84276d	Convert panic() into KASSERT(). Reviewed by: alc MFC after: 1 week	2013-04-28 18:40:55 +00:00
Alan Cox	82af926a57	Eliminate an unneeded call to vm_radix_trimkey() from vm_radix_lookup_le(). This call is clearing bits from the key that will be set again by the next line. Sponsored by: EMC / Isilon Storage Division	2013-04-28 08:29:00 +00:00
Alan Cox	40076ebc5c	Avoid some lookup restarts in vm_radix_lookup_{ge,le}(). Sponsored by: EMC / Isilon Storage Division	2013-04-27 16:44:59 +00:00
Gleb Smirnoff	08a3102c0b	Panic if UMA_ZONE_PCPU is created at early stages of boot, when mp_ncpus isn't yet initialized. Otherwise we will panic at first allocation later. Sponsored by: Nginx, Inc.	2013-04-22 09:02:23 +00:00
Alan Cox	384875a3a6	Simplify vm_radix_{add,dec}lev(). Sponsored by: EMC / Isilon Storage Division	2013-04-22 01:26:13 +00:00
Alan Cox	880659fe81	When calculating the number of reserved nodes, discount the pages that will be used to store the nodes. Sponsored by: EMC / Isilon Storage Division	2013-04-18 05:34:33 +00:00
Alan Cox	a08f2cf69e	Although we perform path compression to reduce the height of the trie and the number of interior nodes, we have previously created a level zero interior node at the root of every non-empty trie, even when that node is not strictly necessary, i.e., it has only one child. This change is the second (and final) step in eliminating those unnecessary level zero interior nodes. Specifically, it updates the deletion and insertion functions so that they do not require a level zero interior node at the root of the trie. For a "buildworld" workload, this change results in a 16.8% reduction in the number of interior nodes allocated and a similar reduction in the average execution time for lookup functions. For example, the average execution time for a call to vm_radix_lookup_ge() is reduced by 22.9%. Reviewed by: attilio, jeff (an earlier version) Sponsored by: EMC / Isilon Storage Division	2013-04-15 06:12:00 +00:00
Alan Cox	6f9c0b15bb	Although we perform path compression to reduce the height of the trie and the number of interior nodes, we always create a level zero interior node at the root of every non-empty trie, even when that node is not strictly necessary, i.e., it has only one child. This change is the first step in eliminating those unnecessary level zero interior nodes. Specifically, it updates all of the lookup functions so that they do not require a level zero interior node at the root. Reviewed by: attilio, jeff (an earlier version) Sponsored by: EMC / Isilon Storage Division	2013-04-12 20:21:28 +00:00
Gleb Smirnoff	85dcf349c1	Convert UMA code to C99 uintXX_t types.	2013-04-09 17:43:48 +00:00
Gleb Smirnoff	04fc5741e0	Swap us_freecount and us_flags, achieving same structure size as before previous commit. Submitted by: alc	2013-04-09 17:25:15 +00:00
Gleb Smirnoff	8cf455b8d9	Since now we support 256 items per slab, we need more bits for us_freecount. This grows uma_slab_head on 32-bit arches, but growth isn't significant. Taking kmem zones as example, only the 32 byte zone is affected, ipers is reduced from 113 to 112. In collaboration with: kib	2013-04-09 15:15:52 +00:00
Gleb Smirnoff	025071f2af	Fix KASSERTs: maximum number of items per slab is 256.	2013-04-09 12:20:44 +00:00
Konstantin Belousov	b9781cf650	Fix the assertions for the state of the object under the map entry with the MAP_ENTRY_VN_WRITECNT flag: - Move the assertion that verifies the state of the v_writecount and vnp.writecount, under the block where the object is locked. - Check that the object type is OBJT_VNODE before asserting. Reported by: avg Reviewed by: alc MFC after: 1 week	2013-04-09 10:04:10 +00:00
Attilio Rao	a15f7df5de	The per-page act_count can be made very-easily protected by the per-page lock rather than vm_object lock, without any further overhead. Make the formal switch. Sponsored by: EMC / Isilon storage division Reviewed by: alc Tested by: pho	2013-04-08 20:02:27 +00:00
Gleb Smirnoff	ad97af7ebd	Merge from projects/counters: UMA_ZONE_PCPU zones. These zones have slab size == sizeof(struct pcpu), but request from VM enough pages to fit (uk_slabsize * mp_ncpus). An item allocated from such zone would have a separate twin for each CPU in the system, and these twins are at a distance of sizeof(struct pcpu) from each other. This magic value of distance would allow us to make some optimizations later. To address private item from a CPU simple arithmetics should be used: item = (type )((char )base + sizeof(struct pcpu) * curcpu) These arithmetics are available as zpcpu_get() macro in pcpu.h. To introduce non-page size slabs a new field had been added to uma_keg uk_slabsize. This shifted some frequently used fields of uma_keg to the fourth cache line on amd64. To mitigate this pessimization, uma_keg fields were a bit rearranged and least frequently used uk_name and uk_link moved down to the fourth cache line. All other fields, that are dereferenced frequently fit into first three cache lines. Sponsored by: Nginx, Inc.	2013-04-08 19:10:45 +00:00
Alan Cox	2c899fede2	Micro-optimize the order of struct vm_radix_node's fields. Specifically, arrange for all of the fields to start at a short offset from the beginning of the structure. Eliminate unnecessary masking of VM_RADIX_FLAGS from the root pointer in vm_radix_getroot(). Sponsored by: EMC / Isilon Storage Division	2013-04-07 01:30:51 +00:00
Jeff Roberson	26089666b6	Prepare to replace the buf splay with a trie: - Don't insert BKGRDMARKER bufs into the splay or dirty/clean buf lists. No consumers need to find them there and it complicates the tree. These flags are all FFS specific and could be moved out of the buf cache. - Use pbgetvp() and pbrelvp() to associate the background and journal bufs with the vp. Not only is this much cheaper it makes more sense for these transient bufs. - Fix the assertions in pbget* and pbrel*. It's not safe to check list pointers which were never initialized. Use the BX flags instead. We also check B_PAGING in reassignbuf() so this should cover all cases. Discussed with: kib, mckusick, attilio Sponsored by: EMC / Isilon Storage Division	2013-04-06 22:21:23 +00:00
Alan Cox	c1c82b36ad	Simplify vm_radix_keybarr(). Sponsored by: EMC / Isilon Storage Division	2013-04-06 18:04:35 +00:00
Alan Cox	72abda6466	Simplify vm_radix_insert(). Reviewed by: attilio Tested by: pho Sponsored by: EMC / Isilon Storage Division	2013-04-06 06:02:55 +00:00
Alan Cox	96f1a84272	Replace the remaining uses of vm_radix_node_page() by vm_radix_isleaf() and vm_radix_topage(). This transformation eliminates some unnecessary conditional branches from the inner loops of vm_radix_insert(), vm_radix_lookup{,_ge,_le}(), and vm_radix_remove(). Simplify the control flow of vm_radix_lookup_{ge,le}(). Reviewed by: attilio (an earlier version) Tested by: pho Sponsored by: EMC / Isilon Storage Division	2013-04-03 06:37:25 +00:00
Konstantin Belousov	bafa6cfc93	Release the v_writecount reference on the vnode in case of error, before the vnode is vput() in vm_mmap_vnode(). Error return means that there is no use reference on the vnode from the vm object reference, and failing to restore v_writecount breaks the invariant that v_writecount is less or equal to the usecount. The situation observed when nfs client returns ESTALE for VOP_GETATTR() after the open. In collaboration with: pho MFC after: 1 week	2013-03-28 06:39:27 +00:00
Alan Cox	3fc10b7363	Introduce vm_radix_isleaf() and use it in a couple places. As compared to using vm_radix_node_page() == NULL, the compiler is able to generate one less conditional branch when vm_radix_isleaf() is used. More use cases involving the inner loops of vm_radix_insert(), vm_radix_lookup{,_ge,_le}(), and vm_radix_remove() will follow. Reviewed by: attilio Sponsored by: EMC / Isilon Storage Division	2013-03-26 17:30:40 +00:00
Alan Cox	652615dcb7	Micro-optimize the control flow in a few places. Eliminate a panic call that could never be reached in vm_radix_insert(). (If the pointer being checked by the panic call were ever NULL, the immmediately preceding loop would have already crashed on a NULL pointer dereference.) Reviewed by: attilio (an earlier version) Sponsored by: EMC / Isilon Storage Division	2013-03-24 16:43:07 +00:00
Konstantin Belousov	7db07e1c85	Only size and create the bio_transient_map when unmapped buffers are enabled. Now, disabling the unmapped buffers should result in the kernel memory map identical to pre-r248550. Sponsored by: The FreeBSD Foundation	2013-03-21 07:28:15 +00:00
Konstantin Belousov	6991ee13a6	Fix the logic inversion in the r248512. Noted by: mckay	2013-03-20 09:44:23 +00:00
Konstantin Belousov	2cc718a11c	Do not map the swap i/o pbufs if the geom provider for the swap partition accepts unmapped requests. Sponsored by: The FreeBSD Foundation Tested by: pho	2013-03-19 14:39:27 +00:00
Konstantin Belousov	6ce697dc73	Pass unmapped buffers for page in requests if the filesystem indicated support for the unmapped i/o. Sponsored by: The FreeBSD Foundation Tested by: pho	2013-03-19 14:36:28 +00:00
Konstantin Belousov	ee75e7de7b	Implement the concept of the unmapped VMIO buffers, i.e. buffers which do not map the b_pages pages into buffer_map KVA. The use of the unmapped buffers eliminate the need to perform TLB shootdown for mapping on the buffer creation and reuse, greatly reducing the amount of IPIs for shootdown on big-SMP machines and eliminating up to 25-30% of the system time on i/o intensive workloads. The unmapped buffer should be explicitely requested by the GB_UNMAPPED flag by the consumer. For unmapped buffer, no KVA reservation is performed at all. The consumer might request unmapped buffer which does have a KVA reserve, to manually map it without recursing into buffer cache and blocking, with the GB_KVAALLOC flag. When the mapped buffer is requested and unmapped buffer already exists, the cache performs an upgrade, possibly reusing the KVA reservation. Unmapped buffer is translated into unmapped bio in g_vfs_strategy(). Unmapped bio carry a pointer to the vm_page_t array, offset and length instead of the data pointer. The provider which processes the bio should explicitely specify a readiness to accept unmapped bio, otherwise g_down geom thread performs the transient upgrade of the bio request by mapping the pages into the new bio_transient_map KVA submap. The bio_transient_map submap claims up to 10% of the buffer map, and the total buffer_map + bio_transient_map KVA usage stays the same. Still, it could be manually tuned by kern.bio_transient_maxcnt tunable, in the units of the transient mappings. Eventually, the bio_transient_map could be removed after all geom classes and drivers can accept unmapped i/o requests. Unmapped support can be turned off by the vfs.unmapped_buf_allowed tunable, disabling which makes the buffer (or cluster) creation requests to ignore GB_UNMAPPED and GB_KVAALLOC flags. Unmapped buffers are only enabled by default on the architectures where pmap_copy_page() was implemented and tested. In the rework, filesystem metadata is not the subject to maxbufspace limit anymore. Since the metadata buffers are always mapped, the buffers still have to fit into the buffer map, which provides a reasonable (but practically unreachable) upper bound on it. The non-metadata buffer allocations, both mapped and unmapped, is accounted against maxbufspace, as before. Effectively, this means that the maxbufspace is forced on mapped and unmapped buffers separately. The pre-patch bufspace limiting code did not worked, because buffer_map fragmentation does not allow the limit to be reached. By Jeff Roberson request, the getnewbuf() function was split into smaller single-purpose functions. Sponsored by: The FreeBSD Foundation Discussed with: jeff (previous version) Tested by: pho, scottl (previous version), jhb, bf MFC after: 2 weeks	2013-03-19 14:13:12 +00:00
Attilio Rao	774d251d99	Sync back vmcontention branch into HEAD: Replace the per-object resident and cached pages splay tree with a path-compressed multi-digit radix trie. Along with this, switch also the x86-specific handling of idle page tables to using the radix trie. This change is supposed to do the following: - Allowing the acquisition of read locking for lookup operations of the resident/cached pages collections as the per-vm_page_t splay iterators are now removed. - Increase the scalability of the operations on the page collections. The radix trie does rely on the consumers locking to ensure atomicity of its operations. In order to avoid deadlocks the bisection nodes are pre-allocated in the UMA zone. This can be done safely because the algorithm needs at maximum one new node per insert which means the maximum number of the desired nodes is the number of available physical frames themselves. However, not all the times a new bisection node is really needed. The radix trie implements path-compression because UFS indirect blocks can lead to several objects with a very sparse trie, increasing the number of levels to usually scan. It also helps in the nodes pre-fetching by introducing the single node per-insert property. This code is not generalized (yet) because of the possible loss of performance by having much of the sizes in play configurable. However, efforts to make this code more general and then reusable in further different consumers might be really done. The only KPI change is the removal of the function vm_page_splay() which is now reaped. The only KBI change, instead, is the removal of the left/right iterators from struct vm_page, which are now reaped. Further technical notes broken into mealpieces can be retrieved from the svn branch: http://svn.freebsd.org/base/user/attilio/vmcontention/ Sponsored by: EMC / Isilon storage division In collaboration with: alc, jeff Tested by: flo, pho, jhb, davide Tested by: ian (arm) Tested by: andreast (powerpc)	2013-03-18 00:25:02 +00:00
Konstantin Belousov	70e198dd07	Some style fixes. Sponsored by: The FreeBSD Foundation	2013-03-14 20:31:39 +00:00
Konstantin Belousov	e8a4a618cf	Add pmap function pmap_copy_pages(), which copies the content of the pages around, taking array of vm_page_t both for source and destination. Starting offsets and total transfer size are specified. The function implements optimal algorithm for copying using the platform-specific optimizations. For instance, on the architectures were the direct map is available, no transient mappings are created, for i386 the per-cpu ephemeral page frame is used. The code was typically borrowed from the pmap_copy_page() for the same architecture. Only i386/amd64, powerpc aim and arm/arm-v6 implementations were tested at the time of commit. High-level code, not committed yet to the tree, ensures that the use of the function is only allowed after explicit enablement. For sparc64, the existing code has known issues and a stab is added instead, to allow the kernel linking. Sponsored by: The FreeBSD Foundation Tested by: pho (i386, amd64), scottl (amd64), ian (arm and arm-v6) MFC after: 2 weeks	2013-03-14 20:18:12 +00:00
Konstantin Belousov	e7788a47e3	Remove excessive and inconsistent initializers for the various kernel maps and submaps. MFC after: 2 weeks	2013-03-14 19:50:09 +00:00
Attilio Rao	4bc80a3402	Simplify vm_page_is_valid(). Sponsored by: EMC / Isilon storage division Reviewed by: alc	2013-03-12 12:20:49 +00:00
Alan Cox	34496b53ee	Update a comment: The object lock is no longer a mutex.	2013-03-09 21:32:24 +00:00
Attilio Rao	89f6b8632c	Switch the vm_object mutex to be a rwlock. This will enable in the future further optimizations where the vm_object lock will be held in read mode most of the time the page cache resident pool of pages are accessed for reading purposes. The change is mostly mechanical but few notes are reported: * The KPI changes as follow: - VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK() - VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK() - VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK() - VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED() (in order to avoid visibility of implementation details) - The read-mode operations are added: VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(), VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED() * The vm/vm_pager.h namespace pollution avoidance (forcing requiring sys/mutex.h in consumers directly to cater its inlining functions using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h consumers now must include also sys/rwlock.h. * zfs requires a quite convoluted fix to include FreeBSD rwlocks into the compat layer because the name clash between FreeBSD and solaris versions must be avoided. At this purpose zfs redefines the vm_object locking functions directly, isolating the FreeBSD components in specific compat stubs. The KPI results heavilly broken by this commit. Thirdy part ports must be updated accordingly (I can think off-hand of VirtualBox, for example). Sponsored by: EMC / Isilon storage division Reviewed by: jeff Reviewed by: pjd (ZFS specific review) Discussed with: alc Tested by: pho	2013-03-09 02:32:23 +00:00
Attilio Rao	c934116100	Merge from vmc-playground: Introduce a new KPI that verifies if the page cache is empty for a specified vm_object. This KPI does not make assumptions about the locking in order to be used also for building assertions at init and destroy time. It is mostly used to hide implementation details of the page cache. Sponsored by: EMC / Isilon storage division Reviewed by: jeff Reviewed by: alc (vm_radix based version) Tested by: flo, pho, jhb, davide	2013-03-09 02:05:29 +00:00
Andre Oppermann	15ae0c9af9	Move the callout subsystem initialization to its own SYSINIT() from being indirectly called via cpu_startup()+vm_ksubmap_init(). The boot order position remains the same at SI_SUB_CPU. Allocation of the callout array is changed to stardard kernel malloc from a slightly obscure direct kernel_map allocation. kern_timeout_callwheel_alloc() is renamed to callout_callwheel_init() to better describe its purpose. kern_timeout_callwheel_init() is removed simplifying the per-cpu initialization. Reviewed by: davide	2013-03-08 10:37:17 +00:00
Attilio Rao	198da1b2fa	Merge from vmcontention: As vm objects are type-stable there is no need to initialize the resident splay tree pointer and the cache splay tree pointer in _vm_object_allocate() but this could be done in the init UMA zone handler. The destructor UMA zone handler, will further check if the condition is retained at every destruction and catch for bugs. Sponsored by: EMC / Isilon storage division Submitted by: alc	2013-03-04 13:10:59 +00:00
Alan Cox	55f33f2caf	The value held by the vm object's field pg_color is only considered valid if the flag OBJ_COLORED is set. Since _vm_object_allocate() doesn't set this flag, it needn't initialize pg_color. Sponsored by: EMC / Isilon Storage Division	2013-03-02 18:07:29 +00:00
Pawel Jakub Dawidek	2609222ab4	Merge Capsicum overhaul: - Capability is no longer separate descriptor type. Now every descriptor has set of its own capability rights. - The cap_new(2) system call is left, but it is no longer documented and should not be used in new code. - The new syscall cap_rights_limit(2) should be used instead of cap_new(2), which limits capability rights of the given descriptor without creating a new one. - The cap_getrights(2) syscall is renamed to cap_rights_get(2). - If CAP_IOCTL capability right is present we can further reduce allowed ioctls list with the new cap_ioctls_limit(2) syscall. List of allowed ioctls can be retrived with cap_ioctls_get(2) syscall. - If CAP_FCNTL capability right is present we can further reduce fcntls that can be used with the new cap_fcntls_limit(2) syscall and retrive them with cap_fcntls_get(2). - To support ioctl and fcntl white-listing the filedesc structure was heavly modified. - The audit subsystem, kdump and procstat tools were updated to recognize new syscalls. - Capability rights were revised and eventhough I tried hard to provide backward API and ABI compatibility there are some incompatible changes that are described in detail below: CAP_CREATE old behaviour: - Allow for openat(2)+O_CREAT. - Allow for linkat(2). - Allow for symlinkat(2). CAP_CREATE new behaviour: - Allow for openat(2)+O_CREAT. Added CAP_LINKAT: - Allow for linkat(2). ABI: Reuses CAP_RMDIR bit. - Allow to be target for renameat(2). Added CAP_SYMLINKAT: - Allow for symlinkat(2). Removed CAP_DELETE. Old behaviour: - Allow for unlinkat(2) when removing non-directory object. - Allow to be source for renameat(2). Removed CAP_RMDIR. Old behaviour: - Allow for unlinkat(2) when removing directory. Added CAP_RENAMEAT: - Required for source directory for the renameat(2) syscall. Added CAP_UNLINKAT (effectively it replaces CAP_DELETE and CAP_RMDIR): - Allow for unlinkat(2) on any object. - Required if target of renameat(2) exists and will be removed by this call. Removed CAP_MAPEXEC. CAP_MMAP old behaviour: - Allow for mmap(2) with any combination of PROT_NONE, PROT_READ and PROT_WRITE. CAP_MMAP new behaviour: - Allow for mmap(2)+PROT_NONE. Added CAP_MMAP_R: - Allow for mmap(PROT_READ). Added CAP_MMAP_W: - Allow for mmap(PROT_WRITE). Added CAP_MMAP_X: - Allow for mmap(PROT_EXEC). Added CAP_MMAP_RW: - Allow for mmap(PROT_READ \| PROT_WRITE). Added CAP_MMAP_RX: - Allow for mmap(PROT_READ \| PROT_EXEC). Added CAP_MMAP_WX: - Allow for mmap(PROT_WRITE \| PROT_EXEC). Added CAP_MMAP_RWX: - Allow for mmap(PROT_READ \| PROT_WRITE \| PROT_EXEC). Renamed CAP_MKDIR to CAP_MKDIRAT. Renamed CAP_MKFIFO to CAP_MKFIFOAT. Renamed CAP_MKNODE to CAP_MKNODEAT. CAP_READ old behaviour: - Allow pread(2). - Disallow read(2), readv(2) (if there is no CAP_SEEK). CAP_READ new behaviour: - Allow read(2), readv(2). - Disallow pread(2) (CAP_SEEK was also required). CAP_WRITE old behaviour: - Allow pwrite(2). - Disallow write(2), writev(2) (if there is no CAP_SEEK). CAP_WRITE new behaviour: - Allow write(2), writev(2). - Disallow pwrite(2) (CAP_SEEK was also required). Added convinient defines: #define CAP_PREAD (CAP_SEEK \| CAP_READ) #define CAP_PWRITE (CAP_SEEK \| CAP_WRITE) #define CAP_MMAP_R (CAP_MMAP \| CAP_SEEK \| CAP_READ) #define CAP_MMAP_W (CAP_MMAP \| CAP_SEEK \| CAP_WRITE) #define CAP_MMAP_X (CAP_MMAP \| CAP_SEEK \| 0x0000000000000008ULL) #define CAP_MMAP_RW (CAP_MMAP_R \| CAP_MMAP_W) #define CAP_MMAP_RX (CAP_MMAP_R \| CAP_MMAP_X) #define CAP_MMAP_WX (CAP_MMAP_W \| CAP_MMAP_X) #define CAP_MMAP_RWX (CAP_MMAP_R \| CAP_MMAP_W \| CAP_MMAP_X) #define CAP_RECV CAP_READ #define CAP_SEND CAP_WRITE #define CAP_SOCK_CLIENT \ (CAP_CONNECT \| CAP_GETPEERNAME \| CAP_GETSOCKNAME \| CAP_GETSOCKOPT \| \ CAP_PEELOFF \| CAP_RECV \| CAP_SEND \| CAP_SETSOCKOPT \| CAP_SHUTDOWN) #define CAP_SOCK_SERVER \ (CAP_ACCEPT \| CAP_BIND \| CAP_GETPEERNAME \| CAP_GETSOCKNAME \| \ CAP_GETSOCKOPT \| CAP_LISTEN \| CAP_PEELOFF \| CAP_RECV \| CAP_SEND \| \ CAP_SETSOCKOPT \| CAP_SHUTDOWN) Added defines for backward API compatibility: #define CAP_MAPEXEC CAP_MMAP_X #define CAP_DELETE CAP_UNLINKAT #define CAP_MKDIR CAP_MKDIRAT #define CAP_RMDIR CAP_UNLINKAT #define CAP_MKFIFO CAP_MKFIFOAT #define CAP_MKNOD CAP_MKNODAT #define CAP_SOCK_ALL (CAP_SOCK_CLIENT \| CAP_SOCK_SERVER) Sponsored by: The FreeBSD Foundation Reviewed by: Christoph Mallon <christoph.mallon@gmx.de> Many aspects discussed with: rwatson, benl, jonathan ABI compatibility discussed with: kib	2013-03-02 00:53:12 +00:00
Attilio Rao	dc1558d1cd	Merge from vmobj-rwlock: VM_OBJECT_LOCKED() macro is only used to implement a custom version of lock assertions right now (which likely spread out thanks to copy and paste). Remove it and implement actual assertions. Sponsored by: EMC / Isilon storage division Reviewed by: alc Tested by: pho	2013-02-27 18:12:13 +00:00
Attilio Rao	a4915c21d9	Merge from vmc-playground branch: Replace the sub-optimal uma_zone_set_obj() primitive with more modern uma_zone_reserve_kva(). The new primitive reserves before hand the necessary KVA space to cater the zone allocations and allocates pages with ALLOC_NOOBJ. More specifically: - uma_zone_reserve_kva() does not need an object to cater the backend allocator. - uma_zone_reserve_kva() can cater M_WAITOK requests, in order to serve zones which need to do uma_prealloc() too. - When possible, uma_zone_reserve_kva() uses directly the direct-mapping by uma_small_alloc() rather than relying on the KVA / offset combination. The removal of the object attribute allows 2 further changes: 1) _vm_object_allocate() becomes static within vm_object.c 2) VM_OBJECT_LOCK_INIT() is removed. This function is replaced by direct calls to mtx_init() as there is no need to export it anymore and the calls aren't either homogeneous anymore: there are now small differences between arguments passed to mtx_init(). Sponsored by: EMC / Isilon storage division Reviewed by: alc (which also offered almost all the comments) Tested by: pho, jhb, davide	2013-02-26 23:35:27 +00:00
Attilio Rao	64a3476f0c	Remove white spaces. Sponsored by: EMC / Isilon storage division	2013-02-26 20:35:40 +00:00
Attilio Rao	0dde287b20	Wrap the sleeps synchronized by the vm_object lock into the specific macro VM_OBJECT_SLEEP(). This hides some implementation details like the usage of the msleep() primitive and the necessity to access to the lock address directly. For this reason VM_OBJECT_MTX() macro is now retired. Sponsored by: EMC / Isilon storage division Reviewed by: alc Tested by: pho	2013-02-26 17:22:08 +00:00
Alan Cox	fc23011bc3	On arm, like sparc64, the end of the kernel map varies from one type of machine to another. Therefore, VM_MAX_KERNEL_ADDRESS can't be a constant. Instead, #define it to be a variable, vm_max_kernel_address, just like we do on sparc64. Reviewed by: kib Tested by: ian	2013-02-18 01:02:48 +00:00
John Baldwin	174b5f3850	Make VM_NDOMAIN a kernel option so that it can be enabled from a kernel config file. Requested by: phk (ages ago) MFC after: 1 month	2013-02-14 19:38:04 +00:00
Marius Strobl	94bfd5b1a0	Try to improve r242655 take III: move these SYSCTLs describing the kernel map, which is defined and initialized in vm/vm_kern.c, to the latter. Submitted by: alc	2013-02-04 09:35:48 +00:00
Gleb Smirnoff	3caae6ca60	Fix typo in debug printf.	2013-01-29 19:06:16 +00:00
Andrey Zonov	b3a01bdf1f	- Add system wide page faults requiring I/O counter. Reviewed by: alc MFC after: 2 weeks	2013-01-28 12:54:53 +00:00
Andrey Zonov	536368691a	- Add sysctls to show number of stats scans. MFC after: 2 weeks	2013-01-28 12:20:20 +00:00
Andrey Zonov	4a36532940	- Style. MFC after: 2 weeks	2013-01-28 12:08:29 +00:00
Andrey Zonov	1cc20081df	- Get rid of unused function vmspace_wired_count(). Reviewed by: alc Approved by: kib (mentor) MFC after: 1 week	2013-01-14 12:12:56 +00:00
Andrey Zonov	cde4a72547	- Improve readability of sys_obreak(). Suggested by: alc Reviewed by: alc Approved by: kib (mentor) MFC after: 1 week	2013-01-11 09:58:35 +00:00
Andrey Zonov	3ac7d29722	- Reduce kernel size by removing unnecessary pointer indirections. GENERIC kernel size reduced in 16 bytes and RACCT kernel in 336 bytes. Suggested by: alc Reviewed by: alc Approved by: kib (mentor) MFC after: 1 week	2013-01-10 12:43:58 +00:00
Kenneth D. Merry	43ab9660c5	Fix a bug in the device pager code that can trigger an assertion in devfs if a particular race condition is hit in the device pager code. This was a side effect of change 227530 which changed the device pager interface to call a new destructor routine for the cdev. That destructor routine, old_dev_pager_dtor(), takes a VM object handle. The object handle is cast to a struct cdev *, and passed into dev_rel(). That works in most cases, except the case in cdev_pager_allocate() where there is a race condition between two threads allocating an object backed by the same device. The loser of the race deallocates its object at the end of the function. The problem is that before inserting the object into the dev_pager_object_list, the object's handle is changed from the struct cdev pointer to the object's own address. This is to avoid conflicts with the winner of the race, which already inserted an object in the list with a handle that is a pointer to the same cdev structure. The object is then passed to vm_object_deallocate(), and eventually makes its way down to old_dev_pager_dtor(). That function passes the handle pointer (which is actually a VM object, not a struct cdev as usual) into dev_rel(). dev_rel() decrements the reference count in the assumed struct cdev (which happens to be 0), and that triggers the assertion in dev_rel() that the reference count is greater than or equal to 0. The fix is to add a cdev pointer to the VM object, and use that pointer when calling the cdev_pg_dtor() routine. vm_object.h: Add a struct cdev pointer to the VM object structure. device_pager.c: In cdev_pager_allocate(), populate the new cdev pointer. In dev_pager_dealloc(), use the new cdev pointer when calling the object's cdev_pg_dtor() routine. Reviewed by: kib Sponsored by: Spectra Logic Corporation MFC after: 1 week	2013-01-09 16:48:38 +00:00
Gleb Smirnoff	936c747be0	Comment fix: there is no ub_ptr, instead explain meaning of uz_count field verbally.	2012-12-21 10:09:45 +00:00
Andrey Zonov	7e19eda4aa	- Fix locked memory accounting for maps with MAP_WIREFUTURE flag. - Add sysctl vm.old_mlock which may turn such accounting off. Reviewed by: avg, trasz Approved by: kib (mentor) MFC after: 1 week	2012-12-18 07:35:01 +00:00
Alan Cox	2863482058	In the past four years, we've added two new vm object types. Each time, similar changes had to be made in various places throughout the machine- independent virtual memory layer to support the new vm object type. However, in most of these places, it's actually not the type of the vm object that matters to us but instead certain attributes of its pages. For example, OBJT_DEVICE, OBJT_MGTDEVICE, and OBJT_SG objects contain fictitious pages. In other words, in most of these places, we were testing the vm object's type to determine if it contained fictitious (or unmanaged) pages. To both simplify the code in these places and make the addition of future vm object types easier, this change introduces two new vm object flags that describe attributes of the vm object's pages, specifically, whether they are fictitious or unmanaged. Reviewed and tested by: kib	2012-12-09 00:32:38 +00:00
Pawel Jakub Dawidek	b0ae014466	White-space cleanups.	2012-12-08 09:23:05 +00:00
Pawel Jakub Dawidek	2f891cd504	Implemented uma_zone_set_warning(9) function that sets a warning, which will be printed once the given zone becomes full and cannot allocate an item. The warning will not be printed more often than every five minutes. All UMA warnings can be globally turned off by setting sysctl/tunable vm.zone_warnings to 0. Discussed on: arch Obtained from: WHEEL Systems MFC after: 2 weeks	2012-12-07 22:27:13 +00:00
Alan Cox	96b0b92ac1	Add support for the (relatively) new object type OBJT_MGTDEVICE to vm_object_set_memattr(). Also, add a "safety belt" so that vm_object_set_memattr() doesn't silently modify undefined object types. Reviewed by: kib MFC after: 10 days	2012-11-28 18:29:34 +00:00
Alan Cox	a922d312b0	Make a few small changes to vm_map_pmap_enter(): Add detail to the comment describing this function. In particular, describe what MAP_PREFAULT_PARTIAL does. Eliminate the abrupt change in behavior when the specified address range grows from MAX_INIT_PT pages to MAX_INIT_PT plus one pages. Instead of doing nothing, i.e., preloading no mappings whatsoever, map any resident pages that fall within the start of the specified address range, i.e., [addr, addr + ulmin(size, ptoa(MAX_INIT_PT))). Long ago, the vm object's list of resident pages was not ordered, so this function had to choose between probing the global hash table of all resident pages and iterating over the vm object's unordered list of resident pages. Now, the list is ordered, so there is no reason for MAP_PREFAULT_PARTIAL to be concerned with the vm object's count of resident changes. MFC after: 14 days	2012-11-25 19:42:36 +00:00
Alan Cox	0d69690e8f	Correct an error in r230623. When both VM_ALLOC_NODUMP and VM_ALLOC_ZERO were specified to vm_page_alloc(), PG_NODUMP wasn't being set on the allocated page when it happened to be pre-zeroed. MFC after: 5 days	2012-11-21 06:26:18 +00:00
Jaakko Heinonen	02c62349c9	- Don't pass geom and provider names as format strings. - Add __printflike() attributes. - Remove an extra argument for the g_new_geomf() call in swapongeom_ev(). Reviewed by: pjd	2012-11-20 12:32:18 +00:00
Alan Cox	969a0af09d	Update a comment to reflect the elimination of the hold queue in r242300.	2012-11-17 04:00:19 +00:00
Konstantin Belousov	43f48b65c0	Move the declaration of vm_phys_paddr_to_vm_page() from vm/vm_page.h to vm/vm_phys.h, where it belongs. Requested and reviewed by: alc MFC after: 2 weeks	2012-11-16 05:55:56 +00:00
Konstantin Belousov	962b064afe	Explicitely state that M_USE_RESERVE requires M_NOWAIT, using assertion. Reviewed by: alc MFC after: 2 weeks	2012-11-16 05:49:56 +00:00
Konstantin Belousov	b32ecf44bc	Flip the semantic of M_NOWAIT to only require the allocation to not sleep, and perform the page allocations with VM_ALLOC_SYSTEM class. Previously, the allocation was also allowed to completely drain the reserve of the free pages, being translated to VM_ALLOC_INTERRUPT request class for vm_page_alloc() and similar functions. Allow the caller of malloc* to request the 'deep drain' semantic by providing M_USE_RESERVE flag, now translated to VM_ALLOC_INTERRUPT class. Previously, it resulted in less aggressive VM_ALLOC_SYSTEM allocation class. Centralize the translation of the M_* malloc(9) flags in the single inline function malloc2vm_flags(). Discussion started by: "Sears, Steven" <Steven.Sears@netapp.com> Reviewed by: alc, mdf (previous version) Tested by: pho (previous version) MFC after: 2 weeks	2012-11-14 20:01:40 +00:00
Alan Cox	8d22020384	Replace the single, global page queues lock with per-queue locks on the active and inactive paging queues. Reviewed by: kib	2012-11-13 02:50:39 +00:00
Attilio Rao	2ebcd458e3	Fix DDB command "show map XXX": - Check that an argument is always available, otherwise current map printing before to recurse is garbage. - Spit out a message if an argument is not provided. - Remove unread nlines variable. - Use an explicit recursive function, disassociated from the DB_SHOW_COMMAND() body, in order to make clear prototype and recursion of the above mentioned function. The code results now much less obscure. Submitted by: gianni	2012-11-12 00:30:40 +00:00
Konstantin Belousov	140dedb81c	The r241025 fixed the case when a binary, executed from nullfs mount, was still possible to open for write from the lower filesystem. There is a symmetric situation where the binary could already has file descriptors opened for write, but it can be executed from the nullfs overlay. Handle the issue by passing one v_writecount reference to the lower vnode if nullfs vnode has non-zero v_writecount. Note that only one write reference can be donated, since nullfs only keeps one use reference on the lower vnode. Always use the lower vnode v_writecount for the checks. Introduce the VOP_GET_WRITECOUNT to read v_writecount, which is currently always bypassed to the lower vnode, and VOP_ADD_WRITECOUNT to manipulate the v_writecount value, which manages a single bypass reference to the lower vnode. Caling the VOPs instead of directly accessing v_writecount provide the fix described in the previous paragraph. Tested by: pho MFC after: 3 weeks	2012-11-02 13:56:36 +00:00
Alan Cox	9fc4739d2a	In general, we call pmap_remove_all() before calling vm_page_cache(). So, the call to pmap_remove_all() within vm_page_cache() is usually redundant. This change eliminates that call to pmap_remove_all() and introduces a call to pmap_remove_all() before vm_page_cache() in the one place where it didn't already exist. When iterating over a paging queue, if the object containing the current page has a zero reference count, then the page can't have any managed mappings. So, a call to pmap_remove_all() is pointless. Change a panic() call in vm_page_cache() to a KASSERT(). MFC after: 6 weeks	2012-11-01 16:20:02 +00:00
Attilio Rao	4ceaf45de5	Rework the known mutexes to benefit about staying on their own cache line in order to avoid manual frobbing but using struct mtx_padalign. The sole exception being nvme and sxfge drivers, where the author redefined CACHE_LINE_SIZE manually, so they need to be analyzed and dealt with separately. Reviwed by: jimharris, alc	2012-10-31 18:07:18 +00:00
Alan Cox	081a488159	Replace the page hold queue, PQ_HOLD, by a new page flag, PG_UNHOLDFREE, because the queue itself serves no purpose. When a held page is freed, inserting the page into the hold queue has the side effect of setting the page's "queue" field to PQ_HOLD. Later, when the page is unheld, it will be freed because the "queue" field is PQ_HOLD. In other words, PQ_HOLD is used as a flag, not a queue. So, this change replaces it with a flag. To accomodate the new page flag, make the page's "flags" field wider and "oflags" field narrower. Reviewed by: kib	2012-10-29 06:15:04 +00:00
Edward Tomasz Napierala	a406d8c319	Remove useless check; vm_pindex_t is unsigned on all architectures. CID: 3701 Found with: Coverity Prevent	2012-10-28 20:03:57 +00:00
Matthew D Fleming	bb196eb480	Const-ify the zone name argument to uma_zcreate(9). MFC after: 3 days	2012-10-26 17:51:05 +00:00
Andre Oppermann	25c1e16409	Move the corresponding MTX_SYSINIT() next to their struct mtx declaration to make their relationship more obvious as done with the other such mutexs.	2012-10-26 17:31:35 +00:00
Konstantin Belousov	ef45823eba	Commit the actual text provided by Alan, instead of the wrong update in r242011. MFC after: 1 week	2012-10-24 18:32:37 +00:00
Konstantin Belousov	bc79b37f2c	Dirty the newly copied anonymous pages after the wired region is forked. Otherwise, pagedaemon might reclaim the page without saving its content into the swap file, resulting in the valid content replaced by zeroes. Reported and tested by: pho Reviewed and comment update by: alc MFC after: 1 week	2012-10-24 18:21:59 +00:00
Konstantin Belousov	5050aa86cf	Remove the support for using non-mpsafe filesystem modules. In particular, do not lock Giant conditionally when calling into the filesystem module, remove the VFS_LOCK_GIANT() and related macros. Stop handling buffers belonging to non-mpsafe filesystems. The VFS_VERSION is bumped to indicate the interface change which does not result in the interface signatures changes. Conducted and reviewed by: attilio Tested by: pho	2012-10-22 17:50:54 +00:00
Eitan Adler	0b80c1e400	Print flags as hex instead of an integer. PR: kern/168210 Submitted by: linimon Reviewed by: alc Approved by: cperciva MFC after: 3 days	2012-10-22 02:11:57 +00:00
Alan Cox	7ecfabc7bb	Move vm_page_requeue() to the only file that uses it. MFC after: 3 weeks	2012-10-13 20:19:43 +00:00
Alan Cox	9af47af64a	Eliminate the conditional for releasing the page queues lock in vm_page_sleep(). vm_page_sleep() is no longer called with this lock held. Eliminate assertions that the page queues lock is NOT held. These assertions won't translate well to having distinct locks on the active and inactive page queues, and they really aren't that useful. MFC after: 3 weeks	2012-10-13 18:46:46 +00:00
Alan Cox	4db2c4b8c7	Tidy up a bit: Update some of the comments. In particular, use "sleep" in preference to "block" where appropriate. Eliminate some unnecessary casts. Make a few whitespace changes for consistency. Reviewed by: kib MFC after: 3 days	2012-10-03 05:06:45 +00:00
Konstantin Belousov	877d24ac8a	Fix the mis-handling of the VV_TEXT on the nullfs vnodes. If you have a binary on a filesystem which is also mounted over by nullfs, you could execute the binary from the lower filesystem, or from the nullfs mount. When executed from lower filesystem, the lower vnode gets VV_TEXT flag set, and the file cannot be modified while the binary is active. But, if executed as the nullfs alias, only the nullfs vnode gets VV_TEXT set, and you still can open the lower vnode for write. Add a set of VOPs for the VV_TEXT query, set and clear operations, which are correctly bypassed to lower vnode. Tested by: pho (previous version) MFC after: 2 weeks	2012-09-28 11:25:02 +00:00
Alan Cox	1f11f2bff4	Address a race condition that was introduced in r238212. Unless the page queues lock is acquired before the page lock is released, there is no guarantee that the page will still be in that same page queue when vm_page_requeue() is called. Reported by: pho In collaboration with: kib MFC after: 3 days	2012-09-23 17:42:39 +00:00
Konstantin Belousov	5f9c767b19	Plug the accounting leak for the wired pages when msync(MS_INVALIDATE) is performed on the vnode mapping which is wired in other address space. While there, explicitely assert that the page is unwired and zero the wire_count instead of substract. The condition is rechecked later in vm_page_free(_toq) already. Reported and tested by: zont Reviewed by: alc (previous version) MFC after: 1 week	2012-09-20 09:52:57 +00:00
Gleb Smirnoff	2864dbbfc1	If caller specifies UMA_ZONE_OFFPAGE explicitly, then do not waste memory in an allocation for a slab. Reviewed by: jeff	2012-09-18 20:28:55 +00:00
Eitan Adler	96240c89f0	Correct double "the the" Approved by: cperciva MFC after: 3 days	2012-09-14 21:28:56 +00:00
Andrey Zonov	c4e357e8d3	- Simplify VM code by using vmspace_wired_count() for counting wired memory of a process. Reviewed by: avg Approved by: kib (mentor) MFC after: 2 weeks	2012-09-05 18:19:54 +00:00
Dag-Erling Smørgrav	f379b823bc	Whitespace cleanup.	2012-09-05 12:24:50 +00:00
Dag-Erling Smørgrav	dc1b35b525	No memory barrier is required. This was pointed out by kib@ a while ago, but I got distracted by other matters. (for real this time)	2012-09-04 22:19:33 +00:00
Dag-Erling Smørgrav	22a5e6b972	Revert previous commit, which was performed in the wrong tree.	2012-09-04 21:06:53 +00:00
Dag-Erling Smørgrav	db0390e833	No memory barrier is required. This was pointed out by kib@ a while ago, but I got distracted by other matters.	2012-09-04 19:04:02 +00:00
Andrey Zonov	cfe52ecf0e	- After r240026 sgrowsiz should be used in a safer maner. Approved by: kib (mentor) MCF after: 1 week	2012-09-03 09:34:46 +00:00
Andrey Zonov	e145130e71	- Remove accounting of locked memory from vsunlock(9) that I missed in r239818. Approved by: kib (mentor)	2012-08-30 08:03:33 +00:00
Andrey Zonov	126a63ce6c	- Don't take an account of locked memory for current process in vslock(9). There are two consumers of vslock(9): sysctl code and drm driver. These consumers are using locked memory as transient memory, it doesn't belong to a process's memory. Suggested by: avg Reviewed by: alc Approved by: kib (mentor) MFC after: 2 weeks	2012-08-29 11:23:59 +00:00
Sergey Kandaurov	9462305cbe	Typo in previous change: print half the theoretical maximum as maximum recommended amount. Reported by: <site freebsd at orientalsensation com> Reviewed by: des	2012-08-27 10:59:49 +00:00
Gleb Smirnoff	42321809c4	Fix function name in keg_cachespread_init() assert.	2012-08-26 09:54:11 +00:00
Dag-Erling Smørgrav	3ff863f1aa	- When running out of swzone, instead of spewing an error message every tick until the situation is resolved (if ever), just print a single message when running out and another when space becomes available. - When adding more swap, warn if the total amount exceeds half the theoretical maximum we can handle.	2012-08-16 08:29:49 +00:00
Konstantin Belousov	ee4116b8f7	For old mmap syscall, when executing on amd64 or ia64, enforce the PROT_EXEC if prot is non-zero, process is 32bit and kern.elf32.i386_read_exec syscal is enabled. This workaround is needed for old i386 a.out binaries, where dynamic linker did not specified PROT_EXEC for mapping of the text. The kern.elf32.i386_read_exec MIB name looks weird for a.out binaries, but I reused the existing knob which already has the needed semantic. MFC after: 1 week	2012-08-14 12:11:48 +00:00
Konstantin Belousov	7707ccabfb	Adjust the r205536, by allowing a non-zero offset for anonymous mappings for a.out binaries. Apparently, a.out ld.so from FreeBSD 1.1.5.1 can issue such requests. Reported and tested by: Dan Plassche <dplassche@gmail.com> MFC after: 1 week	2012-08-14 11:47:07 +00:00
Konstantin Belousov	b6c00483e9	Do not leave invalid pages in the object after the short read for a network file systems (not only NFS proper). Short reads cause pages other then the requested one, which were not filled by read response, to stay invalid. Change the vm_page_readahead_finish() interface to not take the error code, but instead to make a decision to free or to (de)activate the page only by its validity. As result, not requested invalid pages are freed even if the read RPC indicated success. Noted and reviewed by: alc MFC after: 1 week	2012-08-14 11:45:47 +00:00
Alan Cox	18f55b0171	Never sleep on busy pages in vm_pageout_launder(), always skip them. Long ago, sleeping on busy pages in vm_pageout_launder() made sense. The call to vm_pageout_flush() specified asynchronous I/O and sleeping on busy pages blocked vm_pageout_launder() until the flush had completed. However, in CVS revision 1.35 of vm/vm_contig.c, the call to vm_pageout_flush() was changed to request synchronous I/O, but the sleep on busy pages was not removed.	2012-08-07 04:48:14 +00:00
Konstantin Belousov	1c771f9222	After the PHYS_TO_VM_PAGE() function was de-inlined, the main reason to pull vm_param.h was removed. Other big dependency of vm_page.h on vm_param.h are PA_LOCK* definitions, which are only needed for in-kernel code, because modules use KBI-safe functions to lock the pages. Stop including vm_param.h into vm_page.h. Include vm_param.h explicitely for the kernel code which needs it. Suggested and reviewed by: alc MFC after: 2 weeks	2012-08-05 14:11:42 +00:00
Konstantin Belousov	0055cbd3c5	Reduce code duplication and exposure of direct access to struct vm_page oflags by providing helper function vm_page_readahead_finish(), which handles completed reads for pages with indexes other then the requested one, for VOP_GETPAGES(). Reviewed by: alc MFC after: 1 week	2012-08-04 18:16:43 +00:00
Alan Cox	369763e31a	Inline vm_page_aflags_clear() and vm_page_aflags_set(). Add comments stating that neither these functions nor the flags that they are used to manipulate are part of the KBI.	2012-08-03 01:48:15 +00:00
Alan Cox	d26a90a8b2	Eliminate an unneeded declaration. (I should have removed this as part of r227568.)	2012-07-30 20:38:37 +00:00
Konstantin Belousov	311e34e260	Do not requeue held page or page for which locking failed, just leave them alone. Process the act_count updates for the held pages in the vm_pageout loop over the inactive queue, instead of refusing to do anything with such page. Clarify the intent of the addl_page_shortage counter and change its use for pages which are not processed in the loop according to the description. Reviewed by: alc MFC after: 2 weeks	2012-07-26 09:06:48 +00:00
Alan Cox	2cba1ccd0e	Addendum to r238604. If the inactive queue scan isn't restarted, then the variable "addl_page_shortage_init" isn't needed. X-MFC after: r238604	2012-07-24 02:35:30 +00:00
Konstantin Belousov	d4961bcb3a	Do not restart scan of the inactive queue when non-inactive page is found. Rather, we shall not find such pages on inactive queue at all. Requested and reviewed by: alc MFC after: 2 weeks	2012-07-18 21:47:50 +00:00
Alan Cox	85eeca35b9	Move what remains of vm/vm_contig.c into vm/vm_pageout.c, where similar code resides. Rename vm_contig_grow_cache() to vm_pageout_grow_cache(). Reviewed by: kib	2012-07-18 05:21:34 +00:00
Alan Cox	da1ab8a4a0	Correct vm_page_alloc_contig()'s implementation of VM_ALLOC_NODUMP.	2012-07-17 02:36:59 +00:00
Alan Cox	907e4524dc	Various improvements to vm_contig_grow_cache(). Most notably, even when it can't sleep, it can still move clean pages from the inactive queue to the cache. Also, when a page is cached, there is no need to restart the scan. The "next" page pointer held by vm_contig_launder() is still valid. Finally, add a comment summarizing what vm_contig_grow_cache() does based upon the value of "tries". MFC after: 3 weeks	2012-07-16 18:13:43 +00:00
Alan Cox	476f7f2423	Correct an off-by-one error in vm_reserv_alloc_contig() that resulted in the last reservation of a multi-reservation allocation not being initialized.	2012-07-15 21:46:19 +00:00
Matthew D Fleming	f806cdcf99	Fix a bug with memguard(9) on 32-bit architectures without a VM_KMEM_MAX_SIZE. The code was not taking into account the size of the kernel_map, which the kmem_map is allocated from, so it could produce a sub-map size too large to fit. The simplest solution is to ignore VM_KMEM_MAX entirely and base the memguard map's size off the kernel_map's size, since this is always relevant and always smaller. Found by: Justin Hibbits	2012-07-15 20:29:48 +00:00
Alan Cox	9757857c4f	If vm_contig_grow_cache() is allowed to sleep, then invoke the vm_lowmem handlers.	2012-07-14 20:14:03 +00:00
Alan Cox	0ff0fc84c2	Move kmem_alloc_{attr,contig}() to vm/vm_kern.c, where similarly named functions reside. Correct the comment describing kmem_alloc_contig().	2012-07-14 18:10:44 +00:00
Attilio Rao	571a1e92aa	Document the object type movements, related to swp_pager_copy(), in vm_object_collapse() and vm_object_split(). In collabouration with: alc MFC after: 3 days	2012-07-11 01:04:59 +00:00
Konstantin Belousov	a4156419ca	Avoid vm page queues lock leak after r238212. Reported and tested by: Michael Butler <imb protected-networks net> Reviewed by: alc Pointy hat to: kib MFC after: 20 days	2012-07-08 18:04:26 +00:00
Konstantin Belousov	48cc2fc774	Drop page queues mutex on each iteration of vm_pageout_scan over the inactive queue, unless busy page is found. Dropping the mutex often should allow the other lock acquires to proceed without waiting for whole inactive scan to finish. On machines with lot of physical memory scan often need to iterate a lot before it finishes or finds a page which requires laundring, causing high latency for other lock waiters. Suggested and reviewed by: alc MFC after: 3 weeks	2012-07-07 19:39:08 +00:00
Eitan Adler	c288b54837	Add missing sleep stat increase PR: kern/168211 Submitted by: linimon Reviewed by: alc Approved by: cperciva MFC after: 3 days	2012-07-07 17:46:11 +00:00
Konstantin Belousov	5d10ef2096	Style. Reviewed by: alc (previous version) MFC after: 1 week	2012-07-06 20:13:16 +00:00
John Baldwin	687c94aac9	Honor db_pager_quit in 'show uma' and 'show malloc'. MFC after: 1 month	2012-07-02 16:14:52 +00:00
Alan Cox	e30df26e7b	Add new pmap layer locks to the predefined lock order. Change the names of a few existing VM locks to follow a consistent naming scheme.	2012-06-27 03:45:25 +00:00
Attilio Rao	ffdd0c7db3	- Add a comment explaining the locking of the cached pages pool held by vm_objects. - Add flags for the per-object lock and free pages queue mutex lock. Use the newly added flags to mark the cache root within the vm_object structure. Please note that other vm_object members should be marked with correct locking but they are left for other commits. In collabouration with: alc MFC after: 3 days3 days3 days	2012-06-22 18:34:11 +00:00
Alan Cox	eddc92918e	Selectively inline vm_page_dirty().	2012-06-20 23:25:47 +00:00
John Baldwin	6fbe60fa8b	Move the per-thread deferred user map entries list into a private list in vm_map_process_deferred() which is then iterated to release map entries. This avoids having a nested vm map unlock operation called from the loop body attempt to recuse into vm_map_process_deferred(). This can happen if the vm_map_remove() triggers the OOM killer. Reviewed by: alc, kib MFC after: 1 week	2012-06-20 18:00:26 +00:00
Attilio Rao	db9ba57895	Do a more targeted check on the page cache and avoid to check the cache pointer directly in vnode_pager_setsize() by using newly introduced vm_page_is_cached() function. Reviewed by: alc MFC after: 2 weeks X-MFC: r234039,234064	2012-06-16 21:39:00 +00:00
Alan Cox	6031c68de4	The page flag PGA_WRITEABLE is set and cleared exclusively by the pmap layer, but it is read directly by the MI VM layer. This change introduces pmap_page_is_write_mapped() in order to completely encapsulate all direct access to PGA_WRITEABLE in the pmap layer. Aesthetics aside, I am making this change because amd64 will likely begin using an alternative method to track write mappings, and having pmap_page_is_write_mapped() in place allows me to make such a change without further modification to the MI VM layer. As an added bonus, tidy up some nearby comments concerning page flags. Reviewed by: kib MFC after: 6 weeks	2012-06-16 18:56:19 +00:00
Konstantin Belousov	83ce08538a	Use the previous stack entry protection and max protection to correctly propagate the stack execution permissions when stack is grown down. First, curproc->p_sysent->sv_stackprot specifies maximum allowed stack protection for current ABI, so the new stack entry was typically marked executable always. Second, for non-main stack MAP_STACK mapping, the PROT_ flags should be used which were specified at the mmap(2) call time, and not sv_stackprot. MFC after: 1 week	2012-06-10 11:31:50 +00:00
Eitan Adler	0a4a2b8e62	Revert r236380 PR: kern/166780 Requested by: many Approved by: cperciva (implicit)	2012-06-01 18:58:50 +00:00
Eitan Adler	71ee98c97c	Add sysctl to query amount of swap space free PR: kern/166780 Submitted by: Radim Kolar <hsn@sendmail.cz> Approved by: cperciva MFC after: 1 week	2012-06-01 04:42:52 +00:00
Maksim Yevmenkin	251386b4b2	Tweak condition for disabling allocation from per-CPU buckets in low memory situation. I've observed a situation where per-CPU allocations were disabled while there were enough free cached pages. Basically, cnt.v_free_count was sitting stable at a value lower than cnt.v_free_min and that caused massive performance drop. Reviewed by: alc MFC after: 1 week	2012-05-23 18:56:29 +00:00
Konstantin Belousov	4d34e019c4	Calculate the count of per-process cow faults. Export the count to userspace using the obscure spare int field in struct kinfo_proc. Submitted by: Andrey Zonov <andrey zonov org> MFC after: 1 week	2012-05-23 18:10:54 +00:00
Andriy Gapon	b6062382be	vm_pager_object_lookup: small performance optimization do not needlessly lock an object if its handle doesn't match Reviewed by: kib, alc MFC after: 1 week	2012-05-23 12:51:49 +00:00
Andrew Turner	c415e17250	Fix booting on ARM. In PHYS_TO_VM_PAGE() when VM_PHYSSEG_DENSE is set the check if we are past the end of vm_page_array was incorrect causing it to return NULL. This value is then used in vm_phys_add_page causing a data abort. Reviewed by: alc, kib, imp Tested by: stas	2012-05-22 07:04:23 +00:00
Nathan Whitehorn	ccc4a5c761	Replace the list of PVOs owned by each PMAP with an RB tree. This simplifies range operations like pmap_remove() and pmap_protect() as well as allowing simple operations like pmap_extract() not to involve any global state. This substantially reduces lock coverages for the global table lock and improves concurrency.	2012-05-20 14:33:28 +00:00
Konstantin Belousov	df2f557df6	Do not double-reference the found vm object in cdev_pager_lookup(). vm_pager_object_lookup() already referenced the object. Note that there is no in-tree consumers of cdev_pager_lookup(). The only known user of the function is i915 gem driver, which is not yet imported. This should make the KPI change minor. Submitted by: avg MFC after: 1 week	2012-05-18 10:23:47 +00:00
Konstantin Belousov	b7ac5a8571	Add new pager type, OBJT_MGTDEVICE. It provides the device pager which carries fictitous managed pages. In particular, the consumers of the new object type can remove all mappings of the device page with pmap_remove_all(). The range of physical addresses used for fake page allocation shall be registered with vm_phys_fictitious_reg_range() interface to allow the PHYS_TO_VM_PAGE() to work in pmap. Most likely, only i386 and amd64 pmaps can handle fictitious managed pages right now. Sponsored by: The FreeBSD Foundation Reviewed by: alc MFC after: 1 month	2012-05-12 20:49:58 +00:00
Konstantin Belousov	b6de32bd9b	Add a facility to register a range of physical addresses to be used for allocation of fictitious pages, for which PHYS_TO_VM_PAGE() returns proper fictitious vm_page_t. The range should be de-registered after consumer stopped using it. De-inline the PHYS_TO_VM_PAGE() since it now carries code to iterate over registered ranges. A hash container might be developed instead of range registration interface, and fake pages could be put automatically into the hash, were PHYS_TO_VM_PAGE() could look them up later. This should be considered before the MFC of the commit is done. Sponsored by: The FreeBSD Foundation Reviewed by: alc MFC after: 1 month	2012-05-12 20:42:56 +00:00
Konstantin Belousov	e461aae747	Split the code from vm_page_getfake() to initialize the fake page struct vm_page into new interface vm_page_initfake(). Handle the case of fake page re-initialization with changed memattr. Sponsored by: The FreeBSD Foundation Reviewed by: alc MFC after: 1 month	2012-05-12 20:34:22 +00:00
Konstantin Belousov	116c213502	Assert that the page passed to vm_page_putfake() is unmanaged. Sponsored by: The FreeBSD Foundation Reviewed by: alc MFC after: 1 month	2012-05-12 20:27:51 +00:00
Konstantin Belousov	7900f95d88	Assert that fictitious or unmanaged pages do not appear on active/inactive lists. Sponsored by: The FreeBSD Foundation Reviewed by: alc MFC after: 1 month	2012-05-12 20:24:46 +00:00
Konstantin Belousov	13a0b7bcc4	Commit the change forgotten in r235356. Sponsored by: The FreeBSD Foundation Reviewed by: alc MFC after: 1 month	2012-05-12 20:10:18 +00:00
Konstantin Belousov	0c26bb71f6	Make the vm_page_array_size long. Remove redundand zero initialization for vm_page_array_size and nearby variablees. Sponsored by: The FreeBSD Foundation Reviewed by: alc MFC after: 1 month	2012-05-12 20:03:06 +00:00
Alan Cox	13458803f4	Give vm_fault()'s sequential access optimization a makeover. There are two aspects to the sequential access optimization: (1) read ahead of pages that are expected to be accessed in the near future and (2) unmap and cache behind of pages that are not expected to be accessed again. This revision changes both aspects. The read ahead optimization is now more effective. It starts with the same initial read window as before, but arithmetically grows the window on sequential page faults. This can yield increased read bandwidth. For example, on one of my machines, a program using mmap() to read a file that is several times larger than the machine's physical memory takes about 17% less time to complete. The unmap and cache behind optimization is now more selectively applied. The read ahead window must grow to its maximum size before unmap and cache behind is performed. This significantly reduces the number of times that pages are unmapped and cached only to be reactivated a short time later. The unmap and cache behind optimization now clears each page's referenced flag. Previously, in the case of dirty pages, if the containing file was still mapped at the time that the page daemon examined the dirty pages, they would be reactivated. From a stylistic standpoint, this revision also cleanly separates the implementation of the read ahead and unmap/cache behind optimizations. Glanced at: kib MFC after: 2 weeks	2012-05-10 15:16:42 +00:00
Nathan Whitehorn	0b852c03eb	Avoid a lock order reversal in pmap_extract_and_hold() from relocking the page. This PMAP requires an additional lock besides the PMAP lock in pmap_extract_and_hold(), which vm_page_pa_tryrelock() did not release. Suggested by: kib MFC after: 4 days	2012-04-22 17:58:30 +00:00
Konstantin Belousov	1472f4f4b9	When MAP_STACK mapping is created, the map entry is created only to cover the initial stack size. For MCL_WIREFUTURE maps, the subsequent call to vm_map_wire() to wire the whole stack region fails due to VM_MAP_WIRE_NOHOLES flag. Use the VM_MAP_WIRE_HOLESOK to only wire mapped part of the stack. Reported and tested by: Sushanth Rai <sushanth_rai yahoo com> Reviewed by: alc MFC after: 1 week	2012-04-21 18:36:53 +00:00
Alan Cox	2aa163dc57	As documented in vm_page.h, updates to the vm_page's flags no longer require the page queues lock. MFC after: 1 week	2012-04-21 18:26:16 +00:00
Attilio Rao	a0f2c37b6f	- Introduce a cache-miss optimization for consistency with other accesses of the cache member of vm_object objects. - Use novel vm_page_is_cached() for checks outside of the vm subsystem. Reviewed by: alc MFC after: 2 weeks X-MFC: r234039	2012-04-09 17:05:18 +00:00
Alan Cox	1c8279e4e7	Fix mincore(2) so that it reports PG_CACHED pages as resident. MFC after: 2 weeks	2012-04-08 18:25:12 +00:00
Alan Cox	908e3da10e	If a page belonging a reservation is cached, then mark the reservation so that it will be freed to the cache pool rather than the default pool. Otherwise, the cached pages within the reservation may be recycled sooner than necessary. Reported by: Andrey Zonov	2012-04-08 17:00:46 +00:00
Attilio Rao	d1aa86e151	Staticize vm_page_cache_remove(). Reviewed by: alc	2012-04-06 20:34:00 +00:00
Nathan Whitehorn	57bd5cce62	Reduce the frequency that the PowerPC/AIM pmaps invalidate instruction caches, by invalidating kernel icaches only when needed and not flushing user caches for shared pages. Suggested by: kib MFC after: 2 weeks	2012-04-06 16:03:38 +00:00
John Baldwin	35818d2e94	Add new ktrace records for the start and end of VM faults. This gives a pair of records similar to syscall entry and return that a user can use to determine how long page faults take. The new ktrace records are enabled via the 'p' trace type, and are enabled in the default set of trace points. Reviewed by: kib MFC after: 2 weeks	2012-04-05 17:13:14 +00:00
Kirk McKusick	1faacf5d09	Keep track of the mount point associated with a special device to enable the collection of counts of synchronous and asynchronous reads and writes for its associated filesystem. The counts are displayed using `mount -v'. Ensure that buffers used for paging indicate the vnode from which they are operating so that counts of paging I/O operations from the filesystem are collected. This checkin only adds the setting of the mount point for the UFS/FFS filesystem, but it would be trivial to add the setting and clearing of the mount point at filesystem mount/unmount time for other filesystems too. Reviewed by: kib	2012-03-28 20:49:11 +00:00
Alan Cox	5730afc9b6	Handle spurious page faults that may occur in no-fault sections of the kernel. When access restrictions are added to a page table entry, we flush the corresponding virtual address mapping from the TLB. In contrast, when access restrictions are removed from a page table entry, we do not flush the virtual address mapping from the TLB. This is exactly as recommended in AMD's documentation. In effect, when access restrictions are removed from a page table entry, AMD's MMUs will transparently refresh a stale TLB entry. In short, this saves us from having to perform potentially costly TLB flushes. In contrast, Intel's MMUs are allowed to generate a spurious page fault based upon the stale TLB entry. Usually, such spurious page faults are handled by vm_fault() without incident. However, when we are executing no-fault sections of the kernel, we are not allowed to execute vm_fault(). This change introduces special-case handling for spurious page faults that occur in no-fault sections of the kernel. In collaboration with: kib Tested by: gibbs (an earlier version) I would also like to acknowledge Hiroki Sato's assistance in diagnosing this problem. MFC after: 1 week	2012-03-22 04:52:51 +00:00
John Baldwin	d6e9b97b3f	Bah, just revert my earlier change entirely. (Missed alc's request to do this earlier.) Requested by: alc	2012-03-19 19:06:40 +00:00
John Baldwin	92a5994685	Fix madvise(MADV_WILLNEED) to properly handle individual mappings larger than 4GB. Specifically, the inlined version of 'ptoa' of the the 'int' count of pages overflowed on 64-bit platforms. While here, change vm_object_madvise() to accept two vm_pindex_t parameters (start and end) rather than a (start, count) tuple to match other VM APIs as suggested by alc@.	2012-03-19 18:47:34 +00:00
John Baldwin	8407f69657	Alter the previous commit to use vm_size_t instead of vm_pindex_t. vm_pindex_t is not a count of pages per se, it is more like vm_ooffset_t, but a page index instead of a byte offset.	2012-03-19 18:43:44 +00:00
Konstantin Belousov	126d60823a	In vm_object_page_clean(), do not clean OBJ_MIGHTBEDIRTY object flag if the filesystem performed short write and we are skipping the page due to this. Propogate write error from the pager back to the callers of vm_pageout_flush(). Report the failure to write a page from the requested range as the FALSE return value from vm_object_page_clean(), and propagate it back to msync(2) to return EIO to usermode. While there, convert the clearobjflags variable in the vm_object_page_clean() and arguments of the helper functions to boolean. PR: kern/165927 Reviewed by: alc MFC after: 2 weeks	2012-03-17 23:00:32 +00:00
John Baldwin	df96bc9713	Pedantic nit: use vm_pindex_t instead of long for a count of pages.	2012-03-14 20:57:48 +00:00
John Baldwin	b47f624183	Add KTR_VFS traces to track modifications to a vnode's writecount.	2012-03-08 20:27:20 +00:00
Alan Cox	83cbe16ff4	Eliminate stale incorrect ARGSUSED comments. Submitted by: bde	2012-03-02 17:33:51 +00:00
Alan Cox	9ed54e79b5	Simplify kmem_alloc() by eliminating code that existed on account of external pagers in Mach. FreeBSD doesn't implement external pagers. Moreover, it don't pageout the kernel object. So, the reasons for having code don't hold. Reviewed by: kib MFC after: 6 weeks	2012-02-29 05:41:29 +00:00
Alan Cox	f9230ad6b8	Simplify vm_mmap()'s control flow. Add a comment describing what vm_mmap_to_errno() does. Reviewed by: kib MFC after: 3 weeks X-MFC after: r232071	2012-02-25 21:06:39 +00:00
Alan Cox	79e538388f	Simplify vmspace_fork()'s control flow by copying immutable data before the vm map locks are acquired. Also, eliminate redundant initialization of the new vm map's timestamp. Reviewed by: kib MFC after: 3 weeks	2012-02-25 17:49:59 +00:00
Konstantin Belousov	9d22083da8	Place the if() at the right location, to activate the v_writecount accounting for shared writeable mappings for all filesystems, not only for the bypass layers. Submitted by: alc Pointy hat to: kib MFC after: 20 days	2012-02-24 10:41:58 +00:00
Konstantin Belousov	84110e7e0b	Account the writeable shared mappings backed by file in the vnode v_writecount. Keep the amount of the virtual address space used by the mappings in the new vm_object un_pager.vnp.writemappings counter. The vnode v_writecount is incremented when writemappings gets non-zero value, and decremented when writemappings is returned to zero. Writeable shared vnode-backed mappings are accounted for in vm_mmap(), and vm_map_insert() is instructed to set MAP_ENTRY_VN_WRITECNT flag on the created map entry. During deferred map entry deallocation, vm_map_process_deferred() checks for MAP_ENTRY_VN_WRITECOUNT and decrements writemappings for the vm object. Now, the writeable mount cannot be demoted to read-only while writeable shared mappings of the vnodes from the mount point exist. Also, execve(2) fails for such files with ETXTBUSY, as it should be. Noted by: tegge Reviewed by: tegge (long time ago, early version), alc Tested by: pho MFC after: 3 weeks	2012-02-23 21:07:16 +00:00
Konstantin Belousov	501f538675	Remove wrong comment. Discussed with: alc MFC after: 3 days	2012-02-22 20:01:38 +00:00
Alan Cox	a649296959	When vm_mmap() is used to map a vm object into a kernel vm_map, it makes no sense to check the size of the kernel vm_map against the user-level resource limits for the calling process. Reviewed by: kib	2012-02-16 06:45:51 +00:00
Konstantin Belousov	8211bd45bc	Close a race due to dropping of the map lock between creating map entry for a shared mapping and marking the entry for inheritance. Other thread might execute vmspace_fork() in between (e.g. by fork(2)), resulting in the mapping becoming private. Noted and reviewed by: alc MFC after: 1 week	2012-02-11 17:29:07 +00:00
Ed Schouten	7870adb640	Remove direct access to si_name. Code should just use the devtoname() function to obtain the name of a character device. Also add const keywords to pieces of code that need it to build properly. MFC after: 2 weeks	2012-02-10 12:35:57 +00:00
Alexander Motin	8f12d83ad9	Fix NULL dereference panic on attempt to turn off (on system shutdown) disconnected swap device. This is quick and imperfect solution, as swap device will still be opened and GEOM will not be able to destroy it. Proper solution would be to automatically turn off and close disconnected swap device, but with existing code it will cause panic if there is at least one page on device, even if it is unimportant page of the user-level process. It needs some work. Reviewed by: kib@ MFC after: 1 week	2012-02-01 20:12:44 +00:00
Kip Macy	263811f724	exclude kmem_alloc'ed ARC data buffers from kernel minidumps on amd64 excluding other allocations including UMA now entails the addition of a single flag to kmem_alloc or uma zone create Reviewed by: alc, avg MFC after: 2 weeks	2012-01-27 20:18:31 +00:00
Nathan Whitehorn	8d01a3b281	Revert r212360 now that PowerPC can handle large sparse arguments to pmap_remove() (changed in r228412). MFC after: 2 weeks	2012-01-17 00:31:09 +00:00
Konstantin Belousov	5dda2db9c8	Change the type of the paging_in_progress refcounter from u_short to u_int. With the auto-sized buffer cache on the modern machines, UFS metadata can generate more the 65535 pages belonging to the buffers undergoing i/o, overflowing the counter. Reported and tested by: jimharris Reviewed by: alc MFC after: 1 week	2012-01-10 18:05:44 +00:00
Konstantin Belousov	e65919f9fc	Do not restart the scan in vm_object_page_clean() on the object generation change if requested mode is async. The object generation is only changed when the object is marked as OBJ_MIGHTBEDIRTY. For async mode it is enough to write each dirty page, not to make a guarantee that all pages are cleared after the vm_object_page_clean() returned. Diagnosed by: truckman Tested by: flo Reviewed by: alc, truckman MFC after: 2 weeks	2012-01-04 16:04:20 +00:00
Alan Cox	b5f359b7c3	Optimize vm_object_split()'s handling of reservations.	2011-12-28 20:27:18 +00:00
Konstantin Belousov	75ff604a78	Optimize the common case of msyncing the whole file mapping with MS_SYNC flag. The system must guarantee that all writes are finished before syscalls returned. Schedule the writes in async mode, which is much faster and allows the clustering to occur. Wait for writes using VOP_FSYNC(), since we are syncing the whole file mapping. Potentially, the restriction to only apply the optimization can be relaxed by not requiring that the mapping cover whole file, as it is done by other OSes. Reported and tested by: az Reviewed by: alc MFC after: 2 weeks	2011-12-23 09:09:42 +00:00
Konstantin Belousov	e878d99718	Move kstack_cache_entry into the private header, and make the stack cache list header accessible outside vm_glue.c. MFC after: 1 week	2011-12-16 10:56:16 +00:00
Eitan Adler	33fd7c5628	- The previous commit (r228449) accidentally moved the vm.stats.vm.* sysctls to vm.stats.sys. Move them back. Noticed by: pho Reviewed by: bde (earlier version) Approved by: bz MFC after: 1 week Pointy hat to: me	2011-12-14 13:25:00 +00:00
Eitan Adler	3eb9ab5255	Document a large number of currently undocumented sysctls. While here fix some style(9) issues and reduce redundancy. PR: kern/155491 PR: kern/155490 PR: kern/155489 Submitted by: Galimov Albert <wtfcrap@mail.ru> Approved by: bde Reviewed by: jhb MFC after: 1 week	2011-12-13 00:38:50 +00:00
Konstantin Belousov	134465d732	Fix printf. Submitted by: az MFC after: 1 week	2011-12-12 10:04:04 +00:00
Alan Cox	c68c35372e	Introduce vm_reserv_alloc_contig() and teach vm_page_alloc_contig() how to use superpage reservations. So, for the first time, kernel virtual memory that is allocated by contigmalloc(), kmem_alloc_attr(), and kmem_alloc_contig() can be promoted to superpages. In fact, even a series of small contigmalloc() allocations may collectively result in a promoted superpage. Eliminate some duplication of code in vm_reserv_alloc_page(). Change the type of vm_reserv_reclaim_contig()'s first parameter in order that it be consistent with other vm_*_contig() functions. Tested by: marius (sparc64)	2011-12-05 18:29:25 +00:00
Konstantin Belousov	dc874f9881	Rename vm_page_set_valid() to vm_page_set_valid_range(). The vm_page_set_valid() is the most reasonable name for the m->valid accessor. Reviewed by: attilio, alc	2011-11-30 17:39:00 +00:00
Konstantin Belousov	cf1911a9ad	Hide the internals of vm_page_lock(9) from the loadable modules. Since the address of vm_page lock mutex depends on the kernel options, it is easy for module to get out of sync with the kernel. No vm_page_lockptr() accessor is provided for modules. It can be added later if needed, unless proper KPI is developed to serve the needs. Reviewed by: attilio, alc MFC after: 3 weeks	2011-11-29 13:07:32 +00:00
Attilio Rao	9fde98bba3	Introduce the same mutex-wise fix in r227758 for sx locks. The functions that offer file and line specifications are: - sx_assert_ - sx_downgrade_ - sx_slock_ - sx_slock_sig_ - sx_sunlock_ - sx_try_slock_ - sx_try_xlock_ - sx_try_upgrade_ - sx_unlock_ - sx_xlock_ - sx_xlock_sig_ - sx_xunlock_ Now vm_map locking is fully converted and can avoid to know specifics about locking procedures. Reviewed by: kib MFC after: 1 month	2011-11-21 12:59:52 +00:00
Attilio Rao	ccdf233323	Introduce macro stubs in the mutex implementation that will be always defined and will allow consumers, willing to provide options, file and line to locking requests, to not worry about options redefining the interfaces. This is typically useful when there is the need to build another locking interface on top of the mutex one. The introduced functions that consumers can use are: - mtx_lock_flags_ - mtx_unlock_flags_ - mtx_lock_spin_flags_ - mtx_unlock_spin_flags_ - mtx_assert_ - thread_lock_flags_ Spare notes: - Likely we can get rid of all the 'INVARIANTS' specification in the ppbus code by using the same macro as done in this patch (but this is left to the ppbus maintainer) - all the other locking interfaces may require a similar cleanup, where the most notable case is sx which will allow a further cleanup of vm_map locking facilities - The patch should be fully compatible with older branches, thus a MFC is previewed (infact it uses all the underlying mechanisms already present). Comments review by: eadler, Ben Kaduk Discussed with: kib, jhb MFC after: 1 month	2011-11-20 16:33:09 +00:00
Alan Cox	5ff276b7f4	Eliminate end-of-line white space.	2011-11-17 06:54:49 +00:00
Alan Cox	fbd80bd047	Refactor the code that performs physically contiguous memory allocation, yielding a new public interface, vm_page_alloc_contig(). This new function addresses some of the limitations of the current interfaces, contigmalloc() and kmem_alloc_contig(). For example, the physically contiguous memory that is allocated with those interfaces can only be allocated to the kernel vm object and must be mapped into the kernel virtual address space. It also provides functionality that vm_phys_alloc_contig() doesn't, such as wiring the returned pages. Moreover, unlike that function, it respects the low water marks on the paging queues and wakes up the page daemon when necessary. That said, at present, this new function can't be applied to all types of vm objects. However, that restriction will be eliminated in the coming weeks. From a design standpoint, this change also addresses an inconsistency between vm_phys_alloc_contig() and the other vm_phys_alloc*() functions. Specifically, vm_phys_alloc_contig() manipulated vm_page fields that other functions in vm/vm_phys.c didn't. Moreover, vm_phys_alloc_contig() knew about vnodes and reservations. Now, vm_page_alloc_contig() is responsible for these things. Reviewed by: kib Discussed with: jhb	2011-11-16 16:46:09 +00:00
Konstantin Belousov	286790a7dd	Update the device pager interface, while keeping the compatibility layer for old KPI and KBI. New interface should be used together with d_mmap_single cdevsw method. Device pager can be allocated with the cdev_pager_allocate(9) function, which takes struct cdev_pager_ops, containing constructor/destructor and page fault handler methods supplied by driver. Constructor and destructor, called at the pager allocation and deallocation time, allow the driver to handle per-object private data. The pager handler is called to handle page fault on the vm map entry backed by the driver pager. Driver shall return either the vm_page_t which should be mapped, or error code (which does not cause kernel panic anymore). The page handler interface has a placeholder to specify the access mode causing the fault, but currently PROT_READ is always passed there. Sponsored by: The FreeBSD Foundation Reviewed by: alc MFC after: 1 month	2011-11-15 14:40:00 +00:00
Konstantin Belousov	bf277cf450	Remove the condition that is always true. Submitted by: alc MFC after: 1 week	2011-11-15 14:09:53 +00:00
Ed Schouten	6472ac3d8a	Mark all SYSCTL_NODEs static that have no corresponding SYSCTL_DECLs. The SYSCTL_NODE macro defines a list that stores all child-elements of that node. If there's no SYSCTL_DECL macro anywhere else, there's no reason why it shouldn't be static.	2011-11-07 15:43:11 +00:00
Alan Cox	c835bd16a8	Wake up the page daemon in vm_page_alloc_freelist() if it couldn't allocate the requested page because too few pages are cached or free. Document the VM_ALLOC_COUNT() option to vm_page_alloc() and vm_page_alloc_freelist(). Make style changes to vm_page_alloc() and vm_page_alloc_freelist(), such as using a variable name that more closely corresponds to the comments.	2011-11-06 02:03:27 +00:00
Konstantin Belousov	7845becfe8	Remove redundand definitions. The chunk was missed from r227102. MFC after: 2 weeks	2011-11-05 09:03:18 +00:00
Konstantin Belousov	561cc9fcb5	Provide typedefs for the type of bit mask for the page bits. Use the defined types instead of int when manipulating masks. Supposedly, it could fix support for 32KB page size in the machine-independend VM layer. Reviewed by: alc MFC after: 2 weeks	2011-11-05 08:20:32 +00:00
Alan Cox	2614c5c47c	Simplify the implementation of the failure case in kmem_alloc_attr().	2011-11-04 04:41:58 +00:00
John Baldwin	936c09ac0f	Add the posix_fadvise(2) system call. It is somewhat similar to madvise(2) except that it operates on a file descriptor instead of a memory region. It is currently only supported on regular files. Just as with madvise(2), the advice given to posix_fadvise(2) can be divided into two types. The first type provide hints about data access patterns and are used in the file read and write routines to modify the I/O flags passed down to VOP_READ() and VOP_WRITE(). These modes are thus filesystem independent. Note that to ease implementation (and since this API is only advisory anyway), only a single non-normal range is allowed per file descriptor. The second type of hints are used to hint to the OS that data will or will not be used. These hints are implemented via a new VOP_ADVISE(). A default implementation is provided which does nothing for the WILLNEED request and attempts to move any clean pages to the cache page queue for the DONTNEED request. This latter case required two other changes. First, a new V_CLEANONLY flag was added to vinvalbuf(). This requests vinvalbuf() to only flush clean buffers for the vnode from the buffer cache and to not remove any backing pages from the vnode. This is used to ensure clean pages are not wired into the buffer cache before attempting to move them to the cache page queue. The second change adds a new vm_object_page_cache() method. This method is somewhat similar to vm_object_page_remove() except that instead of freeing each page in the specified range, it attempts to move clean pages to the cache queue if possible. To preserve the ABI of struct file, the f_cdevpriv pointer is now reused in a union to point to the currently active advice region if one is present for regular files. Reviewed by: jilles, kib, arch@ Approved by: re (kib) MFC after: 1 month	2011-11-04 04:02:50 +00:00
Alan Cox	8393768074	Add support for VM_ALLOC_WIRED and VM_ALLOC_ZERO to vm_page_alloc_freelist() and use these new options in the mips pmap. Wake up the page daemon in vm_page_alloc_freelist() if the number of free and cached pages becomes too low. Tidy up vm_page_alloc_init(). In particular, add a comment about an important restriction on its use. Tested by: jchandra@	2011-11-02 05:42:51 +00:00
Alan Cox	5c1f2cc4c2	Eliminate vm_phys_bootstrap_alloc(). It was a failed attempt at eliminating duplicated code in the various pmap implementations. Micro-optimize vm_phys_free_pages(). Introduce vm_phys_free_contig(). It is fast routine for freeing an arbitrary number of physically contiguous pages. In particular, it doesn't require the number of pages to be a power of two. Use "u_long" instead of "unsigned long". Bruce Evans (bde@) has convinced me that the "boundary" parameters to kmem_alloc_contig(), vm_phys_alloc_contig(), and vm_reserv_reclaim_contig() should be of type "vm_paddr_t" and not "u_long". Make this change.	2011-10-30 05:06:14 +00:00
Alan Cox	1933a67cf4	Use "u_long" instead of "unsigned long".	2011-10-28 22:36:15 +00:00
Alan Cox	125b695b6e	Tidy up the comment at the head of vm_page_alloc, and mention that the returned page has the flag VPO_BUSY set.	2011-10-27 17:29:19 +00:00
Alan Cox	703dec68bf	Eliminate vestiges of page coloring in VM_ALLOC_NOOBJ calls to vm_page_alloc(). While I'm here, for the sake of consistency, always specify the allocation class, such as VM_ALLOC_NORMAL, as the first of the flags.	2011-10-27 16:39:17 +00:00
Alan Cox	f346986b76	contigmalloc(9) and contigfree(9) are now implemented in terms of other more general VM system interfaces. So, their implementation can now reside in kern_malloc.c alongside the other functions that are declared in malloc.h.	2011-10-27 02:52:24 +00:00
Alan Cox	9c60ca3238	Speed up vm_page_cache() and vm_page_remove() by checking for a few common cases that can be handled in constant time. The insight being that a page's parent in the vm object's tree is very often its predecessor or successor in the vm object's ordered memq. Tested by: jhb MFC after: 10 days	2011-10-25 16:35:08 +00:00
Attilio Rao	2d5106600e	VN_NRESERVLEVEL is used in this file but opt_vm is not included thus the stub switch won't be correctly handled. Include opt_vm.h. Submitted by: jeff MFC after: 3 days	2011-10-22 22:00:35 +00:00
Konstantin Belousov	126b36a21e	Control the execution permission of the readable segments for i386 binaries on the amd64 and ia64 with the sysctl, instead of unconditionally enabling it. Reviewed by: marcel	2011-10-15 12:35:18 +00:00
John Baldwin	9860134635	Fix a typo in a comment.	2011-10-14 11:48:32 +00:00
Marcel Moolenaar	5f81660285	In sys_obreak() and when compiling for amd64 or ia64, when the process is ILP32 (i.e. i386) grant execute permissions by default. The JDK 1.4.x depends on being able to execute from the heap on i386.	2011-10-13 16:20:10 +00:00
Gleb Smirnoff	8d689e042f	Make memguard(9) capable to guard uma(9) allocations.	2011-10-12 18:08:28 +00:00
Konstantin Belousov	17514c1bd9	Style nit. Submitted by: jhb MFC after: 2 weeks	2011-09-29 00:44:34 +00:00
Konstantin Belousov	2042bb377a	Fix grammar. Submitted by: bf MFC after: 2 weeks	2011-09-28 16:12:15 +00:00
Konstantin Belousov	abb9b935ca	Use the trick of performing the atomic operation on the contained aligned word to handle the dirty mask updates in vm_page_clear_dirty_mask(). Remove the vm page queue lock around vm_page_dirty() call in vm_fault_hold() the sole purpose of which was to protect dirty on architectures which does not provide short or byte-wide atomics. Reviewed by: alc, attilio Tested by: flo (sparc64) MFC after: 2 weeks	2011-09-28 14:57:50 +00:00
Konstantin Belousov	005f609130	Use the explicitly-sized types for the dirty and valid masks. Requested by: attilio Reviewed by: alc MFC after: 2 weeks	2011-09-28 14:51:28 +00:00
Kip Macy	8451d0dd78	In order to maximize the re-usability of kernel code in user space this patch modifies makesyscalls.sh to prefix all of the non-compatibility calls (e.g. not linux_, freebsd32_) with sys_ and updates the kernel entry points and all places in the code that use them. It also fixes an additional name space collision between the kernel function psignal and the libc function of the same name by renaming the kernel psignal kern_psignal(). By introducing this change now we will ease future MFCs that change syscalls. Reviewed by: rwatson Approved by: re (bz)	2011-09-16 13:58:51 +00:00
Konstantin Belousov	3407fefef6	Split the vm_page flags PG_WRITEABLE and PG_REFERENCED into atomic flags field. Updates to the atomic flags are performed using the atomic ops on the containing word, do not require any vm lock to be held, and are non-blocking. The vm_page_aflag_set(9) and vm_page_aflag_clear(9) functions are provided to modify afalgs. Document the changes to flags field to only require the page lock. Introduce vm_page_reference(9) function to provide a stable KPI and KBI for filesystems like tmpfs and zfs which need to mark a page as referenced. Reviewed by: alc, attilio Tested by: marius, flo (sparc64); andreast (powerpc, powerpc64) Approved by: re (bz)	2011-09-06 10:30:11 +00:00
Konstantin Belousov	15523cf799	Update some comments in swap_pager.c. Reviewed and most wording by: alc MFC after: 1 week Approved by: re (bz)	2011-08-22 20:44:18 +00:00
Konstantin Belousov	6e903bd0d6	Apply the limit to avoid the overflows in the radix tree subr_blist.c after the conversion of the swap device size to the page size units, not before. That lifts the limit on the usable swap partition size from 32GB to 256GB, that is less depressing for the modern systems. Submitted by: Alexander V. Chernikov <melifaro ipfw ru> Reviewed by: alc Approved by: re (bz) MFC after: 2 weeks	2011-08-22 11:18:47 +00:00
Robert Watson	a9d2f8d84f	Second-to-last commit implementing Capsicum capabilities in the FreeBSD kernel for FreeBSD 9.0: Add a new capability mask argument to fget(9) and friends, allowing system call code to declare what capabilities are required when an integer file descriptor is converted into an in-kernel struct file *. With options CAPABILITIES compiled into the kernel, this enforces capability protection; without, this change is effectively a no-op. Some cases require special handling, such as mmap(2), which must preserve information about the maximum rights at the time of mapping in the memory map so that they can later be enforced in mprotect(2) -- this is done by narrowing the rights in the existing max_protection field used for similar purposes with file permissions. In namei(9), we assert that the code is not reached from within capability mode, as we're not yet ready to enforce namespace capabilities there. This will follow in a later commit. Update two capability names: CAP_EVENT and CAP_KEVENT become CAP_POST_KEVENT and CAP_POLL_KEVENT to more accurately indicate what they represent. Approved by: re (bz) Submitted by: jonathan Sponsored by: Google Inc	2011-08-11 12:30:23 +00:00
Konstantin Belousov	d98d0ce27a	- Move the PG_UNMANAGED flag from m->flags to m->oflags, renaming the flag to VPO_UNMANAGED (and also making the flag protected by the vm object lock, instead of vm page queue lock). - Mark the fake pages with both PG_FICTITIOUS (as it is now) and VPO_UNMANAGED. As a consequence, pmap code now can use use just VPO_UNMANAGED to decide whether the page is unmanaged. Reviewed by: alc Tested by: pho (x86, previous version), marius (sparc64), marcel (arm, ia64, powerpc), ray (mips) Sponsored by: The FreeBSD Foundation Approved by: re (bz)	2011-08-09 21:01:36 +00:00
Alan Cox	12f4b65fa6	Fix an error in kmem_alloc_attr(). Unless "tries" is updated, kmem_alloc_attr() could get stuck in a loop. Approved by: re (kib) MFC after: 3 days	2011-08-07 00:11:39 +00:00
Konstantin Belousov	dda4f96087	Implement the linprocfs swaps file, providing information about the configured swap devices in the Linux-compatible format. Based on the submission by: Robert Millan <rmh debian org> PR: kern/159281 Reviewed by: bde Approved by: re (kensmith) MFC after: 2 weeks	2011-08-01 19:12:15 +00:00
Konstantin Belousov	339772b003	Fix a race in the device pager allocation. If another thread won and allocated the device pager for the given handle, then the object fictitious pages list and the object membership in the global object list still need to be initialized. Otherwise, dev_pager_dealloc() will traverse uninitialized pointers. Reported and tested by: pho Reviewed by: jhb Approved by: re (kensmith) MFC after: 1 week	2011-07-30 14:13:57 +00:00
Konstantin Belousov	2e32165ce0	Extract the code to translate VM error into errno, into an exported function vm_mmap_to_errno(). It is useful for the drivers that implement mmap(2)-like functionality, to be able to return error codes consistent with mmap(2). Sponsored by: The FreeBSD Foundation No objections from: alc MFC after: 1 week	2011-07-10 20:49:13 +00:00
Konstantin Belousov	3103730c82	Style. MFC after: 3 days	2011-07-10 20:45:13 +00:00
Konstantin Belousov	2801687d56	Add a facility to disable processing page faults. When activated, uiomove generates EFAULT if any accessed address is not mapped, as opposed to handling the fault. Sponsored by: The FreeBSD Foundation Reviewed by: alc (previous version)	2011-07-09 15:21:10 +00:00
Edward Tomasz Napierala	afcc55f318	All the racct_*() calls need to happen with the proc locked. Fixing this won't happen before 9.0. This commit adds "#ifdef RACCT" around all the "PROC_LOCK(p); racct_whatever(p, ...); PROC_UNLOCK(p)" instances, in order to avoid useless locking/unlocking in kernels built without "options RACCT".	2011-07-06 20:06:44 +00:00
Attilio Rao	91a1929f07	Handle a race between device_pager and devsw in a more graceful manner: return an error code rather than panic the kernel. Sponsored by: Sandvine Incorporated Reviewed by: kib Tested by: pho MFC after: 2 weeks	2011-07-06 15:09:52 +00:00
Alan Cox	a8229fa37c	Initialize marker pages as held rather than fictitious/wired. Marking the page as held is more useful as a safety precaution in case someone forgets to check for PG_MARKER. Reviewed by: kib	2011-07-02 23:34:47 +00:00
Alan Cox	6bbee8e28a	Add a new option, OBJPR_NOTMAPPED, to vm_object_page_remove(). Passing this option to vm_object_page_remove() asserts that the specified range of pages is not mapped, or more precisely that none of these pages have any managed mappings. Thus, vm_object_page_remove() need not call pmap_remove_all() on the pages. This change not only saves time by eliminating pointless calls to pmap_remove_all(), but it also eliminates an inconsistency in the use of pmap_remove_all() versus related functions, like pmap_remove_write(). It eliminates harmless but pointless calls to pmap_remove_all() that were being performed on PG_UNMANAGED pages. Update all of the existing assertions on pmap_remove_all() to reflect this change. Reviewed by: kib	2011-06-29 16:40:41 +00:00
Alan Cox	1bfec3dfb6	Revert to using the page queues lock in vm_page_clear_dirty_mask() on MIPS. (At present, although atomic_clear_char() is defined by atomic.h on MIPS, it is not actually implemented by support.S.)	2011-06-23 05:23:59 +00:00
Alan Cox	3c76db4c64	Precisely document the synchronization rules for the page's dirty field. (Saying that the lock on the object that the page belongs to must be held only represents one aspect of the rules.) Eliminate the use of the page queues lock for atomically performing read- modify-write operations on the dirty field when the underlying architecture supports atomic operations on char and short types. Document the fact that 32KB pages aren't really supported. Reviewed by: attilio, kib	2011-06-19 19:13:24 +00:00
Konstantin Belousov	3b1025d200	Assert that page is VPO_BUSY or page owner object is locked in vm_page_undirty(). The assert is not precise due to VPO_BUSY owner to tracked, so assertion does not catch the case when VPO_BUSY is owned by other thread. Reviewed by: alc	2011-06-11 20:15:19 +00:00
Konstantin Belousov	9d17da3bef	Fix a bug in r222586. Lock the page owner object around the modification of the m->dirty. Reported and tested by: nwhitehorn Reviewed by: alc	2011-06-11 20:13:28 +00:00
Konstantin Belousov	031ec8c10a	In the VOP_PUTPAGES() implementations, change the default error from VM_PAGER_AGAIN to VM_PAGER_ERROR for the uwritten pages. Return VM_PAGER_AGAIN for the partially written page. Always forward at least one page in the loop of vm_object_page_clean(). VM_PAGER_ERROR causes the page reactivation and does not clear the page dirty state, so the write is not lost. The change fixes an infinite loop in vm_object_page_clean() when the filesystem returns permanent errors for some page writes. Reported and tested by: gavin Reviewed by: alc, rmacklem MFC after: 1 week	2011-06-01 21:00:28 +00:00
Alan Cox	8cd02d00be	Correct an error in r222163. Unless UMA_MD_SMALL_ALLOC is defined, startup_alloc() must be used until uma_startup2() is called. Reported by: jh	2011-05-22 17:46:16 +00:00
Alan Cox	342f1793ba	1. Prior to r214782, UMA did not support multipage allocations before uma_startup2() was called. Thus, setting the variable "booted" to true in uma_startup() was ok on machines with UMA_MD_SMALL_ALLOC defined, because any allocations made after uma_startup() but before uma_startup2() could be satisfied by uma_small_alloc(). Now, however, some multipage allocations are necessary before uma_startup2() just to allocate zone structures on machines with a large number of processors. Thus, a Boolean can no longer effectively describe the state of the UMA allocator. Instead, make "booted" have three values to describe how far initialization has progressed. This allows multipage allocations to continue using startup_alloc() until uma_startup2(), but single-page allocations may begin using uma_small_alloc() after uma_startup(). 2. With the aforementioned change, only a modest increase in boot pages is necessary to boot UMA on a large number of processors. 3. Retire UMA_MD_SMALL_ALLOC_NEEDS_VM. It has only been used between r182028 and r204128. Reviewed by: attilio [1], nwhitehorn [3] Tested by: sbruno	2011-05-21 17:43:43 +00:00
Alan Cox	59d7277f4a	Fix spelling errors.	2011-05-20 17:28:00 +00:00
Alan Cox	df1bc9de7c	Eliminate a redundant #include. ("vm/vm_param.h" already includes "machine/vmparam.h".)	2011-05-20 15:26:31 +00:00
Matthew D Fleming	cfb00e5aa7	Move the ZERO_REGION_SIZE to a machine-dependent file, as on many architectures (i386, for example) the virtual memory space may be constrained enough that 2MB is a large chunk. Use 64K for arches other than amd64 and ia64, with special handling for sparc64 due to differing hardware. Also commit the comment changes to kmem_init_zero_region() that I missed due to not saving the file. (Darn the unfamiliar development environment). Arch maintainers, please feel free to adjust ZERO_REGION_SIZE as you see fit. Requested by: alc MFC after: 1 week MFC with: r221853	2011-05-13 19:35:01 +00:00
Matthew D Fleming	89cb2a19ec	Usa a globally visible region of zeros for both /dev/zero and the md device. There are likely other kernel uses of "blob of zeros" than can be converted. Reviewed by: alc MFC after: 1 week	2011-05-13 18:48:00 +00:00
Max Laier	e18cc7bf3e	Another long standing vm bug found at Isilon: Fix a race between vm_object_collapse and vm_fault. Reviewed by: alc@ MFC after: 3 days	2011-05-09 20:27:49 +00:00
David E. O'Brien	cec9f109bb	Reap old SPL comments. Reviewed by: alc	2011-04-26 22:18:53 +00:00
Konstantin Belousov	86769ac0a4	Fix two bugs in r218670. Hold the vnode around the region where object lock is dropped, until vnode lock is acquired. Do not drop the vnode reference for a case when the object was deallocated during unlock. Note that in this case, VV_TEXT is cleared by vnode_pager_dealloc(). Reported and tested by: pho Reviewed by: alc MFC after: 3 days	2011-04-23 21:38:21 +00:00
John Baldwin	e806d352d2	Fix several places to ignore processes that are not yet fully constructed. MFC after: 1 week	2011-04-06 17:47:22 +00:00
Edward Tomasz Napierala	f497cda257	In vm_daemon(), do not skip processes stopped with SIGSTOP.	2011-04-06 16:27:04 +00:00
Edward Tomasz Napierala	099e7e950f	Add RACCT_RSS. Sponsored by: The FreeBSD Foundation Reviewed by: kib (earlier version)	2011-04-06 16:24:24 +00:00
Edward Tomasz Napierala	1ba5ad4210	Add accounting for most of the memory-related resources. Sponsored by: The FreeBSD Foundation Reviewed by: kib (earlier version)	2011-04-05 20:23:59 +00:00
Konstantin Belousov	af32c4196f	Handle the corner case in vm_fault_quick_hold_pages(). If supplied length is zero, and user address is invalid, function might return -1, due to the truncation and rounding of the address. The callers interpret the situation as EFAULT. Instead of handling the zero length in caller, filter it in vm_fault_quick_hold_pages(). Sponsored by: The FreeBSD Foundation Reviewed by: alc	2011-03-25 16:38:10 +00:00
John Baldwin	8e6fa660f2	Fix some locking nits with the p_state field of struct proc: - Hold the proc lock while changing the state from PRS_NEW to PRS_NORMAL in fork to honor the locking requirements. While here, expand the scope of the PROC_LOCK() on the new process (p2) to avoid some LORs. Previously the code was locking the new child process (p2) after it had locked the parent process (p1). However, when locking two processes, the safe order is to lock the child first, then the parent. - Fix various places that were checking p_state against PRS_NEW without having the process locked to use PROC_LOCK(). Every place was already locking the process, just after the PRS_NEW check. - Remove or reduce the use of PROC_SLOCK() for places that were checking p_state against PRS_NEW. The PROC_LOCK() alone is sufficient for reading the current state. - Reorder fill_kinfo_proc() slightly so it only acquires PROC_SLOCK() once. MFC after: 1 week	2011-03-24 18:40:11 +00:00
Jeff Roberson	e4cd31dd3c	- Merge changes to the base system to support OFED. These include a wider arg2 for sysctl, updates to vlan code, IFT_INFINIBAND, and other miscellaneous small features.	2011-03-21 09:40:01 +00:00
Edward Tomasz Napierala	3fccbe4397	In vm_daemon(), when iterating over all processes in the system, skip those which are not yet fully initialized (i.e. ones with p_state == PRS_NEW). Without it, we could panic in _thread_lock_flags(). Note that there may be other instances of FOREACH_PROC_IN_SYSTEM() that require similar fix. Reported by: pho, keramida Discussed with: kib	2011-03-18 06:47:23 +00:00
Alan Cox	10cf256074	Eliminate duplication of the fake page code and zone by the device and sg pagers. Reviewed by: jhb	2011-03-11 07:07:48 +00:00
Rebecca Cran	2860553a86	Change the return type of vmspace_swap_count to a long to match the other vmspace_*_count functions. MFC after: 3 days	2011-03-01 11:04:30 +00:00
Sergey Kandaurov	7ec9c8d170	Remove sysctl vm.max_proc_mmap used to protect from KVA space exhaustion. As it was pointed out by Alan Cox, that no longer serves its purpose with the modern UMA allocator compared to the old one used in 4.x days. The removal of sysctl eliminates max_proc_mmap type overflow leading to the broken mmap(2) seen with large amount of physical memory on arches with factually unbound KVA space (such as amd64). It was found that slightly less than 256GB of physmem was enough to trigger the overflow. Reviewed by: alc, kib Approved by: avg (mentor) MFC after: 2 months	2011-02-24 09:22:56 +00:00
Rebecca Cran	65d8409cee	Calculate and return the count in vmspace_swap_count as a vm_offset_t instead of an int to avoid overflow. While here, clean up some style(9) issues. PR: kern/152200 Reviewed by: kib MFC after: 2 weeks	2011-02-23 10:28:37 +00:00
Alan Cox	e6ffa21488	Remove pmap fields that are either unused or not fully implemented. Discussed with: kib	2011-02-17 15:36:29 +00:00
Konstantin Belousov	56bdf2dbc2	Since r218070 reenabled the call to vm_map_simplify_entry() from vm_map_insert(), the kmem_back() assumption about newly inserted entry might be broken due to interference of two factors. In the low memory condition, when vm_page_alloc() returns NULL, supplied map is unlocked. If another thread performs kmem_malloc() meantime, and its map entry is placed right next to our thread map entry in the map, both entries wire count is still 0 and entries are coalesced due to vm_map_simplify_entry(). Mark new entry with MAP_ENTRY_IN_TRANSITION to prevent coalesce. Fix some style issues, tighten the assertions to account for MAP_ENTRY_IN_TRANSITION state. Reported and tested by: pho Reviewed by: alc	2011-02-15 09:03:58 +00:00
Konstantin Belousov	03fa5b34a0	Lock the vnode around clearing of VV_TEXT flag. Remove mp_fixme() note mentioning that vnode lock is needed. Reviewed by: alc Tested by: pho MFC after: 1 week	2011-02-13 21:52:26 +00:00
Juli Mallett	6edf6104a9	Use CPU_FOREACH rather than expecting CPUs 0 through mp_ncpus-1 to be present. Don't micro-optimize the uniprocessor case; use the same loop there. Submitted by: Bhanu Prakash Reviewed by: kib, jhb	2011-02-12 02:10:08 +00:00
Alan Cox	d7b20e4b45	Retire VFS_BIO_DEBUG. Convert those checks that were still valid into KASSERT()s and eliminate the rest. Replace excessive printf()s and a panic() in bufdone_finish() with a KASSERT() in vm_page_io_finish(). Reviewed by: kib	2011-02-12 01:00:00 +00:00
Alan Cox	17f3095d1a	Unless "cnt" exceeds MAX_COMMIT_COUNT, nfsrv_commit() and nfsvno_fsync() are incorrectly calling vm_object_page_clean(). They are passing the length of the range rather than the ending offset of the range. Perform the OFF_TO_IDX() conversion in vm_object_page_clean() rather than the callers. Reviewed by: kib MFC after: 3 weeks	2011-02-05 21:21:27 +00:00
Alan Cox	0cc74f144e	Since the last parameter to vm_object_shadow() is a vm_size_t and not a vm_pindex_t, it makes no sense for its callers to perform atop(). Let vm_object_shadow() do that instead.	2011-02-04 21:49:24 +00:00
Alan Cox	3d05198e23	Release the free page queues lock earlier in vm_page_alloc(). Discussed with: kib@	2011-01-30 23:55:48 +00:00
Alan Cox	d2a444c0da	Reenable the call to vm_map_simplify_entry() from vm_map_insert() for non- MAP_STACK_* entries. (See r71983 and r74235.) In some cases, performing this call to vm_map_simplify_entry() halves the number of vm map entries used by the Sun JDK.	2011-01-29 15:23:02 +00:00
Matthew D Fleming	00f0e671ff	Explicitly wire the user buffer rather than doing it implicitly in sbuf_new_for_sysctl(9). This allows using an sbuf with a SYSCTL_OUT drain for extremely large amounts of data where the caller knows that appropriate references are held, and sleeping is not an issue. Inspired by: rwatson	2011-01-27 00:34:12 +00:00
Sergey Kandaurov	4053b05b91	Make MSGBUF_SIZE kernel option a loader tunable kern.msgbufsize. Submitted by: perryh pluto.rain.com (previous version) Reviewed by: jhb Approved by: kib (mentor) Tested by: universe	2011-01-21 10:26:26 +00:00
Alan Cox	2c4992db70	Move the definition of M_VMPGDATA to the swap pager, where the only remaining uses are.	2011-01-18 04:54:43 +00:00
Alan Cox	44e46b9e53	Explicitly initialize the page's queue field to PQ_NONE instead of relying on PQ_NONE being zero. Redefine PQ_NONE and PQ_COUNT so that a page queue isn't allocated for PQ_NONE. Reviewed by: kib@	2011-01-17 19:17:26 +00:00
Alan Cox	9454f82862	Sort function prototypes.	2011-01-16 20:40:50 +00:00

... 4 5 6 7 8 ...

3327 Commits