mirror of
https://git.FreeBSD.org/src.git
synced 2025-01-15 15:06:42 +00:00
4518d9ae4c
MFC after: 2 weeks
1069 lines
38 KiB
Groff
1069 lines
38 KiB
Groff
.\" Copyright (C) 2001 Matthew Dillon. All rights reserved.
|
|
.\"
|
|
.\" Redistribution and use in source and binary forms, with or without
|
|
.\" modification, are permitted provided that the following conditions
|
|
.\" are met:
|
|
.\" 1. Redistributions of source code must retain the above copyright
|
|
.\" notice, this list of conditions and the following disclaimer.
|
|
.\" 2. Redistributions in binary form must reproduce the above copyright
|
|
.\" notice, this list of conditions and the following disclaimer in the
|
|
.\" documentation and/or other materials provided with the distribution.
|
|
.\"
|
|
.\" THIS SOFTWARE IS PROVIDED BY AUTHOR AND CONTRIBUTORS ``AS IS'' AND
|
|
.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
|
.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
|
|
.\" ARE DISCLAIMED. IN NO EVENT SHALL AUTHOR OR CONTRIBUTORS BE LIABLE
|
|
.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
|
.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
|
|
.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
|
|
.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
|
|
.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
|
|
.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
|
|
.\" SUCH DAMAGE.
|
|
.\"
|
|
.\" $FreeBSD$
|
|
.\"
|
|
.Dd October 16, 2010
|
|
.Dt TUNING 7
|
|
.Os
|
|
.Sh NAME
|
|
.Nm tuning
|
|
.Nd performance tuning under FreeBSD
|
|
.Sh SYSTEM SETUP - DISKLABEL, NEWFS, TUNEFS, SWAP
|
|
When using
|
|
.Xr bsdlabel 8
|
|
or
|
|
.Xr sysinstall 8
|
|
to lay out your file systems on a hard disk it is important to remember
|
|
that hard drives can transfer data much more quickly from outer tracks
|
|
than they can from inner tracks.
|
|
To take advantage of this you should
|
|
try to pack your smaller file systems and swap closer to the outer tracks,
|
|
follow with the larger file systems, and end with the largest file systems.
|
|
It is also important to size system standard file systems such that you
|
|
will not be forced to resize them later as you scale the machine up.
|
|
I usually create, in order, a 128M root, 1G swap, 128M
|
|
.Pa /var ,
|
|
128M
|
|
.Pa /var/tmp ,
|
|
3G
|
|
.Pa /usr ,
|
|
and use any remaining space for
|
|
.Pa /home .
|
|
.Pp
|
|
You should typically size your swap space to approximately 2x main memory
|
|
for systems with less than 2GB of RAM, or approximately 1x main memory
|
|
if you have more.
|
|
If you do not have a lot of RAM, though, you will generally want a lot
|
|
more swap.
|
|
It is not recommended that you configure any less than
|
|
256M of swap on a system and you should keep in mind future memory
|
|
expansion when sizing the swap partition.
|
|
The kernel's VM paging algorithms are tuned to perform best when there is
|
|
at least 2x swap versus main memory.
|
|
Configuring too little swap can lead
|
|
to inefficiencies in the VM page scanning code as well as create issues
|
|
later on if you add more memory to your machine.
|
|
Finally, on larger systems
|
|
with multiple SCSI disks (or multiple IDE disks operating on different
|
|
controllers), we strongly recommend that you configure swap on each drive.
|
|
The swap partitions on the drives should be approximately the same size.
|
|
The kernel can handle arbitrary sizes but
|
|
internal data structures scale to 4 times the largest swap partition.
|
|
Keeping
|
|
the swap partitions near the same size will allow the kernel to optimally
|
|
stripe swap space across the N disks.
|
|
Do not worry about overdoing it a
|
|
little, swap space is the saving grace of
|
|
.Ux
|
|
and even if you do not normally use much swap, it can give you more time to
|
|
recover from a runaway program before being forced to reboot.
|
|
.Pp
|
|
How you size your
|
|
.Pa /var
|
|
partition depends heavily on what you intend to use the machine for.
|
|
This
|
|
partition is primarily used to hold mailboxes, the print spool, and log
|
|
files.
|
|
Some people even make
|
|
.Pa /var/log
|
|
its own partition (but except for extreme cases it is not worth the waste
|
|
of a partition ID).
|
|
If your machine is intended to act as a mail
|
|
or print server,
|
|
or you are running a heavily visited web server, you should consider
|
|
creating a much larger partition \(en perhaps a gig or more.
|
|
It is very easy
|
|
to underestimate log file storage requirements.
|
|
.Pp
|
|
Sizing
|
|
.Pa /var/tmp
|
|
depends on the kind of temporary file usage you think you will need.
|
|
128M is
|
|
the minimum we recommend.
|
|
Also note that sysinstall will create a
|
|
.Pa /tmp
|
|
directory.
|
|
Dedicating a partition for temporary file storage is important for
|
|
two reasons: first, it reduces the possibility of file system corruption
|
|
in a crash, and second it reduces the chance of a runaway process that
|
|
fills up
|
|
.Oo Pa /var Oc Ns Pa /tmp
|
|
from blowing up more critical subsystems (mail,
|
|
logging, etc).
|
|
Filling up
|
|
.Oo Pa /var Oc Ns Pa /tmp
|
|
is a very common problem to have.
|
|
.Pp
|
|
In the old days there were differences between
|
|
.Pa /tmp
|
|
and
|
|
.Pa /var/tmp ,
|
|
but the introduction of
|
|
.Pa /var
|
|
(and
|
|
.Pa /var/tmp )
|
|
led to massive confusion
|
|
by program writers so today programs haphazardly use one or the
|
|
other and thus no real distinction can be made between the two.
|
|
So it makes sense to have just one temporary directory and
|
|
softlink to it from the other
|
|
.Pa tmp
|
|
directory locations.
|
|
However you handle
|
|
.Pa /tmp ,
|
|
the one thing you do not want to do is leave it sitting
|
|
on the root partition where it might cause root to fill up or possibly
|
|
corrupt root in a crash/reboot situation.
|
|
.Pp
|
|
The
|
|
.Pa /usr
|
|
partition holds the bulk of the files required to support the system and
|
|
a subdirectory within it called
|
|
.Pa /usr/local
|
|
holds the bulk of the files installed from the
|
|
.Xr ports 7
|
|
hierarchy.
|
|
If you do not use ports all that much and do not intend to keep
|
|
system source
|
|
.Pq Pa /usr/src
|
|
on the machine, you can get away with
|
|
a 1 gigabyte
|
|
.Pa /usr
|
|
partition.
|
|
However, if you install a lot of ports
|
|
(especially window managers and Linux-emulated binaries), we recommend
|
|
at least a 2 gigabyte
|
|
.Pa /usr
|
|
and if you also intend to keep system source
|
|
on the machine, we recommend a 3 gigabyte
|
|
.Pa /usr .
|
|
Do not underestimate the
|
|
amount of space you will need in this partition, it can creep up and
|
|
surprise you!
|
|
.Pp
|
|
The
|
|
.Pa /home
|
|
partition is typically used to hold user-specific data.
|
|
I usually size it to the remainder of the disk.
|
|
.Pp
|
|
Why partition at all?
|
|
Why not create one big
|
|
.Pa /
|
|
partition and be done with it?
|
|
Then I do not have to worry about undersizing things!
|
|
Well, there are several reasons this is not a good idea.
|
|
First,
|
|
each partition has different operational characteristics and separating them
|
|
allows the file system to tune itself to those characteristics.
|
|
For example,
|
|
the root and
|
|
.Pa /usr
|
|
partitions are read-mostly, with very little writing, while
|
|
a lot of reading and writing could occur in
|
|
.Pa /var
|
|
and
|
|
.Pa /var/tmp .
|
|
By properly
|
|
partitioning your system fragmentation introduced in the smaller more
|
|
heavily write-loaded partitions will not bleed over into the mostly-read
|
|
partitions.
|
|
Additionally, keeping the write-loaded partitions closer to
|
|
the edge of the disk (i.e., before the really big partitions instead of after
|
|
in the partition table) will increase I/O performance in the partitions
|
|
where you need it the most.
|
|
Now it is true that you might also need I/O
|
|
performance in the larger partitions, but they are so large that shifting
|
|
them more towards the edge of the disk will not lead to a significant
|
|
performance improvement whereas moving
|
|
.Pa /var
|
|
to the edge can have a huge impact.
|
|
Finally, there are safety concerns.
|
|
Having a small neat root partition that
|
|
is essentially read-only gives it a greater chance of surviving a bad crash
|
|
intact.
|
|
.Pp
|
|
Properly partitioning your system also allows you to tune
|
|
.Xr newfs 8 ,
|
|
and
|
|
.Xr tunefs 8
|
|
parameters.
|
|
Tuning
|
|
.Xr newfs 8
|
|
requires more experience but can lead to significant improvements in
|
|
performance.
|
|
There are three parameters that are relatively safe to tune:
|
|
.Em blocksize , bytes/i-node ,
|
|
and
|
|
.Em cylinders/group .
|
|
.Pp
|
|
.Fx
|
|
performs best when using 8K or 16K file system block sizes.
|
|
The default file system block size is 16K,
|
|
which provides best performance for most applications,
|
|
with the exception of those that perform random access on large files
|
|
(such as database server software).
|
|
Such applications tend to perform better with a smaller block size,
|
|
although modern disk characteristics are such that the performance
|
|
gain from using a smaller block size may not be worth consideration.
|
|
Using a block size larger than 16K
|
|
can cause fragmentation of the buffer cache and
|
|
lead to lower performance.
|
|
.Pp
|
|
The defaults may be unsuitable
|
|
for a file system that requires a very large number of i-nodes
|
|
or is intended to hold a large number of very small files.
|
|
Such a file system should be created with an 8K or 4K block size.
|
|
This also requires you to specify a smaller
|
|
fragment size.
|
|
We recommend always using a fragment size that is 1/8
|
|
the block size (less testing has been done on other fragment size factors).
|
|
The
|
|
.Xr newfs 8
|
|
options for this would be
|
|
.Dq Li "newfs -f 1024 -b 8192 ..." .
|
|
.Pp
|
|
If a large partition is intended to be used to hold fewer, larger files, such
|
|
as database files, you can increase the
|
|
.Em bytes/i-node
|
|
ratio which reduces the number of i-nodes (maximum number of files and
|
|
directories that can be created) for that partition.
|
|
Decreasing the number
|
|
of i-nodes in a file system can greatly reduce
|
|
.Xr fsck 8
|
|
recovery times after a crash.
|
|
Do not use this option
|
|
unless you are actually storing large files on the partition, because if you
|
|
overcompensate you can wind up with a file system that has lots of free
|
|
space remaining but cannot accommodate any more files.
|
|
Using 32768, 65536, or 262144 bytes/i-node is recommended.
|
|
You can go higher but
|
|
it will have only incremental effects on
|
|
.Xr fsck 8
|
|
recovery times.
|
|
For example,
|
|
.Dq Li "newfs -i 32768 ..." .
|
|
.Pp
|
|
.Xr tunefs 8
|
|
may be used to further tune a file system.
|
|
This command can be run in
|
|
single-user mode without having to reformat the file system.
|
|
However, this is possibly the most abused program in the system.
|
|
Many people attempt to
|
|
increase available file system space by setting the min-free percentage to 0.
|
|
This can lead to severe file system fragmentation and we do not recommend
|
|
that you do this.
|
|
Really the only
|
|
.Xr tunefs 8
|
|
option worthwhile here is turning on
|
|
.Em softupdates
|
|
with
|
|
.Dq Li "tunefs -n enable /filesystem" .
|
|
(Note: in
|
|
.Fx 4.5
|
|
and later, softupdates can be turned on using the
|
|
.Fl U
|
|
option to
|
|
.Xr newfs 8 ,
|
|
and
|
|
.Xr sysinstall 8
|
|
will typically enable softupdates automatically for non-root file systems).
|
|
Softupdates drastically improves meta-data performance, mainly file
|
|
creation and deletion.
|
|
We recommend enabling softupdates on most file systems; however, there
|
|
are two limitations to softupdates that you should be aware of when
|
|
determining whether to use it on a file system.
|
|
First, softupdates guarantees file system consistency in the
|
|
case of a crash but could very easily be several seconds (even a minute!\&)
|
|
behind on pending write to the physical disk.
|
|
If you crash you may lose more work
|
|
than otherwise.
|
|
Secondly, softupdates delays the freeing of file system
|
|
blocks.
|
|
If you have a file system (such as the root file system) which is
|
|
close to full, doing a major update of it, e.g.\&
|
|
.Dq Li "make installworld" ,
|
|
can run it out of space and cause the update to fail.
|
|
For this reason, softupdates will not be enabled on the root file system
|
|
during a typical install.
|
|
There is no loss of performance since the root
|
|
file system is rarely written to.
|
|
.Pp
|
|
A number of run-time
|
|
.Xr mount 8
|
|
options exist that can help you tune the system.
|
|
The most obvious and most dangerous one is
|
|
.Cm async .
|
|
Only use this option in conjunction with
|
|
.Xr gjournal 8 ,
|
|
as it is far too dangerous on a normal file system.
|
|
A less dangerous and more
|
|
useful
|
|
.Xr mount 8
|
|
option is called
|
|
.Cm noatime .
|
|
.Ux
|
|
file systems normally update the last-accessed time of a file or
|
|
directory whenever it is accessed.
|
|
This operation is handled in
|
|
.Fx
|
|
with a delayed write and normally does not create a burden on the system.
|
|
However, if your system is accessing a huge number of files on a continuing
|
|
basis the buffer cache can wind up getting polluted with atime updates,
|
|
creating a burden on the system.
|
|
For example, if you are running a heavily
|
|
loaded web site, or a news server with lots of readers, you might want to
|
|
consider turning off atime updates on your larger partitions with this
|
|
.Xr mount 8
|
|
option.
|
|
However, you should not gratuitously turn off atime
|
|
updates everywhere.
|
|
For example, the
|
|
.Pa /var
|
|
file system customarily
|
|
holds mailboxes, and atime (in combination with mtime) is used to
|
|
determine whether a mailbox has new mail.
|
|
You might as well leave
|
|
atime turned on for mostly read-only partitions such as
|
|
.Pa /
|
|
and
|
|
.Pa /usr
|
|
as well.
|
|
This is especially useful for
|
|
.Pa /
|
|
since some system utilities
|
|
use the atime field for reporting.
|
|
.Sh STRIPING DISKS
|
|
In larger systems you can stripe partitions from several drives together
|
|
to create a much larger overall partition.
|
|
Striping can also improve
|
|
the performance of a file system by splitting I/O operations across two
|
|
or more disks.
|
|
The
|
|
.Xr gstripe 8 ,
|
|
.Xr gvinum 8 ,
|
|
and
|
|
.Xr ccdconfig 8
|
|
utilities may be used to create simple striped file systems.
|
|
Generally
|
|
speaking, striping smaller partitions such as the root and
|
|
.Pa /var/tmp ,
|
|
or essentially read-only partitions such as
|
|
.Pa /usr
|
|
is a complete waste of time.
|
|
You should only stripe partitions that require serious I/O performance,
|
|
typically
|
|
.Pa /var , /home ,
|
|
or custom partitions used to hold databases and web pages.
|
|
Choosing the proper stripe size is also
|
|
important.
|
|
File systems tend to store meta-data on power-of-2 boundaries
|
|
and you usually want to reduce seeking rather than increase seeking.
|
|
This
|
|
means you want to use a large off-center stripe size such as 1152 sectors
|
|
so sequential I/O does not seek both disks and so meta-data is distributed
|
|
across both disks rather than concentrated on a single disk.
|
|
If
|
|
you really need to get sophisticated, we recommend using a real hardware
|
|
RAID controller from the list of
|
|
.Fx
|
|
supported controllers.
|
|
.Sh SYSCTL TUNING
|
|
.Xr sysctl 8
|
|
variables permit system behavior to be monitored and controlled at
|
|
run-time.
|
|
Some sysctls simply report on the behavior of the system; others allow
|
|
the system behavior to be modified;
|
|
some may be set at boot time using
|
|
.Xr rc.conf 5 ,
|
|
but most will be set via
|
|
.Xr sysctl.conf 5 .
|
|
There are several hundred sysctls in the system, including many that appear
|
|
to be candidates for tuning but actually are not.
|
|
In this document we will only cover the ones that have the greatest effect
|
|
on the system.
|
|
.Pp
|
|
The
|
|
.Va vm.overcommit
|
|
sysctl defines the overcommit behaviour of the vm subsystem.
|
|
The virtual memory system always does accounting of the swap space
|
|
reservation, both total for system and per-user.
|
|
Corresponding values
|
|
are available through sysctl
|
|
.Va vm.swap_total ,
|
|
that gives the total bytes available for swapping, and
|
|
.Va vm.swap_reserved ,
|
|
that gives number of bytes that may be needed to back all currently
|
|
allocated anonymous memory.
|
|
.Pp
|
|
Setting bit 0 of the
|
|
.Va vm.overcommit
|
|
sysctl causes the virtual memory system to return failure
|
|
to the process when allocation of memory causes
|
|
.Va vm.swap_reserved
|
|
to exceed
|
|
.Va vm.swap_total .
|
|
Bit 1 of the sysctl enforces
|
|
.Dv RLIMIT_SWAP
|
|
limit
|
|
(see
|
|
.Xr getrlimit 2 ) .
|
|
Root is exempt from this limit.
|
|
Bit 2 allows to count most of the physical
|
|
memory as allocatable, except wired and free reserved pages
|
|
(accounted by
|
|
.Va vm.stats.vm.v_free_target
|
|
and
|
|
.Va vm.stats.vm.v_wire_count
|
|
sysctls, respectively).
|
|
.Pp
|
|
The
|
|
.Va kern.ipc.maxpipekva
|
|
loader tunable is used to set a hard limit on the
|
|
amount of kernel address space allocated to mapping of pipe buffers.
|
|
Use of the mapping allows the kernel to eliminate a copy of the
|
|
data from writer address space into the kernel, directly copying
|
|
the content of mapped buffer to the reader.
|
|
Increasing this value to a higher setting, such as `25165824' might
|
|
improve performance on systems where space for mapping pipe buffers
|
|
is quickly exhausted.
|
|
This exhaustion is not fatal; however, and it will only cause pipes to
|
|
to fall back to using double-copy.
|
|
.Pp
|
|
The
|
|
.Va kern.ipc.shm_use_phys
|
|
sysctl defaults to 0 (off) and may be set to 0 (off) or 1 (on).
|
|
Setting
|
|
this parameter to 1 will cause all System V shared memory segments to be
|
|
mapped to unpageable physical RAM.
|
|
This feature only has an effect if you
|
|
are either (A) mapping small amounts of shared memory across many (hundreds)
|
|
of processes, or (B) mapping large amounts of shared memory across any
|
|
number of processes.
|
|
This feature allows the kernel to remove a great deal
|
|
of internal memory management page-tracking overhead at the cost of wiring
|
|
the shared memory into core, making it unswappable.
|
|
.Pp
|
|
The
|
|
.Va vfs.vmiodirenable
|
|
sysctl defaults to 1 (on).
|
|
This parameter controls how directories are cached
|
|
by the system.
|
|
Most directories are small and use but a single fragment
|
|
(typically 2K) in the file system and even less (typically 512 bytes) in
|
|
the buffer cache.
|
|
However, when operating in the default mode the buffer
|
|
cache will only cache a fixed number of directories even if you have a huge
|
|
amount of memory.
|
|
Turning on this sysctl allows the buffer cache to use
|
|
the VM Page Cache to cache the directories.
|
|
The advantage is that all of
|
|
memory is now available for caching directories.
|
|
The disadvantage is that
|
|
the minimum in-core memory used to cache a directory is the physical page
|
|
size (typically 4K) rather than 512 bytes.
|
|
We recommend turning this option off in memory-constrained environments;
|
|
however, when on, it will substantially improve the performance of services
|
|
that manipulate a large number of files.
|
|
Such services can include web caches, large mail systems, and news systems.
|
|
Turning on this option will generally not reduce performance even with the
|
|
wasted memory but you should experiment to find out.
|
|
.Pp
|
|
The
|
|
.Va vfs.write_behind
|
|
sysctl defaults to 1 (on).
|
|
This tells the file system to issue media
|
|
writes as full clusters are collected, which typically occurs when writing
|
|
large sequential files.
|
|
The idea is to avoid saturating the buffer
|
|
cache with dirty buffers when it would not benefit I/O performance.
|
|
However,
|
|
this may stall processes and under certain circumstances you may wish to turn
|
|
it off.
|
|
.Pp
|
|
The
|
|
.Va vfs.hirunningspace
|
|
sysctl determines how much outstanding write I/O may be queued to
|
|
disk controllers system-wide at any given time.
|
|
It is used by the UFS file system.
|
|
The default is self-tuned and
|
|
usually sufficient but on machines with advanced controllers and lots
|
|
of disks this may be tuned up to match what the controllers buffer.
|
|
Configuring this setting to match tagged queuing capabilities of
|
|
controllers or drives with average IO size used in production works
|
|
best (for example: 16 MiB will use 128 tags with IO requests of 128 KiB).
|
|
Note that setting too high a value
|
|
(exceeding the buffer cache's write threshold) can lead to extremely
|
|
bad clustering performance.
|
|
Do not set this value arbitrarily high!
|
|
Higher write queueing values may also add latency to reads occurring at
|
|
the same time.
|
|
.Pp
|
|
The
|
|
.Va vfs.read_max
|
|
sysctl governs VFS read-ahead and is expressed as the number of blocks
|
|
to pre-read if the heuristics algorithm decides that the reads are
|
|
issued sequentially.
|
|
It is used by the UFS, ext2fs and msdosfs file systems.
|
|
With the default UFS block size of 16 KiB, a setting of 32 will allow
|
|
speculatively reading up to 512 KiB.
|
|
This setting may be increased to get around disk I/O latencies, especially
|
|
where these latencies are large such as in virtual machine emulated
|
|
environments.
|
|
It may be tuned down in specific cases where the I/O load is such that
|
|
read-ahead adversely affects performance or where system memory is really
|
|
low.
|
|
.Pp
|
|
The
|
|
.Va vfs.ncsizefactor
|
|
sysctl defines how large VFS namecache may grow.
|
|
The number of currently allocated entries in namecache is provided by
|
|
.Va debug.numcache
|
|
sysctl and the condition
|
|
debug.numcache < kern.maxvnodes * vfs.ncsizefactor
|
|
is adhered to.
|
|
.Pp
|
|
The
|
|
.Va vfs.ncnegfactor
|
|
sysctl defines how many negative entries VFS namecache is allowed to create.
|
|
The number of currently allocated negative entries is provided by
|
|
.Va debug.numneg
|
|
sysctl and the condition
|
|
vfs.ncnegfactor * debug.numneg < debug.numcache
|
|
is adhered to.
|
|
.Pp
|
|
There are various other buffer-cache and VM page cache related sysctls.
|
|
We do not recommend modifying these values.
|
|
As of
|
|
.Fx 4.3 ,
|
|
the VM system does an extremely good job tuning itself.
|
|
.Pp
|
|
The
|
|
.Va net.inet.tcp.sendspace
|
|
and
|
|
.Va net.inet.tcp.recvspace
|
|
sysctls are of particular interest if you are running network intensive
|
|
applications.
|
|
They control the amount of send and receive buffer space
|
|
allowed for any given TCP connection.
|
|
The default sending buffer is 32K; the default receiving buffer
|
|
is 64K.
|
|
You can often
|
|
improve bandwidth utilization by increasing the default at the cost of
|
|
eating up more kernel memory for each connection.
|
|
We do not recommend
|
|
increasing the defaults if you are serving hundreds or thousands of
|
|
simultaneous connections because it is possible to quickly run the system
|
|
out of memory due to stalled connections building up.
|
|
But if you need
|
|
high bandwidth over a fewer number of connections, especially if you have
|
|
gigabit Ethernet, increasing these defaults can make a huge difference.
|
|
You can adjust the buffer size for incoming and outgoing data separately.
|
|
For example, if your machine is primarily doing web serving you may want
|
|
to decrease the recvspace in order to be able to increase the
|
|
sendspace without eating too much kernel memory.
|
|
Note that the routing table (see
|
|
.Xr route 8 )
|
|
can be used to introduce route-specific send and receive buffer size
|
|
defaults.
|
|
.Pp
|
|
As an additional management tool you can use pipes in your
|
|
firewall rules (see
|
|
.Xr ipfw 8 )
|
|
to limit the bandwidth going to or from particular IP blocks or ports.
|
|
For example, if you have a T1 you might want to limit your web traffic
|
|
to 70% of the T1's bandwidth in order to leave the remainder available
|
|
for mail and interactive use.
|
|
Normally a heavily loaded web server
|
|
will not introduce significant latencies into other services even if
|
|
the network link is maxed out, but enforcing a limit can smooth things
|
|
out and lead to longer term stability.
|
|
Many people also enforce artificial
|
|
bandwidth limitations in order to ensure that they are not charged for
|
|
using too much bandwidth.
|
|
.Pp
|
|
Setting the send or receive TCP buffer to values larger than 65535 will result
|
|
in a marginal performance improvement unless both hosts support the window
|
|
scaling extension of the TCP protocol, which is controlled by the
|
|
.Va net.inet.tcp.rfc1323
|
|
sysctl.
|
|
These extensions should be enabled and the TCP buffer size should be set
|
|
to a value larger than 65536 in order to obtain good performance from
|
|
certain types of network links; specifically, gigabit WAN links and
|
|
high-latency satellite links.
|
|
RFC1323 support is enabled by default.
|
|
.Pp
|
|
The
|
|
.Va net.inet.tcp.always_keepalive
|
|
sysctl determines whether or not the TCP implementation should attempt
|
|
to detect dead TCP connections by intermittently delivering
|
|
.Dq keepalives
|
|
on the connection.
|
|
By default, this is enabled for all applications; by setting this
|
|
sysctl to 0, only applications that specifically request keepalives
|
|
will use them.
|
|
In most environments, TCP keepalives will improve the management of
|
|
system state by expiring dead TCP connections, particularly for
|
|
systems serving dialup users who may not always terminate individual
|
|
TCP connections before disconnecting from the network.
|
|
However, in some environments, temporary network outages may be
|
|
incorrectly identified as dead sessions, resulting in unexpectedly
|
|
terminated TCP connections.
|
|
In such environments, setting the sysctl to 0 may reduce the occurrence of
|
|
TCP session disconnections.
|
|
.Pp
|
|
The
|
|
.Va net.inet.tcp.delayed_ack
|
|
TCP feature is largely misunderstood.
|
|
Historically speaking, this feature
|
|
was designed to allow the acknowledgement to transmitted data to be returned
|
|
along with the response.
|
|
For example, when you type over a remote shell,
|
|
the acknowledgement to the character you send can be returned along with the
|
|
data representing the echo of the character.
|
|
With delayed acks turned off,
|
|
the acknowledgement may be sent in its own packet, before the remote service
|
|
has a chance to echo the data it just received.
|
|
This same concept also
|
|
applies to any interactive protocol (e.g.\& SMTP, WWW, POP3), and can cut the
|
|
number of tiny packets flowing across the network in half.
|
|
The
|
|
.Fx
|
|
delayed ACK implementation also follows the TCP protocol rule that
|
|
at least every other packet be acknowledged even if the standard 100ms
|
|
timeout has not yet passed.
|
|
Normally the worst a delayed ACK can do is
|
|
slightly delay the teardown of a connection, or slightly delay the ramp-up
|
|
of a slow-start TCP connection.
|
|
While we are not sure we believe that
|
|
the several FAQs related to packages such as SAMBA and SQUID which advise
|
|
turning off delayed acks may be referring to the slow-start issue.
|
|
In
|
|
.Fx ,
|
|
it would be more beneficial to increase the slow-start flightsize via
|
|
the
|
|
.Va net.inet.tcp.slowstart_flightsize
|
|
sysctl rather than disable delayed acks.
|
|
.Pp
|
|
The
|
|
.Va net.inet.tcp.inflight.enable
|
|
sysctl turns on bandwidth delay product limiting for all TCP connections.
|
|
The system will attempt to calculate the bandwidth delay product for each
|
|
connection and limit the amount of data queued to the network to just the
|
|
amount required to maintain optimum throughput.
|
|
This feature is useful
|
|
if you are serving data over modems, GigE, or high speed WAN links (or
|
|
any other link with a high bandwidth*delay product), especially if you are
|
|
also using window scaling or have configured a large send window.
|
|
If you enable this option, you should also be sure to set
|
|
.Va net.inet.tcp.inflight.debug
|
|
to 0 (disable debugging), and for production use setting
|
|
.Va net.inet.tcp.inflight.min
|
|
to at least 6144 may be beneficial.
|
|
Note however, that setting high
|
|
minimums may effectively disable bandwidth limiting depending on the link.
|
|
The limiting feature reduces the amount of data built up in intermediate
|
|
router and switch packet queues as well as reduces the amount of data built
|
|
up in the local host's interface queue.
|
|
With fewer packets queued up,
|
|
interactive connections, especially over slow modems, will also be able
|
|
to operate with lower round trip times.
|
|
However, note that this feature
|
|
only affects data transmission (uploading / server-side).
|
|
It does not
|
|
affect data reception (downloading).
|
|
.Pp
|
|
Adjusting
|
|
.Va net.inet.tcp.inflight.stab
|
|
is not recommended.
|
|
This parameter defaults to 20, representing 2 maximal packets added
|
|
to the bandwidth delay product window calculation.
|
|
The additional
|
|
window is required to stabilize the algorithm and improve responsiveness
|
|
to changing conditions, but it can also result in higher ping times
|
|
over slow links (though still much lower than you would get without
|
|
the inflight algorithm).
|
|
In such cases you may
|
|
wish to try reducing this parameter to 15, 10, or 5, and you may also
|
|
have to reduce
|
|
.Va net.inet.tcp.inflight.min
|
|
(for example, to 3500) to get the desired effect.
|
|
Reducing these parameters
|
|
should be done as a last resort only.
|
|
.Pp
|
|
The
|
|
.Va net.inet.ip.portrange.*
|
|
sysctls control the port number ranges automatically bound to TCP and UDP
|
|
sockets.
|
|
There are three ranges: a low range, a default range, and a
|
|
high range, selectable via the
|
|
.Dv IP_PORTRANGE
|
|
.Xr setsockopt 2
|
|
call.
|
|
Most
|
|
network programs use the default range which is controlled by
|
|
.Va net.inet.ip.portrange.first
|
|
and
|
|
.Va net.inet.ip.portrange.last ,
|
|
which default to 49152 and 65535, respectively.
|
|
Bound port ranges are
|
|
used for outgoing connections, and it is possible to run the system out
|
|
of ports under certain circumstances.
|
|
This most commonly occurs when you are
|
|
running a heavily loaded web proxy.
|
|
The port range is not an issue
|
|
when running a server which handles mainly incoming connections, such as a
|
|
normal web server, or has a limited number of outgoing connections, such
|
|
as a mail relay.
|
|
For situations where you may run out of ports,
|
|
we recommend decreasing
|
|
.Va net.inet.ip.portrange.first
|
|
modestly.
|
|
A range of 10000 to 30000 ports may be reasonable.
|
|
You should also consider firewall effects when changing the port range.
|
|
Some firewalls
|
|
may block large ranges of ports (usually low-numbered ports) and expect systems
|
|
to use higher ranges of ports for outgoing connections.
|
|
By default
|
|
.Va net.inet.ip.portrange.last
|
|
is set at the maximum allowable port number.
|
|
.Pp
|
|
The
|
|
.Va kern.ipc.somaxconn
|
|
sysctl limits the size of the listen queue for accepting new TCP connections.
|
|
The default value of 128 is typically too low for robust handling of new
|
|
connections in a heavily loaded web server environment.
|
|
For such environments,
|
|
we recommend increasing this value to 1024 or higher.
|
|
The service daemon
|
|
may itself limit the listen queue size (e.g.\&
|
|
.Xr sendmail 8 ,
|
|
apache) but will
|
|
often have a directive in its configuration file to adjust the queue size up.
|
|
Larger listen queues also do a better job of fending off denial of service
|
|
attacks.
|
|
.Pp
|
|
The
|
|
.Va kern.maxfiles
|
|
sysctl determines how many open files the system supports.
|
|
The default is
|
|
typically a few thousand but you may need to bump this up to ten or twenty
|
|
thousand if you are running databases or large descriptor-heavy daemons.
|
|
The read-only
|
|
.Va kern.openfiles
|
|
sysctl may be interrogated to determine the current number of open files
|
|
on the system.
|
|
.Pp
|
|
The
|
|
.Va vm.swap_idle_enabled
|
|
sysctl is useful in large multi-user systems where you have lots of users
|
|
entering and leaving the system and lots of idle processes.
|
|
Such systems
|
|
tend to generate a great deal of continuous pressure on free memory reserves.
|
|
Turning this feature on and adjusting the swapout hysteresis (in idle
|
|
seconds) via
|
|
.Va vm.swap_idle_threshold1
|
|
and
|
|
.Va vm.swap_idle_threshold2
|
|
allows you to depress the priority of pages associated with idle processes
|
|
more quickly then the normal pageout algorithm.
|
|
This gives a helping hand
|
|
to the pageout daemon.
|
|
Do not turn this option on unless you need it,
|
|
because the tradeoff you are making is to essentially pre-page memory sooner
|
|
rather than later, eating more swap and disk bandwidth.
|
|
In a small system
|
|
this option will have a detrimental effect but in a large system that is
|
|
already doing moderate paging this option allows the VM system to stage
|
|
whole processes into and out of memory more easily.
|
|
.Sh LOADER TUNABLES
|
|
Some aspects of the system behavior may not be tunable at runtime because
|
|
memory allocations they perform must occur early in the boot process.
|
|
To change loader tunables, you must set their values in
|
|
.Xr loader.conf 5
|
|
and reboot the system.
|
|
.Pp
|
|
.Va kern.maxusers
|
|
controls the scaling of a number of static system tables, including defaults
|
|
for the maximum number of open files, sizing of network memory resources, etc.
|
|
As of
|
|
.Fx 4.5 ,
|
|
.Va kern.maxusers
|
|
is automatically sized at boot based on the amount of memory available in
|
|
the system, and may be determined at run-time by inspecting the value of the
|
|
read-only
|
|
.Va kern.maxusers
|
|
sysctl.
|
|
Some sites will require larger or smaller values of
|
|
.Va kern.maxusers
|
|
and may set it as a loader tunable; values of 64, 128, and 256 are not
|
|
uncommon.
|
|
We do not recommend going above 256 unless you need a huge number
|
|
of file descriptors; many of the tunable values set to their defaults by
|
|
.Va kern.maxusers
|
|
may be individually overridden at boot-time or run-time as described
|
|
elsewhere in this document.
|
|
Systems older than
|
|
.Fx 4.4
|
|
must set this value via the kernel
|
|
.Xr config 8
|
|
option
|
|
.Cd maxusers
|
|
instead.
|
|
.Pp
|
|
The
|
|
.Va kern.dfldsiz
|
|
and
|
|
.Va kern.dflssiz
|
|
tunables set the default soft limits for process data and stack size
|
|
respectively.
|
|
Processes may increase these up to the hard limits by calling
|
|
.Xr setrlimit 2 .
|
|
The
|
|
.Va kern.maxdsiz ,
|
|
.Va kern.maxssiz ,
|
|
and
|
|
.Va kern.maxtsiz
|
|
tunables set the hard limits for process data, stack, and text size
|
|
respectively; processes may not exceed these limits.
|
|
The
|
|
.Va kern.sgrowsiz
|
|
tunable controls how much the stack segment will grow when a process
|
|
needs to allocate more stack.
|
|
.Pp
|
|
.Va kern.ipc.nmbclusters
|
|
may be adjusted to increase the number of network mbufs the system is
|
|
willing to allocate.
|
|
Each cluster represents approximately 2K of memory,
|
|
so a value of 1024 represents 2M of kernel memory reserved for network
|
|
buffers.
|
|
You can do a simple calculation to figure out how many you need.
|
|
If you have a web server which maxes out at 1000 simultaneous connections,
|
|
and each connection eats a 16K receive and 16K send buffer, you need
|
|
approximately 32MB worth of network buffers to deal with it.
|
|
A good rule of
|
|
thumb is to multiply by 2, so 32MBx2 = 64MB/2K = 32768.
|
|
So for this case
|
|
you would want to set
|
|
.Va kern.ipc.nmbclusters
|
|
to 32768.
|
|
We recommend values between
|
|
1024 and 4096 for machines with moderates amount of memory, and between 4096
|
|
and 32768 for machines with greater amounts of memory.
|
|
Under no circumstances
|
|
should you specify an arbitrarily high value for this parameter, it could
|
|
lead to a boot-time crash.
|
|
The
|
|
.Fl m
|
|
option to
|
|
.Xr netstat 1
|
|
may be used to observe network cluster use.
|
|
Older versions of
|
|
.Fx
|
|
do not have this tunable and require that the
|
|
kernel
|
|
.Xr config 8
|
|
option
|
|
.Dv NMBCLUSTERS
|
|
be set instead.
|
|
.Pp
|
|
More and more programs are using the
|
|
.Xr sendfile 2
|
|
system call to transmit files over the network.
|
|
The
|
|
.Va kern.ipc.nsfbufs
|
|
sysctl controls the number of file system buffers
|
|
.Xr sendfile 2
|
|
is allowed to use to perform its work.
|
|
This parameter nominally scales
|
|
with
|
|
.Va kern.maxusers
|
|
so you should not need to modify this parameter except under extreme
|
|
circumstances.
|
|
See the
|
|
.Sx TUNING
|
|
section in the
|
|
.Xr sendfile 2
|
|
manual page for details.
|
|
.Sh KERNEL CONFIG TUNING
|
|
There are a number of kernel options that you may have to fiddle with in
|
|
a large-scale system.
|
|
In order to change these options you need to be
|
|
able to compile a new kernel from source.
|
|
The
|
|
.Xr config 8
|
|
manual page and the handbook are good starting points for learning how to
|
|
do this.
|
|
Generally the first thing you do when creating your own custom
|
|
kernel is to strip out all the drivers and services you do not use.
|
|
Removing things like
|
|
.Dv INET6
|
|
and drivers you do not have will reduce the size of your kernel, sometimes
|
|
by a megabyte or more, leaving more memory available for applications.
|
|
.Pp
|
|
.Dv SCSI_DELAY
|
|
may be used to reduce system boot times.
|
|
The defaults are fairly high and
|
|
can be responsible for 5+ seconds of delay in the boot process.
|
|
Reducing
|
|
.Dv SCSI_DELAY
|
|
to something below 5 seconds could work (especially with modern drives).
|
|
.Pp
|
|
There are a number of
|
|
.Dv *_CPU
|
|
options that can be commented out.
|
|
If you only want the kernel to run
|
|
on a Pentium class CPU, you can easily remove
|
|
.Dv I486_CPU ,
|
|
but only remove
|
|
.Dv I586_CPU
|
|
if you are sure your CPU is being recognized as a Pentium II or better.
|
|
Some clones may be recognized as a Pentium or even a 486 and not be able
|
|
to boot without those options.
|
|
If it works, great!
|
|
The operating system
|
|
will be able to better use higher-end CPU features for MMU, task switching,
|
|
timebase, and even device operations.
|
|
Additionally, higher-end CPUs support
|
|
4MB MMU pages, which the kernel uses to map the kernel itself into memory,
|
|
increasing its efficiency under heavy syscall loads.
|
|
.Sh IDE WRITE CACHING
|
|
.Fx 4.3
|
|
flirted with turning off IDE write caching.
|
|
This reduced write bandwidth
|
|
to IDE disks but was considered necessary due to serious data consistency
|
|
issues introduced by hard drive vendors.
|
|
Basically the problem is that
|
|
IDE drives lie about when a write completes.
|
|
With IDE write caching turned
|
|
on, IDE hard drives will not only write data to disk out of order, they
|
|
will sometimes delay some of the blocks indefinitely under heavy disk
|
|
load.
|
|
A crash or power failure can result in serious file system
|
|
corruption.
|
|
So our default was changed to be safe.
|
|
Unfortunately, the
|
|
result was such a huge loss in performance that we caved in and changed the
|
|
default back to on after the release.
|
|
You should check the default on
|
|
your system by observing the
|
|
.Va hw.ata.wc
|
|
sysctl variable.
|
|
If IDE write caching is turned off, you can turn it back
|
|
on by setting the
|
|
.Va hw.ata.wc
|
|
loader tunable to 1.
|
|
More information on tuning the ATA driver system may be found in the
|
|
.Xr ata 4
|
|
manual page.
|
|
If you need performance, go with SCSI.
|
|
.Sh CPU, MEMORY, DISK, NETWORK
|
|
The type of tuning you do depends heavily on where your system begins to
|
|
bottleneck as load increases.
|
|
If your system runs out of CPU (idle times
|
|
are perpetually 0%) then you need to consider upgrading the CPU or moving to
|
|
an SMP motherboard (multiple CPU's), or perhaps you need to revisit the
|
|
programs that are causing the load and try to optimize them.
|
|
If your system
|
|
is paging to swap a lot you need to consider adding more memory.
|
|
If your
|
|
system is saturating the disk you typically see high CPU idle times and
|
|
total disk saturation.
|
|
.Xr systat 1
|
|
can be used to monitor this.
|
|
There are many solutions to saturated disks:
|
|
increasing memory for caching, mirroring disks, distributing operations across
|
|
several machines, and so forth.
|
|
If disk performance is an issue and you
|
|
are using IDE drives, switching to SCSI can help a great deal.
|
|
While modern
|
|
IDE drives compare with SCSI in raw sequential bandwidth, the moment you
|
|
start seeking around the disk SCSI drives usually win.
|
|
.Pp
|
|
Finally, you might run out of network suds.
|
|
The first line of defense for
|
|
improving network performance is to make sure you are using switches instead
|
|
of hubs, especially these days where switches are almost as cheap.
|
|
Hubs
|
|
have severe problems under heavy loads due to collision back-off and one bad
|
|
host can severely degrade the entire LAN.
|
|
Second, optimize the network path
|
|
as much as possible.
|
|
For example, in
|
|
.Xr firewall 7
|
|
we describe a firewall protecting internal hosts with a topology where
|
|
the externally visible hosts are not routed through it.
|
|
Use 100BaseT rather
|
|
than 10BaseT, or use 1000BaseT rather than 100BaseT, depending on your needs.
|
|
Most bottlenecks occur at the WAN link (e.g.\&
|
|
modem, T1, DSL, whatever).
|
|
If expanding the link is not an option it may be possible to use the
|
|
.Xr dummynet 4
|
|
feature to implement peak shaving or other forms of traffic shaping to
|
|
prevent the overloaded service (such as web services) from affecting other
|
|
services (such as email), or vice versa.
|
|
In home installations this could
|
|
be used to give interactive traffic (your browser,
|
|
.Xr ssh 1
|
|
logins) priority
|
|
over services you export from your box (web services, email).
|
|
.Sh SEE ALSO
|
|
.Xr netstat 1 ,
|
|
.Xr systat 1 ,
|
|
.Xr sendfile 2 ,
|
|
.Xr ata 4 ,
|
|
.Xr dummynet 4 ,
|
|
.Xr login.conf 5 ,
|
|
.Xr rc.conf 5 ,
|
|
.Xr sysctl.conf 5 ,
|
|
.Xr firewall 7 ,
|
|
.Xr eventtimers 7 ,
|
|
.Xr hier 7 ,
|
|
.Xr ports 7 ,
|
|
.Xr boot 8 ,
|
|
.Xr bsdlabel 8 ,
|
|
.Xr ccdconfig 8 ,
|
|
.Xr config 8 ,
|
|
.Xr fsck 8 ,
|
|
.Xr gjournal 8 ,
|
|
.Xr gstripe 8 ,
|
|
.Xr gvinum 8 ,
|
|
.Xr ifconfig 8 ,
|
|
.Xr ipfw 8 ,
|
|
.Xr loader 8 ,
|
|
.Xr mount 8 ,
|
|
.Xr newfs 8 ,
|
|
.Xr route 8 ,
|
|
.Xr sysctl 8 ,
|
|
.Xr sysinstall 8 ,
|
|
.Xr tunefs 8
|
|
.Sh HISTORY
|
|
The
|
|
.Nm
|
|
manual page was originally written by
|
|
.An Matthew Dillon
|
|
and first appeared
|
|
in
|
|
.Fx 4.3 ,
|
|
May 2001.
|