mirror of
https://git.FreeBSD.org/src.git
synced 2024-12-22 11:17:19 +00:00
50d922a02e
PR: 167776 Submitted by: Nobuyuki Koganemaru (kogane!jp.freebsd.org) MFC after: 3 days
1107 lines
32 KiB
Groff
1107 lines
32 KiB
Groff
.\" Copyright (c) 2007 Seccuris Inc.
|
|
.\" All rights reserved.
|
|
.\"
|
|
.\" This software was developed by Robert N. M. Watson under contract to
|
|
.\" Seccuris Inc.
|
|
.\"
|
|
.\" Redistribution and use in source and binary forms, with or without
|
|
.\" modification, are permitted provided that the following conditions
|
|
.\" are met:
|
|
.\" 1. Redistributions of source code must retain the above copyright
|
|
.\" notice, this list of conditions and the following disclaimer.
|
|
.\" 2. Redistributions in binary form must reproduce the above copyright
|
|
.\" notice, this list of conditions and the following disclaimer in the
|
|
.\" documentation and/or other materials provided with the distribution.
|
|
.\"
|
|
.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
|
|
.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
|
.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
|
|
.\" ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
|
|
.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
|
.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
|
|
.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
|
|
.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
|
|
.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
|
|
.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
|
|
.\" SUCH DAMAGE.
|
|
.\"
|
|
.\" Copyright (c) 1990 The Regents of the University of California.
|
|
.\" All rights reserved.
|
|
.\"
|
|
.\" Redistribution and use in source and binary forms, with or without
|
|
.\" modification, are permitted provided that: (1) source code distributions
|
|
.\" retain the above copyright notice and this paragraph in its entirety, (2)
|
|
.\" distributions including binary code include the above copyright notice and
|
|
.\" this paragraph in its entirety in the documentation or other materials
|
|
.\" provided with the distribution, and (3) all advertising materials mentioning
|
|
.\" features or use of this software display the following acknowledgement:
|
|
.\" ``This product includes software developed by the University of California,
|
|
.\" Lawrence Berkeley Laboratory and its contributors.'' Neither the name of
|
|
.\" the University nor the names of its contributors may be used to endorse
|
|
.\" or promote products derived from this software without specific prior
|
|
.\" written permission.
|
|
.\" THIS SOFTWARE IS PROVIDED ``AS IS'' AND WITHOUT ANY EXPRESS OR IMPLIED
|
|
.\" WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF
|
|
.\" MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.
|
|
.\"
|
|
.\" This document is derived in part from the enet man page (enet.4)
|
|
.\" distributed with 4.3BSD Unix.
|
|
.\"
|
|
.\" $FreeBSD$
|
|
.\"
|
|
.Dd June 15, 2010
|
|
.Dt BPF 4
|
|
.Os
|
|
.Sh NAME
|
|
.Nm bpf
|
|
.Nd Berkeley Packet Filter
|
|
.Sh SYNOPSIS
|
|
.Cd device bpf
|
|
.Sh DESCRIPTION
|
|
The Berkeley Packet Filter
|
|
provides a raw interface to data link layers in a protocol
|
|
independent fashion.
|
|
All packets on the network, even those destined for other hosts,
|
|
are accessible through this mechanism.
|
|
.Pp
|
|
The packet filter appears as a character special device,
|
|
.Pa /dev/bpf .
|
|
After opening the device, the file descriptor must be bound to a
|
|
specific network interface with the
|
|
.Dv BIOCSETIF
|
|
ioctl.
|
|
A given interface can be shared by multiple listeners, and the filter
|
|
underlying each descriptor will see an identical packet stream.
|
|
.Pp
|
|
A separate device file is required for each minor device.
|
|
If a file is in use, the open will fail and
|
|
.Va errno
|
|
will be set to
|
|
.Er EBUSY .
|
|
.Pp
|
|
Associated with each open instance of a
|
|
.Nm
|
|
file is a user-settable packet filter.
|
|
Whenever a packet is received by an interface,
|
|
all file descriptors listening on that interface apply their filter.
|
|
Each descriptor that accepts the packet receives its own copy.
|
|
.Pp
|
|
The packet filter will support any link level protocol that has fixed length
|
|
headers.
|
|
Currently, only Ethernet,
|
|
.Tn SLIP ,
|
|
and
|
|
.Tn PPP
|
|
drivers have been modified to interact with
|
|
.Nm .
|
|
.Pp
|
|
Since packet data is in network byte order, applications should use the
|
|
.Xr byteorder 3
|
|
macros to extract multi-byte values.
|
|
.Pp
|
|
A packet can be sent out on the network by writing to a
|
|
.Nm
|
|
file descriptor.
|
|
The writes are unbuffered, meaning only one packet can be processed per write.
|
|
Currently, only writes to Ethernets and
|
|
.Tn SLIP
|
|
links are supported.
|
|
.Sh BUFFER MODES
|
|
.Nm
|
|
devices deliver packet data to the application via memory buffers provided by
|
|
the application.
|
|
The buffer mode is set using the
|
|
.Dv BIOCSETBUFMODE
|
|
ioctl, and read using the
|
|
.Dv BIOCGETBUFMODE
|
|
ioctl.
|
|
.Ss Buffered read mode
|
|
By default,
|
|
.Nm
|
|
devices operate in the
|
|
.Dv BPF_BUFMODE_BUFFER
|
|
mode, in which packet data is copied explicitly from kernel to user memory
|
|
using the
|
|
.Xr read 2
|
|
system call.
|
|
The user process will declare a fixed buffer size that will be used both for
|
|
sizing internal buffers and for all
|
|
.Xr read 2
|
|
operations on the file.
|
|
This size is queried using the
|
|
.Dv BIOCGBLEN
|
|
ioctl, and is set using the
|
|
.Dv BIOCSBLEN
|
|
ioctl.
|
|
Note that an individual packet larger than the buffer size is necessarily
|
|
truncated.
|
|
.Ss Zero-copy buffer mode
|
|
.Nm
|
|
devices may also operate in the
|
|
.Dv BPF_BUFMODE_ZEROCOPY
|
|
mode, in which packet data is written directly into two user memory buffers
|
|
by the kernel, avoiding both system call and copying overhead.
|
|
Buffers are of fixed (and equal) size, page-aligned, and an even multiple of
|
|
the page size.
|
|
The maximum zero-copy buffer size is returned by the
|
|
.Dv BIOCGETZMAX
|
|
ioctl.
|
|
Note that an individual packet larger than the buffer size is necessarily
|
|
truncated.
|
|
.Pp
|
|
The user process registers two memory buffers using the
|
|
.Dv BIOCSETZBUF
|
|
ioctl, which accepts a
|
|
.Vt struct bpf_zbuf
|
|
pointer as an argument:
|
|
.Bd -literal
|
|
struct bpf_zbuf {
|
|
void *bz_bufa;
|
|
void *bz_bufb;
|
|
size_t bz_buflen;
|
|
};
|
|
.Ed
|
|
.Pp
|
|
.Vt bz_bufa
|
|
is a pointer to the userspace address of the first buffer that will be
|
|
filled, and
|
|
.Vt bz_bufb
|
|
is a pointer to the second buffer.
|
|
.Nm
|
|
will then cycle between the two buffers as they fill and are acknowledged.
|
|
.Pp
|
|
Each buffer begins with a fixed-length header to hold synchronization and
|
|
data length information for the buffer:
|
|
.Bd -literal
|
|
struct bpf_zbuf_header {
|
|
volatile u_int bzh_kernel_gen; /* Kernel generation number. */
|
|
volatile u_int bzh_kernel_len; /* Length of data in the buffer. */
|
|
volatile u_int bzh_user_gen; /* User generation number. */
|
|
/* ...padding for future use... */
|
|
};
|
|
.Ed
|
|
.Pp
|
|
The header structure of each buffer, including all padding, should be zeroed
|
|
before it is configured using
|
|
.Dv BIOCSETZBUF .
|
|
Remaining space in the buffer will be used by the kernel to store packet
|
|
data, laid out in the same format as with buffered read mode.
|
|
.Pp
|
|
The kernel and the user process follow a simple acknowledgement protocol via
|
|
the buffer header to synchronize access to the buffer: when the header
|
|
generation numbers,
|
|
.Vt bzh_kernel_gen
|
|
and
|
|
.Vt bzh_user_gen ,
|
|
hold the same value, the kernel owns the buffer, and when they differ,
|
|
userspace owns the buffer.
|
|
.Pp
|
|
While the kernel owns the buffer, the contents are unstable and may change
|
|
asynchronously; while the user process owns the buffer, its contents are
|
|
stable and will not be changed until the buffer has been acknowledged.
|
|
.Pp
|
|
Initializing the buffer headers to all 0's before registering the buffer has
|
|
the effect of assigning initial ownership of both buffers to the kernel.
|
|
The kernel signals that a buffer has been assigned to userspace by modifying
|
|
.Vt bzh_kernel_gen ,
|
|
and userspace acknowledges the buffer and returns it to the kernel by setting
|
|
the value of
|
|
.Vt bzh_user_gen
|
|
to the value of
|
|
.Vt bzh_kernel_gen .
|
|
.Pp
|
|
In order to avoid caching and memory re-ordering effects, the user process
|
|
must use atomic operations and memory barriers when checking for and
|
|
acknowledging buffers:
|
|
.Bd -literal
|
|
#include <machine/atomic.h>
|
|
|
|
/*
|
|
* Return ownership of a buffer to the kernel for reuse.
|
|
*/
|
|
static void
|
|
buffer_acknowledge(struct bpf_zbuf_header *bzh)
|
|
{
|
|
|
|
atomic_store_rel_int(&bzh->bzh_user_gen, bzh->bzh_kernel_gen);
|
|
}
|
|
|
|
/*
|
|
* Check whether a buffer has been assigned to userspace by the kernel.
|
|
* Return true if userspace owns the buffer, and false otherwise.
|
|
*/
|
|
static int
|
|
buffer_check(struct bpf_zbuf_header *bzh)
|
|
{
|
|
|
|
return (bzh->bzh_user_gen !=
|
|
atomic_load_acq_int(&bzh->bzh_kernel_gen));
|
|
}
|
|
.Ed
|
|
.Pp
|
|
The user process may force the assignment of the next buffer, if any data
|
|
is pending, to userspace using the
|
|
.Dv BIOCROTZBUF
|
|
ioctl.
|
|
This allows the user process to retrieve data in a partially filled buffer
|
|
before the buffer is full, such as following a timeout; the process must
|
|
recheck for buffer ownership using the header generation numbers, as the
|
|
buffer will not be assigned to userspace if no data was present.
|
|
.Pp
|
|
As in the buffered read mode,
|
|
.Xr kqueue 2 ,
|
|
.Xr poll 2 ,
|
|
and
|
|
.Xr select 2
|
|
may be used to sleep awaiting the availability of a completed buffer.
|
|
They will return a readable file descriptor when ownership of the next buffer
|
|
is assigned to user space.
|
|
.Pp
|
|
In the current implementation, the kernel may assign zero, one, or both
|
|
buffers to the user process; however, an earlier implementation maintained
|
|
the invariant that at most one buffer could be assigned to the user process
|
|
at a time.
|
|
In order to both ensure progress and high performance, user processes should
|
|
acknowledge a completely processed buffer as quickly as possible, returning
|
|
it for reuse, and not block waiting on a second buffer while holding another
|
|
buffer.
|
|
.Sh IOCTLS
|
|
The
|
|
.Xr ioctl 2
|
|
command codes below are defined in
|
|
.In net/bpf.h .
|
|
All commands require
|
|
these includes:
|
|
.Bd -literal
|
|
#include <sys/types.h>
|
|
#include <sys/time.h>
|
|
#include <sys/ioctl.h>
|
|
#include <net/bpf.h>
|
|
.Ed
|
|
.Pp
|
|
Additionally,
|
|
.Dv BIOCGETIF
|
|
and
|
|
.Dv BIOCSETIF
|
|
require
|
|
.In sys/socket.h
|
|
and
|
|
.In net/if.h .
|
|
.Pp
|
|
In addition to
|
|
.Dv FIONREAD
|
|
and
|
|
.Dv SIOCGIFADDR ,
|
|
the following commands may be applied to any open
|
|
.Nm
|
|
file.
|
|
The (third) argument to
|
|
.Xr ioctl 2
|
|
should be a pointer to the type indicated.
|
|
.Bl -tag -width BIOCGETBUFMODE
|
|
.It Dv BIOCGBLEN
|
|
.Pq Li u_int
|
|
Returns the required buffer length for reads on
|
|
.Nm
|
|
files.
|
|
.It Dv BIOCSBLEN
|
|
.Pq Li u_int
|
|
Sets the buffer length for reads on
|
|
.Nm
|
|
files.
|
|
The buffer must be set before the file is attached to an interface
|
|
with
|
|
.Dv BIOCSETIF .
|
|
If the requested buffer size cannot be accommodated, the closest
|
|
allowable size will be set and returned in the argument.
|
|
A read call will result in
|
|
.Er EIO
|
|
if it is passed a buffer that is not this size.
|
|
.It Dv BIOCGDLT
|
|
.Pq Li u_int
|
|
Returns the type of the data link layer underlying the attached interface.
|
|
.Er EINVAL
|
|
is returned if no interface has been specified.
|
|
The device types, prefixed with
|
|
.Dq Li DLT_ ,
|
|
are defined in
|
|
.In net/bpf.h .
|
|
.It Dv BIOCPROMISC
|
|
Forces the interface into promiscuous mode.
|
|
All packets, not just those destined for the local host, are processed.
|
|
Since more than one file can be listening on a given interface,
|
|
a listener that opened its interface non-promiscuously may receive
|
|
packets promiscuously.
|
|
This problem can be remedied with an appropriate filter.
|
|
.It Dv BIOCFLUSH
|
|
Flushes the buffer of incoming packets,
|
|
and resets the statistics that are returned by BIOCGSTATS.
|
|
.It Dv BIOCGETIF
|
|
.Pq Li "struct ifreq"
|
|
Returns the name of the hardware interface that the file is listening on.
|
|
The name is returned in the ifr_name field of
|
|
the
|
|
.Li ifreq
|
|
structure.
|
|
All other fields are undefined.
|
|
.It Dv BIOCSETIF
|
|
.Pq Li "struct ifreq"
|
|
Sets the hardware interface associate with the file.
|
|
This
|
|
command must be performed before any packets can be read.
|
|
The device is indicated by name using the
|
|
.Li ifr_name
|
|
field of the
|
|
.Li ifreq
|
|
structure.
|
|
Additionally, performs the actions of
|
|
.Dv BIOCFLUSH .
|
|
.It Dv BIOCSRTIMEOUT
|
|
.It Dv BIOCGRTIMEOUT
|
|
.Pq Li "struct timeval"
|
|
Set or get the read timeout parameter.
|
|
The argument
|
|
specifies the length of time to wait before timing
|
|
out on a read request.
|
|
This parameter is initialized to zero by
|
|
.Xr open 2 ,
|
|
indicating no timeout.
|
|
.It Dv BIOCGSTATS
|
|
.Pq Li "struct bpf_stat"
|
|
Returns the following structure of packet statistics:
|
|
.Bd -literal
|
|
struct bpf_stat {
|
|
u_int bs_recv; /* number of packets received */
|
|
u_int bs_drop; /* number of packets dropped */
|
|
};
|
|
.Ed
|
|
.Pp
|
|
The fields are:
|
|
.Bl -hang -offset indent
|
|
.It Li bs_recv
|
|
the number of packets received by the descriptor since opened or reset
|
|
(including any buffered since the last read call);
|
|
and
|
|
.It Li bs_drop
|
|
the number of packets which were accepted by the filter but dropped by the
|
|
kernel because of buffer overflows
|
|
(i.e., the application's reads are not keeping up with the packet traffic).
|
|
.El
|
|
.It Dv BIOCIMMEDIATE
|
|
.Pq Li u_int
|
|
Enable or disable
|
|
.Dq immediate mode ,
|
|
based on the truth value of the argument.
|
|
When immediate mode is enabled, reads return immediately upon packet
|
|
reception.
|
|
Otherwise, a read will block until either the kernel buffer
|
|
becomes full or a timeout occurs.
|
|
This is useful for programs like
|
|
.Xr rarpd 8
|
|
which must respond to messages in real time.
|
|
The default for a new file is off.
|
|
.It Dv BIOCSETF
|
|
.It Dv BIOCSETFNR
|
|
.Pq Li "struct bpf_program"
|
|
Sets the read filter program used by the kernel to discard uninteresting
|
|
packets.
|
|
An array of instructions and its length is passed in using
|
|
the following structure:
|
|
.Bd -literal
|
|
struct bpf_program {
|
|
int bf_len;
|
|
struct bpf_insn *bf_insns;
|
|
};
|
|
.Ed
|
|
.Pp
|
|
The filter program is pointed to by the
|
|
.Li bf_insns
|
|
field while its length in units of
|
|
.Sq Li struct bpf_insn
|
|
is given by the
|
|
.Li bf_len
|
|
field.
|
|
See section
|
|
.Sx "FILTER MACHINE"
|
|
for an explanation of the filter language.
|
|
The only difference between
|
|
.Dv BIOCSETF
|
|
and
|
|
.Dv BIOCSETFNR
|
|
is
|
|
.Dv BIOCSETF
|
|
performs the actions of
|
|
.Dv BIOCFLUSH
|
|
while
|
|
.Dv BIOCSETFNR
|
|
does not.
|
|
.It Dv BIOCSETWF
|
|
.Pq Li "struct bpf_program"
|
|
Sets the write filter program used by the kernel to control what type of
|
|
packets can be written to the interface.
|
|
See the
|
|
.Dv BIOCSETF
|
|
command for more
|
|
information on the
|
|
.Nm
|
|
filter program.
|
|
.It Dv BIOCVERSION
|
|
.Pq Li "struct bpf_version"
|
|
Returns the major and minor version numbers of the filter language currently
|
|
recognized by the kernel.
|
|
Before installing a filter, applications must check
|
|
that the current version is compatible with the running kernel.
|
|
Version numbers are compatible if the major numbers match and the application minor
|
|
is less than or equal to the kernel minor.
|
|
The kernel version number is returned in the following structure:
|
|
.Bd -literal
|
|
struct bpf_version {
|
|
u_short bv_major;
|
|
u_short bv_minor;
|
|
};
|
|
.Ed
|
|
.Pp
|
|
The current version numbers are given by
|
|
.Dv BPF_MAJOR_VERSION
|
|
and
|
|
.Dv BPF_MINOR_VERSION
|
|
from
|
|
.In net/bpf.h .
|
|
An incompatible filter
|
|
may result in undefined behavior (most likely, an error returned by
|
|
.Fn ioctl
|
|
or haphazard packet matching).
|
|
.It Dv BIOCSHDRCMPLT
|
|
.It Dv BIOCGHDRCMPLT
|
|
.Pq Li u_int
|
|
Set or get the status of the
|
|
.Dq header complete
|
|
flag.
|
|
Set to zero if the link level source address should be filled in automatically
|
|
by the interface output routine.
|
|
Set to one if the link level source
|
|
address will be written, as provided, to the wire.
|
|
This flag is initialized to zero by default.
|
|
.It Dv BIOCSSEESENT
|
|
.It Dv BIOCGSEESENT
|
|
.Pq Li u_int
|
|
These commands are obsolete but left for compatibility.
|
|
Use
|
|
.Dv BIOCSDIRECTION
|
|
and
|
|
.Dv BIOCGDIRECTION
|
|
instead.
|
|
Set or get the flag determining whether locally generated packets on the
|
|
interface should be returned by BPF.
|
|
Set to zero to see only incoming packets on the interface.
|
|
Set to one to see packets originating locally and remotely on the interface.
|
|
This flag is initialized to one by default.
|
|
.It Dv BIOCSDIRECTION
|
|
.It Dv BIOCGDIRECTION
|
|
.Pq Li u_int
|
|
Set or get the setting determining whether incoming, outgoing, or all packets
|
|
on the interface should be returned by BPF.
|
|
Set to
|
|
.Dv BPF_D_IN
|
|
to see only incoming packets on the interface.
|
|
Set to
|
|
.Dv BPF_D_INOUT
|
|
to see packets originating locally and remotely on the interface.
|
|
Set to
|
|
.Dv BPF_D_OUT
|
|
to see only outgoing packets on the interface.
|
|
This setting is initialized to
|
|
.Dv BPF_D_INOUT
|
|
by default.
|
|
.It Dv BIOCSTSTAMP
|
|
.It Dv BIOCGTSTAMP
|
|
.Pq Li u_int
|
|
Set or get format and resolution of the time stamps returned by BPF.
|
|
Set to
|
|
.Dv BPF_T_MICROTIME ,
|
|
.Dv BPF_T_MICROTIME_FAST ,
|
|
.Dv BPF_T_MICROTIME_MONOTONIC ,
|
|
or
|
|
.Dv BPF_T_MICROTIME_MONOTONIC_FAST
|
|
to get time stamps in 64-bit
|
|
.Vt struct timeval
|
|
format.
|
|
Set to
|
|
.Dv BPF_T_NANOTIME ,
|
|
.Dv BPF_T_NANOTIME_FAST ,
|
|
.Dv BPF_T_NANOTIME_MONOTONIC ,
|
|
or
|
|
.Dv BPF_T_NANOTIME_MONOTONIC_FAST
|
|
to get time stamps in 64-bit
|
|
.Vt struct timespec
|
|
format.
|
|
Set to
|
|
.Dv BPF_T_BINTIME ,
|
|
.Dv BPF_T_BINTIME_FAST ,
|
|
.Dv BPF_T_NANOTIME_MONOTONIC ,
|
|
or
|
|
.Dv BPF_T_BINTIME_MONOTONIC_FAST
|
|
to get time stamps in 64-bit
|
|
.Vt struct bintime
|
|
format.
|
|
Set to
|
|
.Dv BPF_T_NONE
|
|
to ignore time stamp.
|
|
All 64-bit time stamp formats are wrapped in
|
|
.Vt struct bpf_ts .
|
|
The
|
|
.Dv BPF_T_MICROTIME_FAST ,
|
|
.Dv BPF_T_NANOTIME_FAST ,
|
|
.Dv BPF_T_BINTIME_FAST ,
|
|
.Dv BPF_T_MICROTIME_MONOTONIC_FAST ,
|
|
.Dv BPF_T_NANOTIME_MONOTONIC_FAST ,
|
|
and
|
|
.Dv BPF_T_BINTIME_MONOTONIC_FAST
|
|
are analogs of corresponding formats without _FAST suffix but do not perform
|
|
a full time counter query, so their accuracy is one timer tick.
|
|
The
|
|
.Dv BPF_T_MICROTIME_MONOTONIC ,
|
|
.Dv BPF_T_NANOTIME_MONOTONIC ,
|
|
.Dv BPF_T_BINTIME_MONOTONIC ,
|
|
.Dv BPF_T_MICROTIME_MONOTONIC_FAST ,
|
|
.Dv BPF_T_NANOTIME_MONOTONIC_FAST ,
|
|
and
|
|
.Dv BPF_T_BINTIME_MONOTONIC_FAST
|
|
store the time elapsed since kernel boot.
|
|
This setting is initialized to
|
|
.Dv BPF_T_MICROTIME
|
|
by default.
|
|
.It Dv BIOCFEEDBACK
|
|
.Pq Li u_int
|
|
Set packet feedback mode.
|
|
This allows injected packets to be fed back as input to the interface when
|
|
output via the interface is successful.
|
|
When
|
|
.Dv BPF_D_INOUT
|
|
direction is set, injected outgoing packet is not returned by BPF to avoid
|
|
duplication. This flag is initialized to zero by default.
|
|
.It Dv BIOCLOCK
|
|
Set the locked flag on the
|
|
.Nm
|
|
descriptor.
|
|
This prevents the execution of
|
|
ioctl commands which could change the underlying operating parameters of
|
|
the device.
|
|
.It Dv BIOCGETBUFMODE
|
|
.It Dv BIOCSETBUFMODE
|
|
.Pq Li u_int
|
|
Get or set the current
|
|
.Nm
|
|
buffering mode; possible values are
|
|
.Dv BPF_BUFMODE_BUFFER ,
|
|
buffered read mode, and
|
|
.Dv BPF_BUFMODE_ZBUF ,
|
|
zero-copy buffer mode.
|
|
.It Dv BIOCSETZBUF
|
|
.Pq Li struct bpf_zbuf
|
|
Set the current zero-copy buffer locations; buffer locations may be
|
|
set only once zero-copy buffer mode has been selected, and prior to attaching
|
|
to an interface.
|
|
Buffers must be of identical size, page-aligned, and an integer multiple of
|
|
pages in size.
|
|
The three fields
|
|
.Vt bz_bufa ,
|
|
.Vt bz_bufb ,
|
|
and
|
|
.Vt bz_buflen
|
|
must be filled out.
|
|
If buffers have already been set for this device, the ioctl will fail.
|
|
.It Dv BIOCGETZMAX
|
|
.Pq Li size_t
|
|
Get the largest individual zero-copy buffer size allowed.
|
|
As two buffers are used in zero-copy buffer mode, the limit (in practice) is
|
|
twice the returned size.
|
|
As zero-copy buffers consume kernel address space, conservative selection of
|
|
buffer size is suggested, especially when there are multiple
|
|
.Nm
|
|
descriptors in use on 32-bit systems.
|
|
.It Dv BIOCROTZBUF
|
|
Force ownership of the next buffer to be assigned to userspace, if any data
|
|
present in the buffer.
|
|
If no data is present, the buffer will remain owned by the kernel.
|
|
This allows consumers of zero-copy buffering to implement timeouts and
|
|
retrieve partially filled buffers.
|
|
In order to handle the case where no data is present in the buffer and
|
|
therefore ownership is not assigned, the user process must check
|
|
.Vt bzh_kernel_gen
|
|
against
|
|
.Vt bzh_user_gen .
|
|
.El
|
|
.Sh BPF HEADER
|
|
One of the following structures is prepended to each packet returned by
|
|
.Xr read 2
|
|
or via a zero-copy buffer:
|
|
.Bd -literal
|
|
struct bpf_xhdr {
|
|
struct bpf_ts bh_tstamp; /* time stamp */
|
|
uint32_t bh_caplen; /* length of captured portion */
|
|
uint32_t bh_datalen; /* original length of packet */
|
|
u_short bh_hdrlen; /* length of bpf header (this struct
|
|
plus alignment padding) */
|
|
};
|
|
|
|
struct bpf_hdr {
|
|
struct timeval bh_tstamp; /* time stamp */
|
|
uint32_t bh_caplen; /* length of captured portion */
|
|
uint32_t bh_datalen; /* original length of packet */
|
|
u_short bh_hdrlen; /* length of bpf header (this struct
|
|
plus alignment padding) */
|
|
};
|
|
.Ed
|
|
.Pp
|
|
The fields, whose values are stored in host order, and are:
|
|
.Pp
|
|
.Bl -tag -compact -width bh_datalen
|
|
.It Li bh_tstamp
|
|
The time at which the packet was processed by the packet filter.
|
|
.It Li bh_caplen
|
|
The length of the captured portion of the packet.
|
|
This is the minimum of
|
|
the truncation amount specified by the filter and the length of the packet.
|
|
.It Li bh_datalen
|
|
The length of the packet off the wire.
|
|
This value is independent of the truncation amount specified by the filter.
|
|
.It Li bh_hdrlen
|
|
The length of the
|
|
.Nm
|
|
header, which may not be equal to
|
|
.\" XXX - not really a function call
|
|
.Fn sizeof "struct bpf_xhdr"
|
|
or
|
|
.Fn sizeof "struct bpf_hdr" .
|
|
.El
|
|
.Pp
|
|
The
|
|
.Li bh_hdrlen
|
|
field exists to account for
|
|
padding between the header and the link level protocol.
|
|
The purpose here is to guarantee proper alignment of the packet
|
|
data structures, which is required on alignment sensitive
|
|
architectures and improves performance on many other architectures.
|
|
The packet filter ensures that the
|
|
.Vt bpf_xhdr ,
|
|
.Vt bpf_hdr
|
|
and the network layer
|
|
header will be word aligned.
|
|
Currently,
|
|
.Vt bpf_hdr
|
|
is used when the time stamp is set to
|
|
.Dv BPF_T_MICROTIME ,
|
|
.Dv BPF_T_MICROTIME_FAST ,
|
|
.Dv BPF_T_MICROTIME_MONOTONIC ,
|
|
.Dv BPF_T_MICROTIME_MONOTONIC_FAST ,
|
|
or
|
|
.Dv BPF_T_NONE
|
|
for backward compatibility reasons. Otherwise,
|
|
.Vt bpf_xhdr
|
|
is used. However,
|
|
.Vt bpf_hdr
|
|
may be deprecated in the near future.
|
|
Suitable precautions
|
|
must be taken when accessing the link layer protocol fields on alignment
|
|
restricted machines.
|
|
(This is not a problem on an Ethernet, since
|
|
the type field is a short falling on an even offset,
|
|
and the addresses are probably accessed in a bytewise fashion).
|
|
.Pp
|
|
Additionally, individual packets are padded so that each starts
|
|
on a word boundary.
|
|
This requires that an application
|
|
has some knowledge of how to get from packet to packet.
|
|
The macro
|
|
.Dv BPF_WORDALIGN
|
|
is defined in
|
|
.In net/bpf.h
|
|
to facilitate
|
|
this process.
|
|
It rounds up its argument to the nearest word aligned value (where a word is
|
|
.Dv BPF_ALIGNMENT
|
|
bytes wide).
|
|
.Pp
|
|
For example, if
|
|
.Sq Li p
|
|
points to the start of a packet, this expression
|
|
will advance it to the next packet:
|
|
.Dl p = (char *)p + BPF_WORDALIGN(p->bh_hdrlen + p->bh_caplen)
|
|
.Pp
|
|
For the alignment mechanisms to work properly, the
|
|
buffer passed to
|
|
.Xr read 2
|
|
must itself be word aligned.
|
|
The
|
|
.Xr malloc 3
|
|
function
|
|
will always return an aligned buffer.
|
|
.Sh FILTER MACHINE
|
|
A filter program is an array of instructions, with all branches forwardly
|
|
directed, terminated by a
|
|
.Em return
|
|
instruction.
|
|
Each instruction performs some action on the pseudo-machine state,
|
|
which consists of an accumulator, index register, scratch memory store,
|
|
and implicit program counter.
|
|
.Pp
|
|
The following structure defines the instruction format:
|
|
.Bd -literal
|
|
struct bpf_insn {
|
|
u_short code;
|
|
u_char jt;
|
|
u_char jf;
|
|
u_long k;
|
|
};
|
|
.Ed
|
|
.Pp
|
|
The
|
|
.Li k
|
|
field is used in different ways by different instructions,
|
|
and the
|
|
.Li jt
|
|
and
|
|
.Li jf
|
|
fields are used as offsets
|
|
by the branch instructions.
|
|
The opcodes are encoded in a semi-hierarchical fashion.
|
|
There are eight classes of instructions:
|
|
.Dv BPF_LD ,
|
|
.Dv BPF_LDX ,
|
|
.Dv BPF_ST ,
|
|
.Dv BPF_STX ,
|
|
.Dv BPF_ALU ,
|
|
.Dv BPF_JMP ,
|
|
.Dv BPF_RET ,
|
|
and
|
|
.Dv BPF_MISC .
|
|
Various other mode and
|
|
operator bits are or'd into the class to give the actual instructions.
|
|
The classes and modes are defined in
|
|
.In net/bpf.h .
|
|
.Pp
|
|
Below are the semantics for each defined
|
|
.Nm
|
|
instruction.
|
|
We use the convention that A is the accumulator, X is the index register,
|
|
P[] packet data, and M[] scratch memory store.
|
|
P[i:n] gives the data at byte offset
|
|
.Dq i
|
|
in the packet,
|
|
interpreted as a word (n=4),
|
|
unsigned halfword (n=2), or unsigned byte (n=1).
|
|
M[i] gives the i'th word in the scratch memory store, which is only
|
|
addressed in word units.
|
|
The memory store is indexed from 0 to
|
|
.Dv BPF_MEMWORDS
|
|
- 1.
|
|
.Li k ,
|
|
.Li jt ,
|
|
and
|
|
.Li jf
|
|
are the corresponding fields in the
|
|
instruction definition.
|
|
.Dq len
|
|
refers to the length of the packet.
|
|
.Bl -tag -width BPF_STXx
|
|
.It Dv BPF_LD
|
|
These instructions copy a value into the accumulator.
|
|
The type of the source operand is specified by an
|
|
.Dq addressing mode
|
|
and can be a constant
|
|
.Pq Dv BPF_IMM ,
|
|
packet data at a fixed offset
|
|
.Pq Dv BPF_ABS ,
|
|
packet data at a variable offset
|
|
.Pq Dv BPF_IND ,
|
|
the packet length
|
|
.Pq Dv BPF_LEN ,
|
|
or a word in the scratch memory store
|
|
.Pq Dv BPF_MEM .
|
|
For
|
|
.Dv BPF_IND
|
|
and
|
|
.Dv BPF_ABS ,
|
|
the data size must be specified as a word
|
|
.Pq Dv BPF_W ,
|
|
halfword
|
|
.Pq Dv BPF_H ,
|
|
or byte
|
|
.Pq Dv BPF_B .
|
|
The semantics of all the recognized
|
|
.Dv BPF_LD
|
|
instructions follow.
|
|
.Bd -literal
|
|
BPF_LD+BPF_W+BPF_ABS A <- P[k:4]
|
|
BPF_LD+BPF_H+BPF_ABS A <- P[k:2]
|
|
BPF_LD+BPF_B+BPF_ABS A <- P[k:1]
|
|
BPF_LD+BPF_W+BPF_IND A <- P[X+k:4]
|
|
BPF_LD+BPF_H+BPF_IND A <- P[X+k:2]
|
|
BPF_LD+BPF_B+BPF_IND A <- P[X+k:1]
|
|
BPF_LD+BPF_W+BPF_LEN A <- len
|
|
BPF_LD+BPF_IMM A <- k
|
|
BPF_LD+BPF_MEM A <- M[k]
|
|
.Ed
|
|
.It Dv BPF_LDX
|
|
These instructions load a value into the index register.
|
|
Note that
|
|
the addressing modes are more restrictive than those of the accumulator loads,
|
|
but they include
|
|
.Dv BPF_MSH ,
|
|
a hack for efficiently loading the IP header length.
|
|
.Bd -literal
|
|
BPF_LDX+BPF_W+BPF_IMM X <- k
|
|
BPF_LDX+BPF_W+BPF_MEM X <- M[k]
|
|
BPF_LDX+BPF_W+BPF_LEN X <- len
|
|
BPF_LDX+BPF_B+BPF_MSH X <- 4*(P[k:1]&0xf)
|
|
.Ed
|
|
.It Dv BPF_ST
|
|
This instruction stores the accumulator into the scratch memory.
|
|
We do not need an addressing mode since there is only one possibility
|
|
for the destination.
|
|
.Bd -literal
|
|
BPF_ST M[k] <- A
|
|
.Ed
|
|
.It Dv BPF_STX
|
|
This instruction stores the index register in the scratch memory store.
|
|
.Bd -literal
|
|
BPF_STX M[k] <- X
|
|
.Ed
|
|
.It Dv BPF_ALU
|
|
The alu instructions perform operations between the accumulator and
|
|
index register or constant, and store the result back in the accumulator.
|
|
For binary operations, a source mode is required
|
|
.Dv ( BPF_K
|
|
or
|
|
.Dv BPF_X ) .
|
|
.Bd -literal
|
|
BPF_ALU+BPF_ADD+BPF_K A <- A + k
|
|
BPF_ALU+BPF_SUB+BPF_K A <- A - k
|
|
BPF_ALU+BPF_MUL+BPF_K A <- A * k
|
|
BPF_ALU+BPF_DIV+BPF_K A <- A / k
|
|
BPF_ALU+BPF_AND+BPF_K A <- A & k
|
|
BPF_ALU+BPF_OR+BPF_K A <- A | k
|
|
BPF_ALU+BPF_LSH+BPF_K A <- A << k
|
|
BPF_ALU+BPF_RSH+BPF_K A <- A >> k
|
|
BPF_ALU+BPF_ADD+BPF_X A <- A + X
|
|
BPF_ALU+BPF_SUB+BPF_X A <- A - X
|
|
BPF_ALU+BPF_MUL+BPF_X A <- A * X
|
|
BPF_ALU+BPF_DIV+BPF_X A <- A / X
|
|
BPF_ALU+BPF_AND+BPF_X A <- A & X
|
|
BPF_ALU+BPF_OR+BPF_X A <- A | X
|
|
BPF_ALU+BPF_LSH+BPF_X A <- A << X
|
|
BPF_ALU+BPF_RSH+BPF_X A <- A >> X
|
|
BPF_ALU+BPF_NEG A <- -A
|
|
.Ed
|
|
.It Dv BPF_JMP
|
|
The jump instructions alter flow of control.
|
|
Conditional jumps
|
|
compare the accumulator against a constant
|
|
.Pq Dv BPF_K
|
|
or the index register
|
|
.Pq Dv BPF_X .
|
|
If the result is true (or non-zero),
|
|
the true branch is taken, otherwise the false branch is taken.
|
|
Jump offsets are encoded in 8 bits so the longest jump is 256 instructions.
|
|
However, the jump always
|
|
.Pq Dv BPF_JA
|
|
opcode uses the 32 bit
|
|
.Li k
|
|
field as the offset, allowing arbitrarily distant destinations.
|
|
All conditionals use unsigned comparison conventions.
|
|
.Bd -literal
|
|
BPF_JMP+BPF_JA pc += k
|
|
BPF_JMP+BPF_JGT+BPF_K pc += (A > k) ? jt : jf
|
|
BPF_JMP+BPF_JGE+BPF_K pc += (A >= k) ? jt : jf
|
|
BPF_JMP+BPF_JEQ+BPF_K pc += (A == k) ? jt : jf
|
|
BPF_JMP+BPF_JSET+BPF_K pc += (A & k) ? jt : jf
|
|
BPF_JMP+BPF_JGT+BPF_X pc += (A > X) ? jt : jf
|
|
BPF_JMP+BPF_JGE+BPF_X pc += (A >= X) ? jt : jf
|
|
BPF_JMP+BPF_JEQ+BPF_X pc += (A == X) ? jt : jf
|
|
BPF_JMP+BPF_JSET+BPF_X pc += (A & X) ? jt : jf
|
|
.Ed
|
|
.It Dv BPF_RET
|
|
The return instructions terminate the filter program and specify the amount
|
|
of packet to accept (i.e., they return the truncation amount).
|
|
A return value of zero indicates that the packet should be ignored.
|
|
The return value is either a constant
|
|
.Pq Dv BPF_K
|
|
or the accumulator
|
|
.Pq Dv BPF_A .
|
|
.Bd -literal
|
|
BPF_RET+BPF_A accept A bytes
|
|
BPF_RET+BPF_K accept k bytes
|
|
.Ed
|
|
.It Dv BPF_MISC
|
|
The miscellaneous category was created for anything that does not
|
|
fit into the above classes, and for any new instructions that might need to
|
|
be added.
|
|
Currently, these are the register transfer instructions
|
|
that copy the index register to the accumulator or vice versa.
|
|
.Bd -literal
|
|
BPF_MISC+BPF_TAX X <- A
|
|
BPF_MISC+BPF_TXA A <- X
|
|
.Ed
|
|
.El
|
|
.Pp
|
|
The
|
|
.Nm
|
|
interface provides the following macros to facilitate
|
|
array initializers:
|
|
.Fn BPF_STMT opcode operand
|
|
and
|
|
.Fn BPF_JUMP opcode operand true_offset false_offset .
|
|
.Sh SYSCTL VARIABLES
|
|
A set of
|
|
.Xr sysctl 8
|
|
variables controls the behaviour of the
|
|
.Nm
|
|
subsystem
|
|
.Bl -tag -width indent
|
|
.It Va net.bpf.optimize_writers: No 0
|
|
Various programs use BPF to send (but not receive) raw packets
|
|
(cdpd, lldpd, dhcpd, dhcp relays, etc. are good examples of such programs).
|
|
They do not need incoming packets to be send to them. Turning this option on
|
|
makes new BPF users to be attached to write-only interface list until program
|
|
explicitly specifies read filter via
|
|
.Cm pcap_set_filter() .
|
|
This removes any performance degradation for high-speed interfaces.
|
|
.It Va net.bpf.stats:
|
|
Binary interface for retrieving general statistics.
|
|
.It Va net.bpf.zerocopy_enable: No 0
|
|
Permits zero-copy to be used with net BPF readers. Use with caution.
|
|
.It Va net.bpf.maxinsns: No 512
|
|
Maximum number of instructions that BPF program can contain. Use
|
|
.Xr tcpdump 1
|
|
-d option to determine approximate number of instruction for any filter.
|
|
.It Va net.bpf.maxbufsize: No 524288
|
|
Maximum buffer size to allocate for packets buffer.
|
|
.It Va net.bpf.bufsize: No 4096
|
|
Default buffer size to allocate for packets buffer.
|
|
.El
|
|
.Sh EXAMPLES
|
|
The following filter is taken from the Reverse ARP Daemon.
|
|
It accepts only Reverse ARP requests.
|
|
.Bd -literal
|
|
struct bpf_insn insns[] = {
|
|
BPF_STMT(BPF_LD+BPF_H+BPF_ABS, 12),
|
|
BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, ETHERTYPE_REVARP, 0, 3),
|
|
BPF_STMT(BPF_LD+BPF_H+BPF_ABS, 20),
|
|
BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, REVARP_REQUEST, 0, 1),
|
|
BPF_STMT(BPF_RET+BPF_K, sizeof(struct ether_arp) +
|
|
sizeof(struct ether_header)),
|
|
BPF_STMT(BPF_RET+BPF_K, 0),
|
|
};
|
|
.Ed
|
|
.Pp
|
|
This filter accepts only IP packets between host 128.3.112.15 and
|
|
128.3.112.35.
|
|
.Bd -literal
|
|
struct bpf_insn insns[] = {
|
|
BPF_STMT(BPF_LD+BPF_H+BPF_ABS, 12),
|
|
BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, ETHERTYPE_IP, 0, 8),
|
|
BPF_STMT(BPF_LD+BPF_W+BPF_ABS, 26),
|
|
BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, 0x8003700f, 0, 2),
|
|
BPF_STMT(BPF_LD+BPF_W+BPF_ABS, 30),
|
|
BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, 0x80037023, 3, 4),
|
|
BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, 0x80037023, 0, 3),
|
|
BPF_STMT(BPF_LD+BPF_W+BPF_ABS, 30),
|
|
BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, 0x8003700f, 0, 1),
|
|
BPF_STMT(BPF_RET+BPF_K, (u_int)-1),
|
|
BPF_STMT(BPF_RET+BPF_K, 0),
|
|
};
|
|
.Ed
|
|
.Pp
|
|
Finally, this filter returns only TCP finger packets.
|
|
We must parse the IP header to reach the TCP header.
|
|
The
|
|
.Dv BPF_JSET
|
|
instruction
|
|
checks that the IP fragment offset is 0 so we are sure
|
|
that we have a TCP header.
|
|
.Bd -literal
|
|
struct bpf_insn insns[] = {
|
|
BPF_STMT(BPF_LD+BPF_H+BPF_ABS, 12),
|
|
BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, ETHERTYPE_IP, 0, 10),
|
|
BPF_STMT(BPF_LD+BPF_B+BPF_ABS, 23),
|
|
BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, IPPROTO_TCP, 0, 8),
|
|
BPF_STMT(BPF_LD+BPF_H+BPF_ABS, 20),
|
|
BPF_JUMP(BPF_JMP+BPF_JSET+BPF_K, 0x1fff, 6, 0),
|
|
BPF_STMT(BPF_LDX+BPF_B+BPF_MSH, 14),
|
|
BPF_STMT(BPF_LD+BPF_H+BPF_IND, 14),
|
|
BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, 79, 2, 0),
|
|
BPF_STMT(BPF_LD+BPF_H+BPF_IND, 16),
|
|
BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, 79, 0, 1),
|
|
BPF_STMT(BPF_RET+BPF_K, (u_int)-1),
|
|
BPF_STMT(BPF_RET+BPF_K, 0),
|
|
};
|
|
.Ed
|
|
.Sh SEE ALSO
|
|
.Xr tcpdump 1 ,
|
|
.Xr ioctl 2 ,
|
|
.Xr kqueue 2 ,
|
|
.Xr poll 2 ,
|
|
.Xr select 2 ,
|
|
.Xr byteorder 3 ,
|
|
.Xr ng_bpf 4 ,
|
|
.Xr bpf 9
|
|
.Rs
|
|
.%A McCanne, S.
|
|
.%A Jacobson V.
|
|
.%T "An efficient, extensible, and portable network monitor"
|
|
.Re
|
|
.Sh HISTORY
|
|
The Enet packet filter was created in 1980 by Mike Accetta and
|
|
Rick Rashid at Carnegie-Mellon University.
|
|
Jeffrey Mogul, at
|
|
Stanford, ported the code to
|
|
.Bx
|
|
and continued its development from
|
|
1983 on.
|
|
Since then, it has evolved into the Ultrix Packet Filter at
|
|
.Tn DEC ,
|
|
a
|
|
.Tn STREAMS
|
|
.Tn NIT
|
|
module under
|
|
.Tn SunOS 4.1 ,
|
|
and
|
|
.Tn BPF .
|
|
.Sh AUTHORS
|
|
.An -nosplit
|
|
.An Steven McCanne ,
|
|
of Lawrence Berkeley Laboratory, implemented BPF in
|
|
Summer 1990.
|
|
Much of the design is due to
|
|
.An Van Jacobson .
|
|
.Pp
|
|
Support for zero-copy buffers was added by
|
|
.An Robert N. M. Watson
|
|
under contract to Seccuris Inc.
|
|
.Sh BUGS
|
|
The read buffer must be of a fixed size (returned by the
|
|
.Dv BIOCGBLEN
|
|
ioctl).
|
|
.Pp
|
|
A file that does not request promiscuous mode may receive promiscuously
|
|
received packets as a side effect of another file requesting this
|
|
mode on the same hardware interface.
|
|
This could be fixed in the kernel with additional processing overhead.
|
|
However, we favor the model where
|
|
all files must assume that the interface is promiscuous, and if
|
|
so desired, must utilize a filter to reject foreign packets.
|
|
.Pp
|
|
Data link protocols with variable length headers are not currently supported.
|
|
.Pp
|
|
The
|
|
.Dv SEESENT ,
|
|
.Dv DIRECTION ,
|
|
and
|
|
.Dv FEEDBACK
|
|
settings have been observed to work incorrectly on some interface
|
|
types, including those with hardware loopback rather than software loopback,
|
|
and point-to-point interfaces.
|
|
They appear to function correctly on a
|
|
broad range of Ethernet-style interfaces.
|