mirror of
https://git.FreeBSD.org/src.git
synced 2025-01-26 16:18:31 +00:00
1052 lines
43 KiB
Plaintext
1052 lines
43 KiB
Plaintext
\input texinfo @c -*- texinfo -*-
|
|
@c %**start of header
|
|
@setfilename gperf.info
|
|
@settitle Perfect Hash Function Generator
|
|
@c @setchapternewpage odd
|
|
@c %**end of header
|
|
|
|
@c some day we should @include version.texi instead of defining
|
|
@c these values at hand.
|
|
@set UPDATED 26 September 2000
|
|
@set EDITION 2.7.2
|
|
@set VERSION 2.7.2
|
|
@c ---------------------
|
|
|
|
@c remove the black boxes generated in the GPL appendix.
|
|
@finalout
|
|
|
|
@c Merge functions into the concept index
|
|
@syncodeindex fn cp
|
|
@c @synindex pg cp
|
|
|
|
@dircategory Programming Tools
|
|
@direntry
|
|
* Gperf: (gperf). Perfect Hash Function Generator.
|
|
@end direntry
|
|
|
|
@ifinfo
|
|
This file documents the features of the GNU Perfect Hash Function
|
|
Generator @value{VERSION}.
|
|
|
|
Copyright @copyright{} 1989-2000 Free Software Foundation, Inc.
|
|
|
|
Permission is granted to make and distribute verbatim copies of this
|
|
manual provided the copyright notice and this permission notice are
|
|
preserved on all copies.
|
|
|
|
@ignore
|
|
Permission is granted to process this file through TeX and print the
|
|
results, provided the printed document carries a copying permission
|
|
notice identical to this one except for the removal of this paragraph
|
|
(this paragraph not being relevant to the printed manual).
|
|
|
|
@end ignore
|
|
|
|
Permission is granted to copy and distribute modified versions of this
|
|
manual under the conditions for verbatim copying, provided also that the
|
|
section entitled ``GNU General Public License'' is included exactly as
|
|
in the original, and provided that the entire resulting derived work is
|
|
distributed under the terms of a permission notice identical to this
|
|
one.
|
|
|
|
Permission is granted to copy and distribute translations of this manual
|
|
into another language, under the above conditions for modified versions,
|
|
except that the section entitled ``GNU General Public License'' and this
|
|
permission notice may be included in translations approved by the Free
|
|
Software Foundation instead of in the original English.
|
|
|
|
@end ifinfo
|
|
|
|
@titlepage
|
|
@title User's Guide to @code{gperf} @value{VERSION}
|
|
@subtitle The GNU Perfect Hash Function Generator
|
|
@subtitle Edition @value{EDITION}, @value{UPDATED}
|
|
@author Douglas C. Schmidt
|
|
|
|
@page
|
|
@vskip 0pt plus 1filll
|
|
Copyright @copyright{} 1989-2000 Free Software Foundation, Inc.
|
|
|
|
|
|
Permission is granted to make and distribute verbatim copies of
|
|
this manual provided the copyright notice and this permission notice
|
|
are preserved on all copies.
|
|
|
|
Permission is granted to copy and distribute modified versions of this
|
|
manual under the conditions for verbatim copying, provided also that the
|
|
section entitled ``GNU General Public License'' is included
|
|
exactly as in the original, and provided that the entire resulting
|
|
derived work is distributed under the terms of a permission notice
|
|
identical to this one.
|
|
|
|
Permission is granted to copy and distribute translations of this manual
|
|
into another language, under the above conditions for modified versions,
|
|
except that the section entitled ``GNU General Public License'' may be
|
|
included in a translation approved by the author instead of in the
|
|
original English.
|
|
@end titlepage
|
|
|
|
@ifinfo
|
|
@node Top, Copying, (dir), (dir)
|
|
@top Introduction
|
|
|
|
This manual documents the GNU @code{gperf} perfect hash function generator
|
|
utility, focusing on its features and how to use them, and how to report
|
|
bugs.
|
|
|
|
@menu
|
|
* Copying:: GNU @code{gperf} General Public License says
|
|
how you can copy and share @code{gperf}.
|
|
* Contributors:: People who have contributed to @code{gperf}.
|
|
* Motivation:: Static search structures and GNU GPERF.
|
|
* Search Structures:: Static search structures and GNU @code{gperf}
|
|
* Description:: High-level discussion of how GPERF functions.
|
|
* Options:: A description of options to the program.
|
|
* Bugs:: Known bugs and limitations with GPERF.
|
|
* Projects:: Things still left to do.
|
|
* Implementation:: Implementation Details for GNU GPERF.
|
|
* Bibliography:: Material Referenced in this Report.
|
|
|
|
* Concept Index::
|
|
|
|
@detailmenu --- The Detailed Node Listing ---
|
|
|
|
High-Level Description of GNU @code{gperf}
|
|
|
|
* Input Format:: Input Format to @code{gperf}
|
|
* Output Format:: Output Format for Generated C Code with @code{gperf}
|
|
* Binary Strings:: Use of NUL characters
|
|
|
|
Input Format to @code{gperf}
|
|
|
|
* Declarations:: @code{struct} Declarations and C Code Inclusion.
|
|
* Keywords:: Format for Keyword Entries.
|
|
* Functions:: Including Additional C Functions.
|
|
|
|
Invoking @code{gperf}
|
|
|
|
* Input Details:: Options that affect Interpretation of the Input File
|
|
* Output Language:: Specifying the Language for the Output Code
|
|
* Output Details:: Fine tuning Details in the Output Code
|
|
* Algorithmic Details:: Changing the Algorithms employed by @code{gperf}
|
|
* Verbosity:: Informative Output
|
|
|
|
@end detailmenu
|
|
@end menu
|
|
|
|
@end ifinfo
|
|
|
|
@node Copying, Contributors, Top, Top
|
|
@unnumbered GNU GENERAL PUBLIC LICENSE
|
|
@include gpl.texinfo
|
|
|
|
@node Contributors, Motivation, Copying, Top
|
|
@unnumbered Contributors to GNU @code{gperf} Utility
|
|
|
|
@itemize @bullet
|
|
@item
|
|
@cindex Bugs
|
|
The GNU @code{gperf} perfect hash function generator utility was
|
|
originally written in GNU C++ by Douglas C. Schmidt. It is now also
|
|
available in a highly-portable ``old-style'' C version. The general
|
|
idea for the perfect hash function generator was inspired by Keith
|
|
Bostic's algorithm written in C, and distributed to net.sources around
|
|
1984. The current program is a heavily modified, enhanced, and extended
|
|
implementation of Keith's basic idea, created at the University of
|
|
California, Irvine. Bugs, patches, and suggestions should be reported
|
|
to both @code{<bug-gnu-utils@@gnu.org>} and
|
|
@code{<gperf-bugs@@lists.sourceforge.net>}.
|
|
|
|
@item
|
|
Special thanks is extended to Michael Tiemann and Doug Lea, for
|
|
providing a useful compiler, and for giving me a forum to exhibit my
|
|
creation.
|
|
|
|
In addition, Adam de Boor and Nels Olson provided many tips and insights
|
|
that greatly helped improve the quality and functionality of @code{gperf}.
|
|
|
|
@item
|
|
A testsuite was added by Bruno Haible. He also rewrote the output
|
|
routines for better reliability.
|
|
@end itemize
|
|
|
|
@node Motivation, Search Structures, Contributors, Top
|
|
@chapter Introduction
|
|
|
|
@code{gperf} is a perfect hash function generator written in C++. It
|
|
transforms an @var{n} element user-specified keyword set @var{W} into a
|
|
perfect hash function @var{F}. @var{F} uniquely maps keywords in
|
|
@var{W} onto the range 0..@var{k}, where @var{k} >= @var{n}. If @var{k}
|
|
= @var{n} then @var{F} is a @emph{minimal} perfect hash function.
|
|
@code{gperf} generates a 0..@var{k} element static lookup table and a
|
|
pair of C functions. These functions determine whether a given
|
|
character string @var{s} occurs in @var{W}, using at most one probe into
|
|
the lookup table.
|
|
|
|
@code{gperf} currently generates the reserved keyword recognizer for
|
|
lexical analyzers in several production and research compilers and
|
|
language processing tools, including GNU C, GNU C++, GNU Pascal, GNU
|
|
Modula 3, and GNU indent. Complete C++ source code for @code{gperf} is
|
|
available via anonymous ftp from @code{ftp://ftp.gnu.org/pub/gnu/gperf/}.
|
|
A paper describing @code{gperf}'s design and implementation in greater
|
|
detail is available in the Second USENIX C++ Conference proceedings.
|
|
|
|
@node Search Structures, Description, Motivation, Top
|
|
@chapter Static search structures and GNU @code{gperf}
|
|
@cindex Static search structure
|
|
|
|
A @dfn{static search structure} is an Abstract Data Type with certain
|
|
fundamental operations, e.g., @emph{initialize}, @emph{insert},
|
|
and @emph{retrieve}. Conceptually, all insertions occur before any
|
|
retrievals. In practice, @code{gperf} generates a @code{static} array
|
|
containing search set keywords and any associated attributes specified
|
|
by the user. Thus, there is essentially no execution-time cost for the
|
|
insertions. It is a useful data structure for representing @emph{static
|
|
search sets}. Static search sets occur frequently in software system
|
|
applications. Typical static search sets include compiler reserved
|
|
words, assembler instruction opcodes, and built-in shell interpreter
|
|
commands. Search set members, called @dfn{keywords}, are inserted into
|
|
the structure only once, usually during program initialization, and are
|
|
not generally modified at run-time.
|
|
|
|
Numerous static search structure implementations exist, e.g.,
|
|
arrays, linked lists, binary search trees, digital search tries, and
|
|
hash tables. Different approaches offer trade-offs between space
|
|
utilization and search time efficiency. For example, an @var{n} element
|
|
sorted array is space efficient, though the average-case time
|
|
complexity for retrieval operations using binary search is
|
|
proportional to log @var{n}. Conversely, hash table implementations
|
|
often locate a table entry in constant time, but typically impose
|
|
additional memory overhead and exhibit poor worst case performance.
|
|
|
|
@cindex Minimal perfect hash functions
|
|
@emph{Minimal perfect hash functions} provide an optimal solution for a
|
|
particular class of static search sets. A minimal perfect hash
|
|
function is defined by two properties:
|
|
|
|
@itemize @bullet
|
|
@item
|
|
It allows keyword recognition in a static search set using at most
|
|
@emph{one} probe into the hash table. This represents the ``perfect''
|
|
property.
|
|
@item
|
|
The actual memory allocated to store the keywords is precisely large
|
|
enough for the keyword set, and @emph{no larger}. This is the
|
|
``minimal'' property.
|
|
@end itemize
|
|
|
|
For most applications it is far easier to generate @emph{perfect} hash
|
|
functions than @emph{minimal perfect} hash functions. Moreover,
|
|
non-minimal perfect hash functions frequently execute faster than
|
|
minimal ones in practice. This phenomena occurs since searching a
|
|
sparse keyword table increases the probability of locating a ``null''
|
|
entry, thereby reducing string comparisons. @code{gperf}'s default
|
|
behavior generates @emph{near-minimal} perfect hash functions for
|
|
keyword sets. However, @code{gperf} provides many options that permit
|
|
user control over the degree of minimality and perfection.
|
|
|
|
Static search sets often exhibit relative stability over time. For
|
|
example, Ada's 63 reserved words have remained constant for nearly a
|
|
decade. It is therefore frequently worthwhile to expend concerted
|
|
effort building an optimal search structure @emph{once}, if it
|
|
subsequently receives heavy use multiple times. @code{gperf} removes
|
|
the drudgery associated with constructing time- and space-efficient
|
|
search structures by hand. It has proven a useful and practical tool
|
|
for serious programming projects. Output from @code{gperf} is currently
|
|
used in several production and research compilers, including GNU C, GNU
|
|
C++, GNU Pascal, and GNU Modula 3. The latter two compilers are not yet
|
|
part of the official GNU distribution. Each compiler utilizes
|
|
@code{gperf} to automatically generate static search structures that
|
|
efficiently identify their respective reserved keywords.
|
|
|
|
@node Description, Options, Search Structures, Top
|
|
@chapter High-Level Description of GNU @code{gperf}
|
|
|
|
@menu
|
|
* Input Format:: Input Format to @code{gperf}
|
|
* Output Format:: Output Format for Generated C Code with @code{gperf}
|
|
* Binary Strings:: Use of NUL characters
|
|
@end menu
|
|
|
|
The perfect hash function generator @code{gperf} reads a set of
|
|
``keywords'' from a @dfn{keyfile} (or from the standard input by
|
|
default). It attempts to derive a perfect hashing function that
|
|
recognizes a member of the @dfn{static keyword set} with at most a
|
|
single probe into the lookup table. If @code{gperf} succeeds in
|
|
generating such a function it produces a pair of C source code routines
|
|
that perform hashing and table lookup recognition. All generated C code
|
|
is directed to the standard output. Command-line options described
|
|
below allow you to modify the input and output format to @code{gperf}.
|
|
|
|
By default, @code{gperf} attempts to produce time-efficient code, with
|
|
less emphasis on efficient space utilization. However, several options
|
|
exist that permit trading-off execution time for storage space and vice
|
|
versa. In particular, expanding the generated table size produces a
|
|
sparse search structure, generally yielding faster searches.
|
|
Conversely, you can direct @code{gperf} to utilize a C @code{switch}
|
|
statement scheme that minimizes data space storage size. Furthermore,
|
|
using a C @code{switch} may actually speed up the keyword retrieval time
|
|
somewhat. Actual results depend on your C compiler, of course.
|
|
|
|
In general, @code{gperf} assigns values to the characters it is using
|
|
for hashing until some set of values gives each keyword a unique value.
|
|
A helpful heuristic is that the larger the hash value range, the easier
|
|
it is for @code{gperf} to find and generate a perfect hash function.
|
|
Experimentation is the key to getting the most from @code{gperf}.
|
|
|
|
@node Input Format, Output Format, Description, Description
|
|
@section Input Format to @code{gperf}
|
|
@cindex Format
|
|
@cindex Declaration section
|
|
@cindex Keywords section
|
|
@cindex Functions section
|
|
You can control the input keyfile format by varying certain command-line
|
|
arguments, in particular the @samp{-t} option. The input's appearance
|
|
is similar to GNU utilities @code{flex} and @code{bison} (or UNIX
|
|
utilities @code{lex} and @code{yacc}). Here's an outline of the general
|
|
format:
|
|
|
|
@example
|
|
@group
|
|
declarations
|
|
%%
|
|
keywords
|
|
%%
|
|
functions
|
|
@end group
|
|
@end example
|
|
|
|
@emph{Unlike} @code{flex} or @code{bison}, all sections of
|
|
@code{gperf}'s input are optional. The following sections describe the
|
|
input format for each section.
|
|
|
|
@menu
|
|
* Declarations:: @code{struct} Declarations and C Code Inclusion.
|
|
* Keywords:: Format for Keyword Entries.
|
|
* Functions:: Including Additional C Functions.
|
|
@end menu
|
|
|
|
@node Declarations, Keywords, Input Format, Input Format
|
|
@subsection @code{struct} Declarations and C Code Inclusion
|
|
|
|
The keyword input file optionally contains a section for including
|
|
arbitrary C declarations and definitions, as well as provisions for
|
|
providing a user-supplied @code{struct}. If the @samp{-t} option
|
|
@emph{is} enabled, you @emph{must} provide a C @code{struct} as the last
|
|
component in the declaration section from the keyfile file. The first
|
|
field in this struct must be a @code{char *} or @code{const char *}
|
|
identifier called @samp{name}, although it is possible to modify this
|
|
field's name with the @samp{-K} option described below.
|
|
|
|
Here is a simple example, using months of the year and their attributes as
|
|
input:
|
|
|
|
@example
|
|
@group
|
|
struct months @{ char *name; int number; int days; int leap_days; @};
|
|
%%
|
|
january, 1, 31, 31
|
|
february, 2, 28, 29
|
|
march, 3, 31, 31
|
|
april, 4, 30, 30
|
|
may, 5, 31, 31
|
|
june, 6, 30, 30
|
|
july, 7, 31, 31
|
|
august, 8, 31, 31
|
|
september, 9, 30, 30
|
|
october, 10, 31, 31
|
|
november, 11, 30, 30
|
|
december, 12, 31, 31
|
|
@end group
|
|
@end example
|
|
|
|
@cindex @samp{%%}
|
|
Separating the @code{struct} declaration from the list of keywords and
|
|
other fields are a pair of consecutive percent signs, @samp{%%},
|
|
appearing left justified in the first column, as in the UNIX utility
|
|
@code{lex}.
|
|
|
|
@cindex @samp{%@{}
|
|
@cindex @samp{%@}}
|
|
Using a syntax similar to GNU utilities @code{flex} and @code{bison}, it
|
|
is possible to directly include C source text and comments verbatim into
|
|
the generated output file. This is accomplished by enclosing the region
|
|
inside left-justified surrounding @samp{%@{}, @samp{%@}} pairs. Here is
|
|
an input fragment based on the previous example that illustrates this
|
|
feature:
|
|
|
|
@example
|
|
@group
|
|
%@{
|
|
#include <assert.h>
|
|
/* This section of code is inserted directly into the output. */
|
|
int return_month_days (struct months *months, int is_leap_year);
|
|
%@}
|
|
struct months @{ char *name; int number; int days; int leap_days; @};
|
|
%%
|
|
january, 1, 31, 31
|
|
february, 2, 28, 29
|
|
march, 3, 31, 31
|
|
...
|
|
@end group
|
|
@end example
|
|
|
|
It is possible to omit the declaration section entirely. In this case
|
|
the keyfile begins directly with the first keyword line, e.g.:
|
|
|
|
@example
|
|
@group
|
|
january, 1, 31, 31
|
|
february, 2, 28, 29
|
|
march, 3, 31, 31
|
|
april, 4, 30, 30
|
|
...
|
|
@end group
|
|
@end example
|
|
|
|
@node Keywords, Functions, Declarations, Input Format
|
|
@subsection Format for Keyword Entries
|
|
|
|
The second keyfile format section contains lines of keywords and any
|
|
associated attributes you might supply. A line beginning with @samp{#}
|
|
in the first column is considered a comment. Everything following the
|
|
@samp{#} is ignored, up to and including the following newline.
|
|
|
|
The first field of each non-comment line is always the key itself. It
|
|
can be given in two ways: as a simple name, i.e., without surrounding
|
|
string quotation marks, or as a string enclosed in double-quotes, in
|
|
C syntax, possibly with backslash escapes like @code{\"} or @code{\234}
|
|
or @code{\xa8}. In either case, it must start right at the beginning
|
|
of the line, without leading whitespace.
|
|
In this context, a ``field'' is considered to extend up to, but
|
|
not include, the first blank, comma, or newline. Here is a simple
|
|
example taken from a partial list of C reserved words:
|
|
|
|
@example
|
|
@group
|
|
# These are a few C reserved words, see the c.gperf file
|
|
# for a complete list of ANSI C reserved words.
|
|
unsigned
|
|
sizeof
|
|
switch
|
|
signed
|
|
if
|
|
default
|
|
for
|
|
while
|
|
return
|
|
@end group
|
|
@end example
|
|
|
|
Note that unlike @code{flex} or @code{bison} the first @samp{%%} marker
|
|
may be elided if the declaration section is empty.
|
|
|
|
Additional fields may optionally follow the leading keyword. Fields
|
|
should be separated by commas, and terminate at the end of line. What
|
|
these fields mean is entirely up to you; they are used to initialize the
|
|
elements of the user-defined @code{struct} provided by you in the
|
|
declaration section. If the @samp{-t} option is @emph{not} enabled
|
|
these fields are simply ignored. All previous examples except the last
|
|
one contain keyword attributes.
|
|
|
|
@node Functions, , Keywords, Input Format
|
|
@subsection Including Additional C Functions
|
|
|
|
The optional third section also corresponds closely with conventions
|
|
found in @code{flex} and @code{bison}. All text in this section,
|
|
starting at the final @samp{%%} and extending to the end of the input
|
|
file, is included verbatim into the generated output file. Naturally,
|
|
it is your responsibility to ensure that the code contained in this
|
|
section is valid C.
|
|
|
|
@node Output Format, Binary Strings, Input Format, Description
|
|
@section Output Format for Generated C Code with @code{gperf}
|
|
@cindex hash table
|
|
|
|
Several options control how the generated C code appears on the standard
|
|
output. Two C function are generated. They are called @code{hash} and
|
|
@code{in_word_set}, although you may modify their names with a command-line
|
|
option. Both functions require two arguments, a string, @code{char *}
|
|
@var{str}, and a length parameter, @code{int} @var{len}. Their default
|
|
function prototypes are as follows:
|
|
|
|
@deftypefun {unsigned int} hash (const char * @var{str}, unsigned int @var{len})
|
|
By default, the generated @code{hash} function returns an integer value
|
|
created by adding @var{len} to several user-specified @var{str} key
|
|
positions indexed into an @dfn{associated values} table stored in a
|
|
local static array. The associated values table is constructed
|
|
internally by @code{gperf} and later output as a static local C array
|
|
called @samp{hash_table}; its meaning and properties are described below
|
|
(@pxref{Implementation}). The relevant key positions are specified via
|
|
the @samp{-k} option when running @code{gperf}, as detailed in the
|
|
@emph{Options} section below(@pxref{Options}).
|
|
@end deftypefun
|
|
|
|
@deftypefun {} in_word_set (const char * @var{str}, unsigned int @var{len})
|
|
If @var{str} is in the keyword set, returns a pointer to that
|
|
keyword. More exactly, if the option @samp{-t} was given, it returns
|
|
a pointer to the matching keyword's structure. Otherwise it returns
|
|
@code{NULL}.
|
|
@end deftypefun
|
|
|
|
If the option @samp{-c} is not used, @var{str} must be a NUL terminated
|
|
string of exactly length @var{len}. If @samp{-c} is used, @var{str} must
|
|
simply be an array of @var{len} characters and does not need to be NUL
|
|
terminated.
|
|
|
|
The code generated for these two functions is affected by the following
|
|
options:
|
|
|
|
@table @samp
|
|
@item -t
|
|
@itemx --struct-type
|
|
Make use of the user-defined @code{struct}.
|
|
|
|
@item -S @var{total-switch-statements}
|
|
@itemx --switch=@var{total-switch-statements}
|
|
@cindex @code{switch}
|
|
Generate 1 or more C @code{switch} statement rather than use a large,
|
|
(and potentially sparse) static array. Although the exact time and
|
|
space savings of this approach vary according to your C compiler's
|
|
degree of optimization, this method often results in smaller and faster
|
|
code.
|
|
@end table
|
|
|
|
If the @samp{-t} and @samp{-S} options are omitted, the default action
|
|
is to generate a @code{char *} array containing the keys, together with
|
|
additional null strings used for padding the array. By experimenting
|
|
with the various input and output options, and timing the resulting C
|
|
code, you can determine the best option choices for different keyword
|
|
set characteristics.
|
|
|
|
@node Binary Strings, , Output Format, Description
|
|
@section Use of NUL characters
|
|
@cindex NUL
|
|
|
|
By default, the code generated by @code{gperf} operates on zero
|
|
terminated strings, the usual representation of strings in C. This means
|
|
that the keywords in the input file must not contain NUL characters,
|
|
and the @var{str} argument passed to @code{hash} or @code{in_word_set}
|
|
must be NUL terminated and have exactly length @var{len}.
|
|
|
|
If option @samp{-c} is used, then the @var{str} argument does not need
|
|
to be NUL terminated. The code generated by @code{gperf} will only
|
|
access the first @var{len}, not @var{len+1}, bytes starting at @var{str}.
|
|
However, the keywords in the input file still must not contain NUL
|
|
characters.
|
|
|
|
If option @samp{-l} is used, then the hash table performs binary
|
|
comparison. The keywords in the input file may contain NUL characters,
|
|
written in string syntax as @code{\000} or @code{\x00}, and the code
|
|
generated by @code{gperf} will treat NUL like any other character.
|
|
Also, in this case the @samp{-c} option is ignored.
|
|
|
|
@node Options, Bugs, Description, Top
|
|
@chapter Invoking @code{gperf}
|
|
|
|
There are @emph{many} options to @code{gperf}. They were added to make
|
|
the program more convenient for use with real applications. ``On-line''
|
|
help is readily available via the @samp{-h} option. Here is the
|
|
complete list of options.
|
|
|
|
@menu
|
|
* Input Details:: Options that affect Interpretation of the Input File
|
|
* Output Language:: Specifying the Language for the Output Code
|
|
* Output Details:: Fine tuning Details in the Output Code
|
|
* Algorithmic Details:: Changing the Algorithms employed by @code{gperf}
|
|
* Verbosity:: Informative Output
|
|
@end menu
|
|
|
|
@node Input Details, Output Language, Options, Options
|
|
@section Options that affect Interpretation of the Input File
|
|
|
|
@table @samp
|
|
@item -e @var{keyword-delimiter-list}
|
|
@itemx --delimiters=@var{keyword-delimiter-list}
|
|
@cindex Delimiters
|
|
Allows the user to provide a string containing delimiters used to
|
|
separate keywords from their attributes. The default is ",\n". This
|
|
option is essential if you want to use keywords that have embedded
|
|
commas or newlines. One useful trick is to use -e'TAB', where TAB is
|
|
the literal tab character.
|
|
|
|
@item -t
|
|
@itemx --struct-type
|
|
Allows you to include a @code{struct} type declaration for generated
|
|
code. Any text before a pair of consecutive @samp{%%} is considered
|
|
part of the type declaration. Keywords and additional fields may follow
|
|
this, one group of fields per line. A set of examples for generating
|
|
perfect hash tables and functions for Ada, C, C++, Pascal, Modula 2,
|
|
Modula 3 and JavaScript reserved words are distributed with this release.
|
|
@end table
|
|
|
|
@node Output Language, Output Details, Input Details, Options
|
|
@section Options to specify the Language for the Output Code
|
|
|
|
@table @samp
|
|
@item -L @var{generated-language-name}
|
|
@itemx --language=@var{generated-language-name}
|
|
Instructs @code{gperf} to generate code in the language specified by the
|
|
option's argument. Languages handled are currently:
|
|
|
|
@table @samp
|
|
@item KR-C
|
|
Old-style K&R C. This language is understood by old-style C compilers and
|
|
ANSI C compilers, but ANSI C compilers may flag warnings (or even errors)
|
|
because of lacking @samp{const}.
|
|
|
|
@item C
|
|
Common C. This language is understood by ANSI C compilers, and also by
|
|
old-style C compilers, provided that you @code{#define const} to empty
|
|
for compilers which don't know about this keyword.
|
|
|
|
@item ANSI-C
|
|
ANSI C. This language is understood by ANSI C compilers and C++ compilers.
|
|
|
|
@item C++
|
|
C++. This language is understood by C++ compilers.
|
|
@end table
|
|
|
|
The default is C.
|
|
|
|
@item -a
|
|
This option is supported for compatibility with previous releases of
|
|
@code{gperf}. It does not do anything.
|
|
|
|
@item -g
|
|
This option is supported for compatibility with previous releases of
|
|
@code{gperf}. It does not do anything.
|
|
@end table
|
|
|
|
@node Output Details, Algorithmic Details, Output Language, Options
|
|
@section Options for fine tuning Details in the Output Code
|
|
|
|
@table @samp
|
|
@item -K @var{key-name}
|
|
@itemx --slot-name=@var{key-name}
|
|
@cindex Slot name
|
|
This option is only useful when option @samp{-t} has been given.
|
|
By default, the program assumes the structure component identifier for
|
|
the keyword is @samp{name}. This option allows an arbitrary choice of
|
|
identifier for this component, although it still must occur as the first
|
|
field in your supplied @code{struct}.
|
|
|
|
@item -F @var{initializers}
|
|
@itemx --initializer-suffix=@var{initializers}
|
|
@cindex Initializers
|
|
This option is only useful when option @samp{-t} has been given.
|
|
It permits to specify initializers for the structure members following
|
|
@var{key name} in empty hash table entries. The list of initializers
|
|
should start with a comma. By default, the emitted code will
|
|
zero-initialize structure members following @var{key name}.
|
|
|
|
@item -H @var{hash-function-name}
|
|
@itemx --hash-fn-name=@var{hash-function-name}
|
|
Allows you to specify the name for the generated hash function. Default
|
|
name is @samp{hash}. This option permits the use of two hash tables in
|
|
the same file.
|
|
|
|
@item -N @var{lookup-function-name}
|
|
@itemx --lookup-fn-name=@var{lookup-function-name}
|
|
Allows you to specify the name for the generated lookup function.
|
|
Default name is @samp{in_word_set}. This option permits completely
|
|
automatic generation of perfect hash functions, especially when multiple
|
|
generated hash functions are used in the same application.
|
|
|
|
@item -Z @var{class-name}
|
|
@itemx --class-name=@var{class-name}
|
|
@cindex Class name
|
|
This option is only useful when option @samp{-L C++} has been given. It
|
|
allows you to specify the name of generated C++ class. Default name is
|
|
@code{Perfect_Hash}.
|
|
|
|
@item -7
|
|
@itemx --seven-bit
|
|
This option specifies that all strings that will be passed as arguments
|
|
to the generated hash function and the generated lookup function will
|
|
solely consist of 7-bit ASCII characters (characters in the range 0..127).
|
|
(Note that the ANSI C functions @code{isalnum} and @code{isgraph} do
|
|
@emph{not} guarantee that a character is in this range. Only an explicit
|
|
test like @samp{c >= 'A' && c <= 'Z'} guarantees this.) This was the
|
|
default in versions of @code{gperf} earlier than 2.7; now the default is
|
|
to assume 8-bit characters.
|
|
|
|
@item -c
|
|
@itemx --compare-strncmp
|
|
Generates C code that uses the @code{strncmp} function to perform
|
|
string comparisons. The default action is to use @code{strcmp}.
|
|
|
|
@item -C
|
|
@itemx --readonly-tables
|
|
Makes the contents of all generated lookup tables constant, i.e.,
|
|
``readonly''. Many compilers can generate more efficient code for this
|
|
by putting the tables in readonly memory.
|
|
|
|
@item -E
|
|
@itemx --enum
|
|
Define constant values using an enum local to the lookup function rather
|
|
than with #defines. This also means that different lookup functions can
|
|
reside in the same file. Thanks to James Clark @code{<jjc@@ai.mit.edu>}.
|
|
|
|
@item -I
|
|
@itemx --includes
|
|
Include the necessary system include file, @code{<string.h>}, at the
|
|
beginning of the code. By default, this is not done; the user must
|
|
include this header file himself to allow compilation of the code.
|
|
|
|
@item -G
|
|
@itemx --global
|
|
Generate the static table of keywords as a static global variable,
|
|
rather than hiding it inside of the lookup function (which is the
|
|
default behavior).
|
|
|
|
@item -W @var{hash-table-array-name}
|
|
@itemx --word-array-name=@var{hash-table-array-name}
|
|
@cindex Array name
|
|
Allows you to specify the name for the generated array containing the
|
|
hash table. Default name is @samp{wordlist}. This option permits the
|
|
use of two hash tables in the same file, even when the option @samp{-G}
|
|
is given.
|
|
|
|
@item -S @var{total-switch-statements}
|
|
@itemx --switch=@var{total-switch-statements}
|
|
@cindex @code{switch}
|
|
Causes the generated C code to use a @code{switch} statement scheme,
|
|
rather than an array lookup table. This can lead to a reduction in both
|
|
time and space requirements for some keyfiles. The argument to this
|
|
option determines how many @code{switch} statements are generated. A
|
|
value of 1 generates 1 @code{switch} containing all the elements, a
|
|
value of 2 generates 2 tables with 1/2 the elements in each
|
|
@code{switch}, etc. This is useful since many C compilers cannot
|
|
correctly generate code for large @code{switch} statements. This option
|
|
was inspired in part by Keith Bostic's original C program.
|
|
|
|
@item -T
|
|
@itemx --omit-struct-type
|
|
Prevents the transfer of the type declaration to the output file. Use
|
|
this option if the type is already defined elsewhere.
|
|
|
|
@item -p
|
|
This option is supported for compatibility with previous releases of
|
|
@code{gperf}. It does not do anything.
|
|
@end table
|
|
|
|
@node Algorithmic Details, Verbosity, Output Details, Options
|
|
@section Options for changing the Algorithms employed by @code{gperf}
|
|
|
|
@table @samp
|
|
@item -k @var{keys}
|
|
@itemx --key-positions=@var{keys}
|
|
Allows selection of the character key positions used in the keywords'
|
|
hash function. The allowable choices range between 1-126, inclusive.
|
|
The positions are separated by commas, e.g., @samp{-k 9,4,13,14};
|
|
ranges may be used, e.g., @samp{-k 2-7}; and positions may occur
|
|
in any order. Furthermore, the meta-character '*' causes the generated
|
|
hash function to consider @strong{all} character positions in each key,
|
|
whereas '$' instructs the hash function to use the ``final character''
|
|
of a key (this is the only way to use a character position greater than
|
|
126, incidentally).
|
|
|
|
For instance, the option @samp{-k 1,2,4,6-10,'$'} generates a hash
|
|
function that considers positions 1,2,4,6,7,8,9,10, plus the last
|
|
character in each key (which may differ for each key, obviously). Keys
|
|
with length less than the indicated key positions work properly, since
|
|
selected key positions exceeding the key length are simply not
|
|
referenced in the hash function.
|
|
|
|
@item -l
|
|
@itemx --compare-strlen
|
|
Compare key lengths before trying a string comparison. This might cut
|
|
down on the number of string comparisons made during the lookup, since
|
|
keys with different lengths are never compared via @code{strcmp}.
|
|
However, using @samp{-l} might greatly increase the size of the
|
|
generated C code if the lookup table range is large (which implies that
|
|
the switch option @samp{-S} is not enabled), since the length table
|
|
contains as many elements as there are entries in the lookup table.
|
|
This option is mandatory for binary comparisons (@pxref{Binary Strings}).
|
|
|
|
@item -D
|
|
@itemx --duplicates
|
|
@cindex Duplicates
|
|
Handle keywords whose key position sets hash to duplicate values.
|
|
Duplicate hash values occur for two reasons:
|
|
|
|
@itemize @bullet
|
|
@item
|
|
Since @code{gperf} does not backtrack it is possible for it to process
|
|
all your input keywords without finding a unique mapping for each word.
|
|
However, frequently only a very small number of duplicates occur, and
|
|
the majority of keys still require one probe into the table.
|
|
|
|
@item
|
|
Sometimes a set of keys may have the same names, but possess different
|
|
attributes. With the -D option @code{gperf} treats all these keys as
|
|
part of an equivalence class and generates a perfect hash function with
|
|
multiple comparisons for duplicate keys. It is up to you to completely
|
|
disambiguate the keywords by modifying the generated C code. However,
|
|
@code{gperf} helps you out by organizing the output.
|
|
@end itemize
|
|
|
|
Option @samp{-D} is extremely useful for certain large or highly
|
|
redundant keyword sets, e.g., assembler instruction opcodes.
|
|
Using this option usually means that the generated hash function is no
|
|
longer perfect. On the other hand, it permits @code{gperf} to work on
|
|
keyword sets that it otherwise could not handle.
|
|
|
|
@item -f @var{iteration-amount}
|
|
@itemx --fast=@var{iteration-amount}
|
|
Generate the perfect hash function ``fast''. This decreases
|
|
@code{gperf}'s running time at the cost of minimizing generated
|
|
table-size. The iteration amount represents the number of times to
|
|
iterate when resolving a collision. `0' means iterate by the number of
|
|
keywords. This option is probably most useful when used in conjunction
|
|
with options @samp{-D} and/or @samp{-S} for @emph{large} keyword sets.
|
|
|
|
@item -i @var{initial-value}
|
|
@itemx --initial-asso=@var{initial-value}
|
|
Provides an initial @var{value} for the associate values array. Default
|
|
is 0. Increasing the initial value helps inflate the final table size,
|
|
possibly leading to more time efficient keyword lookups. Note that this
|
|
option is not particularly useful when @samp{-S} is used. Also,
|
|
@samp{-i} is overridden when the @samp{-r} option is used.
|
|
|
|
@item -j @var{jump-value}
|
|
@itemx --jump=@var{jump-value}
|
|
@cindex Jump value
|
|
Affects the ``jump value'', i.e., how far to advance the associated
|
|
character value upon collisions. @var{Jump-value} is rounded up to an
|
|
odd number, the default is 5. If the @var{jump-value} is 0 @code{gperf}
|
|
jumps by random amounts.
|
|
|
|
@item -n
|
|
@itemx --no-strlen
|
|
Instructs the generator not to include the length of a keyword when
|
|
computing its hash value. This may save a few assembly instructions in
|
|
the generated lookup table.
|
|
|
|
@item -o
|
|
@itemx --occurrence-sort
|
|
Reorders the keywords by sorting the keywords so that frequently
|
|
occuring key position set components appear first. A second reordering
|
|
pass follows so that keys with ``already determined values'' are placed
|
|
towards the front of the keylist. This may decrease the time required
|
|
to generate a perfect hash function for many keyword sets, and also
|
|
produce more minimal perfect hash functions. The reason for this is
|
|
that the reordering helps prune the search time by handling inevitable
|
|
collisions early in the search process. On the other hand, if the
|
|
number of keywords is @emph{very} large using @samp{-o} may
|
|
@emph{increase} @code{gperf}'s execution time, since collisions will
|
|
begin earlier and continue throughout the remainder of keyword
|
|
processing. See Cichelli's paper from the January 1980 Communications
|
|
of the ACM for details.
|
|
|
|
@item -r
|
|
@itemx --random
|
|
Utilizes randomness to initialize the associated values table. This
|
|
frequently generates solutions faster than using deterministic
|
|
initialization (which starts all associated values at 0). Furthermore,
|
|
using the randomization option generally increases the size of the
|
|
table. If @code{gperf} has difficultly with a certain keyword set try using
|
|
@samp{-r} or @samp{-D}.
|
|
|
|
@item -s @var{size-multiple}
|
|
@itemx --size-multiple=@var{size-multiple}
|
|
Affects the size of the generated hash table. The numeric argument for
|
|
this option indicates ``how many times larger or smaller'' the maximum
|
|
associated value range should be, in relationship to the number of keys.
|
|
If the @var{size-multiple} is negative the maximum associated value is
|
|
calculated by @emph{dividing} it into the total number of keys. For
|
|
example, a value of 3 means ``allow the maximum associated value to be
|
|
about 3 times larger than the number of input keys''.
|
|
|
|
Conversely, a value of -3 means ``allow the maximum associated value to
|
|
be about 3 times smaller than the number of input keys''. Negative
|
|
values are useful for limiting the overall size of the generated hash
|
|
table, though this usually increases the number of duplicate hash
|
|
values.
|
|
|
|
If `generate switch' option @samp{-S} is @emph{not} enabled, the maximum
|
|
associated value influences the static array table size, and a larger
|
|
table should decrease the time required for an unsuccessful search, at
|
|
the expense of extra table space.
|
|
|
|
The default value is 1, thus the default maximum associated value about
|
|
the same size as the number of keys (for efficiency, the maximum
|
|
associated value is always rounded up to a power of 2). The actual
|
|
table size may vary somewhat, since this technique is essentially a
|
|
heuristic. In particular, setting this value too high slows down
|
|
@code{gperf}'s runtime, since it must search through a much larger range
|
|
of values. Judicious use of the @samp{-f} option helps alleviate this
|
|
overhead, however.
|
|
@end table
|
|
|
|
@node Verbosity, , Algorithmic Details, Options
|
|
@section Informative Output
|
|
|
|
@table @samp
|
|
@item -h
|
|
@itemx --help
|
|
Prints a short summary on the meaning of each program option. Aborts
|
|
further program execution.
|
|
|
|
@item -v
|
|
@itemx --version
|
|
Prints out the current version number.
|
|
|
|
@item -d
|
|
@itemx --debug
|
|
Enables the debugging option. This produces verbose diagnostics to
|
|
``standard error'' when @code{gperf} is executing. It is useful both for
|
|
maintaining the program and for determining whether a given set of
|
|
options is actually speeding up the search for a solution. Some useful
|
|
information is dumped at the end of the program when the @samp{-d}
|
|
option is enabled.
|
|
@end table
|
|
|
|
@node Bugs, Projects, Options, Top
|
|
@chapter Known Bugs and Limitations with @code{gperf}
|
|
|
|
The following are some limitations with the current release of
|
|
@code{gperf}:
|
|
|
|
@itemize @bullet
|
|
@item
|
|
The @code{gperf} utility is tuned to execute quickly, and works quickly
|
|
for small to medium size data sets (around 1000 keywords). It is
|
|
extremely useful for maintaining perfect hash functions for compiler
|
|
keyword sets. Several recent enhancements now enable @code{gperf} to
|
|
work efficiently on much larger keyword sets (over 15,000 keywords).
|
|
When processing large keyword sets it helps greatly to have over 8 megs
|
|
of RAM.
|
|
|
|
However, since @code{gperf} does not backtrack no guaranteed solution
|
|
occurs on every run. On the other hand, it is usually easy to obtain a
|
|
solution by varying the option parameters. In particular, try the
|
|
@samp{-r} option, and also try changing the default arguments to the
|
|
@samp{-s} and @samp{-j} options. To @emph{guarantee} a solution, use
|
|
the @samp{-D} and @samp{-S} options, although the final results are not
|
|
likely to be a @emph{perfect} hash function anymore! Finally, use the
|
|
@samp{-f} option if you want @code{gperf} to generate the perfect hash
|
|
function @emph{fast}, with less emphasis on making it minimal.
|
|
|
|
@item
|
|
The size of the generate static keyword array can get @emph{extremely}
|
|
large if the input keyword file is large or if the keywords are quite
|
|
similar. This tends to slow down the compilation of the generated C
|
|
code, and @emph{greatly} inflates the object code size. If this
|
|
situation occurs, consider using the @samp{-S} option to reduce data
|
|
size, potentially increasing keyword recognition time a negligible
|
|
amount. Since many C compilers cannot correctly generated code for
|
|
large switch statements it is important to qualify the @var{-S} option
|
|
with an appropriate numerical argument that controls the number of
|
|
switch statements generated.
|
|
|
|
@item
|
|
The maximum number of key positions selected for a given key has an
|
|
arbitrary limit of 126. This restriction should be removed, and if
|
|
anyone considers this a problem write me and let me know so I can remove
|
|
the constraint.
|
|
@end itemize
|
|
|
|
@node Projects, Implementation, Bugs, Top
|
|
@chapter Things Still Left to Do
|
|
|
|
It should be ``relatively'' easy to replace the current perfect hash
|
|
function algorithm with a more exhaustive approach; the perfect hash
|
|
module is essential independent from other program modules. Additional
|
|
worthwhile improvements include:
|
|
|
|
@itemize @bullet
|
|
@item
|
|
Make the algorithm more robust. At present, the program halts with an
|
|
error diagnostic if it can't find a direct solution and the @samp{-D}
|
|
option is not enabled. A more comprehensive, albeit computationally
|
|
expensive, approach would employ backtracking or enable alternative
|
|
options and retry. It's not clear how helpful this would be, in
|
|
general, since most search sets are rather small in practice.
|
|
|
|
@item
|
|
Another useful extension involves modifying the program to generate
|
|
``minimal'' perfect hash functions (under certain circumstances, the
|
|
current version can be rather extravagant in the generated table size).
|
|
Again, this is mostly of theoretical interest, since a sparse table
|
|
often produces faster lookups, and use of the @samp{-S} @code{switch}
|
|
option can minimize the data size, at the expense of slightly longer
|
|
lookups (note that the gcc compiler generally produces good code for
|
|
@code{switch} statements, reducing the need for more complex schemes).
|
|
|
|
@item
|
|
In addition to improving the algorithm, it would also be useful to
|
|
generate a C++ class or Ada package as the code output, in addition to
|
|
the current C routines.
|
|
@end itemize
|
|
|
|
@node Implementation, Bibliography, Projects, Top
|
|
@chapter Implementation Details of GNU @code{gperf}
|
|
|
|
A paper describing the high-level description of the data structures and
|
|
algorithms used to implement @code{gperf} will soon be available. This
|
|
paper is useful not only from a maintenance and enhancement perspective,
|
|
but also because they demonstrate several clever and useful programming
|
|
techniques, e.g., `Iteration Number' boolean arrays, double
|
|
hashing, a ``safe'' and efficient method for reading arbitrarily long
|
|
input from a file, and a provably optimal algorithm for simultaneously
|
|
determining both the minimum and maximum elements in a list.
|
|
|
|
@page
|
|
|
|
@node Bibliography, Concept Index, Implementation, Top
|
|
@chapter Bibliography
|
|
|
|
[1] Chang, C.C.: @i{A Scheme for Constructing Ordered Minimal Perfect
|
|
Hashing Functions} Information Sciences 39(1986), 187-195.
|
|
|
|
[2] Cichelli, Richard J. @i{Author's Response to ``On Cichelli's Minimal Perfect Hash
|
|
Functions Method''} Communications of the ACM, 23, 12(December 1980), 729.
|
|
|
|
[3] Cichelli, Richard J. @i{Minimal Perfect Hash Functions Made Simple}
|
|
Communications of the ACM, 23, 1(January 1980), 17-19.
|
|
|
|
[4] Cook, C. R. and Oldehoeft, R.R. @i{A Letter Oriented Minimal
|
|
Perfect Hashing Function} SIGPLAN Notices, 17, 9(September 1982), 18-27.
|
|
|
|
[5] Cormack, G. V. and Horspool, R. N. S. and Kaiserwerth, M.
|
|
@i{Practical Perfect Hashing} Computer Journal, 28, 1(January 1985), 54-58.
|
|
|
|
[6] Jaeschke, G. @i{Reciprocal Hashing: A Method for Generating Minimal
|
|
Perfect Hashing Functions} Communications of the ACM, 24, 12(December
|
|
1981), 829-833.
|
|
|
|
[7] Jaeschke, G. and Osterburg, G. @i{On Cichelli's Minimal Perfect
|
|
Hash Functions Method} Communications of the ACM, 23, 12(December 1980),
|
|
728-729.
|
|
|
|
[8] Sager, Thomas J. @i{A Polynomial Time Generator for Minimal Perfect
|
|
Hash Functions} Communications of the ACM, 28, 5(December 1985), 523-532
|
|
|
|
[9] Schmidt, Douglas C. @i{GPERF: A Perfect Hash Function Generator}
|
|
Second USENIX C++ Conference Proceedings, April 1990.
|
|
|
|
[10] Sebesta, R.W. and Taylor, M.A. @i{Minimal Perfect Hash Functions
|
|
for Reserved Word Lists} SIGPLAN Notices, 20, 12(September 1985), 47-53.
|
|
|
|
[11] Sprugnoli, R. @i{Perfect Hashing Functions: A Single Probe
|
|
Retrieving Method for Static Sets} Communications of the ACM, 20
|
|
11(November 1977), 841-850.
|
|
|
|
[12] Stallman, Richard M. @i{Using and Porting GNU CC} Free Software Foundation,
|
|
1988.
|
|
|
|
[13] Stroustrup, Bjarne @i{The C++ Programming Language.} Addison-Wesley, 1986.
|
|
|
|
[14] Tiemann, Michael D. @i{User's Guide to GNU C++} Free Software
|
|
Foundation, 1989.
|
|
|
|
@node Concept Index, , Bibliography, Top
|
|
@unnumbered Concept Index
|
|
|
|
@printindex cp
|
|
|
|
@contents
|
|
@bye
|