mirror of
https://git.FreeBSD.org/src.git
synced 2024-12-26 11:47:31 +00:00
73b83ffce6
management problem involving custom print formats. PR: 13615 Reported by: Scott Hazen Mueller <scott@zorch.sf-bay.org>
20917 lines
742 KiB
Plaintext
20917 lines
742 KiB
Plaintext
\input texinfo @c -*-texinfo-*-
|
|
@c %**start of header (This is for running Texinfo on a region.)
|
|
@setfilename gawk.info
|
|
@settitle The GNU Awk User's Guide
|
|
@c %**end of header (This is for running Texinfo on a region.)
|
|
|
|
@c inside ifinfo for older versions of texinfo.tex
|
|
@ifinfo
|
|
@c I hope this is the right category
|
|
@dircategory Programming Languages
|
|
@direntry
|
|
* Gawk: (gawk.info). A Text Scanning and Processing Language.
|
|
@end direntry
|
|
@end ifinfo
|
|
|
|
@c @set xref-automatic-section-title
|
|
@c @set DRAFT
|
|
|
|
@c The following information should be updated here only!
|
|
@c This sets the edition of the document, the version of gawk it
|
|
@c applies to, and when the document was updated.
|
|
@set TITLE Effective AWK Programming
|
|
@set SUBTITLE A User's Guide for GNU Awk
|
|
@set PATCHLEVEL 4
|
|
@set EDITION 1.0.@value{PATCHLEVEL}
|
|
@set VERSION 3.0
|
|
@set UPDATE-MONTH April, 1999
|
|
@iftex
|
|
@set DOCUMENT book
|
|
@end iftex
|
|
@ifinfo
|
|
@set DOCUMENT Info file
|
|
@end ifinfo
|
|
|
|
@ignore
|
|
Some comments on the layout for TeX.
|
|
1. Use at least texinfo.tex 2.159. It contains fixes that
|
|
are needed to get the footings for draft mode to not appear.
|
|
2. I have done A LOT of work to make this look good. There are `@page' commands
|
|
and use of `@group ... @end group' in a number of places. If you muck
|
|
with anything, it's your responsibility not to break the layout.
|
|
@end ignore
|
|
|
|
@c merge the function and variable indexes into the concept index
|
|
@ifinfo
|
|
@synindex fn cp
|
|
@synindex vr cp
|
|
@end ifinfo
|
|
@iftex
|
|
@syncodeindex fn cp
|
|
@syncodeindex vr cp
|
|
@end iftex
|
|
|
|
@c If "finalout" is commented out, the printed output will show
|
|
@c black boxes that mark lines that are too long. Thus, it is
|
|
@c unwise to comment it out when running a master in case there are
|
|
@c overfulls which are deemed okay.
|
|
|
|
@ifclear DRAFT
|
|
@iftex
|
|
@finalout
|
|
@end iftex
|
|
@end ifclear
|
|
|
|
@smallbook
|
|
@iftex
|
|
@c @cropmarks
|
|
@end iftex
|
|
|
|
@ifinfo
|
|
This file documents @code{awk}, a program that you can use to select
|
|
particular records in a file and perform operations upon them.
|
|
|
|
This is Edition @value{EDITION} of @cite{@value{TITLE}},
|
|
for the @value{VERSION}.@value{PATCHLEVEL} version of the GNU implementation of AWK.
|
|
|
|
Copyright (C) 1989, 1991, 92, 93, 96, 97, 98, 99 Free Software Foundation, Inc.
|
|
|
|
Permission is granted to make and distribute verbatim copies of
|
|
this manual provided the copyright notice and this permission notice
|
|
are preserved on all copies.
|
|
|
|
@ignore
|
|
Permission is granted to process this file through TeX and print the
|
|
results, provided the printed document carries copying permission
|
|
notice identical to this one except for the removal of this paragraph
|
|
(this paragraph not being relevant to the printed manual).
|
|
|
|
@end ignore
|
|
Permission is granted to copy and distribute modified versions of this
|
|
manual under the conditions for verbatim copying, provided that the entire
|
|
resulting derived work is distributed under the terms of a permission
|
|
notice identical to this one.
|
|
|
|
Permission is granted to copy and distribute translations of this manual
|
|
into another language, under the above conditions for modified versions,
|
|
except that this permission notice may be stated in a translation approved
|
|
by the Foundation.
|
|
@end ifinfo
|
|
|
|
@setchapternewpage odd
|
|
|
|
@titlepage
|
|
@title @value{TITLE}
|
|
@subtitle @value{SUBTITLE}
|
|
@subtitle Edition @value{EDITION}
|
|
@subtitle @value{UPDATE-MONTH}
|
|
@author Arnold D. Robbins
|
|
@ignore
|
|
@sp 1
|
|
@author Based on @cite{The GAWK Manual},
|
|
@author by Robbins, Close, Rubin, and Stallman
|
|
@end ignore
|
|
|
|
@c Include the Distribution inside the titlepage environment so
|
|
@c that headings are turned off. Headings on and off do not work.
|
|
|
|
@page
|
|
@vskip 0pt plus 1filll
|
|
@ifset LEGALJUNK
|
|
The programs and applications presented in this book have been
|
|
included for their instructional value. They have been tested with care,
|
|
but are not guaranteed for any particular purpose. The publisher does not
|
|
offer any warranties or representations, nor does it accept any
|
|
liabilities with respect to the programs or applications.
|
|
So there.
|
|
@sp 2
|
|
UNIX is a registered trademark of X/Open, Ltd. @*
|
|
Microsoft, MS, and MS-DOS are registered trademarks, and Windows is a
|
|
trademark of Microsoft Corporation in the United States and other
|
|
countries. @*
|
|
Atari, 520ST, 1040ST, TT, STE, Mega, and Falcon are registered trademarks
|
|
or trademarks of Atari Corporation. @*
|
|
DEC, Digital, OpenVMS, ULTRIX, and VMS, are trademarks of Digital Equipment
|
|
Corporation. @*
|
|
@end ifset
|
|
``To boldly go where no man has gone before'' is a
|
|
Registered Trademark of Paramount Pictures Corporation. @*
|
|
@c sorry, i couldn't resist
|
|
@sp 3
|
|
Copyright @copyright{} 1989, 1991, 92, 93, 96, 97, 98, 99 Free Software Foundation, Inc.
|
|
@sp 2
|
|
|
|
This is Edition @value{EDITION} of @cite{@value{TITLE}}, @*
|
|
for the @value{VERSION}.@value{PATCHLEVEL} (or later) version of the GNU implementation of AWK.
|
|
|
|
@sp 2
|
|
@center Published jointly by:
|
|
|
|
@multitable {Specialized Systems Consultants, Inc. (SSC)} {Boston, MA 02111-1307 USA}
|
|
@item Specialized Systems Consultants, Inc. (SSC) @tab Free Software Foundation
|
|
@item PO Box 55549 @tab 59 Temple Place --- Suite 330
|
|
@item Seattle, WA 98155 USA @tab Boston, MA 02111-1307 USA
|
|
@item Phone: +1-206-782-7733 @tab Phone: +1-617-542-5942
|
|
@item Fax: +1-206-782-7191 @tab Fax: +1-617-542-2652
|
|
@item E-mail: @code{sales@@ssc.com} @tab E-mail: @code{gnu@@gnu.org}
|
|
@item URL: @code{http://www.ssc.com/} @tab URL: @code{http://www.fsf.org/}
|
|
@end multitable
|
|
|
|
@sp 1
|
|
@c this ISBN can change! Check with SSC
|
|
@c This one is correct for gawk 3.0 and edition 1.0 from the FSF
|
|
ISBN 1-882114-26-4 @*
|
|
@c This one is correct for gawk 3.0.3 and edition 1.0.3 from SSC
|
|
@c ISBN 1-57831-000-8 @*
|
|
|
|
Permission is granted to make and distribute verbatim copies of
|
|
this manual provided the copyright notice and this permission notice
|
|
are preserved on all copies.
|
|
|
|
Permission is granted to copy and distribute modified versions of this
|
|
manual under the conditions for verbatim copying, provided that the entire
|
|
resulting derived work is distributed under the terms of a permission
|
|
notice identical to this one.
|
|
|
|
Permission is granted to copy and distribute translations of this manual
|
|
into another language, under the above conditions for modified versions,
|
|
except that this permission notice may be stated in a translation approved
|
|
by the Foundation.
|
|
@sp 2
|
|
@c Cover art by Etienne Suvasa.
|
|
Cover art by Amy Wells Wood.
|
|
@end titlepage
|
|
|
|
@c Thanks to Bob Chassell for directions on doing dedications.
|
|
@iftex
|
|
@headings off
|
|
@page
|
|
@w{ }
|
|
@sp 9
|
|
@center @i{To Miriam, for making me complete.}
|
|
@sp 1
|
|
@center @i{To Chana, for the joy you bring us.}
|
|
@sp 1
|
|
@center @i{To Rivka, for the exponential increase.}
|
|
@sp 1
|
|
@center @i{To Nachum, for the added dimension.}
|
|
@page
|
|
@w{ }
|
|
@page
|
|
@headings on
|
|
@end iftex
|
|
|
|
@iftex
|
|
@headings off
|
|
@evenheading @thispage@ @ @ @strong{@value{TITLE}} @| @|
|
|
@oddheading @| @| @strong{@thischapter}@ @ @ @thispage
|
|
@ifset DRAFT
|
|
@evenfooting @today{} @| @emph{DRAFT!} @| Please Do Not Redistribute
|
|
@oddfooting Please Do Not Redistribute @| @emph{DRAFT!} @| @today{}
|
|
@end ifset
|
|
@end iftex
|
|
|
|
@ifinfo
|
|
@node Top, Preface, (dir), (dir)
|
|
@top General Introduction
|
|
@c Preface or Licensing nodes should come right after the Top
|
|
@c node, in `unnumbered' sections, then the chapter, `What is gawk'.
|
|
|
|
This file documents @code{awk}, a program that you can use to select
|
|
particular records in a file and perform operations upon them.
|
|
|
|
This is Edition @value{EDITION} of @cite{@value{TITLE}}, @*
|
|
for the @value{VERSION}.@value{PATCHLEVEL} version of the GNU implementation @*
|
|
of AWK.
|
|
|
|
@end ifinfo
|
|
|
|
@menu
|
|
* Preface:: What this @value{DOCUMENT} is about; brief
|
|
history and acknowledgements.
|
|
* What Is Awk:: What is the @code{awk} language; using this
|
|
@value{DOCUMENT}.
|
|
* Getting Started:: A basic introduction to using @code{awk}. How
|
|
to run an @code{awk} program. Command line
|
|
syntax.
|
|
* One-liners:: Short, sample @code{awk} programs.
|
|
* Regexp:: All about matching things using regular
|
|
expressions.
|
|
* Reading Files:: How to read files and manipulate fields.
|
|
* Printing:: How to print using @code{awk}. Describes the
|
|
@code{print} and @code{printf} statements.
|
|
Also describes redirection of output.
|
|
* Expressions:: Expressions are the basic building blocks of
|
|
statements.
|
|
* Patterns and Actions:: Overviews of patterns and actions.
|
|
* Statements:: The various control statements are described
|
|
in detail.
|
|
* Built-in Variables:: Built-in Variables
|
|
* Arrays:: The description and use of arrays. Also
|
|
includes array-oriented control statements.
|
|
* Built-in:: The built-in functions are summarized here.
|
|
* User-defined:: User-defined functions are described in
|
|
detail.
|
|
* Invoking Gawk:: How to run @code{gawk}.
|
|
* Library Functions:: A Library of @code{awk} Functions.
|
|
* Sample Programs:: Many @code{awk} programs with complete
|
|
explanations.
|
|
* Language History:: The evolution of the @code{awk} language.
|
|
* Gawk Summary:: @code{gawk} Options and Language Summary.
|
|
* Installation:: Installing @code{gawk} under various operating
|
|
systems.
|
|
* Notes:: Something about the implementation of
|
|
@code{gawk}.
|
|
* Glossary:: An explanation of some unfamiliar terms.
|
|
* Copying:: Your right to copy and distribute @code{gawk}.
|
|
* Index:: Concept and Variable Index.
|
|
|
|
* History:: The history of @code{gawk} and @code{awk}.
|
|
* Manual History:: Brief history of the GNU project and this
|
|
@value{DOCUMENT}.
|
|
* Acknowledgements:: Acknowledgements.
|
|
* This Manual:: Using this @value{DOCUMENT}. Includes sample
|
|
input files that you can use.
|
|
* Conventions:: Typographical Conventions.
|
|
* Sample Data Files:: Sample data files for use in the @code{awk}
|
|
programs illustrated in this @value{DOCUMENT}.
|
|
* Names:: What name to use to find @code{awk}.
|
|
* Running gawk:: How to run @code{gawk} programs; includes
|
|
command line syntax.
|
|
* One-shot:: Running a short throw-away @code{awk} program.
|
|
* Read Terminal:: Using no input files (input from terminal
|
|
instead).
|
|
* Long:: Putting permanent @code{awk} programs in
|
|
files.
|
|
* Executable Scripts:: Making self-contained @code{awk} programs.
|
|
* Comments:: Adding documentation to @code{gawk} programs.
|
|
* Very Simple:: A very simple example.
|
|
* Two Rules:: A less simple one-line example with two rules.
|
|
* More Complex:: A more complex example.
|
|
* Statements/Lines:: Subdividing or combining statements into
|
|
lines.
|
|
* Other Features:: Other Features of @code{awk}.
|
|
* When:: When to use @code{gawk} and when to use other
|
|
things.
|
|
* Regexp Usage:: How to Use Regular Expressions.
|
|
* Escape Sequences:: How to write non-printing characters.
|
|
* Regexp Operators:: Regular Expression Operators.
|
|
* GNU Regexp Operators:: Operators specific to GNU software.
|
|
* Case-sensitivity:: How to do case-insensitive matching.
|
|
* Leftmost Longest:: How much text matches.
|
|
* Computed Regexps:: Using Dynamic Regexps.
|
|
* Records:: Controlling how data is split into records.
|
|
* Fields:: An introduction to fields.
|
|
* Non-Constant Fields:: Non-constant Field Numbers.
|
|
* Changing Fields:: Changing the Contents of a Field.
|
|
* Field Separators:: The field separator and how to change it.
|
|
* Basic Field Splitting:: How fields are split with single characters or
|
|
simple strings.
|
|
* Regexp Field Splitting:: Using regexps as the field separator.
|
|
* Single Character Fields:: Making each character a separate field.
|
|
* Command Line Field Separator:: Setting @code{FS} from the command line.
|
|
* Field Splitting Summary:: Some final points and a summary table.
|
|
* Constant Size:: Reading constant width data.
|
|
* Multiple Line:: Reading multi-line records.
|
|
* Getline:: Reading files under explicit program control
|
|
using the @code{getline} function.
|
|
* Getline Intro:: Introduction to the @code{getline} function.
|
|
* Plain Getline:: Using @code{getline} with no arguments.
|
|
* Getline/Variable:: Using @code{getline} into a variable.
|
|
* Getline/File:: Using @code{getline} from a file.
|
|
* Getline/Variable/File:: Using @code{getline} into a variable from a
|
|
file.
|
|
* Getline/Pipe:: Using @code{getline} from a pipe.
|
|
* Getline/Variable/Pipe:: Using @code{getline} into a variable from a
|
|
pipe.
|
|
* Getline Summary:: Summary Of @code{getline} Variants.
|
|
* Print:: The @code{print} statement.
|
|
* Print Examples:: Simple examples of @code{print} statements.
|
|
* Output Separators:: The output separators and how to change them.
|
|
* OFMT:: Controlling Numeric Output With @code{print}.
|
|
* Printf:: The @code{printf} statement.
|
|
* Basic Printf:: Syntax of the @code{printf} statement.
|
|
* Control Letters:: Format-control letters.
|
|
* Format Modifiers:: Format-specification modifiers.
|
|
* Printf Examples:: Several examples.
|
|
* Redirection:: How to redirect output to multiple files and
|
|
pipes.
|
|
* Special Files:: File name interpretation in @code{gawk}.
|
|
@code{gawk} allows access to inherited file
|
|
descriptors.
|
|
* Close Files And Pipes:: Closing Input and Output Files and Pipes.
|
|
* Constants:: String, numeric, and regexp constants.
|
|
* Scalar Constants:: Numeric and string constants.
|
|
* Regexp Constants:: Regular Expression constants.
|
|
* Using Constant Regexps:: When and how to use a regexp constant.
|
|
* Variables:: Variables give names to values for later use.
|
|
* Using Variables:: Using variables in your programs.
|
|
* Assignment Options:: Setting variables on the command line and a
|
|
summary of command line syntax. This is an
|
|
advanced method of input.
|
|
* Conversion:: The conversion of strings to numbers and vice
|
|
versa.
|
|
* Arithmetic Ops:: Arithmetic operations (@samp{+}, @samp{-},
|
|
etc.)
|
|
* Concatenation:: Concatenating strings.
|
|
* Assignment Ops:: Changing the value of a variable or a field.
|
|
* Increment Ops:: Incrementing the numeric value of a variable.
|
|
* Truth Values:: What is ``true'' and what is ``false''.
|
|
* Typing and Comparison:: How variables acquire types, and how this
|
|
affects comparison of numbers and strings with
|
|
@samp{<}, etc.
|
|
* Boolean Ops:: Combining comparison expressions using boolean
|
|
operators @samp{||} (``or''), @samp{&&}
|
|
(``and'') and @samp{!} (``not'').
|
|
* Conditional Exp:: Conditional expressions select between two
|
|
subexpressions under control of a third
|
|
subexpression.
|
|
* Function Calls:: A function call is an expression.
|
|
* Precedence:: How various operators nest.
|
|
* Pattern Overview:: What goes into a pattern.
|
|
* Kinds of Patterns:: A list of all kinds of patterns.
|
|
* Regexp Patterns:: Using regexps as patterns.
|
|
* Expression Patterns:: Any expression can be used as a pattern.
|
|
* Ranges:: Pairs of patterns specify record ranges.
|
|
* BEGIN/END:: Specifying initialization and cleanup rules.
|
|
* Using BEGIN/END:: How and why to use BEGIN/END rules.
|
|
* I/O And BEGIN/END:: I/O issues in BEGIN/END rules.
|
|
* Empty:: The empty pattern, which matches every record.
|
|
* Action Overview:: What goes into an action.
|
|
* If Statement:: Conditionally execute some @code{awk}
|
|
statements.
|
|
* While Statement:: Loop until some condition is satisfied.
|
|
* Do Statement:: Do specified action while looping until some
|
|
condition is satisfied.
|
|
* For Statement:: Another looping statement, that provides
|
|
initialization and increment clauses.
|
|
* Break Statement:: Immediately exit the innermost enclosing loop.
|
|
* Continue Statement:: Skip to the end of the innermost enclosing
|
|
loop.
|
|
* Next Statement:: Stop processing the current input record.
|
|
* Nextfile Statement:: Stop processing the current file.
|
|
* Exit Statement:: Stop execution of @code{awk}.
|
|
* User-modified:: Built-in variables that you change to control
|
|
@code{awk}.
|
|
* Auto-set:: Built-in variables where @code{awk} gives you
|
|
information.
|
|
* ARGC and ARGV:: Ways to use @code{ARGC} and @code{ARGV}.
|
|
* Array Intro:: Introduction to Arrays
|
|
* Reference to Elements:: How to examine one element of an array.
|
|
* Assigning Elements:: How to change an element of an array.
|
|
* Array Example:: Basic Example of an Array
|
|
* Scanning an Array:: A variation of the @code{for} statement. It
|
|
loops through the indices of an array's
|
|
existing elements.
|
|
* Delete:: The @code{delete} statement removes an element
|
|
from an array.
|
|
* Numeric Array Subscripts:: How to use numbers as subscripts in
|
|
@code{awk}.
|
|
* Uninitialized Subscripts:: Using Uninitialized variables as subscripts.
|
|
* Multi-dimensional:: Emulating multi-dimensional arrays in
|
|
@code{awk}.
|
|
* Multi-scanning:: Scanning multi-dimensional arrays.
|
|
* Calling Built-in:: How to call built-in functions.
|
|
* Numeric Functions:: Functions that work with numbers, including
|
|
@code{int}, @code{sin} and @code{rand}.
|
|
* String Functions:: Functions for string manipulation, such as
|
|
@code{split}, @code{match}, and
|
|
@code{sprintf}.
|
|
* I/O Functions:: Functions for files and shell commands.
|
|
* Time Functions:: Functions for dealing with time stamps.
|
|
* Definition Syntax:: How to write definitions and what they mean.
|
|
* Function Example:: An example function definition and what it
|
|
does.
|
|
* Function Caveats:: Things to watch out for.
|
|
* Return Statement:: Specifying the value a function returns.
|
|
* Options:: Command line options and their meanings.
|
|
* Other Arguments:: Input file names and variable assignments.
|
|
* AWKPATH Variable:: Searching directories for @code{awk} programs.
|
|
* Obsolete:: Obsolete Options and/or features.
|
|
* Undocumented:: Undocumented Options and Features.
|
|
* Known Bugs:: Known Bugs in @code{gawk}.
|
|
* Portability Notes:: What to do if you don't have @code{gawk}.
|
|
* Nextfile Function:: Two implementations of a @code{nextfile}
|
|
function.
|
|
* Assert Function:: A function for assertions in @code{awk}
|
|
programs.
|
|
* Round Function:: A function for rounding if @code{sprintf} does
|
|
not do it correctly.
|
|
* Ordinal Functions:: Functions for using characters as numbers and
|
|
vice versa.
|
|
* Join Function:: A function to join an array into a string.
|
|
* Mktime Function:: A function to turn a date into a timestamp.
|
|
* Gettimeofday Function:: A function to get formatted times.
|
|
* Filetrans Function:: A function for handling data file transitions.
|
|
* Getopt Function:: A function for processing command line
|
|
arguments.
|
|
* Passwd Functions:: Functions for getting user information.
|
|
* Group Functions:: Functions for getting group information.
|
|
* Library Names:: How to best name private global variables in
|
|
library functions.
|
|
* Clones:: Clones of common utilities.
|
|
* Cut Program:: The @code{cut} utility.
|
|
* Egrep Program:: The @code{egrep} utility.
|
|
* Id Program:: The @code{id} utility.
|
|
* Split Program:: The @code{split} utility.
|
|
* Tee Program:: The @code{tee} utility.
|
|
* Uniq Program:: The @code{uniq} utility.
|
|
* Wc Program:: The @code{wc} utility.
|
|
* Miscellaneous Programs:: Some interesting @code{awk} programs.
|
|
* Dupword Program:: Finding duplicated words in a document.
|
|
* Alarm Program:: An alarm clock.
|
|
* Translate Program:: A program similar to the @code{tr} utility.
|
|
* Labels Program:: Printing mailing labels.
|
|
* Word Sorting:: A program to produce a word usage count.
|
|
* History Sorting:: Eliminating duplicate entries from a history
|
|
file.
|
|
* Extract Program:: Pulling out programs from Texinfo source
|
|
files.
|
|
* Simple Sed:: A Simple Stream Editor.
|
|
* Igawk Program:: A wrapper for @code{awk} that includes files.
|
|
* V7/SVR3.1:: The major changes between V7 and System V
|
|
Release 3.1.
|
|
* SVR4:: Minor changes between System V Releases 3.1
|
|
and 4.
|
|
* POSIX:: New features from the POSIX standard.
|
|
* BTL:: New features from the Bell Laboratories
|
|
version of @code{awk}.
|
|
* POSIX/GNU:: The extensions in @code{gawk} not in POSIX
|
|
@code{awk}.
|
|
* Command Line Summary:: Recapitulation of the command line.
|
|
* Language Summary:: A terse review of the language.
|
|
* Variables/Fields:: Variables, fields, and arrays.
|
|
* Fields Summary:: Input field splitting.
|
|
* Built-in Summary:: @code{awk}'s built-in variables.
|
|
* Arrays Summary:: Using arrays.
|
|
* Data Type Summary:: Values in @code{awk} are numbers or strings.
|
|
* Rules Summary:: Patterns and Actions, and their component
|
|
parts.
|
|
* Pattern Summary:: Quick overview of patterns.
|
|
* Regexp Summary:: Quick overview of regular expressions.
|
|
* Actions Summary:: Quick overview of actions.
|
|
* Operator Summary:: @code{awk} operators.
|
|
* Control Flow Summary:: The control statements.
|
|
* I/O Summary:: The I/O statements.
|
|
* Printf Summary:: A summary of @code{printf}.
|
|
* Special File Summary:: Special file names interpreted internally.
|
|
* Built-in Functions Summary:: Built-in numeric and string functions.
|
|
* Time Functions Summary:: Built-in time functions.
|
|
* String Constants Summary:: Escape sequences in strings.
|
|
* Functions Summary:: Defining and calling functions.
|
|
* Historical Features:: Some undocumented but supported ``features''.
|
|
* Gawk Distribution:: What is in the @code{gawk} distribution.
|
|
* Getting:: How to get the distribution.
|
|
* Extracting:: How to extract the distribution.
|
|
* Distribution contents:: What is in the distribution.
|
|
* Unix Installation:: Installing @code{gawk} under various versions
|
|
of Unix.
|
|
* Quick Installation:: Compiling @code{gawk} under Unix.
|
|
* Configuration Philosophy:: How it's all supposed to work.
|
|
* VMS Installation:: Installing @code{gawk} on VMS.
|
|
* VMS Compilation:: How to compile @code{gawk} under VMS.
|
|
* VMS Installation Details:: How to install @code{gawk} under VMS.
|
|
* VMS Running:: How to run @code{gawk} under VMS.
|
|
* VMS POSIX:: Alternate instructions for VMS POSIX.
|
|
* PC Installation:: Installing and Compiling @code{gawk} on MS-DOS
|
|
and OS/2
|
|
* Atari Installation:: Installing @code{gawk} on the Atari ST.
|
|
* Atari Compiling:: Compiling @code{gawk} on Atari
|
|
* Atari Using:: Running @code{gawk} on Atari
|
|
* Amiga Installation:: Installing @code{gawk} on an Amiga.
|
|
* Bugs:: Reporting Problems and Bugs.
|
|
* Other Versions:: Other freely available @code{awk}
|
|
implementations.
|
|
* Compatibility Mode:: How to disable certain @code{gawk} extensions.
|
|
* Additions:: Making Additions To @code{gawk}.
|
|
* Adding Code:: Adding code to the main body of @code{gawk}.
|
|
* New Ports:: Porting @code{gawk} to a new operating system.
|
|
* Future Extensions:: New features that may be implemented one day.
|
|
* Improvements:: Suggestions for improvements by volunteers.
|
|
|
|
@end menu
|
|
|
|
@c dedication for Info file
|
|
@ifinfo
|
|
@center To Miriam, for making me complete.
|
|
@sp 1
|
|
@center To Chana, for the joy you bring us.
|
|
@sp 1
|
|
@center To Rivka, for the exponential increase.
|
|
@sp 1
|
|
@center To Nachum, for the added dimension.
|
|
@end ifinfo
|
|
|
|
@node Preface, What Is Awk, Top, Top
|
|
@unnumbered Preface
|
|
|
|
@c I saw a comment somewhere that the preface should describe the book itself,
|
|
@c and the introduction should describe what the book covers.
|
|
|
|
This @value{DOCUMENT} teaches you about the @code{awk} language and
|
|
how you can use it effectively. You should already be familiar with basic
|
|
system commands, such as @code{cat} and @code{ls},@footnote{These commands
|
|
are available on POSIX compliant systems, as well as on traditional Unix
|
|
based systems. If you are using some other operating system, you still need to
|
|
be familiar with the ideas of I/O redirection and pipes.} and basic shell
|
|
facilities, such as Input/Output (I/O) redirection and pipes.
|
|
|
|
Implementations of the @code{awk} language are available for many different
|
|
computing environments. This @value{DOCUMENT}, while describing the @code{awk} language
|
|
in general, also describes a particular implementation of @code{awk} called
|
|
@code{gawk} (which stands for ``GNU Awk''). @code{gawk} runs on a broad range
|
|
of Unix systems, ranging from 80386 PC-based computers, up through large scale
|
|
systems, such as Crays. @code{gawk} has also been ported to MS-DOS and
|
|
OS/2 PC's, Atari and Amiga micro-computers, and VMS.
|
|
|
|
@menu
|
|
* History:: The history of @code{gawk} and @code{awk}.
|
|
* Manual History:: Brief history of the GNU project and this
|
|
@value{DOCUMENT}.
|
|
* Acknowledgements:: Acknowledgements.
|
|
@end menu
|
|
|
|
@node History, Manual History, Preface, Preface
|
|
@unnumberedsec History of @code{awk} and @code{gawk}
|
|
|
|
@cindex acronym
|
|
@cindex history of @code{awk}
|
|
@cindex Aho, Alfred
|
|
@cindex Weinberger, Peter
|
|
@cindex Kernighan, Brian
|
|
@cindex old @code{awk}
|
|
@cindex new @code{awk}
|
|
The name @code{awk} comes from the initials of its designers: Alfred V.@:
|
|
Aho, Peter J.@: Weinberger, and Brian W.@: Kernighan. The original version of
|
|
@code{awk} was written in 1977 at AT&T Bell Laboratories.
|
|
In 1985 a new version made the programming
|
|
language more powerful, introducing user-defined functions, multiple input
|
|
streams, and computed regular expressions.
|
|
This new version became generally available with Unix System V Release 3.1.
|
|
The version in System V Release 4 added some new features and also cleaned
|
|
up the behavior in some of the ``dark corners'' of the language.
|
|
The specification for @code{awk} in the POSIX Command Language
|
|
and Utilities standard further clarified the language based on feedback
|
|
from both the @code{gawk} designers, and the original Bell Labs @code{awk}
|
|
designers.
|
|
|
|
The GNU implementation, @code{gawk}, was written in 1986 by Paul Rubin
|
|
and Jay Fenlason, with advice from Richard Stallman. John Woods
|
|
contributed parts of the code as well. In 1988 and 1989, David Trueman, with
|
|
help from Arnold Robbins, thoroughly reworked @code{gawk} for compatibility
|
|
with the newer @code{awk}. Current development focuses on bug fixes,
|
|
performance improvements, standards compliance, and occasionally, new features.
|
|
|
|
@node Manual History, Acknowledgements, History, Preface
|
|
@unnumberedsec The GNU Project and This Book
|
|
|
|
@cindex Free Software Foundation
|
|
@cindex Stallman, Richard
|
|
The Free Software Foundation (FSF) is a non-profit organization dedicated
|
|
to the production and distribution of freely distributable software.
|
|
It was founded by Richard M.@: Stallman, the author of the original
|
|
Emacs editor. GNU Emacs is the most widely used version of Emacs today.
|
|
|
|
@cindex GNU Project
|
|
The GNU project is an on-going effort on the part of the Free Software
|
|
Foundation to create a complete, freely distributable, POSIX compliant
|
|
computing environment. (GNU stands for ``GNU's not Unix''.)
|
|
The FSF uses the ``GNU General Public License'' (or GPL) to ensure that
|
|
source code for their software is always available to the end user. A
|
|
copy of the GPL is included for your reference
|
|
(@pxref{Copying, ,GNU GENERAL PUBLIC LICENSE}).
|
|
The GPL applies to the C language source code for @code{gawk}.
|
|
|
|
A shell, an editor (Emacs), highly portable optimizing C, C++, and
|
|
Objective-C compilers, a symbolic debugger, and dozens of large and
|
|
small utilities (such as @code{gawk}), have all been completed and are
|
|
freely available. As of this writing (early 1997), the GNU operating
|
|
system kernel (the HURD), has been released, but is still in an early
|
|
stage of development.
|
|
|
|
@cindex Linux
|
|
@cindex NetBSD
|
|
@cindex FreeBSD
|
|
Until the GNU operating system is more fully developed, you should
|
|
consider using Linux, a freely distributable, Unix-like operating
|
|
system for 80386, DEC Alpha, Sun SPARC and other systems. There are
|
|
many books on Linux. One freely available one is @cite{Linux
|
|
Installation and Getting Started}, by Matt Welsh.
|
|
Many Linux distributions are available, often in computer stores or
|
|
bundled on CD-ROM with books about Linux.
|
|
(There are three other freely available, Unix-like operating systems for
|
|
80386 and other systems, NetBSD, FreeBSD,and OpenBSD. All are based on the
|
|
4.4-Lite Berkeley Software Distribution, and they use recent versions
|
|
of @code{gawk} for their versions of @code{awk}.)
|
|
|
|
@iftex
|
|
This @value{DOCUMENT} you are reading now is actually free. The
|
|
information in it is freely available to anyone, the machine readable
|
|
source code for the @value{DOCUMENT} comes with @code{gawk}, and anyone
|
|
may take this @value{DOCUMENT} to a copying machine and make as many
|
|
copies of it as they like. (Take a moment to check the copying
|
|
permissions on the Copyright page.)
|
|
|
|
If you paid money for this @value{DOCUMENT}, what you actually paid for
|
|
was the @value{DOCUMENT}'s nice printing and binding, and the
|
|
publisher's associated costs to produce it. We have made an effort to
|
|
keep these costs reasonable; most people would prefer a bound book to
|
|
over 330 pages of photo-copied text that would then have to be held in
|
|
a loose-leaf binder (not to mention the time and labor involved in
|
|
doing the copying). The same is true of producing this
|
|
@value{DOCUMENT} from the machine readable source; the retail price is
|
|
only slightly more than the cost per page of printing it
|
|
on a laser printer.
|
|
@end iftex
|
|
|
|
This @value{DOCUMENT} itself has gone through several previous,
|
|
preliminary editions. I started working on a preliminary draft of
|
|
@cite{The GAWK Manual}, by Diane Close, Paul Rubin, and Richard
|
|
Stallman in the fall of 1988.
|
|
It was around 90 pages long, and barely described the original, ``old''
|
|
version of @code{awk}. After substantial revision, the first version of
|
|
the @cite{The GAWK Manual} to be released was Edition 0.11 Beta in
|
|
October of 1989. The manual then underwent more substantial revision
|
|
for Edition 0.13 of December 1991.
|
|
David Trueman, Pat Rankin, and Michal Jaegermann contributed sections
|
|
of the manual for Edition 0.13.
|
|
That edition was published by the
|
|
FSF as a bound book early in 1992. Since then there have been several
|
|
minor revisions, notably Edition 0.14 of November 1992 that was published
|
|
by the FSF in January of 1993, and Edition 0.16 of August 1993.
|
|
|
|
Edition 1.0 of @cite{@value{TITLE}} represents a significant re-working
|
|
of @cite{The GAWK Manual}, with much additional material.
|
|
The FSF and I agree that I am now the primary author.
|
|
I also felt that it needed a more descriptive title.
|
|
|
|
@cite{@value{TITLE}} will undoubtedly continue to evolve.
|
|
An electronic version
|
|
comes with the @code{gawk} distribution from the FSF.
|
|
If you find an error in this @value{DOCUMENT}, please report it!
|
|
@xref{Bugs, ,Reporting Problems and Bugs}, for information on submitting
|
|
problem reports electronically, or write to me in care of the FSF.
|
|
|
|
@node Acknowledgements, , Manual History, Preface
|
|
@unnumberedsec Acknowledgements
|
|
|
|
@cindex Stallman, Richard
|
|
I would like to acknowledge Richard M.@: Stallman, for his vision of a
|
|
better world, and for his courage in founding the FSF and starting the
|
|
GNU project.
|
|
|
|
The initial draft of @cite{The GAWK Manual} had the following acknowledgements:
|
|
|
|
@quotation
|
|
Many people need to be thanked for their assistance in producing this
|
|
manual. Jay Fenlason contributed many ideas and sample programs. Richard
|
|
Mlynarik and Robert Chassell gave helpful comments on drafts of this
|
|
manual. The paper @cite{A Supplemental Document for @code{awk}} by John W.@:
|
|
Pierce of the Chemistry Department at UC San Diego, pinpointed several
|
|
issues relevant both to @code{awk} implementation and to this manual, that
|
|
would otherwise have escaped us.
|
|
@end quotation
|
|
|
|
The following people provided many helpful comments on Edition 0.13 of
|
|
@cite{The GAWK Manual}: Rick Adams, Michael Brennan, Rich Burridge, Diane Close,
|
|
Christopher (``Topher'') Eliot, Michael Lijewski, Pat Rankin, Miriam Robbins,
|
|
and Michal Jaegermann.
|
|
|
|
The following people provided many helpful comments for Edition 1.0 of
|
|
@cite{@value{TITLE}}: Karl Berry, Michael Brennan, Darrel
|
|
Hankerson, Michal Jaegermann, Michael Lijewski, and Miriam Robbins.
|
|
Pat Rankin, Michal Jaegermann, Darrel Hankerson and Scott Deifik
|
|
updated their respective sections for Edition 1.0.
|
|
|
|
Robert J.@: Chassell provided much valuable advice on
|
|
the use of Texinfo. He also deserves special thanks for
|
|
convincing me @emph{not} to title this @value{DOCUMENT}
|
|
@cite{How To Gawk Politely}.
|
|
Karl Berry helped significantly with the @TeX{} part of Texinfo.
|
|
|
|
@cindex Trueman, David
|
|
David Trueman deserves special credit; he has done a yeoman job
|
|
of evolving @code{gawk} so that it performs well, and without bugs.
|
|
Although he is no longer involved with @code{gawk},
|
|
working with him on this project was a significant pleasure.
|
|
|
|
@cindex Deifik, Scott
|
|
@cindex Hankerson, Darrel
|
|
@cindex Rommel, Kai Uwe
|
|
@cindex Rankin, Pat
|
|
@cindex Jaegermann, Michal
|
|
Scott Deifik, Darrel Hankerson, Kai Uwe Rommel, Pat Rankin, and Michal
|
|
Jaegermann (in no particular order) are long time members of the
|
|
@code{gawk} ``crack portability team.'' Without their hard work and
|
|
help, @code{gawk} would not be nearly the fine program it is today. It
|
|
has been and continues to be a pleasure working with this team of fine
|
|
people.
|
|
|
|
@cindex Friedl, Jeffrey
|
|
Jeffrey Friedl provided invaluable help in tracking down a number
|
|
of last minute problems with regular expressions in @code{gawk} 3.0.
|
|
|
|
@cindex Kernighan, Brian
|
|
David and I would like to thank Brian Kernighan of Bell Labs for
|
|
invaluable assistance during the testing and debugging of @code{gawk}, and for
|
|
help in clarifying numerous points about the language. We could not have
|
|
done nearly as good a job on either @code{gawk} or its documentation without
|
|
his help.
|
|
|
|
@cindex Hughes, Phil
|
|
I would like to thank Marshall and Elaine Hartholz of Seattle, and Dr.@:
|
|
Bert and Rita Schreiber of Detroit for large amounts of quiet vacation
|
|
time in their homes, which allowed me to make significant progress on
|
|
this @value{DOCUMENT} and on @code{gawk} itself. Phil Hughes of SSC
|
|
contributed in a very important way by loaning me his laptop Linux
|
|
system, not once, but twice, allowing me to do a lot of work while
|
|
away from home.
|
|
|
|
@cindex Robbins, Miriam
|
|
Finally, I must thank my wonderful wife, Miriam, for her patience through
|
|
the many versions of this project, for her proof-reading,
|
|
and for sharing me with the computer.
|
|
I would like to thank my parents for their love, and for the grace with
|
|
which they raised and educated me.
|
|
I also must acknowledge my gratitude to G-d, for the many opportunities
|
|
He has sent my way, as well as for the gifts He has given me with which to
|
|
take advantage of those opportunities.
|
|
@sp 2
|
|
@noindent
|
|
Arnold Robbins @*
|
|
Atlanta, Georgia @*
|
|
February, 1997
|
|
|
|
@ignore
|
|
Stuff still not covered anywhere:
|
|
BASICS:
|
|
Integer vs. floating point
|
|
Hex vs. octal vs. decimal
|
|
Interpreter vs compiler
|
|
input/output
|
|
@end ignore
|
|
|
|
@node What Is Awk, Getting Started, Preface, Top
|
|
@chapter Introduction
|
|
|
|
If you are like many computer users, you would frequently like to make
|
|
changes in various text files wherever certain patterns appear, or
|
|
extract data from parts of certain lines while discarding the rest. To
|
|
write a program to do this in a language such as C or Pascal is a
|
|
time-consuming inconvenience that may take many lines of code. The job
|
|
may be easier with @code{awk}.
|
|
|
|
The @code{awk} utility interprets a special-purpose programming language
|
|
that makes it possible to handle simple data-reformatting jobs
|
|
with just a few lines of code.
|
|
|
|
The GNU implementation of @code{awk} is called @code{gawk}; it is fully
|
|
upward compatible with the System V Release 4 version of
|
|
@code{awk}. @code{gawk} is also upward compatible with the POSIX
|
|
specification of the @code{awk} language. This means that all
|
|
properly written @code{awk} programs should work with @code{gawk}.
|
|
Thus, we usually don't distinguish between @code{gawk} and other @code{awk}
|
|
implementations.
|
|
|
|
@cindex uses of @code{awk}
|
|
Using @code{awk} you can:
|
|
|
|
@itemize @bullet
|
|
@item
|
|
manage small, personal databases
|
|
|
|
@item
|
|
generate reports
|
|
|
|
@item
|
|
validate data
|
|
|
|
@item
|
|
produce indexes, and perform other document preparation tasks
|
|
|
|
@item
|
|
even experiment with algorithms that can be adapted later to other computer
|
|
languages
|
|
@end itemize
|
|
|
|
@menu
|
|
* This Manual:: Using this @value{DOCUMENT}. Includes sample
|
|
input files that you can use.
|
|
* Conventions:: Typographical Conventions.
|
|
* Sample Data Files:: Sample data files for use in the @code{awk}
|
|
programs illustrated in this @value{DOCUMENT}.
|
|
@end menu
|
|
|
|
@node This Manual, Conventions, What Is Awk, What Is Awk
|
|
@section Using This Book
|
|
@cindex book, using this
|
|
@cindex using this book
|
|
@cindex language, @code{awk}
|
|
@cindex program, @code{awk}
|
|
@ignore
|
|
@cindex @code{awk} language
|
|
@cindex @code{awk} program
|
|
@end ignore
|
|
|
|
The term @code{awk} refers to a particular program, and to the language you
|
|
use to tell this program what to do. When we need to be careful, we call
|
|
the program ``the @code{awk} utility'' and the language ``the @code{awk}
|
|
language.'' The term @code{gawk} refers to a version of @code{awk} developed
|
|
as part the GNU project. The purpose of this @value{DOCUMENT} is to explain
|
|
both the @code{awk} language and how to run the @code{awk} utility.
|
|
|
|
The main purpose of the @value{DOCUMENT} is to explain the features
|
|
of @code{awk}, as defined in the POSIX standard. It does so in the context
|
|
of one particular implementation, @code{gawk}. While doing so, it will also
|
|
attempt to describe important differences between @code{gawk} and other
|
|
@code{awk} implementations. Finally, any @code{gawk} features that
|
|
are not in the POSIX standard for @code{awk} will be noted.
|
|
|
|
@iftex
|
|
This @value{DOCUMENT} has the difficult task of being both tutorial and reference.
|
|
If you are a novice, feel free to skip over details that seem too complex.
|
|
You should also ignore the many cross references; they are for the
|
|
expert user, and for the on-line Info version of the document.
|
|
@end iftex
|
|
|
|
The term @dfn{@code{awk} program} refers to a program written by you in
|
|
the @code{awk} programming language.
|
|
|
|
@xref{Getting Started, ,Getting Started with @code{awk}}, for the bare
|
|
essentials you need to know to start using @code{awk}.
|
|
|
|
Some useful ``one-liners'' are included to give you a feel for the
|
|
@code{awk} language (@pxref{One-liners, ,Useful One Line Programs}).
|
|
|
|
Many sample @code{awk} programs have been provided for you
|
|
(@pxref{Library Functions, ,A Library of @code{awk} Functions}; also
|
|
@pxref{Sample Programs, ,Practical @code{awk} Programs}).
|
|
|
|
The entire @code{awk} language is summarized for quick reference in
|
|
@ref{Gawk Summary, ,@code{gawk} Summary}. Look there if you just need
|
|
to refresh your memory about a particular feature.
|
|
|
|
If you find terms that you aren't familiar with, try looking them
|
|
up in the glossary (@pxref{Glossary}).
|
|
|
|
Most of the time complete @code{awk} programs are used as examples, but in
|
|
some of the more advanced sections, only the part of the @code{awk} program
|
|
that illustrates the concept being described is shown.
|
|
|
|
While this @value{DOCUMENT} is aimed principally at people who have not been
|
|
exposed
|
|
to @code{awk}, there is a lot of information here that even the @code{awk}
|
|
expert should find useful. In particular, the description of POSIX
|
|
@code{awk}, and the example programs in
|
|
@ref{Library Functions, ,A Library of @code{awk} Functions}, and
|
|
@ref{Sample Programs, ,Practical @code{awk} Programs},
|
|
should be of interest.
|
|
|
|
@c fakenode --- for prepinfo
|
|
@unnumberedsubsec Dark Corners
|
|
@display
|
|
@i{Who opened that window shade?!?}
|
|
Count Dracula
|
|
@end display
|
|
@sp 1
|
|
|
|
@cindex d.c., see ``dark corner''
|
|
@cindex dark corner
|
|
Until the POSIX standard (and @cite{The Gawk Manual}),
|
|
many features of @code{awk} were either poorly documented, or not
|
|
documented at all. Descriptions of such features
|
|
(often called ``dark corners'') are noted in this @value{DOCUMENT} with
|
|
``(d.c.)''.
|
|
They also appear in the index under the heading ``dark corner.''
|
|
|
|
@node Conventions, Sample Data Files, This Manual, What Is Awk
|
|
@section Typographical Conventions
|
|
|
|
This @value{DOCUMENT} is written using Texinfo, the GNU documentation formatting language.
|
|
A single Texinfo source file is used to produce both the printed and on-line
|
|
versions of the documentation.
|
|
@iftex
|
|
Because of this, the typographical conventions
|
|
are slightly different than in other books you may have read.
|
|
@end iftex
|
|
@ifinfo
|
|
This section briefly documents the typographical conventions used in Texinfo.
|
|
@end ifinfo
|
|
|
|
Examples you would type at the command line are preceded by the common
|
|
shell primary and secondary prompts, @samp{$} and @samp{>}.
|
|
Output from the command is preceded by the glyph ``@print{}''.
|
|
This typically represents the command's standard output.
|
|
Error messages, and other output on the command's standard error, are preceded
|
|
by the glyph ``@error{}''. For example:
|
|
|
|
@example
|
|
@group
|
|
$ echo hi on stdout
|
|
@print{} hi on stdout
|
|
$ echo hello on stderr 1>&2
|
|
@error{} hello on stderr
|
|
@end group
|
|
@end example
|
|
|
|
@iftex
|
|
In the text, command names appear in @code{this font}, while code segments
|
|
appear in the same font and quoted, @samp{like this}. Some things will
|
|
be emphasized @emph{like this}, and if a point needs to be made
|
|
strongly, it will be done @strong{like this}. The first occurrence of
|
|
a new term is usually its @dfn{definition}, and appears in the same
|
|
font as the previous occurrence of ``definition'' in this sentence.
|
|
File names are indicated like this: @file{/path/to/ourfile}.
|
|
@end iftex
|
|
|
|
Characters that you type at the keyboard look @kbd{like this}. In particular,
|
|
there are special characters called ``control characters.'' These are
|
|
characters that you type by holding down both the @kbd{CONTROL} key and
|
|
another key, at the same time. For example, a @kbd{Control-d} is typed
|
|
by first pressing and holding the @kbd{CONTROL} key, next
|
|
pressing the @kbd{d} key, and finally releasing both keys.
|
|
|
|
@node Sample Data Files, , Conventions, What Is Awk
|
|
@section Data Files for the Examples
|
|
|
|
@cindex input file, sample
|
|
@cindex sample input file
|
|
@cindex @file{BBS-list} file
|
|
Many of the examples in this @value{DOCUMENT} take their input from two sample
|
|
data files. The first, called @file{BBS-list}, represents a list of
|
|
computer bulletin board systems together with information about those systems.
|
|
The second data file, called @file{inventory-shipped}, contains
|
|
information about shipments on a monthly basis. In both files,
|
|
each line is considered to be one @dfn{record}.
|
|
|
|
In the file @file{BBS-list}, each record contains the name of a computer
|
|
bulletin board, its phone number, the board's baud rate(s), and a code for
|
|
the number of hours it is operational. An @samp{A} in the last column
|
|
means the board operates 24 hours a day. A @samp{B} in the last
|
|
column means the board operates evening and weekend hours, only. A
|
|
@samp{C} means the board operates only on weekends.
|
|
|
|
@c 2e: Update the baud rates to reflect today's faster modems
|
|
@example
|
|
@c system mkdir eg
|
|
@c system mkdir eg/lib
|
|
@c system mkdir eg/data
|
|
@c system mkdir eg/prog
|
|
@c system mkdir eg/misc
|
|
@c file eg/data/BBS-list
|
|
aardvark 555-5553 1200/300 B
|
|
alpo-net 555-3412 2400/1200/300 A
|
|
barfly 555-7685 1200/300 A
|
|
bites 555-1675 2400/1200/300 A
|
|
camelot 555-0542 300 C
|
|
core 555-2912 1200/300 C
|
|
fooey 555-1234 2400/1200/300 B
|
|
foot 555-6699 1200/300 B
|
|
macfoo 555-6480 1200/300 A
|
|
sdace 555-3430 2400/1200/300 A
|
|
sabafoo 555-2127 1200/300 C
|
|
@c endfile
|
|
@end example
|
|
|
|
@cindex @file{inventory-shipped} file
|
|
The second data file, called @file{inventory-shipped}, represents
|
|
information about shipments during the year.
|
|
Each record contains the month of the year, the number
|
|
of green crates shipped, the number of red boxes shipped, the number of
|
|
orange bags shipped, and the number of blue packages shipped,
|
|
respectively. There are 16 entries, covering the 12 months of one year
|
|
and four months of the next year.
|
|
|
|
@example
|
|
@c file eg/data/inventory-shipped
|
|
Jan 13 25 15 115
|
|
Feb 15 32 24 226
|
|
Mar 15 24 34 228
|
|
Apr 31 52 63 420
|
|
May 16 34 29 208
|
|
Jun 31 42 75 492
|
|
Jul 24 34 67 436
|
|
Aug 15 34 47 316
|
|
Sep 13 55 37 277
|
|
Oct 29 54 68 525
|
|
Nov 20 87 82 577
|
|
Dec 17 35 61 401
|
|
|
|
Jan 21 36 64 620
|
|
Feb 26 58 80 652
|
|
Mar 24 75 70 495
|
|
Apr 21 70 74 514
|
|
@c endfile
|
|
@end example
|
|
|
|
@ifinfo
|
|
If you are reading this in GNU Emacs using Info, you can copy the regions
|
|
of text showing these sample files into your own test files. This way you
|
|
can try out the examples shown in the remainder of this document. You do
|
|
this by using the command @kbd{M-x write-region} to copy text from the Info
|
|
file into a file for use with @code{awk}
|
|
(@xref{Misc File Ops, , Miscellaneous File Operations, emacs, GNU Emacs Manual},
|
|
for more information). Using this information, create your own
|
|
@file{BBS-list} and @file{inventory-shipped} files, and practice what you
|
|
learn in this @value{DOCUMENT}.
|
|
|
|
If you are using the stand-alone version of Info,
|
|
see @ref{Extract Program, ,Extracting Programs from Texinfo Source Files},
|
|
for an @code{awk} program that will extract these data files from
|
|
@file{gawk.texi}, the Texinfo source file for this Info file.
|
|
@end ifinfo
|
|
|
|
@node Getting Started, One-liners, What Is Awk, Top
|
|
@chapter Getting Started with @code{awk}
|
|
@cindex script, definition of
|
|
@cindex rule, definition of
|
|
@cindex program, definition of
|
|
@cindex basic function of @code{awk}
|
|
|
|
The basic function of @code{awk} is to search files for lines (or other
|
|
units of text) that contain certain patterns. When a line matches one
|
|
of the patterns, @code{awk} performs specified actions on that line.
|
|
@code{awk} keeps processing input lines in this way until the end of the
|
|
input files are reached.
|
|
|
|
@cindex data-driven languages
|
|
@cindex procedural languages
|
|
@cindex language, data-driven
|
|
@cindex language, procedural
|
|
Programs in @code{awk} are different from programs in most other languages,
|
|
because @code{awk} programs are @dfn{data-driven}; that is, you describe
|
|
the data you wish to work with, and then what to do when you find it.
|
|
Most other languages are @dfn{procedural}; you have to describe, in great
|
|
detail, every step the program is to take. When working with procedural
|
|
languages, it is usually much
|
|
harder to clearly describe the data your program will process.
|
|
For this reason, @code{awk} programs are often refreshingly easy to both
|
|
write and read.
|
|
|
|
@cindex program, definition of
|
|
@cindex rule, definition of
|
|
When you run @code{awk}, you specify an @code{awk} @dfn{program} that
|
|
tells @code{awk} what to do. The program consists of a series of
|
|
@dfn{rules}. (It may also contain @dfn{function definitions},
|
|
an advanced feature which we will ignore for now.
|
|
@xref{User-defined, ,User-defined Functions}.) Each rule specifies one
|
|
pattern to search for, and one action to perform when that pattern is found.
|
|
|
|
Syntactically, a rule consists of a pattern followed by an action. The
|
|
action is enclosed in curly braces to separate it from the pattern.
|
|
Rules are usually separated by newlines. Therefore, an @code{awk}
|
|
program looks like this:
|
|
|
|
@example
|
|
@var{pattern} @{ @var{action} @}
|
|
@var{pattern} @{ @var{action} @}
|
|
@dots{}
|
|
@end example
|
|
|
|
@menu
|
|
* Names:: What name to use to find @code{awk}.
|
|
* Running gawk:: How to run @code{gawk} programs; includes
|
|
command line syntax.
|
|
* Very Simple:: A very simple example.
|
|
* Two Rules:: A less simple one-line example with two rules.
|
|
* More Complex:: A more complex example.
|
|
* Statements/Lines:: Subdividing or combining statements into
|
|
lines.
|
|
* Other Features:: Other Features of @code{awk}.
|
|
* When:: When to use @code{gawk} and when to use other
|
|
things.
|
|
@end menu
|
|
|
|
@node Names, Running gawk , Getting Started, Getting Started
|
|
@section A Rose By Any Other Name
|
|
|
|
@cindex old @code{awk} vs. new @code{awk}
|
|
@cindex new @code{awk} vs. old @code{awk}
|
|
The @code{awk} language has evolved over the years. Full details are
|
|
provided in @ref{Language History, ,The Evolution of the @code{awk} Language}.
|
|
The language described in this @value{DOCUMENT}
|
|
is often referred to as ``new @code{awk}.''
|
|
|
|
Because of this, many systems have multiple
|
|
versions of @code{awk}.
|
|
Some systems have an @code{awk} utility that implements the
|
|
original version of the @code{awk} language, and a @code{nawk} utility
|
|
for the new version. Others have an @code{oawk} for the ``old @code{awk}''
|
|
language, and plain @code{awk} for the new one. Still others only
|
|
have one version, usually the new one.@footnote{Often, these systems
|
|
use @code{gawk} for their @code{awk} implementation!}
|
|
|
|
All in all, this makes it difficult for you to know which version of
|
|
@code{awk} you should run when writing your programs. The best advice
|
|
we can give here is to check your local documentation. Look for @code{awk},
|
|
@code{oawk}, and @code{nawk}, as well as for @code{gawk}. Chances are, you
|
|
will have some version of new @code{awk} on your system, and that is what
|
|
you should use when running your programs. (Of course, if you're reading
|
|
this @value{DOCUMENT}, chances are good that you have @code{gawk}!)
|
|
|
|
Throughout this @value{DOCUMENT}, whenever we refer to a language feature
|
|
that should be available in any complete implementation of POSIX @code{awk},
|
|
we simply use the term @code{awk}. When referring to a feature that is
|
|
specific to the GNU implementation, we use the term @code{gawk}.
|
|
|
|
@node Running gawk, Very Simple, Names, Getting Started
|
|
@section How to Run @code{awk} Programs
|
|
|
|
@cindex command line formats
|
|
@cindex running @code{awk} programs
|
|
There are several ways to run an @code{awk} program. If the program is
|
|
short, it is easiest to include it in the command that runs @code{awk},
|
|
like this:
|
|
|
|
@example
|
|
awk '@var{program}' @var{input-file1} @var{input-file2} @dots{}
|
|
@end example
|
|
|
|
@noindent
|
|
where @var{program} consists of a series of patterns and actions, as
|
|
described earlier.
|
|
(The reason for the single quotes is described below, in
|
|
@ref{One-shot, ,One-shot Throw-away @code{awk} Programs}.)
|
|
|
|
When the program is long, it is usually more convenient to put it in a file
|
|
and run it with a command like this:
|
|
|
|
@example
|
|
awk -f @var{program-file} @var{input-file1} @var{input-file2} @dots{}
|
|
@end example
|
|
|
|
@menu
|
|
* One-shot:: Running a short throw-away @code{awk} program.
|
|
* Read Terminal:: Using no input files (input from terminal
|
|
instead).
|
|
* Long:: Putting permanent @code{awk} programs in
|
|
files.
|
|
* Executable Scripts:: Making self-contained @code{awk} programs.
|
|
* Comments:: Adding documentation to @code{gawk} programs.
|
|
@end menu
|
|
|
|
@node One-shot, Read Terminal, Running gawk, Running gawk
|
|
@subsection One-shot Throw-away @code{awk} Programs
|
|
|
|
Once you are familiar with @code{awk}, you will often type in simple
|
|
programs the moment you want to use them. Then you can write the
|
|
program as the first argument of the @code{awk} command, like this:
|
|
|
|
@example
|
|
awk '@var{program}' @var{input-file1} @var{input-file2} @dots{}
|
|
@end example
|
|
|
|
@noindent
|
|
where @var{program} consists of a series of @var{patterns} and
|
|
@var{actions}, as described earlier.
|
|
|
|
@cindex single quotes, why needed
|
|
This command format instructs the @dfn{shell}, or command interpreter,
|
|
to start @code{awk} and use the @var{program} to process records in the
|
|
input file(s). There are single quotes around @var{program} so that
|
|
the shell doesn't interpret any @code{awk} characters as special shell
|
|
characters. They also cause the shell to treat all of @var{program} as
|
|
a single argument for @code{awk} and allow @var{program} to be more
|
|
than one line long.
|
|
|
|
This format is also useful for running short or medium-sized @code{awk}
|
|
programs from shell scripts, because it avoids the need for a separate
|
|
file for the @code{awk} program. A self-contained shell script is more
|
|
reliable since there are no other files to misplace.
|
|
|
|
@ref{One-liners, , Useful One Line Programs}, presents several short,
|
|
self-contained programs.
|
|
|
|
As an interesting side point, the command
|
|
|
|
@example
|
|
awk '/foo/' @var{files} @dots{}
|
|
@end example
|
|
|
|
@noindent
|
|
is essentially the same as
|
|
|
|
@cindex @code{egrep}
|
|
@example
|
|
egrep foo @var{files} @dots{}
|
|
@end example
|
|
|
|
@node Read Terminal, Long, One-shot, Running gawk
|
|
@subsection Running @code{awk} without Input Files
|
|
|
|
@cindex standard input
|
|
@cindex input, standard
|
|
You can also run @code{awk} without any input files. If you type the
|
|
command line:
|
|
|
|
@example
|
|
awk '@var{program}'
|
|
@end example
|
|
|
|
@noindent
|
|
then @code{awk} applies the @var{program} to the @dfn{standard input},
|
|
which usually means whatever you type on the terminal. This continues
|
|
until you indicate end-of-file by typing @kbd{Control-d}.
|
|
(On other operating systems, the end-of-file character may be different.
|
|
For example, on OS/2 and MS-DOS, it is @kbd{Control-z}.)
|
|
|
|
For example, the following program prints a friendly piece of advice
|
|
(from Douglas Adams' @cite{The Hitchhiker's Guide to the Galaxy}),
|
|
to keep you from worrying about the complexities of computer programming
|
|
(@samp{BEGIN} is a feature we haven't discussed yet).
|
|
|
|
@example
|
|
$ awk "BEGIN @{ print \"Don't Panic!\" @}"
|
|
@print{} Don't Panic!
|
|
@end example
|
|
|
|
@cindex quoting, shell
|
|
@cindex shell quoting
|
|
This program does not read any input. The @samp{\} before each of the
|
|
inner double quotes is necessary because of the shell's quoting rules,
|
|
in particular because it mixes both single quotes and double quotes.
|
|
|
|
This next simple @code{awk} program
|
|
emulates the @code{cat} utility; it copies whatever you type at the
|
|
keyboard to its standard output. (Why this works is explained shortly.)
|
|
|
|
@example
|
|
$ awk '@{ print @}'
|
|
Now is the time for all good men
|
|
@print{} Now is the time for all good men
|
|
to come to the aid of their country.
|
|
@print{} to come to the aid of their country.
|
|
Four score and seven years ago, ...
|
|
@print{} Four score and seven years ago, ...
|
|
What, me worry?
|
|
@print{} What, me worry?
|
|
@kbd{Control-d}
|
|
@end example
|
|
|
|
@node Long, Executable Scripts, Read Terminal, Running gawk
|
|
@subsection Running Long Programs
|
|
|
|
@cindex running long programs
|
|
@cindex @code{-f} option
|
|
@cindex program file
|
|
@cindex file, @code{awk} program
|
|
Sometimes your @code{awk} programs can be very long. In this case it is
|
|
more convenient to put the program into a separate file. To tell
|
|
@code{awk} to use that file for its program, you type:
|
|
|
|
@example
|
|
awk -f @var{source-file} @var{input-file1} @var{input-file2} @dots{}
|
|
@end example
|
|
|
|
The @samp{-f} instructs the @code{awk} utility to get the @code{awk} program
|
|
from the file @var{source-file}. Any file name can be used for
|
|
@var{source-file}. For example, you could put the program:
|
|
|
|
@example
|
|
BEGIN @{ print "Don't Panic!" @}
|
|
@end example
|
|
|
|
@noindent
|
|
into the file @file{advice}. Then this command:
|
|
|
|
@example
|
|
awk -f advice
|
|
@end example
|
|
|
|
@noindent
|
|
does the same thing as this one:
|
|
|
|
@example
|
|
awk "BEGIN @{ print \"Don't Panic!\" @}"
|
|
@end example
|
|
|
|
@cindex quoting, shell
|
|
@cindex shell quoting
|
|
@noindent
|
|
which was explained earlier (@pxref{Read Terminal, ,Running @code{awk} without Input Files}).
|
|
Note that you don't usually need single quotes around the file name that you
|
|
specify with @samp{-f}, because most file names don't contain any of the shell's
|
|
special characters. Notice that in @file{advice}, the @code{awk}
|
|
program did not have single quotes around it. The quotes are only needed
|
|
for programs that are provided on the @code{awk} command line.
|
|
|
|
If you want to identify your @code{awk} program files clearly as such,
|
|
you can add the extension @file{.awk} to the file name. This doesn't
|
|
affect the execution of the @code{awk} program, but it does make
|
|
``housekeeping'' easier.
|
|
|
|
@node Executable Scripts, Comments, Long, Running gawk
|
|
@subsection Executable @code{awk} Programs
|
|
@cindex executable scripts
|
|
@cindex scripts, executable
|
|
@cindex self contained programs
|
|
@cindex program, self contained
|
|
@cindex @code{#!} (executable scripts)
|
|
|
|
Once you have learned @code{awk}, you may want to write self-contained
|
|
@code{awk} scripts, using the @samp{#!} script mechanism. You can do
|
|
this on many Unix systems@footnote{The @samp{#!} mechanism works on
|
|
Linux systems,
|
|
Unix systems derived from Berkeley Unix, System V Release 4, and some System
|
|
V Release 3 systems.} (and someday on the GNU system).
|
|
|
|
For example, you could update the file @file{advice} to look like this:
|
|
|
|
@example
|
|
#! /bin/awk -f
|
|
|
|
BEGIN @{ print "Don't Panic!" @}
|
|
@end example
|
|
|
|
@noindent
|
|
After making this file executable (with the @code{chmod} utility), you
|
|
can simply type @samp{advice}
|
|
at the shell, and the system will arrange to run @code{awk}@footnote{The
|
|
line beginning with @samp{#!} lists the full file name of an interpreter
|
|
to be run, and an optional initial command line argument to pass to that
|
|
interpreter. The operating system then runs the interpreter with the given
|
|
argument and the full argument list of the executed program. The first argument
|
|
in the list is the full file name of the @code{awk} program. The rest of the
|
|
argument list will either be options to @code{awk}, or data files,
|
|
or both.} as if you had typed @samp{awk -f advice}.
|
|
|
|
@example
|
|
@group
|
|
$ advice
|
|
@print{} Don't Panic!
|
|
@end group
|
|
@end example
|
|
|
|
@noindent
|
|
Self-contained @code{awk} scripts are useful when you want to write a
|
|
program which users can invoke without their having to know that the program is
|
|
written in @code{awk}.
|
|
|
|
@strong{Caution:} You should not put more than one argument on the @samp{#!}
|
|
line after the path to @code{awk}. This will not work. The operating system
|
|
treats the rest of the line as a single agument, and passes it to @code{awk}.
|
|
Doing this will lead to confusing behavior: most likely a usage diagnostic
|
|
of some sort from @code{awk}.
|
|
|
|
@cindex shell scripts
|
|
@cindex scripts, shell
|
|
Some older systems do not support the @samp{#!} mechanism. You can get a
|
|
similar effect using a regular shell script. It would look something
|
|
like this:
|
|
|
|
@example
|
|
: The colon ensures execution by the standard shell.
|
|
awk '@var{program}' "$@@"
|
|
@end example
|
|
|
|
Using this technique, it is @emph{vital} to enclose the @var{program} in
|
|
single quotes to protect it from interpretation by the shell. If you
|
|
omit the quotes, only a shell wizard can predict the results.
|
|
|
|
The @code{"$@@"} causes the shell to forward all the command line
|
|
arguments to the @code{awk} program, without interpretation. The first
|
|
line, which starts with a colon, is used so that this shell script will
|
|
work even if invoked by a user who uses the C shell. (Not all older systems
|
|
obey this convention, but many do.)
|
|
@c 2e:
|
|
@c Someday: (See @cite{The Bourne Again Shell}, by ??.)
|
|
|
|
@node Comments, , Executable Scripts, Running gawk
|
|
@subsection Comments in @code{awk} Programs
|
|
@cindex @code{#} (comment)
|
|
@cindex comments
|
|
@cindex use of comments
|
|
@cindex documenting @code{awk} programs
|
|
@cindex programs, documenting
|
|
|
|
A @dfn{comment} is some text that is included in a program for the sake
|
|
of human readers; it is not really part of the program. Comments
|
|
can explain what the program does, and how it works. Nearly all
|
|
programming languages have provisions for comments, because programs are
|
|
typically hard to understand without their extra help.
|
|
|
|
In the @code{awk} language, a comment starts with the sharp sign
|
|
character, @samp{#}, and continues to the end of the line.
|
|
The @samp{#} does not have to be the first character on the line. The
|
|
@code{awk} language ignores the rest of a line following a sharp sign.
|
|
For example, we could have put the following into @file{advice}:
|
|
|
|
@example
|
|
# This program prints a nice friendly message. It helps
|
|
# keep novice users from being afraid of the computer.
|
|
BEGIN @{ print "Don't Panic!" @}
|
|
@end example
|
|
|
|
You can put comment lines into keyboard-composed throw-away @code{awk}
|
|
programs also, but this usually isn't very useful; the purpose of a
|
|
comment is to help you or another person understand the program at
|
|
a later time.
|
|
|
|
@strong{Caution:} As mentioned in
|
|
@ref{One-shot, ,One-shot Throw-away @code{awk} Programs},
|
|
you can enclose small to medium programs in single quotes, in order to keep
|
|
your shell scripts self-contained. When doing so, @emph{don't} put
|
|
an apostrophe (i.e., a single quote) into a comment (or anywhere else
|
|
in your program). The shell will interpret the quote as the closing
|
|
quote for the entire program. As a result, usually the shell will
|
|
print a message about mismatched quotes, and if @code{awk} actually
|
|
runs, it will probably print strange messages about syntax errors.
|
|
For example:
|
|
|
|
@example
|
|
awk 'BEGIN @{ print "hello" @} # let's be cute'
|
|
@end example
|
|
|
|
@node Very Simple, Two Rules, Running gawk, Getting Started
|
|
@section A Very Simple Example
|
|
|
|
The following command runs a simple @code{awk} program that searches the
|
|
input file @file{BBS-list} for the string of characters: @samp{foo}. (A
|
|
string of characters is usually called a @dfn{string}.
|
|
The term @dfn{string} is perhaps based on similar usage in English, such
|
|
as ``a string of pearls,'' or, ``a string of cars in a train.'')
|
|
|
|
@example
|
|
awk '/foo/ @{ print $0 @}' BBS-list
|
|
@end example
|
|
|
|
@noindent
|
|
When lines containing @samp{foo} are found, they are printed, because
|
|
@w{@samp{print $0}} means print the current line. (Just @samp{print} by
|
|
itself means the same thing, so we could have written that
|
|
instead.)
|
|
|
|
You will notice that slashes, @samp{/}, surround the string @samp{foo}
|
|
in the @code{awk} program. The slashes indicate that @samp{foo}
|
|
is a pattern to search for. This type of pattern is called a
|
|
@dfn{regular expression}, and is covered in more detail later
|
|
(@pxref{Regexp, ,Regular Expressions}).
|
|
The pattern is allowed to match parts of words.
|
|
There are
|
|
single-quotes around the @code{awk} program so that the shell won't
|
|
interpret any of it as special shell characters.
|
|
|
|
Here is what this program prints:
|
|
|
|
@example
|
|
@group
|
|
$ awk '/foo/ @{ print $0 @}' BBS-list
|
|
@print{} fooey 555-1234 2400/1200/300 B
|
|
@print{} foot 555-6699 1200/300 B
|
|
@print{} macfoo 555-6480 1200/300 A
|
|
@print{} sabafoo 555-2127 1200/300 C
|
|
@end group
|
|
@end example
|
|
|
|
@cindex action, default
|
|
@cindex pattern, default
|
|
@cindex default action
|
|
@cindex default pattern
|
|
In an @code{awk} rule, either the pattern or the action can be omitted,
|
|
but not both. If the pattern is omitted, then the action is performed
|
|
for @emph{every} input line. If the action is omitted, the default
|
|
action is to print all lines that match the pattern.
|
|
|
|
@cindex empty action
|
|
@cindex action, empty
|
|
Thus, we could leave out the action (the @code{print} statement and the curly
|
|
braces) in the above example, and the result would be the same: all
|
|
lines matching the pattern @samp{foo} would be printed. By comparison,
|
|
omitting the @code{print} statement but retaining the curly braces makes an
|
|
empty action that does nothing; then no lines would be printed.
|
|
|
|
@node Two Rules, More Complex, Very Simple, Getting Started
|
|
@section An Example with Two Rules
|
|
@cindex how @code{awk} works
|
|
|
|
The @code{awk} utility reads the input files one line at a
|
|
time. For each line, @code{awk} tries the patterns of each of the rules.
|
|
If several patterns match then several actions are run, in the order in
|
|
which they appear in the @code{awk} program. If no patterns match, then
|
|
no actions are run.
|
|
|
|
After processing all the rules (perhaps none) that match the line,
|
|
@code{awk} reads the next line (however,
|
|
@pxref{Next Statement, ,The @code{next} Statement},
|
|
and also @pxref{Nextfile Statement, ,The @code{nextfile} Statement}).
|
|
This continues until the end of the file is reached.
|
|
|
|
For example, the @code{awk} program:
|
|
|
|
@example
|
|
/12/ @{ print $0 @}
|
|
/21/ @{ print $0 @}
|
|
@end example
|
|
|
|
@noindent
|
|
contains two rules. The first rule has the string @samp{12} as the
|
|
pattern and @samp{print $0} as the action. The second rule has the
|
|
string @samp{21} as the pattern and also has @samp{print $0} as the
|
|
action. Each rule's action is enclosed in its own pair of braces.
|
|
|
|
This @code{awk} program prints every line that contains the string
|
|
@samp{12} @emph{or} the string @samp{21}. If a line contains both
|
|
strings, it is printed twice, once by each rule.
|
|
|
|
This is what happens if we run this program on our two sample data files,
|
|
@file{BBS-list} and @file{inventory-shipped}, as shown here:
|
|
|
|
@example
|
|
$ awk '/12/ @{ print $0 @}
|
|
> /21/ @{ print $0 @}' BBS-list inventory-shipped
|
|
@print{} aardvark 555-5553 1200/300 B
|
|
@print{} alpo-net 555-3412 2400/1200/300 A
|
|
@print{} barfly 555-7685 1200/300 A
|
|
@print{} bites 555-1675 2400/1200/300 A
|
|
@print{} core 555-2912 1200/300 C
|
|
@print{} fooey 555-1234 2400/1200/300 B
|
|
@print{} foot 555-6699 1200/300 B
|
|
@print{} macfoo 555-6480 1200/300 A
|
|
@print{} sdace 555-3430 2400/1200/300 A
|
|
@print{} sabafoo 555-2127 1200/300 C
|
|
@print{} sabafoo 555-2127 1200/300 C
|
|
@print{} Jan 21 36 64 620
|
|
@print{} Apr 21 70 74 514
|
|
@end example
|
|
|
|
@noindent
|
|
Note how the line in @file{BBS-list} beginning with @samp{sabafoo}
|
|
was printed twice, once for each rule.
|
|
|
|
@node More Complex, Statements/Lines, Two Rules, Getting Started
|
|
@section A More Complex Example
|
|
|
|
@ignore
|
|
We have to use ls -lg here to get portable output across Unix systems.
|
|
The POSIX ls matches this behavior too. Sigh.
|
|
@end ignore
|
|
Here is an example to give you an idea of what typical @code{awk}
|
|
programs do. This example shows how @code{awk} can be used to
|
|
summarize, select, and rearrange the output of another utility. It uses
|
|
features that haven't been covered yet, so don't worry if you don't
|
|
understand all the details.
|
|
|
|
@example
|
|
ls -lg | awk '$6 == "Nov" @{ sum += $5 @}
|
|
END @{ print sum @}'
|
|
@end example
|
|
|
|
@cindex @code{csh}, backslash continuation
|
|
@cindex backslash continuation in @code{csh}
|
|
This command prints the total number of bytes in all the files in the
|
|
current directory that were last modified in November (of any year).
|
|
(In the C shell you would need to type a semicolon and then a backslash
|
|
at the end of the first line; in a POSIX-compliant shell, such as the
|
|
Bourne shell or Bash, the GNU Bourne-Again shell, you can type the example
|
|
as shown.)
|
|
@ignore
|
|
FIXME: how can users tell what shell they are running? Need a footnote
|
|
or something, but getting into this is a distraction.
|
|
@end ignore
|
|
|
|
The @w{@samp{ls -lg}} part of this example is a system command that gives
|
|
you a listing of the files in a directory, including file size and the date
|
|
the file was last modified. Its output looks like this:
|
|
|
|
@example
|
|
-rw-r--r-- 1 arnold user 1933 Nov 7 13:05 Makefile
|
|
-rw-r--r-- 1 arnold user 10809 Nov 7 13:03 gawk.h
|
|
-rw-r--r-- 1 arnold user 983 Apr 13 12:14 gawk.tab.h
|
|
-rw-r--r-- 1 arnold user 31869 Jun 15 12:20 gawk.y
|
|
-rw-r--r-- 1 arnold user 22414 Nov 7 13:03 gawk1.c
|
|
-rw-r--r-- 1 arnold user 37455 Nov 7 13:03 gawk2.c
|
|
-rw-r--r-- 1 arnold user 27511 Dec 9 13:07 gawk3.c
|
|
-rw-r--r-- 1 arnold user 7989 Nov 7 13:03 gawk4.c
|
|
@end example
|
|
|
|
@noindent
|
|
The first field contains read-write permissions, the second field contains
|
|
the number of links to the file, and the third field identifies the owner of
|
|
the file. The fourth field identifies the group of the file.
|
|
The fifth field contains the size of the file in bytes. The
|
|
sixth, seventh and eighth fields contain the month, day, and time,
|
|
respectively, that the file was last modified. Finally, the ninth field
|
|
contains the name of the file.
|
|
|
|
@cindex automatic initialization
|
|
@cindex initialization, automatic
|
|
The @samp{$6 == "Nov"} in our @code{awk} program is an expression that
|
|
tests whether the sixth field of the output from @w{@samp{ls -lg}}
|
|
matches the string @samp{Nov}. Each time a line has the string
|
|
@samp{Nov} for its sixth field, the action @samp{sum += $5} is
|
|
performed. This adds the fifth field (the file size) to the variable
|
|
@code{sum}. As a result, when @code{awk} has finished reading all the
|
|
input lines, @code{sum} is the sum of the sizes of files whose
|
|
lines matched the pattern. (This works because @code{awk} variables
|
|
are automatically initialized to zero.)
|
|
|
|
After the last line of output from @code{ls} has been processed, the
|
|
@code{END} rule is executed, and the value of @code{sum} is
|
|
printed. In this example, the value of @code{sum} would be 80600.
|
|
|
|
These more advanced @code{awk} techniques are covered in later sections
|
|
(@pxref{Action Overview, ,Overview of Actions}). Before you can move on to more
|
|
advanced @code{awk} programming, you have to know how @code{awk} interprets
|
|
your input and displays your output. By manipulating fields and using
|
|
@code{print} statements, you can produce some very useful and impressive
|
|
looking reports.
|
|
|
|
@node Statements/Lines, Other Features, More Complex, Getting Started
|
|
@section @code{awk} Statements Versus Lines
|
|
@cindex line break
|
|
@cindex newline
|
|
|
|
Most often, each line in an @code{awk} program is a separate statement or
|
|
separate rule, like this:
|
|
|
|
@example
|
|
awk '/12/ @{ print $0 @}
|
|
/21/ @{ print $0 @}' BBS-list inventory-shipped
|
|
@end example
|
|
|
|
However, @code{gawk} will ignore newlines after any of the following:
|
|
|
|
@example
|
|
, @{ ? : || && do else
|
|
@end example
|
|
|
|
@noindent
|
|
A newline at any other point is considered the end of the statement.
|
|
(Splitting lines after @samp{?} and @samp{:} is a minor @code{gawk}
|
|
extension. The @samp{?} and @samp{:} referred to here is the
|
|
three operand conditional expression described in
|
|
@ref{Conditional Exp, ,Conditional Expressions}.)
|
|
|
|
@cindex backslash continuation
|
|
@cindex continuation of lines
|
|
@cindex line continuation
|
|
If you would like to split a single statement into two lines at a point
|
|
where a newline would terminate it, you can @dfn{continue} it by ending the
|
|
first line with a backslash character, @samp{\}. The backslash must be
|
|
the final character on the line to be recognized as a continuation
|
|
character. This is allowed absolutely anywhere in the statement, even
|
|
in the middle of a string or regular expression. For example:
|
|
|
|
@example
|
|
awk '/This regular expression is too long, so continue it\
|
|
on the next line/ @{ print $1 @}'
|
|
@end example
|
|
|
|
@noindent
|
|
@cindex portability issues
|
|
We have generally not used backslash continuation in the sample programs
|
|
in this @value{DOCUMENT}. Since in @code{gawk} there is no limit on the
|
|
length of a line, it is never strictly necessary; it just makes programs
|
|
more readable. For this same reason, as well as for clarity, we have
|
|
kept most statements short in the sample programs presented throughout
|
|
the @value{DOCUMENT}. Backslash continuation is most useful when your
|
|
@code{awk} program is in a separate source file, instead of typed in on
|
|
the command line. You should also note that many @code{awk}
|
|
implementations are more particular about where you may use backslash
|
|
continuation. For example, they may not allow you to split a string
|
|
constant using backslash continuation. Thus, for maximal portability of
|
|
your @code{awk} programs, it is best not to split your lines in the
|
|
middle of a regular expression or a string.
|
|
|
|
@cindex @code{csh}, backslash continuation
|
|
@cindex backslash continuation in @code{csh}
|
|
@strong{Caution: backslash continuation does not work as described above
|
|
with the C shell.} Continuation with backslash works for @code{awk}
|
|
programs in files, and also for one-shot programs @emph{provided} you
|
|
are using a POSIX-compliant shell, such as the Bourne shell or Bash, the
|
|
GNU Bourne-Again shell. But the C shell (@code{csh}) behaves
|
|
differently! There, you must use two backslashes in a row, followed by
|
|
a newline. Note also that when using the C shell, @emph{every} newline
|
|
in your awk program must be escaped with a backslash. To illustrate:
|
|
|
|
@example
|
|
% awk 'BEGIN @{ \
|
|
? print \\
|
|
? "hello, world" \
|
|
? @}'
|
|
@print{} hello, world
|
|
@end example
|
|
|
|
@noindent
|
|
Here, the @samp{%} and @samp{?} are the C shell's primary and secondary
|
|
prompts, analogous to the standard shell's @samp{$} and @samp{>}.
|
|
|
|
@code{awk} is a line-oriented language. Each rule's action has to
|
|
begin on the same line as the pattern. To have the pattern and action
|
|
on separate lines, you @emph{must} use backslash continuation---there
|
|
is no other way.
|
|
|
|
@cindex backslash continuation and comments
|
|
@cindex comments and backslash continuation
|
|
Note that backslash continuation and comments do not mix. As soon
|
|
as @code{awk} sees the @samp{#} that starts a comment, it ignores
|
|
@emph{everything} on the rest of the line. For example:
|
|
|
|
@example
|
|
@group
|
|
$ gawk 'BEGIN @{ print "dont panic" # a friendly \
|
|
> BEGIN rule
|
|
> @}'
|
|
@error{} gawk: cmd. line:2: BEGIN rule
|
|
@error{} gawk: cmd. line:2: ^ parse error
|
|
@end group
|
|
@end example
|
|
|
|
@noindent
|
|
Here, it looks like the backslash would continue the comment onto the
|
|
next line. However, the backslash-newline combination is never even
|
|
noticed, since it is ``hidden'' inside the comment. Thus, the
|
|
@samp{BEGIN} is noted as a syntax error.
|
|
|
|
@cindex multiple statements on one line
|
|
When @code{awk} statements within one rule are short, you might want to put
|
|
more than one of them on a line. You do this by separating the statements
|
|
with a semicolon, @samp{;}.
|
|
|
|
This also applies to the rules themselves.
|
|
Thus, the previous program could have been written:
|
|
|
|
@example
|
|
/12/ @{ print $0 @} ; /21/ @{ print $0 @}
|
|
@end example
|
|
|
|
@noindent
|
|
@strong{Note:} the requirement that rules on the same line must be
|
|
separated with a semicolon was not in the original @code{awk}
|
|
language; it was added for consistency with the treatment of statements
|
|
within an action.
|
|
|
|
@node Other Features, When, Statements/Lines, Getting Started
|
|
@section Other Features of @code{awk}
|
|
|
|
The @code{awk} language provides a number of predefined, or built-in variables, which
|
|
your programs can use to get information from @code{awk}. There are other
|
|
variables your program can set to control how @code{awk} processes your
|
|
data.
|
|
|
|
In addition, @code{awk} provides a number of built-in functions for doing
|
|
common computational and string related operations.
|
|
|
|
As we develop our presentation of the @code{awk} language, we introduce
|
|
most of the variables and many of the functions. They are defined
|
|
systematically in @ref{Built-in Variables}, and
|
|
@ref{Built-in, ,Built-in Functions}.
|
|
|
|
@node When, , Other Features, Getting Started
|
|
@section When to Use @code{awk}
|
|
|
|
@cindex when to use @code{awk}
|
|
@cindex applications of @code{awk}
|
|
You might wonder how @code{awk} might be useful for you. Using
|
|
utility programs, advanced patterns, field separators, arithmetic
|
|
statements, and other selection criteria, you can produce much more
|
|
complex output. The @code{awk} language is very useful for producing
|
|
reports from large amounts of raw data, such as summarizing information
|
|
from the output of other utility programs like @code{ls}.
|
|
(@xref{More Complex, ,A More Complex Example}.)
|
|
|
|
Programs written with @code{awk} are usually much smaller than they would
|
|
be in other languages. This makes @code{awk} programs easy to compose and
|
|
use. Often, @code{awk} programs can be quickly composed at your terminal,
|
|
used once, and thrown away. Since @code{awk} programs are interpreted, you
|
|
can avoid the (usually lengthy) compilation part of the typical
|
|
edit-compile-test-debug cycle of software development.
|
|
|
|
Complex programs have been written in @code{awk}, including a complete
|
|
retargetable assembler for eight-bit microprocessors (@pxref{Glossary}, for
|
|
more information) and a microcode assembler for a special purpose Prolog
|
|
computer. However, @code{awk}'s capabilities are strained by tasks of
|
|
such complexity.
|
|
|
|
If you find yourself writing @code{awk} scripts of more than, say, a few
|
|
hundred lines, you might consider using a different programming
|
|
language. Emacs Lisp is a good choice if you need sophisticated string
|
|
or pattern matching capabilities. The shell is also good at string and
|
|
pattern matching; in addition, it allows powerful use of the system
|
|
utilities. More conventional languages, such as C, C++, and Lisp, offer
|
|
better facilities for system programming and for managing the complexity
|
|
of large programs. Programs in these languages may require more lines
|
|
of source code than the equivalent @code{awk} programs, but they are
|
|
easier to maintain and usually run more efficiently.
|
|
|
|
@node One-liners, Regexp, Getting Started, Top
|
|
@chapter Useful One Line Programs
|
|
|
|
@cindex one-liners
|
|
Many useful @code{awk} programs are short, just a line or two. Here is a
|
|
collection of useful, short programs to get you started. Some of these
|
|
programs contain constructs that haven't been covered yet. The description
|
|
of the program will give you a good idea of what is going on, but please
|
|
read the rest of the @value{DOCUMENT} to become an @code{awk} expert!
|
|
|
|
Most of the examples use a data file named @file{data}. This is just a
|
|
placeholder; if you were to use these programs yourself, you would substitute
|
|
your own file names for @file{data}.
|
|
|
|
@ifinfo
|
|
Since you are reading this in Info, each line of the example code is
|
|
enclosed in quotes, to represent text that you would type literally.
|
|
The examples themselves represent shell commands that use single quotes
|
|
to keep the shell from interpreting the contents of the program.
|
|
When reading the examples, focus on the text between the open and close
|
|
quotes.
|
|
@end ifinfo
|
|
|
|
@table @code
|
|
@item awk '@{ if (length($0) > max) max = length($0) @}
|
|
@itemx @ @ @ @ @ END @{ print max @}' data
|
|
This program prints the length of the longest input line.
|
|
|
|
@item awk 'length($0) > 80' data
|
|
This program prints every line that is longer than 80 characters. The sole
|
|
rule has a relational expression as its pattern, and has no action (so the
|
|
default action, printing the record, is used).
|
|
|
|
@item expand@ data@ |@ awk@ '@{ if (x < length()) x = length() @}
|
|
@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ END @{ print "maximum line length is " x @}'
|
|
This program prints the length of the longest line in @file{data}. The input
|
|
is processed by the @code{expand} program to change tabs into spaces,
|
|
so the widths compared are actually the right-margin columns.
|
|
|
|
@item awk 'NF > 0' data
|
|
This program prints every line that has at least one field. This is an
|
|
easy way to delete blank lines from a file (or rather, to create a new
|
|
file similar to the old file but from which the blank lines have been
|
|
deleted).
|
|
|
|
@c Karl Berry points out that new users probably don't want to see
|
|
@c multiple ways to do things, just the `best' way. He's probably
|
|
@c right. At some point it might be worth adding something about there
|
|
@c often being multiple ways to do things in awk, but for now we'll
|
|
@c just take this one out.
|
|
@ignore
|
|
@item awk '@{ if (NF > 0) print @}' data
|
|
This program also prints every line that has at least one field. Here we
|
|
allow the rule to match every line, and then decide in the action whether
|
|
to print.
|
|
@end ignore
|
|
|
|
@item awk@ 'BEGIN@ @{@ for (i = 1; i <= 7; i++)
|
|
@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ print int(101 * rand()) @}'
|
|
This program prints seven random numbers from zero to 100, inclusive.
|
|
|
|
@item ls -lg @var{files} | awk '@{ x += $5 @} ; END @{ print "total bytes: " x @}'
|
|
This program prints the total number of bytes used by @var{files}.
|
|
|
|
@item ls -lg @var{files} | awk '@{ x += $5 @}
|
|
@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ END @{ print "total K-bytes: " (x + 1023)/1024 @}'
|
|
This program prints the total number of kilobytes used by @var{files}.
|
|
|
|
@item awk -F: '@{ print $1 @}' /etc/passwd | sort
|
|
This program prints a sorted list of the login names of all users.
|
|
|
|
@item awk 'END @{ print NR @}' data
|
|
This program counts lines in a file.
|
|
|
|
@item awk 'NR % 2 == 0' data
|
|
This program prints the even numbered lines in the data file.
|
|
If you were to use the expression @samp{NR % 2 == 1} instead,
|
|
it would print the odd numbered lines.
|
|
@end table
|
|
|
|
@node Regexp, Reading Files, One-liners, Top
|
|
@chapter Regular Expressions
|
|
@cindex pattern, regular expressions
|
|
@cindex regexp
|
|
@cindex regular expression
|
|
@cindex regular expressions as patterns
|
|
|
|
A @dfn{regular expression}, or @dfn{regexp}, is a way of describing a
|
|
set of strings.
|
|
Because regular expressions are such a fundamental part of @code{awk}
|
|
programming, their format and use deserve a separate chapter.
|
|
|
|
A regular expression enclosed in slashes (@samp{/})
|
|
is an @code{awk} pattern that matches every input record whose text
|
|
belongs to that set.
|
|
|
|
The simplest regular expression is a sequence of letters, numbers, or
|
|
both. Such a regexp matches any string that contains that sequence.
|
|
Thus, the regexp @samp{foo} matches any string containing @samp{foo}.
|
|
Therefore, the pattern @code{/foo/} matches any input record containing
|
|
the three characters @samp{foo}, @emph{anywhere} in the record. Other
|
|
kinds of regexps let you specify more complicated classes of strings.
|
|
|
|
@iftex
|
|
Initially, the examples will be simple. As we explain more about how
|
|
regular expressions work, we will present more complicated examples.
|
|
@end iftex
|
|
|
|
@menu
|
|
* Regexp Usage:: How to Use Regular Expressions.
|
|
* Escape Sequences:: How to write non-printing characters.
|
|
* Regexp Operators:: Regular Expression Operators.
|
|
* GNU Regexp Operators:: Operators specific to GNU software.
|
|
* Case-sensitivity:: How to do case-insensitive matching.
|
|
* Leftmost Longest:: How much text matches.
|
|
* Computed Regexps:: Using Dynamic Regexps.
|
|
@end menu
|
|
|
|
@node Regexp Usage, Escape Sequences, Regexp, Regexp
|
|
@section How to Use Regular Expressions
|
|
|
|
A regular expression can be used as a pattern by enclosing it in
|
|
slashes. Then the regular expression is tested against the
|
|
entire text of each record. (Normally, it only needs
|
|
to match some part of the text in order to succeed.) For example, this
|
|
prints the second field of each record that contains the three
|
|
characters @samp{foo} anywhere in it:
|
|
|
|
@example
|
|
@group
|
|
$ awk '/foo/ @{ print $2 @}' BBS-list
|
|
@print{} 555-1234
|
|
@print{} 555-6699
|
|
@print{} 555-6480
|
|
@print{} 555-2127
|
|
@end group
|
|
@end example
|
|
|
|
@cindex regexp matching operators
|
|
@cindex string-matching operators
|
|
@cindex operators, string-matching
|
|
@cindex operators, regexp matching
|
|
@cindex regexp match/non-match operators
|
|
@cindex @code{~} operator
|
|
@cindex @code{!~} operator
|
|
Regular expressions can also be used in matching expressions. These
|
|
expressions allow you to specify the string to match against; it need
|
|
not be the entire current input record. The two operators, @samp{~}
|
|
and @samp{!~}, perform regular expression comparisons. Expressions
|
|
using these operators can be used as patterns or in @code{if},
|
|
@code{while}, @code{for}, and @code{do} statements.
|
|
@ifinfo
|
|
@c adding this xref in TeX screws up the formatting too much
|
|
(@xref{Statements, ,Control Statements in Actions}.)
|
|
@end ifinfo
|
|
|
|
@table @code
|
|
@item @var{exp} ~ /@var{regexp}/
|
|
This is true if the expression @var{exp} (taken as a string)
|
|
is matched by @var{regexp}. The following example matches, or selects,
|
|
all input records with the upper-case letter @samp{J} somewhere in the
|
|
first field:
|
|
|
|
@example
|
|
@group
|
|
$ awk '$1 ~ /J/' inventory-shipped
|
|
@print{} Jan 13 25 15 115
|
|
@print{} Jun 31 42 75 492
|
|
@print{} Jul 24 34 67 436
|
|
@print{} Jan 21 36 64 620
|
|
@end group
|
|
@end example
|
|
|
|
So does this:
|
|
|
|
@example
|
|
awk '@{ if ($1 ~ /J/) print @}' inventory-shipped
|
|
@end example
|
|
|
|
@item @var{exp} !~ /@var{regexp}/
|
|
This is true if the expression @var{exp} (taken as a character string)
|
|
is @emph{not} matched by @var{regexp}. The following example matches,
|
|
or selects, all input records whose first field @emph{does not} contain
|
|
the upper-case letter @samp{J}:
|
|
|
|
@example
|
|
@group
|
|
$ awk '$1 !~ /J/' inventory-shipped
|
|
@print{} Feb 15 32 24 226
|
|
@print{} Mar 15 24 34 228
|
|
@print{} Apr 31 52 63 420
|
|
@print{} May 16 34 29 208
|
|
@dots{}
|
|
@end group
|
|
@end example
|
|
@end table
|
|
|
|
@cindex regexp constant
|
|
When a regexp is written enclosed in slashes, like @code{/foo/}, we call it
|
|
a @dfn{regexp constant}, much like @code{5.27} is a numeric constant, and
|
|
@code{"foo"} is a string constant.
|
|
|
|
@node Escape Sequences, Regexp Operators, Regexp Usage, Regexp
|
|
@section Escape Sequences
|
|
|
|
@cindex escape sequence notation
|
|
Some characters cannot be included literally in string constants
|
|
(@code{"foo"}) or regexp constants (@code{/foo/}). You represent them
|
|
instead with @dfn{escape sequences}, which are character sequences
|
|
beginning with a backslash (@samp{\}).
|
|
|
|
One use of an escape sequence is to include a double-quote character in
|
|
a string constant. Since a plain double-quote would end the string, you
|
|
must use @samp{\"} to represent an actual double-quote character as a
|
|
part of the string. For example:
|
|
|
|
@example
|
|
$ awk 'BEGIN @{ print "He said \"hi!\" to her." @}'
|
|
@print{} He said "hi!" to her.
|
|
@end example
|
|
|
|
The backslash character itself is another character that cannot be
|
|
included normally; you write @samp{\\} to put one backslash in the
|
|
string or regexp. Thus, the string whose contents are the two characters
|
|
@samp{"} and @samp{\} must be written @code{"\"\\"}.
|
|
|
|
Another use of backslash is to represent unprintable characters
|
|
such as tab or newline. While there is nothing to stop you from entering most
|
|
unprintable characters directly in a string constant or regexp constant,
|
|
they may look ugly.
|
|
|
|
Here is a table of all the escape sequences used in @code{awk}, and
|
|
what they represent. Unless noted otherwise, all of these escape
|
|
sequences apply to both string constants and regexp constants.
|
|
|
|
@c @cartouche
|
|
@table @code
|
|
@item \\
|
|
A literal backslash, @samp{\}.
|
|
|
|
@cindex @code{awk} language, V.4 version
|
|
@item \a
|
|
The ``alert'' character, @kbd{Control-g}, ASCII code 7 (BEL).
|
|
|
|
@item \b
|
|
Backspace, @kbd{Control-h}, ASCII code 8 (BS).
|
|
|
|
@item \f
|
|
Formfeed, @kbd{Control-l}, ASCII code 12 (FF).
|
|
|
|
@item \n
|
|
Newline, @kbd{Control-j}, ASCII code 10 (LF).
|
|
|
|
@item \r
|
|
Carriage return, @kbd{Control-m}, ASCII code 13 (CR).
|
|
|
|
@item \t
|
|
Horizontal tab, @kbd{Control-i}, ASCII code 9 (HT).
|
|
|
|
@cindex @code{awk} language, V.4 version
|
|
@item \v
|
|
Vertical tab, @kbd{Control-k}, ASCII code 11 (VT).
|
|
|
|
@item \@var{nnn}
|
|
The octal value @var{nnn}, where @var{nnn} are one to three digits
|
|
between @samp{0} and @samp{7}. For example, the code for the ASCII ESC
|
|
(escape) character is @samp{\033}.
|
|
|
|
@cindex @code{awk} language, V.4 version
|
|
@cindex @code{awk} language, POSIX version
|
|
@cindex POSIX @code{awk}
|
|
@item \x@var{hh}@dots{}
|
|
The hexadecimal value @var{hh}, where @var{hh} are hexadecimal
|
|
digits (@samp{0} through @samp{9} and either @samp{A} through @samp{F} or
|
|
@samp{a} through @samp{f}). Like the same construct in ANSI C, the escape
|
|
sequence continues until the first non-hexadecimal digit is seen. However,
|
|
using more than two hexadecimal digits produces undefined results. (The
|
|
@samp{\x} escape sequence is not allowed in POSIX @code{awk}.)
|
|
|
|
@item \/
|
|
A literal slash (necessary for regexp constants only).
|
|
You use this when you wish to write a regexp
|
|
constant that contains a slash. Since the regexp is delimited by
|
|
slashes, you need to escape the slash that is part of the pattern,
|
|
in order to tell @code{awk} to keep processing the rest of the regexp.
|
|
|
|
@item \"
|
|
A literal double-quote (necessary for string constants only).
|
|
You use this when you wish to write a string
|
|
constant that contains a double-quote. Since the string is delimited by
|
|
double-quotes, you need to escape the quote that is part of the string,
|
|
in order to tell @code{awk} to keep processing the rest of the string.
|
|
@end table
|
|
@c @end cartouche
|
|
|
|
In @code{gawk}, there are additional two character sequences that begin
|
|
with backslash that have special meaning in regexps.
|
|
@xref{GNU Regexp Operators, ,Additional Regexp Operators Only in @code{gawk}}.
|
|
|
|
In a string constant,
|
|
what happens if you place a backslash before something that is not one of
|
|
the characters listed above? POSIX @code{awk} purposely leaves this case
|
|
undefined. There are two choices.
|
|
|
|
@itemize @bullet
|
|
@item
|
|
Strip the backslash out. This is what Unix @code{awk} and @code{gawk} both do.
|
|
For example, @code{"a\qc"} is the same as @code{"aqc"}.
|
|
|
|
@item
|
|
Leave the backslash alone. Some other @code{awk} implementations do this.
|
|
In such implementations, @code{"a\qc"} is the same as if you had typed
|
|
@code{"a\\qc"}.
|
|
@end itemize
|
|
|
|
In a regexp, a backslash before any character that is not in the above table,
|
|
and not listed in
|
|
@ref{GNU Regexp Operators, ,Additional Regexp Operators Only in @code{gawk}},
|
|
means that the next character should be taken literally, even if it would
|
|
normally be a regexp operator. E.g., @code{/a\+b/} matches the three
|
|
characters @samp{a+b}.
|
|
|
|
@cindex portability issues
|
|
For complete portability, do not use a backslash before any character not
|
|
listed in the table above.
|
|
|
|
Another interesting question arises. Suppose you use an octal or hexadecimal
|
|
escape to represent a regexp metacharacter
|
|
(@pxref{Regexp Operators, , Regular Expression Operators}).
|
|
Does @code{awk} treat the character as a literal character, or as a regexp
|
|
operator?
|
|
|
|
@cindex dark corner
|
|
It turns out that historically, such characters were taken literally (d.c.).
|
|
However, the POSIX standard indicates that they should be treated
|
|
as real metacharacters, and this is what @code{gawk} does.
|
|
However, in compatibility mode (@pxref{Options, ,Command Line Options}),
|
|
@code{gawk} treats the characters represented by octal and hexadecimal
|
|
escape sequences literally when used in regexp constants. Thus,
|
|
@code{/a\52b/} is equivalent to @code{/a\*b/}.
|
|
|
|
To summarize:
|
|
|
|
@enumerate 1
|
|
@item
|
|
The escape sequences in the table above are always processed first,
|
|
for both string constants and regexp constants. This happens very early,
|
|
as soon as @code{awk} reads your program.
|
|
|
|
@item
|
|
@code{gawk} processes both regexp constants and dynamic regexps
|
|
(@pxref{Computed Regexps, ,Using Dynamic Regexps}),
|
|
for the special operators listed in
|
|
@ref{GNU Regexp Operators, ,Additional Regexp Operators Only in @code{gawk}}.
|
|
|
|
@item
|
|
A backslash before any other character means to treat that character
|
|
literally.
|
|
@end enumerate
|
|
|
|
@node Regexp Operators, GNU Regexp Operators, Escape Sequences, Regexp
|
|
@section Regular Expression Operators
|
|
@cindex metacharacters
|
|
@cindex regular expression metacharacters
|
|
@cindex regexp operators
|
|
|
|
You can combine regular expressions with the following characters,
|
|
called @dfn{regular expression operators}, or @dfn{metacharacters}, to
|
|
increase the power and versatility of regular expressions.
|
|
|
|
The escape sequences described
|
|
@iftex
|
|
above
|
|
@end iftex
|
|
in @ref{Escape Sequences},
|
|
are valid inside a regexp. They are introduced by a @samp{\}. They
|
|
are recognized and converted into the corresponding real characters as
|
|
the very first step in processing regexps.
|
|
|
|
Here is a table of metacharacters. All characters that are not escape
|
|
sequences and that are not listed in the table stand for themselves.
|
|
|
|
@table @code
|
|
@item \
|
|
This is used to suppress the special meaning of a character when
|
|
matching. For example:
|
|
|
|
@example
|
|
\$
|
|
@end example
|
|
|
|
@noindent
|
|
matches the character @samp{$}.
|
|
|
|
@c NEEDED
|
|
@page
|
|
@cindex anchors in regexps
|
|
@cindex regexp, anchors
|
|
@item ^
|
|
This matches the beginning of a string. For example:
|
|
|
|
@example
|
|
^@@chapter
|
|
@end example
|
|
|
|
@noindent
|
|
matches the @samp{@@chapter} at the beginning of a string, and can be used
|
|
to identify chapter beginnings in Texinfo source files.
|
|
The @samp{^} is known as an @dfn{anchor}, since it anchors the pattern to
|
|
matching only at the beginning of the string.
|
|
|
|
It is important to realize that @samp{^} does not match the beginning of
|
|
a line embedded in a string. In this example the condition is not true:
|
|
|
|
@example
|
|
if ("line1\nLINE 2" ~ /^L/) @dots{}
|
|
@end example
|
|
|
|
@item $
|
|
This is similar to @samp{^}, but it matches only at the end of a string.
|
|
For example:
|
|
|
|
@example
|
|
p$
|
|
@end example
|
|
|
|
@noindent
|
|
matches a record that ends with a @samp{p}. The @samp{$} is also an anchor,
|
|
and also does not match the end of a line embedded in a string. In this
|
|
example the condition is not true:
|
|
|
|
@example
|
|
if ("line1\nLINE 2" ~ /1$/) @dots{}
|
|
@end example
|
|
|
|
@item .
|
|
The period, or dot, matches any single character,
|
|
@emph{including} the newline character. For example:
|
|
|
|
@example
|
|
.P
|
|
@end example
|
|
|
|
@noindent
|
|
matches any single character followed by a @samp{P} in a string. Using
|
|
concatenation we can make a regular expression like @samp{U.A}, which
|
|
matches any three-character sequence that begins with @samp{U} and ends
|
|
with @samp{A}.
|
|
|
|
@cindex @code{awk} language, POSIX version
|
|
@cindex POSIX @code{awk}
|
|
In strict POSIX mode (@pxref{Options, ,Command Line Options}),
|
|
@samp{.} does not match the @sc{nul}
|
|
character, which is a character with all bits equal to zero.
|
|
Otherwise, @sc{nul} is just another character. Other versions of @code{awk}
|
|
may not be able to match the @sc{nul} character.
|
|
|
|
@ignore
|
|
2e: Add stuff that character list is the POSIX terminology. In other
|
|
literature known as character set or character class.
|
|
@end ignore
|
|
|
|
@cindex character list
|
|
@item [@dots{}]
|
|
This is called a @dfn{character list}. It matches any @emph{one} of the
|
|
characters that are enclosed in the square brackets. For example:
|
|
|
|
@example
|
|
[MVX]
|
|
@end example
|
|
|
|
@noindent
|
|
matches any one of the characters @samp{M}, @samp{V}, or @samp{X} in a
|
|
string.
|
|
|
|
Ranges of characters are indicated by using a hyphen between the beginning
|
|
and ending characters, and enclosing the whole thing in brackets. For
|
|
example:
|
|
|
|
@example
|
|
[0-9]
|
|
@end example
|
|
|
|
@noindent
|
|
matches any digit.
|
|
Multiple ranges are allowed. E.g., the list @code{@w{[A-Za-z0-9]}} is a
|
|
common way to express the idea of ``all alphanumeric characters.''
|
|
|
|
To include one of the characters @samp{\}, @samp{]}, @samp{-} or @samp{^} in a
|
|
character list, put a @samp{\} in front of it. For example:
|
|
|
|
@example
|
|
[d\]]
|
|
@end example
|
|
|
|
@noindent
|
|
matches either @samp{d}, or @samp{]}.
|
|
|
|
@cindex @code{egrep}
|
|
This treatment of @samp{\} in character lists
|
|
is compatible with other @code{awk}
|
|
implementations, and is also mandated by POSIX.
|
|
The regular expressions in @code{awk} are a superset
|
|
of the POSIX specification for Extended Regular Expressions (EREs).
|
|
POSIX EREs are based on the regular expressions accepted by the
|
|
traditional @code{egrep} utility.
|
|
|
|
@cindex character classes
|
|
@cindex @code{awk} language, POSIX version
|
|
@cindex POSIX @code{awk}
|
|
@dfn{Character classes} are a new feature introduced in the POSIX standard.
|
|
A character class is a special notation for describing
|
|
lists of characters that have a specific attribute, but where the
|
|
actual characters themselves can vary from country to country and/or
|
|
from character set to character set. For example, the notion of what
|
|
is an alphabetic character differs in the USA and in France.
|
|
|
|
A character class is only valid in a regexp @emph{inside} the
|
|
brackets of a character list. Character classes consist of @samp{[:},
|
|
a keyword denoting the class, and @samp{:]}. Here are the character
|
|
classes defined by the POSIX standard.
|
|
|
|
@table @code
|
|
@item [:alnum:]
|
|
Alphanumeric characters.
|
|
|
|
@item [:alpha:]
|
|
Alphabetic characters.
|
|
|
|
@item [:blank:]
|
|
Space and tab characters.
|
|
|
|
@item [:cntrl:]
|
|
Control characters.
|
|
|
|
@item [:digit:]
|
|
Numeric characters.
|
|
|
|
@item [:graph:]
|
|
Characters that are printable and are also visible.
|
|
(A space is printable, but not visible, while an @samp{a} is both.)
|
|
|
|
@item [:lower:]
|
|
Lower-case alphabetic characters.
|
|
|
|
@item [:print:]
|
|
Printable characters (characters that are not control characters.)
|
|
|
|
@item [:punct:]
|
|
Punctuation characters (characters that are not letter, digits,
|
|
control characters, or space characters).
|
|
|
|
@item [:space:]
|
|
Space characters (such as space, tab, and formfeed, to name a few).
|
|
|
|
@item [:upper:]
|
|
Upper-case alphabetic characters.
|
|
|
|
@item [:xdigit:]
|
|
Characters that are hexadecimal digits.
|
|
@end table
|
|
|
|
For example, before the POSIX standard, to match alphanumeric
|
|
characters, you had to write @code{/[A-Za-z0-9]/}. If your
|
|
character set had other alphabetic characters in it, this would not
|
|
match them. With the POSIX character classes, you can write
|
|
@code{/[[:alnum:]]/}, and this will match @emph{all} the alphabetic
|
|
and numeric characters in your character set.
|
|
|
|
@cindex collating elements
|
|
Two additional special sequences can appear in character lists.
|
|
These apply to non-ASCII character sets, which can have single symbols
|
|
(called @dfn{collating elements}) that are represented with more than one
|
|
character, as well as several characters that are equivalent for
|
|
@dfn{collating}, or sorting, purposes. (E.g., in French, a plain ``e''
|
|
and a grave-accented ``@`e'' are equivalent.)
|
|
|
|
@table @asis
|
|
@cindex collating symbols
|
|
@item Collating Symbols
|
|
A @dfn{collating symbol} is a multi-character collating element enclosed in
|
|
@samp{[.} and @samp{.]}. For example, if @samp{ch} is a collating element,
|
|
then @code{[[.ch.]]} is a regexp that matches this collating element, while
|
|
@code{[ch]} is a regexp that matches either @samp{c} or @samp{h}.
|
|
|
|
@cindex equivalence classes
|
|
@item Equivalence Classes
|
|
An @dfn{equivalence class} is a locale-specific name for a list of
|
|
characters that are equivalent. The name is enclosed in
|
|
@samp{[=} and @samp{=]}.
|
|
For example, the name @samp{e} might be used to represent all of
|
|
``e,'' ``@`e,'' and ``@'e.'' In this case, @code{[[=e]]} is a regexp
|
|
that matches any of @samp{e}, @samp{@'e}, or @samp{@`e}.
|
|
@end table
|
|
|
|
These features are very valuable in non-English speaking locales.
|
|
|
|
@strong{Caution:} The library functions that @code{gawk} uses for regular
|
|
expression matching currently only recognize POSIX character classes;
|
|
they do not recognize collating symbols or equivalence classes.
|
|
@c maybe one day ...
|
|
|
|
@cindex complemented character list
|
|
@cindex character list, complemented
|
|
@item [^ @dots{}]
|
|
This is a @dfn{complemented character list}. The first character after
|
|
the @samp{[} @emph{must} be a @samp{^}. It matches any characters
|
|
@emph{except} those in the square brackets. For example:
|
|
|
|
@example
|
|
[^0-9]
|
|
@end example
|
|
|
|
@noindent
|
|
matches any character that is not a digit.
|
|
|
|
@item |
|
|
This is the @dfn{alternation operator}, and it is used to specify
|
|
alternatives. For example:
|
|
|
|
@example
|
|
^P|[0-9]
|
|
@end example
|
|
|
|
@noindent
|
|
matches any string that matches either @samp{^P} or @samp{[0-9]}. This
|
|
means it matches any string that starts with @samp{P} or contains a digit.
|
|
|
|
The alternation applies to the largest possible regexps on either side.
|
|
In other words, @samp{|} has the lowest precedence of all the regular
|
|
expression operators.
|
|
|
|
@item (@dots{})
|
|
Parentheses are used for grouping in regular expressions as in
|
|
arithmetic. They can be used to concatenate regular expressions
|
|
containing the alternation operator, @samp{|}. For example,
|
|
@samp{@@(samp|code)\@{[^@}]+\@}} matches both @samp{@@code@{foo@}} and
|
|
@samp{@@samp@{bar@}}. (These are Texinfo formatting control sequences.)
|
|
|
|
@item *
|
|
This symbol means that the preceding regular expression is to be
|
|
repeated as many times as necessary to find a match. For example:
|
|
|
|
@example
|
|
ph*
|
|
@end example
|
|
|
|
@noindent
|
|
applies the @samp{*} symbol to the preceding @samp{h} and looks for matches
|
|
of one @samp{p} followed by any number of @samp{h}s. This will also match
|
|
just @samp{p} if no @samp{h}s are present.
|
|
|
|
The @samp{*} repeats the @emph{smallest} possible preceding expression.
|
|
(Use parentheses if you wish to repeat a larger expression.) It finds
|
|
as many repetitions as possible. For example:
|
|
|
|
@example
|
|
awk '/\(c[ad][ad]*r x\)/ @{ print @}' sample
|
|
@end example
|
|
|
|
@noindent
|
|
prints every record in @file{sample} containing a string of the form
|
|
@samp{(car x)}, @samp{(cdr x)}, @samp{(cadr x)}, and so on.
|
|
Notice the escaping of the parentheses by preceding them
|
|
with backslashes.
|
|
|
|
@item +
|
|
This symbol is similar to @samp{*}, but the preceding expression must be
|
|
matched at least once. This means that:
|
|
|
|
@example
|
|
wh+y
|
|
@end example
|
|
|
|
@noindent
|
|
would match @samp{why} and @samp{whhy} but not @samp{wy}, whereas
|
|
@samp{wh*y} would match all three of these strings. This is a simpler
|
|
way of writing the last @samp{*} example:
|
|
|
|
@example
|
|
awk '/\(c[ad]+r x\)/ @{ print @}' sample
|
|
@end example
|
|
|
|
@item ?
|
|
This symbol is similar to @samp{*}, but the preceding expression can be
|
|
matched either once or not at all. For example:
|
|
|
|
@example
|
|
fe?d
|
|
@end example
|
|
|
|
@noindent
|
|
will match @samp{fed} and @samp{fd}, but nothing else.
|
|
|
|
@cindex @code{awk} language, POSIX version
|
|
@cindex POSIX @code{awk}
|
|
@cindex interval expressions
|
|
@item @{@var{n}@}
|
|
@itemx @{@var{n},@}
|
|
@itemx @{@var{n},@var{m}@}
|
|
One or two numbers inside braces denote an @dfn{interval expression}.
|
|
If there is one number in the braces, the preceding regexp is repeated
|
|
@var{n} times.
|
|
If there are two numbers separated by a comma, the preceding regexp is
|
|
repeated @var{n} to @var{m} times.
|
|
If there is one number followed by a comma, then the preceding regexp
|
|
is repeated at least @var{n} times.
|
|
|
|
@table @code
|
|
@item wh@{3@}y
|
|
matches @samp{whhhy} but not @samp{why} or @samp{whhhhy}.
|
|
|
|
@item wh@{3,5@}y
|
|
matches @samp{whhhy} or @samp{whhhhy} or @samp{whhhhhy}, only.
|
|
|
|
@item wh@{2,@}y
|
|
matches @samp{whhy} or @samp{whhhy}, and so on.
|
|
@end table
|
|
|
|
Interval expressions were not traditionally available in @code{awk}.
|
|
As part of the POSIX standard they were added, to make @code{awk}
|
|
and @code{egrep} consistent with each other.
|
|
|
|
However, since old programs may use @samp{@{} and @samp{@}} in regexp
|
|
constants, by default @code{gawk} does @emph{not} match interval expressions
|
|
in regexps. If either @samp{--posix} or @samp{--re-interval} are specified
|
|
(@pxref{Options, , Command Line Options}), then interval expressions
|
|
are allowed in regexps.
|
|
@end table
|
|
|
|
@cindex precedence, regexp operators
|
|
@cindex regexp operators, precedence of
|
|
In regular expressions, the @samp{*}, @samp{+}, and @samp{?} operators,
|
|
as well as the braces @samp{@{} and @samp{@}},
|
|
have
|
|
the highest precedence, followed by concatenation, and finally by @samp{|}.
|
|
As in arithmetic, parentheses can change how operators are grouped.
|
|
|
|
If @code{gawk} is in compatibility mode
|
|
(@pxref{Options, ,Command Line Options}),
|
|
character classes and interval expressions are not available in
|
|
regular expressions.
|
|
|
|
The next
|
|
@ifinfo
|
|
node
|
|
@end ifinfo
|
|
@iftex
|
|
section
|
|
@end iftex
|
|
discusses the GNU-specific regexp operators, and provides
|
|
more detail concerning how command line options affect the way @code{gawk}
|
|
interprets the characters in regular expressions.
|
|
|
|
@node GNU Regexp Operators, Case-sensitivity, Regexp Operators, Regexp
|
|
@section Additional Regexp Operators Only in @code{gawk}
|
|
|
|
@c This section adapted from the regex-0.12 manual
|
|
|
|
@cindex regexp operators, GNU specific
|
|
GNU software that deals with regular expressions provides a number of
|
|
additional regexp operators. These operators are described in this
|
|
section, and are specific to @code{gawk}; they are not available in other
|
|
@code{awk} implementations.
|
|
|
|
@cindex word, regexp definition of
|
|
Most of the additional operators are for dealing with word matching.
|
|
For our purposes, a @dfn{word} is a sequence of one or more letters, digits,
|
|
or underscores (@samp{_}).
|
|
|
|
@table @code
|
|
@cindex @code{\w} regexp operator
|
|
@item \w
|
|
This operator matches any word-constituent character, i.e.@: any
|
|
letter, digit, or underscore. Think of it as a short-hand for
|
|
@c @w{@code{[A-Za-z0-9_]}} or
|
|
@w{@code{[[:alnum:]_]}}.
|
|
|
|
@cindex @code{\W} regexp operator
|
|
@item \W
|
|
This operator matches any character that is not word-constituent.
|
|
Think of it as a short-hand for
|
|
@c @w{@code{[^A-Za-z0-9_]}} or
|
|
@w{@code{[^[:alnum:]_]}}.
|
|
|
|
@cindex @code{\<} regexp operator
|
|
@item \<
|
|
This operator matches the empty string at the beginning of a word.
|
|
For example, @code{/\<away/} matches @samp{away}, but not
|
|
@samp{stowaway}.
|
|
|
|
@cindex @code{\>} regexp operator
|
|
@item \>
|
|
This operator matches the empty string at the end of a word.
|
|
For example, @code{/stow\>/} matches @samp{stow}, but not @samp{stowaway}.
|
|
|
|
@cindex @code{\y} regexp operator
|
|
@cindex word boundaries, matching
|
|
@item \y
|
|
This operator matches the empty string at either the beginning or the
|
|
end of a word (the word boundar@strong{y}). For example, @samp{\yballs?\y}
|
|
matches either @samp{ball} or @samp{balls} as a separate word.
|
|
|
|
@cindex @code{\B} regexp operator
|
|
@item \B
|
|
This operator matches the empty string within a word. In other words,
|
|
@samp{\B} matches the empty string that occurs between two
|
|
word-constituent characters. For example,
|
|
@code{/\Brat\B/} matches @samp{crate}, but it does not match @samp{dirty rat}.
|
|
@samp{\B} is essentially the opposite of @samp{\y}.
|
|
@end table
|
|
|
|
There are two other operators that work on buffers. In Emacs, a
|
|
@dfn{buffer} is, naturally, an Emacs buffer. For other programs, the
|
|
regexp library routines that @code{gawk} uses consider the entire
|
|
string to be matched as the buffer.
|
|
|
|
For @code{awk}, since @samp{^} and @samp{$} always work in terms
|
|
of the beginning and end of strings, these operators don't add any
|
|
new capabilities. They are provided for compatibility with other GNU
|
|
software.
|
|
|
|
@cindex buffer matching operators
|
|
@table @code
|
|
@cindex @code{\`} regexp operator
|
|
@item \`
|
|
This operator matches the empty string at the
|
|
beginning of the buffer.
|
|
|
|
@cindex @code{\'} regexp operator
|
|
@item \'
|
|
This operator matches the empty string at the
|
|
end of the buffer.
|
|
@end table
|
|
|
|
In other GNU software, the word boundary operator is @samp{\b}. However,
|
|
that conflicts with the @code{awk} language's definition of @samp{\b}
|
|
as backspace, so @code{gawk} uses a different letter.
|
|
|
|
An alternative method would have been to require two backslashes in the
|
|
GNU operators, but this was deemed to be too confusing, and the current
|
|
method of using @samp{\y} for the GNU @samp{\b} appears to be the
|
|
lesser of two evils.
|
|
|
|
@c NOTE!!! Keep this in sync with the same table in the summary appendix!
|
|
@cindex regexp, effect of command line options
|
|
The various command line options
|
|
(@pxref{Options, ,Command Line Options})
|
|
control how @code{gawk} interprets characters in regexps.
|
|
|
|
@table @asis
|
|
@item No options
|
|
In the default case, @code{gawk} provide all the facilities of
|
|
POSIX regexps and the GNU regexp operators described
|
|
@iftex
|
|
above.
|
|
@end iftex
|
|
@ifinfo
|
|
in @ref{Regexp Operators, ,Regular Expression Operators}.
|
|
@end ifinfo
|
|
However, interval expressions are not supported.
|
|
|
|
@item @code{--posix}
|
|
Only POSIX regexps are supported, the GNU operators are not special
|
|
(e.g., @samp{\w} matches a literal @samp{w}). Interval expressions
|
|
are allowed.
|
|
|
|
@item @code{--traditional}
|
|
Traditional Unix @code{awk} regexps are matched. The GNU operators
|
|
are not special, interval expressions are not available, and neither
|
|
are the POSIX character classes (@code{[[:alnum:]]} and so on).
|
|
Characters described by octal and hexadecimal escape sequences are
|
|
treated literally, even if they represent regexp metacharacters.
|
|
|
|
@item @code{--re-interval}
|
|
Allow interval expressions in regexps, even if @samp{--traditional}
|
|
has been provided.
|
|
@end table
|
|
|
|
@node Case-sensitivity, Leftmost Longest, GNU Regexp Operators, Regexp
|
|
@section Case-sensitivity in Matching
|
|
|
|
@cindex case sensitivity
|
|
@cindex ignoring case
|
|
Case is normally significant in regular expressions, both when matching
|
|
ordinary characters (i.e.@: not metacharacters), and inside character
|
|
sets. Thus a @samp{w} in a regular expression matches only a lower-case
|
|
@samp{w} and not an upper-case @samp{W}.
|
|
|
|
The simplest way to do a case-independent match is to use a character
|
|
list: @samp{[Ww]}. However, this can be cumbersome if you need to use it
|
|
often; and it can make the regular expressions harder to
|
|
read. There are two alternatives that you might prefer.
|
|
|
|
One way to do a case-insensitive match at a particular point in the
|
|
program is to convert the data to a single case, using the
|
|
@code{tolower} or @code{toupper} built-in string functions (which we
|
|
haven't discussed yet;
|
|
@pxref{String Functions, ,Built-in Functions for String Manipulation}).
|
|
For example:
|
|
|
|
@example
|
|
tolower($1) ~ /foo/ @{ @dots{} @}
|
|
@end example
|
|
|
|
@noindent
|
|
converts the first field to lower-case before matching against it.
|
|
This will work in any POSIX-compliant implementation of @code{awk}.
|
|
|
|
@cindex differences between @code{gawk} and @code{awk}
|
|
@cindex @code{~} operator
|
|
@cindex @code{!~} operator
|
|
@vindex IGNORECASE
|
|
Another method, specific to @code{gawk}, is to set the variable
|
|
@code{IGNORECASE} to a non-zero value (@pxref{Built-in Variables}).
|
|
When @code{IGNORECASE} is not zero, @emph{all} regexp and string
|
|
operations ignore case. Changing the value of
|
|
@code{IGNORECASE} dynamically controls the case sensitivity of your
|
|
program as it runs. Case is significant by default because
|
|
@code{IGNORECASE} (like most variables) is initialized to zero.
|
|
|
|
@example
|
|
@group
|
|
x = "aB"
|
|
if (x ~ /ab/) @dots{} # this test will fail
|
|
@end group
|
|
|
|
@group
|
|
IGNORECASE = 1
|
|
if (x ~ /ab/) @dots{} # now it will succeed
|
|
@end group
|
|
@end example
|
|
|
|
In general, you cannot use @code{IGNORECASE} to make certain rules
|
|
case-insensitive and other rules case-sensitive, because there is no way
|
|
to set @code{IGNORECASE} just for the pattern of a particular rule.
|
|
@ignore
|
|
This isn't quite true. Consider:
|
|
|
|
IGNORECASE=1 && /foObAr/ { .... }
|
|
IGNORECASE=0 || /foobar/ { .... }
|
|
|
|
But that's pretty bad style and I don't want to get into it at this
|
|
late date.
|
|
@end ignore
|
|
To do this, you must use character lists or @code{tolower}. However, one
|
|
thing you can do only with @code{IGNORECASE} is turn case-sensitivity on
|
|
or off dynamically for all the rules at once.
|
|
|
|
@code{IGNORECASE} can be set on the command line, or in a @code{BEGIN} rule
|
|
(@pxref{Other Arguments, ,Other Command Line Arguments}; also
|
|
@pxref{Using BEGIN/END, ,Startup and Cleanup Actions}).
|
|
Setting @code{IGNORECASE} from the command line is a way to make
|
|
a program case-insensitive without having to edit it.
|
|
|
|
Prior to version 3.0 of @code{gawk}, the value of @code{IGNORECASE}
|
|
only affected regexp operations. It did not affect string comparison
|
|
with @samp{==}, @samp{!=}, and so on.
|
|
Beginning with version 3.0, both regexp and string comparison
|
|
operations are affected by @code{IGNORECASE}.
|
|
|
|
@cindex ISO 8859-1
|
|
@cindex ISO Latin-1
|
|
Beginning with version 3.0 of @code{gawk}, the equivalences between upper-case
|
|
and lower-case characters are based on the ISO-8859-1 (ISO Latin-1)
|
|
character set. This character set is a superset of the traditional 128
|
|
ASCII characters, that also provides a number of characters suitable
|
|
for use with European languages.
|
|
@ignore
|
|
A pure ASCII character set can be used instead if @code{gawk} is compiled
|
|
with @samp{-DUSE_PURE_ASCII}.
|
|
@end ignore
|
|
|
|
The value of @code{IGNORECASE} has no effect if @code{gawk} is in
|
|
compatibility mode (@pxref{Options, ,Command Line Options}).
|
|
Case is always significant in compatibility mode.
|
|
|
|
@node Leftmost Longest, Computed Regexps, Case-sensitivity, Regexp
|
|
@section How Much Text Matches?
|
|
|
|
@cindex leftmost longest match
|
|
@cindex matching, leftmost longest
|
|
Consider the following example:
|
|
|
|
@example
|
|
echo aaaabcd | awk '@{ sub(/a+/, "<A>"); print @}'
|
|
@end example
|
|
|
|
This example uses the @code{sub} function (which we haven't discussed yet,
|
|
@pxref{String Functions, ,Built-in Functions for String Manipulation})
|
|
to make a change to the input record. Here, the regexp @code{/a+/}
|
|
indicates ``one or more @samp{a} characters,'' and the replacement
|
|
text is @samp{<A>}.
|
|
|
|
The input contains four @samp{a} characters. What will the output be?
|
|
In other words, how many is ``one or more''---will @code{awk} match two,
|
|
three, or all four @samp{a} characters?
|
|
|
|
The answer is, @code{awk} (and POSIX) regular expressions always match
|
|
the leftmost, @emph{longest} sequence of input characters that can
|
|
match. Thus, in this example, all four @samp{a} characters are
|
|
replaced with @samp{<A>}.
|
|
|
|
@example
|
|
$ echo aaaabcd | awk '@{ sub(/a+/, "<A>"); print @}'
|
|
@print{} <A>bcd
|
|
@end example
|
|
|
|
For simple match/no-match tests, this is not so important. But when doing
|
|
regexp-based field and record splitting, and
|
|
text matching and substitutions with the @code{match}, @code{sub}, @code{gsub},
|
|
and @code{gensub} functions, it is very important.
|
|
@ifinfo
|
|
@xref{String Functions, ,Built-in Functions for String Manipulation},
|
|
for more information on these functions.
|
|
@end ifinfo
|
|
Understanding this principle is also important for regexp-based record
|
|
and field splitting (@pxref{Records, ,How Input is Split into Records},
|
|
and also @pxref{Field Separators, ,Specifying How Fields are Separated}).
|
|
|
|
@node Computed Regexps, , Leftmost Longest, Regexp
|
|
@section Using Dynamic Regexps
|
|
|
|
@cindex computed regular expressions
|
|
@cindex regular expressions, computed
|
|
@cindex dynamic regular expressions
|
|
@cindex regexp, dynamic
|
|
@cindex @code{~} operator
|
|
@cindex @code{!~} operator
|
|
The right hand side of a @samp{~} or @samp{!~} operator need not be a
|
|
regexp constant (i.e.@: a string of characters between slashes). It may
|
|
be any expression. The expression is evaluated, and converted if
|
|
necessary to a string; the contents of the string are used as the
|
|
regexp. A regexp that is computed in this way is called a @dfn{dynamic
|
|
regexp}. For example:
|
|
|
|
@example
|
|
BEGIN @{ identifier_regexp = "[A-Za-z_][A-Za-z_0-9]+" @}
|
|
$0 ~ identifier_regexp @{ print @}
|
|
@end example
|
|
|
|
@noindent
|
|
sets @code{identifier_regexp} to a regexp that describes @code{awk}
|
|
variable names, and tests if the input record matches this regexp.
|
|
|
|
@strong{Caution:} When using the @samp{~} and @samp{!~}
|
|
operators, there is a difference between a regexp constant
|
|
enclosed in slashes, and a string constant enclosed in double quotes.
|
|
If you are going to use a string constant, you have to understand that
|
|
the string is in essence scanned @emph{twice}; the first time when
|
|
@code{awk} reads your program, and the second time when it goes to
|
|
match the string on the left-hand side of the operator with the pattern
|
|
on the right. This is true of any string valued expression (such as
|
|
@code{identifier_regexp} above), not just string constants.
|
|
|
|
@cindex regexp constants, difference between slashes and quotes
|
|
What difference does it make if the string is
|
|
scanned twice? The answer has to do with escape sequences, and particularly
|
|
with backslashes. To get a backslash into a regular expression inside a
|
|
string, you have to type two backslashes.
|
|
|
|
For example, @code{/\*/} is a regexp constant for a literal @samp{*}.
|
|
Only one backslash is needed. To do the same thing with a string,
|
|
you would have to type @code{"\\*"}. The first backslash escapes the
|
|
second one, so that the string actually contains the
|
|
two characters @samp{\} and @samp{*}.
|
|
|
|
@cindex common mistakes
|
|
@cindex mistakes, common
|
|
@cindex errors, common
|
|
Given that you can use both regexp and string constants to describe
|
|
regular expressions, which should you use? The answer is ``regexp
|
|
constants,'' for several reasons.
|
|
|
|
@enumerate 1
|
|
@item
|
|
String constants are more complicated to write, and
|
|
more difficult to read. Using regexp constants makes your programs
|
|
less error-prone. Not understanding the difference between the two
|
|
kinds of constants is a common source of errors.
|
|
|
|
@item
|
|
It is also more efficient to use regexp constants: @code{awk} can note
|
|
that you have supplied a regexp and store it internally in a form that
|
|
makes pattern matching more efficient. When using a string constant,
|
|
@code{awk} must first convert the string into this internal form, and
|
|
then perform the pattern matching.
|
|
|
|
@item
|
|
Using regexp constants is better style; it shows clearly that you
|
|
intend a regexp match.
|
|
@end enumerate
|
|
|
|
@node Reading Files, Printing, Regexp, Top
|
|
@chapter Reading Input Files
|
|
|
|
@cindex reading files
|
|
@cindex input
|
|
@cindex standard input
|
|
@vindex FILENAME
|
|
In the typical @code{awk} program, all input is read either from the
|
|
standard input (by default the keyboard, but often a pipe from another
|
|
command) or from files whose names you specify on the @code{awk} command
|
|
line. If you specify input files, @code{awk} reads them in order, reading
|
|
all the data from one before going on to the next. The name of the current
|
|
input file can be found in the built-in variable @code{FILENAME}
|
|
(@pxref{Built-in Variables}).
|
|
|
|
The input is read in units called @dfn{records}, and processed by the
|
|
rules of your program one record at a time.
|
|
By default, each record is one line. Each
|
|
record is automatically split into chunks called @dfn{fields}.
|
|
This makes it more convenient for programs to work on the parts of a record.
|
|
|
|
On rare occasions you will need to use the @code{getline} command.
|
|
The @code{getline} command is valuable, both because it
|
|
can do explicit input from any number of files, and because the files
|
|
used with it do not have to be named on the @code{awk} command line
|
|
(@pxref{Getline, ,Explicit Input with @code{getline}}).
|
|
|
|
@menu
|
|
* Records:: Controlling how data is split into records.
|
|
* Fields:: An introduction to fields.
|
|
* Non-Constant Fields:: Non-constant Field Numbers.
|
|
* Changing Fields:: Changing the Contents of a Field.
|
|
* Field Separators:: The field separator and how to change it.
|
|
* Constant Size:: Reading constant width data.
|
|
* Multiple Line:: Reading multi-line records.
|
|
* Getline:: Reading files under explicit program control
|
|
using the @code{getline} function.
|
|
@end menu
|
|
|
|
@node Records, Fields, Reading Files, Reading Files
|
|
@section How Input is Split into Records
|
|
|
|
@cindex record separator, @code{RS}
|
|
@cindex changing the record separator
|
|
@cindex record, definition of
|
|
@vindex RS
|
|
The @code{awk} utility divides the input for your @code{awk}
|
|
program into records and fields.
|
|
Records are separated by a character called the @dfn{record separator}.
|
|
By default, the record separator is the newline character.
|
|
This is why records are, by default, single lines.
|
|
You can use a different character for the record separator by
|
|
assigning the character to the built-in variable @code{RS}.
|
|
|
|
You can change the value of @code{RS} in the @code{awk} program,
|
|
like any other variable, with the
|
|
assignment operator, @samp{=} (@pxref{Assignment Ops, ,Assignment Expressions}).
|
|
The new record-separator character should be enclosed in quotation marks,
|
|
which indicate
|
|
a string constant. Often the right time to do this is at the beginning
|
|
of execution, before any input has been processed, so that the very
|
|
first record will be read with the proper separator. To do this, use
|
|
the special @code{BEGIN} pattern
|
|
(@pxref{BEGIN/END, ,The @code{BEGIN} and @code{END} Special Patterns}). For
|
|
example:
|
|
|
|
@example
|
|
awk 'BEGIN @{ RS = "/" @} ; @{ print $0 @}' BBS-list
|
|
@end example
|
|
|
|
@noindent
|
|
changes the value of @code{RS} to @code{"/"}, before reading any input.
|
|
This is a string whose first character is a slash; as a result, records
|
|
are separated by slashes. Then the input file is read, and the second
|
|
rule in the @code{awk} program (the action with no pattern) prints each
|
|
record. Since each @code{print} statement adds a newline at the end of
|
|
its output, the effect of this @code{awk} program is to copy the input
|
|
with each slash changed to a newline. Here are the results of running
|
|
the program on @file{BBS-list}:
|
|
|
|
@example
|
|
@group
|
|
$ awk 'BEGIN @{ RS = "/" @} ; @{ print $0 @}' BBS-list
|
|
@print{} aardvark 555-5553 1200
|
|
@print{} 300 B
|
|
@print{} alpo-net 555-3412 2400
|
|
@print{} 1200
|
|
@print{} 300 A
|
|
@print{} barfly 555-7685 1200
|
|
@print{} 300 A
|
|
@print{} bites 555-1675 2400
|
|
@print{} 1200
|
|
@print{} 300 A
|
|
@print{} camelot 555-0542 300 C
|
|
@print{} core 555-2912 1200
|
|
@print{} 300 C
|
|
@print{} fooey 555-1234 2400
|
|
@print{} 1200
|
|
@print{} 300 B
|
|
@print{} foot 555-6699 1200
|
|
@print{} 300 B
|
|
@print{} macfoo 555-6480 1200
|
|
@print{} 300 A
|
|
@print{} sdace 555-3430 2400
|
|
@print{} 1200
|
|
@print{} 300 A
|
|
@print{} sabafoo 555-2127 1200
|
|
@print{} 300 C
|
|
@print{}
|
|
@end group
|
|
@end example
|
|
|
|
@noindent
|
|
Note that the entry for the @samp{camelot} BBS is not split.
|
|
In the original data file
|
|
(@pxref{Sample Data Files, , Data Files for the Examples}),
|
|
the line looks like this:
|
|
|
|
@example
|
|
camelot 555-0542 300 C
|
|
@end example
|
|
|
|
@noindent
|
|
It only has one baud rate; there are no slashes in the record.
|
|
|
|
Another way to change the record separator is on the command line,
|
|
using the variable-assignment feature
|
|
(@pxref{Other Arguments, ,Other Command Line Arguments}).
|
|
|
|
@example
|
|
awk '@{ print $0 @}' RS="/" BBS-list
|
|
@end example
|
|
|
|
@noindent
|
|
This sets @code{RS} to @samp{/} before processing @file{BBS-list}.
|
|
|
|
Using an unusual character such as @samp{/} for the record separator
|
|
produces correct behavior in the vast majority of cases. However,
|
|
the following (extreme) pipeline prints a surprising @samp{1}. There
|
|
is one field, consisting of a newline. The value of the built-in
|
|
variable @code{NF} is the number of fields in the current record.
|
|
|
|
@example
|
|
$ echo | awk 'BEGIN @{ RS = "a" @} ; @{ print NF @}'
|
|
@print{} 1
|
|
@end example
|
|
|
|
@cindex dark corner
|
|
@noindent
|
|
Reaching the end of an input file terminates the current input record,
|
|
even if the last character in the file is not the character in @code{RS}
|
|
(d.c.).
|
|
|
|
@cindex empty string
|
|
The empty string, @code{""} (a string of no characters), has a special meaning
|
|
as the value of @code{RS}: it means that records are separated
|
|
by one or more blank lines, and nothing else.
|
|
@xref{Multiple Line, ,Multiple-Line Records}, for more details.
|
|
|
|
If you change the value of @code{RS} in the middle of an @code{awk} run,
|
|
the new value is used to delimit subsequent records, but the record
|
|
currently being processed (and records already processed) are not
|
|
affected.
|
|
|
|
@vindex RT
|
|
@cindex record terminator, @code{RT}
|
|
@cindex terminator, record
|
|
@cindex differences between @code{gawk} and @code{awk}
|
|
After the end of the record has been determined, @code{gawk}
|
|
sets the variable @code{RT} to the text in the input that matched
|
|
@code{RS}.
|
|
|
|
@cindex regular expressions as record separators
|
|
The value of @code{RS} is in fact not limited to a one-character
|
|
string. It can be any regular expression
|
|
(@pxref{Regexp, ,Regular Expressions}).
|
|
In general, each record
|
|
ends at the next string that matches the regular expression; the next
|
|
record starts at the end of the matching string. This general rule is
|
|
actually at work in the usual case, where @code{RS} contains just a
|
|
newline: a record ends at the beginning of the next matching string (the
|
|
next newline in the input) and the following record starts just after
|
|
the end of this string (at the first character of the following line).
|
|
The newline, since it matches @code{RS}, is not part of either record.
|
|
|
|
When @code{RS} is a single character, @code{RT} will
|
|
contain the same single character. However, when @code{RS} is a
|
|
regular expression, then @code{RT} becomes more useful; it contains
|
|
the actual input text that matched the regular expression.
|
|
|
|
The following example illustrates both of these features.
|
|
It sets @code{RS} equal to a regular expression that
|
|
matches either a newline, or a series of one or more upper-case letters
|
|
with optional leading and/or trailing white space
|
|
(@pxref{Regexp, , Regular Expressions}).
|
|
|
|
@example
|
|
$ echo record 1 AAAA record 2 BBBB record 3 |
|
|
> gawk 'BEGIN @{ RS = "\n|( *[[:upper:]]+ *)" @}
|
|
> @{ print "Record =", $0, "and RT =", RT @}'
|
|
@print{} Record = record 1 and RT = AAAA
|
|
@print{} Record = record 2 and RT = BBBB
|
|
@print{} Record = record 3 and RT =
|
|
@print{}
|
|
@end example
|
|
|
|
@noindent
|
|
The final line of output has an extra blank line. This is because the
|
|
value of @code{RT} is a newline, and then the @code{print} statement
|
|
supplies its own terminating newline.
|
|
|
|
@xref{Simple Sed, ,A Simple Stream Editor}, for a more useful example
|
|
of @code{RS} as a regexp and @code{RT}.
|
|
|
|
@cindex differences between @code{gawk} and @code{awk}
|
|
The use of @code{RS} as a regular expression and the @code{RT}
|
|
variable are @code{gawk} extensions; they are not available in
|
|
compatibility mode
|
|
(@pxref{Options, ,Command Line Options}).
|
|
In compatibility mode, only the first character of the value of
|
|
@code{RS} is used to determine the end of the record.
|
|
|
|
@cindex number of records, @code{NR}, @code{FNR}
|
|
@vindex NR
|
|
@vindex FNR
|
|
The @code{awk} utility keeps track of the number of records that have
|
|
been read so far from the current input file. This value is stored in a
|
|
built-in variable called @code{FNR}. It is reset to zero when a new
|
|
file is started. Another built-in variable, @code{NR}, is the total
|
|
number of input records read so far from all data files. It starts at zero
|
|
but is never automatically reset to zero.
|
|
|
|
@node Fields, Non-Constant Fields, Records, Reading Files
|
|
@section Examining Fields
|
|
|
|
@cindex examining fields
|
|
@cindex fields
|
|
@cindex accessing fields
|
|
When @code{awk} reads an input record, the record is
|
|
automatically separated or @dfn{parsed} by the interpreter into chunks
|
|
called @dfn{fields}. By default, fields are separated by whitespace,
|
|
like words in a line.
|
|
Whitespace in @code{awk} means any string of one or more spaces,
|
|
tabs or newlines;@footnote{In POSIX @code{awk}, newlines are not
|
|
considered whitespace for separating fields.} other characters such as
|
|
formfeed, and so on, that are
|
|
considered whitespace by other languages are @emph{not} considered
|
|
whitespace by @code{awk}.
|
|
|
|
The purpose of fields is to make it more convenient for you to refer to
|
|
these pieces of the record. You don't have to use them---you can
|
|
operate on the whole record if you wish---but fields are what make
|
|
simple @code{awk} programs so powerful.
|
|
|
|
@cindex @code{$} (field operator)
|
|
@cindex field operator @code{$}
|
|
To refer to a field in an @code{awk} program, you use a dollar-sign,
|
|
@samp{$}, followed by the number of the field you want. Thus, @code{$1}
|
|
refers to the first field, @code{$2} to the second, and so on. For
|
|
example, suppose the following is a line of input:
|
|
|
|
@example
|
|
This seems like a pretty nice example.
|
|
@end example
|
|
|
|
@noindent
|
|
Here the first field, or @code{$1}, is @samp{This}; the second field, or
|
|
@code{$2}, is @samp{seems}; and so on. Note that the last field,
|
|
@code{$7}, is @samp{example.}. Because there is no space between the
|
|
@samp{e} and the @samp{.}, the period is considered part of the seventh
|
|
field.
|
|
|
|
@vindex NF
|
|
@cindex number of fields, @code{NF}
|
|
@code{NF} is a built-in variable whose value
|
|
is the number of fields in the current record.
|
|
@code{awk} updates the value of @code{NF} automatically, each time
|
|
a record is read.
|
|
|
|
No matter how many fields there are, the last field in a record can be
|
|
represented by @code{$NF}. So, in the example above, @code{$NF} would
|
|
be the same as @code{$7}, which is @samp{example.}. Why this works is
|
|
explained below (@pxref{Non-Constant Fields, ,Non-constant Field Numbers}).
|
|
If you try to reference a field beyond the last one, such as @code{$8}
|
|
when the record has only seven fields, you get the empty string.
|
|
@c the empty string acts like 0 in some contexts, but I don't want to
|
|
@c get into that here....
|
|
|
|
@code{$0}, which looks like a reference to the ``zeroth'' field, is
|
|
a special case: it represents the whole input record. @code{$0} is
|
|
used when you are not interested in fields.
|
|
|
|
Here are some more examples:
|
|
|
|
@example
|
|
@group
|
|
$ awk '$1 ~ /foo/ @{ print $0 @}' BBS-list
|
|
@print{} fooey 555-1234 2400/1200/300 B
|
|
@print{} foot 555-6699 1200/300 B
|
|
@print{} macfoo 555-6480 1200/300 A
|
|
@print{} sabafoo 555-2127 1200/300 C
|
|
@end group
|
|
@end example
|
|
|
|
@noindent
|
|
This example prints each record in the file @file{BBS-list} whose first
|
|
field contains the string @samp{foo}. The operator @samp{~} is called a
|
|
@dfn{matching operator}
|
|
(@pxref{Regexp Usage, , How to Use Regular Expressions});
|
|
it tests whether a string (here, the field @code{$1}) matches a given regular
|
|
expression.
|
|
|
|
By contrast, the following example
|
|
looks for @samp{foo} in @emph{the entire record} and prints the first
|
|
field and the last field for each input record containing a
|
|
match.
|
|
|
|
@example
|
|
@group
|
|
$ awk '/foo/ @{ print $1, $NF @}' BBS-list
|
|
@print{} fooey B
|
|
@print{} foot B
|
|
@print{} macfoo A
|
|
@print{} sabafoo C
|
|
@end group
|
|
@end example
|
|
|
|
@node Non-Constant Fields, Changing Fields, Fields, Reading Files
|
|
@section Non-constant Field Numbers
|
|
|
|
The number of a field does not need to be a constant. Any expression in
|
|
the @code{awk} language can be used after a @samp{$} to refer to a
|
|
field. The value of the expression specifies the field number. If the
|
|
value is a string, rather than a number, it is converted to a number.
|
|
Consider this example:
|
|
|
|
@example
|
|
awk '@{ print $NR @}'
|
|
@end example
|
|
|
|
@noindent
|
|
Recall that @code{NR} is the number of records read so far: one in the
|
|
first record, two in the second, etc. So this example prints the first
|
|
field of the first record, the second field of the second record, and so
|
|
on. For the twentieth record, field number 20 is printed; most likely,
|
|
the record has fewer than 20 fields, so this prints a blank line.
|
|
|
|
Here is another example of using expressions as field numbers:
|
|
|
|
@example
|
|
awk '@{ print $(2*2) @}' BBS-list
|
|
@end example
|
|
|
|
@code{awk} must evaluate the expression @samp{(2*2)} and use
|
|
its value as the number of the field to print. The @samp{*} sign
|
|
represents multiplication, so the expression @samp{2*2} evaluates to four.
|
|
The parentheses are used so that the multiplication is done before the
|
|
@samp{$} operation; they are necessary whenever there is a binary
|
|
operator in the field-number expression. This example, then, prints the
|
|
hours of operation (the fourth field) for every line of the file
|
|
@file{BBS-list}. (All of the @code{awk} operators are listed, in
|
|
order of decreasing precedence, in
|
|
@ref{Precedence, , Operator Precedence (How Operators Nest)}.)
|
|
|
|
If the field number you compute is zero, you get the entire record.
|
|
Thus, @code{$(2-2)} has the same value as @code{$0}. Negative field
|
|
numbers are not allowed; trying to reference one will usually terminate
|
|
your running @code{awk} program. (The POSIX standard does not define
|
|
what happens when you reference a negative field number. @code{gawk}
|
|
will notice this and terminate your program. Other @code{awk}
|
|
implementations may behave differently.)
|
|
|
|
As mentioned in @ref{Fields, ,Examining Fields},
|
|
the number of fields in the current record is stored in the built-in
|
|
variable @code{NF} (also @pxref{Built-in Variables}). The expression
|
|
@code{$NF} is not a special feature: it is the direct consequence of
|
|
evaluating @code{NF} and using its value as a field number.
|
|
|
|
@node Changing Fields, Field Separators, Non-Constant Fields, Reading Files
|
|
@section Changing the Contents of a Field
|
|
|
|
@cindex field, changing contents of
|
|
@cindex changing contents of a field
|
|
@cindex assignment to fields
|
|
You can change the contents of a field as seen by @code{awk} within an
|
|
@code{awk} program; this changes what @code{awk} perceives as the
|
|
current input record. (The actual input is untouched; @code{awk} @emph{never}
|
|
modifies the input file.)
|
|
|
|
Consider this example and its output:
|
|
|
|
@example
|
|
@group
|
|
$ awk '@{ $3 = $2 - 10; print $2, $3 @}' inventory-shipped
|
|
@print{} 13 3
|
|
@print{} 15 5
|
|
@print{} 15 5
|
|
@dots{}
|
|
@end group
|
|
@end example
|
|
|
|
@noindent
|
|
The @samp{-} sign represents subtraction, so this program reassigns
|
|
field three, @code{$3}, to be the value of field two minus ten,
|
|
@samp{$2 - 10}. (@xref{Arithmetic Ops, ,Arithmetic Operators}.)
|
|
Then field two, and the new value for field three, are printed.
|
|
|
|
In order for this to work, the text in field @code{$2} must make sense
|
|
as a number; the string of characters must be converted to a number in
|
|
order for the computer to do arithmetic on it. The number resulting
|
|
from the subtraction is converted back to a string of characters which
|
|
then becomes field three.
|
|
@xref{Conversion, ,Conversion of Strings and Numbers}.
|
|
|
|
When you change the value of a field (as perceived by @code{awk}), the
|
|
text of the input record is recalculated to contain the new field where
|
|
the old one was. Therefore, @code{$0} changes to reflect the altered
|
|
field. Thus, this program
|
|
prints a copy of the input file, with 10 subtracted from the second
|
|
field of each line.
|
|
|
|
@example
|
|
@group
|
|
$ awk '@{ $2 = $2 - 10; print $0 @}' inventory-shipped
|
|
@print{} Jan 3 25 15 115
|
|
@print{} Feb 5 32 24 226
|
|
@print{} Mar 5 24 34 228
|
|
@dots{}
|
|
@end group
|
|
@end example
|
|
|
|
You can also assign contents to fields that are out of range. For
|
|
example:
|
|
|
|
@example
|
|
$ awk '@{ $6 = ($5 + $4 + $3 + $2)
|
|
> print $6 @}' inventory-shipped
|
|
@print{} 168
|
|
@print{} 297
|
|
@print{} 301
|
|
@dots{}
|
|
@end example
|
|
|
|
@noindent
|
|
We've just created @code{$6}, whose value is the sum of fields
|
|
@code{$2}, @code{$3}, @code{$4}, and @code{$5}. The @samp{+} sign
|
|
represents addition. For the file @file{inventory-shipped}, @code{$6}
|
|
represents the total number of parcels shipped for a particular month.
|
|
|
|
Creating a new field changes @code{awk}'s internal copy of the current
|
|
input record---the value of @code{$0}. Thus, if you do @samp{print $0}
|
|
after adding a field, the record printed includes the new field, with
|
|
the appropriate number of field separators between it and the previously
|
|
existing fields.
|
|
|
|
This recomputation affects and is affected by
|
|
@code{NF} (the number of fields; @pxref{Fields, ,Examining Fields}),
|
|
and by a feature that has not been discussed yet,
|
|
the @dfn{output field separator}, @code{OFS},
|
|
which is used to separate the fields (@pxref{Output Separators}).
|
|
For example, the value of @code{NF} is set to the number of the highest
|
|
field you create.
|
|
|
|
Note, however, that merely @emph{referencing} an out-of-range field
|
|
does @emph{not} change the value of either @code{$0} or @code{NF}.
|
|
Referencing an out-of-range field only produces an empty string. For
|
|
example:
|
|
|
|
@example
|
|
if ($(NF+1) != "")
|
|
print "can't happen"
|
|
else
|
|
print "everything is normal"
|
|
@end example
|
|
|
|
@noindent
|
|
should print @samp{everything is normal}, because @code{NF+1} is certain
|
|
to be out of range. (@xref{If Statement, ,The @code{if}-@code{else} Statement},
|
|
for more information about @code{awk}'s @code{if-else} statements.
|
|
@xref{Typing and Comparison, ,Variable Typing and Comparison Expressions},
|
|
for more information about the @samp{!=} operator.)
|
|
|
|
It is important to note that making an assignment to an existing field
|
|
will change the
|
|
value of @code{$0}, but will not change the value of @code{NF},
|
|
even when you assign the empty string to a field. For example:
|
|
|
|
@example
|
|
@group
|
|
$ echo a b c d | awk '@{ OFS = ":"; $2 = ""
|
|
> print $0; print NF @}'
|
|
@print{} a::c:d
|
|
@print{} 4
|
|
@end group
|
|
@end example
|
|
|
|
@noindent
|
|
The field is still there; it just has an empty value. You can tell
|
|
because there are two colons in a row.
|
|
|
|
This example shows what happens if you create a new field.
|
|
|
|
@example
|
|
$ echo a b c d | awk '@{ OFS = ":"; $2 = ""; $6 = "new"
|
|
> print $0; print NF @}'
|
|
@print{} a::c:d::new
|
|
@print{} 6
|
|
@end example
|
|
|
|
@noindent
|
|
The intervening field, @code{$5} is created with an empty value
|
|
(indicated by the second pair of adjacent colons),
|
|
and @code{NF} is updated with the value six.
|
|
|
|
Finally, decrementing @code{NF} will lose the values of the fields
|
|
after the new value of @code{NF}, and @code{$0} will be recomputed.
|
|
Here is an example:
|
|
|
|
@example
|
|
$ echo a b c d e f | ../gawk '@{ print "NF =", NF;
|
|
> NF = 3; print $0 @}'
|
|
@print{} NF = 6
|
|
@print{} a b c
|
|
@end example
|
|
|
|
@node Field Separators, Constant Size, Changing Fields, Reading Files
|
|
@section Specifying How Fields are Separated
|
|
|
|
This section is rather long; it describes one of the most fundamental
|
|
operations in @code{awk}.
|
|
|
|
@menu
|
|
* Basic Field Splitting:: How fields are split with single characters
|
|
or simple strings.
|
|
* Regexp Field Splitting:: Using regexps as the field separator.
|
|
* Single Character Fields:: Making each character a separate field.
|
|
* Command Line Field Separator:: Setting @code{FS} from the command line.
|
|
* Field Splitting Summary:: Some final points and a summary table.
|
|
@end menu
|
|
|
|
@node Basic Field Splitting, Regexp Field Splitting, Field Separators, Field Separators
|
|
@subsection The Basics of Field Separating
|
|
@vindex FS
|
|
@cindex fields, separating
|
|
@cindex field separator, @code{FS}
|
|
|
|
The @dfn{field separator}, which is either a single character or a regular
|
|
expression, controls the way @code{awk} splits an input record into fields.
|
|
@code{awk} scans the input record for character sequences that
|
|
match the separator; the fields themselves are the text between the matches.
|
|
|
|
In the examples below, we use the bullet symbol ``@bullet{}'' to represent
|
|
spaces in the output.
|
|
|
|
If the field separator is @samp{oo}, then the following line:
|
|
|
|
@example
|
|
moo goo gai pan
|
|
@end example
|
|
|
|
@noindent
|
|
would be split into three fields: @samp{m}, @samp{@bullet{}g} and
|
|
@samp{@bullet{}gai@bullet{}pan}.
|
|
Note the leading spaces in the values of the second and third fields.
|
|
|
|
@cindex common mistakes
|
|
@cindex mistakes, common
|
|
@cindex errors, common
|
|
The field separator is represented by the built-in variable @code{FS}.
|
|
Shell programmers take note! @code{awk} does @emph{not} use the name @code{IFS}
|
|
which is used by the POSIX compatible shells (such as the Bourne shell,
|
|
@code{sh}, or the GNU Bourne-Again Shell, Bash).
|
|
|
|
You can change the value of @code{FS} in the @code{awk} program with the
|
|
assignment operator, @samp{=} (@pxref{Assignment Ops, ,Assignment Expressions}).
|
|
Often the right time to do this is at the beginning of execution,
|
|
before any input has been processed, so that the very first record
|
|
will be read with the proper separator. To do this, use the special
|
|
@code{BEGIN} pattern
|
|
(@pxref{BEGIN/END, ,The @code{BEGIN} and @code{END} Special Patterns}).
|
|
For example, here we set the value of @code{FS} to the string
|
|
@code{","}:
|
|
|
|
@example
|
|
awk 'BEGIN @{ FS = "," @} ; @{ print $2 @}'
|
|
@end example
|
|
|
|
@noindent
|
|
Given the input line,
|
|
|
|
@example
|
|
John Q. Smith, 29 Oak St., Walamazoo, MI 42139
|
|
@end example
|
|
|
|
@noindent
|
|
this @code{awk} program extracts and prints the string
|
|
@samp{@bullet{}29@bullet{}Oak@bullet{}St.}.
|
|
|
|
@cindex field separator, choice of
|
|
@cindex regular expressions as field separators
|
|
Sometimes your input data will contain separator characters that don't
|
|
separate fields the way you thought they would. For instance, the
|
|
person's name in the example we just used might have a title or
|
|
suffix attached, such as @samp{John Q. Smith, LXIX}. From input
|
|
containing such a name:
|
|
|
|
@example
|
|
John Q. Smith, LXIX, 29 Oak St., Walamazoo, MI 42139
|
|
@end example
|
|
|
|
@noindent
|
|
@c careful of an overfull hbox here!
|
|
the above program would extract @samp{@bullet{}LXIX}, instead of
|
|
@samp{@bullet{}29@bullet{}Oak@bullet{}St.}.
|
|
If you were expecting the program to print the
|
|
address, you would be surprised. The moral is: choose your data layout and
|
|
separator characters carefully to prevent such problems.
|
|
|
|
@iftex
|
|
As you know, normally,
|
|
@end iftex
|
|
@ifinfo
|
|
Normally,
|
|
@end ifinfo
|
|
fields are separated by whitespace sequences
|
|
(spaces, tabs and newlines), not by single spaces: two spaces in a row do not
|
|
delimit an empty field. The default value of the field separator @code{FS}
|
|
is a string containing a single space, @w{@code{" "}}. If this value were
|
|
interpreted in the usual way, each space character would separate
|
|
fields, so two spaces in a row would make an empty field between them.
|
|
The reason this does not happen is that a single space as the value of
|
|
@code{FS} is a special case: it is taken to specify the default manner
|
|
of delimiting fields.
|
|
|
|
If @code{FS} is any other single character, such as @code{","}, then
|
|
each occurrence of that character separates two fields. Two consecutive
|
|
occurrences delimit an empty field. If the character occurs at the
|
|
beginning or the end of the line, that too delimits an empty field. The
|
|
space character is the only single character which does not follow these
|
|
rules.
|
|
|
|
@node Regexp Field Splitting, Single Character Fields, Basic Field Splitting, Field Separators
|
|
@subsection Using Regular Expressions to Separate Fields
|
|
|
|
The previous
|
|
@iftex
|
|
subsection
|
|
@end iftex
|
|
@ifinfo
|
|
node
|
|
@end ifinfo
|
|
discussed the use of single characters or simple strings as the
|
|
value of @code{FS}.
|
|
More generally, the value of @code{FS} may be a string containing any
|
|
regular expression. In this case, each match in the record for the regular
|
|
expression separates fields. For example, the assignment:
|
|
|
|
@example
|
|
FS = ", \t"
|
|
@end example
|
|
|
|
@noindent
|
|
makes every area of an input line that consists of a comma followed by a
|
|
space and a tab, into a field separator. (@samp{\t}
|
|
is an @dfn{escape sequence} that stands for a tab;
|
|
@pxref{Escape Sequences},
|
|
for the complete list of similar escape sequences.)
|
|
|
|
For a less trivial example of a regular expression, suppose you want
|
|
single spaces to separate fields the way single commas were used above.
|
|
You can set @code{FS} to @w{@code{"[@ ]"}} (left bracket, space, right
|
|
bracket). This regular expression matches a single space and nothing else
|
|
(@pxref{Regexp, ,Regular Expressions}).
|
|
|
|
There is an important difference between the two cases of @samp{FS = @w{" "}}
|
|
(a single space) and @samp{FS = @w{"[ \t\n]+"}} (left bracket, space,
|
|
backslash, ``t'', backslash, ``n'', right bracket, which is a regular
|
|
expression matching one or more spaces, tabs, or newlines). For both
|
|
values of @code{FS}, fields are separated by runs of spaces, tabs
|
|
and/or newlines. However, when the value of @code{FS} is @w{@code{"
|
|
"}}, @code{awk} will first strip leading and trailing whitespace from
|
|
the record, and then decide where the fields are.
|
|
|
|
For example, the following pipeline prints @samp{b}:
|
|
|
|
@example
|
|
$ echo ' a b c d ' | awk '@{ print $2 @}'
|
|
@print{} b
|
|
@end example
|
|
|
|
@noindent
|
|
However, this pipeline prints @samp{a} (note the extra spaces around
|
|
each letter):
|
|
|
|
@example
|
|
$ echo ' a b c d ' | awk 'BEGIN @{ FS = "[ \t]+" @}
|
|
> @{ print $2 @}'
|
|
@print{} a
|
|
@end example
|
|
|
|
@noindent
|
|
@cindex null string
|
|
@cindex empty string
|
|
In this case, the first field is @dfn{null}, or empty.
|
|
|
|
The stripping of leading and trailing whitespace also comes into
|
|
play whenever @code{$0} is recomputed. For instance, study this pipeline:
|
|
|
|
@example
|
|
$ echo ' a b c d' | awk '@{ print; $2 = $2; print @}'
|
|
@print{} a b c d
|
|
@print{} a b c d
|
|
@end example
|
|
|
|
@noindent
|
|
The first @code{print} statement prints the record as it was read,
|
|
with leading whitespace intact. The assignment to @code{$2} rebuilds
|
|
@code{$0} by concatenating @code{$1} through @code{$NF} together,
|
|
separated by the value of @code{OFS}. Since the leading whitespace
|
|
was ignored when finding @code{$1}, it is not part of the new @code{$0}.
|
|
Finally, the last @code{print} statement prints the new @code{$0}.
|
|
|
|
@node Single Character Fields, Command Line Field Separator, Regexp Field Splitting, Field Separators
|
|
@subsection Making Each Character a Separate Field
|
|
|
|
@cindex differences between @code{gawk} and @code{awk}
|
|
@cindex single character fields
|
|
There are times when you may want to examine each character
|
|
of a record separately. In @code{gawk}, this is easy to do, you
|
|
simply assign the null string (@code{""}) to @code{FS}. In this case,
|
|
each individual character in the record will become a separate field.
|
|
Here is an example:
|
|
|
|
@example
|
|
@group
|
|
$ echo a b | gawk 'BEGIN @{ FS = "" @}
|
|
> @{
|
|
> for (i = 1; i <= NF; i = i + 1)
|
|
> print "Field", i, "is", $i
|
|
> @}'
|
|
@print{} Field 1 is a
|
|
@print{} Field 2 is
|
|
@print{} Field 3 is b
|
|
@end group
|
|
@end example
|
|
|
|
@cindex dark corner
|
|
Traditionally, the behavior for @code{FS} equal to @code{""} was not defined.
|
|
In this case, Unix @code{awk} would simply treat the entire record
|
|
as only having one field (d.c.). In compatibility mode
|
|
(@pxref{Options, ,Command Line Options}),
|
|
if @code{FS} is the null string, then @code{gawk} will also
|
|
behave this way.
|
|
|
|
@node Command Line Field Separator, Field Splitting Summary, Single Character Fields, Field Separators
|
|
@subsection Setting @code{FS} from the Command Line
|
|
@cindex @code{-F} option
|
|
@cindex field separator, on command line
|
|
@cindex command line, setting @code{FS} on
|
|
|
|
@code{FS} can be set on the command line. You use the @samp{-F} option to
|
|
do so. For example:
|
|
|
|
@example
|
|
awk -F, '@var{program}' @var{input-files}
|
|
@end example
|
|
|
|
@noindent
|
|
sets @code{FS} to be the @samp{,} character. Notice that the option uses
|
|
a capital @samp{F}. Contrast this with @samp{-f}, which specifies a file
|
|
containing an @code{awk} program. Case is significant in command line options:
|
|
the @samp{-F} and @samp{-f} options have nothing to do with each other.
|
|
You can use both options at the same time to set the @code{FS} variable
|
|
@emph{and} get an @code{awk} program from a file.
|
|
|
|
The value used for the argument to @samp{-F} is processed in exactly the
|
|
same way as assignments to the built-in variable @code{FS}. This means that
|
|
if the field separator contains special characters, they must be escaped
|
|
appropriately. For example, to use a @samp{\} as the field separator, you
|
|
would have to type:
|
|
|
|
@example
|
|
# same as FS = "\\"
|
|
awk -F\\\\ '@dots{}' files @dots{}
|
|
@end example
|
|
|
|
@noindent
|
|
Since @samp{\} is used for quoting in the shell, @code{awk} will see
|
|
@samp{-F\\}. Then @code{awk} processes the @samp{\\} for escape
|
|
characters (@pxref{Escape Sequences}), finally yielding
|
|
a single @samp{\} to be used for the field separator.
|
|
|
|
@cindex historical features
|
|
As a special case, in compatibility mode
|
|
(@pxref{Options, ,Command Line Options}), if the
|
|
argument to @samp{-F} is @samp{t}, then @code{FS} is set to the tab
|
|
character. This is because if you type @samp{-F\t} at the shell,
|
|
without any quotes, the @samp{\} gets deleted, so @code{awk} figures that you
|
|
really want your fields to be separated with tabs, and not @samp{t}s.
|
|
Use @samp{-v FS="t"} on the command line if you really do want to separate
|
|
your fields with @samp{t}s
|
|
(@pxref{Options, ,Command Line Options}).
|
|
|
|
For example, let's use an @code{awk} program file called @file{baud.awk}
|
|
that contains the pattern @code{/300/}, and the action @samp{print $1}.
|
|
Here is the program:
|
|
|
|
@example
|
|
/300/ @{ print $1 @}
|
|
@end example
|
|
|
|
Let's also set @code{FS} to be the @samp{-} character, and run the
|
|
program on the file @file{BBS-list}. The following command prints a
|
|
list of the names of the bulletin boards that operate at 300 baud and
|
|
the first three digits of their phone numbers:
|
|
|
|
@c tweaked to make the tex output look better in @smallbook
|
|
@example
|
|
@group
|
|
$ awk -F- -f baud.awk BBS-list
|
|
@print{} aardvark 555
|
|
@print{} alpo
|
|
@print{} barfly 555
|
|
@dots{}
|
|
@end group
|
|
@ignore
|
|
@print{} bites 555
|
|
@print{} camelot 555
|
|
@print{} core 555
|
|
@print{} fooey 555
|
|
@print{} foot 555
|
|
@print{} macfoo 555
|
|
@print{} sdace 555
|
|
@print{} sabafoo 555
|
|
@end ignore
|
|
@end example
|
|
|
|
@noindent
|
|
Note the second line of output. In the original file
|
|
(@pxref{Sample Data Files, ,Data Files for the Examples}),
|
|
the second line looked like this:
|
|
|
|
@example
|
|
alpo-net 555-3412 2400/1200/300 A
|
|
@end example
|
|
|
|
The @samp{-} as part of the system's name was used as the field
|
|
separator, instead of the @samp{-} in the phone number that was
|
|
originally intended. This demonstrates why you have to be careful in
|
|
choosing your field and record separators.
|
|
|
|
On many Unix systems, each user has a separate entry in the system password
|
|
file, one line per user. The information in these lines is separated
|
|
by colons. The first field is the user's logon name, and the second is
|
|
the user's encrypted password. A password file entry might look like this:
|
|
|
|
@example
|
|
arnold:xyzzy:2076:10:Arnold Robbins:/home/arnold:/bin/sh
|
|
@end example
|
|
|
|
The following program searches the system password file, and prints
|
|
the entries for users who have no password:
|
|
|
|
@example
|
|
awk -F: '$2 == ""' /etc/passwd
|
|
@end example
|
|
|
|
@node Field Splitting Summary, , Command Line Field Separator, Field Separators
|
|
@subsection Field Splitting Summary
|
|
|
|
@cindex @code{awk} language, POSIX version
|
|
@cindex POSIX @code{awk}
|
|
According to the POSIX standard, @code{awk} is supposed to behave
|
|
as if each record is split into fields at the time that it is read.
|
|
In particular, this means that you can change the value of @code{FS}
|
|
after a record is read, and the value of the fields (i.e.@: how they were split)
|
|
should reflect the old value of @code{FS}, not the new one.
|
|
|
|
@cindex dark corner
|
|
@cindex @code{sed} utility
|
|
@cindex stream editor
|
|
However, many implementations of @code{awk} do not work this way. Instead,
|
|
they defer splitting the fields until a field is actually
|
|
referenced. The fields will be split
|
|
using the @emph{current} value of @code{FS}! (d.c.)
|
|
This behavior can be difficult
|
|
to diagnose. The following example illustrates the difference
|
|
between the two methods.
|
|
(The @code{sed}@footnote{The @code{sed} utility is a ``stream editor.''
|
|
Its behavior is also defined by the POSIX standard.}
|
|
command prints just the first line of @file{/etc/passwd}.)
|
|
|
|
@example
|
|
sed 1q /etc/passwd | awk '@{ FS = ":" ; print $1 @}'
|
|
@end example
|
|
|
|
@noindent
|
|
will usually print
|
|
|
|
@example
|
|
root
|
|
@end example
|
|
|
|
@noindent
|
|
on an incorrect implementation of @code{awk}, while @code{gawk}
|
|
will print something like
|
|
|
|
@example
|
|
root:nSijPlPhZZwgE:0:0:Root:/:
|
|
@end example
|
|
|
|
The following table summarizes how fields are split, based on the
|
|
value of @code{FS}. (@samp{==} means ``is equal to.'')
|
|
|
|
@c @cartouche
|
|
@table @code
|
|
@item FS == " "
|
|
Fields are separated by runs of whitespace. Leading and trailing
|
|
whitespace are ignored. This is the default.
|
|
|
|
@item FS == @var{any other single character}
|
|
Fields are separated by each occurrence of the character. Multiple
|
|
successive occurrences delimit empty fields, as do leading and
|
|
trailing occurrences.
|
|
The character can even be a regexp metacharacter; it does not need
|
|
to be escaped.
|
|
|
|
@item FS == @var{regexp}
|
|
Fields are separated by occurrences of characters that match @var{regexp}.
|
|
Leading and trailing matches of @var{regexp} delimit empty fields.
|
|
|
|
@item FS == ""
|
|
Each individual character in the record becomes a separate field.
|
|
@end table
|
|
@c @end cartouche
|
|
|
|
@node Constant Size, Multiple Line, Field Separators, Reading Files
|
|
@section Reading Fixed-width Data
|
|
|
|
(This section discusses an advanced, experimental feature. If you are
|
|
a novice @code{awk} user, you may wish to skip it on the first reading.)
|
|
|
|
@code{gawk} version 2.13 introduced a new facility for dealing with
|
|
fixed-width fields with no distinctive field separator. Data of this
|
|
nature arises, for example, in the input for old FORTRAN programs where
|
|
numbers are run together; or in the output of programs that did not
|
|
anticipate the use of their output as input for other programs.
|
|
|
|
An example of the latter is a table where all the columns are lined up by
|
|
the use of a variable number of spaces and @emph{empty fields are just
|
|
spaces}. Clearly, @code{awk}'s normal field splitting based on @code{FS}
|
|
will not work well in this case. Although a portable @code{awk} program
|
|
can use a series of @code{substr} calls on @code{$0}
|
|
(@pxref{String Functions, ,Built-in Functions for String Manipulation}),
|
|
this is awkward and inefficient for a large number of fields.
|
|
|
|
The splitting of an input record into fixed-width fields is specified by
|
|
assigning a string containing space-separated numbers to the built-in
|
|
variable @code{FIELDWIDTHS}. Each number specifies the width of the field
|
|
@emph{including} columns between fields. If you want to ignore the columns
|
|
between fields, you can specify the width as a separate field that is
|
|
subsequently ignored.
|
|
|
|
The following data is the output of the Unix @code{w} utility. It is useful
|
|
to illustrate the use of @code{FIELDWIDTHS}.
|
|
|
|
@example
|
|
@group
|
|
10:06pm up 21 days, 14:04, 23 users
|
|
User tty login@ idle JCPU PCPU what
|
|
hzuo ttyV0 8:58pm 9 5 vi p24.tex
|
|
hzang ttyV3 6:37pm 50 -csh
|
|
eklye ttyV5 9:53pm 7 1 em thes.tex
|
|
dportein ttyV6 8:17pm 1:47 -csh
|
|
gierd ttyD3 10:00pm 1 elm
|
|
dave ttyD4 9:47pm 4 4 w
|
|
brent ttyp0 26Jun91 4:46 26:46 4:41 bash
|
|
dave ttyq4 26Jun9115days 46 46 wnewmail
|
|
@end group
|
|
@end example
|
|
|
|
The following program takes the above input, converts the idle time to
|
|
number of seconds and prints out the first two fields and the calculated
|
|
idle time. (This program uses a number of @code{awk} features that
|
|
haven't been introduced yet.)
|
|
|
|
@example
|
|
@group
|
|
BEGIN @{ FIELDWIDTHS = "9 6 10 6 7 7 35" @}
|
|
NR > 2 @{
|
|
idle = $4
|
|
sub(/^ */, "", idle) # strip leading spaces
|
|
if (idle == "")
|
|
idle = 0
|
|
if (idle ~ /:/) @{
|
|
split(idle, t, ":")
|
|
idle = t[1] * 60 + t[2]
|
|
@}
|
|
if (idle ~ /days/)
|
|
idle *= 24 * 60 * 60
|
|
|
|
print $1, $2, idle
|
|
@}
|
|
@end group
|
|
@end example
|
|
|
|
Here is the result of running the program on the data:
|
|
|
|
@example
|
|
hzuo ttyV0 0
|
|
hzang ttyV3 50
|
|
eklye ttyV5 0
|
|
dportein ttyV6 107
|
|
gierd ttyD3 1
|
|
dave ttyD4 0
|
|
brent ttyp0 286
|
|
dave ttyq4 1296000
|
|
@end example
|
|
|
|
Another (possibly more practical) example of fixed-width input data
|
|
would be the input from a deck of balloting cards. In some parts of
|
|
the United States, voters mark their choices by punching holes in computer
|
|
cards. These cards are then processed to count the votes for any particular
|
|
candidate or on any particular issue. Since a voter may choose not to
|
|
vote on some issue, any column on the card may be empty. An @code{awk}
|
|
program for processing such data could use the @code{FIELDWIDTHS} feature
|
|
to simplify reading the data. (Of course, getting @code{gawk} to run on
|
|
a system with card readers is another story!)
|
|
|
|
@ignore
|
|
Exercise: Write a ballot card reading program
|
|
@end ignore
|
|
|
|
Assigning a value to @code{FS} causes @code{gawk} to return to using
|
|
@code{FS} for field splitting. Use @samp{FS = FS} to make this happen,
|
|
without having to know the current value of @code{FS}.
|
|
|
|
This feature is still experimental, and may evolve over time.
|
|
Note that in particular, @code{gawk} does not attempt to verify
|
|
the sanity of the values used in the value of @code{FIELDWIDTHS}.
|
|
|
|
@node Multiple Line, Getline, Constant Size, Reading Files
|
|
@section Multiple-Line Records
|
|
|
|
@cindex multiple line records
|
|
@cindex input, multiple line records
|
|
@cindex reading files, multiple line records
|
|
@cindex records, multiple line
|
|
In some data bases, a single line cannot conveniently hold all the
|
|
information in one entry. In such cases, you can use multi-line
|
|
records.
|
|
|
|
The first step in doing this is to choose your data format: when records
|
|
are not defined as single lines, how do you want to define them?
|
|
What should separate records?
|
|
|
|
One technique is to use an unusual character or string to separate
|
|
records. For example, you could use the formfeed character (written
|
|
@samp{\f} in @code{awk}, as in C) to separate them, making each record
|
|
a page of the file. To do this, just set the variable @code{RS} to
|
|
@code{"\f"} (a string containing the formfeed character). Any
|
|
other character could equally well be used, as long as it won't be part
|
|
of the data in a record.
|
|
|
|
Another technique is to have blank lines separate records. By a special
|
|
dispensation, an empty string as the value of @code{RS} indicates that
|
|
records are separated by one or more blank lines. If you set @code{RS}
|
|
to the empty string, a record always ends at the first blank line
|
|
encountered. And the next record doesn't start until the first non-blank
|
|
line that follows---no matter how many blank lines appear in a row, they
|
|
are considered one record-separator.
|
|
|
|
@cindex leftmost longest match
|
|
@cindex matching, leftmost longest
|
|
You can achieve the same effect as @samp{RS = ""} by assigning the
|
|
string @code{"\n\n+"} to @code{RS}. This regexp matches the newline
|
|
at the end of the record, and one or more blank lines after the record.
|
|
In addition, a regular expression always matches the longest possible
|
|
sequence when there is a choice
|
|
(@pxref{Leftmost Longest, ,How Much Text Matches?}).
|
|
So the next record doesn't start until
|
|
the first non-blank line that follows---no matter how many blank lines
|
|
appear in a row, they are considered one record-separator.
|
|
|
|
@cindex dark corner
|
|
There is an important difference between @samp{RS = ""} and
|
|
@samp{RS = "\n\n+"}. In the first case, leading newlines in the input
|
|
data file are ignored, and if a file ends without extra blank lines
|
|
after the last record, the final newline is removed from the record.
|
|
In the second case, this special processing is not done (d.c.).
|
|
|
|
Now that the input is separated into records, the second step is to
|
|
separate the fields in the record. One way to do this is to divide each
|
|
of the lines into fields in the normal manner. This happens by default
|
|
as the result of a special feature: when @code{RS} is set to the empty
|
|
string, the newline character @emph{always} acts as a field separator.
|
|
This is in addition to whatever field separations result from @code{FS}.
|
|
|
|
The original motivation for this special exception was probably to provide
|
|
useful behavior in the default case (i.e.@: @code{FS} is equal
|
|
to @w{@code{" "}}). This feature can be a problem if you really don't
|
|
want the newline character to separate fields, since there is no way to
|
|
prevent it. However, you can work around this by using the @code{split}
|
|
function to break up the record manually
|
|
(@pxref{String Functions, ,Built-in Functions for String Manipulation}).
|
|
|
|
Another way to separate fields is to
|
|
put each field on a separate line: to do this, just set the
|
|
variable @code{FS} to the string @code{"\n"}. (This simple regular
|
|
expression matches a single newline.)
|
|
|
|
A practical example of a data file organized this way might be a mailing
|
|
list, where each entry is separated by blank lines. If we have a mailing
|
|
list in a file named @file{addresses}, that looks like this:
|
|
|
|
@example
|
|
Jane Doe
|
|
123 Main Street
|
|
Anywhere, SE 12345-6789
|
|
|
|
John Smith
|
|
456 Tree-lined Avenue
|
|
Smallville, MW 98765-4321
|
|
|
|
@dots{}
|
|
@end example
|
|
|
|
@noindent
|
|
A simple program to process this file would look like this:
|
|
|
|
@example
|
|
@group
|
|
# addrs.awk --- simple mailing list program
|
|
|
|
# Records are separated by blank lines.
|
|
# Each line is one field.
|
|
BEGIN @{ RS = "" ; FS = "\n" @}
|
|
|
|
@{
|
|
print "Name is:", $1
|
|
print "Address is:", $2
|
|
print "City and State are:", $3
|
|
print ""
|
|
@}
|
|
@end group
|
|
@end example
|
|
|
|
Running the program produces the following output:
|
|
|
|
@example
|
|
@group
|
|
$ awk -f addrs.awk addresses
|
|
@print{} Name is: Jane Doe
|
|
@print{} Address is: 123 Main Street
|
|
@print{} City and State are: Anywhere, SE 12345-6789
|
|
@print{}
|
|
@end group
|
|
@group
|
|
@print{} Name is: John Smith
|
|
@print{} Address is: 456 Tree-lined Avenue
|
|
@print{} City and State are: Smallville, MW 98765-4321
|
|
@print{}
|
|
@dots{}
|
|
@end group
|
|
@end example
|
|
|
|
@xref{Labels Program, ,Printing Mailing Labels}, for a more realistic
|
|
program that deals with address lists.
|
|
|
|
The following table summarizes how records are split, based on the
|
|
value of @code{RS}. (@samp{==} means ``is equal to.'')
|
|
|
|
@c @cartouche
|
|
@table @code
|
|
@item RS == "\n"
|
|
Records are separated by the newline character (@samp{\n}). In effect,
|
|
every line in the data file is a separate record, including blank lines.
|
|
This is the default.
|
|
|
|
@item RS == @var{any single character}
|
|
Records are separated by each occurrence of the character. Multiple
|
|
successive occurrences delimit empty records.
|
|
|
|
@item RS == ""
|
|
Records are separated by runs of blank lines. The newline character
|
|
always serves as a field separator, in addition to whatever value
|
|
@code{FS} may have. Leading and trailing newlines in a file are ignored.
|
|
|
|
@item RS == @var{regexp}
|
|
Records are separated by occurrences of characters that match @var{regexp}.
|
|
Leading and trailing matches of @var{regexp} delimit empty records.
|
|
@end table
|
|
@c @end cartouche
|
|
|
|
@vindex RT
|
|
In all cases, @code{gawk} sets @code{RT} to the input text that matched the
|
|
value specified by @code{RS}.
|
|
|
|
@node Getline, , Multiple Line, Reading Files
|
|
@section Explicit Input with @code{getline}
|
|
|
|
@findex getline
|
|
@cindex input, explicit
|
|
@cindex explicit input
|
|
@cindex input, @code{getline} command
|
|
@cindex reading files, @code{getline} command
|
|
So far we have been getting our input data from @code{awk}'s main
|
|
input stream---either the standard input (usually your terminal, sometimes
|
|
the output from another program) or from the
|
|
files specified on the command line. The @code{awk} language has a
|
|
special built-in command called @code{getline} that
|
|
can be used to read input under your explicit control.
|
|
|
|
@menu
|
|
* Getline Intro:: Introduction to the @code{getline} function.
|
|
* Plain Getline:: Using @code{getline} with no arguments.
|
|
* Getline/Variable:: Using @code{getline} into a variable.
|
|
* Getline/File:: Using @code{getline} from a file.
|
|
* Getline/Variable/File:: Using @code{getline} into a variable from a
|
|
file.
|
|
* Getline/Pipe:: Using @code{getline} from a pipe.
|
|
* Getline/Variable/Pipe:: Using @code{getline} into a variable from a
|
|
pipe.
|
|
* Getline Summary:: Summary Of @code{getline} Variants.
|
|
@end menu
|
|
|
|
@node Getline Intro, Plain Getline, Getline, Getline
|
|
@subsection Introduction to @code{getline}
|
|
|
|
This command is used in several different ways, and should @emph{not} be
|
|
used by beginners. It is covered here because this is the chapter on input.
|
|
The examples that follow the explanation of the @code{getline} command
|
|
include material that has not been covered yet. Therefore, come back
|
|
and study the @code{getline} command @emph{after} you have reviewed the
|
|
rest of this @value{DOCUMENT} and have a good knowledge of how @code{awk} works.
|
|
|
|
@vindex ERRNO
|
|
@cindex differences between @code{gawk} and @code{awk}
|
|
@cindex @code{getline}, return values
|
|
@code{getline} returns one if it finds a record, and zero if the end of the
|
|
file is encountered. If there is some error in getting a record, such
|
|
as a file that cannot be opened, then @code{getline} returns @minus{}1.
|
|
In this case, @code{gawk} sets the variable @code{ERRNO} to a string
|
|
describing the error that occurred.
|
|
|
|
In the following examples, @var{command} stands for a string value that
|
|
represents a shell command.
|
|
|
|
@node Plain Getline, Getline/Variable, Getline Intro, Getline
|
|
@subsection Using @code{getline} with No Arguments
|
|
|
|
The @code{getline} command can be used without arguments to read input
|
|
from the current input file. All it does in this case is read the next
|
|
input record and split it up into fields. This is useful if you've
|
|
finished processing the current record, but you want to do some special
|
|
processing @emph{right now} on the next record. Here's an
|
|
example:
|
|
|
|
@example
|
|
@group
|
|
awk '@{
|
|
if ((t = index($0, "/*")) != 0) @{
|
|
# value will be "" if t is 1
|
|
tmp = substr($0, 1, t - 1)
|
|
u = index(substr($0, t + 2), "*/")
|
|
while (u == 0) @{
|
|
if (getline <= 0) @{
|
|
m = "unexpected EOF or error"
|
|
m = (m ": " ERRNO)
|
|
print m > "/dev/stderr"
|
|
exit
|
|
@}
|
|
t = -1
|
|
u = index($0, "*/")
|
|
@}
|
|
@end group
|
|
@group
|
|
# substr expression will be "" if */
|
|
# occurred at end of line
|
|
$0 = tmp substr($0, t + u + 3)
|
|
@}
|
|
print $0
|
|
@}'
|
|
@end group
|
|
@end example
|
|
|
|
This @code{awk} program deletes all C-style comments, @samp{/* @dots{}
|
|
*/}, from the input. By replacing the @samp{print $0} with other
|
|
statements, you could perform more complicated processing on the
|
|
decommented input, like searching for matches of a regular
|
|
expression. This program has a subtle problem---it does not work if one
|
|
comment ends and another begins on the same line.
|
|
|
|
@ignore
|
|
Exercise,
|
|
write a program that does handle multiple comments on the line.
|
|
@end ignore
|
|
|
|
This form of the @code{getline} command sets @code{NF} (the number of
|
|
fields; @pxref{Fields, ,Examining Fields}), @code{NR} (the number of
|
|
records read so far; @pxref{Records, ,How Input is Split into Records}),
|
|
@code{FNR} (the number of records read from this input file), and the
|
|
value of @code{$0}.
|
|
|
|
@cindex dark corner
|
|
@strong{Note:} the new value of @code{$0} is used in testing
|
|
the patterns of any subsequent rules. The original value
|
|
of @code{$0} that triggered the rule which executed @code{getline}
|
|
is lost (d.c.).
|
|
By contrast, the @code{next} statement reads a new record
|
|
but immediately begins processing it normally, starting with the first
|
|
rule in the program. @xref{Next Statement, ,The @code{next} Statement}.
|
|
|
|
@node Getline/Variable, Getline/File, Plain Getline, Getline
|
|
@subsection Using @code{getline} Into a Variable
|
|
|
|
You can use @samp{getline @var{var}} to read the next record from
|
|
@code{awk}'s input into the variable @var{var}. No other processing is
|
|
done.
|
|
|
|
For example, suppose the next line is a comment, or a special string,
|
|
and you want to read it, without triggering
|
|
any rules. This form of @code{getline} allows you to read that line
|
|
and store it in a variable so that the main
|
|
read-a-line-and-check-each-rule loop of @code{awk} never sees it.
|
|
|
|
The following example swaps every two lines of input. For example, given:
|
|
|
|
@example
|
|
wan
|
|
tew
|
|
free
|
|
phore
|
|
@end example
|
|
|
|
@noindent
|
|
it outputs:
|
|
|
|
@example
|
|
tew
|
|
wan
|
|
phore
|
|
free
|
|
@end example
|
|
|
|
@noindent
|
|
Here's the program:
|
|
|
|
@example
|
|
@group
|
|
awk '@{
|
|
if ((getline tmp) > 0) @{
|
|
print tmp
|
|
print $0
|
|
@} else
|
|
print $0
|
|
@}'
|
|
@end group
|
|
@end example
|
|
|
|
The @code{getline} command used in this way sets only the variables
|
|
@code{NR} and @code{FNR} (and of course, @var{var}). The record is not
|
|
split into fields, so the values of the fields (including @code{$0}) and
|
|
the value of @code{NF} do not change.
|
|
|
|
@node Getline/File, Getline/Variable/File, Getline/Variable, Getline
|
|
@subsection Using @code{getline} from a File
|
|
|
|
@cindex input redirection
|
|
@cindex redirection of input
|
|
Use @samp{getline < @var{file}} to read
|
|
the next record from the file
|
|
@var{file}. Here @var{file} is a string-valued expression that
|
|
specifies the file name. @samp{< @var{file}} is called a @dfn{redirection}
|
|
since it directs input to come from a different place.
|
|
|
|
For example, the following
|
|
program reads its input record from the file @file{secondary.input} when it
|
|
encounters a first field with a value equal to 10 in the current input
|
|
file.
|
|
|
|
@example
|
|
@group
|
|
awk '@{
|
|
if ($1 == 10) @{
|
|
getline < "secondary.input"
|
|
print
|
|
@} else
|
|
print
|
|
@}'
|
|
@end group
|
|
@end example
|
|
|
|
Since the main input stream is not used, the values of @code{NR} and
|
|
@code{FNR} are not changed. But the record read is split into fields in
|
|
the normal manner, so the values of @code{$0} and other fields are
|
|
changed. So is the value of @code{NF}.
|
|
|
|
@c Thanks to Paul Eggert for initial wording here
|
|
According to POSIX, @samp{getline < @var{expression}} is ambiguous if
|
|
@var{expression} contains unparenthesized operators other than
|
|
@samp{$}; for example, @samp{getline < dir "/" file} is ambiguous
|
|
because the concatenation operator is not parenthesized, and you should
|
|
write it as @samp{getline < (dir "/" file)} if you want your program
|
|
to be portable to other @code{awk} implementations.
|
|
|
|
@node Getline/Variable/File, Getline/Pipe, Getline/File, Getline
|
|
@subsection Using @code{getline} Into a Variable from a File
|
|
|
|
Use @samp{getline @var{var} < @var{file}} to read input
|
|
the file
|
|
@var{file} and put it in the variable @var{var}. As above, @var{file}
|
|
is a string-valued expression that specifies the file from which to read.
|
|
|
|
In this version of @code{getline}, none of the built-in variables are
|
|
changed, and the record is not split into fields. The only variable
|
|
changed is @var{var}.
|
|
|
|
@ifinfo
|
|
@c Thanks to Paul Eggert for initial wording here
|
|
According to POSIX, @samp{getline @var{var} < @var{expression}} is ambiguous if
|
|
@var{expression} contains unparenthesized operators other than
|
|
@samp{$}; for example, @samp{getline < dir "/" file} is ambiguous
|
|
because the concatenation operator is not parenthesized, and you should
|
|
write it as @samp{getline < (dir "/" file)} if you want your program
|
|
to be portable to other @code{awk} implementations.
|
|
@end ifinfo
|
|
|
|
For example, the following program copies all the input files to the
|
|
output, except for records that say @w{@samp{@@include @var{filename}}}.
|
|
Such a record is replaced by the contents of the file
|
|
@var{filename}.
|
|
|
|
@example
|
|
@group
|
|
awk '@{
|
|
if (NF == 2 && $1 == "@@include") @{
|
|
while ((getline line < $2) > 0)
|
|
print line
|
|
close($2)
|
|
@} else
|
|
print
|
|
@}'
|
|
@end group
|
|
@end example
|
|
|
|
Note here how the name of the extra input file is not built into
|
|
the program; it is taken directly from the data, from the second field on
|
|
the @samp{@@include} line.
|
|
|
|
The @code{close} function is called to ensure that if two identical
|
|
@samp{@@include} lines appear in the input, the entire specified file is
|
|
included twice.
|
|
@xref{Close Files And Pipes, ,Closing Input and Output Files and Pipes}.
|
|
|
|
One deficiency of this program is that it does not process nested
|
|
@samp{@@include} statements
|
|
(@samp{@@include} statements in included files)
|
|
the way a true macro preprocessor would.
|
|
@xref{Igawk Program, ,An Easy Way to Use Library Functions}, for a program
|
|
that does handle nested @samp{@@include} statements.
|
|
|
|
@node Getline/Pipe, Getline/Variable/Pipe, Getline/Variable/File, Getline
|
|
@subsection Using @code{getline} from a Pipe
|
|
|
|
@cindex input pipeline
|
|
@cindex pipeline, input
|
|
You can pipe the output of a command into @code{getline}, using
|
|
@samp{@var{command} | getline}. In
|
|
this case, the string @var{command} is run as a shell command and its output
|
|
is piped into @code{awk} to be used as input. This form of @code{getline}
|
|
reads one record at a time from the pipe.
|
|
|
|
For example, the following program copies its input to its output, except for
|
|
lines that begin with @samp{@@execute}, which are replaced by the output
|
|
produced by running the rest of the line as a shell command:
|
|
|
|
@example
|
|
@group
|
|
awk '@{
|
|
if ($1 == "@@execute") @{
|
|
tmp = substr($0, 10)
|
|
while ((tmp | getline) > 0)
|
|
print
|
|
close(tmp)
|
|
@} else
|
|
print
|
|
@}'
|
|
@end group
|
|
@end example
|
|
|
|
@noindent
|
|
The @code{close} function is called to ensure that if two identical
|
|
@samp{@@execute} lines appear in the input, the command is run for
|
|
each one.
|
|
@xref{Close Files And Pipes, ,Closing Input and Output Files and Pipes}.
|
|
@c Exercise!!
|
|
@c This example is unrealistic, since you could just use system
|
|
|
|
@c NEEDED
|
|
@page
|
|
Given the input:
|
|
|
|
@example
|
|
@group
|
|
foo
|
|
bar
|
|
baz
|
|
@@execute who
|
|
bletch
|
|
@end group
|
|
@end example
|
|
|
|
@noindent
|
|
the program might produce:
|
|
|
|
@example
|
|
@group
|
|
foo
|
|
bar
|
|
baz
|
|
arnold ttyv0 Jul 13 14:22
|
|
miriam ttyp0 Jul 13 14:23 (murphy:0)
|
|
bill ttyp1 Jul 13 14:23 (murphy:0)
|
|
bletch
|
|
@end group
|
|
@end example
|
|
|
|
@noindent
|
|
Notice that this program ran the command @code{who} and printed the result.
|
|
(If you try this program yourself, you will of course get different results,
|
|
showing you who is logged in on your system.)
|
|
|
|
This variation of @code{getline} splits the record into fields, sets the
|
|
value of @code{NF} and recomputes the value of @code{$0}. The values of
|
|
@code{NR} and @code{FNR} are not changed.
|
|
|
|
@c Thanks to Paul Eggert for initial wording here
|
|
According to POSIX, @samp{@var{expression} | getline} is ambiguous if
|
|
@var{expression} contains unparenthesized operators other than
|
|
@samp{$}; for example, @samp{"echo " "date" | getline} is ambiguous
|
|
because the concatenation operator is not parenthesized, and you should
|
|
write it as @samp{("echo " "date") | getline} if you want your program
|
|
to be portable to other @code{awk} implementations.
|
|
(It happens that @code{gawk} gets it right, but you should not
|
|
rely on this. Parentheses make it easier to read, anyway.)
|
|
|
|
@node Getline/Variable/Pipe, Getline Summary, Getline/Pipe, Getline
|
|
@subsection Using @code{getline} Into a Variable from a Pipe
|
|
|
|
When you use @samp{@var{command} | getline @var{var}}, the
|
|
output of the command @var{command} is sent through a pipe to
|
|
@code{getline} and into the variable @var{var}. For example, the
|
|
following program reads the current date and time into the variable
|
|
@code{current_time}, using the @code{date} utility, and then
|
|
prints it.
|
|
|
|
@example
|
|
@group
|
|
awk 'BEGIN @{
|
|
"date" | getline current_time
|
|
close("date")
|
|
print "Report printed on " current_time
|
|
@}'
|
|
@end group
|
|
@end example
|
|
|
|
In this version of @code{getline}, none of the built-in variables are
|
|
changed, and the record is not split into fields.
|
|
|
|
@ifinfo
|
|
@c Thanks to Paul Eggert for initial wording here
|
|
According to POSIX, @samp{@var{expression} | getline @var{var}} is ambiguous if
|
|
@var{expression} contains unparenthesized operators other than
|
|
@samp{$}; for example, @samp{"echo " "date" | getline @var{var}} is ambiguous
|
|
because the concatenation operator is not parenthesized, and you should
|
|
write it as @samp{("echo " "date") | getline @var{var}} if you want your
|
|
program to be portable to other @code{awk} implementations.
|
|
(It happens that @code{gawk} gets it right, but you should not
|
|
rely on this. Parentheses make it easier to read, anyway.)
|
|
@end ifinfo
|
|
|
|
@node Getline Summary, , Getline/Variable/Pipe, Getline
|
|
@subsection Summary of @code{getline} Variants
|
|
|
|
With all the forms of @code{getline}, even though @code{$0} and @code{NF},
|
|
may be updated, the record will not be tested against all the patterns
|
|
in the @code{awk} program, in the way that would happen if the record
|
|
were read normally by the main processing loop of @code{awk}. However
|
|
the new record is tested against any subsequent rules.
|
|
|
|
@cindex differences between @code{gawk} and @code{awk}
|
|
@cindex limitations
|
|
@cindex implementation limits
|
|
Many @code{awk} implementations limit the number of pipelines an @code{awk}
|
|
program may have open to just one! In @code{gawk}, there is no such limit.
|
|
You can open as many pipelines as the underlying operating system will
|
|
permit.
|
|
|
|
@vindex FILENAME
|
|
@cindex dark corner
|
|
@cindex @code{getline}, setting @code{FILENAME}
|
|
@cindex @code{FILENAME}, being set by @code{getline}
|
|
An interesting side-effect occurs if you use @code{getline} (without a
|
|
redirection) inside a @code{BEGIN} rule. Since an unredirected @code{getline}
|
|
reads from the command line data files, the first @code{getline} command
|
|
causes @code{awk} to set the value of @code{FILENAME}. Normally,
|
|
@code{FILENAME} does not have a value inside @code{BEGIN} rules, since you
|
|
have not yet started to process the command line data files (d.c.).
|
|
(@xref{BEGIN/END, , The @code{BEGIN} and @code{END} Special Patterns},
|
|
also @pxref{Auto-set, , Built-in Variables that Convey Information}.)
|
|
|
|
The following table summarizes the six variants of @code{getline},
|
|
listing which built-in variables are set by each one.
|
|
|
|
@c @cartouche
|
|
@table @code
|
|
@item getline
|
|
sets @code{$0}, @code{NF}, @code{FNR}, and @code{NR}.
|
|
|
|
@item getline @var{var}
|
|
sets @var{var}, @code{FNR}, and @code{NR}.
|
|
|
|
@item getline < @var{file}
|
|
sets @code{$0}, and @code{NF}.
|
|
|
|
@item getline @var{var} < @var{file}
|
|
sets @var{var}.
|
|
|
|
@item @var{command} | getline
|
|
sets @code{$0}, and @code{NF}.
|
|
|
|
@item @var{command} | getline @var{var}
|
|
sets @var{var}.
|
|
@end table
|
|
@c @end cartouche
|
|
|
|
@node Printing, Expressions, Reading Files, Top
|
|
@chapter Printing Output
|
|
|
|
@cindex printing
|
|
@cindex output
|
|
One of the most common actions is to @dfn{print}, or output,
|
|
some or all of the input. You use the @code{print} statement
|
|
for simple output. You use the @code{printf} statement
|
|
for fancier formatting. Both are described in this chapter.
|
|
|
|
@menu
|
|
* Print:: The @code{print} statement.
|
|
* Print Examples:: Simple examples of @code{print} statements.
|
|
* Output Separators:: The output separators and how to change them.
|
|
* OFMT:: Controlling Numeric Output With @code{print}.
|
|
* Printf:: The @code{printf} statement.
|
|
* Redirection:: How to redirect output to multiple files and
|
|
pipes.
|
|
* Special Files:: File name interpretation in @code{gawk}.
|
|
@code{gawk} allows access to inherited file
|
|
descriptors.
|
|
* Close Files And Pipes:: Closing Input and Output Files and Pipes.
|
|
@end menu
|
|
|
|
@node Print, Print Examples, Printing, Printing
|
|
@section The @code{print} Statement
|
|
@cindex @code{print} statement
|
|
|
|
The @code{print} statement does output with simple, standardized
|
|
formatting. You specify only the strings or numbers to be printed, in a
|
|
list separated by commas. They are output, separated by single spaces,
|
|
followed by a newline. The statement looks like this:
|
|
|
|
@example
|
|
print @var{item1}, @var{item2}, @dots{}
|
|
@end example
|
|
|
|
@noindent
|
|
The entire list of items may optionally be enclosed in parentheses. The
|
|
parentheses are necessary if any of the item expressions uses the @samp{>}
|
|
relational operator; otherwise it could be confused with a redirection
|
|
(@pxref{Redirection, ,Redirecting Output of @code{print} and @code{printf}}).
|
|
|
|
The items to be printed can be constant strings or numbers, fields of the
|
|
current record (such as @code{$1}), variables, or any @code{awk}
|
|
expressions.
|
|
Numeric values are converted to strings, and then printed.
|
|
|
|
The @code{print} statement is completely general for
|
|
computing @emph{what} values to print. However, with two exceptions,
|
|
you cannot specify @emph{how} to print them---how many
|
|
columns, whether to use exponential notation or not, and so on.
|
|
(For the exceptions, @pxref{Output Separators}, and
|
|
@ref{OFMT, ,Controlling Numeric Output with @code{print}}.)
|
|
For that, you need the @code{printf} statement
|
|
(@pxref{Printf, ,Using @code{printf} Statements for Fancier Printing}).
|
|
|
|
The simple statement @samp{print} with no items is equivalent to
|
|
@samp{print $0}: it prints the entire current record. To print a blank
|
|
line, use @samp{print ""}, where @code{""} is the empty string.
|
|
|
|
To print a fixed piece of text, use a string constant such as
|
|
@w{@code{"Don't Panic"}} as one item. If you forget to use the
|
|
double-quote characters, your text will be taken as an @code{awk}
|
|
expression, and you will probably get an error. Keep in mind that a
|
|
space is printed between any two items.
|
|
|
|
Each @code{print} statement makes at least one line of output. But it
|
|
isn't limited to one line. If an item value is a string that contains a
|
|
newline, the newline is output along with the rest of the string. A
|
|
single @code{print} can make any number of lines this way.
|
|
|
|
@node Print Examples, Output Separators, Print, Printing
|
|
@section Examples of @code{print} Statements
|
|
|
|
Here is an example of printing a string that contains embedded newlines
|
|
(the @samp{\n} is an escape sequence, used to represent the newline
|
|
character; @pxref{Escape Sequences}):
|
|
|
|
@example
|
|
@group
|
|
$ awk 'BEGIN @{ print "line one\nline two\nline three" @}'
|
|
@print{} line one
|
|
@print{} line two
|
|
@print{} line three
|
|
@end group
|
|
@end example
|
|
|
|
Here is an example that prints the first two fields of each input record,
|
|
with a space between them:
|
|
|
|
@example
|
|
@group
|
|
$ awk '@{ print $1, $2 @}' inventory-shipped
|
|
@print{} Jan 13
|
|
@print{} Feb 15
|
|
@print{} Mar 15
|
|
@dots{}
|
|
@end group
|
|
@end example
|
|
|
|
@cindex common mistakes
|
|
@cindex mistakes, common
|
|
@cindex errors, common
|
|
A common mistake in using the @code{print} statement is to omit the comma
|
|
between two items. This often has the effect of making the items run
|
|
together in the output, with no space. The reason for this is that
|
|
juxtaposing two string expressions in @code{awk} means to concatenate
|
|
them. Here is the same program, without the comma:
|
|
|
|
@example
|
|
@group
|
|
$ awk '@{ print $1 $2 @}' inventory-shipped
|
|
@print{} Jan13
|
|
@print{} Feb15
|
|
@print{} Mar15
|
|
@dots{}
|
|
@end group
|
|
@end example
|
|
|
|
To someone unfamiliar with the file @file{inventory-shipped}, neither
|
|
example's output makes much sense. A heading line at the beginning
|
|
would make it clearer. Let's add some headings to our table of months
|
|
(@code{$1}) and green crates shipped (@code{$2}). We do this using the
|
|
@code{BEGIN} pattern
|
|
(@pxref{BEGIN/END, ,The @code{BEGIN} and @code{END} Special Patterns})
|
|
to force the headings to be printed only once:
|
|
|
|
@example
|
|
awk 'BEGIN @{ print "Month Crates"
|
|
print "----- ------" @}
|
|
@{ print $1, $2 @}' inventory-shipped
|
|
@end example
|
|
|
|
@noindent
|
|
Did you already guess what happens? When run, the program prints
|
|
the following:
|
|
|
|
@example
|
|
@group
|
|
Month Crates
|
|
----- ------
|
|
Jan 13
|
|
Feb 15
|
|
Mar 15
|
|
@dots{}
|
|
@end group
|
|
@end example
|
|
|
|
@noindent
|
|
The headings and the table data don't line up! We can fix this by printing
|
|
some spaces between the two fields:
|
|
|
|
@example
|
|
awk 'BEGIN @{ print "Month Crates"
|
|
print "----- ------" @}
|
|
@{ print $1, " ", $2 @}' inventory-shipped
|
|
@end example
|
|
|
|
You can imagine that this way of lining up columns can get pretty
|
|
complicated when you have many columns to fix. Counting spaces for two
|
|
or three columns can be simple, but more than this and you can get
|
|
lost quite easily. This is why the @code{printf} statement was
|
|
created (@pxref{Printf, ,Using @code{printf} Statements for Fancier Printing});
|
|
one of its specialties is lining up columns of data.
|
|
|
|
@cindex line continuation
|
|
As a side point,
|
|
you can continue either a @code{print} or @code{printf} statement simply
|
|
by putting a newline after any comma
|
|
(@pxref{Statements/Lines, ,@code{awk} Statements Versus Lines}).
|
|
|
|
@node Output Separators, OFMT, Print Examples, Printing
|
|
@section Output Separators
|
|
|
|
@cindex output field separator, @code{OFS}
|
|
@cindex output record separator, @code{ORS}
|
|
@vindex OFS
|
|
@vindex ORS
|
|
As mentioned previously, a @code{print} statement contains a list
|
|
of items, separated by commas. In the output, the items are normally
|
|
separated by single spaces. This need not be the case; a
|
|
single space is only the default. You can specify any string of
|
|
characters to use as the @dfn{output field separator} by setting the
|
|
built-in variable @code{OFS}. The initial value of this variable
|
|
is the string @w{@code{" "}}, that is, a single space.
|
|
|
|
The output from an entire @code{print} statement is called an
|
|
@dfn{output record}. Each @code{print} statement outputs one output
|
|
record and then outputs a string called the @dfn{output record separator}.
|
|
The built-in variable @code{ORS} specifies this string. The initial
|
|
value of @code{ORS} is the string @code{"\n"}, i.e.@: a newline
|
|
character; thus, normally each @code{print} statement makes a separate line.
|
|
|
|
You can change how output fields and records are separated by assigning
|
|
new values to the variables @code{OFS} and/or @code{ORS}. The usual
|
|
place to do this is in the @code{BEGIN} rule
|
|
(@pxref{BEGIN/END, ,The @code{BEGIN} and @code{END} Special Patterns}), so
|
|
that it happens before any input is processed. You may also do this
|
|
with assignments on the command line, before the names of your input
|
|
files, or using the @samp{-v} command line option
|
|
(@pxref{Options, ,Command Line Options}).
|
|
|
|
@ignore
|
|
Exercise,
|
|
Rewrite the
|
|
@example
|
|
awk 'BEGIN @{ print "Month Crates"
|
|
print "----- ------" @}
|
|
@{ print $1, " ", $2 @}' inventory-shipped
|
|
@end example
|
|
program by using a new value of @code{OFS}.
|
|
@end ignore
|
|
|
|
The following example prints the first and second fields of each input
|
|
record separated by a semicolon, with a blank line added after each
|
|
line:
|
|
|
|
@example
|
|
@group
|
|
$ awk 'BEGIN @{ OFS = ";"; ORS = "\n\n" @}
|
|
> @{ print $1, $2 @}' BBS-list
|
|
@print{} aardvark;555-5553
|
|
@print{}
|
|
@print{} alpo-net;555-3412
|
|
@print{}
|
|
@print{} barfly;555-7685
|
|
@dots{}
|
|
@end group
|
|
@end example
|
|
|
|
If the value of @code{ORS} does not contain a newline, all your output
|
|
will be run together on a single line, unless you output newlines some
|
|
other way.
|
|
|
|
@node OFMT, Printf, Output Separators, Printing
|
|
@section Controlling Numeric Output with @code{print}
|
|
@vindex OFMT
|
|
@cindex numeric output format
|
|
@cindex format, numeric output
|
|
@cindex output format specifier, @code{OFMT}
|
|
When you use the @code{print} statement to print numeric values,
|
|
@code{awk} internally converts the number to a string of characters,
|
|
and prints that string. @code{awk} uses the @code{sprintf} function
|
|
to do this conversion
|
|
(@pxref{String Functions, ,Built-in Functions for String Manipulation}).
|
|
For now, it suffices to say that the @code{sprintf}
|
|
function accepts a @dfn{format specification} that tells it how to format
|
|
numbers (or strings), and that there are a number of different ways in which
|
|
numbers can be formatted. The different format specifications are discussed
|
|
more fully in
|
|
@ref{Control Letters, , Format-Control Letters}.
|
|
|
|
The built-in variable @code{OFMT} contains the default format specification
|
|
that @code{print} uses with @code{sprintf} when it wants to convert a
|
|
number to a string for printing.
|
|
The default value of @code{OFMT} is @code{"%.6g"}.
|
|
By supplying different format specifications
|
|
as the value of @code{OFMT}, you can change how @code{print} will print
|
|
your numbers. As a brief example:
|
|
|
|
@example
|
|
@group
|
|
$ awk 'BEGIN @{
|
|
> OFMT = "%.0f" # print numbers as integers (rounds)
|
|
> print 17.23 @}'
|
|
@print{} 17
|
|
@end group
|
|
@end example
|
|
|
|
@noindent
|
|
@cindex dark corner
|
|
@cindex @code{awk} language, POSIX version
|
|
@cindex POSIX @code{awk}
|
|
According to the POSIX standard, @code{awk}'s behavior will be undefined
|
|
if @code{OFMT} contains anything but a floating point conversion specification
|
|
(d.c.).
|
|
|
|
@node Printf, Redirection, OFMT, Printing
|
|
@section Using @code{printf} Statements for Fancier Printing
|
|
@cindex formatted output
|
|
@cindex output, formatted
|
|
|
|
If you want more precise control over the output format than
|
|
@code{print} gives you, use @code{printf}. With @code{printf} you can
|
|
specify the width to use for each item, and you can specify various
|
|
formatting choices for numbers (such as what radix to use, whether to
|
|
print an exponent, whether to print a sign, and how many digits to print
|
|
after the decimal point). You do this by supplying a string, called
|
|
the @dfn{format string}, which controls how and where to print the other
|
|
arguments.
|
|
|
|
@menu
|
|
* Basic Printf:: Syntax of the @code{printf} statement.
|
|
* Control Letters:: Format-control letters.
|
|
* Format Modifiers:: Format-specification modifiers.
|
|
* Printf Examples:: Several examples.
|
|
@end menu
|
|
|
|
@node Basic Printf, Control Letters, Printf, Printf
|
|
@subsection Introduction to the @code{printf} Statement
|
|
|
|
@cindex @code{printf} statement, syntax of
|
|
The @code{printf} statement looks like this:
|
|
|
|
@example
|
|
printf @var{format}, @var{item1}, @var{item2}, @dots{}
|
|
@end example
|
|
|
|
@noindent
|
|
The entire list of arguments may optionally be enclosed in parentheses. The
|
|
parentheses are necessary if any of the item expressions use the @samp{>}
|
|
relational operator; otherwise it could be confused with a redirection
|
|
(@pxref{Redirection, ,Redirecting Output of @code{print} and @code{printf}}).
|
|
|
|
@cindex format string
|
|
The difference between @code{printf} and @code{print} is the @var{format}
|
|
argument. This is an expression whose value is taken as a string; it
|
|
specifies how to output each of the other arguments. It is called
|
|
the @dfn{format string}.
|
|
|
|
The format string is very similar to that in the ANSI C library function
|
|
@code{printf}. Most of @var{format} is text to be output verbatim.
|
|
Scattered among this text are @dfn{format specifiers}, one per item.
|
|
Each format specifier says to output the next item in the argument list
|
|
at that place in the format.
|
|
|
|
The @code{printf} statement does not automatically append a newline to its
|
|
output. It outputs only what the format string specifies. So if you want
|
|
a newline, you must include one in the format string. The output separator
|
|
variables @code{OFS} and @code{ORS} have no effect on @code{printf}
|
|
statements. For example:
|
|
|
|
@example
|
|
@group
|
|
BEGIN @{
|
|
ORS = "\nOUCH!\n"; OFS = "!"
|
|
msg = "Don't Panic!"; printf "%s\n", msg
|
|
@}
|
|
@end group
|
|
@end example
|
|
|
|
This program still prints the familiar @samp{Don't Panic!} message.
|
|
|
|
@node Control Letters, Format Modifiers, Basic Printf, Printf
|
|
@subsection Format-Control Letters
|
|
@cindex @code{printf}, format-control characters
|
|
@cindex format specifier
|
|
|
|
A format specifier starts with the character @samp{%} and ends with a
|
|
@dfn{format-control letter}; it tells the @code{printf} statement how
|
|
to output one item. (If you actually want to output a @samp{%}, write
|
|
@samp{%%}.) The format-control letter specifies what kind of value to
|
|
print. The rest of the format specifier is made up of optional
|
|
@dfn{modifiers} which are parameters to use, such as the field width.
|
|
|
|
Here is a list of the format-control letters:
|
|
|
|
@table @code
|
|
@item c
|
|
This prints a number as an ASCII character. Thus, @samp{printf "%c",
|
|
65} outputs the letter @samp{A}. The output for a string value is
|
|
the first character of the string.
|
|
|
|
@item d
|
|
@itemx i
|
|
These are equivalent. They both print a decimal integer.
|
|
The @samp{%i} specification is for compatibility with ANSI C.
|
|
|
|
@item e
|
|
@itemx E
|
|
This prints a number in scientific (exponential) notation.
|
|
For example,
|
|
|
|
@example
|
|
printf "%4.3e\n", 1950
|
|
@end example
|
|
|
|
@noindent
|
|
prints @samp{1.950e+03}, with a total of four significant figures of
|
|
which three follow the decimal point. The @samp{4.3} are modifiers,
|
|
discussed below. @samp{%E} uses @samp{E} instead of @samp{e} in the output.
|
|
|
|
@item f
|
|
This prints a number in floating point notation.
|
|
For example,
|
|
|
|
@example
|
|
printf "%4.3f", 1950
|
|
@end example
|
|
|
|
@noindent
|
|
prints @samp{1950.000}, with a total of four significant figures of
|
|
which three follow the decimal point. The @samp{4.3} are modifiers,
|
|
discussed below.
|
|
|
|
@item g
|
|
@itemx G
|
|
This prints a number in either scientific notation or floating point
|
|
notation, whichever uses fewer characters. If the result is printed in
|
|
scientific notation, @samp{%G} uses @samp{E} instead of @samp{e}.
|
|
|
|
@item o
|
|
This prints an unsigned octal integer.
|
|
(In octal, or base-eight notation, the digits run from @samp{0} to @samp{7};
|
|
the decimal number eight is represented as @samp{10} in octal.)
|
|
|
|
@item s
|
|
This prints a string.
|
|
|
|
@item x
|
|
@itemx X
|
|
This prints an unsigned hexadecimal integer.
|
|
(In hexadecimal, or base-16 notation, the digits are @samp{0} through @samp{9}
|
|
and @samp{a} through @samp{f}. The hexadecimal digit @samp{f} represents
|
|
the decimal number 15.) @samp{%X} uses the letters @samp{A} through @samp{F}
|
|
instead of @samp{a} through @samp{f}.
|
|
|
|
@item %
|
|
This isn't really a format-control letter, but it does have a meaning
|
|
when used after a @samp{%}: the sequence @samp{%%} outputs one
|
|
@samp{%}. It does not consume an argument, and it ignores any
|
|
modifiers.
|
|
@end table
|
|
|
|
@cindex dark corner
|
|
When using the integer format-control letters for values that are outside
|
|
the range of a C @code{long} integer, @code{gawk} will switch to the
|
|
@samp{%g} format specifier. Other versions of @code{awk} may print
|
|
invalid values, or do something else entirely (d.c.).
|
|
|
|
@node Format Modifiers, Printf Examples, Control Letters, Printf
|
|
@subsection Modifiers for @code{printf} Formats
|
|
|
|
@cindex @code{printf}, modifiers
|
|
@cindex modifiers (in format specifiers)
|
|
A format specification can also include @dfn{modifiers} that can control
|
|
how much of the item's value is printed and how much space it gets. The
|
|
modifiers come between the @samp{%} and the format-control letter.
|
|
In the examples below, we use the bullet symbol ``@bullet{}'' to represent
|
|
spaces in the output. Here are the possible modifiers, in the order in
|
|
which they may appear:
|
|
|
|
@table @code
|
|
@item -
|
|
The minus sign, used before the width modifier (see below),
|
|
says to left-justify
|
|
the argument within its specified width. Normally the argument
|
|
is printed right-justified in the specified width. Thus,
|
|
|
|
@example
|
|
printf "%-4s", "foo"
|
|
@end example
|
|
|
|
@noindent
|
|
prints @samp{foo@bullet{}}.
|
|
|
|
@item @var{space}
|
|
For numeric conversions, prefix positive values with a space, and
|
|
negative values with a minus sign.
|
|
|
|
@item +
|
|
The plus sign, used before the width modifier (see below),
|
|
says to always supply a sign for numeric conversions, even if the data
|
|
to be formatted is positive. The @samp{+} overrides the space modifier.
|
|
|
|
@item #
|
|
Use an ``alternate form'' for certain control letters.
|
|
For @samp{%o}, supply a leading zero.
|
|
For @samp{%x}, and @samp{%X}, supply a leading @samp{0x} or @samp{0X} for
|
|
a non-zero result.
|
|
For @samp{%e}, @samp{%E}, and @samp{%f}, the result will always contain a
|
|
decimal point.
|
|
For @samp{%g}, and @samp{%G}, trailing zeros are not removed from the result.
|
|
|
|
@cindex dark corner
|
|
@item 0
|
|
A leading @samp{0} (zero) acts as a flag, that indicates output should be
|
|
padded with zeros instead of spaces.
|
|
This applies even to non-numeric output formats (d.c.).
|
|
This flag only has an effect when the field width is wider than the
|
|
value to be printed.
|
|
|
|
@item @var{width}
|
|
This is a number specifying the desired minimum width of a field. Inserting any
|
|
number between the @samp{%} sign and the format control character forces the
|
|
field to be expanded to this width. The default way to do this is to
|
|
pad with spaces on the left. For example,
|
|
|
|
@example
|
|
printf "%4s", "foo"
|
|
@end example
|
|
|
|
@noindent
|
|
prints @samp{@bullet{}foo}.
|
|
|
|
The value of @var{width} is a minimum width, not a maximum. If the item
|
|
value requires more than @var{width} characters, it can be as wide as
|
|
necessary. Thus,
|
|
|
|
@example
|
|
printf "%4s", "foobar"
|
|
@end example
|
|
|
|
@noindent
|
|
prints @samp{foobar}.
|
|
|
|
Preceding the @var{width} with a minus sign causes the output to be
|
|
padded with spaces on the right, instead of on the left.
|
|
|
|
@item .@var{prec}
|
|
This is a number that specifies the precision to use when printing.
|
|
For the @samp{e}, @samp{E}, and @samp{f} formats, this specifies the
|
|
number of digits you want printed to the right of the decimal point.
|
|
For the @samp{g}, and @samp{G} formats, it specifies the maximum number
|
|
of significant digits. For the @samp{d}, @samp{o}, @samp{i}, @samp{u},
|
|
@samp{x}, and @samp{X} formats, it specifies the minimum number of
|
|
digits to print. For a string, it specifies the maximum number of
|
|
characters from the string that should be printed. Thus,
|
|
|
|
@example
|
|
printf "%.4s", "foobar"
|
|
@end example
|
|
|
|
@noindent
|
|
prints @samp{foob}.
|
|
@end table
|
|
|
|
The C library @code{printf}'s dynamic @var{width} and @var{prec}
|
|
capability (for example, @code{"%*.*s"}) is supported. Instead of
|
|
supplying explicit @var{width} and/or @var{prec} values in the format
|
|
string, you pass them in the argument list. For example:
|
|
|
|
@example
|
|
w = 5
|
|
p = 3
|
|
s = "abcdefg"
|
|
printf "%*.*s\n", w, p, s
|
|
@end example
|
|
|
|
@noindent
|
|
is exactly equivalent to
|
|
|
|
@example
|
|
s = "abcdefg"
|
|
printf "%5.3s\n", s
|
|
@end example
|
|
|
|
@noindent
|
|
Both programs output @samp{@w{@bullet{}@bullet{}abc}}.
|
|
|
|
Earlier versions of @code{awk} did not support this capability.
|
|
If you must use such a version, you may simulate this feature by using
|
|
concatenation to build up the format string, like so:
|
|
|
|
@example
|
|
w = 5
|
|
p = 3
|
|
s = "abcdefg"
|
|
printf "%" w "." p "s\n", s
|
|
@end example
|
|
|
|
@noindent
|
|
This is not particularly easy to read, but it does work.
|
|
|
|
@cindex @code{awk} language, POSIX version
|
|
@cindex POSIX @code{awk}
|
|
C programmers may be used to supplying additional @samp{l} and @samp{h}
|
|
flags in @code{printf} format strings. These are not valid in @code{awk}.
|
|
Most @code{awk} implementations silently ignore these flags.
|
|
If @samp{--lint} is provided on the command line
|
|
(@pxref{Options, ,Command Line Options}),
|
|
@code{gawk} will warn about their use. If @samp{--posix} is supplied,
|
|
their use is a fatal error.
|
|
|
|
@node Printf Examples, , Format Modifiers, Printf
|
|
@subsection Examples Using @code{printf}
|
|
|
|
Here is how to use @code{printf} to make an aligned table:
|
|
|
|
@example
|
|
awk '@{ printf "%-10s %s\n", $1, $2 @}' BBS-list
|
|
@end example
|
|
|
|
@noindent
|
|
prints the names of bulletin boards (@code{$1}) of the file
|
|
@file{BBS-list} as a string of 10 characters, left justified. It also
|
|
prints the phone numbers (@code{$2}) afterward on the line. This
|
|
produces an aligned two-column table of names and phone numbers:
|
|
|
|
@example
|
|
@group
|
|
$ awk '@{ printf "%-10s %s\n", $1, $2 @}' BBS-list
|
|
@print{} aardvark 555-5553
|
|
@print{} alpo-net 555-3412
|
|
@print{} barfly 555-7685
|
|
@print{} bites 555-1675
|
|
@print{} camelot 555-0542
|
|
@print{} core 555-2912
|
|
@print{} fooey 555-1234
|
|
@print{} foot 555-6699
|
|
@print{} macfoo 555-6480
|
|
@print{} sdace 555-3430
|
|
@print{} sabafoo 555-2127
|
|
@end group
|
|
@end example
|
|
|
|
Did you notice that we did not specify that the phone numbers be printed
|
|
as numbers? They had to be printed as strings because the numbers are
|
|
separated by a dash.
|
|
If we had tried to print the phone numbers as numbers, all we would have
|
|
gotten would have been the first three digits, @samp{555}.
|
|
This would have been pretty confusing.
|
|
|
|
We did not specify a width for the phone numbers because they are the
|
|
last things on their lines. We don't need to put spaces after them.
|
|
|
|
We could make our table look even nicer by adding headings to the tops
|
|
of the columns. To do this, we use the @code{BEGIN} pattern
|
|
(@pxref{BEGIN/END, ,The @code{BEGIN} and @code{END} Special Patterns})
|
|
to force the header to be printed only once, at the beginning of
|
|
the @code{awk} program:
|
|
|
|
@example
|
|
@group
|
|
awk 'BEGIN @{ print "Name Number"
|
|
print "---- ------" @}
|
|
@{ printf "%-10s %s\n", $1, $2 @}' BBS-list
|
|
@end group
|
|
@end example
|
|
|
|
Did you notice that we mixed @code{print} and @code{printf} statements in
|
|
the above example? We could have used just @code{printf} statements to get
|
|
the same results:
|
|
|
|
@example
|
|
@group
|
|
awk 'BEGIN @{ printf "%-10s %s\n", "Name", "Number"
|
|
printf "%-10s %s\n", "----", "------" @}
|
|
@{ printf "%-10s %s\n", $1, $2 @}' BBS-list
|
|
@end group
|
|
@end example
|
|
|
|
@noindent
|
|
By printing each column heading with the same format specification
|
|
used for the elements of the column, we have made sure that the headings
|
|
are aligned just like the columns.
|
|
|
|
The fact that the same format specification is used three times can be
|
|
emphasized by storing it in a variable, like this:
|
|
|
|
@example
|
|
@group
|
|
awk 'BEGIN @{ format = "%-10s %s\n"
|
|
printf format, "Name", "Number"
|
|
printf format, "----", "------" @}
|
|
@{ printf format, $1, $2 @}' BBS-list
|
|
@end group
|
|
@end example
|
|
|
|
@c !!! exercise
|
|
See if you can use the @code{printf} statement to line up the headings and
|
|
table data for our @file{inventory-shipped} example covered earlier in the
|
|
section on the @code{print} statement
|
|
(@pxref{Print, ,The @code{print} Statement}).
|
|
|
|
@node Redirection, Special Files, Printf, Printing
|
|
@section Redirecting Output of @code{print} and @code{printf}
|
|
|
|
@cindex output redirection
|
|
@cindex redirection of output
|
|
So far we have been dealing only with output that prints to the standard
|
|
output, usually your terminal. Both @code{print} and @code{printf} can
|
|
also send their output to other places.
|
|
This is called @dfn{redirection}.
|
|
|
|
A redirection appears after the @code{print} or @code{printf} statement.
|
|
Redirections in @code{awk} are written just like redirections in shell
|
|
commands, except that they are written inside the @code{awk} program.
|
|
|
|
There are three forms of output redirection: output to a file,
|
|
output appended to a file, and output through a pipe to another
|
|
command.
|
|
They are all shown for
|
|
the @code{print} statement, but they work identically for @code{printf}
|
|
also.
|
|
|
|
@table @code
|
|
@item print @var{items} > @var{output-file}
|
|
This type of redirection prints the items into the output file
|
|
@var{output-file}. The file name @var{output-file} can be any
|
|
expression. Its value is changed to a string and then used as a
|
|
file name (@pxref{Expressions}).
|
|
|
|
When this type of redirection is used, the @var{output-file} is erased
|
|
before the first output is written to it. Subsequent writes
|
|
to the same @var{output-file} do not
|
|
erase @var{output-file}, but append to it. If @var{output-file} does
|
|
not exist, then it is created.
|
|
|
|
For example, here is how an @code{awk} program can write a list of
|
|
BBS names to a file @file{name-list} and a list of phone numbers to a
|
|
file @file{phone-list}. Each output file contains one name or number
|
|
per line.
|
|
|
|
@example
|
|
@group
|
|
$ awk '@{ print $2 > "phone-list"
|
|
> print $1 > "name-list" @}' BBS-list
|
|
@end group
|
|
@group
|
|
$ cat phone-list
|
|
@print{} 555-5553
|
|
@print{} 555-3412
|
|
@dots{}
|
|
@end group
|
|
@group
|
|
$ cat name-list
|
|
@print{} aardvark
|
|
@print{} alpo-net
|
|
@dots{}
|
|
@end group
|
|
@end example
|
|
|
|
@item print @var{items} >> @var{output-file}
|
|
This type of redirection prints the items into the pre-existing output file
|
|
@var{output-file}. The difference between this and the
|
|
single-@samp{>} redirection is that the old contents (if any) of
|
|
@var{output-file} are not erased. Instead, the @code{awk} output is
|
|
appended to the file.
|
|
If @var{output-file} does not exist, then it is created.
|
|
|
|
@cindex pipes for output
|
|
@cindex output, piping
|
|
@item print @var{items} | @var{command}
|
|
It is also possible to send output to another program through a pipe
|
|
instead of into a
|
|
file. This type of redirection opens a pipe to @var{command} and writes
|
|
the values of @var{items} through this pipe, to another process created
|
|
to execute @var{command}.
|
|
|
|
The redirection argument @var{command} is actually an @code{awk}
|
|
expression. Its value is converted to a string, whose contents give the
|
|
shell command to be run.
|
|
|
|
For example, this produces two files, one unsorted list of BBS names
|
|
and one list sorted in reverse alphabetical order:
|
|
|
|
@example
|
|
awk '@{ print $1 > "names.unsorted"
|
|
command = "sort -r > names.sorted"
|
|
print $1 | command @}' BBS-list
|
|
@end example
|
|
|
|
Here the unsorted list is written with an ordinary redirection while
|
|
the sorted list is written by piping through the @code{sort} utility.
|
|
|
|
This example uses redirection to mail a message to a mailing
|
|
list @samp{bug-system}. This might be useful when trouble is encountered
|
|
in an @code{awk} script run periodically for system maintenance.
|
|
|
|
@example
|
|
report = "mail bug-system"
|
|
print "Awk script failed:", $0 | report
|
|
m = ("at record number " FNR " of " FILENAME)
|
|
print m | report
|
|
close(report)
|
|
@end example
|
|
|
|
The message is built using string concatenation and saved in the variable
|
|
@code{m}. It is then sent down the pipeline to the @code{mail} program.
|
|
|
|
We call the @code{close} function here because it's a good idea to close
|
|
the pipe as soon as all the intended output has been sent to it.
|
|
@xref{Close Files And Pipes, ,Closing Input and Output Files and Pipes},
|
|
for more information
|
|
on this. This example also illustrates the use of a variable to represent
|
|
a @var{file} or @var{command}: it is not necessary to always
|
|
use a string constant. Using a variable is generally a good idea,
|
|
since @code{awk} requires you to spell the string value identically
|
|
every time.
|
|
@end table
|
|
|
|
Redirecting output using @samp{>}, @samp{>>}, or @samp{|} asks the system
|
|
to open a file or pipe only if the particular @var{file} or @var{command}
|
|
you've specified has not already been written to by your program, or if
|
|
it has been closed since it was last written to.
|
|
|
|
@cindex differences between @code{gawk} and @code{awk}
|
|
@cindex limitations
|
|
@cindex implementation limits
|
|
@iftex
|
|
As mentioned earlier
|
|
(@pxref{Getline Summary, , Summary of @code{getline} Variants}),
|
|
many
|
|
@end iftex
|
|
@ifinfo
|
|
Many
|
|
@end ifinfo
|
|
@code{awk} implementations limit the number of pipelines an @code{awk}
|
|
program may have open to just one! In @code{gawk}, there is no such limit.
|
|
You can open as many pipelines as the underlying operating system will
|
|
permit.
|
|
|
|
@node Special Files, Close Files And Pipes , Redirection, Printing
|
|
@section Special File Names in @code{gawk}
|
|
@cindex standard input
|
|
@cindex standard output
|
|
@cindex standard error output
|
|
@cindex file descriptors
|
|
|
|
Running programs conventionally have three input and output streams
|
|
already available to them for reading and writing. These are known as
|
|
the @dfn{standard input}, @dfn{standard output}, and @dfn{standard error
|
|
output}. These streams are, by default, connected to your terminal, but
|
|
they are often redirected with the shell, via the @samp{<}, @samp{<<},
|
|
@samp{>}, @samp{>>}, @samp{>&} and @samp{|} operators. Standard error
|
|
is typically used for writing error messages; the reason we have two separate
|
|
streams, standard output and standard error, is so that they can be
|
|
redirected separately.
|
|
|
|
@cindex differences between @code{gawk} and @code{awk}
|
|
In other implementations of @code{awk}, the only way to write an error
|
|
message to standard error in an @code{awk} program is as follows:
|
|
|
|
@example
|
|
print "Serious error detected!" | "cat 1>&2"
|
|
@end example
|
|
|
|
@noindent
|
|
This works by opening a pipeline to a shell command which can access the
|
|
standard error stream which it inherits from the @code{awk} process.
|
|
This is far from elegant, and is also inefficient, since it requires a
|
|
separate process. So people writing @code{awk} programs often
|
|
neglect to do this. Instead, they send the error messages to the
|
|
terminal, like this:
|
|
|
|
@example
|
|
@group
|
|
print "Serious error detected!" > "/dev/tty"
|
|
@end group
|
|
@end example
|
|
|
|
@noindent
|
|
This usually has the same effect, but not always: although the
|
|
standard error stream is usually the terminal, it can be redirected, and
|
|
when that happens, writing to the terminal is not correct. In fact, if
|
|
@code{awk} is run from a background job, it may not have a terminal at all.
|
|
Then opening @file{/dev/tty} will fail.
|
|
|
|
@code{gawk} provides special file names for accessing the three standard
|
|
streams. When you redirect input or output in @code{gawk}, if the file name
|
|
matches one of these special names, then @code{gawk} directly uses the
|
|
stream it stands for.
|
|
|
|
@cindex @file{/dev/stdin}
|
|
@cindex @file{/dev/stdout}
|
|
@cindex @file{/dev/stderr}
|
|
@cindex @file{/dev/fd}
|
|
@c @cartouche
|
|
@table @file
|
|
@item /dev/stdin
|
|
The standard input (file descriptor 0).
|
|
|
|
@item /dev/stdout
|
|
The standard output (file descriptor 1).
|
|
|
|
@item /dev/stderr
|
|
The standard error output (file descriptor 2).
|
|
|
|
@item /dev/fd/@var{N}
|
|
The file associated with file descriptor @var{N}. Such a file must have
|
|
been opened by the program initiating the @code{awk} execution (typically
|
|
the shell). Unless you take special pains in the shell from which
|
|
you invoke @code{gawk}, only descriptors 0, 1 and 2 are available.
|
|
@end table
|
|
@c @end cartouche
|
|
|
|
The file names @file{/dev/stdin}, @file{/dev/stdout}, and @file{/dev/stderr}
|
|
are aliases for @file{/dev/fd/0}, @file{/dev/fd/1}, and @file{/dev/fd/2},
|
|
respectively, but they are more self-explanatory.
|
|
|
|
The proper way to write an error message in a @code{gawk} program
|
|
is to use @file{/dev/stderr}, like this:
|
|
|
|
@example
|
|
print "Serious error detected!" > "/dev/stderr"
|
|
@end example
|
|
|
|
@code{gawk} also provides special file names that give access to information
|
|
about the running @code{gawk} process. Each of these ``files'' provides
|
|
a single record of information. To read them more than once, you must
|
|
first close them with the @code{close} function
|
|
(@pxref{Close Files And Pipes, ,Closing Input and Output Files and Pipes}).
|
|
The filenames are:
|
|
|
|
@cindex process information
|
|
@cindex @file{/dev/pid}
|
|
@cindex @file{/dev/pgrpid}
|
|
@cindex @file{/dev/ppid}
|
|
@cindex @file{/dev/user}
|
|
@c @cartouche
|
|
@table @file
|
|
@item /dev/pid
|
|
Reading this file returns the process ID of the current process,
|
|
in decimal, terminated with a newline.
|
|
|
|
@item /dev/ppid
|
|
Reading this file returns the parent process ID of the current process,
|
|
in decimal, terminated with a newline.
|
|
|
|
@item /dev/pgrpid
|
|
Reading this file returns the process group ID of the current process,
|
|
in decimal, terminated with a newline.
|
|
|
|
@item /dev/user
|
|
Reading this file returns a single record terminated with a newline.
|
|
The fields are separated with spaces. The fields represent the
|
|
following information:
|
|
|
|
@table @code
|
|
@item $1
|
|
The return value of the @code{getuid} system call
|
|
(the real user ID number).
|
|
|
|
@item $2
|
|
The return value of the @code{geteuid} system call
|
|
(the effective user ID number).
|
|
|
|
@item $3
|
|
The return value of the @code{getgid} system call
|
|
(the real group ID number).
|
|
|
|
@item $4
|
|
The return value of the @code{getegid} system call
|
|
(the effective group ID number).
|
|
@end table
|
|
|
|
If there are any additional fields, they are the group IDs returned by
|
|
@code{getgroups} system call.
|
|
(Multiple groups may not be supported on all systems.)
|
|
@end table
|
|
@c @end cartouche
|
|
|
|
These special file names may be used on the command line as data
|
|
files, as well as for I/O redirections within an @code{awk} program.
|
|
They may not be used as source files with the @samp{-f} option.
|
|
|
|
Recognition of these special file names is disabled if @code{gawk} is in
|
|
compatibility mode (@pxref{Options, ,Command Line Options}).
|
|
|
|
@strong{Caution}: Unless your system actually has a @file{/dev/fd} directory
|
|
(or any of the other above listed special files),
|
|
the interpretation of these file names is done by @code{gawk} itself.
|
|
For example, using @samp{/dev/fd/4} for output will actually write on
|
|
file descriptor 4, and not on a new file descriptor that was @code{dup}'ed
|
|
from file descriptor 4. Most of the time this does not matter; however, it
|
|
is important to @emph{not} close any of the files related to file descriptors
|
|
0, 1, and 2. If you do close one of these files, unpredictable behavior
|
|
will result.
|
|
|
|
The special files that provide process-related information may disappear
|
|
in a future version of @code{gawk}.
|
|
@xref{Future Extensions, ,Probable Future Extensions}.
|
|
|
|
@node Close Files And Pipes, , Special Files, Printing
|
|
@section Closing Input and Output Files and Pipes
|
|
@cindex closing input files and pipes
|
|
@cindex closing output files and pipes
|
|
@findex close
|
|
|
|
If the same file name or the same shell command is used with
|
|
@code{getline}
|
|
(@pxref{Getline, ,Explicit Input with @code{getline}})
|
|
more than once during the execution of an @code{awk}
|
|
program, the file is opened (or the command is executed) only the first time.
|
|
At that time, the first record of input is read from that file or command.
|
|
The next time the same file or command is used in @code{getline}, another
|
|
record is read from it, and so on.
|
|
|
|
Similarly, when a file or pipe is opened for output, the file name or command
|
|
associated with
|
|
it is remembered by @code{awk} and subsequent writes to the same file or
|
|
command are appended to the previous writes. The file or pipe stays
|
|
open until @code{awk} exits.
|
|
|
|
This implies that if you want to start reading the same file again from
|
|
the beginning, or if you want to rerun a shell command (rather than
|
|
reading more output from the command), you must take special steps.
|
|
What you must do is use the @code{close} function, as follows:
|
|
|
|
@example
|
|
close(@var{filename})
|
|
@end example
|
|
|
|
@noindent
|
|
or
|
|
|
|
@example
|
|
close(@var{command})
|
|
@end example
|
|
|
|
The argument @var{filename} or @var{command} can be any expression. Its
|
|
value must @emph{exactly} match the string that was used to open the file or
|
|
start the command (spaces and other ``irrelevant'' characters
|
|
included). For example, if you open a pipe with this:
|
|
|
|
@example
|
|
"sort -r names" | getline foo
|
|
@end example
|
|
|
|
@noindent
|
|
then you must close it with this:
|
|
|
|
@example
|
|
close("sort -r names")
|
|
@end example
|
|
|
|
Once this function call is executed, the next @code{getline} from that
|
|
file or command, or the next @code{print} or @code{printf} to that
|
|
file or command, will reopen the file or rerun the command.
|
|
|
|
Because the expression that you use to close a file or pipeline must
|
|
exactly match the expression used to open the file or run the command,
|
|
it is good practice to use a variable to store the file name or command.
|
|
The previous example would become
|
|
|
|
@example
|
|
sortcom = "sort -r names"
|
|
sortcom | getline foo
|
|
@dots{}
|
|
close(sortcom)
|
|
@end example
|
|
|
|
@noindent
|
|
This helps avoid hard-to-find typographical errors in your @code{awk}
|
|
programs.
|
|
|
|
Here are some reasons why you might need to close an output file:
|
|
|
|
@itemize @bullet
|
|
@item
|
|
To write a file and read it back later on in the same @code{awk}
|
|
program. Close the file when you are finished writing it; then
|
|
you can start reading it with @code{getline}.
|
|
|
|
@item
|
|
To write numerous files, successively, in the same @code{awk}
|
|
program. If you don't close the files, eventually you may exceed a
|
|
system limit on the number of open files in one process. So close
|
|
each one when you are finished writing it.
|
|
|
|
@item
|
|
To make a command finish. When you redirect output through a pipe,
|
|
the command reading the pipe normally continues to try to read input
|
|
as long as the pipe is open. Often this means the command cannot
|
|
really do its work until the pipe is closed. For example, if you
|
|
redirect output to the @code{mail} program, the message is not
|
|
actually sent until the pipe is closed.
|
|
|
|
@item
|
|
To run the same program a second time, with the same arguments.
|
|
This is not the same thing as giving more input to the first run!
|
|
|
|
For example, suppose you pipe output to the @code{mail} program. If you
|
|
output several lines redirected to this pipe without closing it, they make
|
|
a single message of several lines. By contrast, if you close the pipe
|
|
after each line of output, then each line makes a separate message.
|
|
@end itemize
|
|
|
|
@vindex ERRNO
|
|
@cindex differences between @code{gawk} and @code{awk}
|
|
@code{close} returns a value of zero if the close succeeded.
|
|
Otherwise, the value will be non-zero.
|
|
In this case, @code{gawk} sets the variable @code{ERRNO} to a string
|
|
describing the error that occurred.
|
|
|
|
@cindex differences between @code{gawk} and @code{awk}
|
|
@cindex portability issues
|
|
If you use more files than the system allows you to have open,
|
|
@code{gawk} will attempt to multiplex the available open files among
|
|
your data files. @code{gawk}'s ability to do this depends upon the
|
|
facilities of your operating system: it may not always work. It is
|
|
therefore both good practice and good portability advice to always
|
|
use @code{close} on your files when you are done with them.
|
|
|
|
@node Expressions, Patterns and Actions, Printing, Top
|
|
@chapter Expressions
|
|
@cindex expression
|
|
|
|
Expressions are the basic building blocks of @code{awk} patterns
|
|
and actions. An expression evaluates to a value, which you can print, test,
|
|
store in a variable or pass to a function. Additionally, an expression
|
|
can assign a new value to a variable or a field, with an assignment operator.
|
|
|
|
An expression can serve as a pattern or action statement on its own.
|
|
Most other kinds of
|
|
statements contain one or more expressions which specify data on which to
|
|
operate. As in other languages, expressions in @code{awk} include
|
|
variables, array references, constants, and function calls, as well as
|
|
combinations of these with various operators.
|
|
|
|
@menu
|
|
* Constants:: String, numeric, and regexp constants.
|
|
* Using Constant Regexps:: When and how to use a regexp constant.
|
|
* Variables:: Variables give names to values for later use.
|
|
* Conversion:: The conversion of strings to numbers and vice
|
|
versa.
|
|
* Arithmetic Ops:: Arithmetic operations (@samp{+}, @samp{-},
|
|
etc.)
|
|
* Concatenation:: Concatenating strings.
|
|
* Assignment Ops:: Changing the value of a variable or a field.
|
|
* Increment Ops:: Incrementing the numeric value of a variable.
|
|
* Truth Values:: What is ``true'' and what is ``false''.
|
|
* Typing and Comparison:: How variables acquire types, and how this
|
|
affects comparison of numbers and strings with
|
|
@samp{<}, etc.
|
|
* Boolean Ops:: Combining comparison expressions using boolean
|
|
operators @samp{||} (``or''), @samp{&&}
|
|
(``and'') and @samp{!} (``not'').
|
|
* Conditional Exp:: Conditional expressions select between two
|
|
subexpressions under control of a third
|
|
subexpression.
|
|
* Function Calls:: A function call is an expression.
|
|
* Precedence:: How various operators nest.
|
|
@end menu
|
|
|
|
@node Constants, Using Constant Regexps, Expressions, Expressions
|
|
@section Constant Expressions
|
|
@cindex constants, types of
|
|
@cindex string constants
|
|
|
|
The simplest type of expression is the @dfn{constant}, which always has
|
|
the same value. There are three types of constants: numeric constants,
|
|
string constants, and regular expression constants.
|
|
|
|
@menu
|
|
* Scalar Constants:: Numeric and string constants.
|
|
* Regexp Constants:: Regular Expression constants.
|
|
@end menu
|
|
|
|
@node Scalar Constants, Regexp Constants, Constants, Constants
|
|
@subsection Numeric and String Constants
|
|
|
|
@cindex numeric constant
|
|
@cindex numeric value
|
|
A @dfn{numeric constant} stands for a number. This number can be an
|
|
integer, a decimal fraction, or a number in scientific (exponential)
|
|
notation.@footnote{The internal representation uses double-precision
|
|
floating point numbers. If you don't know what that means, then don't
|
|
worry about it.} Here are some examples of numeric constants, which all
|
|
have the same value:
|
|
|
|
@example
|
|
105
|
|
1.05e+2
|
|
1050e-1
|
|
@end example
|
|
|
|
A string constant consists of a sequence of characters enclosed in
|
|
double-quote marks. For example:
|
|
|
|
@example
|
|
"parrot"
|
|
@end example
|
|
|
|
@noindent
|
|
@cindex differences between @code{gawk} and @code{awk}
|
|
represents the string whose contents are @samp{parrot}. Strings in
|
|
@code{gawk} can be of any length and they can contain any of the possible
|
|
eight-bit ASCII characters including ASCII NUL (character code zero).
|
|
Other @code{awk}
|
|
implementations may have difficulty with some character codes.
|
|
|
|
@node Regexp Constants, , Scalar Constants, Constants
|
|
@subsection Regular Expression Constants
|
|
|
|
@cindex @code{~} operator
|
|
@cindex @code{!~} operator
|
|
A regexp constant is a regular expression description enclosed in
|
|
slashes, such as @code{@w{/^beginning and end$/}}. Most regexps used in
|
|
@code{awk} programs are constant, but the @samp{~} and @samp{!~}
|
|
matching operators can also match computed or ``dynamic'' regexps
|
|
(which are just ordinary strings or variables that contain a regexp).
|
|
|
|
@node Using Constant Regexps, Variables, Constants, Expressions
|
|
@section Using Regular Expression Constants
|
|
|
|
When used on the right hand side of the @samp{~} or @samp{!~}
|
|
operators, a regexp constant merely stands for the regexp that is to be
|
|
matched.
|
|
|
|
@cindex dark corner
|
|
Regexp constants (such as @code{/foo/}) may be used like simple expressions.
|
|
When a
|
|
regexp constant appears by itself, it has the same meaning as if it appeared
|
|
in a pattern, i.e.@: @samp{($0 ~ /foo/)} (d.c.)
|
|
(@pxref{Expression Patterns, ,Expressions as Patterns}).
|
|
This means that the two code segments,
|
|
|
|
@example
|
|
if ($0 ~ /barfly/ || $0 ~ /camelot/)
|
|
print "found"
|
|
@end example
|
|
|
|
@noindent
|
|
and
|
|
|
|
@example
|
|
if (/barfly/ || /camelot/)
|
|
print "found"
|
|
@end example
|
|
|
|
@noindent
|
|
are exactly equivalent.
|
|
|
|
One rather bizarre consequence of this rule is that the following
|
|
boolean expression is valid, but does not do what the user probably
|
|
intended:
|
|
|
|
@example
|
|
# note that /foo/ is on the left of the ~
|
|
if (/foo/ ~ $1) print "found foo"
|
|
@end example
|
|
|
|
@noindent
|
|
This code is ``obviously'' testing @code{$1} for a match against the regexp
|
|
@code{/foo/}. But in fact, the expression @samp{/foo/ ~ $1} actually means
|
|
@samp{($0 ~ /foo/) ~ $1}. In other words, first match the input record
|
|
against the regexp @code{/foo/}. The result will be either zero or one,
|
|
depending upon the success or failure of the match. Then match that result
|
|
against the first field in the record.
|
|
|
|
Since it is unlikely that you would ever really wish to make this kind of
|
|
test, @code{gawk} will issue a warning when it sees this construct in
|
|
a program.
|
|
|
|
Another consequence of this rule is that the assignment statement
|
|
|
|
@example
|
|
matches = /foo/
|
|
@end example
|
|
|
|
@noindent
|
|
will assign either zero or one to the variable @code{matches}, depending
|
|
upon the contents of the current input record.
|
|
|
|
This feature of the language was never well documented until the
|
|
POSIX specification.
|
|
|
|
@cindex differences between @code{gawk} and @code{awk}
|
|
@cindex dark corner
|
|
Constant regular expressions are also used as the first argument for
|
|
the @code{gensub}, @code{sub} and @code{gsub} functions, and as the
|
|
second argument of the @code{match} function
|
|
(@pxref{String Functions, ,Built-in Functions for String Manipulation}).
|
|
Modern implementations of @code{awk}, including @code{gawk}, allow
|
|
the third argument of @code{split} to be a regexp constant, while some
|
|
older implementations do not (d.c.).
|
|
|
|
This can lead to confusion when attempting to use regexp constants
|
|
as arguments to user defined functions
|
|
(@pxref{User-defined, , User-defined Functions}).
|
|
For example:
|
|
|
|
@example
|
|
@group
|
|
function mysub(pat, repl, str, global)
|
|
@{
|
|
if (global)
|
|
gsub(pat, repl, str)
|
|
else
|
|
sub(pat, repl, str)
|
|
return str
|
|
@}
|
|
@end group
|
|
|
|
@group
|
|
@{
|
|
@dots{}
|
|
text = "hi! hi yourself!"
|
|
mysub(/hi/, "howdy", text, 1)
|
|
@dots{}
|
|
@}
|
|
@end group
|
|
@end example
|
|
|
|
In this example, the programmer wishes to pass a regexp constant to the
|
|
user-defined function @code{mysub}, which will in turn pass it on to
|
|
either @code{sub} or @code{gsub}. However, what really happens is that
|
|
the @code{pat} parameter will be either one or zero, depending upon whether
|
|
or not @code{$0} matches @code{/hi/}.
|
|
|
|
As it is unlikely that you would ever really wish to pass a truth value
|
|
in this way, @code{gawk} will issue a warning when it sees a regexp
|
|
constant used as a parameter to a user-defined function.
|
|
|
|
@node Variables, Conversion, Using Constant Regexps, Expressions
|
|
@section Variables
|
|
|
|
Variables are ways of storing values at one point in your program for
|
|
use later in another part of your program. You can manipulate them
|
|
entirely within your program text, and you can also assign values to
|
|
them on the @code{awk} command line.
|
|
|
|
@menu
|
|
* Using Variables:: Using variables in your programs.
|
|
* Assignment Options:: Setting variables on the command line and a
|
|
summary of command line syntax. This is an
|
|
advanced method of input.
|
|
@end menu
|
|
|
|
@node Using Variables, Assignment Options, Variables, Variables
|
|
@subsection Using Variables in a Program
|
|
|
|
@cindex variables, user-defined
|
|
@cindex user-defined variables
|
|
Variables let you give names to values and refer to them later. You have
|
|
already seen variables in many of the examples. The name of a variable
|
|
must be a sequence of letters, digits and underscores, but it may not begin
|
|
with a digit. Case is significant in variable names; @code{a} and @code{A}
|
|
are distinct variables.
|
|
|
|
A variable name is a valid expression by itself; it represents the
|
|
variable's current value. Variables are given new values with
|
|
@dfn{assignment operators}, @dfn{increment operators} and
|
|
@dfn{decrement operators}.
|
|
@xref{Assignment Ops, ,Assignment Expressions}.
|
|
|
|
A few variables have special built-in meanings, such as @code{FS}, the
|
|
field separator, and @code{NF}, the number of fields in the current
|
|
input record. @xref{Built-in Variables}, for a list of them. These
|
|
built-in variables can be used and assigned just like all other
|
|
variables, but their values are also used or changed automatically by
|
|
@code{awk}. All built-in variables names are entirely upper-case.
|
|
|
|
Variables in @code{awk} can be assigned either numeric or string
|
|
values. By default, variables are initialized to the empty string, which
|
|
is zero if converted to a number. There is no need to
|
|
``initialize'' each variable explicitly in @code{awk},
|
|
the way you would in C and in most other traditional languages.
|
|
|
|
@node Assignment Options, , Using Variables, Variables
|
|
@subsection Assigning Variables on the Command Line
|
|
|
|
You can set any @code{awk} variable by including a @dfn{variable assignment}
|
|
among the arguments on the command line when you invoke @code{awk}
|
|
(@pxref{Other Arguments, ,Other Command Line Arguments}). Such an assignment has
|
|
this form:
|
|
|
|
@example
|
|
@var{variable}=@var{text}
|
|
@end example
|
|
|
|
@noindent
|
|
With it, you can set a variable either at the beginning of the
|
|
@code{awk} run or in between input files.
|
|
|
|
If you precede the assignment with the @samp{-v} option, like this:
|
|
|
|
@example
|
|
-v @var{variable}=@var{text}
|
|
@end example
|
|
|
|
@noindent
|
|
then the variable is set at the very beginning, before even the
|
|
@code{BEGIN} rules are run. The @samp{-v} option and its assignment
|
|
must precede all the file name arguments, as well as the program text.
|
|
(@xref{Options, ,Command Line Options}, for more information about
|
|
the @samp{-v} option.)
|
|
|
|
Otherwise, the variable assignment is performed at a time determined by
|
|
its position among the input file arguments: after the processing of the
|
|
preceding input file argument. For example:
|
|
|
|
@example
|
|
awk '@{ print $n @}' n=4 inventory-shipped n=2 BBS-list
|
|
@end example
|
|
|
|
@noindent
|
|
prints the value of field number @code{n} for all input records. Before
|
|
the first file is read, the command line sets the variable @code{n}
|
|
equal to four. This causes the fourth field to be printed in lines from
|
|
the file @file{inventory-shipped}. After the first file has finished,
|
|
but before the second file is started, @code{n} is set to two, so that the
|
|
second field is printed in lines from @file{BBS-list}.
|
|
|
|
@example
|
|
@group
|
|
$ awk '@{ print $n @}' n=4 inventory-shipped n=2 BBS-list
|
|
@print{} 15
|
|
@print{} 24
|
|
@dots{}
|
|
@print{} 555-5553
|
|
@print{} 555-3412
|
|
@dots{}
|
|
@end group
|
|
@end example
|
|
|
|
Command line arguments are made available for explicit examination by
|
|
the @code{awk} program in an array named @code{ARGV}
|
|
(@pxref{ARGC and ARGV, ,Using @code{ARGC} and @code{ARGV}}).
|
|
|
|
@cindex dark corner
|
|
@code{awk} processes the values of command line assignments for escape
|
|
sequences (d.c.) (@pxref{Escape Sequences}).
|
|
|
|
@node Conversion, Arithmetic Ops, Variables, Expressions
|
|
@section Conversion of Strings and Numbers
|
|
|
|
@cindex conversion of strings and numbers
|
|
Strings are converted to numbers, and numbers to strings, if the context
|
|
of the @code{awk} program demands it. For example, if the value of
|
|
either @code{foo} or @code{bar} in the expression @samp{foo + bar}
|
|
happens to be a string, it is converted to a number before the addition
|
|
is performed. If numeric values appear in string concatenation, they
|
|
are converted to strings. Consider this:
|
|
|
|
@example
|
|
two = 2; three = 3
|
|
print (two three) + 4
|
|
@end example
|
|
|
|
@noindent
|
|
This prints the (numeric) value 27. The numeric values of
|
|
the variables @code{two} and @code{three} are converted to strings and
|
|
concatenated together, and the resulting string is converted back to the
|
|
number 23, to which four is then added.
|
|
|
|
@cindex null string
|
|
@cindex empty string
|
|
@cindex type conversion
|
|
If, for some reason, you need to force a number to be converted to a
|
|
string, concatenate the empty string, @code{""}, with that number.
|
|
To force a string to be converted to a number, add zero to that string.
|
|
|
|
A string is converted to a number by interpreting any numeric prefix
|
|
of the string as numerals:
|
|
@code{"2.5"} converts to 2.5, @code{"1e3"} converts to 1000, and @code{"25fix"}
|
|
has a numeric value of 25.
|
|
Strings that can't be interpreted as valid numbers are converted to
|
|
zero.
|
|
|
|
@vindex CONVFMT
|
|
The exact manner in which numbers are converted into strings is controlled
|
|
by the @code{awk} built-in variable @code{CONVFMT} (@pxref{Built-in Variables}).
|
|
Numbers are converted using the @code{sprintf} function
|
|
(@pxref{String Functions, ,Built-in Functions for String Manipulation})
|
|
with @code{CONVFMT} as the format
|
|
specifier.
|
|
|
|
@code{CONVFMT}'s default value is @code{"%.6g"}, which prints a value with
|
|
at least six significant digits. For some applications you will want to
|
|
change it to specify more precision. Double precision on most modern
|
|
machines gives you 16 or 17 decimal digits of precision.
|
|
|
|
Strange results can happen if you set @code{CONVFMT} to a string that doesn't
|
|
tell @code{sprintf} how to format floating point numbers in a useful way.
|
|
For example, if you forget the @samp{%} in the format, all numbers will be
|
|
converted to the same constant string.
|
|
|
|
@cindex dark corner
|
|
As a special case, if a number is an integer, then the result of converting
|
|
it to a string is @emph{always} an integer, no matter what the value of
|
|
@code{CONVFMT} may be. Given the following code fragment:
|
|
|
|
@example
|
|
CONVFMT = "%2.2f"
|
|
a = 12
|
|
b = a ""
|
|
@end example
|
|
|
|
@noindent
|
|
@code{b} has the value @code{"12"}, not @code{"12.00"} (d.c.).
|
|
|
|
@cindex @code{awk} language, POSIX version
|
|
@cindex POSIX @code{awk}
|
|
@vindex OFMT
|
|
Prior to the POSIX standard, @code{awk} specified that the value
|
|
of @code{OFMT} was used for converting numbers to strings. @code{OFMT}
|
|
specifies the output format to use when printing numbers with @code{print}.
|
|
@code{CONVFMT} was introduced in order to separate the semantics of
|
|
conversion from the semantics of printing. Both @code{CONVFMT} and
|
|
@code{OFMT} have the same default value: @code{"%.6g"}. In the vast majority
|
|
of cases, old @code{awk} programs will not change their behavior.
|
|
However, this use of @code{OFMT} is something to keep in mind if you must
|
|
port your program to other implementations of @code{awk}; we recommend
|
|
that instead of changing your programs, you just port @code{gawk} itself!
|
|
@xref{Print, ,The @code{print} Statement},
|
|
for more information on the @code{print} statement.
|
|
|
|
@node Arithmetic Ops, Concatenation, Conversion, Expressions
|
|
@section Arithmetic Operators
|
|
@cindex arithmetic operators
|
|
@cindex operators, arithmetic
|
|
@cindex addition
|
|
@cindex subtraction
|
|
@cindex multiplication
|
|
@cindex division
|
|
@cindex remainder
|
|
@cindex quotient
|
|
@cindex exponentiation
|
|
|
|
The @code{awk} language uses the common arithmetic operators when
|
|
evaluating expressions. All of these arithmetic operators follow normal
|
|
precedence rules, and work as you would expect them to.
|
|
|
|
Here is a file @file{grades} containing a list of student names and
|
|
three test scores per student (it's a small class):
|
|
|
|
@example
|
|
Pat 100 97 58
|
|
Sandy 84 72 93
|
|
Chris 72 92 89
|
|
@end example
|
|
|
|
@noindent
|
|
This programs takes the file @file{grades}, and prints the average
|
|
of the scores.
|
|
|
|
@example
|
|
$ awk '@{ sum = $2 + $3 + $4 ; avg = sum / 3
|
|
> print $1, avg @}' grades
|
|
@print{} Pat 85
|
|
@print{} Sandy 83
|
|
@print{} Chris 84.3333
|
|
@end example
|
|
|
|
This table lists the arithmetic operators in @code{awk}, in order from
|
|
highest precedence to lowest:
|
|
|
|
@c @cartouche
|
|
@table @code
|
|
@item - @var{x}
|
|
Negation.
|
|
|
|
@item + @var{x}
|
|
Unary plus. The expression is converted to a number.
|
|
|
|
@cindex @code{awk} language, POSIX version
|
|
@cindex POSIX @code{awk}
|
|
@item @var{x} ^ @var{y}
|
|
@itemx @var{x} ** @var{y}
|
|
Exponentiation: @var{x} raised to the @var{y} power. @samp{2 ^ 3} has
|
|
the value eight. The character sequence @samp{**} is equivalent to
|
|
@samp{^}. (The POSIX standard only specifies the use of @samp{^}
|
|
for exponentiation.)
|
|
|
|
@item @var{x} * @var{y}
|
|
Multiplication.
|
|
|
|
@item @var{x} / @var{y}
|
|
Division. Since all numbers in @code{awk} are
|
|
real numbers, the result is not rounded to an integer: @samp{3 / 4}
|
|
has the value 0.75.
|
|
|
|
@item @var{x} % @var{y}
|
|
@cindex differences between @code{gawk} and @code{awk}
|
|
Remainder. The quotient is rounded toward zero to an integer,
|
|
multiplied by @var{y} and this result is subtracted from @var{x}.
|
|
This operation is sometimes known as ``trunc-mod.'' The following
|
|
relation always holds:
|
|
|
|
@example
|
|
b * int(a / b) + (a % b) == a
|
|
@end example
|
|
|
|
One possibly undesirable effect of this definition of remainder is that
|
|
@code{@var{x} % @var{y}} is negative if @var{x} is negative. Thus,
|
|
|
|
@example
|
|
-17 % 8 = -1
|
|
@end example
|
|
|
|
In other @code{awk} implementations, the signedness of the remainder
|
|
may be machine dependent.
|
|
@c !!! what does posix say?
|
|
|
|
@item @var{x} + @var{y}
|
|
Addition.
|
|
|
|
@item @var{x} - @var{y}
|
|
Subtraction.
|
|
@end table
|
|
@c @end cartouche
|
|
|
|
For maximum portability, do not use the @samp{**} operator.
|
|
|
|
Unary plus and minus have the same precedence,
|
|
the multiplication operators all have the same precedence, and
|
|
addition and subtraction have the same precedence.
|
|
|
|
@node Concatenation, Assignment Ops, Arithmetic Ops, Expressions
|
|
@section String Concatenation
|
|
@cindex Kernighan, Brian
|
|
@display
|
|
@i{It seemed like a good idea at the time.}
|
|
Brian Kernighan
|
|
@end display
|
|
@sp 1
|
|
|
|
@cindex string operators
|
|
@cindex operators, string
|
|
@cindex concatenation
|
|
There is only one string operation: concatenation. It does not have a
|
|
specific operator to represent it. Instead, concatenation is performed by
|
|
writing expressions next to one another, with no operator. For example:
|
|
|
|
@example
|
|
@group
|
|
$ awk '@{ print "Field number one: " $1 @}' BBS-list
|
|
@print{} Field number one: aardvark
|
|
@print{} Field number one: alpo-net
|
|
@dots{}
|
|
@end group
|
|
@end example
|
|
|
|
Without the space in the string constant after the @samp{:}, the line
|
|
would run together. For example:
|
|
|
|
@example
|
|
@group
|
|
$ awk '@{ print "Field number one:" $1 @}' BBS-list
|
|
@print{} Field number one:aardvark
|
|
@print{} Field number one:alpo-net
|
|
@dots{}
|
|
@end group
|
|
@end example
|
|
|
|
Since string concatenation does not have an explicit operator, it is
|
|
often necessary to insure that it happens where you want it to by
|
|
using parentheses to enclose
|
|
the items to be concatenated. For example, the
|
|
following code fragment does not concatenate @code{file} and @code{name}
|
|
as you might expect:
|
|
|
|
@example
|
|
@group
|
|
file = "file"
|
|
name = "name"
|
|
print "something meaningful" > file name
|
|
@end group
|
|
@end example
|
|
|
|
@noindent
|
|
It is necessary to use the following:
|
|
|
|
@example
|
|
print "something meaningful" > (file name)
|
|
@end example
|
|
|
|
We recommend that you use parentheses around concatenation in all but the
|
|
most common contexts (such as on the right-hand side of @samp{=}).
|
|
|
|
@node Assignment Ops, Increment Ops, Concatenation, Expressions
|
|
@section Assignment Expressions
|
|
@cindex assignment operators
|
|
@cindex operators, assignment
|
|
@cindex expression, assignment
|
|
|
|
An @dfn{assignment} is an expression that stores a new value into a
|
|
variable. For example, let's assign the value one to the variable
|
|
@code{z}:
|
|
|
|
@example
|
|
z = 1
|
|
@end example
|
|
|
|
After this expression is executed, the variable @code{z} has the value one.
|
|
Whatever old value @code{z} had before the assignment is forgotten.
|
|
|
|
Assignments can store string values also. For example, this would store
|
|
the value @code{"this food is good"} in the variable @code{message}:
|
|
|
|
@example
|
|
thing = "food"
|
|
predicate = "good"
|
|
message = "this " thing " is " predicate
|
|
@end example
|
|
|
|
@noindent
|
|
(This also illustrates string concatenation.)
|
|
|
|
The @samp{=} sign is called an @dfn{assignment operator}. It is the
|
|
simplest assignment operator because the value of the right-hand
|
|
operand is stored unchanged.
|
|
|
|
@cindex side effect
|
|
Most operators (addition, concatenation, and so on) have no effect
|
|
except to compute a value. If you ignore the value, you might as well
|
|
not use the operator. An assignment operator is different; it does
|
|
produce a value, but even if you ignore the value, the assignment still
|
|
makes itself felt through the alteration of the variable. We call this
|
|
a @dfn{side effect}.
|
|
|
|
@cindex lvalue
|
|
@cindex rvalue
|
|
The left-hand operand of an assignment need not be a variable
|
|
(@pxref{Variables}); it can also be a field
|
|
(@pxref{Changing Fields, ,Changing the Contents of a Field}) or
|
|
an array element (@pxref{Arrays, ,Arrays in @code{awk}}).
|
|
These are all called @dfn{lvalues},
|
|
which means they can appear on the left-hand side of an assignment operator.
|
|
The right-hand operand may be any expression; it produces the new value
|
|
which the assignment stores in the specified variable, field or array
|
|
element. (Such values are called @dfn{rvalues}).
|
|
|
|
@cindex types of variables
|
|
It is important to note that variables do @emph{not} have permanent types.
|
|
The type of a variable is simply the type of whatever value it happens
|
|
to hold at the moment. In the following program fragment, the variable
|
|
@code{foo} has a numeric value at first, and a string value later on:
|
|
|
|
@example
|
|
@group
|
|
foo = 1
|
|
print foo
|
|
foo = "bar"
|
|
print foo
|
|
@end group
|
|
@end example
|
|
|
|
@noindent
|
|
When the second assignment gives @code{foo} a string value, the fact that
|
|
it previously had a numeric value is forgotten.
|
|
|
|
String values that do not begin with a digit have a numeric value of
|
|
zero. After executing this code, the value of @code{foo} is five:
|
|
|
|
@example
|
|
foo = "a string"
|
|
foo = foo + 5
|
|
@end example
|
|
|
|
@noindent
|
|
(Note that using a variable as a number and then later as a string can
|
|
be confusing and is poor programming style. The above examples illustrate how
|
|
@code{awk} works, @emph{not} how you should write your own programs!)
|
|
|
|
An assignment is an expression, so it has a value: the same value that
|
|
is assigned. Thus, @samp{z = 1} as an expression has the value one.
|
|
One consequence of this is that you can write multiple assignments together:
|
|
|
|
@example
|
|
x = y = z = 0
|
|
@end example
|
|
|
|
@noindent
|
|
stores the value zero in all three variables. It does this because the
|
|
value of @samp{z = 0}, which is zero, is stored into @code{y}, and then
|
|
the value of @samp{y = z = 0}, which is zero, is stored into @code{x}.
|
|
|
|
You can use an assignment anywhere an expression is called for. For
|
|
example, it is valid to write @samp{x != (y = 1)} to set @code{y} to one
|
|
and then test whether @code{x} equals one. But this style tends to make
|
|
programs hard to read; except in a one-shot program, you should
|
|
not use such nesting of assignments.
|
|
|
|
Aside from @samp{=}, there are several other assignment operators that
|
|
do arithmetic with the old value of the variable. For example, the
|
|
operator @samp{+=} computes a new value by adding the right-hand value
|
|
to the old value of the variable. Thus, the following assignment adds
|
|
five to the value of @code{foo}:
|
|
|
|
@example
|
|
foo += 5
|
|
@end example
|
|
|
|
@noindent
|
|
This is equivalent to the following:
|
|
|
|
@example
|
|
foo = foo + 5
|
|
@end example
|
|
|
|
@noindent
|
|
Use whichever one makes the meaning of your program clearer.
|
|
|
|
There are situations where using @samp{+=} (or any assignment operator)
|
|
is @emph{not} the same as simply repeating the left-hand operand in the
|
|
right-hand expression. For example:
|
|
|
|
@cindex Rankin, Pat
|
|
@example
|
|
@group
|
|
# Thanks to Pat Rankin for this example
|
|
BEGIN @{
|
|
foo[rand()] += 5
|
|
for (x in foo)
|
|
print x, foo[x]
|
|
|
|
bar[rand()] = bar[rand()] + 5
|
|
for (x in bar)
|
|
print x, bar[x]
|
|
@}
|
|
@end group
|
|
@end example
|
|
|
|
@noindent
|
|
The indices of @code{bar} are guaranteed to be different, because
|
|
@code{rand} will return different values each time it is called.
|
|
(Arrays and the @code{rand} function haven't been covered yet.
|
|
@xref{Arrays, ,Arrays in @code{awk}},
|
|
and see @ref{Numeric Functions, ,Numeric Built-in Functions}, for more information).
|
|
This example illustrates an important fact about the assignment
|
|
operators: the left-hand expression is only evaluated @emph{once}.
|
|
|
|
It is also up to the implementation as to which expression is evaluated
|
|
first, the left-hand one or the right-hand one.
|
|
Consider this example:
|
|
|
|
@example
|
|
i = 1
|
|
a[i += 2] = i + 1
|
|
@end example
|
|
|
|
@noindent
|
|
The value of @code{a[3]} could be either two or four.
|
|
|
|
Here is a table of the arithmetic assignment operators. In each
|
|
case, the right-hand operand is an expression whose value is converted
|
|
to a number.
|
|
|
|
@c @cartouche
|
|
@table @code
|
|
@item @var{lvalue} += @var{increment}
|
|
Adds @var{increment} to the value of @var{lvalue} to make the new value
|
|
of @var{lvalue}.
|
|
|
|
@item @var{lvalue} -= @var{decrement}
|
|
Subtracts @var{decrement} from the value of @var{lvalue}.
|
|
|
|
@item @var{lvalue} *= @var{coefficient}
|
|
Multiplies the value of @var{lvalue} by @var{coefficient}.
|
|
|
|
@item @var{lvalue} /= @var{divisor}
|
|
Divides the value of @var{lvalue} by @var{divisor}.
|
|
|
|
@item @var{lvalue} %= @var{modulus}
|
|
Sets @var{lvalue} to its remainder by @var{modulus}.
|
|
|
|
@cindex @code{awk} language, POSIX version
|
|
@cindex POSIX @code{awk}
|
|
@item @var{lvalue} ^= @var{power}
|
|
@itemx @var{lvalue} **= @var{power}
|
|
Raises @var{lvalue} to the power @var{power}.
|
|
(Only the @samp{^=} operator is specified by POSIX.)
|
|
@end table
|
|
@c @end cartouche
|
|
|
|
For maximum portability, do not use the @samp{**=} operator.
|
|
|
|
@node Increment Ops, Truth Values, Assignment Ops, Expressions
|
|
@section Increment and Decrement Operators
|
|
|
|
@cindex increment operators
|
|
@cindex operators, increment
|
|
@dfn{Increment} and @dfn{decrement operators} increase or decrease the value of
|
|
a variable by one. You could do the same thing with an assignment operator, so
|
|
the increment operators add no power to the @code{awk} language; but they
|
|
are convenient abbreviations for very common operations.
|
|
|
|
The operator to add one is written @samp{++}. It can be used to increment
|
|
a variable either before or after taking its value.
|
|
|
|
To pre-increment a variable @var{v}, write @samp{++@var{v}}. This adds
|
|
one to the value of @var{v} and that new value is also the value of this
|
|
expression. The assignment expression @samp{@var{v} += 1} is completely
|
|
equivalent.
|
|
|
|
Writing the @samp{++} after the variable specifies post-increment. This
|
|
increments the variable value just the same; the difference is that the
|
|
value of the increment expression itself is the variable's @emph{old}
|
|
value. Thus, if @code{foo} has the value four, then the expression @samp{foo++}
|
|
has the value four, but it changes the value of @code{foo} to five.
|
|
|
|
The post-increment @samp{foo++} is nearly equivalent to writing @samp{(foo
|
|
+= 1) - 1}. It is not perfectly equivalent because all numbers in
|
|
@code{awk} are floating point: in floating point, @samp{foo + 1 - 1} does
|
|
not necessarily equal @code{foo}. But the difference is minute as
|
|
long as you stick to numbers that are fairly small (less than 10e12).
|
|
|
|
Any lvalue can be incremented. Fields and array elements are incremented
|
|
just like variables. (Use @samp{$(i++)} when you wish to do a field reference
|
|
and a variable increment at the same time. The parentheses are necessary
|
|
because of the precedence of the field reference operator, @samp{$}.)
|
|
|
|
@cindex decrement operators
|
|
@cindex operators, decrement
|
|
The decrement operator @samp{--} works just like @samp{++} except that
|
|
it subtracts one instead of adding. Like @samp{++}, it can be used before
|
|
the lvalue to pre-decrement or after it to post-decrement.
|
|
|
|
Here is a summary of increment and decrement expressions.
|
|
|
|
@c @cartouche
|
|
@table @code
|
|
@item ++@var{lvalue}
|
|
This expression increments @var{lvalue} and the new value becomes the
|
|
value of the expression.
|
|
|
|
@item @var{lvalue}++
|
|
This expression increments @var{lvalue}, but
|
|
the value of the expression is the @emph{old} value of @var{lvalue}.
|
|
|
|
@item --@var{lvalue}
|
|
Like @samp{++@var{lvalue}}, but instead of adding, it subtracts. It
|
|
decrements @var{lvalue} and delivers the value that results.
|
|
|
|
@item @var{lvalue}--
|
|
Like @samp{@var{lvalue}++}, but instead of adding, it subtracts. It
|
|
decrements @var{lvalue}. The value of the expression is the @emph{old}
|
|
value of @var{lvalue}.
|
|
@end table
|
|
@c @end cartouche
|
|
|
|
@node Truth Values, Typing and Comparison, Increment Ops, Expressions
|
|
@section True and False in @code{awk}
|
|
@cindex truth values
|
|
@cindex logical true
|
|
@cindex logical false
|
|
|
|
Many programming languages have a special representation for the concepts
|
|
of ``true'' and ``false.'' Such languages usually use the special
|
|
constants @code{true} and @code{false}, or perhaps their upper-case
|
|
equivalents.
|
|
|
|
@cindex null string
|
|
@cindex empty string
|
|
@code{awk} is different. It borrows a very simple concept of true and
|
|
false from C. In @code{awk}, any non-zero numeric value, @emph{or} any
|
|
non-empty string value is true. Any other value (zero or the null
|
|
string, @code{""}) is false. The following program will print @samp{A strange
|
|
truth value} three times:
|
|
|
|
@example
|
|
@group
|
|
BEGIN @{
|
|
if (3.1415927)
|
|
print "A strange truth value"
|
|
if ("Four Score And Seven Years Ago")
|
|
print "A strange truth value"
|
|
if (j = 57)
|
|
print "A strange truth value"
|
|
@}
|
|
@end group
|
|
@end example
|
|
|
|
@cindex dark corner
|
|
There is a surprising consequence of the ``non-zero or non-null'' rule:
|
|
The string constant @code{"0"} is actually true, since it is non-null (d.c.).
|
|
|
|
@node Typing and Comparison, Boolean Ops, Truth Values, Expressions
|
|
@section Variable Typing and Comparison Expressions
|
|
@cindex comparison expressions
|
|
@cindex expression, comparison
|
|
@cindex expression, matching
|
|
@cindex relational operators
|
|
@cindex operators, relational
|
|
@cindex regexp match/non-match operators
|
|
@cindex variable typing
|
|
@cindex types of variables
|
|
@c 2e: consider splitting this section into subsections
|
|
@display
|
|
@i{The Guide is definitive. Reality is frequently inaccurate.}
|
|
The Hitchhiker's Guide to the Galaxy
|
|
@end display
|
|
@sp 1
|
|
|
|
Unlike other programming languages, @code{awk} variables do not have a
|
|
fixed type. Instead, they can be either a number or a string, depending
|
|
upon the value that is assigned to them.
|
|
|
|
@cindex numeric string
|
|
The 1992 POSIX standard introduced
|
|
the concept of a @dfn{numeric string}, which is simply a string that looks
|
|
like a number, for example, @code{@w{" +2"}}. This concept is used
|
|
for determining the type of a variable.
|
|
|
|
The type of the variable is important, since the types of two variables
|
|
determine how they are compared.
|
|
|
|
In @code{gawk}, variable typing follows these rules.
|
|
|
|
@enumerate 1
|
|
@item
|
|
A numeric literal or the result of a numeric operation has the @var{numeric}
|
|
attribute.
|
|
|
|
@item
|
|
A string literal or the result of a string operation has the @var{string}
|
|
attribute.
|
|
|
|
@item
|
|
Fields, @code{getline} input, @code{FILENAME}, @code{ARGV} elements,
|
|
@code{ENVIRON} elements and the
|
|
elements of an array created by @code{split} that are numeric strings
|
|
have the @var{strnum} attribute. Otherwise, they have the @var{string}
|
|
attribute.
|
|
Uninitialized variables also have the @var{strnum} attribute.
|
|
|
|
@item
|
|
Attributes propagate across assignments, but are not changed by
|
|
any use.
|
|
@c (Although a use may cause the entity to acquire an additional
|
|
@c value such that it has both a numeric and string value -- this leaves the
|
|
@c attribute unchanged.)
|
|
@c This is important but not relevant
|
|
@end enumerate
|
|
|
|
The last rule is particularly important. In the following program,
|
|
@code{a} has numeric type, even though it is later used in a string
|
|
operation.
|
|
|
|
@example
|
|
BEGIN @{
|
|
a = 12.345
|
|
b = a " is a cute number"
|
|
print b
|
|
@}
|
|
@end example
|
|
|
|
When two operands are compared, either string comparison or numeric comparison
|
|
may be used, depending on the attributes of the operands, according to the
|
|
following, symmetric, matrix:
|
|
|
|
@c thanks to Karl Berry, kb@cs.umb.edu, for major help with TeX tables
|
|
@tex
|
|
\centerline{
|
|
\vbox{\bigskip % space above the table (about 1 linespace)
|
|
% Because we have vertical rules, we can't let TeX insert interline space
|
|
% in its usual way.
|
|
\offinterlineskip
|
|
%
|
|
% Define the table template. & separates columns, and \cr ends the
|
|
% template (and each row). # is replaced by the text of that entry on
|
|
% each row. The template for the first column breaks down like this:
|
|
% \strut -- a way to make each line have the height and depth
|
|
% of a normal line of type, since we turned off interline spacing.
|
|
% \hfil -- infinite glue; has the effect of right-justifying in this case.
|
|
% # -- replaced by the text (for instance, `STRNUM', in the last row).
|
|
% \quad -- about the width of an `M'. Just separates the columns.
|
|
%
|
|
% The second column (\vrule#) is what generates the vertical rule that
|
|
% spans table rows.
|
|
%
|
|
% The doubled && before the next entry means `repeat the following
|
|
% template as many times as necessary on each line' -- in our case, twice.
|
|
%
|
|
% The template itself, \quad#\hfil, left-justifies with a little space before.
|
|
%
|
|
\halign{\strut\hfil#\quad&\vrule#&&\quad#\hfil\cr
|
|
&&STRING &NUMERIC &STRNUM\cr
|
|
% The \omit tells TeX to skip inserting the template for this column on
|
|
% this particular row. In this case, we only want a little extra space
|
|
% to separate the heading row from the rule below it. the depth 2pt --
|
|
% `\vrule depth 2pt' is that little space.
|
|
\omit &depth 2pt\cr
|
|
% This is the horizontal rule below the heading. Since it has nothing to
|
|
% do with the columns of the table, we use \noalign to get it in there.
|
|
\noalign{\hrule}
|
|
% Like above, this time a little more space.
|
|
\omit &depth 4pt\cr
|
|
% The remaining rows have nothing special about them.
|
|
STRING &&string &string &string\cr
|
|
NUMERIC &&string &numeric &numeric\cr
|
|
STRNUM &&string &numeric &numeric\cr
|
|
}}}
|
|
@end tex
|
|
@ifinfo
|
|
@display
|
|
+----------------------------------------------
|
|
| STRING NUMERIC STRNUM
|
|
--------+----------------------------------------------
|
|
|
|
|
STRING | string string string
|
|
|
|
|
NUMERIC | string numeric numeric
|
|
|
|
|
STRNUM | string numeric numeric
|
|
--------+----------------------------------------------
|
|
@end display
|
|
@end ifinfo
|
|
|
|
The basic idea is that user input that looks numeric, and @emph{only}
|
|
user input, should be treated as numeric, even though it is actually
|
|
made of characters, and is therefore also a string.
|
|
|
|
@dfn{Comparison expressions} compare strings or numbers for
|
|
relationships such as equality. They are written using @dfn{relational
|
|
operators}, which are a superset of those in C. Here is a table of
|
|
them:
|
|
|
|
@cindex relational operators
|
|
@cindex operators, relational
|
|
@cindex @code{<} operator
|
|
@cindex @code{<=} operator
|
|
@cindex @code{>} operator
|
|
@cindex @code{>=} operator
|
|
@cindex @code{==} operator
|
|
@cindex @code{!=} operator
|
|
@cindex @code{~} operator
|
|
@cindex @code{!~} operator
|
|
@cindex @code{in} operator
|
|
@c @cartouche
|
|
@table @code
|
|
@item @var{x} < @var{y}
|
|
True if @var{x} is less than @var{y}.
|
|
|
|
@item @var{x} <= @var{y}
|
|
True if @var{x} is less than or equal to @var{y}.
|
|
|
|
@item @var{x} > @var{y}
|
|
True if @var{x} is greater than @var{y}.
|
|
|
|
@item @var{x} >= @var{y}
|
|
True if @var{x} is greater than or equal to @var{y}.
|
|
|
|
@item @var{x} == @var{y}
|
|
True if @var{x} is equal to @var{y}.
|
|
|
|
@item @var{x} != @var{y}
|
|
True if @var{x} is not equal to @var{y}.
|
|
|
|
@item @var{x} ~ @var{y}
|
|
True if the string @var{x} matches the regexp denoted by @var{y}.
|
|
|
|
@item @var{x} !~ @var{y}
|
|
True if the string @var{x} does not match the regexp denoted by @var{y}.
|
|
|
|
@item @var{subscript} in @var{array}
|
|
True if the array @var{array} has an element with the subscript @var{subscript}.
|
|
@end table
|
|
@c @end cartouche
|
|
|
|
Comparison expressions have the value one if true and zero if false.
|
|
|
|
When comparing operands of mixed types, numeric operands are converted
|
|
to strings using the value of @code{CONVFMT}
|
|
(@pxref{Conversion, ,Conversion of Strings and Numbers}).
|
|
|
|
Strings are compared
|
|
by comparing the first character of each, then the second character of each,
|
|
and so on. Thus @code{"10"} is less than @code{"9"}. If there are two
|
|
strings where one is a prefix of the other, the shorter string is less than
|
|
the longer one. Thus @code{"abc"} is less than @code{"abcd"}.
|
|
|
|
@cindex common mistakes
|
|
@cindex mistakes, common
|
|
@cindex errors, common
|
|
It is very easy to accidentally mistype the @samp{==} operator, and
|
|
leave off one of the @samp{=}s. The result is still valid @code{awk}
|
|
code, but the program will not do what you mean:
|
|
|
|
@example
|
|
if (a = b) # oops! should be a == b
|
|
@dots{}
|
|
else
|
|
@dots{}
|
|
@end example
|
|
|
|
@noindent
|
|
Unless @code{b} happens to be zero or the null string, the @code{if}
|
|
part of the test will always succeed. Because the operators are
|
|
so similar, this kind of error is very difficult to spot when
|
|
scanning the source code.
|
|
|
|
Here are some sample expressions, how @code{gawk} compares them, and what
|
|
the result of the comparison is.
|
|
|
|
@table @code
|
|
@item 1.5 <= 2.0
|
|
numeric comparison (true)
|
|
|
|
@item "abc" >= "xyz"
|
|
string comparison (false)
|
|
|
|
@item 1.5 != " +2"
|
|
string comparison (true)
|
|
|
|
@item "1e2" < "3"
|
|
string comparison (true)
|
|
|
|
@item a = 2; b = "2"
|
|
@itemx a == b
|
|
string comparison (true)
|
|
|
|
@item a = 2; b = " +2"
|
|
@itemx a == b
|
|
string comparison (false)
|
|
@end table
|
|
|
|
In this example,
|
|
|
|
@example
|
|
@group
|
|
$ echo 1e2 3 | awk '@{ print ($1 < $2) ? "true" : "false" @}'
|
|
@print{} false
|
|
@end group
|
|
@end example
|
|
|
|
@noindent
|
|
the result is @samp{false} since both @code{$1} and @code{$2} are numeric
|
|
strings and thus both have the @var{strnum} attribute,
|
|
dictating a numeric comparison.
|
|
|
|
The purpose of the comparison rules and the use of numeric strings is
|
|
to attempt to produce the behavior that is ``least surprising,'' while
|
|
still ``doing the right thing.''
|
|
|
|
@cindex comparisons, string vs. regexp
|
|
@cindex string comparison vs. regexp comparison
|
|
@cindex regexp comparison vs. string comparison
|
|
String comparisons and regular expression comparisons are very different.
|
|
For example,
|
|
|
|
@example
|
|
x == "foo"
|
|
@end example
|
|
|
|
@noindent
|
|
has the value of one, or is true, if the variable @code{x}
|
|
is precisely @samp{foo}. By contrast,
|
|
|
|
@example
|
|
x ~ /foo/
|
|
@end example
|
|
|
|
@noindent
|
|
has the value one if @code{x} contains @samp{foo}, such as
|
|
@code{"Oh, what a fool am I!"}.
|
|
|
|
The right hand operand of the @samp{~} and @samp{!~} operators may be
|
|
either a regexp constant (@code{/@dots{}/}), or an ordinary
|
|
expression, in which case the value of the expression as a string is used as a
|
|
dynamic regexp (@pxref{Regexp Usage, ,How to Use Regular Expressions}; also
|
|
@pxref{Computed Regexps, ,Using Dynamic Regexps}).
|
|
|
|
@cindex regexp as expression
|
|
In recent implementations of @code{awk}, a constant regular
|
|
expression in slashes by itself is also an expression. The regexp
|
|
@code{/@var{regexp}/} is an abbreviation for this comparison expression:
|
|
|
|
@example
|
|
$0 ~ /@var{regexp}/
|
|
@end example
|
|
|
|
One special place where @code{/foo/} is @emph{not} an abbreviation for
|
|
@samp{$0 ~ /foo/} is when it is the right-hand operand of @samp{~} or
|
|
@samp{!~}!
|
|
@xref{Using Constant Regexps, ,Using Regular Expression Constants},
|
|
where this is discussed in more detail.
|
|
|
|
@c This paragraph has been here since day 1, and has always bothered
|
|
@c me, especially since the expression doesn't really make a lot of
|
|
@c sense. So, just take it out.
|
|
@ignore
|
|
In some contexts it may be necessary to write parentheses around the
|
|
regexp to avoid confusing the @code{gawk} parser. For example,
|
|
@samp{(/x/ - /y/) > threshold} is not allowed, but @samp{((/x/) - (/y/))
|
|
> threshold} parses properly.
|
|
@end ignore
|
|
|
|
@node Boolean Ops, Conditional Exp, Typing and Comparison, Expressions
|
|
@section Boolean Expressions
|
|
@cindex expression, boolean
|
|
@cindex boolean expressions
|
|
@cindex operators, boolean
|
|
@cindex boolean operators
|
|
@cindex logical operations
|
|
@cindex operations, logical
|
|
@cindex short-circuit operators
|
|
@cindex operators, short-circuit
|
|
@cindex and operator
|
|
@cindex or operator
|
|
@cindex not operator
|
|
@cindex @code{&&} operator
|
|
@cindex @code{||} operator
|
|
@cindex @code{!} operator
|
|
|
|
A @dfn{boolean expression} is a combination of comparison expressions or
|
|
matching expressions, using the boolean operators ``or''
|
|
(@samp{||}), ``and'' (@samp{&&}), and ``not'' (@samp{!}), along with
|
|
parentheses to control nesting. The truth value of the boolean expression is
|
|
computed by combining the truth values of the component expressions.
|
|
Boolean expressions are also referred to as @dfn{logical expressions}.
|
|
The terms are equivalent.
|
|
|
|
Boolean expressions can be used wherever comparison and matching
|
|
expressions can be used. They can be used in @code{if}, @code{while},
|
|
@code{do} and @code{for} statements
|
|
(@pxref{Statements, ,Control Statements in Actions}).
|
|
They have numeric values (one if true, zero if false), which come into play
|
|
if the result of the boolean expression is stored in a variable, or
|
|
used in arithmetic.
|
|
|
|
In addition, every boolean expression is also a valid pattern, so
|
|
you can use one as a pattern to control the execution of rules.
|
|
|
|
Here are descriptions of the three boolean operators, with examples.
|
|
|
|
@c @cartouche
|
|
@table @code
|
|
@item @var{boolean1} && @var{boolean2}
|
|
True if both @var{boolean1} and @var{boolean2} are true. For example,
|
|
the following statement prints the current input record if it contains
|
|
both @samp{2400} and @samp{foo}.
|
|
|
|
@example
|
|
if ($0 ~ /2400/ && $0 ~ /foo/) print
|
|
@end example
|
|
|
|
The subexpression @var{boolean2} is evaluated only if @var{boolean1}
|
|
is true. This can make a difference when @var{boolean2} contains
|
|
expressions that have side effects: in the case of @samp{$0 ~ /foo/ &&
|
|
($2 == bar++)}, the variable @code{bar} is not incremented if there is
|
|
no @samp{foo} in the record.
|
|
|
|
@item @var{boolean1} || @var{boolean2}
|
|
True if at least one of @var{boolean1} or @var{boolean2} is true.
|
|
For example, the following statement prints all records in the input
|
|
that contain @emph{either} @samp{2400} or
|
|
@samp{foo}, or both.
|
|
|
|
@example
|
|
if ($0 ~ /2400/ || $0 ~ /foo/) print
|
|
@end example
|
|
|
|
The subexpression @var{boolean2} is evaluated only if @var{boolean1}
|
|
is false. This can make a difference when @var{boolean2} contains
|
|
expressions that have side effects.
|
|
|
|
@item ! @var{boolean}
|
|
True if @var{boolean} is false. For example, the following program prints
|
|
all records in the input file @file{BBS-list} that do @emph{not} contain the
|
|
string @samp{foo}.
|
|
|
|
@c A better example would be `if (! (subscript in array)) ...' but we
|
|
@c haven't done anything with arrays or `in' yet. Sigh.
|
|
@example
|
|
awk '@{ if (! ($0 ~ /foo/)) print @}' BBS-list
|
|
@end example
|
|
@end table
|
|
@c @end cartouche
|
|
|
|
The @samp{&&} and @samp{||} operators are called @dfn{short-circuit}
|
|
operators because of the way they work. Evaluation of the full expression
|
|
is ``short-circuited'' if the result can be determined part way through
|
|
its evaluation.
|
|
|
|
@cindex line continuation
|
|
You can continue a statement that uses @samp{&&} or @samp{||} simply
|
|
by putting a newline after them. But you cannot put a newline in front
|
|
of either of these operators without using backslash continuation
|
|
(@pxref{Statements/Lines, ,@code{awk} Statements Versus Lines}).
|
|
|
|
The actual value of an expression using the @samp{!} operator will be
|
|
either one or zero, depending upon the truth value of the expression it
|
|
is applied to.
|
|
|
|
The @samp{!} operator is often useful for changing the sense of a flag
|
|
variable from false to true and back again. For example, the following
|
|
program is one way to print lines in between special bracketing lines:
|
|
|
|
@example
|
|
$1 == "START" @{ interested = ! interested @}
|
|
interested == 1 @{ print @}
|
|
$1 == "END" @{ interested = ! interested @}
|
|
@end example
|
|
|
|
@noindent
|
|
The variable @code{interested}, like all @code{awk} variables, starts
|
|
out initialized to zero, which is also false. When a line is seen whose
|
|
first field is @samp{START}, the value of @code{interested} is toggled
|
|
to true, using @samp{!}. The next rule prints lines as long as
|
|
@code{interested} is true. When a line is seen whose first field is
|
|
@samp{END}, @code{interested} is toggled back to false.
|
|
@ignore
|
|
We should discuss using `next' in the two rules that toggle the
|
|
variable, to avoid printing the bracketing lines, but that's more
|
|
distraction than really needed.
|
|
@end ignore
|
|
|
|
@node Conditional Exp, Function Calls, Boolean Ops, Expressions
|
|
@section Conditional Expressions
|
|
@cindex conditional expression
|
|
@cindex expression, conditional
|
|
|
|
A @dfn{conditional expression} is a special kind of expression with
|
|
three operands. It allows you to use one expression's value to select
|
|
one of two other expressions.
|
|
|
|
The conditional expression is the same as in the C language:
|
|
|
|
@example
|
|
@var{selector} ? @var{if-true-exp} : @var{if-false-exp}
|
|
@end example
|
|
|
|
@noindent
|
|
There are three subexpressions. The first, @var{selector}, is always
|
|
computed first. If it is ``true'' (not zero and not null) then
|
|
@var{if-true-exp} is computed next and its value becomes the value of
|
|
the whole expression. Otherwise, @var{if-false-exp} is computed next
|
|
and its value becomes the value of the whole expression.
|
|
|
|
For example, this expression produces the absolute value of @code{x}:
|
|
|
|
@example
|
|
x > 0 ? x : -x
|
|
@end example
|
|
|
|
Each time the conditional expression is computed, exactly one of
|
|
@var{if-true-exp} and @var{if-false-exp} is computed; the other is ignored.
|
|
This is important when the expressions contain side effects. For example,
|
|
this conditional expression examines element @code{i} of either array
|
|
@code{a} or array @code{b}, and increments @code{i}.
|
|
|
|
@example
|
|
x == y ? a[i++] : b[i++]
|
|
@end example
|
|
|
|
@noindent
|
|
This is guaranteed to increment @code{i} exactly once, because each time
|
|
only one of the two increment expressions is executed,
|
|
and the other is not.
|
|
@xref{Arrays, ,Arrays in @code{awk}},
|
|
for more information about arrays.
|
|
|
|
@cindex differences between @code{gawk} and @code{awk}
|
|
@cindex line continuation
|
|
As a minor @code{gawk} extension,
|
|
you can continue a statement that uses @samp{?:} simply
|
|
by putting a newline after either character.
|
|
However, you cannot put a newline in front
|
|
of either character without using backslash continuation
|
|
(@pxref{Statements/Lines, ,@code{awk} Statements Versus Lines}).
|
|
If @samp{--posix} is specified
|
|
(@pxref{Options, , Command Line Options}), then this extension is disabled.
|
|
|
|
@node Function Calls, Precedence, Conditional Exp, Expressions
|
|
@section Function Calls
|
|
@cindex function call
|
|
@cindex calling a function
|
|
|
|
A @dfn{function} is a name for a particular calculation. Because it has
|
|
a name, you can ask for it by name at any point in the program. For
|
|
example, the function @code{sqrt} computes the square root of a number.
|
|
|
|
A fixed set of functions are @dfn{built-in}, which means they are
|
|
available in every @code{awk} program. The @code{sqrt} function is one
|
|
of these. @xref{Built-in, ,Built-in Functions}, for a list of built-in
|
|
functions and their descriptions. In addition, you can define your own
|
|
functions for use in your program.
|
|
@xref{User-defined, ,User-defined Functions}, for how to do this.
|
|
|
|
@cindex arguments in function call
|
|
The way to use a function is with a @dfn{function call} expression,
|
|
which consists of the function name followed immediately by a list of
|
|
@dfn{arguments} in parentheses. The arguments are expressions which
|
|
provide the raw materials for the function's calculations.
|
|
When there is more than one argument, they are separated by commas. If
|
|
there are no arguments, write just @samp{()} after the function name.
|
|
Here are some examples:
|
|
|
|
@example
|
|
sqrt(x^2 + y^2) @i{one argument}
|
|
atan2(y, x) @i{two arguments}
|
|
rand() @i{no arguments}
|
|
@end example
|
|
|
|
@strong{Do not put any space between the function name and the
|
|
open-parenthesis!} A user-defined function name looks just like the name of
|
|
a variable, and space would make the expression look like concatenation
|
|
of a variable with an expression inside parentheses. Space before the
|
|
parenthesis is harmless with built-in functions, but it is best not to get
|
|
into the habit of using space to avoid mistakes with user-defined
|
|
functions.
|
|
|
|
Each function expects a particular number of arguments. For example, the
|
|
@code{sqrt} function must be called with a single argument, the number
|
|
to take the square root of:
|
|
|
|
@example
|
|
sqrt(@var{argument})
|
|
@end example
|
|
|
|
Some of the built-in functions allow you to omit the final argument.
|
|
If you do so, they use a reasonable default.
|
|
@xref{Built-in, ,Built-in Functions}, for full details. If arguments
|
|
are omitted in calls to user-defined functions, then those arguments are
|
|
treated as local variables, initialized to the empty string
|
|
(@pxref{User-defined, ,User-defined Functions}).
|
|
|
|
Like every other expression, the function call has a value, which is
|
|
computed by the function based on the arguments you give it. In this
|
|
example, the value of @samp{sqrt(@var{argument})} is the square root of
|
|
@var{argument}. A function can also have side effects, such as assigning
|
|
values to certain variables or doing I/O.
|
|
|
|
Here is a command to read numbers, one number per line, and print the
|
|
square root of each one:
|
|
|
|
@example
|
|
@group
|
|
$ awk '@{ print "The square root of", $1, "is", sqrt($1) @}'
|
|
1
|
|
@print{} The square root of 1 is 1
|
|
3
|
|
@print{} The square root of 3 is 1.73205
|
|
5
|
|
@print{} The square root of 5 is 2.23607
|
|
@kbd{Control-d}
|
|
@end group
|
|
@end example
|
|
|
|
@node Precedence, , Function Calls, Expressions
|
|
@section Operator Precedence (How Operators Nest)
|
|
@cindex precedence
|
|
@cindex operator precedence
|
|
|
|
@dfn{Operator precedence} determines how operators are grouped, when
|
|
different operators appear close by in one expression. For example,
|
|
@samp{*} has higher precedence than @samp{+}; thus, @samp{a + b * c}
|
|
means to multiply @code{b} and @code{c}, and then add @code{a} to the
|
|
product (i.e.@: @samp{a + (b * c)}).
|
|
|
|
You can overrule the precedence of the operators by using parentheses.
|
|
You can think of the precedence rules as saying where the
|
|
parentheses are assumed to be if you do not write parentheses yourself. In
|
|
fact, it is wise to always use parentheses whenever you have an unusual
|
|
combination of operators, because other people who read the program may
|
|
not remember what the precedence is in this case. You might forget,
|
|
too; then you could make a mistake. Explicit parentheses will help prevent
|
|
any such mistake.
|
|
|
|
When operators of equal precedence are used together, the leftmost
|
|
operator groups first, except for the assignment, conditional and
|
|
exponentiation operators, which group in the opposite order.
|
|
Thus, @samp{a - b + c} groups as @samp{(a - b) + c}, and
|
|
@samp{a = b = c} groups as @samp{a = (b = c)}.
|
|
|
|
The precedence of prefix unary operators does not matter as long as only
|
|
unary operators are involved, because there is only one way to interpret
|
|
them---innermost first. Thus, @samp{$++i} means @samp{$(++i)} and
|
|
@samp{++$x} means @samp{++($x)}. However, when another operator follows
|
|
the operand, then the precedence of the unary operators can matter.
|
|
Thus, @samp{$x^2} means @samp{($x)^2}, but @samp{-x^2} means
|
|
@samp{-(x^2)}, because @samp{-} has lower precedence than @samp{^}
|
|
while @samp{$} has higher precedence.
|
|
|
|
Here is a table of @code{awk}'s operators, in order from highest
|
|
precedence to lowest:
|
|
|
|
@c use @code in the items, looks better in TeX w/o all the quotes
|
|
@table @code
|
|
@item (@dots{})
|
|
Grouping.
|
|
|
|
@item $
|
|
Field.
|
|
|
|
@item ++ --
|
|
Increment, decrement.
|
|
|
|
@cindex @code{awk} language, POSIX version
|
|
@cindex POSIX @code{awk}
|
|
@item ^ **
|
|
Exponentiation. These operators group right-to-left.
|
|
(The @samp{**} operator is not specified by POSIX.)
|
|
|
|
@item + - !
|
|
Unary plus, minus, logical ``not''.
|
|
|
|
@item * / %
|
|
Multiplication, division, modulus.
|
|
|
|
@item + -
|
|
Addition, subtraction.
|
|
|
|
@item @r{Concatenation}
|
|
No special token is used to indicate concatenation.
|
|
The operands are simply written side by side.
|
|
|
|
@item < <= == !=
|
|
@itemx > >= >> |
|
|
Relational, and redirection.
|
|
The relational operators and the redirections have the same precedence
|
|
level. Characters such as @samp{>} serve both as relationals and as
|
|
redirections; the context distinguishes between the two meanings.
|
|
|
|
Note that the I/O redirection operators in @code{print} and @code{printf}
|
|
statements belong to the statement level, not to expressions. The
|
|
redirection does not produce an expression which could be the operand of
|
|
another operator. As a result, it does not make sense to use a
|
|
redirection operator near another operator of lower precedence, without
|
|
parentheses. Such combinations, for example @samp{print foo > a ? b : c},
|
|
result in syntax errors.
|
|
The correct way to write this statement is @samp{print foo > (a ? b : c)}.
|
|
|
|
@item ~ !~
|
|
Matching, non-matching.
|
|
|
|
@item in
|
|
Array membership.
|
|
|
|
@item &&
|
|
Logical ``and''.
|
|
|
|
@item ||
|
|
Logical ``or''.
|
|
|
|
@item ?:
|
|
Conditional. This operator groups right-to-left.
|
|
|
|
@cindex @code{awk} language, POSIX version
|
|
@cindex POSIX @code{awk}
|
|
@item = += -= *=
|
|
@itemx /= %= ^= **=
|
|
Assignment. These operators group right-to-left.
|
|
(The @samp{**=} operator is not specified by POSIX.)
|
|
@end table
|
|
|
|
@node Patterns and Actions, Statements, Expressions, Top
|
|
@chapter Patterns and Actions
|
|
@cindex pattern, definition of
|
|
|
|
As you have already seen, each @code{awk} statement consists of
|
|
a pattern with an associated action. This chapter describes how
|
|
you build patterns and actions.
|
|
|
|
@menu
|
|
* Pattern Overview:: What goes into a pattern.
|
|
* Action Overview:: What goes into an action.
|
|
@end menu
|
|
|
|
@node Pattern Overview, Action Overview, Patterns and Actions, Patterns and Actions
|
|
@section Pattern Elements
|
|
|
|
Patterns in @code{awk} control the execution of rules: a rule is
|
|
executed when its pattern matches the current input record. This
|
|
section explains all about how to write patterns.
|
|
|
|
@menu
|
|
* Kinds of Patterns:: A list of all kinds of patterns.
|
|
* Regexp Patterns:: Using regexps as patterns.
|
|
* Expression Patterns:: Any expression can be used as a pattern.
|
|
* Ranges:: Pairs of patterns specify record ranges.
|
|
* BEGIN/END:: Specifying initialization and cleanup rules.
|
|
* Empty:: The empty pattern, which matches every record.
|
|
@end menu
|
|
|
|
@node Kinds of Patterns, Regexp Patterns, Pattern Overview, Pattern Overview
|
|
@subsection Kinds of Patterns
|
|
@cindex patterns, types of
|
|
|
|
Here is a summary of the types of patterns supported in @code{awk}.
|
|
|
|
@table @code
|
|
@item /@var{regular expression}/
|
|
A regular expression as a pattern. It matches when the text of the
|
|
input record fits the regular expression.
|
|
(@xref{Regexp, ,Regular Expressions}.)
|
|
|
|
@item @var{expression}
|
|
A single expression. It matches when its value
|
|
is non-zero (if a number) or non-null (if a string).
|
|
(@xref{Expression Patterns, ,Expressions as Patterns}.)
|
|
|
|
@item @var{pat1}, @var{pat2}
|
|
A pair of patterns separated by a comma, specifying a range of records.
|
|
The range includes both the initial record that matches @var{pat1}, and
|
|
the final record that matches @var{pat2}.
|
|
(@xref{Ranges, ,Specifying Record Ranges with Patterns}.)
|
|
|
|
@item BEGIN
|
|
@itemx END
|
|
Special patterns for you to supply start-up or clean-up actions for your
|
|
@code{awk} program.
|
|
(@xref{BEGIN/END, ,The @code{BEGIN} and @code{END} Special Patterns}.)
|
|
|
|
@item @var{empty}
|
|
The empty pattern matches every input record.
|
|
(@xref{Empty, ,The Empty Pattern}.)
|
|
@end table
|
|
|
|
@node Regexp Patterns, Expression Patterns, Kinds of Patterns, Pattern Overview
|
|
@subsection Regular Expressions as Patterns
|
|
|
|
We have been using regular expressions as patterns since our early examples.
|
|
This kind of pattern is simply a regexp constant in the pattern part of
|
|
a rule. Its meaning is @samp{$0 ~ /@var{pattern}/}.
|
|
The pattern matches when the input record matches the regexp.
|
|
For example:
|
|
|
|
@example
|
|
/foo|bar|baz/ @{ buzzwords++ @}
|
|
END @{ print buzzwords, "buzzwords seen" @}
|
|
@end example
|
|
|
|
@node Expression Patterns, Ranges, Regexp Patterns, Pattern Overview
|
|
@subsection Expressions as Patterns
|
|
|
|
Any @code{awk} expression is valid as an @code{awk} pattern.
|
|
Then the pattern matches if the expression's value is non-zero (if a
|
|
number) or non-null (if a string).
|
|
|
|
The expression is reevaluated each time the rule is tested against a new
|
|
input record. If the expression uses fields such as @code{$1}, the
|
|
value depends directly on the new input record's text; otherwise, it
|
|
depends only on what has happened so far in the execution of the
|
|
@code{awk} program, but that may still be useful.
|
|
|
|
A very common kind of expression used as a pattern is the comparison
|
|
expression, using the comparison operators described in
|
|
@ref{Typing and Comparison, ,Variable Typing and Comparison Expressions}.
|
|
|
|
Regexp matching and non-matching are also very common expressions.
|
|
The left operand of the @samp{~} and @samp{!~} operators is a string.
|
|
The right operand is either a constant regular expression enclosed in
|
|
slashes (@code{/@var{regexp}/}), or any expression, whose string value
|
|
is used as a dynamic regular expression
|
|
(@pxref{Computed Regexps, , Using Dynamic Regexps}).
|
|
|
|
The following example prints the second field of each input record
|
|
whose first field is precisely @samp{foo}.
|
|
|
|
@example
|
|
$ awk '$1 == "foo" @{ print $2 @}' BBS-list
|
|
@end example
|
|
|
|
@noindent
|
|
(There is no output, since there is no BBS site named ``foo''.)
|
|
Contrast this with the following regular expression match, which would
|
|
accept any record with a first field that contains @samp{foo}:
|
|
|
|
@example
|
|
@group
|
|
$ awk '$1 ~ /foo/ @{ print $2 @}' BBS-list
|
|
@print{} 555-1234
|
|
@print{} 555-6699
|
|
@print{} 555-6480
|
|
@print{} 555-2127
|
|
@end group
|
|
@end example
|
|
|
|
Boolean expressions are also commonly used as patterns.
|
|
Whether the pattern
|
|
matches an input record depends on whether its subexpressions match.
|
|
|
|
For example, the following command prints all records in
|
|
@file{BBS-list} that contain both @samp{2400} and @samp{foo}.
|
|
|
|
@example
|
|
$ awk '/2400/ && /foo/' BBS-list
|
|
@print{} fooey 555-1234 2400/1200/300 B
|
|
@end example
|
|
|
|
The following command prints all records in
|
|
@file{BBS-list} that contain @emph{either} @samp{2400} or @samp{foo}, or
|
|
both.
|
|
|
|
@example
|
|
@group
|
|
$ awk '/2400/ || /foo/' BBS-list
|
|
@print{} alpo-net 555-3412 2400/1200/300 A
|
|
@print{} bites 555-1675 2400/1200/300 A
|
|
@print{} fooey 555-1234 2400/1200/300 B
|
|
@print{} foot 555-6699 1200/300 B
|
|
@print{} macfoo 555-6480 1200/300 A
|
|
@print{} sdace 555-3430 2400/1200/300 A
|
|
@print{} sabafoo 555-2127 1200/300 C
|
|
@end group
|
|
@end example
|
|
|
|
The following command prints all records in
|
|
@file{BBS-list} that do @emph{not} contain the string @samp{foo}.
|
|
|
|
@example
|
|
@group
|
|
$ awk '! /foo/' BBS-list
|
|
@print{} aardvark 555-5553 1200/300 B
|
|
@print{} alpo-net 555-3412 2400/1200/300 A
|
|
@print{} barfly 555-7685 1200/300 A
|
|
@print{} bites 555-1675 2400/1200/300 A
|
|
@print{} camelot 555-0542 300 C
|
|
@print{} core 555-2912 1200/300 C
|
|
@print{} sdace 555-3430 2400/1200/300 A
|
|
@end group
|
|
@end example
|
|
|
|
The subexpressions of a boolean operator in a pattern can be constant regular
|
|
expressions, comparisons, or any other @code{awk} expressions. Range
|
|
patterns are not expressions, so they cannot appear inside boolean
|
|
patterns. Likewise, the special patterns @code{BEGIN} and @code{END},
|
|
which never match any input record, are not expressions and cannot
|
|
appear inside boolean patterns.
|
|
|
|
A regexp constant as a pattern is also a special case of an expression
|
|
pattern. @code{/foo/} as an expression has the value one if @samp{foo}
|
|
appears in the current input record; thus, as a pattern, @code{/foo/}
|
|
matches any record containing @samp{foo}.
|
|
|
|
@node Ranges, BEGIN/END, Expression Patterns, Pattern Overview
|
|
@subsection Specifying Record Ranges with Patterns
|
|
|
|
@cindex range pattern
|
|
@cindex pattern, range
|
|
@cindex matching ranges of lines
|
|
A @dfn{range pattern} is made of two patterns separated by a comma, of
|
|
the form @samp{@var{begpat}, @var{endpat}}. It matches ranges of
|
|
consecutive input records. The first pattern, @var{begpat}, controls
|
|
where the range begins, and the second one, @var{endpat}, controls where
|
|
it ends. For example,
|
|
|
|
@example
|
|
awk '$1 == "on", $1 == "off"'
|
|
@end example
|
|
|
|
@noindent
|
|
prints every record between @samp{on}/@samp{off} pairs, inclusive.
|
|
|
|
A range pattern starts out by matching @var{begpat}
|
|
against every input record; when a record matches @var{begpat}, the
|
|
range pattern becomes @dfn{turned on}. The range pattern matches this
|
|
record. As long as it stays turned on, it automatically matches every
|
|
input record read. It also matches @var{endpat} against
|
|
every input record; when that succeeds, the range pattern is turned
|
|
off again for the following record. Then it goes back to checking
|
|
@var{begpat} against each record.
|
|
|
|
The record that turns on the range pattern and the one that turns it
|
|
off both match the range pattern. If you don't want to operate on
|
|
these records, you can write @code{if} statements in the rule's action
|
|
to distinguish them from the records you are interested in.
|
|
|
|
It is possible for a pattern to be turned both on and off by the same
|
|
record, if the record satisfies both conditions. Then the action is
|
|
executed for just that record.
|
|
|
|
For example, suppose you have text between two identical markers (say
|
|
the @samp{%} symbol) that you wish to ignore. You might try to
|
|
combine a range pattern that describes the delimited text with the
|
|
@code{next} statement
|
|
(not discussed yet, @pxref{Next Statement, , The @code{next} Statement}),
|
|
which causes @code{awk} to skip any further processing of the current
|
|
record and start over again with the next input record. Such a program
|
|
would look like this:
|
|
|
|
@example
|
|
/^%$/,/^%$/ @{ next @}
|
|
@{ print @}
|
|
@end example
|
|
|
|
@noindent
|
|
@cindex skipping lines between markers
|
|
This program fails because the range pattern is both turned on and turned off
|
|
by the first line with just a @samp{%} on it. To accomplish this task, you
|
|
must write the program this way, using a flag:
|
|
|
|
@example
|
|
/^%$/ @{ skip = ! skip; next @}
|
|
skip == 1 @{ next @} # skip lines with `skip' set
|
|
@end example
|
|
|
|
Note that in a range pattern, the @samp{,} has the lowest precedence
|
|
(is evaluated last) of all the operators. Thus, for example, the
|
|
following program attempts to combine a range pattern with another,
|
|
simpler test.
|
|
|
|
@example
|
|
echo Yes | awk '/1/,/2/ || /Yes/'
|
|
@end example
|
|
|
|
The author of this program intended it to mean @samp{(/1/,/2/) || /Yes/}.
|
|
However, @code{awk} interprets this as @samp{/1/, (/2/ || /Yes/)}.
|
|
This cannot be changed or worked around; range patterns do not combine
|
|
with other patterns.
|
|
|
|
@node BEGIN/END, Empty, Ranges, Pattern Overview
|
|
@subsection The @code{BEGIN} and @code{END} Special Patterns
|
|
|
|
@cindex @code{BEGIN} special pattern
|
|
@cindex pattern, @code{BEGIN}
|
|
@cindex @code{END} special pattern
|
|
@cindex pattern, @code{END}
|
|
@code{BEGIN} and @code{END} are special patterns. They are not used to
|
|
match input records. Rather, they supply start-up or
|
|
clean-up actions for your @code{awk} script.
|
|
|
|
@menu
|
|
* Using BEGIN/END:: How and why to use BEGIN/END rules.
|
|
* I/O And BEGIN/END:: I/O issues in BEGIN/END rules.
|
|
@end menu
|
|
|
|
@node Using BEGIN/END, I/O And BEGIN/END, BEGIN/END, BEGIN/END
|
|
@subsubsection Startup and Cleanup Actions
|
|
|
|
A @code{BEGIN} rule is executed, once, before the first input record
|
|
has been read. An @code{END} rule is executed, once, after all the
|
|
input has been read. For example:
|
|
|
|
@example
|
|
@group
|
|
$ awk '
|
|
> BEGIN @{ print "Analysis of \"foo\"" @}
|
|
> /foo/ @{ ++n @}
|
|
> END @{ print "\"foo\" appears " n " times." @}' BBS-list
|
|
@print{} Analysis of "foo"
|
|
@print{} "foo" appears 4 times.
|
|
@end group
|
|
@end example
|
|
|
|
This program finds the number of records in the input file @file{BBS-list}
|
|
that contain the string @samp{foo}. The @code{BEGIN} rule prints a title
|
|
for the report. There is no need to use the @code{BEGIN} rule to
|
|
initialize the counter @code{n} to zero, as @code{awk} does this
|
|
automatically (@pxref{Variables}).
|
|
|
|
The second rule increments the variable @code{n} every time a
|
|
record containing the pattern @samp{foo} is read. The @code{END} rule
|
|
prints the value of @code{n} at the end of the run.
|
|
|
|
The special patterns @code{BEGIN} and @code{END} cannot be used in ranges
|
|
or with boolean operators (indeed, they cannot be used with any operators).
|
|
|
|
An @code{awk} program may have multiple @code{BEGIN} and/or @code{END}
|
|
rules. They are executed in the order they appear, all the @code{BEGIN}
|
|
rules at start-up and all the @code{END} rules at termination.
|
|
@code{BEGIN} and @code{END} rules may be intermixed with other rules.
|
|
This feature was added in the 1987 version of @code{awk}, and is included
|
|
in the POSIX standard. The original (1978) version of @code{awk}
|
|
required you to put the @code{BEGIN} rule at the beginning of the
|
|
program, and the @code{END} rule at the end, and only allowed one of
|
|
each. This is no longer required, but it is a good idea in terms of
|
|
program organization and readability.
|
|
|
|
Multiple @code{BEGIN} and @code{END} rules are useful for writing
|
|
library functions, since each library file can have its own @code{BEGIN} and/or
|
|
@code{END} rule to do its own initialization and/or cleanup. Note that
|
|
the order in which library functions are named on the command line
|
|
controls the order in which their @code{BEGIN} and @code{END} rules are
|
|
executed. Therefore you have to be careful to write such rules in
|
|
library files so that the order in which they are executed doesn't matter.
|
|
@xref{Options, ,Command Line Options}, for more information on
|
|
using library functions.
|
|
@xref{Library Functions, ,A Library of @code{awk} Functions},
|
|
for a number of useful library functions.
|
|
|
|
@cindex dark corner
|
|
If an @code{awk} program only has a @code{BEGIN} rule, and no other
|
|
rules, then the program exits after the @code{BEGIN} rule has been run.
|
|
(The original version of @code{awk} used to keep reading and ignoring input
|
|
until end of file was seen.) However, if an @code{END} rule exists,
|
|
then the input will be read, even if there are no other rules in
|
|
the program. This is necessary in case the @code{END} rule checks the
|
|
@code{FNR} and @code{NR} variables (d.c.).
|
|
|
|
@code{BEGIN} and @code{END} rules must have actions; there is no default
|
|
action for these rules since there is no current record when they run.
|
|
|
|
@node I/O And BEGIN/END, , Using BEGIN/END, BEGIN/END
|
|
@subsubsection Input/Output from @code{BEGIN} and @code{END} Rules
|
|
|
|
@cindex I/O from @code{BEGIN} and @code{END}
|
|
There are several (sometimes subtle) issues involved when doing I/O
|
|
from a @code{BEGIN} or @code{END} rule.
|
|
|
|
The first has to do with the value of @code{$0} in a @code{BEGIN}
|
|
rule. Since @code{BEGIN} rules are executed before any input is read,
|
|
there simply is no input record, and therefore no fields, when
|
|
executing @code{BEGIN} rules. References to @code{$0} and the fields
|
|
yield a null string or zero, depending upon the context. One way
|
|
to give @code{$0} a real value is to execute a @code{getline} command
|
|
without a variable (@pxref{Getline, ,Explicit Input with @code{getline}}).
|
|
Another way is to simply assign a value to it.
|
|
|
|
@cindex differences between @code{gawk} and @code{awk}
|
|
The second point is similar to the first, but from the other direction.
|
|
Inside an @code{END} rule, what is the value of @code{$0} and @code{NF}?
|
|
Traditionally, due largely to implementation issues, @code{$0} and
|
|
@code{NF} were @emph{undefined} inside an @code{END} rule.
|
|
The POSIX standard specified that @code{NF} was available in an @code{END}
|
|
rule, containing the number of fields from the last input record.
|
|
Due most probably to an oversight, the standard does not say that @code{$0}
|
|
is also preserved, although logically one would think that it should be.
|
|
In fact, @code{gawk} does preserve the value of @code{$0} for use in
|
|
@code{END} rules. Be aware, however, that Unix @code{awk}, and possibly
|
|
other implementations, do not.
|
|
|
|
The third point follows from the first two. What is the meaning of
|
|
@samp{print} inside a @code{BEGIN} or @code{END} rule? The meaning is
|
|
the same as always, @samp{print $0}. If @code{$0} is the null string,
|
|
then this prints an empty line. Many long time @code{awk} programmers
|
|
use @samp{print} in @code{BEGIN} and @code{END} rules, to mean
|
|
@samp{@w{print ""}}, relying on @code{$0} being null. While you might
|
|
generally get away with this in @code{BEGIN} rules, in @code{gawk} at
|
|
least, it is a very bad idea in @code{END} rules. It is also poor
|
|
style, since if you want an empty line in the output, you
|
|
should say so explicitly in your program.
|
|
|
|
@node Empty, , BEGIN/END, Pattern Overview
|
|
@subsection The Empty Pattern
|
|
|
|
@cindex empty pattern
|
|
@cindex pattern, empty
|
|
An empty (i.e.@: non-existent) pattern is considered to match @emph{every}
|
|
input record. For example, the program:
|
|
|
|
@example
|
|
awk '@{ print $1 @}' BBS-list
|
|
@end example
|
|
|
|
@noindent
|
|
prints the first field of every record.
|
|
|
|
@node Action Overview, , Pattern Overview, Patterns and Actions
|
|
@section Overview of Actions
|
|
@cindex action, definition of
|
|
@cindex curly braces
|
|
@cindex action, curly braces
|
|
@cindex action, separating statements
|
|
|
|
An @code{awk} program or script consists of a series of
|
|
rules and function definitions, interspersed. (Functions are
|
|
described later. @xref{User-defined, ,User-defined Functions}.)
|
|
|
|
A rule contains a pattern and an action, either of which (but not
|
|
both) may be
|
|
omitted. The purpose of the @dfn{action} is to tell @code{awk} what to do
|
|
once a match for the pattern is found. Thus, in outline, an @code{awk}
|
|
program generally looks like this:
|
|
|
|
@example
|
|
@r{[}@var{pattern}@r{]} @r{[}@{ @var{action} @}@r{]}
|
|
@r{[}@var{pattern}@r{]} @r{[}@{ @var{action} @}@r{]}
|
|
@dots{}
|
|
function @var{name}(@var{args}) @{ @dots{} @}
|
|
@dots{}
|
|
@end example
|
|
|
|
An action consists of one or more @code{awk} @dfn{statements}, enclosed
|
|
in curly braces (@samp{@{} and @samp{@}}). Each statement specifies one
|
|
thing to be done. The statements are separated by newlines or
|
|
semicolons.
|
|
|
|
The curly braces around an action must be used even if the action
|
|
contains only one statement, or even if it contains no statements at
|
|
all. However, if you omit the action entirely, omit the curly braces as
|
|
well. An omitted action is equivalent to @samp{@{ print $0 @}}.
|
|
|
|
@example
|
|
/foo/ @{ @} # match foo, do nothing - empty action
|
|
/foo/ # match foo, print the record - omitted action
|
|
@end example
|
|
|
|
Here are the kinds of statements supported in @code{awk}:
|
|
|
|
@itemize @bullet
|
|
@item
|
|
Expressions, which can call functions or assign values to variables
|
|
(@pxref{Expressions}). Executing
|
|
this kind of statement simply computes the value of the expression.
|
|
This is useful when the expression has side effects
|
|
(@pxref{Assignment Ops, ,Assignment Expressions}).
|
|
|
|
@item
|
|
Control statements, which specify the control flow of @code{awk}
|
|
programs. The @code{awk} language gives you C-like constructs
|
|
(@code{if}, @code{for}, @code{while}, and @code{do}) as well as a few
|
|
special ones (@pxref{Statements, ,Control Statements in Actions}).
|
|
|
|
@item
|
|
Compound statements, which consist of one or more statements enclosed in
|
|
curly braces. A compound statement is used in order to put several
|
|
statements together in the body of an @code{if}, @code{while}, @code{do}
|
|
or @code{for} statement.
|
|
|
|
@item
|
|
Input statements, using the @code{getline} command
|
|
(@pxref{Getline, ,Explicit Input with @code{getline}}), the @code{next}
|
|
statement (@pxref{Next Statement, ,The @code{next} Statement}),
|
|
and the @code{nextfile} statement
|
|
(@pxref{Nextfile Statement, ,The @code{nextfile} Statement}).
|
|
|
|
@item
|
|
Output statements, @code{print} and @code{printf}.
|
|
@xref{Printing, ,Printing Output}.
|
|
|
|
@item
|
|
Deletion statements, for deleting array elements.
|
|
@xref{Delete, ,The @code{delete} Statement}.
|
|
@end itemize
|
|
|
|
@iftex
|
|
The next chapter covers control statements in detail.
|
|
@end iftex
|
|
|
|
@node Statements, Built-in Variables, Patterns and Actions, Top
|
|
@chapter Control Statements in Actions
|
|
@cindex control statement
|
|
|
|
@dfn{Control statements} such as @code{if}, @code{while}, and so on
|
|
control the flow of execution in @code{awk} programs. Most of the
|
|
control statements in @code{awk} are patterned on similar statements in
|
|
C.
|
|
|
|
All the control statements start with special keywords such as @code{if}
|
|
and @code{while}, to distinguish them from simple expressions.
|
|
|
|
@cindex compound statement
|
|
@cindex statement, compound
|
|
Many control statements contain other statements; for example, the
|
|
@code{if} statement contains another statement which may or may not be
|
|
executed. The contained statement is called the @dfn{body}. If you
|
|
want to include more than one statement in the body, group them into a
|
|
single @dfn{compound statement} with curly braces, separating them with
|
|
newlines or semicolons.
|
|
|
|
@menu
|
|
* If Statement:: Conditionally execute some @code{awk}
|
|
statements.
|
|
* While Statement:: Loop until some condition is satisfied.
|
|
* Do Statement:: Do specified action while looping until some
|
|
condition is satisfied.
|
|
* For Statement:: Another looping statement, that provides
|
|
initialization and increment clauses.
|
|
* Break Statement:: Immediately exit the innermost enclosing loop.
|
|
* Continue Statement:: Skip to the end of the innermost enclosing
|
|
loop.
|
|
* Next Statement:: Stop processing the current input record.
|
|
* Nextfile Statement:: Stop processing the current file.
|
|
* Exit Statement:: Stop execution of @code{awk}.
|
|
@end menu
|
|
|
|
@node If Statement, While Statement, Statements, Statements
|
|
@section The @code{if}-@code{else} Statement
|
|
|
|
@cindex @code{if}-@code{else} statement
|
|
The @code{if}-@code{else} statement is @code{awk}'s decision-making
|
|
statement. It looks like this:
|
|
|
|
@example
|
|
if (@var{condition}) @var{then-body} @r{[}else @var{else-body}@r{]}
|
|
@end example
|
|
|
|
@noindent
|
|
The @var{condition} is an expression that controls what the rest of the
|
|
statement will do. If @var{condition} is true, @var{then-body} is
|
|
executed; otherwise, @var{else-body} is executed.
|
|
The @code{else} part of the statement is
|
|
optional. The condition is considered false if its value is zero or
|
|
the null string, and true otherwise.
|
|
|
|
Here is an example:
|
|
|
|
@example
|
|
if (x % 2 == 0)
|
|
print "x is even"
|
|
else
|
|
print "x is odd"
|
|
@end example
|
|
|
|
In this example, if the expression @samp{x % 2 == 0} is true (that is,
|
|
the value of @code{x} is evenly divisible by two), then the first @code{print}
|
|
statement is executed, otherwise the second @code{print} statement is
|
|
executed.
|
|
|
|
If the @code{else} appears on the same line as @var{then-body}, and
|
|
@var{then-body} is not a compound statement (i.e.@: not surrounded by
|
|
curly braces), then a semicolon must separate @var{then-body} from
|
|
@code{else}. To illustrate this, let's rewrite the previous example:
|
|
|
|
@example
|
|
if (x % 2 == 0) print "x is even"; else
|
|
print "x is odd"
|
|
@end example
|
|
|
|
@noindent
|
|
If you forget the @samp{;}, @code{awk} won't be able to interpret the
|
|
statement, and you will get a syntax error.
|
|
|
|
We would not actually write this example this way, because a human
|
|
reader might fail to see the @code{else} if it were not the first thing
|
|
on its line.
|
|
|
|
@node While Statement, Do Statement, If Statement, Statements
|
|
@section The @code{while} Statement
|
|
@cindex @code{while} statement
|
|
@cindex loop
|
|
@cindex body of a loop
|
|
|
|
In programming, a @dfn{loop} means a part of a program that can
|
|
be executed two or more times in succession.
|
|
|
|
The @code{while} statement is the simplest looping statement in
|
|
@code{awk}. It repeatedly executes a statement as long as a condition is
|
|
true. It looks like this:
|
|
|
|
@example
|
|
while (@var{condition})
|
|
@var{body}
|
|
@end example
|
|
|
|
@noindent
|
|
Here @var{body} is a statement that we call the @dfn{body} of the loop,
|
|
and @var{condition} is an expression that controls how long the loop
|
|
keeps running.
|
|
|
|
The first thing the @code{while} statement does is test @var{condition}.
|
|
If @var{condition} is true, it executes the statement @var{body}.
|
|
@ifinfo
|
|
(The @var{condition} is true when the value
|
|
is not zero and not a null string.)
|
|
@end ifinfo
|
|
After @var{body} has been executed,
|
|
@var{condition} is tested again, and if it is still true, @var{body} is
|
|
executed again. This process repeats until @var{condition} is no longer
|
|
true. If @var{condition} is initially false, the body of the loop is
|
|
never executed, and @code{awk} continues with the statement following
|
|
the loop.
|
|
|
|
This example prints the first three fields of each record, one per line.
|
|
|
|
@example
|
|
awk '@{ i = 1
|
|
while (i <= 3) @{
|
|
print $i
|
|
i++
|
|
@}
|
|
@}' inventory-shipped
|
|
@end example
|
|
|
|
@noindent
|
|
Here the body of the loop is a compound statement enclosed in braces,
|
|
containing two statements.
|
|
|
|
The loop works like this: first, the value of @code{i} is set to one.
|
|
Then, the @code{while} tests whether @code{i} is less than or equal to
|
|
three. This is true when @code{i} equals one, so the @code{i}-th
|
|
field is printed. Then the @samp{i++} increments the value of @code{i}
|
|
and the loop repeats. The loop terminates when @code{i} reaches four.
|
|
|
|
As you can see, a newline is not required between the condition and the
|
|
body; but using one makes the program clearer unless the body is a
|
|
compound statement or is very simple. The newline after the open-brace
|
|
that begins the compound statement is not required either, but the
|
|
program would be harder to read without it.
|
|
|
|
@node Do Statement, For Statement, While Statement, Statements
|
|
@section The @code{do}-@code{while} Statement
|
|
|
|
The @code{do} loop is a variation of the @code{while} looping statement.
|
|
The @code{do} loop executes the @var{body} once, and then repeats @var{body}
|
|
as long as @var{condition} is true. It looks like this:
|
|
|
|
@example
|
|
@group
|
|
do
|
|
@var{body}
|
|
while (@var{condition})
|
|
@end group
|
|
@end example
|
|
|
|
Even if @var{condition} is false at the start, @var{body} is executed at
|
|
least once (and only once, unless executing @var{body} makes
|
|
@var{condition} true). Contrast this with the corresponding
|
|
@code{while} statement:
|
|
|
|
@example
|
|
while (@var{condition})
|
|
@var{body}
|
|
@end example
|
|
|
|
@noindent
|
|
This statement does not execute @var{body} even once if @var{condition}
|
|
is false to begin with.
|
|
|
|
Here is an example of a @code{do} statement:
|
|
|
|
@example
|
|
awk '@{ i = 1
|
|
do @{
|
|
print $0
|
|
i++
|
|
@} while (i <= 10)
|
|
@}'
|
|
@end example
|
|
|
|
@noindent
|
|
This program prints each input record ten times. It isn't a very
|
|
realistic example, since in this case an ordinary @code{while} would do
|
|
just as well. But this reflects actual experience; there is only
|
|
occasionally a real use for a @code{do} statement.
|
|
|
|
@node For Statement, Break Statement, Do Statement, Statements
|
|
@section The @code{for} Statement
|
|
@cindex @code{for} statement
|
|
|
|
The @code{for} statement makes it more convenient to count iterations of a
|
|
loop. The general form of the @code{for} statement looks like this:
|
|
|
|
@example
|
|
for (@var{initialization}; @var{condition}; @var{increment})
|
|
@var{body}
|
|
@end example
|
|
|
|
@noindent
|
|
The @var{initialization}, @var{condition} and @var{increment} parts are
|
|
arbitrary @code{awk} expressions, and @var{body} stands for any
|
|
@code{awk} statement.
|
|
|
|
The @code{for} statement starts by executing @var{initialization}.
|
|
Then, as long
|
|
as @var{condition} is true, it repeatedly executes @var{body} and then
|
|
@var{increment}. Typically @var{initialization} sets a variable to
|
|
either zero or one, @var{increment} adds one to it, and @var{condition}
|
|
compares it against the desired number of iterations.
|
|
|
|
Here is an example of a @code{for} statement:
|
|
|
|
@example
|
|
@group
|
|
awk '@{ for (i = 1; i <= 3; i++)
|
|
print $i
|
|
@}' inventory-shipped
|
|
@end group
|
|
@end example
|
|
|
|
@noindent
|
|
This prints the first three fields of each input record, one field per
|
|
line.
|
|
|
|
You cannot set more than one variable in the
|
|
@var{initialization} part unless you use a multiple assignment statement
|
|
such as @samp{x = y = 0}, which is possible only if all the initial values
|
|
are equal. (But you can initialize additional variables by writing
|
|
their assignments as separate statements preceding the @code{for} loop.)
|
|
|
|
The same is true of the @var{increment} part; to increment additional
|
|
variables, you must write separate statements at the end of the loop.
|
|
The C compound expression, using C's comma operator, would be useful in
|
|
this context, but it is not supported in @code{awk}.
|
|
|
|
Most often, @var{increment} is an increment expression, as in the
|
|
example above. But this is not required; it can be any expression
|
|
whatever. For example, this statement prints all the powers of two
|
|
between one and 100:
|
|
|
|
@example
|
|
for (i = 1; i <= 100; i *= 2)
|
|
print i
|
|
@end example
|
|
|
|
Any of the three expressions in the parentheses following the @code{for} may
|
|
be omitted if there is nothing to be done there. Thus, @w{@samp{for (; x
|
|
> 0;)}} is equivalent to @w{@samp{while (x > 0)}}. If the
|
|
@var{condition} is omitted, it is treated as @var{true}, effectively
|
|
yielding an @dfn{infinite loop} (i.e.@: a loop that will never
|
|
terminate).
|
|
|
|
In most cases, a @code{for} loop is an abbreviation for a @code{while}
|
|
loop, as shown here:
|
|
|
|
@example
|
|
@var{initialization}
|
|
while (@var{condition}) @{
|
|
@var{body}
|
|
@var{increment}
|
|
@}
|
|
@end example
|
|
|
|
@noindent
|
|
The only exception is when the @code{continue} statement
|
|
(@pxref{Continue Statement, ,The @code{continue} Statement}) is used
|
|
inside the loop; changing a @code{for} statement to a @code{while}
|
|
statement in this way can change the effect of the @code{continue}
|
|
statement inside the loop.
|
|
|
|
There is an alternate version of the @code{for} loop, for iterating over
|
|
all the indices of an array:
|
|
|
|
@example
|
|
for (i in array)
|
|
@var{do something with} array[i]
|
|
@end example
|
|
|
|
@noindent
|
|
@xref{Scanning an Array, ,Scanning All Elements of an Array},
|
|
for more information on this version of the @code{for} loop.
|
|
|
|
The @code{awk} language has a @code{for} statement in addition to a
|
|
@code{while} statement because often a @code{for} loop is both less work to
|
|
type and more natural to think of. Counting the number of iterations is
|
|
very common in loops. It can be easier to think of this counting as part
|
|
of looping rather than as something to do inside the loop.
|
|
|
|
The next section has more complicated examples of @code{for} loops.
|
|
|
|
@node Break Statement, Continue Statement, For Statement, Statements
|
|
@section The @code{break} Statement
|
|
@cindex @code{break} statement
|
|
@cindex loops, exiting
|
|
|
|
The @code{break} statement jumps out of the innermost @code{for},
|
|
@code{while}, or @code{do} loop that encloses it. The
|
|
following example finds the smallest divisor of any integer, and also
|
|
identifies prime numbers:
|
|
|
|
@example
|
|
awk '# find smallest divisor of num
|
|
@{ num = $1
|
|
for (div = 2; div*div <= num; div++)
|
|
if (num % div == 0)
|
|
break
|
|
if (num % div == 0)
|
|
printf "Smallest divisor of %d is %d\n", num, div
|
|
else
|
|
printf "%d is prime\n", num
|
|
@}'
|
|
@end example
|
|
|
|
When the remainder is zero in the first @code{if} statement, @code{awk}
|
|
immediately @dfn{breaks out} of the containing @code{for} loop. This means
|
|
that @code{awk} proceeds immediately to the statement following the loop
|
|
and continues processing. (This is very different from the @code{exit}
|
|
statement which stops the entire @code{awk} program.
|
|
@xref{Exit Statement, ,The @code{exit} Statement}.)
|
|
|
|
Here is another program equivalent to the previous one. It illustrates how
|
|
the @var{condition} of a @code{for} or @code{while} could just as well be
|
|
replaced with a @code{break} inside an @code{if}:
|
|
|
|
@example
|
|
@group
|
|
awk '# find smallest divisor of num
|
|
@{ num = $1
|
|
for (div = 2; ; div++) @{
|
|
if (num % div == 0) @{
|
|
printf "Smallest divisor of %d is %d\n", num, div
|
|
break
|
|
@}
|
|
if (div*div > num) @{
|
|
printf "%d is prime\n", num
|
|
break
|
|
@}
|
|
@}
|
|
@}'
|
|
@end group
|
|
@end example
|
|
|
|
@cindex @code{break}, outside of loops
|
|
@cindex historical features
|
|
@cindex @code{awk} language, POSIX version
|
|
@cindex POSIX @code{awk}
|
|
@cindex dark corner
|
|
As described above, the @code{break} statement has no meaning when
|
|
used outside the body of a loop. However, although it was never documented,
|
|
historical implementations of @code{awk} have treated the @code{break}
|
|
statement outside of a loop as if it were a @code{next} statement
|
|
(@pxref{Next Statement, ,The @code{next} Statement}).
|
|
Recent versions of Unix @code{awk} no longer allow this usage.
|
|
@code{gawk} will support this use of @code{break} only if @samp{--traditional}
|
|
has been specified on the command line
|
|
(@pxref{Options, ,Command Line Options}).
|
|
Otherwise, it will be treated as an error, since the POSIX standard
|
|
specifies that @code{break} should only be used inside the body of a
|
|
loop (d.c.).
|
|
|
|
@node Continue Statement, Next Statement, Break Statement, Statements
|
|
@section The @code{continue} Statement
|
|
|
|
@cindex @code{continue} statement
|
|
The @code{continue} statement, like @code{break}, is used only inside
|
|
@code{for}, @code{while}, and @code{do} loops. It skips
|
|
over the rest of the loop body, causing the next cycle around the loop
|
|
to begin immediately. Contrast this with @code{break}, which jumps out
|
|
of the loop altogether.
|
|
|
|
@c The point of this program was to illustrate the use of continue with
|
|
@c a while loop. But Karl Berry points out that that is done adequately
|
|
@c below, and that this example is very un-awk-like. So for now, we'll
|
|
@c omit it.
|
|
@ignore
|
|
In Texinfo source files, text that the author wishes to ignore can be
|
|
enclosed between lines that start with @samp{@@ignore} and end with
|
|
@samp{@@end ignore}. Here is a program that strips out lines between
|
|
@samp{@@ignore} and @samp{@@end ignore} pairs.
|
|
|
|
@example
|
|
BEGIN @{
|
|
while (getline > 0) @{
|
|
if (/^@@ignore/)
|
|
ignoring = 1
|
|
else if (/^@@end[ \t]+ignore/) @{
|
|
ignoring = 0
|
|
continue
|
|
@}
|
|
if (ignoring)
|
|
continue
|
|
print
|
|
@}
|
|
@}
|
|
@end example
|
|
|
|
When an @samp{@@ignore} is seen, the @code{ignoring} flag is set to one (true).
|
|
When @samp{@@end ignore} is seen, the flag is reset to zero (false). As long
|
|
as the flag is true, the input record is not printed, because the
|
|
@code{continue} restarts the @code{while} loop, skipping over the @code{print}
|
|
statement.
|
|
|
|
@c Exercise!!!
|
|
@c How could this program be written to make better use of the awk language?
|
|
@end ignore
|
|
|
|
The @code{continue} statement in a @code{for} loop directs @code{awk} to
|
|
skip the rest of the body of the loop, and resume execution with the
|
|
increment-expression of the @code{for} statement. The following program
|
|
illustrates this fact:
|
|
|
|
@example
|
|
awk 'BEGIN @{
|
|
for (x = 0; x <= 20; x++) @{
|
|
if (x == 5)
|
|
continue
|
|
printf "%d ", x
|
|
@}
|
|
print ""
|
|
@}'
|
|
@end example
|
|
|
|
@noindent
|
|
This program prints all the numbers from zero to 20, except for five, for
|
|
which the @code{printf} is skipped. Since the increment @samp{x++}
|
|
is not skipped, @code{x} does not remain stuck at five. Contrast the
|
|
@code{for} loop above with this @code{while} loop:
|
|
|
|
@example
|
|
awk 'BEGIN @{
|
|
x = 0
|
|
while (x <= 20) @{
|
|
if (x == 5)
|
|
continue
|
|
printf "%d ", x
|
|
x++
|
|
@}
|
|
print ""
|
|
@}'
|
|
@end example
|
|
|
|
@noindent
|
|
This program loops forever once @code{x} gets to five.
|
|
|
|
@cindex @code{continue}, outside of loops
|
|
@cindex historical features
|
|
@cindex @code{awk} language, POSIX version
|
|
@cindex POSIX @code{awk}
|
|
@cindex dark corner
|
|
As described above, the @code{continue} statement has no meaning when
|
|
used outside the body of a loop. However, although it was never documented,
|
|
historical implementations of @code{awk} have treated the @code{continue}
|
|
statement outside of a loop as if it were a @code{next} statement
|
|
(@pxref{Next Statement, ,The @code{next} Statement}).
|
|
Recent versions of Unix @code{awk} no longer allow this usage.
|
|
@code{gawk} will support this use of @code{continue} only if
|
|
@samp{--traditional} has been specified on the command line
|
|
(@pxref{Options, ,Command Line Options}).
|
|
Otherwise, it will be treated as an error, since the POSIX standard
|
|
specifies that @code{continue} should only be used inside the body of a
|
|
loop (d.c.).
|
|
|
|
@node Next Statement, Nextfile Statement, Continue Statement, Statements
|
|
@section The @code{next} Statement
|
|
@cindex @code{next} statement
|
|
|
|
The @code{next} statement forces @code{awk} to immediately stop processing
|
|
the current record and go on to the next record. This means that no
|
|
further rules are executed for the current record. The rest of the
|
|
current rule's action is not executed either.
|
|
|
|
Contrast this with the effect of the @code{getline} function
|
|
(@pxref{Getline, ,Explicit Input with @code{getline}}). That too causes
|
|
@code{awk} to read the next record immediately, but it does not alter the
|
|
flow of control in any way. So the rest of the current action executes
|
|
with a new input record.
|
|
|
|
At the highest level, @code{awk} program execution is a loop that reads
|
|
an input record and then tests each rule's pattern against it. If you
|
|
think of this loop as a @code{for} statement whose body contains the
|
|
rules, then the @code{next} statement is analogous to a @code{continue}
|
|
statement: it skips to the end of the body of this implicit loop, and
|
|
executes the increment (which reads another record).
|
|
|
|
For example, if your @code{awk} program works only on records with four
|
|
fields, and you don't want it to fail when given bad input, you might
|
|
use this rule near the beginning of the program:
|
|
|
|
@example
|
|
@group
|
|
NF != 4 @{
|
|
err = sprintf("%s:%d: skipped: NF != 4\n", FILENAME, FNR)
|
|
print err > "/dev/stderr"
|
|
next
|
|
@}
|
|
@end group
|
|
@end example
|
|
|
|
@noindent
|
|
so that the following rules will not see the bad record. The error
|
|
message is redirected to the standard error output stream, as error
|
|
messages should be. @xref{Special Files, ,Special File Names in @code{gawk}}.
|
|
|
|
@cindex @code{awk} language, POSIX version
|
|
@cindex POSIX @code{awk}
|
|
According to the POSIX standard, the behavior is undefined if
|
|
the @code{next} statement is used in a @code{BEGIN} or @code{END} rule.
|
|
@code{gawk} will treat it as a syntax error.
|
|
Although POSIX permits it,
|
|
some other @code{awk} implementations don't allow the @code{next}
|
|
statement inside function bodies
|
|
(@pxref{User-defined, ,User-defined Functions}).
|
|
Just as any other @code{next} statement, a @code{next} inside a
|
|
function body reads the next record and starts processing it with the
|
|
first rule in the program.
|
|
|
|
If the @code{next} statement causes the end of the input to be reached,
|
|
then the code in any @code{END} rules will be executed.
|
|
@xref{BEGIN/END, ,The @code{BEGIN} and @code{END} Special Patterns}.
|
|
|
|
@cindex @code{next}, inside a user-defined function
|
|
@strong{Caution:} Some @code{awk} implementations generate a run-time
|
|
error if you use the @code{next} statement inside a user-defined function
|
|
(@pxref{User-defined, , User-defined Functions}).
|
|
@code{gawk} does not have this problem.
|
|
|
|
@node Nextfile Statement, Exit Statement, Next Statement, Statements
|
|
@section The @code{nextfile} Statement
|
|
@cindex @code{nextfile} statement
|
|
@cindex differences between @code{gawk} and @code{awk}
|
|
|
|
@code{gawk} provides the @code{nextfile} statement,
|
|
which is similar to the @code{next} statement.
|
|
However, instead of abandoning processing of the current record, the
|
|
@code{nextfile} statement instructs @code{gawk} to stop processing the
|
|
current data file.
|
|
|
|
Upon execution of the @code{nextfile} statement, @code{FILENAME} is
|
|
updated to the name of the next data file listed on the command line,
|
|
@code{FNR} is reset to one, @code{ARGIND} is incremented, and processing
|
|
starts over with the first rule in the progam. @xref{Built-in Variables}.
|
|
|
|
If the @code{nextfile} statement causes the end of the input to be reached,
|
|
then the code in any @code{END} rules will be executed.
|
|
@xref{BEGIN/END, ,The @code{BEGIN} and @code{END} Special Patterns}.
|
|
|
|
The @code{nextfile} statement is a @code{gawk} extension; it is not
|
|
(currently) available in any other @code{awk} implementation.
|
|
@xref{Nextfile Function, ,Implementing @code{nextfile} as a Function},
|
|
for a user-defined function you can use to simulate the @code{nextfile}
|
|
statement.
|
|
|
|
The @code{nextfile} statement would be useful if you have many data
|
|
files to process, and you expect that you
|
|
would not want to process every record in every file.
|
|
Normally, in order to move on to
|
|
the next data file, you would have to continue scanning the unwanted
|
|
records. The @code{nextfile} statement accomplishes this much more
|
|
efficiently.
|
|
|
|
@cindex @code{next file} statement
|
|
@strong{Caution:} Versions of @code{gawk} prior to 3.0 used two
|
|
words (@samp{next file}) for the @code{nextfile} statement. This was
|
|
changed in 3.0 to one word, since the treatment of @samp{file} was
|
|
inconsistent. When it appeared after @code{next}, it was a keyword.
|
|
Otherwise, it was a regular identifier. The old usage is still
|
|
accepted. However, @code{gawk} will generate a warning message, and
|
|
support for @code{next file} will eventually be discontinued in a
|
|
future version of @code{gawk}.
|
|
|
|
@node Exit Statement, , Nextfile Statement, Statements
|
|
@section The @code{exit} Statement
|
|
|
|
@cindex @code{exit} statement
|
|
The @code{exit} statement causes @code{awk} to immediately stop
|
|
executing the current rule and to stop processing input; any remaining input
|
|
is ignored. It looks like this:
|
|
|
|
@example
|
|
exit @r{[}@var{return code}@r{]}
|
|
@end example
|
|
|
|
If an @code{exit} statement is executed from a @code{BEGIN} rule the
|
|
program stops processing everything immediately. No input records are
|
|
read. However, if an @code{END} rule is present, it is executed
|
|
(@pxref{BEGIN/END, ,The @code{BEGIN} and @code{END} Special Patterns}).
|
|
|
|
If @code{exit} is used as part of an @code{END} rule, it causes
|
|
the program to stop immediately.
|
|
|
|
An @code{exit} statement that is not part
|
|
of a @code{BEGIN} or @code{END} rule stops the execution of any further
|
|
automatic rules for the current record, skips reading any remaining input
|
|
records, and executes
|
|
the @code{END} rule if there is one.
|
|
|
|
If you do not want the @code{END} rule to do its job in this case, you
|
|
can set a variable to non-zero before the @code{exit} statement, and check
|
|
that variable in the @code{END} rule.
|
|
@xref{Assert Function, ,Assertions},
|
|
for an example that does this.
|
|
|
|
@cindex dark corner
|
|
If an argument is supplied to @code{exit}, its value is used as the exit
|
|
status code for the @code{awk} process. If no argument is supplied,
|
|
@code{exit} returns status zero (success). In the case where an argument
|
|
is supplied to a first @code{exit} statement, and then @code{exit} is
|
|
called a second time with no argument, the previously supplied exit value
|
|
is used (d.c.).
|
|
|
|
For example, let's say you've discovered an error condition you really
|
|
don't know how to handle. Conventionally, programs report this by
|
|
exiting with a non-zero status. Your @code{awk} program can do this
|
|
using an @code{exit} statement with a non-zero argument. Here is an
|
|
example:
|
|
|
|
@example
|
|
@group
|
|
BEGIN @{
|
|
if (("date" | getline date_now) <= 0) @{
|
|
print "Can't get system date" > "/dev/stderr"
|
|
exit 1
|
|
@}
|
|
print "current date is", date_now
|
|
close("date")
|
|
@}
|
|
@end group
|
|
@end example
|
|
|
|
@node Built-in Variables, Arrays, Statements, Top
|
|
@chapter Built-in Variables
|
|
@cindex built-in variables
|
|
|
|
Most @code{awk} variables are available for you to use for your own
|
|
purposes; they never change except when your program assigns values to
|
|
them, and never affect anything except when your program examines them.
|
|
However, a few variables in @code{awk} have special built-in meanings.
|
|
Some of them @code{awk} examines automatically, so that they enable you
|
|
to tell @code{awk} how to do certain things. Others are set
|
|
automatically by @code{awk}, so that they carry information from the
|
|
internal workings of @code{awk} to your program.
|
|
|
|
This chapter documents all the built-in variables of @code{gawk}. Most
|
|
of them are also documented in the chapters describing their areas of
|
|
activity.
|
|
|
|
@menu
|
|
* User-modified:: Built-in variables that you change to control
|
|
@code{awk}.
|
|
* Auto-set:: Built-in variables where @code{awk} gives you
|
|
information.
|
|
* ARGC and ARGV:: Ways to use @code{ARGC} and @code{ARGV}.
|
|
@end menu
|
|
|
|
@node User-modified, Auto-set, Built-in Variables, Built-in Variables
|
|
@section Built-in Variables that Control @code{awk}
|
|
@cindex built-in variables, user modifiable
|
|
|
|
This is an alphabetical list of the variables which you can change to
|
|
control how @code{awk} does certain things. Those variables that are
|
|
specific to @code{gawk} are marked with an asterisk, @samp{*}.
|
|
|
|
@table @code
|
|
@vindex CONVFMT
|
|
@cindex @code{awk} language, POSIX version
|
|
@cindex POSIX @code{awk}
|
|
@item CONVFMT
|
|
This string controls conversion of numbers to
|
|
strings (@pxref{Conversion, ,Conversion of Strings and Numbers}).
|
|
It works by being passed, in effect, as the first argument to the
|
|
@code{sprintf} function
|
|
(@pxref{String Functions, ,Built-in Functions for String Manipulation}).
|
|
Its default value is @code{"%.6g"}.
|
|
@code{CONVFMT} was introduced by the POSIX standard.
|
|
|
|
@vindex FIELDWIDTHS
|
|
@item FIELDWIDTHS *
|
|
This is a space separated list of columns that tells @code{gawk}
|
|
how to split input with fixed, columnar boundaries. It is an
|
|
experimental feature. Assigning to @code{FIELDWIDTHS}
|
|
overrides the use of @code{FS} for field splitting.
|
|
@xref{Constant Size, ,Reading Fixed-width Data}, for more information.
|
|
|
|
If @code{gawk} is in compatibility mode
|
|
(@pxref{Options, ,Command Line Options}), then @code{FIELDWIDTHS}
|
|
has no special meaning, and field splitting operations are done based
|
|
exclusively on the value of @code{FS}.
|
|
|
|
@vindex FS
|
|
@item FS
|
|
@code{FS} is the input field separator
|
|
(@pxref{Field Separators, ,Specifying How Fields are Separated}).
|
|
The value is a single-character string or a multi-character regular
|
|
expression that matches the separations between fields in an input
|
|
record. If the value is the null string (@code{""}), then each
|
|
character in the record becomes a separate field.
|
|
|
|
The default value is @w{@code{" "}}, a string consisting of a single
|
|
space. As a special exception, this value means that any
|
|
sequence of spaces, tabs, and/or newlines is a single separator.@footnote{In
|
|
POSIX @code{awk}, newline does not count as whitespace.} It also causes
|
|
spaces, tabs, and newlines at the beginning and end of a record to be ignored.
|
|
|
|
You can set the value of @code{FS} on the command line using the
|
|
@samp{-F} option:
|
|
|
|
@example
|
|
awk -F, '@var{program}' @var{input-files}
|
|
@end example
|
|
|
|
If @code{gawk} is using @code{FIELDWIDTHS} for field-splitting,
|
|
assigning a value to @code{FS} will cause @code{gawk} to return to
|
|
the normal, @code{FS}-based, field splitting. An easy way to do this
|
|
is to simply say @samp{FS = FS}, perhaps with an explanatory comment.
|
|
|
|
@vindex IGNORECASE
|
|
@item IGNORECASE *
|
|
If @code{IGNORECASE} is non-zero or non-null, then all string comparisons,
|
|
and all regular expression matching are case-independent. Thus, regexp
|
|
matching with @samp{~} and @samp{!~}, and the @code{gensub},
|
|
@code{gsub}, @code{index}, @code{match}, @code{split} and @code{sub}
|
|
functions, record termination with @code{RS}, and field splitting with
|
|
@code{FS} all ignore case when doing their particular regexp operations.
|
|
The value of @code{IGNORECASE} does @emph{not} affect array subscripting.
|
|
@xref{Case-sensitivity, ,Case-sensitivity in Matching}.
|
|
|
|
If @code{gawk} is in compatibility mode
|
|
(@pxref{Options, ,Command Line Options}),
|
|
then @code{IGNORECASE} has no special meaning, and string
|
|
and regexp operations are always case-sensitive.
|
|
|
|
@vindex OFMT
|
|
@item OFMT
|
|
This string controls conversion of numbers to
|
|
strings (@pxref{Conversion, ,Conversion of Strings and Numbers}) for
|
|
printing with the @code{print} statement. It works by being passed, in
|
|
effect, as the first argument to the @code{sprintf} function
|
|
(@pxref{String Functions, ,Built-in Functions for String Manipulation}).
|
|
Its default value is @code{"%.6g"}. Earlier versions of @code{awk}
|
|
also used @code{OFMT} to specify the format for converting numbers to
|
|
strings in general expressions; this is now done by @code{CONVFMT}.
|
|
|
|
@vindex OFS
|
|
@item OFS
|
|
This is the output field separator (@pxref{Output Separators}). It is
|
|
output between the fields output by a @code{print} statement. Its
|
|
default value is @w{@code{" "}}, a string consisting of a single space.
|
|
|
|
@vindex ORS
|
|
@item ORS
|
|
This is the output record separator. It is output at the end of every
|
|
@code{print} statement. Its default value is @code{"\n"}.
|
|
(@xref{Output Separators}.)
|
|
|
|
@vindex RS
|
|
@item RS
|
|
This is @code{awk}'s input record separator. Its default value is a string
|
|
containing a single newline character, which means that an input record
|
|
consists of a single line of text.
|
|
It can also be the null string, in which case records are separated by
|
|
runs of blank lines, or a regexp, in which case records are separated by
|
|
matches of the regexp in the input text.
|
|
(@xref{Records, ,How Input is Split into Records}.)
|
|
|
|
@vindex SUBSEP
|
|
@item SUBSEP
|
|
@code{SUBSEP} is the subscript separator. It has the default value of
|
|
@code{"\034"}, and is used to separate the parts of the indices of a
|
|
multi-dimensional array. Thus, the expression @code{@w{foo["A", "B"]}}
|
|
really accesses @code{foo["A\034B"]}
|
|
(@pxref{Multi-dimensional, ,Multi-dimensional Arrays}).
|
|
@end table
|
|
|
|
@node Auto-set, ARGC and ARGV, User-modified, Built-in Variables
|
|
@section Built-in Variables that Convey Information
|
|
@cindex built-in variables, convey information
|
|
|
|
This is an alphabetical list of the variables that are set
|
|
automatically by @code{awk} on certain occasions in order to provide
|
|
information to your program. Those variables that are specific to
|
|
@code{gawk} are marked with an asterisk, @samp{*}.
|
|
|
|
@table @code
|
|
@vindex ARGC
|
|
@vindex ARGV
|
|
@item ARGC
|
|
@itemx ARGV
|
|
The command-line arguments available to @code{awk} programs are stored in
|
|
an array called @code{ARGV}. @code{ARGC} is the number of command-line
|
|
arguments present. @xref{Other Arguments, ,Other Command Line Arguments}.
|
|
Unlike most @code{awk} arrays,
|
|
@code{ARGV} is indexed from zero to @code{ARGC} @minus{} 1. For example:
|
|
|
|
@example
|
|
@group
|
|
$ awk 'BEGIN @{
|
|
> for (i = 0; i < ARGC; i++)
|
|
> print ARGV[i]
|
|
> @}' inventory-shipped BBS-list
|
|
@print{} awk
|
|
@print{} inventory-shipped
|
|
@print{} BBS-list
|
|
@end group
|
|
@end example
|
|
|
|
@noindent
|
|
In this example, @code{ARGV[0]} contains @code{"awk"}, @code{ARGV[1]}
|
|
contains @code{"inventory-shipped"}, and @code{ARGV[2]} contains
|
|
@code{"BBS-list"}. The value of @code{ARGC} is three, one more than the
|
|
index of the last element in @code{ARGV}, since the elements are numbered
|
|
from zero.
|
|
|
|
The names @code{ARGC} and @code{ARGV}, as well as the convention of indexing
|
|
the array from zero to @code{ARGC} @minus{} 1, are derived from the C language's
|
|
method of accessing command line arguments.
|
|
@xref{ARGC and ARGV, , Using @code{ARGC} and @code{ARGV}}, for information
|
|
about how @code{awk} uses these variables.
|
|
|
|
@vindex ARGIND
|
|
@item ARGIND *
|
|
The index in @code{ARGV} of the current file being processed.
|
|
Every time @code{gawk} opens a new data file for processing, it sets
|
|
@code{ARGIND} to the index in @code{ARGV} of the file name.
|
|
When @code{gawk} is processing the input files, it is always
|
|
true that @samp{FILENAME == ARGV[ARGIND]}.
|
|
|
|
This variable is useful in file processing; it allows you to tell how far
|
|
along you are in the list of data files, and to distinguish between
|
|
successive instances of the same filename on the command line.
|
|
|
|
While you can change the value of @code{ARGIND} within your @code{awk}
|
|
program, @code{gawk} will automatically set it to a new value when the
|
|
next file is opened.
|
|
|
|
This variable is a @code{gawk} extension. In other @code{awk} implementations,
|
|
or if @code{gawk} is in compatibility mode
|
|
(@pxref{Options, ,Command Line Options}),
|
|
it is not special.
|
|
|
|
@vindex ENVIRON
|
|
@item ENVIRON
|
|
An associative array that contains the values of the environment. The array
|
|
indices are the environment variable names; the values are the values of
|
|
the particular environment variables. For example,
|
|
@code{ENVIRON["HOME"]} might be @file{/home/arnold}. Changing this array
|
|
does not affect the environment passed on to any programs that
|
|
@code{awk} may spawn via redirection or the @code{system} function.
|
|
(In a future version of @code{gawk}, it may do so.)
|
|
|
|
Some operating systems may not have environment variables.
|
|
On such systems, the @code{ENVIRON} array is empty (except for
|
|
@w{@code{ENVIRON["AWKPATH"]}}).
|
|
|
|
@vindex ERRNO
|
|
@item ERRNO *
|
|
If a system error occurs either doing a redirection for @code{getline},
|
|
during a read for @code{getline}, or during a @code{close} operation,
|
|
then @code{ERRNO} will contain a string describing the error.
|
|
|
|
This variable is a @code{gawk} extension. In other @code{awk} implementations,
|
|
or if @code{gawk} is in compatibility mode
|
|
(@pxref{Options, ,Command Line Options}),
|
|
it is not special.
|
|
|
|
@cindex dark corner
|
|
@vindex FILENAME
|
|
@item FILENAME
|
|
This is the name of the file that @code{awk} is currently reading.
|
|
When no data files are listed on the command line, @code{awk} reads
|
|
from the standard input, and @code{FILENAME} is set to @code{"-"}.
|
|
@code{FILENAME} is changed each time a new file is read
|
|
(@pxref{Reading Files, ,Reading Input Files}).
|
|
Inside a @code{BEGIN} rule, the value of @code{FILENAME} is
|
|
@code{""}, since there are no input files being processed
|
|
yet.@footnote{Some early implementations of Unix @code{awk} initialized
|
|
@code{FILENAME} to @code{"-"}, even if there were data files to be
|
|
processed. This behavior was incorrect, and should not be relied
|
|
upon in your programs.} (d.c.)
|
|
|
|
@vindex FNR
|
|
@item FNR
|
|
@code{FNR} is the current record number in the current file. @code{FNR} is
|
|
incremented each time a new record is read
|
|
(@pxref{Getline, ,Explicit Input with @code{getline}}). It is reinitialized
|
|
to zero each time a new input file is started.
|
|
|
|
@vindex NF
|
|
@item NF
|
|
@code{NF} is the number of fields in the current input record.
|
|
@code{NF} is set each time a new record is read, when a new field is
|
|
created, or when @code{$0} changes (@pxref{Fields, ,Examining Fields}).
|
|
|
|
@vindex NR
|
|
@item NR
|
|
This is the number of input records @code{awk} has processed since
|
|
the beginning of the program's execution
|
|
(@pxref{Records, ,How Input is Split into Records}).
|
|
@code{NR} is set each time a new record is read.
|
|
|
|
@vindex RLENGTH
|
|
@item RLENGTH
|
|
@code{RLENGTH} is the length of the substring matched by the
|
|
@code{match} function
|
|
(@pxref{String Functions, ,Built-in Functions for String Manipulation}).
|
|
@code{RLENGTH} is set by invoking the @code{match} function. Its value
|
|
is the length of the matched string, or @minus{}1 if no match was found.
|
|
|
|
@vindex RSTART
|
|
@item RSTART
|
|
@code{RSTART} is the start-index in characters of the substring matched by the
|
|
@code{match} function
|
|
(@pxref{String Functions, ,Built-in Functions for String Manipulation}).
|
|
@code{RSTART} is set by invoking the @code{match} function. Its value
|
|
is the position of the string where the matched substring starts, or zero
|
|
if no match was found.
|
|
|
|
@vindex RT
|
|
@item RT *
|
|
@code{RT} is set each time a record is read. It contains the input text
|
|
that matched the text denoted by @code{RS}, the record separator.
|
|
|
|
This variable is a @code{gawk} extension. In other @code{awk} implementations,
|
|
or if @code{gawk} is in compatibility mode
|
|
(@pxref{Options, ,Command Line Options}),
|
|
it is not special.
|
|
@end table
|
|
|
|
@cindex dark corner
|
|
A side note about @code{NR} and @code{FNR}.
|
|
@code{awk} simply increments both of these variables
|
|
each time it reads a record, instead of setting them to the absolute
|
|
value of the number of records read. This means that your program can
|
|
change these variables, and their new values will be incremented for
|
|
each record (d.c.). For example:
|
|
|
|
@example
|
|
@group
|
|
$ echo '1
|
|
> 2
|
|
> 3
|
|
> 4' | awk 'NR == 2 @{ NR = 17 @}
|
|
> @{ print NR @}'
|
|
@print{} 1
|
|
@print{} 17
|
|
@print{} 18
|
|
@print{} 19
|
|
@end group
|
|
@end example
|
|
|
|
@noindent
|
|
Before @code{FNR} was added to the @code{awk} language
|
|
(@pxref{V7/SVR3.1, ,Major Changes between V7 and SVR3.1}),
|
|
many @code{awk} programs used this feature to track the number of
|
|
records in a file by resetting @code{NR} to zero when @code{FILENAME}
|
|
changed.
|
|
|
|
@node ARGC and ARGV, , Auto-set, Built-in Variables
|
|
@section Using @code{ARGC} and @code{ARGV}
|
|
|
|
In @ref{Auto-set, , Built-in Variables that Convey Information},
|
|
you saw this program describing the information contained in @code{ARGC}
|
|
and @code{ARGV}:
|
|
|
|
@example
|
|
@group
|
|
$ awk 'BEGIN @{
|
|
> for (i = 0; i < ARGC; i++)
|
|
> print ARGV[i]
|
|
> @}' inventory-shipped BBS-list
|
|
@print{} awk
|
|
@print{} inventory-shipped
|
|
@print{} BBS-list
|
|
@end group
|
|
@end example
|
|
|
|
@noindent
|
|
In this example, @code{ARGV[0]} contains @code{"awk"}, @code{ARGV[1]}
|
|
contains @code{"inventory-shipped"}, and @code{ARGV[2]} contains
|
|
@code{"BBS-list"}.
|
|
|
|
Notice that the @code{awk} program is not entered in @code{ARGV}. The
|
|
other special command line options, with their arguments, are also not
|
|
entered. This includes variable assignments done with the @samp{-v}
|
|
option (@pxref{Options, ,Command Line Options}).
|
|
Normal variable assignments on the command line @emph{are}
|
|
treated as arguments, and do show up in the @code{ARGV} array.
|
|
|
|
@example
|
|
$ cat showargs.awk
|
|
@print{} BEGIN @{
|
|
@print{} printf "A=%d, B=%d\n", A, B
|
|
@print{} for (i = 0; i < ARGC; i++)
|
|
@print{} printf "\tARGV[%d] = %s\n", i, ARGV[i]
|
|
@print{} @}
|
|
@print{} END @{ printf "A=%d, B=%d\n", A, B @}
|
|
$ awk -v A=1 -f showargs.awk B=2 /dev/null
|
|
@print{} A=1, B=0
|
|
@print{} ARGV[0] = awk
|
|
@print{} ARGV[1] = B=2
|
|
@print{} ARGV[2] = /dev/null
|
|
@print{} A=1, B=2
|
|
@end example
|
|
|
|
Your program can alter @code{ARGC} and the elements of @code{ARGV}.
|
|
Each time @code{awk} reaches the end of an input file, it uses the next
|
|
element of @code{ARGV} as the name of the next input file. By storing a
|
|
different string there, your program can change which files are read.
|
|
You can use @code{"-"} to represent the standard input. By storing
|
|
additional elements and incrementing @code{ARGC} you can cause
|
|
additional files to be read.
|
|
|
|
If you decrease the value of @code{ARGC}, that eliminates input files
|
|
from the end of the list. By recording the old value of @code{ARGC}
|
|
elsewhere, your program can treat the eliminated arguments as
|
|
something other than file names.
|
|
|
|
To eliminate a file from the middle of the list, store the null string
|
|
(@code{""}) into @code{ARGV} in place of the file's name. As a
|
|
special feature, @code{awk} ignores file names that have been
|
|
replaced with the null string.
|
|
You may also use the @code{delete} statement to remove elements from
|
|
@code{ARGV} (@pxref{Delete, ,The @code{delete} Statement}).
|
|
|
|
All of these actions are typically done from the @code{BEGIN} rule,
|
|
before actual processing of the input begins.
|
|
@xref{Split Program, ,Splitting a Large File Into Pieces}, and see
|
|
@ref{Tee Program, ,Duplicating Output Into Multiple Files}, for an example
|
|
of each way of removing elements from @code{ARGV}.
|
|
|
|
The following fragment processes @code{ARGV} in order to examine, and
|
|
then remove, command line options.
|
|
|
|
@example
|
|
@group
|
|
BEGIN @{
|
|
for (i = 1; i < ARGC; i++) @{
|
|
if (ARGV[i] == "-v")
|
|
verbose = 1
|
|
else if (ARGV[i] == "-d")
|
|
debug = 1
|
|
@end group
|
|
@group
|
|
else if (ARGV[i] ~ /^-?/) @{
|
|
e = sprintf("%s: unrecognized option -- %c",
|
|
ARGV[0], substr(ARGV[i], 1, ,1))
|
|
print e > "/dev/stderr"
|
|
@} else
|
|
break
|
|
delete ARGV[i]
|
|
@}
|
|
@}
|
|
@end group
|
|
@end example
|
|
|
|
To actually get the options into the @code{awk} program, you have to
|
|
end the @code{awk} options with @samp{--}, and then supply your options,
|
|
like so:
|
|
|
|
@example
|
|
awk -f myprog -- -v -d file1 file2 @dots{}
|
|
@end example
|
|
|
|
@cindex differences between @code{gawk} and @code{awk}
|
|
This is not necessary in @code{gawk}: Unless @samp{--posix} has been
|
|
specified, @code{gawk} silently puts any unrecognized options into
|
|
@code{ARGV} for the @code{awk} program to deal with.
|
|
|
|
As soon as it
|
|
sees an unknown option, @code{gawk} stops looking for other options it might
|
|
otherwise recognize. The above example with @code{gawk} would be:
|
|
|
|
@example
|
|
gawk -f myprog -d -v file1 file2 @dots{}
|
|
@end example
|
|
|
|
@noindent
|
|
Since @samp{-d} is not a valid @code{gawk} option, the following @samp{-v}
|
|
is passed on to the @code{awk} program.
|
|
|
|
@node Arrays, Built-in, Built-in Variables, Top
|
|
@chapter Arrays in @code{awk}
|
|
|
|
An @dfn{array} is a table of values, called @dfn{elements}. The
|
|
elements of an array are distinguished by their indices. @dfn{Indices}
|
|
may be either numbers or strings. @code{awk} maintains a single set
|
|
of names that may be used for naming variables, arrays and functions
|
|
(@pxref{User-defined, ,User-defined Functions}).
|
|
Thus, you cannot have a variable and an array with the same name in the
|
|
same @code{awk} program.
|
|
|
|
@menu
|
|
* Array Intro:: Introduction to Arrays
|
|
* Reference to Elements:: How to examine one element of an array.
|
|
* Assigning Elements:: How to change an element of an array.
|
|
* Array Example:: Basic Example of an Array
|
|
* Scanning an Array:: A variation of the @code{for} statement. It
|
|
loops through the indices of an array's
|
|
existing elements.
|
|
* Delete:: The @code{delete} statement removes an element
|
|
from an array.
|
|
* Numeric Array Subscripts:: How to use numbers as subscripts in
|
|
@code{awk}.
|
|
* Uninitialized Subscripts:: Using Uninitialized variables as subscripts.
|
|
* Multi-dimensional:: Emulating multi-dimensional arrays in
|
|
@code{awk}.
|
|
* Multi-scanning:: Scanning multi-dimensional arrays.
|
|
@end menu
|
|
|
|
@node Array Intro, Reference to Elements, Arrays, Arrays
|
|
@section Introduction to Arrays
|
|
|
|
@cindex arrays
|
|
The @code{awk} language provides one-dimensional @dfn{arrays} for storing groups
|
|
of related strings or numbers.
|
|
|
|
Every @code{awk} array must have a name. Array names have the same
|
|
syntax as variable names; any valid variable name would also be a valid
|
|
array name. But you cannot use one name in both ways (as an array and
|
|
as a variable) in one @code{awk} program.
|
|
|
|
Arrays in @code{awk} superficially resemble arrays in other programming
|
|
languages; but there are fundamental differences. In @code{awk}, you
|
|
don't need to specify the size of an array before you start to use it.
|
|
Additionally, any number or string in @code{awk} may be used as an
|
|
array index, not just consecutive integers.
|
|
|
|
In most other languages, you have to @dfn{declare} an array and specify
|
|
how many elements or components it contains. In such languages, the
|
|
declaration causes a contiguous block of memory to be allocated for that
|
|
many elements. An index in the array usually must be a positive integer; for
|
|
example, the index zero specifies the first element in the array, which is
|
|
actually stored at the beginning of the block of memory. Index one
|
|
specifies the second element, which is stored in memory right after the
|
|
first element, and so on. It is impossible to add more elements to the
|
|
array, because it has room for only as many elements as you declared.
|
|
(Some languages allow arbitrary starting and ending indices,
|
|
e.g., @samp{15 .. 27}, but the size of the array is still fixed when
|
|
the array is declared.)
|
|
|
|
A contiguous array of four elements might look like this,
|
|
conceptually, if the element values are eight, @code{"foo"},
|
|
@code{""} and 30:
|
|
|
|
@iftex
|
|
@c from Karl Berry, much thanks for the help.
|
|
@tex
|
|
\bigskip % space above the table (about 1 linespace)
|
|
\offinterlineskip
|
|
\newdimen\width \width = 1.5cm
|
|
\newdimen\hwidth \hwidth = 4\width \advance\hwidth by 2pt % 5 * 0.4pt
|
|
\centerline{\vbox{
|
|
\halign{\strut\hfil\ignorespaces#&&\vrule#&\hbox to\width{\hfil#\unskip\hfil}\cr
|
|
\noalign{\hrule width\hwidth}
|
|
&&{\tt 8} &&{\tt "foo"} &&{\tt ""} &&{\tt 30} &&\quad value\cr
|
|
\noalign{\hrule width\hwidth}
|
|
\noalign{\smallskip}
|
|
&\omit&0&\omit &1 &\omit&2 &\omit&3 &\omit&\quad index\cr
|
|
}
|
|
}}
|
|
@end tex
|
|
@end iftex
|
|
@ifinfo
|
|
@example
|
|
+---------+---------+--------+---------+
|
|
| 8 | "foo" | "" | 30 | @r{value}
|
|
+---------+---------+--------+---------+
|
|
0 1 2 3 @r{index}
|
|
@end example
|
|
@end ifinfo
|
|
|
|
@noindent
|
|
Only the values are stored; the indices are implicit from the order of
|
|
the values. Eight is the value at index zero, because eight appears in the
|
|
position with zero elements before it.
|
|
|
|
@cindex arrays, definition of
|
|
@cindex associative arrays
|
|
@cindex arrays, associative
|
|
Arrays in @code{awk} are different: they are @dfn{associative}. This means
|
|
that each array is a collection of pairs: an index, and its corresponding
|
|
array element value:
|
|
|
|
@example
|
|
@r{Element} 4 @r{Value} 30
|
|
@r{Element} 2 @r{Value} "foo"
|
|
@r{Element} 1 @r{Value} 8
|
|
@r{Element} 3 @r{Value} ""
|
|
@end example
|
|
|
|
@noindent
|
|
We have shown the pairs in jumbled order because their order is irrelevant.
|
|
|
|
One advantage of associative arrays is that new pairs can be added
|
|
at any time. For example, suppose we add to the above array a tenth element
|
|
whose value is @w{@code{"number ten"}}. The result is this:
|
|
|
|
@example
|
|
@r{Element} 10 @r{Value} "number ten"
|
|
@r{Element} 4 @r{Value} 30
|
|
@r{Element} 2 @r{Value} "foo"
|
|
@r{Element} 1 @r{Value} 8
|
|
@r{Element} 3 @r{Value} ""
|
|
@end example
|
|
|
|
@noindent
|
|
@cindex sparse arrays
|
|
@cindex arrays, sparse
|
|
Now the array is @dfn{sparse}, which just means some indices are missing:
|
|
it has elements 1--4 and 10, but doesn't have elements 5, 6, 7, 8, or 9.
|
|
@c ok, I should spell out the above, but ...
|
|
|
|
Another consequence of associative arrays is that the indices don't
|
|
have to be positive integers. Any number, or even a string, can be
|
|
an index. For example, here is an array which translates words from
|
|
English into French:
|
|
|
|
@example
|
|
@r{Element} "dog" @r{Value} "chien"
|
|
@r{Element} "cat" @r{Value} "chat"
|
|
@r{Element} "one" @r{Value} "un"
|
|
@r{Element} 1 @r{Value} "un"
|
|
@end example
|
|
|
|
@noindent
|
|
Here we decided to translate the number one in both spelled-out and
|
|
numeric form---thus illustrating that a single array can have both
|
|
numbers and strings as indices.
|
|
(In fact, array subscripts are always strings; this is discussed
|
|
in more detail in
|
|
@ref{Numeric Array Subscripts, ,Using Numbers to Subscript Arrays}.)
|
|
|
|
@cindex Array subscripts and @code{IGNORECASE}
|
|
@cindex @code{IGNORECASE} and array subscripts
|
|
@vindex IGNORECASE
|
|
The value of @code{IGNORECASE} has no effect upon array subscripting.
|
|
You must use the exact same string value to retrieve an array element
|
|
as you used to store it.
|
|
|
|
When @code{awk} creates an array for you, e.g., with the @code{split}
|
|
built-in function,
|
|
that array's indices are consecutive integers starting at one.
|
|
(@xref{String Functions, ,Built-in Functions for String Manipulation}.)
|
|
|
|
@node Reference to Elements, Assigning Elements, Array Intro, Arrays
|
|
@section Referring to an Array Element
|
|
@cindex array reference
|
|
@cindex element of array
|
|
@cindex reference to array
|
|
|
|
The principal way of using an array is to refer to one of its elements.
|
|
An array reference is an expression which looks like this:
|
|
|
|
@example
|
|
@var{array}[@var{index}]
|
|
@end example
|
|
|
|
@noindent
|
|
Here, @var{array} is the name of an array. The expression @var{index} is
|
|
the index of the element of the array that you want.
|
|
|
|
The value of the array reference is the current value of that array
|
|
element. For example, @code{foo[4.3]} is an expression for the element
|
|
of array @code{foo} at index @samp{4.3}.
|
|
|
|
If you refer to an array element that has no recorded value, the value
|
|
of the reference is @code{""}, the null string. This includes elements
|
|
to which you have not assigned any value, and elements that have been
|
|
deleted (@pxref{Delete, ,The @code{delete} Statement}). Such a reference
|
|
automatically creates that array element, with the null string as its value.
|
|
(In some cases, this is unfortunate, because it might waste memory inside
|
|
@code{awk}.)
|
|
|
|
@cindex arrays, presence of elements
|
|
@cindex arrays, the @code{in} operator
|
|
You can find out if an element exists in an array at a certain index with
|
|
the expression:
|
|
|
|
@example
|
|
@var{index} in @var{array}
|
|
@end example
|
|
|
|
@noindent
|
|
This expression tests whether or not the particular index exists,
|
|
without the side effect of creating that element if it is not present.
|
|
The expression has the value one (true) if @code{@var{array}[@var{index}]}
|
|
exists, and zero (false) if it does not exist.
|
|
|
|
For example, to test whether the array @code{frequencies} contains the
|
|
index @samp{2}, you could write this statement:
|
|
|
|
@example
|
|
if (2 in frequencies)
|
|
print "Subscript 2 is present."
|
|
@end example
|
|
|
|
Note that this is @emph{not} a test of whether or not the array
|
|
@code{frequencies} contains an element whose @emph{value} is two.
|
|
(There is no way to do that except to scan all the elements.) Also, this
|
|
@emph{does not} create @code{frequencies[2]}, while the following
|
|
(incorrect) alternative would do so:
|
|
|
|
@example
|
|
if (frequencies[2] != "")
|
|
print "Subscript 2 is present."
|
|
@end example
|
|
|
|
@node Assigning Elements, Array Example, Reference to Elements, Arrays
|
|
@section Assigning Array Elements
|
|
@cindex array assignment
|
|
@cindex element assignment
|
|
|
|
Array elements are lvalues: they can be assigned values just like
|
|
@code{awk} variables:
|
|
|
|
@example
|
|
@var{array}[@var{subscript}] = @var{value}
|
|
@end example
|
|
|
|
@noindent
|
|
Here @var{array} is the name of your array. The expression
|
|
@var{subscript} is the index of the element of the array that you want
|
|
to assign a value. The expression @var{value} is the value you are
|
|
assigning to that element of the array.
|
|
|
|
@node Array Example, Scanning an Array, Assigning Elements, Arrays
|
|
@section Basic Array Example
|
|
|
|
The following program takes a list of lines, each beginning with a line
|
|
number, and prints them out in order of line number. The line numbers are
|
|
not in order, however, when they are first read: they are scrambled. This
|
|
program sorts the lines by making an array using the line numbers as
|
|
subscripts. It then prints out the lines in sorted order of their numbers.
|
|
It is a very simple program, and gets confused if it encounters repeated
|
|
numbers, gaps, or lines that don't begin with a number.
|
|
|
|
@example
|
|
@c file eg/misc/arraymax.awk
|
|
@{
|
|
if ($1 > max)
|
|
max = $1
|
|
arr[$1] = $0
|
|
@}
|
|
|
|
END @{
|
|
for (x = 1; x <= max; x++)
|
|
print arr[x]
|
|
@}
|
|
@c endfile
|
|
@end example
|
|
|
|
The first rule keeps track of the largest line number seen so far;
|
|
it also stores each line into the array @code{arr}, at an index that
|
|
is the line's number.
|
|
|
|
The second rule runs after all the input has been read, to print out
|
|
all the lines.
|
|
|
|
When this program is run with the following input:
|
|
|
|
@example
|
|
@group
|
|
@c file eg/misc/arraymax.data
|
|
5 I am the Five man
|
|
2 Who are you? The new number two!
|
|
4 . . . And four on the floor
|
|
1 Who is number one?
|
|
3 I three you.
|
|
@c endfile
|
|
@end group
|
|
@end example
|
|
|
|
@noindent
|
|
its output is this:
|
|
|
|
@example
|
|
1 Who is number one?
|
|
2 Who are you? The new number two!
|
|
3 I three you.
|
|
4 . . . And four on the floor
|
|
5 I am the Five man
|
|
@end example
|
|
|
|
If a line number is repeated, the last line with a given number overrides
|
|
the others.
|
|
|
|
Gaps in the line numbers can be handled with an easy improvement to the
|
|
program's @code{END} rule:
|
|
|
|
@example
|
|
END @{
|
|
for (x = 1; x <= max; x++)
|
|
if (x in arr)
|
|
print arr[x]
|
|
@}
|
|
@end example
|
|
|
|
@node Scanning an Array, Delete, Array Example, Arrays
|
|
@section Scanning All Elements of an Array
|
|
@cindex @code{for (x in @dots{})}
|
|
@cindex arrays, special @code{for} statement
|
|
@cindex scanning an array
|
|
|
|
In programs that use arrays, you often need a loop that executes
|
|
once for each element of an array. In other languages, where arrays are
|
|
contiguous and indices are limited to positive integers, this is
|
|
easy: you can
|
|
find all the valid indices by counting from the lowest index
|
|
up to the highest. This
|
|
technique won't do the job in @code{awk}, since any number or string
|
|
can be an array index. So @code{awk} has a special kind of @code{for}
|
|
statement for scanning an array:
|
|
|
|
@example
|
|
for (@var{var} in @var{array})
|
|
@var{body}
|
|
@end example
|
|
|
|
@noindent
|
|
This loop executes @var{body} once for each index in @var{array} that your
|
|
program has previously used, with the
|
|
variable @var{var} set to that index.
|
|
|
|
Here is a program that uses this form of the @code{for} statement. The
|
|
first rule scans the input records and notes which words appear (at
|
|
least once) in the input, by storing a one into the array @code{used} with
|
|
the word as index. The second rule scans the elements of @code{used} to
|
|
find all the distinct words that appear in the input. It prints each
|
|
word that is more than 10 characters long, and also prints the number of
|
|
such words. @xref{String Functions, ,Built-in Functions for String Manipulation}, for more information
|
|
on the built-in function @code{length}.
|
|
|
|
@example
|
|
# Record a 1 for each word that is used at least once.
|
|
@{
|
|
for (i = 1; i <= NF; i++)
|
|
used[$i] = 1
|
|
@}
|
|
|
|
# Find number of distinct words more than 10 characters long.
|
|
END @{
|
|
for (x in used)
|
|
if (length(x) > 10) @{
|
|
++num_long_words
|
|
print x
|
|
@}
|
|
print num_long_words, "words longer than 10 characters"
|
|
@}
|
|
@end example
|
|
|
|
@noindent
|
|
@xref{Word Sorting, ,Generating Word Usage Counts},
|
|
for a more detailed example of this type.
|
|
|
|
The order in which elements of the array are accessed by this statement
|
|
is determined by the internal arrangement of the array elements within
|
|
@code{awk} and cannot be controlled or changed. This can lead to
|
|
problems if new elements are added to @var{array} by statements in
|
|
the loop body; you cannot predict whether or not the @code{for} loop will
|
|
reach them. Similarly, changing @var{var} inside the loop may produce
|
|
strange results. It is best to avoid such things.
|
|
|
|
@node Delete, Numeric Array Subscripts, Scanning an Array, Arrays
|
|
@section The @code{delete} Statement
|
|
@cindex @code{delete} statement
|
|
@cindex deleting elements of arrays
|
|
@cindex removing elements of arrays
|
|
@cindex arrays, deleting an element
|
|
|
|
You can remove an individual element of an array using the @code{delete}
|
|
statement:
|
|
|
|
@example
|
|
delete @var{array}[@var{index}]
|
|
@end example
|
|
|
|
Once you have deleted an array element, you can no longer obtain any
|
|
value the element once had. It is as if you had never referred
|
|
to it and had never given it any value.
|
|
|
|
Here is an example of deleting elements in an array:
|
|
|
|
@example
|
|
for (i in frequencies)
|
|
delete frequencies[i]
|
|
@end example
|
|
|
|
@noindent
|
|
This example removes all the elements from the array @code{frequencies}.
|
|
|
|
If you delete an element, a subsequent @code{for} statement to scan the array
|
|
will not report that element, and the @code{in} operator to check for
|
|
the presence of that element will return zero (i.e.@: false):
|
|
|
|
@example
|
|
delete foo[4]
|
|
if (4 in foo)
|
|
print "This will never be printed"
|
|
@end example
|
|
|
|
It is important to note that deleting an element is @emph{not} the
|
|
same as assigning it a null value (the empty string, @code{""}).
|
|
|
|
@example
|
|
foo[4] = ""
|
|
if (4 in foo)
|
|
print "This is printed, even though foo[4] is empty"
|
|
@end example
|
|
|
|
It is not an error to delete an element that does not exist.
|
|
|
|
@cindex arrays, deleting entire contents
|
|
@cindex deleting entire arrays
|
|
@cindex differences between @code{gawk} and @code{awk}
|
|
You can delete all the elements of an array with a single statement,
|
|
by leaving off the subscript in the @code{delete} statement.
|
|
|
|
@example
|
|
delete @var{array}
|
|
@end example
|
|
|
|
This ability is a @code{gawk} extension; it is not available in
|
|
compatibility mode (@pxref{Options, ,Command Line Options}).
|
|
|
|
Using this version of the @code{delete} statement is about three times
|
|
more efficient than the equivalent loop that deletes each element one
|
|
at a time.
|
|
|
|
@cindex portability issues
|
|
The following statement provides a portable, but non-obvious way to clear
|
|
out an array.
|
|
|
|
@cindex Brennan, Michael
|
|
@example
|
|
@group
|
|
# thanks to Michael Brennan for pointing this out
|
|
split("", array)
|
|
@end group
|
|
@end example
|
|
|
|
The @code{split} function
|
|
(@pxref{String Functions, ,Built-in Functions for String Manipulation})
|
|
clears out the target array first. This call asks it to split
|
|
apart the null string. Since there is no data to split out, the
|
|
function simply clears the array and then returns.
|
|
|
|
@strong{Caution:} Deleting an array does not change its type; you cannot
|
|
delete an array and then use the array's name as a scalar. For
|
|
example, this will not work:
|
|
|
|
@example
|
|
a[1] = 3; delete a; a = 3
|
|
@end example
|
|
|
|
@node Numeric Array Subscripts, Uninitialized Subscripts, Delete, Arrays
|
|
@section Using Numbers to Subscript Arrays
|
|
|
|
An important aspect of arrays to remember is that @emph{array subscripts
|
|
are always strings}. If you use a numeric value as a subscript,
|
|
it will be converted to a string value before it is used for subscripting
|
|
(@pxref{Conversion, ,Conversion of Strings and Numbers}).
|
|
|
|
@cindex conversions, during subscripting
|
|
@cindex numbers, used as subscripts
|
|
@vindex CONVFMT
|
|
This means that the value of the built-in variable @code{CONVFMT} can potentially
|
|
affect how your program accesses elements of an array. For example:
|
|
|
|
@example
|
|
xyz = 12.153
|
|
data[xyz] = 1
|
|
CONVFMT = "%2.2f"
|
|
@group
|
|
if (xyz in data)
|
|
printf "%s is in data\n", xyz
|
|
else
|
|
printf "%s is not in data\n", xyz
|
|
@end group
|
|
@end example
|
|
|
|
@noindent
|
|
This prints @samp{12.15 is not in data}. The first statement gives
|
|
@code{xyz} a numeric value. Assigning to
|
|
@code{data[xyz]} subscripts @code{data} with the string value @code{"12.153"}
|
|
(using the default conversion value of @code{CONVFMT}, @code{"%.6g"}),
|
|
and assigns one to @code{data["12.153"]}. The program then changes
|
|
the value of @code{CONVFMT}. The test @samp{(xyz in data)} generates a new
|
|
string value from @code{xyz}, this time @code{"12.15"}, since the value of
|
|
@code{CONVFMT} only allows two significant digits. This test fails,
|
|
since @code{"12.15"} is a different string from @code{"12.153"}.
|
|
|
|
According to the rules for conversions
|
|
(@pxref{Conversion, ,Conversion of Strings and Numbers}), integer
|
|
values are always converted to strings as integers, no matter what the
|
|
value of @code{CONVFMT} may happen to be. So the usual case of:
|
|
|
|
@example
|
|
for (i = 1; i <= maxsub; i++)
|
|
@i{do something with} array[i]
|
|
@end example
|
|
|
|
@noindent
|
|
will work, no matter what the value of @code{CONVFMT}.
|
|
|
|
Like many things in @code{awk}, the majority of the time things work
|
|
as you would expect them to work. But it is useful to have a precise
|
|
knowledge of the actual rules, since sometimes they can have a subtle
|
|
effect on your programs.
|
|
|
|
@node Uninitialized Subscripts, Multi-dimensional, Numeric Array Subscripts, Arrays
|
|
@section Using Uninitialized Variables as Subscripts
|
|
|
|
@cindex uninitialized variables, as array subscripts
|
|
@cindex array subscripts, uninitialized variables
|
|
Suppose you want to print your input data in reverse order.
|
|
A reasonable attempt at a program to do so (with some test
|
|
data) might look like this:
|
|
|
|
@example
|
|
@group
|
|
$ echo 'line 1
|
|
> line 2
|
|
> line 3' | awk '@{ l[lines] = $0; ++lines @}
|
|
> END @{
|
|
> for (i = lines-1; i >= 0; --i)
|
|
> print l[i]
|
|
> @}'
|
|
@print{} line 3
|
|
@print{} line 2
|
|
@end group
|
|
@end example
|
|
|
|
Unfortunately, the very first line of input data did not come out in the
|
|
output!
|
|
|
|
At first glance, this program should have worked. The variable @code{lines}
|
|
is uninitialized, and uninitialized variables have the numeric value zero.
|
|
So, the value of @code{l[0]} should have been printed.
|
|
|
|
The issue here is that subscripts for @code{awk} arrays are @strong{always}
|
|
strings. And uninitialized variables, when used as strings, have the
|
|
value @code{""}, not zero. Thus, @samp{line 1} ended up stored in
|
|
@code{l[""]}.
|
|
|
|
The following version of the program works correctly:
|
|
|
|
@example
|
|
@{ l[lines++] = $0 @}
|
|
END @{
|
|
for (i = lines - 1; i >= 0; --i)
|
|
print l[i]
|
|
@}
|
|
@end example
|
|
|
|
Here, the @samp{++} forces @code{lines} to be numeric, thus making
|
|
the ``old value'' numeric zero, which is then converted to @code{"0"}
|
|
as the array subscript.
|
|
|
|
@cindex null string, as array subscript
|
|
@cindex dark corner
|
|
As we have just seen, even though it is somewhat unusual, the null string
|
|
(@code{""}) is a valid array subscript (d.c.). If @samp{--lint} is provided
|
|
on the command line (@pxref{Options, ,Command Line Options}),
|
|
@code{gawk} will warn about the use of the null string as a subscript.
|
|
|
|
@node Multi-dimensional, Multi-scanning, Uninitialized Subscripts, Arrays
|
|
@section Multi-dimensional Arrays
|
|
|
|
@cindex subscripts in arrays
|
|
@cindex arrays, multi-dimensional subscripts
|
|
@cindex multi-dimensional subscripts
|
|
A multi-dimensional array is an array in which an element is identified
|
|
by a sequence of indices, instead of a single index. For example, a
|
|
two-dimensional array requires two indices. The usual way (in most
|
|
languages, including @code{awk}) to refer to an element of a
|
|
two-dimensional array named @code{grid} is with
|
|
@code{grid[@var{x},@var{y}]}.
|
|
|
|
@vindex SUBSEP
|
|
Multi-dimensional arrays are supported in @code{awk} through
|
|
concatenation of indices into one string. What happens is that
|
|
@code{awk} converts the indices into strings
|
|
(@pxref{Conversion, ,Conversion of Strings and Numbers}) and
|
|
concatenates them together, with a separator between them. This creates
|
|
a single string that describes the values of the separate indices. The
|
|
combined string is used as a single index into an ordinary,
|
|
one-dimensional array. The separator used is the value of the built-in
|
|
variable @code{SUBSEP}.
|
|
|
|
For example, suppose we evaluate the expression @samp{foo[5,12] = "value"}
|
|
when the value of @code{SUBSEP} is @code{"@@"}. The numbers five and 12 are
|
|
converted to strings and
|
|
concatenated with an @samp{@@} between them, yielding @code{"5@@12"}; thus,
|
|
the array element @code{foo["5@@12"]} is set to @code{"value"}.
|
|
|
|
Once the element's value is stored, @code{awk} has no record of whether
|
|
it was stored with a single index or a sequence of indices. The two
|
|
expressions @samp{foo[5,12]} and @w{@samp{foo[5 SUBSEP 12]}} are always
|
|
equivalent.
|
|
|
|
The default value of @code{SUBSEP} is the string @code{"\034"},
|
|
which contains a non-printing character that is unlikely to appear in an
|
|
@code{awk} program or in most input data.
|
|
|
|
The usefulness of choosing an unlikely character comes from the fact
|
|
that index values that contain a string matching @code{SUBSEP} lead to
|
|
combined strings that are ambiguous. Suppose that @code{SUBSEP} were
|
|
@code{"@@"}; then @w{@samp{foo["a@@b", "c"]}} and @w{@samp{foo["a",
|
|
"b@@c"]}} would be indistinguishable because both would actually be
|
|
stored as @samp{foo["a@@b@@c"]}.
|
|
|
|
You can test whether a particular index-sequence exists in a
|
|
``multi-dimensional'' array with the same operator @samp{in} used for single
|
|
dimensional arrays. Instead of a single index as the left-hand operand,
|
|
write the whole sequence of indices, separated by commas, in
|
|
parentheses:
|
|
|
|
@example
|
|
(@var{subscript1}, @var{subscript2}, @dots{}) in @var{array}
|
|
@end example
|
|
|
|
The following example treats its input as a two-dimensional array of
|
|
fields; it rotates this array 90 degrees clockwise and prints the
|
|
result. It assumes that all lines have the same number of
|
|
elements.
|
|
|
|
@example
|
|
@group
|
|
awk '@{
|
|
if (max_nf < NF)
|
|
max_nf = NF
|
|
max_nr = NR
|
|
for (x = 1; x <= NF; x++)
|
|
vector[x, NR] = $x
|
|
@}
|
|
@end group
|
|
|
|
@group
|
|
END @{
|
|
for (x = 1; x <= max_nf; x++) @{
|
|
for (y = max_nr; y >= 1; --y)
|
|
printf("%s ", vector[x, y])
|
|
printf("\n")
|
|
@}
|
|
@}'
|
|
@end group
|
|
@end example
|
|
|
|
@noindent
|
|
When given the input:
|
|
|
|
@example
|
|
@group
|
|
1 2 3 4 5 6
|
|
2 3 4 5 6 1
|
|
3 4 5 6 1 2
|
|
4 5 6 1 2 3
|
|
@end group
|
|
@end example
|
|
|
|
@noindent
|
|
it produces:
|
|
|
|
@example
|
|
@group
|
|
4 3 2 1
|
|
5 4 3 2
|
|
6 5 4 3
|
|
1 6 5 4
|
|
2 1 6 5
|
|
3 2 1 6
|
|
@end group
|
|
@end example
|
|
|
|
@node Multi-scanning, , Multi-dimensional, Arrays
|
|
@section Scanning Multi-dimensional Arrays
|
|
|
|
There is no special @code{for} statement for scanning a
|
|
``multi-dimensional'' array; there cannot be one, because in truth there
|
|
are no multi-dimensional arrays or elements; there is only a
|
|
multi-dimensional @emph{way of accessing} an array.
|
|
|
|
However, if your program has an array that is always accessed as
|
|
multi-dimensional, you can get the effect of scanning it by combining
|
|
the scanning @code{for} statement
|
|
(@pxref{Scanning an Array, ,Scanning All Elements of an Array}) with the
|
|
@code{split} built-in function
|
|
(@pxref{String Functions, ,Built-in Functions for String Manipulation}).
|
|
It works like this:
|
|
|
|
@example
|
|
for (combined in array) @{
|
|
split(combined, separate, SUBSEP)
|
|
@dots{}
|
|
@}
|
|
@end example
|
|
|
|
@noindent
|
|
This sets @code{combined} to
|
|
each concatenated, combined index in the array, and splits it
|
|
into the individual indices by breaking it apart where the value of
|
|
@code{SUBSEP} appears. The split-out indices become the elements of
|
|
the array @code{separate}.
|
|
|
|
Thus, suppose you have previously stored a value in @code{array[1, "foo"]};
|
|
then an element with index @code{"1\034foo"} exists in
|
|
@code{array}. (Recall that the default value of @code{SUBSEP} is
|
|
the character with code 034.) Sooner or later the @code{for} statement
|
|
will find that index and do an iteration with @code{combined} set to
|
|
@code{"1\034foo"}. Then the @code{split} function is called as
|
|
follows:
|
|
|
|
@example
|
|
split("1\034foo", separate, "\034")
|
|
@end example
|
|
|
|
@noindent
|
|
The result of this is to set @code{separate[1]} to @code{"1"} and
|
|
@code{separate[2]} to @code{"foo"}. Presto, the original sequence of
|
|
separate indices has been recovered.
|
|
|
|
@node Built-in, User-defined, Arrays, Top
|
|
@chapter Built-in Functions
|
|
|
|
@c 2e: USE TEXINFO-2 FUNCTION DEFINITION STUFF!!!!!!!!!!!!!
|
|
@cindex built-in functions
|
|
@dfn{Built-in} functions are functions that are always available for
|
|
your @code{awk} program to call. This chapter defines all the built-in
|
|
functions in @code{awk}; some of them are mentioned in other sections,
|
|
but they are summarized here for your convenience. (You can also define
|
|
new functions yourself. @xref{User-defined, ,User-defined Functions}.)
|
|
|
|
@menu
|
|
* Calling Built-in:: How to call built-in functions.
|
|
* Numeric Functions:: Functions that work with numbers, including
|
|
@code{int}, @code{sin} and @code{rand}.
|
|
* String Functions:: Functions for string manipulation, such as
|
|
@code{split}, @code{match}, and
|
|
@code{sprintf}.
|
|
* I/O Functions:: Functions for files and shell commands.
|
|
* Time Functions:: Functions for dealing with time stamps.
|
|
@end menu
|
|
|
|
@node Calling Built-in, Numeric Functions, Built-in, Built-in
|
|
@section Calling Built-in Functions
|
|
|
|
To call a built-in function, write the name of the function followed
|
|
by arguments in parentheses. For example, @samp{atan2(y + z, 1)}
|
|
is a call to the function @code{atan2}, with two arguments.
|
|
|
|
Whitespace is ignored between the built-in function name and the
|
|
open-parenthesis, but we recommend that you avoid using whitespace
|
|
there. User-defined functions do not permit whitespace in this way, and
|
|
you will find it easier to avoid mistakes by following a simple
|
|
convention which always works: no whitespace after a function name.
|
|
|
|
@cindex differences between @code{gawk} and @code{awk}
|
|
Each built-in function accepts a certain number of arguments.
|
|
In some cases, arguments can be omitted. The defaults for omitted
|
|
arguments vary from function to function and are described under the
|
|
individual functions. In some @code{awk} implementations, extra
|
|
arguments given to built-in functions are ignored. However, in @code{gawk},
|
|
it is a fatal error to give extra arguments to a built-in function.
|
|
|
|
When a function is called, expressions that create the function's actual
|
|
parameters are evaluated completely before the function call is performed.
|
|
For example, in the code fragment:
|
|
|
|
@example
|
|
i = 4
|
|
j = sqrt(i++)
|
|
@end example
|
|
|
|
@noindent
|
|
the variable @code{i} is set to five before @code{sqrt} is called
|
|
with a value of four for its actual parameter.
|
|
|
|
@cindex evaluation, order of
|
|
@cindex order of evaluation
|
|
The order of evaluation of the expressions used for the function's
|
|
parameters is undefined. Thus, you should not write programs that
|
|
assume that parameters are evaluated from left to right or from
|
|
right to left. For example,
|
|
|
|
@example
|
|
i = 5
|
|
j = atan2(i++, i *= 2)
|
|
@end example
|
|
|
|
If the order of evaluation is left to right, then @code{i} first becomes
|
|
six, and then 12, and @code{atan2} is called with the two arguments six
|
|
and 12. But if the order of evaluation is right to left, @code{i}
|
|
first becomes 10, and then 11, and @code{atan2} is called with the
|
|
two arguments 11 and 10.
|
|
|
|
@node Numeric Functions, String Functions, Calling Built-in, Built-in
|
|
@section Numeric Built-in Functions
|
|
|
|
Here is a full list of built-in functions that work with numbers.
|
|
Optional parameters are enclosed in square brackets (``['' and ``]'').
|
|
|
|
@table @code
|
|
@item int(@var{x})
|
|
@findex int
|
|
This produces the nearest integer to @var{x}, located between @var{x} and zero,
|
|
truncated toward zero.
|
|
|
|
For example, @code{int(3)} is three, @code{int(3.9)} is three, @code{int(-3.9)}
|
|
is @minus{}3, and @code{int(-3)} is @minus{}3 as well.
|
|
|
|
@item sqrt(@var{x})
|
|
@findex sqrt
|
|
This gives you the positive square root of @var{x}. It reports an error
|
|
if @var{x} is negative. Thus, @code{sqrt(4)} is two.
|
|
|
|
@item exp(@var{x})
|
|
@findex exp
|
|
This gives you the exponential of @var{x} (@code{e ^ @var{x}}), or reports
|
|
an error if @var{x} is out of range. The range of values @var{x} can have
|
|
depends on your machine's floating point representation.
|
|
|
|
@item log(@var{x})
|
|
@findex log
|
|
This gives you the natural logarithm of @var{x}, if @var{x} is positive;
|
|
otherwise, it reports an error.
|
|
|
|
@item sin(@var{x})
|
|
@findex sin
|
|
This gives you the sine of @var{x}, with @var{x} in radians.
|
|
|
|
@item cos(@var{x})
|
|
@findex cos
|
|
This gives you the cosine of @var{x}, with @var{x} in radians.
|
|
|
|
@item atan2(@var{y}, @var{x})
|
|
@findex atan2
|
|
This gives you the arctangent of @code{@var{y} / @var{x}} in radians.
|
|
|
|
@item rand()
|
|
@findex rand
|
|
This gives you a random number. The values of @code{rand} are
|
|
uniformly-distributed between zero and one.
|
|
The value is never zero and never one.
|
|
|
|
Often you want random integers instead. Here is a user-defined function
|
|
you can use to obtain a random non-negative integer less than @var{n}:
|
|
|
|
@example
|
|
function randint(n) @{
|
|
return int(n * rand())
|
|
@}
|
|
@end example
|
|
|
|
@noindent
|
|
The multiplication produces a random real number greater than zero and less
|
|
than @code{n}. We then make it an integer (using @code{int}) between zero
|
|
and @code{n} @minus{} 1, inclusive.
|
|
|
|
Here is an example where a similar function is used to produce
|
|
random integers between one and @var{n}. This program
|
|
prints a new random number for each input record.
|
|
|
|
@example
|
|
@group
|
|
awk '
|
|
# Function to roll a simulated die.
|
|
function roll(n) @{ return 1 + int(rand() * n) @}
|
|
@end group
|
|
|
|
@group
|
|
# Roll 3 six-sided dice and
|
|
# print total number of points.
|
|
@{
|
|
printf("%d points\n",
|
|
roll(6)+roll(6)+roll(6))
|
|
@}'
|
|
@end group
|
|
@end example
|
|
|
|
@cindex seed for random numbers
|
|
@cindex random numbers, seed of
|
|
@comment MAWK uses a different seed each time.
|
|
@strong{Caution:} In most @code{awk} implementations, including @code{gawk},
|
|
@code{rand} starts generating numbers from the same
|
|
starting number, or @dfn{seed}, each time you run @code{awk}. Thus,
|
|
a program will generate the same results each time you run it.
|
|
The numbers are random within one @code{awk} run, but predictable
|
|
from run to run. This is convenient for debugging, but if you want
|
|
a program to do different things each time it is used, you must change
|
|
the seed to a value that will be different in each run. To do this,
|
|
use @code{srand}.
|
|
|
|
@item srand(@r{[}@var{x}@r{]})
|
|
@findex srand
|
|
The function @code{srand} sets the starting point, or seed,
|
|
for generating random numbers to the value @var{x}.
|
|
|
|
Each seed value leads to a particular sequence of random
|
|
numbers.@footnote{Computer generated random numbers really are not truly
|
|
random. They are technically known as ``pseudo-random.'' This means
|
|
that while the numbers in a sequence appear to be random, you can in
|
|
fact generate the same sequence of random numbers over and over again.}
|
|
Thus, if you set the seed to the same value a second time, you will get
|
|
the same sequence of random numbers again.
|
|
|
|
If you omit the argument @var{x}, as in @code{srand()}, then the current
|
|
date and time of day are used for a seed. This is the way to get random
|
|
numbers that are truly unpredictable.
|
|
|
|
The return value of @code{srand} is the previous seed. This makes it
|
|
easy to keep track of the seeds for use in consistently reproducing
|
|
sequences of random numbers.
|
|
@end table
|
|
|
|
@node String Functions, I/O Functions, Numeric Functions, Built-in
|
|
@section Built-in Functions for String Manipulation
|
|
|
|
The functions in this section look at or change the text of one or more
|
|
strings.
|
|
Optional parameters are enclosed in square brackets (``['' and ``]'').
|
|
|
|
@table @code
|
|
@item index(@var{in}, @var{find})
|
|
@findex index
|
|
This searches the string @var{in} for the first occurrence of the string
|
|
@var{find}, and returns the position in characters where that occurrence
|
|
begins in the string @var{in}. For example:
|
|
|
|
@example
|
|
$ awk 'BEGIN @{ print index("peanut", "an") @}'
|
|
@print{} 3
|
|
@end example
|
|
|
|
@noindent
|
|
If @var{find} is not found, @code{index} returns zero.
|
|
(Remember that string indices in @code{awk} start at one.)
|
|
|
|
@item length(@r{[}@var{string}@r{]})
|
|
@findex length
|
|
This gives you the number of characters in @var{string}. If
|
|
@var{string} is a number, the length of the digit string representing
|
|
that number is returned. For example, @code{length("abcde")} is five. By
|
|
contrast, @code{length(15 * 35)} works out to three. How? Well, 15 * 35 =
|
|
525, and 525 is then converted to the string @code{"525"}, which has
|
|
three characters.
|
|
|
|
If no argument is supplied, @code{length} returns the length of @code{$0}.
|
|
|
|
@cindex historical features
|
|
@cindex portability issues
|
|
@cindex @code{awk} language, POSIX version
|
|
@cindex POSIX @code{awk}
|
|
In older versions of @code{awk}, you could call the @code{length} function
|
|
without any parentheses. Doing so is marked as ``deprecated'' in the
|
|
POSIX standard. This means that while you can do this in your
|
|
programs, it is a feature that can eventually be removed from a future
|
|
version of the standard. Therefore, for maximal portability of your
|
|
@code{awk} programs, you should always supply the parentheses.
|
|
|
|
@item match(@var{string}, @var{regexp})
|
|
@findex match
|
|
The @code{match} function searches the string, @var{string}, for the
|
|
longest, leftmost substring matched by the regular expression,
|
|
@var{regexp}. It returns the character position, or @dfn{index}, of
|
|
where that substring begins (one, if it starts at the beginning of
|
|
@var{string}). If no match is found, it returns zero.
|
|
|
|
@vindex RSTART
|
|
@vindex RLENGTH
|
|
The @code{match} function sets the built-in variable @code{RSTART} to
|
|
the index. It also sets the built-in variable @code{RLENGTH} to the
|
|
length in characters of the matched substring. If no match is found,
|
|
@code{RSTART} is set to zero, and @code{RLENGTH} to @minus{}1.
|
|
|
|
For example:
|
|
|
|
@example
|
|
@group
|
|
@c file eg/misc/findpat.sh
|
|
awk '@{
|
|
if ($1 == "FIND")
|
|
regex = $2
|
|
else @{
|
|
where = match($0, regex)
|
|
if (where != 0)
|
|
print "Match of", regex, "found at", \
|
|
where, "in", $0
|
|
@}
|
|
@}'
|
|
@c endfile
|
|
@end group
|
|
@end example
|
|
|
|
@noindent
|
|
This program looks for lines that match the regular expression stored in
|
|
the variable @code{regex}. This regular expression can be changed. If the
|
|
first word on a line is @samp{FIND}, @code{regex} is changed to be the
|
|
second word on that line. Therefore, given:
|
|
|
|
@example
|
|
@c file eg/misc/findpat.data
|
|
FIND ru+n
|
|
My program runs
|
|
but not very quickly
|
|
FIND Melvin
|
|
JF+KM
|
|
This line is property of Reality Engineering Co.
|
|
Melvin was here.
|
|
@c endfile
|
|
@end example
|
|
|
|
@noindent
|
|
@code{awk} prints:
|
|
|
|
@example
|
|
Match of ru+n found at 12 in My program runs
|
|
Match of Melvin found at 1 in Melvin was here.
|
|
@end example
|
|
|
|
@item split(@var{string}, @var{array} @r{[}, @var{fieldsep}@r{]})
|
|
@findex split
|
|
This divides @var{string} into pieces separated by @var{fieldsep},
|
|
and stores the pieces in @var{array}. The first piece is stored in
|
|
@code{@var{array}[1]}, the second piece in @code{@var{array}[2]}, and so
|
|
forth. The string value of the third argument, @var{fieldsep}, is
|
|
a regexp describing where to split @var{string} (much as @code{FS} can
|
|
be a regexp describing where to split input records). If
|
|
the @var{fieldsep} is omitted, the value of @code{FS} is used.
|
|
@code{split} returns the number of elements created.
|
|
|
|
The @code{split} function splits strings into pieces in a
|
|
manner similar to the way input lines are split into fields. For example:
|
|
|
|
@example
|
|
split("cul-de-sac", a, "-")
|
|
@end example
|
|
|
|
@noindent
|
|
splits the string @samp{cul-de-sac} into three fields using @samp{-} as the
|
|
separator. It sets the contents of the array @code{a} as follows:
|
|
|
|
@example
|
|
a[1] = "cul"
|
|
a[2] = "de"
|
|
a[3] = "sac"
|
|
@end example
|
|
|
|
@noindent
|
|
The value returned by this call to @code{split} is three.
|
|
|
|
As with input field-splitting, when the value of @var{fieldsep} is
|
|
@w{@code{" "}}, leading and trailing whitespace is ignored, and the elements
|
|
are separated by runs of whitespace.
|
|
|
|
@cindex differences between @code{gawk} and @code{awk}
|
|
Also as with input field-splitting, if @var{fieldsep} is the null string, each
|
|
individual character in the string is split into its own array element.
|
|
(This is a @code{gawk}-specific extension.)
|
|
|
|
@cindex dark corner
|
|
Recent implementations of @code{awk}, including @code{gawk}, allow
|
|
the third argument to be a regexp constant (@code{/abc/}), as well as a
|
|
string (d.c.). The POSIX standard allows this as well.
|
|
|
|
Before splitting the string, @code{split} deletes any previously existing
|
|
elements in the array @var{array} (d.c.).
|
|
|
|
If @var{string} does not match @var{fieldsep} at all, @var{array} will have
|
|
one element. The value of that element will be the original
|
|
@var{string}.
|
|
|
|
@item sprintf(@var{format}, @var{expression1},@dots{})
|
|
@findex sprintf
|
|
This returns (without printing) the string that @code{printf} would
|
|
have printed out with the same arguments
|
|
(@pxref{Printf, ,Using @code{printf} Statements for Fancier Printing}).
|
|
For example:
|
|
|
|
@example
|
|
sprintf("pi = %.2f (approx.)", 22/7)
|
|
@end example
|
|
|
|
@noindent
|
|
returns the string @w{@code{"pi = 3.14 (approx.)"}}.
|
|
|
|
@ignore
|
|
2e: For sub, gsub, and gensub, either here or in the "how much matches"
|
|
section, we need some explanation that it is possible to match the
|
|
null string when using closures like *. E.g.,
|
|
|
|
$ echo abc | awk '{ gsub(/m*/, "X"); print }'
|
|
@print{} XaXbXcX
|
|
|
|
Although this makes a certain amount of sense, it can be very
|
|
suprising.
|
|
@end ignore
|
|
|
|
@item sub(@var{regexp}, @var{replacement} @r{[}, @var{target}@r{]})
|
|
@findex sub
|
|
The @code{sub} function alters the value of @var{target}.
|
|
It searches this value, which is treated as a string, for the
|
|
leftmost longest substring matched by the regular expression, @var{regexp},
|
|
extending this match as far as possible. Then the entire string is
|
|
changed by replacing the matched text with @var{replacement}.
|
|
The modified string becomes the new value of @var{target}.
|
|
|
|
This function is peculiar because @var{target} is not simply
|
|
used to compute a value, and not just any expression will do: it
|
|
must be a variable, field or array element, so that @code{sub} can
|
|
store a modified value there. If this argument is omitted, then the
|
|
default is to use and alter @code{$0}.
|
|
|
|
For example:
|
|
|
|
@example
|
|
str = "water, water, everywhere"
|
|
sub(/at/, "ith", str)
|
|
@end example
|
|
|
|
@noindent
|
|
sets @code{str} to @w{@code{"wither, water, everywhere"}}, by replacing the
|
|
leftmost, longest occurrence of @samp{at} with @samp{ith}.
|
|
|
|
The @code{sub} function returns the number of substitutions made (either
|
|
one or zero).
|
|
|
|
If the special character @samp{&} appears in @var{replacement}, it
|
|
stands for the precise substring that was matched by @var{regexp}. (If
|
|
the regexp can match more than one string, then this precise substring
|
|
may vary.) For example:
|
|
|
|
@example
|
|
awk '@{ sub(/candidate/, "& and his wife"); print @}'
|
|
@end example
|
|
|
|
@noindent
|
|
changes the first occurrence of @samp{candidate} to @samp{candidate
|
|
and his wife} on each input line.
|
|
|
|
Here is another example:
|
|
|
|
@example
|
|
awk 'BEGIN @{
|
|
str = "daabaaa"
|
|
sub(/a*/, "c&c", str)
|
|
print str
|
|
@}'
|
|
@print{} dcaacbaaa
|
|
@end example
|
|
|
|
@noindent
|
|
This shows how @samp{&} can represent a non-constant string, and also
|
|
illustrates the ``leftmost, longest'' rule in regexp matching
|
|
(@pxref{Leftmost Longest, ,How Much Text Matches?}).
|
|
|
|
The effect of this special character (@samp{&}) can be turned off by putting a
|
|
backslash before it in the string. As usual, to insert one backslash in
|
|
the string, you must write two backslashes. Therefore, write @samp{\\&}
|
|
in a string constant to include a literal @samp{&} in the replacement.
|
|
For example, here is how to replace the first @samp{|} on each line with
|
|
an @samp{&}:
|
|
|
|
@example
|
|
awk '@{ sub(/\|/, "\\&"); print @}'
|
|
@end example
|
|
|
|
@cindex @code{sub}, third argument of
|
|
@cindex @code{gsub}, third argument of
|
|
@strong{Note:} As mentioned above, the third argument to @code{sub} must
|
|
be a variable, field or array reference.
|
|
Some versions of @code{awk} allow the third argument to
|
|
be an expression which is not an lvalue. In such a case, @code{sub}
|
|
would still search for the pattern and return zero or one, but the result of
|
|
the substitution (if any) would be thrown away because there is no place
|
|
to put it. Such versions of @code{awk} accept expressions like
|
|
this:
|
|
|
|
@example
|
|
sub(/USA/, "United States", "the USA and Canada")
|
|
@end example
|
|
|
|
@noindent
|
|
For historical compatibility, @code{gawk} will accept erroneous code,
|
|
such as in the above example. However, using any other non-changeable
|
|
object as the third parameter will cause a fatal error, and your program
|
|
will not run.
|
|
|
|
Finally, if the @var{regexp} is not a regexp constant, it is converted into a
|
|
string and then the value of that string is treated as the regexp to match.
|
|
|
|
@item gsub(@var{regexp}, @var{replacement} @r{[}, @var{target}@r{]})
|
|
@findex gsub
|
|
This is similar to the @code{sub} function, except @code{gsub} replaces
|
|
@emph{all} of the longest, leftmost, @emph{non-overlapping} matching
|
|
substrings it can find. The @samp{g} in @code{gsub} stands for
|
|
``global,'' which means replace everywhere. For example:
|
|
|
|
@example
|
|
awk '@{ gsub(/Britain/, "United Kingdom"); print @}'
|
|
@end example
|
|
|
|
@noindent
|
|
replaces all occurrences of the string @samp{Britain} with @samp{United
|
|
Kingdom} for all input records.
|
|
|
|
The @code{gsub} function returns the number of substitutions made. If
|
|
the variable to be searched and altered, @var{target}, is
|
|
omitted, then the entire input record, @code{$0}, is used.
|
|
|
|
As in @code{sub}, the characters @samp{&} and @samp{\} are special,
|
|
and the third argument must be an lvalue.
|
|
@end table
|
|
|
|
@table @code
|
|
@item gensub(@var{regexp}, @var{replacement}, @var{how} @r{[}, @var{target}@r{]})
|
|
@findex gensub
|
|
@code{gensub} is a general substitution function. Like @code{sub} and
|
|
@code{gsub}, it searches the target string @var{target} for matches of
|
|
the regular expression @var{regexp}. Unlike @code{sub} and
|
|
@code{gsub}, the modified string is returned as the result of the
|
|
function, and the original target string is @emph{not} changed. If
|
|
@var{how} is a string beginning with @samp{g} or @samp{G}, then it
|
|
replaces all matches of @var{regexp} with @var{replacement}.
|
|
Otherwise, @var{how} is a number indicating which match of @var{regexp}
|
|
to replace. If no @var{target} is supplied, @code{$0} is used instead.
|
|
|
|
@code{gensub} provides an additional feature that is not available
|
|
in @code{sub} or @code{gsub}: the ability to specify components of
|
|
a regexp in the replacement text. This is done by using parentheses
|
|
in the regexp to mark the components, and then specifying @samp{\@var{n}}
|
|
in the replacement text, where @var{n} is a digit from one to nine.
|
|
For example:
|
|
|
|
@example
|
|
@group
|
|
$ gawk '
|
|
> BEGIN @{
|
|
> a = "abc def"
|
|
> b = gensub(/(.+) (.+)/, "\\2 \\1", "g", a)
|
|
> print b
|
|
> @}'
|
|
@print{} def abc
|
|
@end group
|
|
@end example
|
|
|
|
@noindent
|
|
As described above for @code{sub}, you must type two backslashes in order
|
|
to get one into the string.
|
|
|
|
In the replacement text, the sequence @samp{\0} represents the entire
|
|
matched text, as does the character @samp{&}.
|
|
|
|
This example shows how you can use the third argument to control
|
|
which match of the regexp should be changed.
|
|
|
|
@example
|
|
$ echo a b c a b c |
|
|
> gawk '@{ print gensub(/a/, "AA", 2) @}'
|
|
@print{} a b c AA b c
|
|
@end example
|
|
|
|
In this case, @code{$0} is used as the default target string.
|
|
@code{gensub} returns the new string as its result, which is
|
|
passed directly to @code{print} for printing.
|
|
|
|
If the @var{how} argument is a string that does not begin with @samp{g} or
|
|
@samp{G}, or if it is a number that is less than zero, only one
|
|
substitution is performed.
|
|
|
|
If @var{regexp} does not match @var{target}, @code{gensub}'s return value
|
|
is the original, unchanged value of @var{target}.
|
|
|
|
@cindex differences between @code{gawk} and @code{awk}
|
|
@code{gensub} is a @code{gawk} extension; it is not available
|
|
in compatibility mode (@pxref{Options, ,Command Line Options}).
|
|
|
|
@item substr(@var{string}, @var{start} @r{[}, @var{length}@r{]})
|
|
@findex substr
|
|
This returns a @var{length}-character-long substring of @var{string},
|
|
starting at character number @var{start}. The first character of a
|
|
string is character number one. For example,
|
|
@code{substr("washington", 5, 3)} returns @code{"ing"}.
|
|
|
|
If @var{length} is not present, this function returns the whole suffix of
|
|
@var{string} that begins at character number @var{start}. For example,
|
|
@code{substr("washington", 5)} returns @code{"ington"}. The whole
|
|
suffix is also returned
|
|
if @var{length} is greater than the number of characters remaining
|
|
in the string, counting from character number @var{start}.
|
|
|
|
@strong{Note:} The string returned by @code{substr} @emph{cannot} be
|
|
assigned to. Thus, it is a mistake to attempt to change a portion of
|
|
a string, like this:
|
|
|
|
@example
|
|
string = "abcdef"
|
|
# try to get "abCDEf", won't work
|
|
substr(string, 3, 3) = "CDE"
|
|
@end example
|
|
|
|
@noindent
|
|
or to use @code{substr} as the third agument of @code{sub} or @code{gsub}:
|
|
|
|
@example
|
|
gsub(/xyz/, "pdq", substr($0, 5, 20)) # WRONG
|
|
@end example
|
|
|
|
@cindex case conversion
|
|
@cindex conversion of case
|
|
@item tolower(@var{string})
|
|
@findex tolower
|
|
This returns a copy of @var{string}, with each upper-case character
|
|
in the string replaced with its corresponding lower-case character.
|
|
Non-alphabetic characters are left unchanged. For example,
|
|
@code{tolower("MiXeD cAsE 123")} returns @code{"mixed case 123"}.
|
|
|
|
@item toupper(@var{string})
|
|
@findex toupper
|
|
This returns a copy of @var{string}, with each lower-case character
|
|
in the string replaced with its corresponding upper-case character.
|
|
Non-alphabetic characters are left unchanged. For example,
|
|
@code{toupper("MiXeD cAsE 123")} returns @code{"MIXED CASE 123"}.
|
|
@end table
|
|
|
|
@c fakenode --- for prepinfo
|
|
@subheading More About @samp{\} and @samp{&} with @code{sub}, @code{gsub} and @code{gensub}
|
|
|
|
@cindex escape processing, @code{sub} et. al.
|
|
When using @code{sub}, @code{gsub} or @code{gensub}, and trying to get literal
|
|
backslashes and ampersands into the replacement text, you need to remember
|
|
that there are several levels of @dfn{escape processing} going on.
|
|
|
|
First, there is the @dfn{lexical} level, which is when @code{awk} reads
|
|
your program, and builds an internal copy of your program that can
|
|
be executed.
|
|
|
|
Then there is the run-time level, when @code{awk} actually scans the
|
|
replacement string to determine what to generate.
|
|
|
|
At both levels, @code{awk} looks for a defined set of characters that
|
|
can come after a backslash. At the lexical level, it looks for the
|
|
escape sequences listed in @ref{Escape Sequences}.
|
|
Thus, for every @samp{\} that @code{awk} will process at the run-time
|
|
level, you type two @samp{\}s at the lexical level.
|
|
When a character that is not valid for an escape sequence follows the
|
|
@samp{\}, Unix @code{awk} and @code{gawk} both simply remove the initial
|
|
@samp{\}, and put the following character into the string. Thus, for
|
|
example, @code{"a\qb"} is treated as @code{"aqb"}.
|
|
|
|
At the run-time level, the various functions handle sequences of
|
|
@samp{\} and @samp{&} differently. The situation is (sadly) somewhat complex.
|
|
|
|
Historically, the @code{sub} and @code{gsub} functions treated the two
|
|
character sequence @samp{\&} specially; this sequence was replaced in
|
|
the generated text with a single @samp{&}. Any other @samp{\} within
|
|
the @var{replacement} string that did not precede an @samp{&} was passed
|
|
through unchanged. To illustrate with a table:
|
|
|
|
@c Thank to Karl Berry for help with the TeX stuff.
|
|
@tex
|
|
\vbox{\bigskip
|
|
% This table has lots of &'s and \'s, so unspecialize them.
|
|
\catcode`\& = \other \catcode`\\ = \other
|
|
% But then we need character for escape and tab.
|
|
@catcode`! = 4
|
|
@halign{@hfil#!@qquad@hfil#!@qquad#@hfil@cr
|
|
You type!@code{sub} sees!@code{sub} generates@cr
|
|
@hrulefill!@hrulefill!@hrulefill@cr
|
|
@code{\&}! @code{&}!the matched text@cr
|
|
@code{\\&}! @code{\&}!a literal @samp{&}@cr
|
|
@code{\\\&}! @code{\&}!a literal @samp{&}@cr
|
|
@code{\\\\&}! @code{\\&}!a literal @samp{\&}@cr
|
|
@code{\\\\\&}! @code{\\&}!a literal @samp{\&}@cr
|
|
@code{\\\\\\&}! @code{\\\&}!a literal @samp{\\&}@cr
|
|
@code{\\q}! @code{\q}!a literal @samp{\q}@cr
|
|
}
|
|
@bigskip}
|
|
@end tex
|
|
@ifinfo
|
|
@display
|
|
You type @code{sub} sees @code{sub} generates
|
|
-------- ---------- ---------------
|
|
@code{\&} @code{&} the matched text
|
|
@code{\\&} @code{\&} a literal @samp{&}
|
|
@code{\\\&} @code{\&} a literal @samp{&}
|
|
@code{\\\\&} @code{\\&} a literal @samp{\&}
|
|
@code{\\\\\&} @code{\\&} a literal @samp{\&}
|
|
@code{\\\\\\&} @code{\\\&} a literal @samp{\\&}
|
|
@code{\\q} @code{\q} a literal @samp{\q}
|
|
@end display
|
|
@end ifinfo
|
|
|
|
@noindent
|
|
This table shows both the lexical level processing, where
|
|
an odd number of backslashes becomes an even number at the run time level,
|
|
and the run-time processing done by @code{sub}.
|
|
(For the sake of simplicity, the rest of the tables below only show the
|
|
case of even numbers of @samp{\}s entered at the lexical level.)
|
|
|
|
The problem with the historical approach is that there is no way to get
|
|
a literal @samp{\} followed by the matched text.
|
|
|
|
@cindex @code{awk} language, POSIX version
|
|
@cindex POSIX @code{awk}
|
|
The 1992 POSIX standard attempted to fix this problem. The standard
|
|
says that @code{sub} and @code{gsub} look for either a @samp{\} or an @samp{&}
|
|
after the @samp{\}. If either one follows a @samp{\}, that character is
|
|
output literally. The interpretation of @samp{\} and @samp{&} then becomes
|
|
like this:
|
|
|
|
@c thanks to Karl Berry for formatting this table
|
|
@tex
|
|
\vbox{\bigskip
|
|
% This table has lots of &'s and \'s, so unspecialize them.
|
|
\catcode`\& = \other \catcode`\\ = \other
|
|
% But then we need character for escape and tab.
|
|
@catcode`! = 4
|
|
@halign{@hfil#!@qquad@hfil#!@qquad#@hfil@cr
|
|
You type!@code{sub} sees!@code{sub} generates@cr
|
|
@hrulefill!@hrulefill!@hrulefill@cr
|
|
@code{&}! @code{&}!the matched text@cr
|
|
@code{\\&}! @code{\&}!a literal @samp{&}@cr
|
|
@code{\\\\&}! @code{\\&}!a literal @samp{\}, then the matched text@cr
|
|
@code{\\\\\\&}! @code{\\\&}!a literal @samp{\&}@cr
|
|
}
|
|
@bigskip}
|
|
@end tex
|
|
@ifinfo
|
|
@display
|
|
You type @code{sub} sees @code{sub} generates
|
|
-------- ---------- ---------------
|
|
@code{&} @code{&} the matched text
|
|
@code{\\&} @code{\&} a literal @samp{&}
|
|
@code{\\\\&} @code{\\&} a literal @samp{\}, then the matched text
|
|
@code{\\\\\\&} @code{\\\&} a literal @samp{\&}
|
|
@end display
|
|
@end ifinfo
|
|
|
|
@noindent
|
|
This would appear to solve the problem.
|
|
Unfortunately, the phrasing of the standard is unusual. It
|
|
says, in effect, that @samp{\} turns off the special meaning of any
|
|
following character, but that for anything other than @samp{\} and @samp{&},
|
|
such special meaning is undefined. This wording leads to two problems.
|
|
|
|
@enumerate
|
|
@item
|
|
Backslashes must now be doubled in the @var{replacement} string, breaking
|
|
historical @code{awk} programs.
|
|
|
|
@item
|
|
To make sure that an @code{awk} program is portable, @emph{every} character
|
|
in the @var{replacement} string must be preceded with a
|
|
backslash.@footnote{This consequence was certainly unintended.}
|
|
@c I can say that, 'cause I was involved in making this change
|
|
@end enumerate
|
|
|
|
The POSIX standard is under revision.@footnote{As of @value{UPDATE-MONTH},
|
|
with final approval and publication hopefully sometime in 1997.}
|
|
Because of the above problems, proposed text for the revised standard
|
|
reverts to rules that correspond more closely to the original existing
|
|
practice. The proposed rules have special cases that make it possible
|
|
to produce a @samp{\} preceding the matched text.
|
|
|
|
@tex
|
|
\vbox{\bigskip
|
|
% This table has lots of &'s and \'s, so unspecialize them.
|
|
\catcode`\& = \other \catcode`\\ = \other
|
|
% But then we need character for escape and tab.
|
|
@catcode`! = 4
|
|
@halign{@hfil#!@qquad@hfil#!@qquad#@hfil@cr
|
|
You type!@code{sub} sees!@code{sub} generates@cr
|
|
@hrulefill!@hrulefill!@hrulefill@cr
|
|
@code{\\\\\\&}! @code{\\\&}!a literal @samp{\&}@cr
|
|
@code{\\\\&}! @code{\\&}!a literal @samp{\}, followed by the matched text@cr
|
|
@code{\\&}! @code{\&}!a literal @samp{&}@cr
|
|
@code{\\q}! @code{\q}!a literal @samp{\q}@cr
|
|
}
|
|
@bigskip}
|
|
@end tex
|
|
@ifinfo
|
|
@display
|
|
You type @code{sub} sees @code{sub} generates
|
|
-------- ---------- ---------------
|
|
@code{\\\\\\&} @code{\\\&} a literal @samp{\&}
|
|
@code{\\\\&} @code{\\&} a literal @samp{\}, followed by the matched text
|
|
@code{\\&} @code{\&} a literal @samp{&}
|
|
@code{\\q} @code{\q} a literal @samp{\q}
|
|
@end display
|
|
@end ifinfo
|
|
|
|
In a nutshell, at the run-time level, there are now three special sequences
|
|
of characters, @samp{\\\&}, @samp{\\&} and @samp{\&}, whereas historically,
|
|
there was only one. However, as in the historical case, any @samp{\} that
|
|
is not part of one of these three sequences is not special, and appears
|
|
in the output literally.
|
|
|
|
@code{gawk} 3.0 follows these proposed POSIX rules for @code{sub} and
|
|
@code{gsub}.
|
|
@c As much as we think it's a lousy idea. You win some, you lose some. Sigh.
|
|
Whether these proposed rules will actually become codified into the
|
|
standard is unknown at this point. Subsequent @code{gawk} releases will
|
|
track the standard and implement whatever the final version specifies;
|
|
this @value{DOCUMENT} will be updated as well.
|
|
|
|
The rules for @code{gensub} are considerably simpler. At the run-time
|
|
level, whenever @code{gawk} sees a @samp{\}, if the following character
|
|
is a digit, then the text that matched the corresponding parenthesized
|
|
subexpression is placed in the generated output. Otherwise,
|
|
no matter what the character after the @samp{\} is, that character will
|
|
appear in the generated text, and the @samp{\} will not.
|
|
|
|
@tex
|
|
\vbox{\bigskip
|
|
% This table has lots of &'s and \'s, so unspecialize them.
|
|
\catcode`\& = \other \catcode`\\ = \other
|
|
% But then we need character for escape and tab.
|
|
@catcode`! = 4
|
|
@halign{@hfil#!@qquad@hfil#!@qquad#@hfil@cr
|
|
You type!@code{gensub} sees!@code{gensub} generates@cr
|
|
@hrulefill!@hrulefill!@hrulefill@cr
|
|
@code{&}! @code{&}!the matched text@cr
|
|
@code{\\&}! @code{\&}!a literal @samp{&}@cr
|
|
@code{\\\\}! @code{\\}!a literal @samp{\}@cr
|
|
@code{\\\\&}! @code{\\&}!a literal @samp{\}, then the matched text@cr
|
|
@code{\\\\\\&}! @code{\\\&}!a literal @samp{\&}@cr
|
|
@code{\\q}! @code{\q}!a literal @samp{q}@cr
|
|
}
|
|
@bigskip}
|
|
@end tex
|
|
@ifinfo
|
|
@display
|
|
You type @code{gensub} sees @code{gensub} generates
|
|
-------- ------------- ------------------
|
|
@code{&} @code{&} the matched text
|
|
@code{\\&} @code{\&} a literal @samp{&}
|
|
@code{\\\\} @code{\\} a literal @samp{\}
|
|
@code{\\\\&} @code{\\&} a literal @samp{\}, then the matched text
|
|
@code{\\\\\\&} @code{\\\&} a literal @samp{\&}
|
|
@code{\\q} @code{\q} a literal @samp{q}
|
|
@end display
|
|
@end ifinfo
|
|
|
|
Because of the complexity of the lexical and run-time level processing,
|
|
and the special cases for @code{sub} and @code{gsub},
|
|
we recommend the use of @code{gawk} and @code{gensub} for when you have
|
|
to do substitutions.
|
|
|
|
@node I/O Functions, Time Functions, String Functions, Built-in
|
|
@section Built-in Functions for Input/Output
|
|
|
|
The following functions are related to Input/Output (I/O).
|
|
Optional parameters are enclosed in square brackets (``['' and ``]'').
|
|
|
|
@table @code
|
|
@item close(@var{filename})
|
|
@findex close
|
|
Close the file @var{filename}, for input or output. The argument may
|
|
alternatively be a shell command that was used for redirecting to or
|
|
from a pipe; then the pipe is closed.
|
|
@xref{Close Files And Pipes, ,Closing Input and Output Files and Pipes},
|
|
for more information.
|
|
|
|
@item fflush(@r{[}@var{filename}@r{]})
|
|
@findex fflush
|
|
@cindex portability issues
|
|
@cindex flushing buffers
|
|
@cindex buffers, flushing
|
|
@cindex buffering output
|
|
@cindex output, buffering
|
|
Flush any buffered output associated @var{filename}, which is either a
|
|
file opened for writing, or a shell command for redirecting output to
|
|
a pipe.
|
|
|
|
Many utility programs will @dfn{buffer} their output; they save information
|
|
to be written to a disk file or terminal in memory, until there is enough
|
|
for it to be worthwhile to send the data to the ouput device.
|
|
This is often more efficient than writing
|
|
every little bit of information as soon as it is ready. However, sometimes
|
|
it is necessary to force a program to @dfn{flush} its buffers; that is,
|
|
write the information to its destination, even if a buffer is not full.
|
|
This is the purpose of the @code{fflush} function; @code{gawk} too
|
|
buffers its output, and the @code{fflush} function can be used to force
|
|
@code{gawk} to flush its buffers.
|
|
|
|
@code{fflush} is a recent (1994) addition to the Bell Labs research
|
|
version of @code{awk}; it is not part of the POSIX standard, and will
|
|
not be available if @samp{--posix} has been specified on the command
|
|
line (@pxref{Options, ,Command Line Options}).
|
|
|
|
@code{gawk} extends the @code{fflush} function in two ways. The first
|
|
is to allow no argument at all. In this case, the buffer for the
|
|
standard output is flushed. The second way is to allow the null string
|
|
(@w{@code{""}}) as the argument. In this case, the buffers for
|
|
@emph{all} open output files and pipes are flushed.
|
|
|
|
@code{fflush} returns zero if the buffer was successfully flushed,
|
|
and nonzero otherwise.
|
|
|
|
@item system(@var{command})
|
|
@findex system
|
|
@cindex interaction, @code{awk} and other programs
|
|
The @code{system} function allows the user to execute operating system commands
|
|
and then return to the @code{awk} program. The @code{system} function
|
|
executes the command given by the string @var{command}. It returns, as
|
|
its value, the status returned by the command that was executed.
|
|
|
|
For example, if the following fragment of code is put in your @code{awk}
|
|
program:
|
|
|
|
@example
|
|
END @{
|
|
system("date | mail -s 'awk run done' root")
|
|
@}
|
|
@end example
|
|
|
|
@noindent
|
|
the system administrator will be sent mail when the @code{awk} program
|
|
finishes processing input and begins its end-of-input processing.
|
|
|
|
Note that redirecting @code{print} or @code{printf} into a pipe is often
|
|
enough to accomplish your task. If you need to run many commands, it
|
|
will be more efficient to simply print them to a pipe to the shell:
|
|
|
|
@example
|
|
while (@var{more stuff to do})
|
|
print @var{command} | "/bin/sh"
|
|
close("/bin/sh")
|
|
@end example
|
|
|
|
@noindent
|
|
However, if your @code{awk}
|
|
program is interactive, @code{system} is useful for cranking up large
|
|
self-contained programs, such as a shell or an editor.
|
|
|
|
Some operating systems cannot implement the @code{system} function.
|
|
@code{system} causes a fatal error if it is not supported.
|
|
@end table
|
|
|
|
@c fakenode --- for prepinfo
|
|
@subheading Interactive vs. Non-Interactive Buffering
|
|
@cindex buffering, interactive vs. non-interactive
|
|
@cindex buffering, non-interactive vs. interactive
|
|
@cindex interactive buffering vs. non-interactive
|
|
@cindex non-interactive buffering vs. interactive
|
|
|
|
As a side point, buffering issues can be even more confusing depending
|
|
upon whether or not your program is @dfn{interactive}, i.e., communicating
|
|
with a user sitting at a keyboard.@footnote{A program is interactive
|
|
if the standard output is connected
|
|
to a terminal device.}
|
|
|
|
Interactive programs generally @dfn{line buffer} their output; they
|
|
write out every line. Non-interactive programs wait until they have
|
|
a full buffer, which may be many lines of output.
|
|
|
|
@c Thanks to Walter.Mecky@dresdnerbank.de for this example, and for
|
|
@c motivating me to write this section.
|
|
Here is an example of the difference.
|
|
|
|
@example
|
|
$ awk '@{ print $1 + $2 @}'
|
|
1 1
|
|
@print{} 2
|
|
2 3
|
|
@print{} 5
|
|
@kbd{Control-d}
|
|
@end example
|
|
|
|
@noindent
|
|
Each line of output is printed immediately. Compare that behavior
|
|
with this example.
|
|
|
|
@example
|
|
$ awk '@{ print $1 + $2 @}' | cat
|
|
1 1
|
|
2 3
|
|
@kbd{Control-d}
|
|
@print{} 2
|
|
@print{} 5
|
|
@end example
|
|
|
|
@noindent
|
|
Here, no output is printed until after the @kbd{Control-d} is typed, since
|
|
it is all buffered, and sent down the pipe to @code{cat} in one shot.
|
|
|
|
@c fakenode --- for prepinfo
|
|
@subheading Controlling Output Buffering with @code{system}
|
|
@cindex flushing buffers
|
|
@cindex buffers, flushing
|
|
@cindex buffering output
|
|
@cindex output, buffering
|
|
|
|
The @code{fflush} function provides explicit control over output buffering for
|
|
individual files and pipes. However, its use is not portable to many other
|
|
@code{awk} implementations. An alternative method to flush output
|
|
buffers is by calling @code{system} with a null string as its argument:
|
|
|
|
@example
|
|
system("") # flush output
|
|
@end example
|
|
|
|
@noindent
|
|
@code{gawk} treats this use of the @code{system} function as a special
|
|
case, and is smart enough not to run a shell (or other command
|
|
interpreter) with the empty command. Therefore, with @code{gawk}, this
|
|
idiom is not only useful, it is efficient. While this method should work
|
|
with other @code{awk} implementations, it will not necessarily avoid
|
|
starting an unnecessary shell. (Other implementations may only
|
|
flush the buffer associated with the standard output, and not necessarily
|
|
all buffered output.)
|
|
|
|
If you think about what a programmer expects, it makes sense that
|
|
@code{system} should flush any pending output. The following program:
|
|
|
|
@example
|
|
BEGIN @{
|
|
print "first print"
|
|
system("echo system echo")
|
|
print "second print"
|
|
@}
|
|
@end example
|
|
|
|
@noindent
|
|
must print
|
|
|
|
@example
|
|
first print
|
|
system echo
|
|
second print
|
|
@end example
|
|
|
|
@noindent
|
|
and not
|
|
|
|
@example
|
|
system echo
|
|
first print
|
|
second print
|
|
@end example
|
|
|
|
If @code{awk} did not flush its buffers before calling @code{system}, the
|
|
latter (undesirable) output is what you would see.
|
|
|
|
@node Time Functions, , I/O Functions, Built-in
|
|
@section Functions for Dealing with Time Stamps
|
|
|
|
@cindex timestamps
|
|
@cindex time of day
|
|
A common use for @code{awk} programs is the processing of log files
|
|
containing time stamp information, indicating when a
|
|
particular log record was written. Many programs log their time stamp
|
|
in the form returned by the @code{time} system call, which is the
|
|
number of seconds since a particular epoch. On POSIX systems,
|
|
it is the number of seconds since Midnight, January 1, 1970, UTC.
|
|
|
|
In order to make it easier to process such log files, and to produce
|
|
useful reports, @code{gawk} provides two functions for working with time
|
|
stamps. Both of these are @code{gawk} extensions; they are not specified
|
|
in the POSIX standard, nor are they in any other known version
|
|
of @code{awk}.
|
|
|
|
Optional parameters are enclosed in square brackets (``['' and ``]'').
|
|
|
|
@table @code
|
|
@item systime()
|
|
@findex systime
|
|
This function returns the current time as the number of seconds since
|
|
the system epoch. On POSIX systems, this is the number of seconds
|
|
since Midnight, January 1, 1970, UTC. It may be a different number on
|
|
other systems.
|
|
|
|
@item strftime(@r{[}@var{format} @r{[}, @var{timestamp}@r{]]})
|
|
@findex strftime
|
|
This function returns a string. It is similar to the function of the
|
|
same name in ANSI C. The time specified by @var{timestamp} is used to
|
|
produce a string, based on the contents of the @var{format} string.
|
|
The @var{timestamp} is in the same format as the value returned by the
|
|
@code{systime} function. If no @var{timestamp} argument is supplied,
|
|
@code{gawk} will use the current time of day as the time stamp.
|
|
If no @var{format} argument is supplied, @code{strftime} uses
|
|
@code{@w{"%a %b %d %H:%M:%S %Z %Y"}}. This format string produces
|
|
output (almost) equivalent to that of the @code{date} utility.
|
|
(Versions of @code{gawk} prior to 3.0 require the @var{format} argument.)
|
|
@end table
|
|
|
|
The @code{systime} function allows you to compare a time stamp from a
|
|
log file with the current time of day. In particular, it is easy to
|
|
determine how long ago a particular record was logged. It also allows
|
|
you to produce log records using the ``seconds since the epoch'' format.
|
|
|
|
The @code{strftime} function allows you to easily turn a time stamp
|
|
into human-readable information. It is similar in nature to the @code{sprintf}
|
|
function
|
|
(@pxref{String Functions, ,Built-in Functions for String Manipulation}),
|
|
in that it copies non-format specification characters verbatim to the
|
|
returned string, while substituting date and time values for format
|
|
specifications in the @var{format} string.
|
|
|
|
@code{strftime} is guaranteed by the ANSI C standard to support
|
|
the following date format specifications:
|
|
|
|
@table @code
|
|
@item %a
|
|
The locale's abbreviated weekday name.
|
|
|
|
@item %A
|
|
The locale's full weekday name.
|
|
|
|
@item %b
|
|
The locale's abbreviated month name.
|
|
|
|
@item %B
|
|
The locale's full month name.
|
|
|
|
@item %c
|
|
The locale's ``appropriate'' date and time representation.
|
|
|
|
@item %d
|
|
The day of the month as a decimal number (01--31).
|
|
|
|
@item %H
|
|
The hour (24-hour clock) as a decimal number (00--23).
|
|
|
|
@item %I
|
|
The hour (12-hour clock) as a decimal number (01--12).
|
|
|
|
@item %j
|
|
The day of the year as a decimal number (001--366).
|
|
|
|
@item %m
|
|
The month as a decimal number (01--12).
|
|
|
|
@item %M
|
|
The minute as a decimal number (00--59).
|
|
|
|
@item %p
|
|
The locale's equivalent of the AM/PM designations associated
|
|
with a 12-hour clock.
|
|
|
|
@item %S
|
|
The second as a decimal number (00--60).@footnote{Occasionally there are
|
|
minutes in a year with a leap second, which is why the
|
|
seconds can go up to 60.}
|
|
|
|
@item %U
|
|
The week number of the year (the first Sunday as the first day of week one)
|
|
as a decimal number (00--53).
|
|
|
|
@item %w
|
|
The weekday as a decimal number (0--6). Sunday is day zero.
|
|
|
|
@item %W
|
|
The week number of the year (the first Monday as the first day of week one)
|
|
as a decimal number (00--53).
|
|
|
|
@item %x
|
|
The locale's ``appropriate'' date representation.
|
|
|
|
@item %X
|
|
The locale's ``appropriate'' time representation.
|
|
|
|
@item %y
|
|
The year without century as a decimal number (00--99).
|
|
|
|
@item %Y
|
|
The year with century as a decimal number (e.g., 1995).
|
|
|
|
@item %Z
|
|
The time zone name or abbreviation, or no characters if
|
|
no time zone is determinable.
|
|
|
|
@item %%
|
|
A literal @samp{%}.
|
|
@end table
|
|
|
|
If a conversion specifier is not one of the above, the behavior is
|
|
undefined.@footnote{This is because ANSI C leaves the
|
|
behavior of the C version of @code{strftime} undefined, and @code{gawk}
|
|
will use the system's version of @code{strftime} if it's there.
|
|
Typically, the conversion specifier will either not appear in the
|
|
returned string, or it will appear literally.}
|
|
|
|
@cindex locale, definition of
|
|
Informally, a @dfn{locale} is the geographic place in which a program
|
|
is meant to run. For example, a common way to abbreviate the date
|
|
September 4, 1991 in the United States would be ``9/4/91''.
|
|
In many countries in Europe, however, it would be abbreviated ``4.9.91''.
|
|
Thus, the @samp{%x} specification in a @code{"US"} locale might produce
|
|
@samp{9/4/91}, while in a @code{"EUROPE"} locale, it might produce
|
|
@samp{4.9.91}. The ANSI C standard defines a default @code{"C"}
|
|
locale, which is an environment that is typical of what most C programmers
|
|
are used to.
|
|
|
|
A public-domain C version of @code{strftime} is supplied with @code{gawk}
|
|
for systems that are not yet fully ANSI-compliant. If that version is
|
|
used to compile @code{gawk} (@pxref{Installation, ,Installing @code{gawk}}),
|
|
then the following additional format specifications are available:
|
|
|
|
@table @code
|
|
@item %D
|
|
Equivalent to specifying @samp{%m/%d/%y}.
|
|
|
|
@item %e
|
|
The day of the month, padded with a space if it is only one digit.
|
|
|
|
@item %h
|
|
Equivalent to @samp{%b}, above.
|
|
|
|
@item %n
|
|
A newline character (ASCII LF).
|
|
|
|
@item %r
|
|
Equivalent to specifying @samp{%I:%M:%S %p}.
|
|
|
|
@item %R
|
|
Equivalent to specifying @samp{%H:%M}.
|
|
|
|
@item %T
|
|
Equivalent to specifying @samp{%H:%M:%S}.
|
|
|
|
@item %t
|
|
A tab character.
|
|
|
|
@item %k
|
|
The hour (24-hour clock) as a decimal number (0-23).
|
|
Single digit numbers are padded with a space.
|
|
|
|
@item %l
|
|
The hour (12-hour clock) as a decimal number (1-12).
|
|
Single digit numbers are padded with a space.
|
|
|
|
@item %C
|
|
The century, as a number between 00 and 99.
|
|
|
|
@item %u
|
|
The weekday as a decimal number
|
|
[1 (Monday)--7].
|
|
|
|
@cindex ISO 8601
|
|
@item %V
|
|
The week number of the year (the first Monday as the first
|
|
day of week one) as a decimal number (01--53).
|
|
The method for determining the week number is as specified by ISO 8601
|
|
(to wit: if the week containing January 1 has four or more days in the
|
|
new year, then it is week one, otherwise it is week 53 of the previous year
|
|
and the next week is week one).
|
|
|
|
@item %G
|
|
The year with century of the ISO week number, as a decimal number.
|
|
|
|
For example, January 1, 1993, is in week 53 of 1992. Thus, the year
|
|
of its ISO week number is 1992, even though its year is 1993.
|
|
Similarly, December 31, 1973, is in week 1 of 1974. Thus, the year
|
|
of its ISO week number is 1974, even though its year is 1973.
|
|
|
|
@item %g
|
|
The year without century of the ISO week number, as a decimal number (00--99).
|
|
|
|
@item %Ec %EC %Ex %Ey %EY %Od %Oe %OH %OI
|
|
@itemx %Om %OM %OS %Ou %OU %OV %Ow %OW %Oy
|
|
These are ``alternate representations'' for the specifications
|
|
that use only the second letter (@samp{%c}, @samp{%C}, and so on).
|
|
They are recognized, but their normal representations are
|
|
used.@footnote{If you don't understand any of this, don't worry about
|
|
it; these facilities are meant to make it easier to ``internationalize''
|
|
programs.}
|
|
(These facilitate compliance with the POSIX @code{date} utility.)
|
|
|
|
@item %v
|
|
The date in VMS format (e.g., 20-JUN-1991).
|
|
|
|
@cindex RFC-822
|
|
@cindex RFC-1036
|
|
@item %z
|
|
The timezone offset in a +HHMM format (e.g., the format necessary to
|
|
produce RFC-822/RFC-1036 date headers).
|
|
@end table
|
|
|
|
This example is an @code{awk} implementation of the POSIX
|
|
@code{date} utility. Normally, the @code{date} utility prints the
|
|
current date and time of day in a well known format. However, if you
|
|
provide an argument to it that begins with a @samp{+}, @code{date}
|
|
will copy non-format specifier characters to the standard output, and
|
|
will interpret the current time according to the format specifiers in
|
|
the string. For example:
|
|
|
|
@example
|
|
$ date '+Today is %A, %B %d, %Y.'
|
|
@print{} Today is Thursday, July 11, 1991.
|
|
@end example
|
|
|
|
Here is the @code{gawk} version of the @code{date} utility.
|
|
It has a shell ``wrapper'', to handle the @samp{-u} option,
|
|
which requires that @code{date} run as if the time zone
|
|
was set to UTC.
|
|
|
|
@example
|
|
@group
|
|
#! /bin/sh
|
|
#
|
|
# date --- approximate the P1003.2 'date' command
|
|
|
|
case $1 in
|
|
-u) TZ=GMT0 # use UTC
|
|
export TZ
|
|
shift ;;
|
|
esac
|
|
@end group
|
|
|
|
@group
|
|
gawk 'BEGIN @{
|
|
format = "%a %b %d %H:%M:%S %Z %Y"
|
|
exitval = 0
|
|
@end group
|
|
|
|
@group
|
|
if (ARGC > 2)
|
|
exitval = 1
|
|
else if (ARGC == 2) @{
|
|
format = ARGV[1]
|
|
if (format ~ /^\+/)
|
|
format = substr(format, 2) # remove leading +
|
|
@}
|
|
print strftime(format)
|
|
exit exitval
|
|
@}' "$@@"
|
|
@end group
|
|
@end example
|
|
|
|
@node User-defined, Invoking Gawk, Built-in, Top
|
|
@chapter User-defined Functions
|
|
|
|
@cindex user-defined functions
|
|
@cindex functions, user-defined
|
|
Complicated @code{awk} programs can often be simplified by defining
|
|
your own functions. User-defined functions can be called just like
|
|
built-in ones (@pxref{Function Calls}), but it is up to you to define
|
|
them---to tell @code{awk} what they should do.
|
|
|
|
@menu
|
|
* Definition Syntax:: How to write definitions and what they mean.
|
|
* Function Example:: An example function definition and what it
|
|
does.
|
|
* Function Caveats:: Things to watch out for.
|
|
* Return Statement:: Specifying the value a function returns.
|
|
@end menu
|
|
|
|
@node Definition Syntax, Function Example, User-defined, User-defined
|
|
@section Function Definition Syntax
|
|
@cindex defining functions
|
|
@cindex function definition
|
|
|
|
Definitions of functions can appear anywhere between the rules of an
|
|
@code{awk} program. Thus, the general form of an @code{awk} program is
|
|
extended to include sequences of rules @emph{and} user-defined function
|
|
definitions.
|
|
There is no need in @code{awk} to put the definition of a function
|
|
before all uses of the function. This is because @code{awk} reads the
|
|
entire program before starting to execute any of it.
|
|
|
|
The definition of a function named @var{name} looks like this:
|
|
|
|
@example
|
|
function @var{name}(@var{parameter-list})
|
|
@{
|
|
@var{body-of-function}
|
|
@}
|
|
@end example
|
|
|
|
@cindex names, use of
|
|
@cindex namespaces
|
|
@noindent
|
|
@var{name} is the name of the function to be defined. A valid function
|
|
name is like a valid variable name: a sequence of letters, digits and
|
|
underscores, not starting with a digit.
|
|
Within a single @code{awk} program, any particular name can only be
|
|
used as a variable, array or function.
|
|
|
|
@var{parameter-list} is a list of the function's arguments and local
|
|
variable names, separated by commas. When the function is called,
|
|
the argument names are used to hold the argument values given in
|
|
the call. The local variables are initialized to the empty string.
|
|
A function cannot have two parameters with the same name.
|
|
|
|
The @var{body-of-function} consists of @code{awk} statements. It is the
|
|
most important part of the definition, because it says what the function
|
|
should actually @emph{do}. The argument names exist to give the body a
|
|
way to talk about the arguments; local variables, to give the body
|
|
places to keep temporary values.
|
|
|
|
Argument names are not distinguished syntactically from local variable
|
|
names; instead, the number of arguments supplied when the function is
|
|
called determines how many argument variables there are. Thus, if three
|
|
argument values are given, the first three names in @var{parameter-list}
|
|
are arguments, and the rest are local variables.
|
|
|
|
It follows that if the number of arguments is not the same in all calls
|
|
to the function, some of the names in @var{parameter-list} may be
|
|
arguments on some occasions and local variables on others. Another
|
|
way to think of this is that omitted arguments default to the
|
|
null string.
|
|
|
|
Usually when you write a function you know how many names you intend to
|
|
use for arguments and how many you intend to use as local variables. It is
|
|
conventional to place some extra space between the arguments and
|
|
the local variables, to document how your function is supposed to be used.
|
|
|
|
@cindex variable shadowing
|
|
During execution of the function body, the arguments and local variable
|
|
values hide or @dfn{shadow} any variables of the same names used in the
|
|
rest of the program. The shadowed variables are not accessible in the
|
|
function definition, because there is no way to name them while their
|
|
names have been taken away for the local variables. All other variables
|
|
used in the @code{awk} program can be referenced or set normally in the
|
|
function's body.
|
|
|
|
The arguments and local variables last only as long as the function body
|
|
is executing. Once the body finishes, you can once again access the
|
|
variables that were shadowed while the function was running.
|
|
|
|
@cindex recursive function
|
|
@cindex function, recursive
|
|
The function body can contain expressions which call functions. They
|
|
can even call this function, either directly or by way of another
|
|
function. When this happens, we say the function is @dfn{recursive}.
|
|
|
|
@cindex @code{awk} language, POSIX version
|
|
@cindex POSIX @code{awk}
|
|
In many @code{awk} implementations, including @code{gawk},
|
|
the keyword @code{function} may be
|
|
abbreviated @code{func}. However, POSIX only specifies the use of
|
|
the keyword @code{function}. This actually has some practical implications.
|
|
If @code{gawk} is in POSIX-compatibility mode
|
|
(@pxref{Options, ,Command Line Options}), then the following
|
|
statement will @emph{not} define a function:
|
|
|
|
@example
|
|
func foo() @{ a = sqrt($1) ; print a @}
|
|
@end example
|
|
|
|
@noindent
|
|
Instead it defines a rule that, for each record, concatenates the value
|
|
of the variable @samp{func} with the return value of the function @samp{foo}.
|
|
If the resulting string is non-null, the action is executed.
|
|
This is probably not what was desired. (@code{awk} accepts this input as
|
|
syntactically valid, since functions may be used before they are defined
|
|
in @code{awk} programs.)
|
|
|
|
@cindex portability issues
|
|
To ensure that your @code{awk} programs are portable, always use the
|
|
keyword @code{function} when defining a function.
|
|
|
|
@node Function Example, Function Caveats, Definition Syntax, User-defined
|
|
@section Function Definition Examples
|
|
|
|
Here is an example of a user-defined function, called @code{myprint}, that
|
|
takes a number and prints it in a specific format.
|
|
|
|
@example
|
|
function myprint(num)
|
|
@{
|
|
printf "%6.3g\n", num
|
|
@}
|
|
@end example
|
|
|
|
@noindent
|
|
To illustrate, here is an @code{awk} rule which uses our @code{myprint}
|
|
function:
|
|
|
|
@example
|
|
$3 > 0 @{ myprint($3) @}
|
|
@end example
|
|
|
|
@noindent
|
|
This program prints, in our special format, all the third fields that
|
|
contain a positive number in our input. Therefore, when given:
|
|
|
|
@example
|
|
@group
|
|
1.2 3.4 5.6 7.8
|
|
9.10 11.12 -13.14 15.16
|
|
17.18 19.20 21.22 23.24
|
|
@end group
|
|
@end example
|
|
|
|
@noindent
|
|
this program, using our function to format the results, prints:
|
|
|
|
@example
|
|
5.6
|
|
21.2
|
|
@end example
|
|
|
|
This function deletes all the elements in an array.
|
|
|
|
@example
|
|
function delarray(a, i)
|
|
@{
|
|
for (i in a)
|
|
delete a[i]
|
|
@}
|
|
@end example
|
|
|
|
When working with arrays, it is often necessary to delete all the elements
|
|
in an array and start over with a new list of elements
|
|
(@pxref{Delete, ,The @code{delete} Statement}).
|
|
Instead of having
|
|
to repeat this loop everywhere in your program that you need to clear out
|
|
an array, your program can just call @code{delarray}.
|
|
|
|
Here is an example of a recursive function. It takes a string
|
|
as an input parameter, and returns the string in backwards order.
|
|
|
|
@example
|
|
function rev(str, start)
|
|
@{
|
|
if (start == 0)
|
|
return ""
|
|
|
|
return (substr(str, start, 1) rev(str, start - 1))
|
|
@}
|
|
@end example
|
|
|
|
If this function is in a file named @file{rev.awk}, we can test it
|
|
this way:
|
|
|
|
@example
|
|
$ echo "Don't Panic!" |
|
|
> gawk --source '@{ print rev($0, length($0)) @}' -f rev.awk
|
|
@print{} !cinaP t'noD
|
|
@end example
|
|
|
|
Here is an example that uses the built-in function @code{strftime}.
|
|
(@xref{Time Functions, ,Functions for Dealing with Time Stamps},
|
|
for more information on @code{strftime}.)
|
|
The C @code{ctime} function takes a timestamp and returns it in a string,
|
|
formatted in a well known fashion. Here is an @code{awk} version:
|
|
|
|
@example
|
|
@c file eg/lib/ctime.awk
|
|
@group
|
|
# ctime.awk
|
|
#
|
|
# awk version of C ctime(3) function
|
|
|
|
function ctime(ts, format)
|
|
@{
|
|
format = "%a %b %d %H:%M:%S %Z %Y"
|
|
if (ts == 0)
|
|
ts = systime() # use current time as default
|
|
return strftime(format, ts)
|
|
@}
|
|
@c endfile
|
|
@end group
|
|
@end example
|
|
|
|
@node Function Caveats, Return Statement, Function Example, User-defined
|
|
@section Calling User-defined Functions
|
|
|
|
@cindex call by value
|
|
@cindex call by reference
|
|
@cindex calling a function
|
|
@cindex function call
|
|
@dfn{Calling a function} means causing the function to run and do its job.
|
|
A function call is an expression, and its value is the value returned by
|
|
the function.
|
|
|
|
A function call consists of the function name followed by the arguments
|
|
in parentheses. What you write in the call for the arguments are
|
|
@code{awk} expressions; each time the call is executed, these
|
|
expressions are evaluated, and the values are the actual arguments. For
|
|
example, here is a call to @code{foo} with three arguments (the first
|
|
being a string concatenation):
|
|
|
|
@example
|
|
foo(x y, "lose", 4 * z)
|
|
@end example
|
|
|
|
@strong{Caution:} whitespace characters (spaces and tabs) are not allowed
|
|
between the function name and the open-parenthesis of the argument list.
|
|
If you write whitespace by mistake, @code{awk} might think that you mean
|
|
to concatenate a variable with an expression in parentheses. However, it
|
|
notices that you used a function name and not a variable name, and reports
|
|
an error.
|
|
|
|
@cindex call by value
|
|
When a function is called, it is given a @emph{copy} of the values of
|
|
its arguments. This is known as @dfn{call by value}. The caller may use
|
|
a variable as the expression for the argument, but the called function
|
|
does not know this: it only knows what value the argument had. For
|
|
example, if you write this code:
|
|
|
|
@example
|
|
foo = "bar"
|
|
z = myfunc(foo)
|
|
@end example
|
|
|
|
@noindent
|
|
then you should not think of the argument to @code{myfunc} as being
|
|
``the variable @code{foo}.'' Instead, think of the argument as the
|
|
string value, @code{"bar"}.
|
|
|
|
If the function @code{myfunc} alters the values of its local variables,
|
|
this has no effect on any other variables. Thus, if @code{myfunc}
|
|
does this:
|
|
|
|
@example
|
|
@group
|
|
function myfunc(str)
|
|
@{
|
|
print str
|
|
str = "zzz"
|
|
print str
|
|
@}
|
|
@end group
|
|
@end example
|
|
|
|
@noindent
|
|
to change its first argument variable @code{str}, this @emph{does not}
|
|
change the value of @code{foo} in the caller. The role of @code{foo} in
|
|
calling @code{myfunc} ended when its value, @code{"bar"}, was computed.
|
|
If @code{str} also exists outside of @code{myfunc}, the function body
|
|
cannot alter this outer value, because it is shadowed during the
|
|
execution of @code{myfunc} and cannot be seen or changed from there.
|
|
|
|
@cindex call by reference
|
|
However, when arrays are the parameters to functions, they are @emph{not}
|
|
copied. Instead, the array itself is made available for direct manipulation
|
|
by the function. This is usually called @dfn{call by reference}.
|
|
Changes made to an array parameter inside the body of a function @emph{are}
|
|
visible outside that function.
|
|
@ifinfo
|
|
This can be @strong{very} dangerous if you do not watch what you are
|
|
doing. For example:
|
|
@end ifinfo
|
|
@iftex
|
|
@emph{This can be very dangerous if you do not watch what you are
|
|
doing.} For example:
|
|
@end iftex
|
|
|
|
@example
|
|
function changeit(array, ind, nvalue)
|
|
@{
|
|
array[ind] = nvalue
|
|
@}
|
|
|
|
BEGIN @{
|
|
a[1] = 1; a[2] = 2; a[3] = 3
|
|
changeit(a, 2, "two")
|
|
printf "a[1] = %s, a[2] = %s, a[3] = %s\n",
|
|
a[1], a[2], a[3]
|
|
@}
|
|
@end example
|
|
|
|
@noindent
|
|
This program prints @samp{a[1] = 1, a[2] = two, a[3] = 3}, because
|
|
@code{changeit} stores @code{"two"} in the second element of @code{a}.
|
|
|
|
@cindex undefined functions
|
|
@cindex functions, undefined
|
|
Some @code{awk} implementations allow you to call a function that
|
|
has not been defined, and only report a problem at run-time when the
|
|
program actually tries to call the function. For example:
|
|
|
|
@example
|
|
@group
|
|
BEGIN @{
|
|
if (0)
|
|
foo()
|
|
else
|
|
bar()
|
|
@}
|
|
function bar() @{ @dots{} @}
|
|
# note that `foo' is not defined
|
|
@end group
|
|
@end example
|
|
|
|
@noindent
|
|
Since the @samp{if} statement will never be true, it is not really a
|
|
problem that @code{foo} has not been defined. Usually though, it is a
|
|
problem if a program calls an undefined function.
|
|
|
|
@ignore
|
|
At one point, I had gawk dieing on this, but later decided that this might
|
|
break old programs and/or test suites.
|
|
@end ignore
|
|
|
|
If @samp{--lint} has been specified
|
|
(@pxref{Options, ,Command Line Options}),
|
|
@code{gawk} will report about calls to undefined functions.
|
|
|
|
Some @code{awk} implementations generate a run-time
|
|
error if you use the @code{next} statement
|
|
(@pxref{Next Statement, , The @code{next} Statement})
|
|
inside a user-defined function.
|
|
@code{gawk} does not have this problem.
|
|
|
|
@node Return Statement, , Function Caveats, User-defined
|
|
@section The @code{return} Statement
|
|
@cindex @code{return} statement
|
|
|
|
The body of a user-defined function can contain a @code{return} statement.
|
|
This statement returns control to the rest of the @code{awk} program. It
|
|
can also be used to return a value for use in the rest of the @code{awk}
|
|
program. It looks like this:
|
|
|
|
@example
|
|
return @r{[}@var{expression}@r{]}
|
|
@end example
|
|
|
|
The @var{expression} part is optional. If it is omitted, then the returned
|
|
value is undefined and, therefore, unpredictable.
|
|
|
|
A @code{return} statement with no value expression is assumed at the end of
|
|
every function definition. So if control reaches the end of the function
|
|
body, then the function returns an unpredictable value. @code{awk}
|
|
will @emph{not} warn you if you use the return value of such a function.
|
|
|
|
Sometimes, you want to write a function for what it does, not for
|
|
what it returns. Such a function corresponds to a @code{void} function
|
|
in C or to a @code{procedure} in Pascal. Thus, it may be appropriate to not
|
|
return any value; you should simply bear in mind that if you use the return
|
|
value of such a function, you do so at your own risk.
|
|
|
|
Here is an example of a user-defined function that returns a value
|
|
for the largest number among the elements of an array:
|
|
|
|
@example
|
|
@group
|
|
function maxelt(vec, i, ret)
|
|
@{
|
|
for (i in vec) @{
|
|
if (ret == "" || vec[i] > ret)
|
|
ret = vec[i]
|
|
@}
|
|
return ret
|
|
@}
|
|
@end group
|
|
@end example
|
|
|
|
@noindent
|
|
You call @code{maxelt} with one argument, which is an array name. The local
|
|
variables @code{i} and @code{ret} are not intended to be arguments;
|
|
while there is nothing to stop you from passing two or three arguments
|
|
to @code{maxelt}, the results would be strange. The extra space before
|
|
@code{i} in the function parameter list indicates that @code{i} and
|
|
@code{ret} are not supposed to be arguments. This is a convention that
|
|
you should follow when you define functions.
|
|
|
|
Here is a program that uses our @code{maxelt} function. It loads an
|
|
array, calls @code{maxelt}, and then reports the maximum number in that
|
|
array:
|
|
|
|
@example
|
|
@group
|
|
awk '
|
|
function maxelt(vec, i, ret)
|
|
@{
|
|
for (i in vec) @{
|
|
if (ret == "" || vec[i] > ret)
|
|
ret = vec[i]
|
|
@}
|
|
return ret
|
|
@}
|
|
@end group
|
|
|
|
@group
|
|
# Load all fields of each record into nums.
|
|
@{
|
|
for(i = 1; i <= NF; i++)
|
|
nums[NR, i] = $i
|
|
@}
|
|
|
|
END @{
|
|
print maxelt(nums)
|
|
@}'
|
|
@end group
|
|
@end example
|
|
|
|
Given the following input:
|
|
|
|
@example
|
|
@group
|
|
1 5 23 8 16
|
|
44 3 5 2 8 26
|
|
256 291 1396 2962 100
|
|
-6 467 998 1101
|
|
99385 11 0 225
|
|
@end group
|
|
@end example
|
|
|
|
@noindent
|
|
our program tells us (predictably) that @code{99385} is the largest number
|
|
in our array.
|
|
|
|
@node Invoking Gawk, Library Functions, User-defined, Top
|
|
@chapter Running @code{awk}
|
|
@cindex command line
|
|
@cindex invocation of @code{gawk}
|
|
@cindex arguments, command line
|
|
@cindex options, command line
|
|
@cindex long options
|
|
@cindex options, long
|
|
|
|
There are two ways to run @code{awk}: with an explicit program, or with
|
|
one or more program files. Here are templates for both of them; items
|
|
enclosed in @samp{@r{[}@dots{}@r{]}} in these templates are optional.
|
|
|
|
Besides traditional one-letter POSIX-style options, @code{gawk} also
|
|
supports GNU long options.
|
|
|
|
@example
|
|
awk @r{[@var{options}]} -f progfile @r{[@code{--}]} @var{file} @dots{}
|
|
awk @r{[@var{options}]} @r{[@code{--}]} '@var{program}' @var{file} @dots{}
|
|
@end example
|
|
|
|
@cindex empty program
|
|
@cindex dark corner
|
|
It is possible to invoke @code{awk} with an empty program:
|
|
|
|
@example
|
|
$ awk '' datafile1 datafile2
|
|
@end example
|
|
|
|
@noindent
|
|
Doing so makes little sense though; @code{awk} will simply exit
|
|
silently when given an empty program (d.c.). If @samp{--lint} has
|
|
been specified on the command line, @code{gawk} will issue a
|
|
warning that the program is empty.
|
|
|
|
@menu
|
|
* Options:: Command line options and their meanings.
|
|
* Other Arguments:: Input file names and variable assignments.
|
|
* AWKPATH Variable:: Searching directories for @code{awk} programs.
|
|
* Obsolete:: Obsolete Options and/or features.
|
|
* Undocumented:: Undocumented Options and Features.
|
|
* Known Bugs:: Known Bugs in @code{gawk}.
|
|
@end menu
|
|
|
|
@node Options, Other Arguments, Invoking Gawk, Invoking Gawk
|
|
@section Command Line Options
|
|
|
|
Options begin with a dash, and consist of a single character.
|
|
GNU style long options consist of two dashes and a keyword.
|
|
The keyword can be abbreviated, as long the abbreviation allows the option
|
|
to be uniquely identified. If the option takes an argument, then the
|
|
keyword is either immediately followed by an equals sign (@samp{=}) and the
|
|
argument's value, or the keyword and the argument's value are separated
|
|
by whitespace. For brevity, the discussion below only refers to the
|
|
traditional short options; however the long and short options are
|
|
interchangeable in all contexts.
|
|
|
|
Each long option for @code{gawk} has a corresponding
|
|
POSIX-style option. The options and their meanings are as follows:
|
|
|
|
@table @code
|
|
@item -F @var{fs}
|
|
@itemx --field-separator @var{fs}
|
|
@cindex @code{-F} option
|
|
@cindex @code{--field-separator} option
|
|
Sets the @code{FS} variable to @var{fs}
|
|
(@pxref{Field Separators, ,Specifying How Fields are Separated}).
|
|
|
|
@item -f @var{source-file}
|
|
@itemx --file @var{source-file}
|
|
@cindex @code{-f} option
|
|
@cindex @code{--file} option
|
|
Indicates that the @code{awk} program is to be found in @var{source-file}
|
|
instead of in the first non-option argument.
|
|
|
|
@item -v @var{var}=@var{val}
|
|
@itemx --assign @var{var}=@var{val}
|
|
@cindex @code{-v} option
|
|
@cindex @code{--assign} option
|
|
Sets the variable @var{var} to the value @var{val} @strong{before}
|
|
execution of the program begins. Such variable values are available
|
|
inside the @code{BEGIN} rule
|
|
(@pxref{Other Arguments, ,Other Command Line Arguments}).
|
|
|
|
The @samp{-v} option can only set one variable, but you can use
|
|
it more than once, setting another variable each time, like this:
|
|
@samp{awk @w{-v foo=1} @w{-v bar=2} @dots{}}.
|
|
|
|
@item -mf @var{NNN}
|
|
@itemx -mr @var{NNN}
|
|
Set various memory limits to the value @var{NNN}. The @samp{f} flag sets
|
|
the maximum number of fields, and the @samp{r} flag sets the maximum
|
|
record size. These two flags and the @samp{-m} option are from the
|
|
Bell Labs research version of Unix @code{awk}. They are provided
|
|
for compatibility, but otherwise ignored by
|
|
@code{gawk}, since @code{gawk} has no predefined limits.
|
|
|
|
@item -W @var{gawk-opt}
|
|
@cindex @code{-W} option
|
|
Following the POSIX standard, options that are implementation
|
|
specific are supplied as arguments to the @samp{-W} option. These options
|
|
also have corresponding GNU style long options.
|
|
See below.
|
|
|
|
@item --
|
|
Signals the end of the command line options. The following arguments
|
|
are not treated as options even if they begin with @samp{-}. This
|
|
interpretation of @samp{--} follows the POSIX argument parsing
|
|
conventions.
|
|
|
|
This is useful if you have file names that start with @samp{-},
|
|
or in shell scripts, if you have file names that will be specified
|
|
by the user which could start with @samp{-}.
|
|
@end table
|
|
|
|
The following @code{gawk}-specific options are available:
|
|
|
|
@table @code
|
|
@item -W traditional
|
|
@itemx -W compat
|
|
@itemx --traditional
|
|
@itemx --compat
|
|
@cindex @code{--compat} option
|
|
@cindex @code{--traditional} option
|
|
@cindex compatibility mode
|
|
Specifies @dfn{compatibility mode}, in which the GNU extensions to
|
|
the @code{awk} language are disabled, so that @code{gawk} behaves just
|
|
like the Bell Labs research version of Unix @code{awk}.
|
|
@samp{--traditional} is the preferred form of this option.
|
|
@xref{POSIX/GNU, ,Extensions in @code{gawk} Not in POSIX @code{awk}},
|
|
which summarizes the extensions. Also see
|
|
@ref{Compatibility Mode, ,Downward Compatibility and Debugging}.
|
|
|
|
@item -W copyleft
|
|
@itemx -W copyright
|
|
@itemx --copyleft
|
|
@itemx --copyright
|
|
@cindex @code{--copyleft} option
|
|
@cindex @code{--copyright} option
|
|
Print the short version of the General Public License, and then exit.
|
|
This option may disappear in a future version of @code{gawk}.
|
|
|
|
@item -W help
|
|
@itemx -W usage
|
|
@itemx --help
|
|
@itemx --usage
|
|
@cindex @code{--help} option
|
|
@cindex @code{--usage} option
|
|
Print a ``usage'' message summarizing the short and long style options
|
|
that @code{gawk} accepts, and then exit.
|
|
|
|
@item -W lint
|
|
@itemx --lint
|
|
@cindex @code{--lint} option
|
|
Warn about constructs that are dubious or non-portable to
|
|
other @code{awk} implementations.
|
|
Some warnings are issued when @code{gawk} first reads your program. Others
|
|
are issued at run-time, as your program executes.
|
|
|
|
@item -W lint-old
|
|
@itemx --lint-old
|
|
@cindex @code{--lint-old} option
|
|
Warn about constructs that are not available in
|
|
the original Version 7 Unix version of @code{awk}
|
|
(@pxref{V7/SVR3.1, , Major Changes between V7 and SVR3.1}).
|
|
|
|
@item -W posix
|
|
@itemx --posix
|
|
@cindex @code{--posix} option
|
|
@cindex POSIX mode
|
|
Operate in strict POSIX mode. This disables all @code{gawk}
|
|
extensions (just like @samp{--traditional}), and adds the following additional
|
|
restrictions:
|
|
|
|
@c IMPORTANT! Keep this list in sync with the one in node POSIX
|
|
|
|
@itemize @bullet
|
|
@item
|
|
@code{\x} escape sequences are not recognized
|
|
(@pxref{Escape Sequences}).
|
|
|
|
@item
|
|
Newlines do not act as whitespace to separate fields when @code{FS} is
|
|
equal to a single space.
|
|
|
|
@item
|
|
The synonym @code{func} for the keyword @code{function} is not
|
|
recognized (@pxref{Definition Syntax, ,Function Definition Syntax}).
|
|
|
|
@item
|
|
The operators @samp{**} and @samp{**=} cannot be used in
|
|
place of @samp{^} and @samp{^=} (@pxref{Arithmetic Ops, ,Arithmetic Operators},
|
|
and also @pxref{Assignment Ops, ,Assignment Expressions}).
|
|
|
|
@item
|
|
Specifying @samp{-Ft} on the command line does not set the value
|
|
of @code{FS} to be a single tab character
|
|
(@pxref{Field Separators, ,Specifying How Fields are Separated}).
|
|
|
|
@item
|
|
The @code{fflush} built-in function is not supported
|
|
(@pxref{I/O Functions, , Built-in Functions for Input/Output}).
|
|
@end itemize
|
|
|
|
If you supply both @samp{--traditional} and @samp{--posix} on the
|
|
command line, @samp{--posix} will take precedence. @code{gawk}
|
|
will also issue a warning if both options are supplied.
|
|
|
|
@item -W re-interval
|
|
@itemx --re-interval
|
|
Allow interval expressions
|
|
(@pxref{Regexp Operators, , Regular Expression Operators}),
|
|
in regexps.
|
|
Because interval expressions were traditionally not available in @code{awk},
|
|
@code{gawk} does not provide them by default. This prevents old @code{awk}
|
|
programs from breaking.
|
|
|
|
@item -W source @var{program-text}
|
|
@itemx --source @var{program-text}
|
|
@cindex @code{--source} option
|
|
Program source code is taken from the @var{program-text}. This option
|
|
allows you to mix source code in files with source
|
|
code that you enter on the command line. This is particularly useful
|
|
when you have library functions that you wish to use from your command line
|
|
programs (@pxref{AWKPATH Variable, ,The @code{AWKPATH} Environment Variable}).
|
|
|
|
@item -W version
|
|
@itemx --version
|
|
@cindex @code{--version} option
|
|
Prints version information for this particular copy of @code{gawk}.
|
|
This allows you to determine if your copy of @code{gawk} is up to date
|
|
with respect to whatever the Free Software Foundation is currently
|
|
distributing.
|
|
It is also useful for bug reports
|
|
(@pxref{Bugs, , Reporting Problems and Bugs}).
|
|
@end table
|
|
|
|
Any other options are flagged as invalid with a warning message, but
|
|
are otherwise ignored.
|
|
|
|
In compatibility mode, as a special case, if the value of @var{fs} supplied
|
|
to the @samp{-F} option is @samp{t}, then @code{FS} is set to the tab
|
|
character (@code{"\t"}). This is only true for @samp{--traditional}, and not
|
|
for @samp{--posix}
|
|
(@pxref{Field Separators, ,Specifying How Fields are Separated}).
|
|
|
|
The @samp{-f} option may be used more than once on the command line.
|
|
If it is, @code{awk} reads its program source from all of the named files, as
|
|
if they had been concatenated together into one big file. This is
|
|
useful for creating libraries of @code{awk} functions. Useful functions
|
|
can be written once, and then retrieved from a standard place, instead
|
|
of having to be included into each individual program.
|
|
|
|
You can type in a program at the terminal and still use library functions,
|
|
by specifying @samp{-f /dev/tty}. @code{awk} will read a file from the terminal
|
|
to use as part of the @code{awk} program. After typing your program,
|
|
type @kbd{Control-d} (the end-of-file character) to terminate it.
|
|
(You may also use @samp{-f -} to read program source from the standard
|
|
input, but then you will not be able to also use the standard input as a
|
|
source of data.)
|
|
|
|
Because it is clumsy using the standard @code{awk} mechanisms to mix source
|
|
file and command line @code{awk} programs, @code{gawk} provides the
|
|
@samp{--source} option. This does not require you to pre-empt the standard
|
|
input for your source code, and allows you to easily mix command line
|
|
and library source code
|
|
(@pxref{AWKPATH Variable, ,The @code{AWKPATH} Environment Variable}).
|
|
|
|
If no @samp{-f} or @samp{--source} option is specified, then @code{gawk}
|
|
will use the first non-option command line argument as the text of the
|
|
program source code.
|
|
|
|
@cindex @code{POSIXLY_CORRECT} environment variable
|
|
@cindex environment variable, @code{POSIXLY_CORRECT}
|
|
If the environment variable @code{POSIXLY_CORRECT} exists,
|
|
then @code{gawk} will behave in strict POSIX mode, exactly as if
|
|
you had supplied the @samp{--posix} command line option.
|
|
Many GNU programs look for this environment variable to turn on
|
|
strict POSIX mode. If you supply @samp{--lint} on the command line,
|
|
and @code{gawk} turns on POSIX mode because of @code{POSIXLY_CORRECT},
|
|
then it will print a warning message indicating that POSIX
|
|
mode is in effect.
|
|
|
|
You would typically set this variable in your shell's startup file.
|
|
For a Bourne compatible shell (such as Bash), you would add these
|
|
lines to the @file{.profile} file in your home directory.
|
|
|
|
@example
|
|
@group
|
|
POSIXLY_CORRECT=true
|
|
export POSIXLY_CORRECT
|
|
@end group
|
|
@end example
|
|
|
|
For a @code{csh} compatible shell,@footnote{Not recommended.}
|
|
you would add this line to the @file{.login} file in your home directory.
|
|
|
|
@example
|
|
setenv POSIXLY_CORRECT true
|
|
@end example
|
|
|
|
@node Other Arguments, AWKPATH Variable, Options, Invoking Gawk
|
|
@section Other Command Line Arguments
|
|
|
|
Any additional arguments on the command line are normally treated as
|
|
input files to be processed in the order specified. However, an
|
|
argument that has the form @code{@var{var}=@var{value}}, assigns
|
|
the value @var{value} to the variable @var{var}---it does not specify a
|
|
file at all.
|
|
|
|
@vindex ARGIND
|
|
@vindex ARGV
|
|
All these arguments are made available to your @code{awk} program in the
|
|
@code{ARGV} array (@pxref{Built-in Variables}). Command line options
|
|
and the program text (if present) are omitted from @code{ARGV}.
|
|
All other arguments, including variable assignments, are
|
|
included. As each element of @code{ARGV} is processed, @code{gawk}
|
|
sets the variable @code{ARGIND} to the index in @code{ARGV} of the
|
|
current element.
|
|
|
|
The distinction between file name arguments and variable-assignment
|
|
arguments is made when @code{awk} is about to open the next input file.
|
|
At that point in execution, it checks the ``file name'' to see whether
|
|
it is really a variable assignment; if so, @code{awk} sets the variable
|
|
instead of reading a file.
|
|
|
|
Therefore, the variables actually receive the given values after all
|
|
previously specified files have been read. In particular, the values of
|
|
variables assigned in this fashion are @emph{not} available inside a
|
|
@code{BEGIN} rule
|
|
(@pxref{BEGIN/END, ,The @code{BEGIN} and @code{END} Special Patterns}),
|
|
since such rules are run before @code{awk} begins scanning the argument list.
|
|
|
|
@cindex dark corner
|
|
The variable values given on the command line are processed for escape
|
|
sequences (d.c.) (@pxref{Escape Sequences}).
|
|
|
|
In some earlier implementations of @code{awk}, when a variable assignment
|
|
occurred before any file names, the assignment would happen @emph{before}
|
|
the @code{BEGIN} rule was executed. @code{awk}'s behavior was thus
|
|
inconsistent; some command line assignments were available inside the
|
|
@code{BEGIN} rule, while others were not. However,
|
|
some applications came to depend
|
|
upon this ``feature.'' When @code{awk} was changed to be more consistent,
|
|
the @samp{-v} option was added to accommodate applications that depended
|
|
upon the old behavior.
|
|
|
|
The variable assignment feature is most useful for assigning to variables
|
|
such as @code{RS}, @code{OFS}, and @code{ORS}, which control input and
|
|
output formats, before scanning the data files. It is also useful for
|
|
controlling state if multiple passes are needed over a data file. For
|
|
example:
|
|
|
|
@cindex multiple passes over data
|
|
@cindex passes, multiple
|
|
@example
|
|
awk 'pass == 1 @{ @var{pass 1 stuff} @}
|
|
pass == 2 @{ @var{pass 2 stuff} @}' pass=1 mydata pass=2 mydata
|
|
@end example
|
|
|
|
Given the variable assignment feature, the @samp{-F} option for setting
|
|
the value of @code{FS} is not
|
|
strictly necessary. It remains for historical compatibility.
|
|
|
|
@node AWKPATH Variable, Obsolete, Other Arguments, Invoking Gawk
|
|
@section The @code{AWKPATH} Environment Variable
|
|
@cindex @code{AWKPATH} environment variable
|
|
@cindex environment variable, @code{AWKPATH}
|
|
@cindex search path
|
|
@cindex directory search
|
|
@cindex path, search
|
|
@cindex differences between @code{gawk} and @code{awk}
|
|
|
|
The previous section described how @code{awk} program files can be named
|
|
on the command line with the @samp{-f} option. In most @code{awk}
|
|
implementations, you must supply a precise path name for each program
|
|
file, unless the file is in the current directory.
|
|
|
|
@cindex search path, for source files
|
|
But in @code{gawk}, if the file name supplied to the @samp{-f} option
|
|
does not contain a @samp{/}, then @code{gawk} searches a list of
|
|
directories (called the @dfn{search path}), one by one, looking for a
|
|
file with the specified name.
|
|
|
|
The search path is a string consisting of directory names
|
|
separated by colons. @code{gawk} gets its search path from the
|
|
@code{AWKPATH} environment variable. If that variable does not exist,
|
|
@code{gawk} uses a default path, which is
|
|
@samp{.:/usr/local/share/awk}.@footnote{Your version of @code{gawk}
|
|
may use a directory that is different than @file{/usr/local/share/awk}; it
|
|
will depend upon how @code{gawk} was built and installed. The actual
|
|
directory will be the value of @samp{$(datadir)} generated when
|
|
@code{gawk} was configured. You probably don't need to worry about this
|
|
though.} (Programs written for use by
|
|
system administrators should use an @code{AWKPATH} variable that
|
|
does not include the current directory, @file{.}.)
|
|
|
|
The search path feature is particularly useful for building up libraries
|
|
of useful @code{awk} functions. The library files can be placed in a
|
|
standard directory that is in the default path, and then specified on
|
|
the command line with a short file name. Otherwise, the full file name
|
|
would have to be typed for each file.
|
|
|
|
By using both the @samp{--source} and @samp{-f} options, your command line
|
|
@code{awk} programs can use facilities in @code{awk} library files.
|
|
@xref{Library Functions, , A Library of @code{awk} Functions}.
|
|
|
|
Path searching is not done if @code{gawk} is in compatibility mode.
|
|
This is true for both @samp{--traditional} and @samp{--posix}.
|
|
@xref{Options, ,Command Line Options}.
|
|
|
|
@strong{Note:} if you want files in the current directory to be found,
|
|
you must include the current directory in the path, either by including
|
|
@file{.} explicitly in the path, or by writing a null entry in the
|
|
path. (A null entry is indicated by starting or ending the path with a
|
|
colon, or by placing two colons next to each other (@samp{::}).) If the
|
|
current directory is not included in the path, then files cannot be
|
|
found in the current directory. This path search mechanism is identical
|
|
to the shell's.
|
|
@c someday, @cite{The Bourne Again Shell}....
|
|
|
|
Starting with version 3.0, if @code{AWKPATH} is not defined in the
|
|
environment, @code{gawk} will place its default search path into
|
|
@code{ENVIRON["AWKPATH"]}. This makes it easy to determine
|
|
the actual search path @code{gawk} will use.
|
|
|
|
@node Obsolete, Undocumented, AWKPATH Variable, Invoking Gawk
|
|
@section Obsolete Options and/or Features
|
|
|
|
@cindex deprecated options
|
|
@cindex obsolete options
|
|
@cindex deprecated features
|
|
@cindex obsolete features
|
|
This section describes features and/or command line options from
|
|
previous releases of @code{gawk} that are either not available in the
|
|
current version, or that are still supported but deprecated (meaning that
|
|
they will @emph{not} be in the next release).
|
|
|
|
@c update this section for each release!
|
|
|
|
For version @value{VERSION}.@value{PATCHLEVEL} of @code{gawk}, there are no
|
|
command line options
|
|
or other deprecated features from the previous version of @code{gawk}.
|
|
@iftex
|
|
This section
|
|
@end iftex
|
|
@ifinfo
|
|
This node
|
|
@end ifinfo
|
|
is thus essentially a place holder,
|
|
in case some option becomes obsolete in a future version of @code{gawk}.
|
|
|
|
@ignore
|
|
@c This is pretty old news...
|
|
The public-domain version of @code{strftime} that is distributed with
|
|
@code{gawk} changed for the 2.14 release. The @samp{%V} conversion specifier
|
|
that used to generate the date in VMS format was changed to @samp{%v}.
|
|
This is because the POSIX standard for the @code{date} utility now
|
|
specifies a @samp{%V} conversion specifier.
|
|
@xref{Time Functions, ,Functions for Dealing with Time Stamps}, for details.
|
|
@end ignore
|
|
|
|
@node Undocumented, Known Bugs, Obsolete, Invoking Gawk
|
|
@section Undocumented Options and Features
|
|
@cindex undocumented features
|
|
@display
|
|
@i{Use the Source, Luke!}
|
|
Obi-Wan
|
|
@end display
|
|
@sp 1
|
|
|
|
This section intentionally left blank.
|
|
|
|
@c Read The Source, Luke!
|
|
|
|
@ignore
|
|
@c If these came out in the Info file or TeX document, then they wouldn't
|
|
@c be undocumented, would they?
|
|
|
|
@code{gawk} has one undocumented option:
|
|
|
|
@table @code
|
|
@item -W nostalgia
|
|
@itemx --nostalgia
|
|
Print the message @code{"awk: bailing out near line 1"} and dump core.
|
|
This option was inspired by the common behavior of very early versions of
|
|
Unix @code{awk}, and by a t--shirt.
|
|
@end table
|
|
|
|
Early versions of @code{awk} used to not require any separator (either
|
|
a newline or @samp{;}) between the rules in @code{awk} programs. Thus,
|
|
it was common to see one-line programs like:
|
|
|
|
@example
|
|
awk '@{ sum += $1 @} END @{ print sum @}'
|
|
@end example
|
|
|
|
@code{gawk} actually supports this, but it is purposely undocumented
|
|
since it is considered bad style. The correct way to write such a program
|
|
is either
|
|
|
|
@example
|
|
awk '@{ sum += $1 @} ; END @{ print sum @}'
|
|
@end example
|
|
|
|
@noindent
|
|
or
|
|
|
|
@example
|
|
awk '@{ sum += $1 @}
|
|
END @{ print sum @}' data
|
|
@end example
|
|
|
|
@noindent
|
|
@xref{Statements/Lines, ,@code{awk} Statements Versus Lines}, for a fuller
|
|
explanation.
|
|
|
|
@end ignore
|
|
|
|
@node Known Bugs, , Undocumented, Invoking Gawk
|
|
@section Known Bugs in @code{gawk}
|
|
@cindex bugs, known in @code{gawk}
|
|
@cindex known bugs
|
|
|
|
@itemize @bullet
|
|
@item
|
|
The @samp{-F} option for changing the value of @code{FS}
|
|
(@pxref{Options, ,Command Line Options})
|
|
is not necessary given the command line variable
|
|
assignment feature; it remains only for backwards compatibility.
|
|
|
|
@item
|
|
If your system actually has support for @file{/dev/fd} and the
|
|
associated @file{/dev/stdin}, @file{/dev/stdout}, and
|
|
@file{/dev/stderr} files, you may get different output from @code{gawk}
|
|
than you would get on a system without those files. When @code{gawk}
|
|
interprets these files internally, it synchronizes output to the
|
|
standard output with output to @file{/dev/stdout}, while on a system
|
|
with those files, the output is actually to different open files
|
|
(@pxref{Special Files, ,Special File Names in @code{gawk}}).
|
|
|
|
@item
|
|
Syntactically invalid single character programs tend to overflow
|
|
the parse stack, generating a rather unhelpful message. Such programs
|
|
are surprisingly difficult to diagnose in the completely general case,
|
|
and the effort to do so really is not worth it.
|
|
@end itemize
|
|
|
|
@node Library Functions, Sample Programs, Invoking Gawk, Top
|
|
@chapter A Library of @code{awk} Functions
|
|
|
|
@c 2e: USE TEXINFO-2 FUNCTION DEFINITION STUFF!!!!!!!!!!!!!
|
|
This chapter presents a library of useful @code{awk} functions. The
|
|
sample programs presented later
|
|
(@pxref{Sample Programs, ,Practical @code{awk} Programs})
|
|
use these functions.
|
|
The functions are presented here in a progression from simple to complex.
|
|
|
|
@ref{Extract Program, ,Extracting Programs from Texinfo Source Files},
|
|
presents a program that you can use to extract the source code for
|
|
these example library functions and programs from the Texinfo source
|
|
for this @value{DOCUMENT}.
|
|
(This has already been done as part of the @code{gawk} distribution.)
|
|
|
|
If you have written one or more useful, general purpose @code{awk} functions,
|
|
and would like to contribute them for a subsequent edition of this @value{DOCUMENT},
|
|
please contact the author. @xref{Bugs, ,Reporting Problems and Bugs},
|
|
for information on doing this. Don't just send code, as you will be
|
|
required to either place your code in the public domain,
|
|
publish it under the GPL (@pxref{Copying, ,GNU GENERAL PUBLIC LICENSE}),
|
|
or assign the copyright in it to the Free Software Foundation.
|
|
|
|
@menu
|
|
* Portability Notes:: What to do if you don't have @code{gawk}.
|
|
* Nextfile Function:: Two implementations of a @code{nextfile}
|
|
function.
|
|
* Assert Function:: A function for assertions in @code{awk}
|
|
programs.
|
|
* Round Function:: A function for rounding if @code{sprintf} does
|
|
not do it correctly.
|
|
* Ordinal Functions:: Functions for using characters as numbers and
|
|
vice versa.
|
|
* Join Function:: A function to join an array into a string.
|
|
* Mktime Function:: A function to turn a date into a timestamp.
|
|
* Gettimeofday Function:: A function to get formatted times.
|
|
* Filetrans Function:: A function for handling data file transitions.
|
|
* Getopt Function:: A function for processing command line
|
|
arguments.
|
|
* Passwd Functions:: Functions for getting user information.
|
|
* Group Functions:: Functions for getting group information.
|
|
* Library Names:: How to best name private global variables in
|
|
library functions.
|
|
@end menu
|
|
|
|
@node Portability Notes, Nextfile Function, Library Functions, Library Functions
|
|
@section Simulating @code{gawk}-specific Features
|
|
@cindex portability issues
|
|
|
|
The programs in this chapter and in
|
|
@ref{Sample Programs, ,Practical @code{awk} Programs},
|
|
freely use features that are specific to @code{gawk}.
|
|
This section briefly discusses how you can rewrite these programs for
|
|
different implementations of @code{awk}.
|
|
|
|
Diagnostic error messages are sent to @file{/dev/stderr}.
|
|
Use @samp{| "cat 1>&2"} instead of @samp{> "/dev/stderr"}, if your system
|
|
does not have a @file{/dev/stderr}, or if you cannot use @code{gawk}.
|
|
|
|
A number of programs use @code{nextfile}
|
|
(@pxref{Nextfile Statement, ,The @code{nextfile} Statement}),
|
|
to skip any remaining input in the input file.
|
|
@ref{Nextfile Function, ,Implementing @code{nextfile} as a Function},
|
|
shows you how to write a function that will do the same thing.
|
|
|
|
Finally, some of the programs choose to ignore upper-case and lower-case
|
|
distinctions in their input. They do this by assigning one to @code{IGNORECASE}.
|
|
You can achieve the same effect by adding the following rule to the
|
|
beginning of the program:
|
|
|
|
@example
|
|
# ignore case
|
|
@{ $0 = tolower($0) @}
|
|
@end example
|
|
|
|
@noindent
|
|
Also, verify that all regexp and string constants used in
|
|
comparisons only use lower-case letters.
|
|
|
|
@node Nextfile Function, Assert Function, Portability Notes, Library Functions
|
|
@section Implementing @code{nextfile} as a Function
|
|
|
|
@cindex skipping input files
|
|
@cindex input files, skipping
|
|
The @code{nextfile} statement presented in
|
|
@ref{Nextfile Statement, ,The @code{nextfile} Statement},
|
|
is a @code{gawk}-specific extension. It is not available in other
|
|
implementations of @code{awk}. This section shows two versions of a
|
|
@code{nextfile} function that you can use to simulate @code{gawk}'s
|
|
@code{nextfile} statement if you cannot use @code{gawk}.
|
|
|
|
Here is a first attempt at writing a @code{nextfile} function.
|
|
|
|
@example
|
|
@group
|
|
# nextfile --- skip remaining records in current file
|
|
|
|
# this should be read in before the "main" awk program
|
|
|
|
function nextfile() @{ _abandon_ = FILENAME; next @}
|
|
|
|
_abandon_ == FILENAME @{ next @}
|
|
@end group
|
|
@end example
|
|
|
|
This file should be included before the main program, because it supplies
|
|
a rule that must be executed first. This rule compares the current data
|
|
file's name (which is always in the @code{FILENAME} variable) to a private
|
|
variable named @code{_abandon_}. If the file name matches, then the action
|
|
part of the rule executes a @code{next} statement, to go on to the next
|
|
record. (The use of @samp{_} in the variable name is a convention.
|
|
It is discussed more fully in
|
|
@ref{Library Names, , Naming Library Function Global Variables}.)
|
|
|
|
The use of the @code{next} statement effectively creates a loop that reads
|
|
all the records from the current data file.
|
|
Eventually, the end of the file is reached, and
|
|
a new data file is opened, changing the value of @code{FILENAME}.
|
|
Once this happens, the comparison of @code{_abandon_} to @code{FILENAME}
|
|
fails, and execution continues with the first rule of the ``real'' program.
|
|
|
|
The @code{nextfile} function itself simply sets the value of @code{_abandon_}
|
|
and then executes a @code{next} statement to start the loop
|
|
going.@footnote{Some implementations of @code{awk} do not allow you to
|
|
execute @code{next} from within a function body. Some other work-around
|
|
will be necessary if you use such a version.}
|
|
@c mawk is what we're talking about.
|
|
|
|
This initial version has a subtle problem. What happens if the same data
|
|
file is listed @emph{twice} on the command line, one right after the other,
|
|
or even with just a variable assignment between the two occurrences of
|
|
the file name?
|
|
|
|
@c @findex nextfile
|
|
@c do it this way, since all the indices are merged
|
|
@cindex @code{nextfile} function
|
|
In such a case,
|
|
this code will skip right through the file, a second time, even though
|
|
it should stop when it gets to the end of the first occurrence.
|
|
Here is a second version of @code{nextfile} that remedies this problem.
|
|
|
|
@example
|
|
@group
|
|
@c file eg/lib/nextfile.awk
|
|
# nextfile --- skip remaining records in current file
|
|
# correctly handle successive occurrences of the same file
|
|
# Arnold Robbins, arnold@@gnu.org, Public Domain
|
|
# May, 1993
|
|
|
|
# this should be read in before the "main" awk program
|
|
|
|
function nextfile() @{ _abandon_ = FILENAME; next @}
|
|
|
|
_abandon_ == FILENAME @{
|
|
if (FNR == 1)
|
|
_abandon_ = ""
|
|
else
|
|
next
|
|
@}
|
|
@c endfile
|
|
@end group
|
|
@end example
|
|
|
|
The @code{nextfile} function has not changed. It sets @code{_abandon_}
|
|
equal to the current file name and then executes a @code{next} satement.
|
|
The @code{next} statement reads the next record and increments @code{FNR},
|
|
so @code{FNR} is guaranteed to have a value of at least two.
|
|
However, if @code{nextfile} is called for the last record in the file,
|
|
then @code{awk} will close the current data file and move on to the next
|
|
one. Upon doing so, @code{FILENAME} will be set to the name of the new file,
|
|
and @code{FNR} will be reset to one. If this next file is the same as
|
|
the previous one, @code{_abandon_} will still be equal to @code{FILENAME}.
|
|
However, @code{FNR} will be equal to one, telling us that this is a new
|
|
occurrence of the file, and not the one we were reading when the
|
|
@code{nextfile} function was executed. In that case, @code{_abandon_}
|
|
is reset to the empty string, so that further executions of this rule
|
|
will fail (until the next time that @code{nextfile} is called).
|
|
|
|
If @code{FNR} is not one, then we are still in the original data file,
|
|
and the program executes a @code{next} statement to skip through it.
|
|
|
|
An important question to ask at this point is: ``Given that the
|
|
functionality of @code{nextfile} can be provided with a library file,
|
|
why is it built into @code{gawk}?'' This is an important question. Adding
|
|
features for little reason leads to larger, slower programs that are
|
|
harder to maintain.
|
|
|
|
The answer is that building @code{nextfile} into @code{gawk} provides
|
|
significant gains in efficiency. If the @code{nextfile} function is executed
|
|
at the beginning of a large data file, @code{awk} still has to scan the entire
|
|
file, splitting it up into records, just to skip over it. The built-in
|
|
@code{nextfile} can simply close the file immediately and proceed to the
|
|
next one, saving a lot of time. This is particularly important in
|
|
@code{awk}, since @code{awk} programs are generally I/O bound (i.e.@:
|
|
they spend most of their time doing input and output, instead of performing
|
|
computations).
|
|
|
|
@node Assert Function, Round Function, Nextfile Function, Library Functions
|
|
@section Assertions
|
|
|
|
@cindex assertions
|
|
@cindex @code{assert}, C version
|
|
When writing large programs, it is often useful to be able to know
|
|
that a condition or set of conditions is true. Before proceeding with a
|
|
particular computation, you make a statement about what you believe to be
|
|
the case. Such a statement is known as an
|
|
``assertion.'' The C language provides an @code{<assert.h>} header file
|
|
and corresponding @code{assert} macro that the programmer can use to make
|
|
assertions. If an assertion fails, the @code{assert} macro arranges to
|
|
print a diagnostic message describing the condition that should have
|
|
been true but was not, and then it kills the program. In C, using
|
|
@code{assert} looks this:
|
|
|
|
@example
|
|
#include <assert.h>
|
|
|
|
int myfunc(int a, double b)
|
|
@{
|
|
assert(a <= 5 && b >= 17);
|
|
@dots{}
|
|
@}
|
|
@end example
|
|
|
|
If the assertion failed, the program would print a message similar to
|
|
this:
|
|
|
|
@example
|
|
prog.c:5: assertion failed: a <= 5 && b >= 17
|
|
@end example
|
|
|
|
@findex assert
|
|
The ANSI C language makes it possible to turn the condition into a string for use
|
|
in printing the diagnostic message. This is not possible in @code{awk}, so
|
|
this @code{assert} function also requires a string version of the condition
|
|
that is being tested.
|
|
|
|
@example
|
|
@c @group
|
|
@c file eg/lib/assert.awk
|
|
# assert --- assert that a condition is true. Otherwise exit.
|
|
# Arnold Robbins, arnold@@gnu.org, Public Domain
|
|
# May, 1993
|
|
|
|
function assert(condition, string)
|
|
@{
|
|
if (! condition) @{
|
|
printf("%s:%d: assertion failed: %s\n",
|
|
FILENAME, FNR, string) > "/dev/stderr"
|
|
_assert_exit = 1
|
|
exit 1
|
|
@}
|
|
@}
|
|
|
|
END @{
|
|
if (_assert_exit)
|
|
exit 1
|
|
@}
|
|
@c endfile
|
|
@c @end group
|
|
@end example
|
|
|
|
The @code{assert} function tests the @code{condition} parameter. If it
|
|
is false, it prints a message to standard error, using the @code{string}
|
|
parameter to describe the failed condition. It then sets the variable
|
|
@code{_assert_exit} to one, and executes the @code{exit} statement.
|
|
The @code{exit} statement jumps to the @code{END} rule. If the @code{END}
|
|
rules finds @code{_assert_exit} to be true, then it exits immediately.
|
|
|
|
The purpose of the @code{END} rule with its test is to
|
|
keep any other @code{END} rules from running. When an assertion fails, the
|
|
program should exit immediately.
|
|
If no assertions fail, then @code{_assert_exit} will still be
|
|
false when the @code{END} rule is run normally, and the rest of the
|
|
program's @code{END} rules will execute.
|
|
For all of this to work correctly, @file{assert.awk} must be the
|
|
first source file read by @code{awk}.
|
|
|
|
You would use this function in your programs this way:
|
|
|
|
@example
|
|
function myfunc(a, b)
|
|
@{
|
|
assert(a <= 5 && b >= 17, "a <= 5 && b >= 17")
|
|
@dots{}
|
|
@}
|
|
@end example
|
|
|
|
@noindent
|
|
If the assertion failed, you would see a message like this:
|
|
|
|
@example
|
|
mydata:1357: assertion failed: a <= 5 && b >= 17
|
|
@end example
|
|
|
|
There is a problem with this version of @code{assert}, that it may not
|
|
be possible to work around with standard @code{awk}.
|
|
An @code{END} rule is automatically added
|
|
to the program calling @code{assert}. Normally, if a program consists
|
|
of just a @code{BEGIN} rule, the input files and/or standard input are
|
|
not read. However, now that the program has an @code{END} rule, @code{awk}
|
|
will attempt to read the input data files, or standard input
|
|
(@pxref{Using BEGIN/END, , Startup and Cleanup Actions}),
|
|
most likely causing the program to hang, waiting for input.
|
|
|
|
@node Round Function, Ordinal Functions, Assert Function, Library Functions
|
|
@section Rounding Numbers
|
|
|
|
@cindex rounding
|
|
The way @code{printf} and @code{sprintf}
|
|
(@pxref{Printf, , Using @code{printf} Statements for Fancier Printing})
|
|
do rounding will often depend
|
|
upon the system's C @code{sprintf} subroutine.
|
|
On many machines,
|
|
@code{sprintf} rounding is ``unbiased,'' which means it doesn't always
|
|
round a trailing @samp{.5} up, contrary to naive expectations. In unbiased
|
|
rounding, @samp{.5} rounds to even, rather than always up, so 1.5 rounds to
|
|
2 but 4.5 rounds to 4.
|
|
The result is that if you are using a format that does
|
|
rounding (e.g., @code{"%.0f"}) you should check what your system does.
|
|
The following function does traditional rounding;
|
|
it might be useful if your awk's @code{printf} does unbiased rounding.
|
|
|
|
@findex round
|
|
@example
|
|
@c file eg/lib/round.awk
|
|
# round --- do normal rounding
|
|
#
|
|
# Arnold Robbins, arnold@@gnu.org, August, 1996
|
|
# Public Domain
|
|
|
|
function round(x, ival, aval, fraction)
|
|
@{
|
|
ival = int(x) # integer part, int() truncates
|
|
|
|
# see if fractional part
|
|
if (ival == x) # no fraction
|
|
return x
|
|
|
|
if (x < 0) @{
|
|
aval = -x # absolute value
|
|
ival = int(aval)
|
|
fraction = aval - ival
|
|
if (fraction >= .5)
|
|
return int(x) - 1 # -2.5 --> -3
|
|
else
|
|
return int(x) # -2.3 --> -2
|
|
@} else @{
|
|
fraction = x - ival
|
|
if (fraction >= .5)
|
|
return ival + 1
|
|
else
|
|
return ival
|
|
@}
|
|
@}
|
|
|
|
# test harness
|
|
@{ print $0, round($0) @}
|
|
@c endfile
|
|
@end example
|
|
|
|
@node Ordinal Functions, Join Function, Round Function, Library Functions
|
|
@section Translating Between Characters and Numbers
|
|
|
|
@cindex numeric character values
|
|
@cindex values of characters as numbers
|
|
One commercial implementation of @code{awk} supplies a built-in function,
|
|
@code{ord}, which takes a character and returns the numeric value for that
|
|
character in the machine's character set. If the string passed to
|
|
@code{ord} has more than one character, only the first one is used.
|
|
|
|
The inverse of this function is @code{chr} (from the function of the same
|
|
name in Pascal), which takes a number and returns the corresponding character.
|
|
|
|
Both functions can be written very nicely in @code{awk}; there is no real
|
|
reason to build them into the @code{awk} interpreter.
|
|
|
|
@findex ord
|
|
@findex chr
|
|
@example
|
|
@group
|
|
@c file eg/lib/ord.awk
|
|
# ord.awk --- do ord and chr
|
|
#
|
|
# Global identifiers:
|
|
# _ord_: numerical values indexed by characters
|
|
# _ord_init: function to initialize _ord_
|
|
#
|
|
# Arnold Robbins
|
|
# arnold@@gnu.org
|
|
# Public Domain
|
|
# 16 January, 1992
|
|
# 20 July, 1992, revised
|
|
|
|
BEGIN @{ _ord_init() @}
|
|
@c endfile
|
|
@end group
|
|
|
|
@c @group
|
|
@c file eg/lib/ord.awk
|
|
function _ord_init( low, high, i, t)
|
|
@{
|
|
low = sprintf("%c", 7) # BEL is ascii 7
|
|
if (low == "\a") @{ # regular ascii
|
|
low = 0
|
|
high = 127
|
|
@} else if (sprintf("%c", 128 + 7) == "\a") @{
|
|
# ascii, mark parity
|
|
low = 128
|
|
high = 255
|
|
@} else @{ # ebcdic(!)
|
|
low = 0
|
|
high = 255
|
|
@}
|
|
|
|
for (i = low; i <= high; i++) @{
|
|
t = sprintf("%c", i)
|
|
_ord_[t] = i
|
|
@}
|
|
@}
|
|
@c endfile
|
|
@c @end group
|
|
@end example
|
|
|
|
@cindex character sets
|
|
@cindex character encodings
|
|
@cindex ASCII
|
|
@cindex EBCDIC
|
|
@cindex mark parity
|
|
Some explanation of the numbers used by @code{chr} is worthwhile.
|
|
The most prominent character set in use today is ASCII. Although an
|
|
eight-bit byte can hold 256 distinct values (from zero to 255), ASCII only
|
|
defines characters that use the values from zero to 127.@footnote{ASCII
|
|
has been extended in many countries to use the values from 128 to 255
|
|
for country-specific characters. If your system uses these extensions,
|
|
you can simplify @code{_ord_init} to simply loop from zero to 255.}
|
|
At least one computer manufacturer that we know of
|
|
@c Pr1me, blech
|
|
uses ASCII, but with mark parity, meaning that the leftmost bit in the byte
|
|
is always one. What this means is that on those systems, characters
|
|
have numeric values from 128 to 255.
|
|
Finally, large mainframe systems use the EBCDIC character set, which
|
|
uses all 256 values.
|
|
While there are other character sets in use on some older systems,
|
|
they are not really worth worrying about.
|
|
|
|
@example
|
|
@group
|
|
@c file eg/lib/ord.awk
|
|
function ord(str, c)
|
|
@{
|
|
# only first character is of interest
|
|
c = substr(str, 1, 1)
|
|
return _ord_[c]
|
|
@}
|
|
@c endfile
|
|
@end group
|
|
|
|
@group
|
|
@c file eg/lib/ord.awk
|
|
function chr(c)
|
|
@{
|
|
# force c to be numeric by adding 0
|
|
return sprintf("%c", c + 0)
|
|
@}
|
|
@c endfile
|
|
@end group
|
|
|
|
@c @group
|
|
@c file eg/lib/ord.awk
|
|
#### test code ####
|
|
# BEGIN \
|
|
# @{
|
|
# for (;;) @{
|
|
# printf("enter a character: ")
|
|
# if (getline var <= 0)
|
|
# break
|
|
# printf("ord(%s) = %d\n", var, ord(var))
|
|
# @}
|
|
# @}
|
|
@c endfile
|
|
@c @end group
|
|
@end example
|
|
|
|
An obvious improvement to these functions would be to move the code for the
|
|
@code{@w{_ord_init}} function into the body of the @code{BEGIN} rule. It was
|
|
written this way initially for ease of development.
|
|
|
|
There is a ``test program'' in a @code{BEGIN} rule, for testing the
|
|
function. It is commented out for production use.
|
|
|
|
@node Join Function, Mktime Function, Ordinal Functions, Library Functions
|
|
@section Merging an Array Into a String
|
|
|
|
@cindex merging strings
|
|
When doing string processing, it is often useful to be able to join
|
|
all the strings in an array into one long string. The following function,
|
|
@code{join}, accomplishes this task. It is used later in several of
|
|
the application programs
|
|
(@pxref{Sample Programs, ,Practical @code{awk} Programs}).
|
|
|
|
Good function design is important; this function needs to be general, but it
|
|
should also have a reasonable default behavior. It is called with an array
|
|
and the beginning and ending indices of the elements in the array to be
|
|
merged. This assumes that the array indices are numeric---a reasonable
|
|
assumption since the array was likely created with @code{split}
|
|
(@pxref{String Functions, ,Built-in Functions for String Manipulation}).
|
|
|
|
@findex join
|
|
@example
|
|
@group
|
|
@c file eg/lib/join.awk
|
|
# join.awk --- join an array into a string
|
|
# Arnold Robbins, arnold@@gnu.org, Public Domain
|
|
# May 1993
|
|
|
|
function join(array, start, end, sep, result, i)
|
|
@{
|
|
if (sep == "")
|
|
sep = " "
|
|
else if (sep == SUBSEP) # magic value
|
|
sep = ""
|
|
result = array[start]
|
|
for (i = start + 1; i <= end; i++)
|
|
result = result sep array[i]
|
|
return result
|
|
@}
|
|
@c endfile
|
|
@end group
|
|
@end example
|
|
|
|
An optional additional argument is the separator to use when joining the
|
|
strings back together. If the caller supplies a non-empty value,
|
|
@code{join} uses it. If it is not supplied, it will have a null
|
|
value. In this case, @code{join} uses a single blank as a default
|
|
separator for the strings. If the value is equal to @code{SUBSEP},
|
|
then @code{join} joins the strings with no separator between them.
|
|
@code{SUBSEP} serves as a ``magic'' value to indicate that there should
|
|
be no separation between the component strings.
|
|
|
|
It would be nice if @code{awk} had an assignment operator for concatenation.
|
|
The lack of an explicit operator for concatenation makes string operations
|
|
more difficult than they really need to be.
|
|
|
|
@node Mktime Function, Gettimeofday Function, Join Function, Library Functions
|
|
@section Turning Dates Into Timestamps
|
|
|
|
The @code{systime} function built in to @code{gawk}
|
|
returns the current time of day as
|
|
a timestamp in ``seconds since the Epoch.'' This timestamp
|
|
can be converted into a printable date of almost infinitely variable
|
|
format using the built-in @code{strftime} function.
|
|
(For more information on @code{systime} and @code{strftime},
|
|
@pxref{Time Functions, ,Functions for Dealing with Time Stamps}.)
|
|
|
|
@cindex converting dates to timestamps
|
|
@cindex dates, converting to timestamps
|
|
@cindex timestamps, converting from dates
|
|
An interesting but difficult problem is to convert a readable representation
|
|
of a date back into a timestamp. The ANSI C library provides a @code{mktime}
|
|
function that does the basic job, converting a canonical representation of a
|
|
date into a timestamp.
|
|
|
|
It would appear at first glance that @code{gawk} would have to supply a
|
|
@code{mktime} built-in function that was simply a ``hook'' to the C language
|
|
version. In fact though, @code{mktime} can be implemented entirely in
|
|
@code{awk}.
|
|
|
|
Here is a version of @code{mktime} for @code{awk}. It takes a simple
|
|
representation of the date and time, and converts it into a timestamp.
|
|
|
|
The code is presented here intermixed with explanatory prose. In
|
|
@ref{Extract Program, ,Extracting Programs from Texinfo Source Files},
|
|
you will see how the Texinfo source file for this @value{DOCUMENT}
|
|
can be processed to extract the code into a single source file.
|
|
|
|
The program begins with a descriptive comment and a @code{BEGIN} rule
|
|
that initializes a table @code{_tm_months}. This table is a two-dimensional
|
|
array that has the lengths of the months. The first index is zero for
|
|
regular years, and one for leap years. The values are the same for all the
|
|
months in both kinds of years, except for February; thus the use of multiple
|
|
assignment.
|
|
|
|
@example
|
|
@c @group
|
|
@c file eg/lib/mktime.awk
|
|
# mktime.awk --- convert a canonical date representation
|
|
# into a timestamp
|
|
# Arnold Robbins, arnold@@gnu.org, Public Domain
|
|
# May 1993
|
|
|
|
BEGIN \
|
|
@{
|
|
# Initialize table of month lengths
|
|
_tm_months[0,1] = _tm_months[1,1] = 31
|
|
_tm_months[0,2] = 28; _tm_months[1,2] = 29
|
|
_tm_months[0,3] = _tm_months[1,3] = 31
|
|
_tm_months[0,4] = _tm_months[1,4] = 30
|
|
_tm_months[0,5] = _tm_months[1,5] = 31
|
|
_tm_months[0,6] = _tm_months[1,6] = 30
|
|
_tm_months[0,7] = _tm_months[1,7] = 31
|
|
_tm_months[0,8] = _tm_months[1,8] = 31
|
|
_tm_months[0,9] = _tm_months[1,9] = 30
|
|
_tm_months[0,10] = _tm_months[1,10] = 31
|
|
_tm_months[0,11] = _tm_months[1,11] = 30
|
|
_tm_months[0,12] = _tm_months[1,12] = 31
|
|
@}
|
|
@c endfile
|
|
@c @end group
|
|
@end example
|
|
|
|
The benefit of merging multiple @code{BEGIN} rules
|
|
(@pxref{BEGIN/END, ,The @code{BEGIN} and @code{END} Special Patterns})
|
|
is particularly clear when writing library files. Functions in library
|
|
files can cleanly initialize their own private data and also provide clean-up
|
|
actions in private @code{END} rules.
|
|
|
|
The next function is a simple one that computes whether a given year is or
|
|
is not a leap year. If a year is evenly divisible by four, but not evenly
|
|
divisible by 100, or if it is evenly divisible by 400, then it is a leap
|
|
year. Thus, 1904 was a leap year, 1900 was not, but 2000 will be.
|
|
@c Change this after the year 2000 to ``2000 was'' (:-)
|
|
|
|
@findex _tm_isleap
|
|
@example
|
|
@group
|
|
@c file eg/lib/mktime.awk
|
|
# decide if a year is a leap year
|
|
function _tm_isleap(year, ret)
|
|
@{
|
|
ret = (year % 4 == 0 && year % 100 != 0) ||
|
|
(year % 400 == 0)
|
|
|
|
return ret
|
|
@}
|
|
@c endfile
|
|
@end group
|
|
@end example
|
|
|
|
This function is only used a few times in this file, and its computation
|
|
could have been written @dfn{in-line} (at the point where it's used).
|
|
Making it a separate function made the original development easier, and also
|
|
avoids the possibility of typing errors when duplicating the code in
|
|
multiple places.
|
|
|
|
The next function is more interesting. It does most of the work of
|
|
generating a timestamp, which is converting a date and time into some number
|
|
of seconds since the Epoch. The caller passes an array (rather
|
|
imaginatively named @code{a}) containing six
|
|
values: the year including century, the month as a number between one and 12,
|
|
the day of the month, the hour as a number between zero and 23, the minute in
|
|
the hour, and the seconds within the minute.
|
|
|
|
The function uses several local variables to precompute the number of
|
|
seconds in an hour, seconds in a day, and seconds in a year. Often,
|
|
similar C code simply writes out the expression in-line, expecting the
|
|
compiler to do @dfn{constant folding}. E.g., most C compilers would
|
|
turn @samp{60 * 60} into @samp{3600} at compile time, instead of recomputing
|
|
it every time at run time. Precomputing these values makes the
|
|
function more efficient.
|
|
|
|
@findex _tm_addup
|
|
@example
|
|
@c @group
|
|
@c file eg/lib/mktime.awk
|
|
# convert a date into seconds
|
|
function _tm_addup(a, total, yearsecs, daysecs,
|
|
hoursecs, i, j)
|
|
@{
|
|
hoursecs = 60 * 60
|
|
daysecs = 24 * hoursecs
|
|
yearsecs = 365 * daysecs
|
|
|
|
total = (a[1] - 1970) * yearsecs
|
|
|
|
@group
|
|
# extra day for leap years
|
|
for (i = 1970; i < a[1]; i++)
|
|
if (_tm_isleap(i))
|
|
total += daysecs
|
|
@end group
|
|
|
|
@group
|
|
j = _tm_isleap(a[1])
|
|
for (i = 1; i < a[2]; i++)
|
|
total += _tm_months[j, i] * daysecs
|
|
@end group
|
|
|
|
total += (a[3] - 1) * daysecs
|
|
total += a[4] * hoursecs
|
|
total += a[5] * 60
|
|
total += a[6]
|
|
|
|
return total
|
|
@}
|
|
@c endfile
|
|
@c @end group
|
|
@end example
|
|
|
|
The function starts with a first approximation of all the seconds between
|
|
Midnight, January 1, 1970,@footnote{This is the Epoch on POSIX systems.
|
|
It may be different on other systems.} and the beginning of the current
|
|
year. It then goes through all those years, and for every leap year,
|
|
adds an additional day's worth of seconds.
|
|
|
|
The variable @code{j} holds either one or zero, if the current year is or is not
|
|
a leap year.
|
|
For every month in the current year prior to the current month, it adds
|
|
the number of seconds in the month, using the appropriate entry in the
|
|
@code{_tm_months} array.
|
|
|
|
Finally, it adds in the seconds for the number of days prior to the current
|
|
day, and the number of hours, minutes, and seconds in the current day.
|
|
|
|
The result is a count of seconds since January 1, 1970. This value is not
|
|
yet what is needed though. The reason why is described shortly.
|
|
|
|
The main @code{mktime} function takes a single character string argument.
|
|
This string is a representation of a date and time in a ``canonical''
|
|
(fixed) form. This string should be
|
|
@code{"@var{year} @var{month} @var{day} @var{hour} @var{minute} @var{second}"}.
|
|
|
|
@findex mktime
|
|
@example
|
|
@c @group
|
|
@c file eg/lib/mktime.awk
|
|
# mktime --- convert a date into seconds,
|
|
# compensate for time zone
|
|
|
|
function mktime(str, res1, res2, a, b, i, j, t, diff)
|
|
@{
|
|
i = split(str, a, " ") # don't rely on FS
|
|
|
|
if (i != 6)
|
|
return -1
|
|
|
|
# force numeric
|
|
for (j in a)
|
|
a[j] += 0
|
|
|
|
@group
|
|
# validate
|
|
if (a[1] < 1970 ||
|
|
a[2] < 1 || a[2] > 12 ||
|
|
a[3] < 1 || a[3] > 31 ||
|
|
a[4] < 0 || a[4] > 23 ||
|
|
a[5] < 0 || a[5] > 59 ||
|
|
a[6] < 0 || a[6] > 60 )
|
|
return -1
|
|
@end group
|
|
|
|
res1 = _tm_addup(a)
|
|
t = strftime("%Y %m %d %H %M %S", res1)
|
|
|
|
if (_tm_debug)
|
|
printf("(%s) -> (%s)\n", str, t) > "/dev/stderr"
|
|
|
|
split(t, b, " ")
|
|
res2 = _tm_addup(b)
|
|
|
|
diff = res1 - res2
|
|
|
|
if (_tm_debug)
|
|
printf("diff = %d seconds\n", diff) > "/dev/stderr"
|
|
|
|
res1 += diff
|
|
|
|
return res1
|
|
@}
|
|
@c endfile
|
|
@c @end group
|
|
@end example
|
|
|
|
The function first splits the string into an array, using spaces and tabs as
|
|
separators. If there are not six elements in the array, it returns an
|
|
error, signaled as the value @minus{}1.
|
|
Next, it forces each element of the array to be numeric, by adding zero to it.
|
|
The following @samp{if} statement then makes sure that each element is
|
|
within an allowable range. (This checking could be extended further, e.g.,
|
|
to make sure that the day of the month is within the correct range for the
|
|
particular month supplied.) All of this is essentially preliminary set-up
|
|
and error checking.
|
|
|
|
Recall that @code{_tm_addup} generated a value in seconds since Midnight,
|
|
January 1, 1970. This value is not directly usable as the result we want,
|
|
@emph{since the calculation does not account for the local timezone}. In other
|
|
words, the value represents the count in seconds since the Epoch, but only
|
|
for UTC (Universal Coordinated Time). If the local timezone is east or west
|
|
of UTC, then some number of hours should be either added to, or subtracted from
|
|
the resulting timestamp.
|
|
|
|
For example, 6:23 p.m. in Atlanta, Georgia (USA), is normally five hours west
|
|
of (behind) UTC. It is only four hours behind UTC if daylight savings
|
|
time is in effect.
|
|
If you are calling @code{mktime} in Atlanta, with the argument
|
|
@code{@w{"1993 5 23 18 23 12"}}, the result from @code{_tm_addup} will be
|
|
for 6:23 p.m. UTC, which is only 2:23 p.m. in Atlanta. It is necessary to
|
|
add another four hours worth of seconds to the result.
|
|
|
|
How can @code{mktime} determine how far away it is from UTC? This is
|
|
surprisingly easy. The returned timestamp represents the time passed to
|
|
@code{mktime} @emph{as UTC}. This timestamp can be fed back to
|
|
@code{strftime}, which will format it as a @emph{local} time; i.e.@: as
|
|
if it already had the UTC difference added in to it. This is done by
|
|
giving @code{@w{"%Y %m %d %H %M %S"}} to @code{strftime} as the format
|
|
argument. It returns the computed timestamp in the original string
|
|
format. The result represents a time that accounts for the UTC
|
|
difference. When the new time is converted back to a timestamp, the
|
|
difference between the two timestamps is the difference (in seconds)
|
|
between the local timezone and UTC. This difference is then added back
|
|
to the original result. An example demonstrating this is presented below.
|
|
|
|
Finally, there is a ``main'' program for testing the function.
|
|
|
|
@example
|
|
@c @group
|
|
@c file eg/lib/mktime.awk
|
|
BEGIN @{
|
|
if (_tm_test) @{
|
|
printf "Enter date as yyyy mm dd hh mm ss: "
|
|
getline _tm_test_date
|
|
|
|
t = mktime(_tm_test_date)
|
|
r = strftime("%Y %m %d %H %M %S", t)
|
|
printf "Got back (%s)\n", r
|
|
@}
|
|
@}
|
|
@c endfile
|
|
@c @end group
|
|
@end example
|
|
|
|
The entire program uses two variables that can be set on the command
|
|
line to control debugging output and to enable the test in the final
|
|
@code{BEGIN} rule. Here is the result of a test run. (Note that debugging
|
|
output is to standard error, and test output is to standard output.)
|
|
|
|
@example
|
|
@c @group
|
|
$ gawk -f mktime.awk -v _tm_test=1 -v _tm_debug=1
|
|
@print{} Enter date as yyyy mm dd hh mm ss: 1993 5 23 15 35 10
|
|
@error{} (1993 5 23 15 35 10) -> (1993 05 23 11 35 10)
|
|
@error{} diff = 14400 seconds
|
|
@print{} Got back (1993 05 23 15 35 10)
|
|
@c @end group
|
|
@end example
|
|
|
|
The time entered was 3:35 p.m. (15:35 on a 24-hour clock), on May 23, 1993.
|
|
The first line
|
|
of debugging output shows the resulting time as UTC---four hours ahead of
|
|
the local time zone. The second line shows that the difference is 14400
|
|
seconds, which is four hours. (The difference is only four hours, since
|
|
daylight savings time is in effect during May.)
|
|
The final line of test output shows that the timezone compensation
|
|
algorithm works; the returned time is the same as the entered time.
|
|
|
|
This program does not solve the general problem of turning an arbitrary date
|
|
representation into a timestamp. That problem is very involved. However,
|
|
the @code{mktime} function provides a foundation upon which to build. Other
|
|
software can convert month names into numeric months, and AM/PM times into
|
|
24-hour clocks, to generate the ``canonical'' format that @code{mktime}
|
|
requires.
|
|
|
|
@node Gettimeofday Function, Filetrans Function, Mktime Function, Library Functions
|
|
@section Managing the Time of Day
|
|
|
|
@cindex formatted timestamps
|
|
@cindex timestamps, formatted
|
|
The @code{systime} and @code{strftime} functions described in
|
|
@ref{Time Functions, ,Functions for Dealing with Time Stamps},
|
|
provide the minimum functionality necessary for dealing with the time of day
|
|
in human readable form. While @code{strftime} is extensive, the control
|
|
formats are not necessarily easy to remember or intuitively obvious when
|
|
reading a program.
|
|
|
|
The following function, @code{gettimeofday}, populates a user-supplied array
|
|
with pre-formatted time information. It returns a string with the current
|
|
time formatted in the same way as the @code{date} utility.
|
|
|
|
@findex gettimeofday
|
|
@example
|
|
@c @group
|
|
@c file eg/lib/gettime.awk
|
|
# gettimeofday --- get the time of day in a usable format
|
|
# Arnold Robbins, arnold@@gnu.org, Public Domain, May 1993
|
|
#
|
|
# Returns a string in the format of output of date(1)
|
|
# Populates the array argument time with individual values:
|
|
# time["second"] -- seconds (0 - 59)
|
|
# time["minute"] -- minutes (0 - 59)
|
|
# time["hour"] -- hours (0 - 23)
|
|
# time["althour"] -- hours (0 - 12)
|
|
# time["monthday"] -- day of month (1 - 31)
|
|
# time["month"] -- month of year (1 - 12)
|
|
# time["monthname"] -- name of the month
|
|
# time["shortmonth"] -- short name of the month
|
|
# time["year"] -- year within century (0 - 99)
|
|
# time["fullyear"] -- year with century (19xx or 20xx)
|
|
# time["weekday"] -- day of week (Sunday = 0)
|
|
# time["altweekday"] -- day of week (Monday = 0)
|
|
# time["weeknum"] -- week number, Sunday first day
|
|
# time["altweeknum"] -- week number, Monday first day
|
|
# time["dayname"] -- name of weekday
|
|
# time["shortdayname"] -- short name of weekday
|
|
# time["yearday"] -- day of year (0 - 365)
|
|
# time["timezone"] -- abbreviation of timezone name
|
|
# time["ampm"] -- AM or PM designation
|
|
|
|
@group
|
|
function gettimeofday(time, ret, now, i)
|
|
@{
|
|
# get time once, avoids unnecessary system calls
|
|
now = systime()
|
|
|
|
# return date(1)-style output
|
|
ret = strftime("%a %b %d %H:%M:%S %Z %Y", now)
|
|
|
|
# clear out target array
|
|
for (i in time)
|
|
delete time[i]
|
|
@end group
|
|
|
|
@group
|
|
# fill in values, force numeric values to be
|
|
# numeric by adding 0
|
|
time["second"] = strftime("%S", now) + 0
|
|
time["minute"] = strftime("%M", now) + 0
|
|
time["hour"] = strftime("%H", now) + 0
|
|
time["althour"] = strftime("%I", now) + 0
|
|
time["monthday"] = strftime("%d", now) + 0
|
|
time["month"] = strftime("%m", now) + 0
|
|
time["monthname"] = strftime("%B", now)
|
|
time["shortmonth"] = strftime("%b", now)
|
|
time["year"] = strftime("%y", now) + 0
|
|
time["fullyear"] = strftime("%Y", now) + 0
|
|
time["weekday"] = strftime("%w", now) + 0
|
|
time["altweekday"] = strftime("%u", now) + 0
|
|
time["dayname"] = strftime("%A", now)
|
|
time["shortdayname"] = strftime("%a", now)
|
|
time["yearday"] = strftime("%j", now) + 0
|
|
time["timezone"] = strftime("%Z", now)
|
|
time["ampm"] = strftime("%p", now)
|
|
time["weeknum"] = strftime("%U", now) + 0
|
|
time["altweeknum"] = strftime("%W", now) + 0
|
|
|
|
return ret
|
|
@}
|
|
@end group
|
|
@c endfile
|
|
@end example
|
|
|
|
The string indices are easier to use and read than the various formats
|
|
required by @code{strftime}. The @code{alarm} program presented in
|
|
@ref{Alarm Program, ,An Alarm Clock Program},
|
|
uses this function.
|
|
|
|
@c exercise!!!
|
|
The @code{gettimeofday} function is presented above as it was written. A
|
|
more general design for this function would have allowed the user to supply
|
|
an optional timestamp value that would have been used instead of the current
|
|
time.
|
|
|
|
@node Filetrans Function, Getopt Function, Gettimeofday Function, Library Functions
|
|
@section Noting Data File Boundaries
|
|
|
|
@cindex per file initialization and clean-up
|
|
The @code{BEGIN} and @code{END} rules are each executed exactly once, at
|
|
the beginning and end respectively of your @code{awk} program
|
|
(@pxref{BEGIN/END, ,The @code{BEGIN} and @code{END} Special Patterns}).
|
|
We (the @code{gawk} authors) once had a user who mistakenly thought that the
|
|
@code{BEGIN} rule was executed at the beginning of each data file and the
|
|
@code{END} rule was executed at the end of each data file. When informed
|
|
that this was not the case, the user requested that we add new special
|
|
patterns to @code{gawk}, named @code{BEGIN_FILE} and @code{END_FILE}, that
|
|
would have the desired behavior. He even supplied us the code to do so.
|
|
|
|
However, after a little thought, I came up with the following library program.
|
|
It arranges to call two user-supplied functions, @code{beginfile} and
|
|
@code{endfile}, at the beginning and end of each data file.
|
|
Besides solving the problem in only nine(!) lines of code, it does so
|
|
@emph{portably}; this will work with any implementation of @code{awk}.
|
|
|
|
@example
|
|
@c @group
|
|
# transfile.awk
|
|
#
|
|
# Give the user a hook for filename transitions
|
|
#
|
|
# The user must supply functions beginfile() and endfile()
|
|
# that each take the name of the file being started or
|
|
# finished, respectively.
|
|
#
|
|
# Arnold Robbins, arnold@@gnu.org, January 1992
|
|
# Public Domain
|
|
|
|
FILENAME != _oldfilename \
|
|
@{
|
|
if (_oldfilename != "")
|
|
endfile(_oldfilename)
|
|
_oldfilename = FILENAME
|
|
beginfile(FILENAME)
|
|
@}
|
|
|
|
END @{ endfile(FILENAME) @}
|
|
@c @end group
|
|
@end example
|
|
|
|
This file must be loaded before the user's ``main'' program, so that the
|
|
rule it supplies will be executed first.
|
|
|
|
This rule relies on @code{awk}'s @code{FILENAME} variable that
|
|
automatically changes for each new data file. The current file name is
|
|
saved in a private variable, @code{_oldfilename}. If @code{FILENAME} does
|
|
not equal @code{_oldfilename}, then a new data file is being processed, and
|
|
it is necessary to call @code{endfile} for the old file. Since
|
|
@code{endfile} should only be called if a file has been processed, the
|
|
program first checks to make sure that @code{_oldfilename} is not the null
|
|
string. The program then assigns the current file name to
|
|
@code{_oldfilename}, and calls @code{beginfile} for the file.
|
|
Since, like all @code{awk} variables, @code{_oldfilename} will be
|
|
initialized to the null string, this rule executes correctly even for the
|
|
first data file.
|
|
|
|
The program also supplies an @code{END} rule, to do the final processing for
|
|
the last file. Since this @code{END} rule comes before any @code{END} rules
|
|
supplied in the ``main'' program, @code{endfile} will be called first. Once
|
|
again the value of multiple @code{BEGIN} and @code{END} rules should be clear.
|
|
|
|
@findex beginfile
|
|
@findex endfile
|
|
This version has same problem as the first version of @code{nextfile}
|
|
(@pxref{Nextfile Function, ,Implementing @code{nextfile} as a Function}).
|
|
If the same data file occurs twice in a row on command line, then
|
|
@code{endfile} and @code{beginfile} will not be executed at the end of the
|
|
first pass and at the beginning of the second pass.
|
|
This version solves the problem.
|
|
|
|
@example
|
|
@c @group
|
|
@c file eg/lib/ftrans.awk
|
|
# ftrans.awk --- handle data file transitions
|
|
#
|
|
# user supplies beginfile() and endfile() functions
|
|
#
|
|
# Arnold Robbins, arnold@@gnu.org, November 1992
|
|
# Public Domain
|
|
|
|
FNR == 1 @{
|
|
if (_filename_ != "")
|
|
endfile(_filename_)
|
|
_filename_ = FILENAME
|
|
beginfile(FILENAME)
|
|
@}
|
|
|
|
END @{ endfile(_filename_) @}
|
|
@c endfile
|
|
@c @end group
|
|
@end example
|
|
|
|
In @ref{Wc Program, ,Counting Things},
|
|
you will see how this library function can be used, and
|
|
how it simplifies writing the main program.
|
|
|
|
@node Getopt Function, Passwd Functions, Filetrans Function, Library Functions
|
|
@section Processing Command Line Options
|
|
|
|
@cindex @code{getopt}, C version
|
|
@cindex processing arguments
|
|
@cindex argument processing
|
|
Most utilities on POSIX compatible systems take options or ``switches'' on
|
|
the command line that can be used to change the way a program behaves.
|
|
@code{awk} is an example of such a program
|
|
(@pxref{Options, ,Command Line Options}).
|
|
Often, options take @dfn{arguments}, data that the program needs to
|
|
correctly obey the command line option. For example, @code{awk}'s
|
|
@samp{-F} option requires a string to use as the field separator.
|
|
The first occurrence on the command line of either @samp{--} or a
|
|
string that does not begin with @samp{-} ends the options.
|
|
|
|
Most Unix systems provide a C function named @code{getopt} for processing
|
|
command line arguments. The programmer provides a string describing the one
|
|
letter options. If an option requires an argument, it is followed in the
|
|
string with a colon. @code{getopt} is also passed the
|
|
count and values of the command line arguments, and is called in a loop.
|
|
@code{getopt} processes the command line arguments for option letters.
|
|
Each time around the loop, it returns a single character representing the
|
|
next option letter that it found, or @samp{?} if it found an invalid option.
|
|
When it returns @minus{}1, there are no options left on the command line.
|
|
|
|
When using @code{getopt}, options that do not take arguments can be
|
|
grouped together. Furthermore, options that take arguments require that the
|
|
argument be present. The argument can immediately follow the option letter,
|
|
or it can be a separate command line argument.
|
|
|
|
Given a hypothetical program that takes
|
|
three command line options, @samp{-a}, @samp{-b}, and @samp{-c}, and
|
|
@samp{-b} requires an argument, all of the following are valid ways of
|
|
invoking the program:
|
|
|
|
@example
|
|
@c @group
|
|
prog -a -b foo -c data1 data2 data3
|
|
prog -ac -bfoo -- data1 data2 data3
|
|
prog -acbfoo data1 data2 data3
|
|
@c @end group
|
|
@end example
|
|
|
|
Notice that when the argument is grouped with its option, the rest of
|
|
the command line argument is considered to be the option's argument.
|
|
In the above example, @samp{-acbfoo} indicates that all of the
|
|
@samp{-a}, @samp{-b}, and @samp{-c} options were supplied,
|
|
and that @samp{foo} is the argument to the @samp{-b} option.
|
|
|
|
@code{getopt} provides four external variables that the programmer can use.
|
|
|
|
@table @code
|
|
@item optind
|
|
The index in the argument value array (@code{argv}) where the first
|
|
non-option command line argument can be found.
|
|
|
|
@item optarg
|
|
The string value of the argument to an option.
|
|
|
|
@item opterr
|
|
Usually @code{getopt} prints an error message when it finds an invalid
|
|
option. Setting @code{opterr} to zero disables this feature. (An
|
|
application might wish to print its own error message.)
|
|
|
|
@item optopt
|
|
The letter representing the command line option.
|
|
While not usually documented, most versions supply this variable.
|
|
@end table
|
|
|
|
The following C fragment shows how @code{getopt} might process command line
|
|
arguments for @code{awk}.
|
|
|
|
@example
|
|
@group
|
|
int
|
|
main(int argc, char *argv[])
|
|
@{
|
|
@dots{}
|
|
/* print our own message */
|
|
opterr = 0;
|
|
@end group
|
|
@group
|
|
while ((c = getopt(argc, argv, "v:f:F:W:")) != -1) @{
|
|
switch (c) @{
|
|
case 'f': /* file */
|
|
@dots{}
|
|
break;
|
|
case 'F': /* field separator */
|
|
@dots{}
|
|
break;
|
|
case 'v': /* variable assignment */
|
|
@dots{}
|
|
break;
|
|
case 'W': /* extension */
|
|
@dots{}
|
|
break;
|
|
case '?':
|
|
default:
|
|
usage();
|
|
break;
|
|
@}
|
|
@}
|
|
@dots{}
|
|
@}
|
|
@end group
|
|
@end example
|
|
|
|
As a side point, @code{gawk} actually uses the GNU @code{getopt_long}
|
|
function to process both normal and GNU-style long options
|
|
(@pxref{Options, ,Command Line Options}).
|
|
|
|
The abstraction provided by @code{getopt} is very useful, and would be quite
|
|
handy in @code{awk} programs as well. Here is an @code{awk} version of
|
|
@code{getopt}. This function highlights one of the greatest weaknesses in
|
|
@code{awk}, which is that it is very poor at manipulating single characters.
|
|
Repeated calls to @code{substr} are necessary for accessing individual
|
|
characters (@pxref{String Functions, ,Built-in Functions for String Manipulation}).
|
|
|
|
The discussion walks through the code a bit at a time.
|
|
|
|
@example
|
|
@c @group
|
|
@c file eg/lib/getopt.awk
|
|
# getopt --- do C library getopt(3) function in awk
|
|
#
|
|
# arnold@@gnu.org
|
|
# Public domain
|
|
#
|
|
# Initial version: March, 1991
|
|
# Revised: May, 1993
|
|
|
|
@group
|
|
# External variables:
|
|
# Optind -- index of ARGV for first non-option argument
|
|
# Optarg -- string value of argument to current option
|
|
# Opterr -- if non-zero, print our own diagnostic
|
|
# Optopt -- current option letter
|
|
@end group
|
|
|
|
# Returns
|
|
# -1 at end of options
|
|
# ? for unrecognized option
|
|
# <c> a character representing the current option
|
|
|
|
# Private Data
|
|
# _opti index in multi-flag option, e.g., -abc
|
|
@c endfile
|
|
@c @end group
|
|
@end example
|
|
|
|
The function starts out with some documentation: who wrote the code,
|
|
and when it was revised, followed by a list of the global variables it uses,
|
|
what the return values are and what they mean, and any global variables that
|
|
are ``private'' to this library function. Such documentation is essential
|
|
for any program, and particularly for library functions.
|
|
|
|
@findex getopt
|
|
@example
|
|
@c @group
|
|
@c file eg/lib/getopt.awk
|
|
function getopt(argc, argv, options, optl, thisopt, i)
|
|
@{
|
|
optl = length(options)
|
|
if (optl == 0) # no options given
|
|
return -1
|
|
|
|
if (argv[Optind] == "--") @{ # all done
|
|
Optind++
|
|
_opti = 0
|
|
return -1
|
|
@} else if (argv[Optind] !~ /^-[^: \t\n\f\r\v\b]/) @{
|
|
_opti = 0
|
|
return -1
|
|
@}
|
|
@c endfile
|
|
@c @end group
|
|
@end example
|
|
|
|
The function first checks that it was indeed called with a string of options
|
|
(the @code{options} parameter). If @code{options} has a zero length,
|
|
@code{getopt} immediately returns @minus{}1.
|
|
|
|
The next thing to check for is the end of the options. A @samp{--} ends the
|
|
command line options, as does any command line argument that does not begin
|
|
with a @samp{-}. @code{Optind} is used to step through the array of command
|
|
line arguments; it retains its value across calls to @code{getopt}, since it
|
|
is a global variable.
|
|
|
|
The regexp used, @code{@w{/^-[^: \t\n\f\r\v\b]/}}, is
|
|
perhaps a bit of overkill; it checks for a @samp{-} followed by anything
|
|
that is not whitespace and not a colon.
|
|
If the current command line argument does not match this pattern,
|
|
it is not an option, and it ends option processing.
|
|
|
|
@example
|
|
@group
|
|
@c file eg/lib/getopt.awk
|
|
if (_opti == 0)
|
|
_opti = 2
|
|
thisopt = substr(argv[Optind], _opti, 1)
|
|
Optopt = thisopt
|
|
i = index(options, thisopt)
|
|
if (i == 0) @{
|
|
if (Opterr)
|
|
printf("%c -- invalid option\n",
|
|
thisopt) > "/dev/stderr"
|
|
if (_opti >= length(argv[Optind])) @{
|
|
Optind++
|
|
_opti = 0
|
|
@} else
|
|
_opti++
|
|
return "?"
|
|
@}
|
|
@c endfile
|
|
@end group
|
|
@end example
|
|
|
|
The @code{_opti} variable tracks the position in the current command line
|
|
argument (@code{argv[Optind]}). In the case that multiple options were
|
|
grouped together with one @samp{-} (e.g., @samp{-abx}), it is necessary
|
|
to return them to the user one at a time.
|
|
|
|
If @code{_opti} is equal to zero, it is set to two, the index in the string
|
|
of the next character to look at (we skip the @samp{-}, which is at position
|
|
one). The variable @code{thisopt} holds the character, obtained with
|
|
@code{substr}. It is saved in @code{Optopt} for the main program to use.
|
|
|
|
If @code{thisopt} is not in the @code{options} string, then it is an
|
|
invalid option. If @code{Opterr} is non-zero, @code{getopt} prints an error
|
|
message on the standard error that is similar to the message from the C
|
|
version of @code{getopt}.
|
|
|
|
Since the option is invalid, it is necessary to skip it and move on to the
|
|
next option character. If @code{_opti} is greater than or equal to the
|
|
length of the current command line argument, then it is necessary to move on
|
|
to the next one, so @code{Optind} is incremented and @code{_opti} is reset
|
|
to zero. Otherwise, @code{Optind} is left alone and @code{_opti} is merely
|
|
incremented.
|
|
|
|
In any case, since the option was invalid, @code{getopt} returns @samp{?}.
|
|
The main program can examine @code{Optopt} if it needs to know what the
|
|
invalid option letter actually was.
|
|
|
|
@example
|
|
@group
|
|
@c file eg/lib/getopt.awk
|
|
if (substr(options, i + 1, 1) == ":") @{
|
|
# get option argument
|
|
if (length(substr(argv[Optind], _opti + 1)) > 0)
|
|
Optarg = substr(argv[Optind], _opti + 1)
|
|
else
|
|
Optarg = argv[++Optind]
|
|
_opti = 0
|
|
@} else
|
|
Optarg = ""
|
|
@c endfile
|
|
@end group
|
|
@end example
|
|
|
|
If the option requires an argument, the option letter is followed by a colon
|
|
in the @code{options} string. If there are remaining characters in the
|
|
current command line argument (@code{argv[Optind]}), then the rest of that
|
|
string is assigned to @code{Optarg}. Otherwise, the next command line
|
|
argument is used (@samp{-xFOO} vs. @samp{@w{-x FOO}}). In either case,
|
|
@code{_opti} is reset to zero, since there are no more characters left to
|
|
examine in the current command line argument.
|
|
|
|
@example
|
|
@c @group
|
|
@c file eg/lib/getopt.awk
|
|
if (_opti == 0 || _opti >= length(argv[Optind])) @{
|
|
Optind++
|
|
_opti = 0
|
|
@} else
|
|
_opti++
|
|
return thisopt
|
|
@}
|
|
@c endfile
|
|
@c @end group
|
|
@end example
|
|
|
|
Finally, if @code{_opti} is either zero or greater than the length of the
|
|
current command line argument, it means this element in @code{argv} is
|
|
through being processed, so @code{Optind} is incremented to point to the
|
|
next element in @code{argv}. If neither condition is true, then only
|
|
@code{_opti} is incremented, so that the next option letter can be processed
|
|
on the next call to @code{getopt}.
|
|
|
|
@example
|
|
@c @group
|
|
@c file eg/lib/getopt.awk
|
|
BEGIN @{
|
|
Opterr = 1 # default is to diagnose
|
|
Optind = 1 # skip ARGV[0]
|
|
|
|
# test program
|
|
if (_getopt_test) @{
|
|
while ((_go_c = getopt(ARGC, ARGV, "ab:cd")) != -1)
|
|
printf("c = <%c>, optarg = <%s>\n",
|
|
_go_c, Optarg)
|
|
printf("non-option arguments:\n")
|
|
for (; Optind < ARGC; Optind++)
|
|
printf("\tARGV[%d] = <%s>\n",
|
|
Optind, ARGV[Optind])
|
|
@}
|
|
@}
|
|
@c endfile
|
|
@c @end group
|
|
@end example
|
|
|
|
The @code{BEGIN} rule initializes both @code{Opterr} and @code{Optind} to one.
|
|
@code{Opterr} is set to one, since the default behavior is for @code{getopt}
|
|
to print a diagnostic message upon seeing an invalid option. @code{Optind}
|
|
is set to one, since there's no reason to look at the program name, which is
|
|
in @code{ARGV[0]}.
|
|
|
|
The rest of the @code{BEGIN} rule is a simple test program. Here is the
|
|
result of two sample runs of the test program.
|
|
|
|
@example
|
|
@group
|
|
$ awk -f getopt.awk -v _getopt_test=1 -- -a -cbARG bax -x
|
|
@print{} c = <a>, optarg = <>
|
|
@print{} c = <c>, optarg = <>
|
|
@print{} c = <b>, optarg = <ARG>
|
|
@print{} non-option arguments:
|
|
@print{} ARGV[3] = <bax>
|
|
@print{} ARGV[4] = <-x>
|
|
@end group
|
|
|
|
@group
|
|
$ awk -f getopt.awk -v _getopt_test=1 -- -a -x -- xyz abc
|
|
@print{} c = <a>, optarg = <>
|
|
@error{} x -- invalid option
|
|
@print{} c = <?>, optarg = <>
|
|
@print{} non-option arguments:
|
|
@print{} ARGV[4] = <xyz>
|
|
@print{} ARGV[5] = <abc>
|
|
@end group
|
|
@end example
|
|
|
|
The first @samp{--} terminates the arguments to @code{awk}, so that it does
|
|
not try to interpret the @samp{-a} etc. as its own options.
|
|
|
|
Several of the sample programs presented in
|
|
@ref{Sample Programs, ,Practical @code{awk} Programs},
|
|
use @code{getopt} to process their arguments.
|
|
|
|
@node Passwd Functions, Group Functions, Getopt Function, Library Functions
|
|
@section Reading the User Database
|
|
|
|
@cindex @file{/dev/user}
|
|
The @file{/dev/user} special file
|
|
(@pxref{Special Files, ,Special File Names in @code{gawk}})
|
|
provides access to the current user's real and effective user and group id
|
|
numbers, and if available, the user's supplementary group set.
|
|
However, since these are numbers, they do not provide very useful
|
|
information to the average user. There needs to be some way to find the
|
|
user information associated with the user and group numbers. This
|
|
section presents a suite of functions for retrieving information from the
|
|
user database. @xref{Group Functions, ,Reading the Group Database},
|
|
for a similar suite that retrieves information from the group database.
|
|
|
|
@cindex @code{getpwent}, C version
|
|
@cindex user information
|
|
@cindex login information
|
|
@cindex account information
|
|
@cindex password file
|
|
The POSIX standard does not define the file where user information is
|
|
kept. Instead, it provides the @code{<pwd.h>} header file
|
|
and several C language subroutines for obtaining user information.
|
|
The primary function is @code{getpwent}, for ``get password entry.''
|
|
The ``password'' comes from the original user database file,
|
|
@file{/etc/passwd}, which kept user information, along with the
|
|
encrypted passwords (hence the name).
|
|
|
|
While an @code{awk} program could simply read @file{/etc/passwd} directly
|
|
(the format is well known), because of the way password
|
|
files are handled on networked systems,
|
|
this file may not contain complete information about the system's set of users.
|
|
|
|
@cindex @code{pwcat} program
|
|
To be sure of being
|
|
able to produce a readable, complete version of the user database, it is
|
|
necessary to write a small C program that calls @code{getpwent}.
|
|
@code{getpwent} is defined to return a pointer to a @code{struct passwd}.
|
|
Each time it is called, it returns the next entry in the database.
|
|
When there are no more entries, it returns @code{NULL}, the null pointer.
|
|
When this happens, the C program should call @code{endpwent} to close the
|
|
database.
|
|
Here is @code{pwcat}, a C program that ``cats'' the password database.
|
|
|
|
@findex pwcat.c
|
|
@example
|
|
@c @group
|
|
@c file eg/lib/pwcat.c
|
|
/*
|
|
* pwcat.c
|
|
*
|
|
* Generate a printable version of the password database
|
|
*
|
|
* Arnold Robbins
|
|
* arnold@@gnu.org
|
|
* May 1993
|
|
* Public Domain
|
|
*/
|
|
|
|
#include <stdio.h>
|
|
#include <pwd.h>
|
|
|
|
int
|
|
main(argc, argv)
|
|
int argc;
|
|
char **argv;
|
|
@{
|
|
struct passwd *p;
|
|
|
|
while ((p = getpwent()) != NULL)
|
|
printf("%s:%s:%d:%d:%s:%s:%s\n",
|
|
p->pw_name, p->pw_passwd, p->pw_uid,
|
|
p->pw_gid, p->pw_gecos, p->pw_dir, p->pw_shell);
|
|
|
|
endpwent();
|
|
exit(0);
|
|
@}
|
|
@c endfile
|
|
@c @end group
|
|
@end example
|
|
|
|
If you don't understand C, don't worry about it.
|
|
The output from @code{pwcat} is the user database, in the traditional
|
|
@file{/etc/passwd} format of colon-separated fields. The fields are:
|
|
|
|
@table @asis
|
|
@item Login name
|
|
The user's login name.
|
|
|
|
@item Encrypted password
|
|
The user's encrypted password. This may not be available on some systems.
|
|
|
|
@item User-ID
|
|
The user's numeric user-id number.
|
|
|
|
@item Group-ID
|
|
The user's numeric group-id number.
|
|
|
|
@item Full name
|
|
The user's full name, and perhaps other information associated with the
|
|
user.
|
|
|
|
@item Home directory
|
|
The user's login, or ``home'' directory (familiar to shell programmers as
|
|
@code{$HOME}).
|
|
|
|
@item Login shell
|
|
The program that will be run when the user logs in. This is usually a
|
|
shell, such as Bash (the Gnu Bourne-Again shell).
|
|
@end table
|
|
|
|
Here are a few lines representative of @code{pwcat}'s output.
|
|
|
|
@example
|
|
@c @group
|
|
$ pwcat
|
|
@print{} root:3Ov02d5VaUPB6:0:1:Operator:/:/bin/sh
|
|
@print{} nobody:*:65534:65534::/:
|
|
@print{} daemon:*:1:1::/:
|
|
@print{} sys:*:2:2::/:/bin/csh
|
|
@print{} bin:*:3:3::/bin:
|
|
@print{} arnold:xyzzy:2076:10:Arnold Robbins:/home/arnold:/bin/sh
|
|
@print{} miriam:yxaay:112:10:Miriam Robbins:/home/miriam:/bin/sh
|
|
@print{} andy:abcca2:113:10:Andy Jacobs:/home/andy:/bin/sh
|
|
@dots{}
|
|
@c @end group
|
|
@end example
|
|
|
|
With that introduction, here is a group of functions for getting user
|
|
information. There are several functions here, corresponding to the C
|
|
functions of the same name.
|
|
|
|
@findex _pw_init
|
|
@example
|
|
@c file eg/lib/passwdawk.in
|
|
@group
|
|
# passwd.awk --- access password file information
|
|
# Arnold Robbins, arnold@@gnu.org, Public Domain
|
|
# May 1993
|
|
|
|
BEGIN @{
|
|
# tailor this to suit your system
|
|
_pw_awklib = "/usr/local/libexec/awk/"
|
|
@}
|
|
@end group
|
|
|
|
@group
|
|
function _pw_init( oldfs, oldrs, olddol0, pwcat)
|
|
@{
|
|
if (_pw_inited)
|
|
return
|
|
oldfs = FS
|
|
oldrs = RS
|
|
olddol0 = $0
|
|
FS = ":"
|
|
RS = "\n"
|
|
pwcat = _pw_awklib "pwcat"
|
|
while ((pwcat | getline) > 0) @{
|
|
_pw_byname[$1] = $0
|
|
_pw_byuid[$3] = $0
|
|
_pw_bycount[++_pw_total] = $0
|
|
@}
|
|
close(pwcat)
|
|
_pw_count = 0
|
|
_pw_inited = 1
|
|
FS = oldfs
|
|
RS = oldrs
|
|
$0 = olddol0
|
|
@}
|
|
@c endfile
|
|
@end group
|
|
@end example
|
|
|
|
The @code{BEGIN} rule sets a private variable to the directory where
|
|
@code{pwcat} is stored. Since it is used to help out an @code{awk} library
|
|
routine, we have chosen to put it in @file{/usr/local/libexec/awk}.
|
|
You might want it to be in a different directory on your system.
|
|
|
|
The function @code{_pw_init} keeps three copies of the user information
|
|
in three associative arrays. The arrays are indexed by user name
|
|
(@code{_pw_byname}), by user-id number (@code{_pw_byuid}), and by order of
|
|
occurrence (@code{_pw_bycount}).
|
|
|
|
The variable @code{_pw_inited} is used for efficiency; @code{_pw_init} only
|
|
needs to be called once.
|
|
|
|
Since this function uses @code{getline} to read information from
|
|
@code{pwcat}, it first saves the values of @code{FS}, @code{RS}, and
|
|
@code{$0}. Doing so is necessary, since these functions could be called
|
|
from anywhere within a user's program, and the user may have his or her
|
|
own values for @code{FS} and @code{RS}.
|
|
@ignore
|
|
Problem, what if FIELDWIDTHS is in use? Sigh.
|
|
@end ignore
|
|
|
|
The main part of the function uses a loop to read database lines, split
|
|
the line into fields, and then store the line into each array as necessary.
|
|
When the loop is done, @code{@w{_pw_init}} cleans up by closing the pipeline,
|
|
setting @code{@w{_pw_inited}} to one, and restoring @code{FS}, @code{RS}, and
|
|
@code{$0}. The use of @code{@w{_pw_count}} will be explained below.
|
|
|
|
@findex getpwnam
|
|
@example
|
|
@group
|
|
@c file eg/lib/passwdawk.in
|
|
function getpwnam(name)
|
|
@{
|
|
_pw_init()
|
|
if (name in _pw_byname)
|
|
return _pw_byname[name]
|
|
return ""
|
|
@}
|
|
@c endfile
|
|
@end group
|
|
@end example
|
|
|
|
The @code{getpwnam} function takes a user name as a string argument. If that
|
|
user is in the database, it returns the appropriate line. Otherwise it
|
|
returns the null string.
|
|
|
|
@findex getpwuid
|
|
@example
|
|
@group
|
|
@c file eg/lib/passwdawk.in
|
|
function getpwuid(uid)
|
|
@{
|
|
_pw_init()
|
|
if (uid in _pw_byuid)
|
|
return _pw_byuid[uid]
|
|
return ""
|
|
@}
|
|
@c endfile
|
|
@end group
|
|
@end example
|
|
|
|
Similarly,
|
|
the @code{getpwuid} function takes a user-id number argument. If that
|
|
user number is in the database, it returns the appropriate line. Otherwise it
|
|
returns the null string.
|
|
|
|
@findex getpwent
|
|
@example
|
|
@c @group
|
|
@c file eg/lib/passwdawk.in
|
|
function getpwent()
|
|
@{
|
|
_pw_init()
|
|
if (_pw_count < _pw_total)
|
|
return _pw_bycount[++_pw_count]
|
|
return ""
|
|
@}
|
|
@c endfile
|
|
@c @end group
|
|
@end example
|
|
|
|
The @code{getpwent} function simply steps through the database, one entry at
|
|
a time. It uses @code{_pw_count} to track its current position in the
|
|
@code{_pw_bycount} array.
|
|
|
|
@findex endpwent
|
|
@example
|
|
@c @group
|
|
@c file eg/lib/passwdawk.in
|
|
function endpwent()
|
|
@{
|
|
_pw_count = 0
|
|
@}
|
|
@c endfile
|
|
@c @end group
|
|
@end example
|
|
|
|
The @code{@w{endpwent}} function resets @code{@w{_pw_count}} to zero, so that
|
|
subsequent calls to @code{getpwent} will start over again.
|
|
|
|
A conscious design decision in this suite is that each subroutine calls
|
|
@code{@w{_pw_init}} to initialize the database arrays. The overhead of running
|
|
a separate process to generate the user database, and the I/O to scan it,
|
|
will only be incurred if the user's main program actually calls one of these
|
|
functions. If this library file is loaded along with a user's program, but
|
|
none of the routines are ever called, then there is no extra run-time overhead.
|
|
(The alternative would be to move the body of @code{@w{_pw_init}} into a
|
|
@code{BEGIN} rule, which would always run @code{pwcat}. This simplifies the
|
|
code but runs an extra process that may never be needed.)
|
|
|
|
In turn, calling @code{_pw_init} is not too expensive, since the
|
|
@code{_pw_inited} variable keeps the program from reading the data more than
|
|
once. If you are worried about squeezing every last cycle out of your
|
|
@code{awk} program, the check of @code{_pw_inited} could be moved out of
|
|
@code{_pw_init} and duplicated in all the other functions. In practice,
|
|
this is not necessary, since most @code{awk} programs are I/O bound, and it
|
|
would clutter up the code.
|
|
|
|
The @code{id} program in @ref{Id Program, ,Printing Out User Information},
|
|
uses these functions.
|
|
|
|
@node Group Functions, Library Names, Passwd Functions, Library Functions
|
|
@section Reading the Group Database
|
|
|
|
@cindex @code{getgrent}, C version
|
|
@cindex group information
|
|
@cindex account information
|
|
@cindex group file
|
|
Much of the discussion presented in
|
|
@ref{Passwd Functions, ,Reading the User Database},
|
|
applies to the group database as well. Although there has traditionally
|
|
been a well known file, @file{/etc/group}, in a well known format, the POSIX
|
|
standard only provides a set of C library routines
|
|
(@code{<grp.h>} and @code{getgrent})
|
|
for accessing the information.
|
|
Even though this file may exist, it likely does not have
|
|
complete information. Therefore, as with the user database, it is necessary
|
|
to have a small C program that generates the group database as its output.
|
|
|
|
@cindex @code{grcat} program
|
|
Here is @code{grcat}, a C program that ``cats'' the group database.
|
|
|
|
@findex grcat.c
|
|
@example
|
|
@c @group
|
|
@c file eg/lib/grcat.c
|
|
/*
|
|
* grcat.c
|
|
*
|
|
* Generate a printable version of the group database
|
|
*
|
|
* Arnold Robbins, arnold@@gnu.org
|
|
* May 1993
|
|
* Public Domain
|
|
*/
|
|
|
|
#include <stdio.h>
|
|
#include <grp.h>
|
|
|
|
@group
|
|
int
|
|
main(argc, argv)
|
|
int argc;
|
|
char **argv;
|
|
@{
|
|
struct group *g;
|
|
int i;
|
|
@end group
|
|
|
|
while ((g = getgrent()) != NULL) @{
|
|
printf("%s:%s:%d:", g->gr_name, g->gr_passwd,
|
|
g->gr_gid);
|
|
for (i = 0; g->gr_mem[i] != NULL; i++) @{
|
|
printf("%s", g->gr_mem[i]);
|
|
if (g->gr_mem[i+1] != NULL)
|
|
putchar(',');
|
|
@}
|
|
putchar('\n');
|
|
@}
|
|
endgrent();
|
|
exit(0);
|
|
@}
|
|
@c endfile
|
|
@c @end group
|
|
@end example
|
|
|
|
Each line in the group database represent one group. The fields are
|
|
separated with colons, and represent the following information.
|
|
|
|
@table @asis
|
|
@item Group Name
|
|
The name of the group.
|
|
|
|
@item Group Password
|
|
The encrypted group password. In practice, this field is never used. It is
|
|
usually empty, or set to @samp{*}.
|
|
|
|
@item Group ID Number
|
|
The numeric group-id number. This number should be unique within the file.
|
|
|
|
@item Group Member List
|
|
A comma-separated list of user names. These users are members of the group.
|
|
Most Unix systems allow users to be members of several groups
|
|
simultaneously. If your system does, then reading @file{/dev/user} will
|
|
return those group-id numbers in @code{$5} through @code{$NF}.
|
|
(Note that @file{/dev/user} is a @code{gawk} extension;
|
|
@pxref{Special Files, ,Special File Names in @code{gawk}}.)
|
|
@end table
|
|
|
|
Here is what running @code{grcat} might produce:
|
|
|
|
@example
|
|
@group
|
|
$ grcat
|
|
@print{} wheel:*:0:arnold
|
|
@print{} nogroup:*:65534:
|
|
@print{} daemon:*:1:
|
|
@print{} kmem:*:2:
|
|
@print{} staff:*:10:arnold,miriam,andy
|
|
@print{} other:*:20:
|
|
@dots{}
|
|
@end group
|
|
@end example
|
|
|
|
Here are the functions for obtaining information from the group database.
|
|
There are several, modeled after the C library functions of the same names.
|
|
|
|
@findex _gr_init
|
|
@example
|
|
@group
|
|
@c file eg/lib/groupawk.in
|
|
# group.awk --- functions for dealing with the group file
|
|
# Arnold Robbins, arnold@@gnu.org, Public Domain
|
|
# May 1993
|
|
|
|
BEGIN \
|
|
@{
|
|
# Change to suit your system
|
|
_gr_awklib = "/usr/local/libexec/awk/"
|
|
@}
|
|
@c endfile
|
|
@end group
|
|
|
|
@group
|
|
@c file eg/lib/groupawk.in
|
|
function _gr_init( oldfs, oldrs, olddol0, grcat, n, a, i)
|
|
@{
|
|
if (_gr_inited)
|
|
return
|
|
@end group
|
|
|
|
@group
|
|
oldfs = FS
|
|
oldrs = RS
|
|
olddol0 = $0
|
|
FS = ":"
|
|
RS = "\n"
|
|
@end group
|
|
|
|
@group
|
|
grcat = _gr_awklib "grcat"
|
|
while ((grcat | getline) > 0) @{
|
|
if ($1 in _gr_byname)
|
|
_gr_byname[$1] = _gr_byname[$1] "," $4
|
|
else
|
|
_gr_byname[$1] = $0
|
|
if ($3 in _gr_bygid)
|
|
_gr_bygid[$3] = _gr_bygid[$3] "," $4
|
|
else
|
|
_gr_bygid[$3] = $0
|
|
|
|
n = split($4, a, "[ \t]*,[ \t]*")
|
|
@end group
|
|
@group
|
|
for (i = 1; i <= n; i++)
|
|
if (a[i] in _gr_groupsbyuser)
|
|
_gr_groupsbyuser[a[i]] = \
|
|
_gr_groupsbyuser[a[i]] " " $1
|
|
else
|
|
_gr_groupsbyuser[a[i]] = $1
|
|
@end group
|
|
|
|
@group
|
|
_gr_bycount[++_gr_count] = $0
|
|
@}
|
|
@end group
|
|
@group
|
|
close(grcat)
|
|
_gr_count = 0
|
|
_gr_inited++
|
|
FS = oldfs
|
|
RS = oldrs
|
|
$0 = olddol0
|
|
@}
|
|
@c endfile
|
|
@end group
|
|
@end example
|
|
|
|
The @code{BEGIN} rule sets a private variable to the directory where
|
|
@code{grcat} is stored. Since it is used to help out an @code{awk} library
|
|
routine, we have chosen to put it in @file{/usr/local/libexec/awk}. You might
|
|
want it to be in a different directory on your system.
|
|
|
|
These routines follow the same general outline as the user database routines
|
|
(@pxref{Passwd Functions, ,Reading the User Database}).
|
|
The @code{@w{_gr_inited}} variable is used to
|
|
ensure that the database is scanned no more than once.
|
|
The @code{@w{_gr_init}} function first saves @code{FS}, @code{RS}, and
|
|
@code{$0}, and then sets @code{FS} and @code{RS} to the correct values for
|
|
scanning the group information.
|
|
|
|
The group information is stored is several associative arrays.
|
|
The arrays are indexed by group name (@code{@w{_gr_byname}}), by group-id number
|
|
(@code{@w{_gr_bygid}}), and by position in the database (@code{@w{_gr_bycount}}).
|
|
There is an additional array indexed by user name (@code{@w{_gr_groupsbyuser}}),
|
|
that is a space separated list of groups that each user belongs to.
|
|
|
|
Unlike the user database, it is possible to have multiple records in the
|
|
database for the same group. This is common when a group has a large number
|
|
of members. Such a pair of entries might look like:
|
|
|
|
@example
|
|
tvpeople:*:101:johny,jay,arsenio
|
|
tvpeople:*:101:david,conan,tom,joan
|
|
@end example
|
|
|
|
For this reason, @code{_gr_init} looks to see if a group name or
|
|
group-id number has already been seen. If it has, then the user names are
|
|
simply concatenated onto the previous list of users. (There is actually a
|
|
subtle problem with the code presented above. Suppose that
|
|
the first time there were no names. This code adds the names with
|
|
a leading comma. It also doesn't check that there is a @code{$4}.)
|
|
|
|
Finally, @code{_gr_init} closes the pipeline to @code{grcat}, restores
|
|
@code{FS}, @code{RS}, and @code{$0}, initializes @code{_gr_count} to zero
|
|
(it is used later), and makes @code{_gr_inited} non-zero.
|
|
|
|
@findex getgrnam
|
|
@example
|
|
@c @group
|
|
@c file eg/lib/groupawk.in
|
|
function getgrnam(group)
|
|
@{
|
|
_gr_init()
|
|
if (group in _gr_byname)
|
|
return _gr_byname[group]
|
|
return ""
|
|
@}
|
|
@c endfile
|
|
@c @end group
|
|
@end example
|
|
|
|
The @code{getgrnam} function takes a group name as its argument, and if that
|
|
group exists, it is returned. Otherwise, @code{getgrnam} returns the null
|
|
string.
|
|
|
|
@findex getgrgid
|
|
@example
|
|
@c @group
|
|
@c file eg/lib/groupawk.in
|
|
function getgrgid(gid)
|
|
@{
|
|
_gr_init()
|
|
if (gid in _gr_bygid)
|
|
return _gr_bygid[gid]
|
|
return ""
|
|
@}
|
|
@c endfile
|
|
@c @end group
|
|
@end example
|
|
|
|
The @code{getgrgid} function is similar, it takes a numeric group-id, and
|
|
looks up the information associated with that group-id.
|
|
|
|
@findex getgruser
|
|
@example
|
|
@group
|
|
@c file eg/lib/groupawk.in
|
|
function getgruser(user)
|
|
@{
|
|
_gr_init()
|
|
if (user in _gr_groupsbyuser)
|
|
return _gr_groupsbyuser[user]
|
|
return ""
|
|
@}
|
|
@c endfile
|
|
@end group
|
|
@end example
|
|
|
|
The @code{getgruser} function does not have a C counterpart. It takes a
|
|
user name, and returns the list of groups that have the user as a member.
|
|
|
|
@findex getgrent
|
|
@example
|
|
@c @group
|
|
@c file eg/lib/groupawk.in
|
|
function getgrent()
|
|
@{
|
|
_gr_init()
|
|
if (++_gr_count in _gr_bycount)
|
|
return _gr_bycount[_gr_count]
|
|
return ""
|
|
@}
|
|
@c endfile
|
|
@c @end group
|
|
@end example
|
|
|
|
The @code{getgrent} function steps through the database one entry at a time.
|
|
It uses @code{_gr_count} to track its position in the list.
|
|
|
|
@findex endgrent
|
|
@example
|
|
@group
|
|
@c file eg/lib/groupawk.in
|
|
function endgrent()
|
|
@{
|
|
_gr_count = 0
|
|
@}
|
|
@c endfile
|
|
@end group
|
|
@end example
|
|
|
|
@code{endgrent} resets @code{_gr_count} to zero so that @code{getgrent} can
|
|
start over again.
|
|
|
|
As with the user database routines, each function calls @code{_gr_init} to
|
|
initialize the arrays. Doing so only incurs the extra overhead of running
|
|
@code{grcat} if these functions are used (as opposed to moving the body of
|
|
@code{_gr_init} into a @code{BEGIN} rule).
|
|
|
|
Most of the work is in scanning the database and building the various
|
|
associative arrays. The functions that the user calls are themselves very
|
|
simple, relying on @code{awk}'s associative arrays to do work.
|
|
|
|
The @code{id} program in @ref{Id Program, ,Printing Out User Information},
|
|
uses these functions.
|
|
|
|
@node Library Names, , Group Functions, Library Functions
|
|
@section Naming Library Function Global Variables
|
|
|
|
@cindex namespace issues in @code{awk}
|
|
@cindex documenting @code{awk} programs
|
|
@cindex programs, documenting
|
|
Due to the way the @code{awk} language evolved, variables are either
|
|
@dfn{global} (usable by the entire program), or @dfn{local} (usable just by
|
|
a specific function). There is no intermediate state analogous to
|
|
@code{static} variables in C.
|
|
|
|
Library functions often need to have global variables that they can use to
|
|
preserve state information between calls to the function. For example,
|
|
@code{getopt}'s variable @code{_opti}
|
|
(@pxref{Getopt Function, ,Processing Command Line Options}),
|
|
and the @code{_tm_months} array used by @code{mktime}
|
|
(@pxref{Mktime Function, ,Turning Dates Into Timestamps}).
|
|
Such variables are called @dfn{private}, since the only functions that need to
|
|
use them are the ones in the library.
|
|
|
|
When writing a library function, you should try to choose names for your
|
|
private variables so that they will not conflict with any variables used by
|
|
either another library function or a user's main program. For example, a
|
|
name like @samp{i} or @samp{j} is not a good choice, since user programs
|
|
often use variable names like these for their own purposes.
|
|
|
|
The example programs shown in this chapter all start the names of their
|
|
private variables with an underscore (@samp{_}). Users generally don't use
|
|
leading underscores in their variable names, so this convention immediately
|
|
decreases the chances that the variable name will be accidentally shared
|
|
with the user's program.
|
|
|
|
In addition, several of the library functions use a prefix that helps
|
|
indicate what function or set of functions uses the variables. For example,
|
|
@code{_tm_months} in @code{mktime}
|
|
(@pxref{Mktime Function, ,Turning Dates Into Timestamps}), and
|
|
@code{_pw_byname} in the user data base routines
|
|
(@pxref{Passwd Functions, ,Reading the User Database}).
|
|
This convention is recommended, since it even further decreases the chance
|
|
of inadvertent conflict among variable names.
|
|
Note that this convention can be used equally well both for variable names
|
|
and for private function names too.
|
|
|
|
While I could have re-written all the library routines to use this
|
|
convention, I did not do so, in order to show how my own @code{awk}
|
|
programming style has evolved, and to provide some basis for this
|
|
discussion.
|
|
|
|
As a final note on variable naming, if a function makes global variables
|
|
available for use by a main program, it is a good convention to start that
|
|
variable's name with a capital letter.
|
|
For example, @code{getopt}'s @code{Opterr} and @code{Optind} variables
|
|
(@pxref{Getopt Function, ,Processing Command Line Options}).
|
|
The leading capital letter indicates that it is global, while the fact that
|
|
the variable name is not all capital letters indicates that the variable is
|
|
not one of @code{awk}'s built-in variables, like @code{FS}.
|
|
|
|
It is also important that @emph{all} variables in library functions
|
|
that do not need to save state are in fact declared local. If this is
|
|
not done, the variable could accidentally be used in the user's program,
|
|
leading to bugs that are very difficult to track down.
|
|
|
|
@example
|
|
function lib_func(x, y, l1, l2)
|
|
@{
|
|
@dots{}
|
|
@var{use variable} some_var # some_var could be local
|
|
@dots{} # but is not by oversight
|
|
@}
|
|
@end example
|
|
|
|
@cindex Tcl
|
|
A different convention, common in the Tcl community, is to use a single
|
|
associative array to hold the values needed by the library function(s), or
|
|
``package.'' This significantly decreases the number of actual global names
|
|
in use. For example, the functions described in
|
|
@ref{Passwd Functions, , Reading the User Database},
|
|
might have used @code{@w{PW_data["inited"]}}, @code{@w{PW_data["total"]}},
|
|
@code{@w{PW_data["count"]}} and @code{@w{PW_data["awklib"]}}, instead of
|
|
@code{@w{_pw_inited}}, @code{@w{_pw_awklib}}, @code{@w{_pw_total}},
|
|
and @code{@w{_pw_count}}.
|
|
|
|
The conventions presented in this section are exactly that, conventions. You
|
|
are not required to write your programs this way, we merely recommend that
|
|
you do so.
|
|
|
|
@node Sample Programs, Language History, Library Functions, Top
|
|
@chapter Practical @code{awk} Programs
|
|
|
|
This chapter presents a potpourri of @code{awk} programs for your reading
|
|
enjoyment.
|
|
@iftex
|
|
There are two sections. The first presents @code{awk}
|
|
versions of several common POSIX utilities.
|
|
The second is a grab-bag of interesting programs.
|
|
@end iftex
|
|
|
|
Many of these programs use the library functions presented in
|
|
@ref{Library Functions, ,A Library of @code{awk} Functions}.
|
|
|
|
@menu
|
|
* Clones:: Clones of common utilities.
|
|
* Miscellaneous Programs:: Some interesting @code{awk} programs.
|
|
@end menu
|
|
|
|
@node Clones, Miscellaneous Programs, Sample Programs, Sample Programs
|
|
@section Re-inventing Wheels for Fun and Profit
|
|
|
|
This section presents a number of POSIX utilities that are implemented in
|
|
@code{awk}. Re-inventing these programs in @code{awk} is often enjoyable,
|
|
since the algorithms can be very clearly expressed, and usually the code is
|
|
very concise and simple. This is true because @code{awk} does so much for you.
|
|
|
|
It should be noted that these programs are not necessarily intended to
|
|
replace the installed versions on your system. Instead, their
|
|
purpose is to illustrate @code{awk} language programming for ``real world''
|
|
tasks.
|
|
|
|
The programs are presented in alphabetical order.
|
|
|
|
@menu
|
|
* Cut Program:: The @code{cut} utility.
|
|
* Egrep Program:: The @code{egrep} utility.
|
|
* Id Program:: The @code{id} utility.
|
|
* Split Program:: The @code{split} utility.
|
|
* Tee Program:: The @code{tee} utility.
|
|
* Uniq Program:: The @code{uniq} utility.
|
|
* Wc Program:: The @code{wc} utility.
|
|
@end menu
|
|
|
|
@node Cut Program, Egrep Program, Clones, Clones
|
|
@subsection Cutting Out Fields and Columns
|
|
|
|
@cindex @code{cut} utility
|
|
The @code{cut} utility selects, or ``cuts,'' either characters or fields
|
|
from its standard
|
|
input and sends them to its standard output. @code{cut} can cut out either
|
|
a list of characters, or a list of fields. By default, fields are separated
|
|
by tabs, but you may supply a command line option to change the field
|
|
@dfn{delimiter}, i.e.@: the field separator character. @code{cut}'s definition
|
|
of fields is less general than @code{awk}'s.
|
|
|
|
A common use of @code{cut} might be to pull out just the login name of
|
|
logged-on users from the output of @code{who}. For example, the following
|
|
pipeline generates a sorted, unique list of the logged on users:
|
|
|
|
@example
|
|
who | cut -c1-8 | sort | uniq
|
|
@end example
|
|
|
|
The options for @code{cut} are:
|
|
|
|
@table @code
|
|
@item -c @var{list}
|
|
Use @var{list} as the list of characters to cut out. Items within the list
|
|
may be separated by commas, and ranges of characters can be separated with
|
|
dashes. The list @samp{1-8,15,22-35} specifies characters one through
|
|
eight, 15, and 22 through 35.
|
|
|
|
@item -f @var{list}
|
|
Use @var{list} as the list of fields to cut out.
|
|
|
|
@item -d @var{delim}
|
|
Use @var{delim} as the field separator character instead of the tab
|
|
character.
|
|
|
|
@item -s
|
|
Suppress printing of lines that do not contain the field delimiter.
|
|
@end table
|
|
|
|
The @code{awk} implementation of @code{cut} uses the @code{getopt} library
|
|
function (@pxref{Getopt Function, ,Processing Command Line Options}),
|
|
and the @code{join} library function
|
|
(@pxref{Join Function, ,Merging an Array Into a String}).
|
|
|
|
The program begins with a comment describing the options and a @code{usage}
|
|
function which prints out a usage message and exits. @code{usage} is called
|
|
if invalid arguments are supplied.
|
|
|
|
@findex cut.awk
|
|
@example
|
|
@c @group
|
|
@c file eg/prog/cut.awk
|
|
# cut.awk --- implement cut in awk
|
|
# Arnold Robbins, arnold@@gnu.org, Public Domain
|
|
# May 1993
|
|
|
|
# Options:
|
|
# -f list Cut fields
|
|
# -d c Field delimiter character
|
|
# -c list Cut characters
|
|
#
|
|
# -s Suppress lines without the delimiter character
|
|
|
|
function usage( e1, e2)
|
|
@{
|
|
e1 = "usage: cut [-f list] [-d c] [-s] [files...]"
|
|
e2 = "usage: cut [-c list] [files...]"
|
|
print e1 > "/dev/stderr"
|
|
print e2 > "/dev/stderr"
|
|
exit 1
|
|
@}
|
|
@c endfile
|
|
@c @end group
|
|
@end example
|
|
|
|
@noindent
|
|
The variables @code{e1} and @code{e2} are used so that the function
|
|
fits nicely on the
|
|
@iftex
|
|
page.
|
|
@end iftex
|
|
@ifinfo
|
|
screen.
|
|
@end ifinfo
|
|
|
|
Next comes a @code{BEGIN} rule that parses the command line options.
|
|
It sets @code{FS} to a single tab character, since that is @code{cut}'s
|
|
default field separator. The output field separator is also set to be the
|
|
same as the input field separator. Then @code{getopt} is used to step
|
|
through the command line options. One or the other of the variables
|
|
@code{by_fields} or @code{by_chars} is set to true, to indicate that
|
|
processing should be done by fields or by characters respectively.
|
|
When cutting by characters, the output field separator is set to the null
|
|
string.
|
|
|
|
@example
|
|
@c @group
|
|
@c file eg/prog/cut.awk
|
|
BEGIN \
|
|
@{
|
|
FS = "\t" # default
|
|
OFS = FS
|
|
while ((c = getopt(ARGC, ARGV, "sf:c:d:")) != -1) @{
|
|
if (c == "f") @{
|
|
by_fields = 1
|
|
fieldlist = Optarg
|
|
@group
|
|
@} else if (c == "c") @{
|
|
by_chars = 1
|
|
fieldlist = Optarg
|
|
OFS = ""
|
|
@} else if (c == "d") @{
|
|
if (length(Optarg) > 1) @{
|
|
printf("Using first character of %s" \
|
|
" for delimiter\n", Optarg) > "/dev/stderr"
|
|
Optarg = substr(Optarg, 1, 1)
|
|
@}
|
|
FS = Optarg
|
|
OFS = FS
|
|
if (FS == " ") # defeat awk semantics
|
|
FS = "[ ]"
|
|
@} else if (c == "s")
|
|
suppress++
|
|
else
|
|
usage()
|
|
@}
|
|
@end group
|
|
|
|
for (i = 1; i < Optind; i++)
|
|
ARGV[i] = ""
|
|
@c endfile
|
|
@c @end group
|
|
@end example
|
|
|
|
Special care is taken when the field delimiter is a space. Using
|
|
@code{@w{" "}} (a single space) for the value of @code{FS} is
|
|
incorrect---@code{awk} would
|
|
separate fields with runs of spaces, tabs and/or newlines, and we want them to be
|
|
separated with individual spaces. Also, note that after @code{getopt} is
|
|
through, we have to clear out all the elements of @code{ARGV} from one to
|
|
@code{Optind}, so that @code{awk} will not try to process the command line
|
|
options as file names.
|
|
|
|
After dealing with the command line options, the program verifies that the
|
|
options make sense. Only one or the other of @samp{-c} and @samp{-f} should
|
|
be used, and both require a field list. Then either @code{set_fieldlist} or
|
|
@code{set_charlist} is called to pull apart the list of fields or
|
|
characters.
|
|
|
|
@example
|
|
@c @group
|
|
@c file eg/prog/cut.awk
|
|
if (by_fields && by_chars)
|
|
usage()
|
|
|
|
if (by_fields == 0 && by_chars == 0)
|
|
by_fields = 1 # default
|
|
|
|
if (fieldlist == "") @{
|
|
print "cut: needs list for -c or -f" > "/dev/stderr"
|
|
exit 1
|
|
@}
|
|
|
|
@group
|
|
if (by_fields)
|
|
set_fieldlist()
|
|
else
|
|
set_charlist()
|
|
@}
|
|
@c endfile
|
|
@end group
|
|
@end example
|
|
|
|
Here is @code{set_fieldlist}. It first splits the field list apart
|
|
at the commas, into an array. Then, for each element of the array, it
|
|
looks to see if it is actually a range, and if so splits it apart. The range
|
|
is verified to make sure the first number is smaller than the second.
|
|
Each number in the list is added to the @code{flist} array, which simply
|
|
lists the fields that will be printed.
|
|
Normal field splitting is used.
|
|
The program lets @code{awk}
|
|
handle the job of doing the field splitting.
|
|
|
|
@example
|
|
@c @group
|
|
@c file eg/prog/cut.awk
|
|
function set_fieldlist( n, m, i, j, k, f, g)
|
|
@{
|
|
n = split(fieldlist, f, ",")
|
|
j = 1 # index in flist
|
|
for (i = 1; i <= n; i++) @{
|
|
if (index(f[i], "-") != 0) @{ # a range
|
|
m = split(f[i], g, "-")
|
|
if (m != 2 || g[1] >= g[2]) @{
|
|
printf("bad field list: %s\n",
|
|
f[i]) > "/dev/stderr"
|
|
exit 1
|
|
@}
|
|
for (k = g[1]; k <= g[2]; k++)
|
|
flist[j++] = k
|
|
@} else
|
|
flist[j++] = f[i]
|
|
@}
|
|
nfields = j - 1
|
|
@}
|
|
@c endfile
|
|
@c @end group
|
|
@end example
|
|
|
|
The @code{set_charlist} function is more complicated than @code{set_fieldlist}.
|
|
The idea here is to use @code{gawk}'s @code{FIELDWIDTHS} variable
|
|
(@pxref{Constant Size, ,Reading Fixed-width Data}),
|
|
which describes constant width input. When using a character list, that is
|
|
exactly what we have.
|
|
|
|
Setting up @code{FIELDWIDTHS} is more complicated than simply listing the
|
|
fields that need to be printed. We have to keep track of the fields to be
|
|
printed, and also the intervening characters that have to be skipped.
|
|
For example, suppose you wanted characters one through eight, 15, and
|
|
22 through 35. You would use @samp{-c 1-8,15,22-35}. The necessary value
|
|
for @code{FIELDWIDTHS} would be @code{@w{"8 6 1 6 14"}}. This gives us five
|
|
fields, and what should be printed are @code{$1}, @code{$3}, and @code{$5}.
|
|
The intermediate fields are ``filler,'' stuff in between the desired data.
|
|
|
|
@code{flist} lists the fields to be printed, and @code{t} tracks the
|
|
complete field list, including filler fields.
|
|
|
|
@example
|
|
@c @group
|
|
@c file eg/prog/cut.awk
|
|
function set_charlist( field, i, j, f, g, t,
|
|
filler, last, len)
|
|
@{
|
|
field = 1 # count total fields
|
|
n = split(fieldlist, f, ",")
|
|
j = 1 # index in flist
|
|
for (i = 1; i <= n; i++) @{
|
|
if (index(f[i], "-") != 0) @{ # range
|
|
m = split(f[i], g, "-")
|
|
if (m != 2 || g[1] >= g[2]) @{
|
|
printf("bad character list: %s\n",
|
|
f[i]) > "/dev/stderr"
|
|
exit 1
|
|
@}
|
|
len = g[2] - g[1] + 1
|
|
if (g[1] > 1) # compute length of filler
|
|
filler = g[1] - last - 1
|
|
else
|
|
filler = 0
|
|
if (filler)
|
|
t[field++] = filler
|
|
t[field++] = len # length of field
|
|
last = g[2]
|
|
flist[j++] = field - 1
|
|
@} else @{
|
|
if (f[i] > 1)
|
|
filler = f[i] - last - 1
|
|
else
|
|
filler = 0
|
|
if (filler)
|
|
t[field++] = filler
|
|
t[field++] = 1
|
|
last = f[i]
|
|
flist[j++] = field - 1
|
|
@}
|
|
@}
|
|
@group
|
|
FIELDWIDTHS = join(t, 1, field - 1)
|
|
nfields = j - 1
|
|
@}
|
|
@end group
|
|
@c endfile
|
|
@end example
|
|
|
|
Here is the rule that actually processes the data. If the @samp{-s} option
|
|
was given, then @code{suppress} will be true. The first @code{if} statement
|
|
makes sure that the input record does have the field separator. If
|
|
@code{cut} is processing fields, @code{suppress} is true, and the field
|
|
separator character is not in the record, then the record is skipped.
|
|
|
|
If the record is valid, then at this point, @code{gawk} has split the data
|
|
into fields, either using the character in @code{FS} or using fixed-length
|
|
fields and @code{FIELDWIDTHS}. The loop goes through the list of fields
|
|
that should be printed. If the corresponding field has data in it, it is
|
|
printed. If the next field also has data, then the separator character is
|
|
written out in between the fields.
|
|
|
|
@c 2e: Could use `index($0, FS) != 0' instead of `$0 !~ FS', below
|
|
|
|
@example
|
|
@c @group
|
|
@c file eg/prog/cut.awk
|
|
@{
|
|
if (by_fields && suppress && $0 !~ FS)
|
|
next
|
|
|
|
for (i = 1; i <= nfields; i++) @{
|
|
if ($flist[i] != "") @{
|
|
printf "%s", $flist[i]
|
|
if (i < nfields && $flist[i+1] != "")
|
|
printf "%s", OFS
|
|
@}
|
|
@}
|
|
print ""
|
|
@}
|
|
@c endfile
|
|
@c @end group
|
|
@end example
|
|
|
|
This version of @code{cut} relies on @code{gawk}'s @code{FIELDWIDTHS}
|
|
variable to do the character-based cutting. While it would be possible in
|
|
other @code{awk} implementations to use @code{substr}
|
|
(@pxref{String Functions, ,Built-in Functions for String Manipulation}),
|
|
it would also be extremely painful to do so.
|
|
The @code{FIELDWIDTHS} variable supplies an elegant solution to the problem
|
|
of picking the input line apart by characters.
|
|
|
|
@node Egrep Program, Id Program, Cut Program, Clones
|
|
@subsection Searching for Regular Expressions in Files
|
|
|
|
@cindex @code{egrep} utility
|
|
The @code{egrep} utility searches files for patterns. It uses regular
|
|
expressions that are almost identical to those available in @code{awk}
|
|
(@pxref{Regexp Constants, ,Regular Expression Constants}). It is used this way:
|
|
|
|
@example
|
|
egrep @r{[} @var{options} @r{]} '@var{pattern}' @var{files} @dots{}
|
|
@end example
|
|
|
|
The @var{pattern} is a regexp.
|
|
In typical usage, the regexp is quoted to prevent the shell from expanding
|
|
any of the special characters as file name wildcards.
|
|
Normally, @code{egrep} prints the
|
|
lines that matched. If multiple file names are provided on the command
|
|
line, each output line is preceded by the name of the file and a colon.
|
|
|
|
@c NEEDED
|
|
@page
|
|
The options are:
|
|
|
|
@table @code
|
|
@item -c
|
|
Print out a count of the lines that matched the pattern, instead of the
|
|
lines themselves.
|
|
|
|
@item -s
|
|
Be silent. No output is produced, and the exit value indicates whether
|
|
or not the pattern was matched.
|
|
|
|
@item -v
|
|
Invert the sense of the test. @code{egrep} prints the lines that do
|
|
@emph{not} match the pattern, and exits successfully if the pattern was not
|
|
matched.
|
|
|
|
@item -i
|
|
Ignore case distinctions in both the pattern and the input data.
|
|
|
|
@item -l
|
|
Only print the names of the files that matched, not the lines that matched.
|
|
|
|
@item -e @var{pattern}
|
|
Use @var{pattern} as the regexp to match. The purpose of the @samp{-e}
|
|
option is to allow patterns that start with a @samp{-}.
|
|
@end table
|
|
|
|
This version uses the @code{getopt} library function
|
|
(@pxref{Getopt Function, ,Processing Command Line Options}),
|
|
and the file transition library program
|
|
(@pxref{Filetrans Function, ,Noting Data File Boundaries}).
|
|
|
|
The program begins with a descriptive comment, and then a @code{BEGIN} rule
|
|
that processes the command line arguments with @code{getopt}. The @samp{-i}
|
|
(ignore case) option is particularly easy with @code{gawk}; we just use the
|
|
@code{IGNORECASE} built in variable
|
|
(@pxref{Built-in Variables}).
|
|
|
|
@findex egrep.awk
|
|
@example
|
|
@c @group
|
|
@c file eg/prog/egrep.awk
|
|
# egrep.awk --- simulate egrep in awk
|
|
# Arnold Robbins, arnold@@gnu.org, Public Domain
|
|
# May 1993
|
|
|
|
# Options:
|
|
# -c count of lines
|
|
# -s silent - use exit value
|
|
# -v invert test, success if no match
|
|
# -i ignore case
|
|
# -l print filenames only
|
|
# -e argument is pattern
|
|
|
|
BEGIN @{
|
|
while ((c = getopt(ARGC, ARGV, "ce:svil")) != -1) @{
|
|
if (c == "c")
|
|
count_only++
|
|
else if (c == "s")
|
|
no_print++
|
|
else if (c == "v")
|
|
invert++
|
|
else if (c == "i")
|
|
IGNORECASE = 1
|
|
else if (c == "l")
|
|
filenames_only++
|
|
else if (c == "e")
|
|
pattern = Optarg
|
|
else
|
|
usage()
|
|
@}
|
|
@c endfile
|
|
@c @end group
|
|
@end example
|
|
|
|
Next comes the code that handles the @code{egrep} specific behavior. If no
|
|
pattern was supplied with @samp{-e}, the first non-option on the command
|
|
line is used. The @code{awk} command line arguments up to @code{ARGV[Optind]}
|
|
are cleared, so that @code{awk} won't try to process them as files. If no
|
|
files were specified, the standard input is used, and if multiple files were
|
|
specified, we make sure to note this so that the file names can precede the
|
|
matched lines in the output.
|
|
|
|
The last two lines are commented out, since they are not needed in
|
|
@code{gawk}. They should be uncommented if you have to use another version
|
|
of @code{awk}.
|
|
|
|
@example
|
|
@c @group
|
|
@c file eg/prog/egrep.awk
|
|
if (pattern == "")
|
|
pattern = ARGV[Optind++]
|
|
|
|
for (i = 1; i < Optind; i++)
|
|
ARGV[i] = ""
|
|
if (Optind >= ARGC) @{
|
|
ARGV[1] = "-"
|
|
ARGC = 2
|
|
@} else if (ARGC - Optind > 1)
|
|
do_filenames++
|
|
|
|
# if (IGNORECASE)
|
|
# pattern = tolower(pattern)
|
|
@}
|
|
@c endfile
|
|
@c @end group
|
|
@end example
|
|
|
|
The next set of lines should be uncommented if you are not using
|
|
@code{gawk}. This rule translates all the characters in the input line
|
|
into lower-case if the @samp{-i} option was specified. The rule is
|
|
commented out since it is not necessary with @code{gawk}.
|
|
@c bug: if a match happens, we output the translated line, not the original
|
|
|
|
@example
|
|
@c @group
|
|
@c file eg/prog/egrep.awk
|
|
#@{
|
|
# if (IGNORECASE)
|
|
# $0 = tolower($0)
|
|
#@}
|
|
@c endfile
|
|
@c @end group
|
|
@end example
|
|
|
|
The @code{beginfile} function is called by the rule in @file{ftrans.awk}
|
|
when each new file is processed. In this case, it is very simple; all it
|
|
does is initialize a variable @code{fcount} to zero. @code{fcount} tracks
|
|
how many lines in the current file matched the pattern.
|
|
|
|
@example
|
|
@group
|
|
@c file eg/prog/egrep.awk
|
|
function beginfile(junk)
|
|
@{
|
|
fcount = 0
|
|
@}
|
|
@c endfile
|
|
@end group
|
|
@end example
|
|
|
|
The @code{endfile} function is called after each file has been processed.
|
|
It is used only when the user wants a count of the number of lines that
|
|
matched. @code{no_print} will be true only if the exit status is desired.
|
|
@code{count_only} will be true if line counts are desired. @code{egrep}
|
|
will therefore only print line counts if printing and counting are enabled.
|
|
The output format must be adjusted depending upon the number of files to be
|
|
processed. Finally, @code{fcount} is added to @code{total}, so that we
|
|
know how many lines altogether matched the pattern.
|
|
|
|
@example
|
|
@c @group
|
|
@c file eg/prog/egrep.awk
|
|
function endfile(file)
|
|
@{
|
|
if (! no_print && count_only)
|
|
if (do_filenames)
|
|
print file ":" fcount
|
|
else
|
|
print fcount
|
|
|
|
total += fcount
|
|
@}
|
|
@c endfile
|
|
@c @end group
|
|
@end example
|
|
|
|
This rule does most of the work of matching lines. The variable
|
|
@code{matches} will be true if the line matched the pattern. If the user
|
|
wants lines that did not match, the sense of the @code{matches} is inverted
|
|
using the @samp{!} operator. @code{fcount} is incremented with the value of
|
|
@code{matches}, which will be either one or zero, depending upon a
|
|
successful or unsuccessful match. If the line did not match, the
|
|
@code{next} statement just moves on to the next record.
|
|
|
|
There are several optimizations for performance in the following few lines
|
|
of code. If the user only wants exit status (@code{no_print} is true), and
|
|
we don't have to count lines, then it is enough to know that one line in
|
|
this file matched, and we can skip on to the next file with @code{nextfile}.
|
|
Along similar lines, if we are only printing file names, and we
|
|
don't need to count lines, we can print the file name, and then skip to the
|
|
next file with @code{nextfile}.
|
|
|
|
Finally, each line is printed, with a leading filename and colon if
|
|
necessary.
|
|
|
|
@ignore
|
|
2e: note, probably better to recode the last few lines as
|
|
if (! count_only) @{
|
|
if (no_print)
|
|
nextfile
|
|
|
|
if (filenames_only) @{
|
|
print FILENAME
|
|
nextfile
|
|
@}
|
|
|
|
if (do_filenames)
|
|
print FILENAME ":" $0
|
|
else
|
|
print
|
|
@}
|
|
@end ignore
|
|
|
|
@example
|
|
@c @group
|
|
@c file eg/prog/egrep.awk
|
|
@{
|
|
matches = ($0 ~ pattern)
|
|
if (invert)
|
|
matches = ! matches
|
|
|
|
fcount += matches # 1 or 0
|
|
|
|
@group
|
|
if (! matches)
|
|
next
|
|
@end group
|
|
|
|
if (no_print && ! count_only)
|
|
nextfile
|
|
|
|
if (filenames_only && ! count_only) @{
|
|
print FILENAME
|
|
nextfile
|
|
@}
|
|
|
|
if (do_filenames && ! count_only)
|
|
print FILENAME ":" $0
|
|
else if (! count_only)
|
|
print
|
|
@}
|
|
@c endfile
|
|
@c @end group
|
|
@end example
|
|
|
|
@c @strong{Exercise}: rearrange the code inside @samp{if (! count_only)}.
|
|
|
|
The @code{END} rule takes care of producing the correct exit status. If
|
|
there were no matches, the exit status is one, otherwise it is zero.
|
|
|
|
@example
|
|
@c @group
|
|
@c file eg/prog/egrep.awk
|
|
END \
|
|
@{
|
|
if (total == 0)
|
|
exit 1
|
|
exit 0
|
|
@}
|
|
@c endfile
|
|
@c @end group
|
|
@end example
|
|
|
|
The @code{usage} function prints a usage message in case of invalid options
|
|
and then exits.
|
|
|
|
@example
|
|
@c @group
|
|
@c file eg/prog/egrep.awk
|
|
function usage( e)
|
|
@{
|
|
e = "Usage: egrep [-csvil] [-e pat] [files ...]"
|
|
print e > "/dev/stderr"
|
|
exit 1
|
|
@}
|
|
@c endfile
|
|
@c @end group
|
|
@end example
|
|
|
|
The variable @code{e} is used so that the function fits nicely
|
|
on the printed page.
|
|
|
|
@cindex backslash continuation
|
|
Just a note on programming style. You may have noticed that the @code{END}
|
|
rule uses backslash continuation, with the open brace on a line by
|
|
itself. This is so that it more closely resembles the way functions
|
|
are written. Many of the examples
|
|
@iftex
|
|
in this chapter
|
|
@end iftex
|
|
use this style. You can decide for yourself if you like writing
|
|
your @code{BEGIN} and @code{END} rules this way,
|
|
or not.
|
|
|
|
@node Id Program, Split Program, Egrep Program, Clones
|
|
@subsection Printing Out User Information
|
|
|
|
@cindex @code{id} utility
|
|
The @code{id} utility lists a user's real and effective user-id numbers,
|
|
real and effective group-id numbers, and the user's group set, if any.
|
|
@code{id} will only print the effective user-id and group-id if they are
|
|
different from the real ones. If possible, @code{id} will also supply the
|
|
corresponding user and group names. The output might look like this:
|
|
|
|
@example
|
|
$ id
|
|
@print{} uid=2076(arnold) gid=10(staff) groups=10(staff),4(tty)
|
|
@end example
|
|
|
|
This information is exactly what is provided by @code{gawk}'s
|
|
@file{/dev/user} special file (@pxref{Special Files, ,Special File Names in @code{gawk}}).
|
|
However, the @code{id} utility provides a more palatable output than just a
|
|
string of numbers.
|
|
|
|
Here is a simple version of @code{id} written in @code{awk}.
|
|
It uses the user database library functions
|
|
(@pxref{Passwd Functions, ,Reading the User Database}),
|
|
and the group database library functions
|
|
(@pxref{Group Functions, ,Reading the Group Database}).
|
|
|
|
The program is fairly straightforward. All the work is done in the
|
|
@code{BEGIN} rule. The user and group id numbers are obtained from
|
|
@file{/dev/user}. If there is no support for @file{/dev/user}, the program
|
|
gives up.
|
|
|
|
The code is repetitive. The entry in the user database for the real user-id
|
|
number is split into parts at the @samp{:}. The name is the first field.
|
|
Similar code is used for the effective user-id number, and the group
|
|
numbers.
|
|
|
|
@findex id.awk
|
|
@example
|
|
@c @group
|
|
@c file eg/prog/id.awk
|
|
# id.awk --- implement id in awk
|
|
# Arnold Robbins, arnold@@gnu.org, Public Domain
|
|
# May 1993
|
|
|
|
# output is:
|
|
# uid=12(foo) euid=34(bar) gid=3(baz) \
|
|
# egid=5(blat) groups=9(nine),2(two),1(one)
|
|
|
|
BEGIN \
|
|
@{
|
|
if ((getline < "/dev/user") < 0) @{
|
|
err = "id: no /dev/user support - cannot run"
|
|
print err > "/dev/stderr"
|
|
exit 1
|
|
@}
|
|
close("/dev/user")
|
|
|
|
uid = $1
|
|
euid = $2
|
|
gid = $3
|
|
egid = $4
|
|
|
|
printf("uid=%d", uid)
|
|
pw = getpwuid(uid)
|
|
@group
|
|
if (pw != "") @{
|
|
split(pw, a, ":")
|
|
printf("(%s)", a[1])
|
|
@}
|
|
@end group
|
|
|
|
if (euid != uid) @{
|
|
printf(" euid=%d", euid)
|
|
pw = getpwuid(euid)
|
|
if (pw != "") @{
|
|
split(pw, a, ":")
|
|
printf("(%s)", a[1])
|
|
@}
|
|
@}
|
|
|
|
printf(" gid=%d", gid)
|
|
pw = getgrgid(gid)
|
|
if (pw != "") @{
|
|
split(pw, a, ":")
|
|
printf("(%s)", a[1])
|
|
@}
|
|
|
|
if (egid != gid) @{
|
|
printf(" egid=%d", egid)
|
|
pw = getgrgid(egid)
|
|
if (pw != "") @{
|
|
split(pw, a, ":")
|
|
printf("(%s)", a[1])
|
|
@}
|
|
@}
|
|
|
|
if (NF > 4) @{
|
|
printf(" groups=");
|
|
for (i = 5; i <= NF; i++) @{
|
|
printf("%d", $i)
|
|
pw = getgrgid($i)
|
|
if (pw != "") @{
|
|
split(pw, a, ":")
|
|
printf("(%s)", a[1])
|
|
@}
|
|
@group
|
|
if (i < NF)
|
|
printf(",")
|
|
@end group
|
|
@}
|
|
@}
|
|
print ""
|
|
@}
|
|
@c endfile
|
|
@c @end group
|
|
@end example
|
|
|
|
@c exercise!!!
|
|
@ignore
|
|
The POSIX version of @code{id} takes arguments that control which
|
|
information is printed. Modify this version to accept the same
|
|
arguments and perform in the same way.
|
|
@end ignore
|
|
|
|
@node Split Program, Tee Program, Id Program, Clones
|
|
@subsection Splitting a Large File Into Pieces
|
|
|
|
@cindex @code{split} utility
|
|
The @code{split} program splits large text files into smaller pieces. By default,
|
|
the output files are named @file{xaa}, @file{xab}, and so on. Each file has
|
|
1000 lines in it, with the likely exception of the last file. To change the
|
|
number of lines in each file, you supply a number on the command line
|
|
preceded with a minus, e.g., @samp{-500} for files with 500 lines in them
|
|
instead of 1000. To change the name of the output files to something like
|
|
@file{myfileaa}, @file{myfileab}, and so on, you supply an additional
|
|
argument that specifies the filename.
|
|
|
|
Here is a version of @code{split} in @code{awk}. It uses the @code{ord} and
|
|
@code{chr} functions presented in
|
|
@ref{Ordinal Functions, ,Translating Between Characters and Numbers}.
|
|
|
|
The program first sets its defaults, and then tests to make sure there are
|
|
not too many arguments. It then looks at each argument in turn. The
|
|
first argument could be a minus followed by a number. If it is, this happens
|
|
to look like a negative number, so it is made positive, and that is the
|
|
count of lines. The data file name is skipped over, and the final argument
|
|
is used as the prefix for the output file names.
|
|
|
|
@findex split.awk
|
|
@example
|
|
@c @group
|
|
@c file eg/prog/split.awk
|
|
# split.awk --- do split in awk
|
|
# Arnold Robbins, arnold@@gnu.org, Public Domain
|
|
# May 1993
|
|
|
|
# usage: split [-num] [file] [outname]
|
|
|
|
BEGIN @{
|
|
outfile = "x" # default
|
|
count = 1000
|
|
if (ARGC > 4)
|
|
usage()
|
|
|
|
i = 1
|
|
if (ARGV[i] ~ /^-[0-9]+$/) @{
|
|
count = -ARGV[i]
|
|
ARGV[i] = ""
|
|
i++
|
|
@}
|
|
# test argv in case reading from stdin instead of file
|
|
if (i in ARGV)
|
|
i++ # skip data file name
|
|
if (i in ARGV) @{
|
|
outfile = ARGV[i]
|
|
ARGV[i] = ""
|
|
@}
|
|
|
|
s1 = s2 = "a"
|
|
out = (outfile s1 s2)
|
|
@}
|
|
@c endfile
|
|
@c @end group
|
|
@end example
|
|
|
|
The next rule does most of the work. @code{tcount} (temporary count) tracks
|
|
how many lines have been printed to the output file so far. If it is greater
|
|
than @code{count}, it is time to close the current file and start a new one.
|
|
@code{s1} and @code{s2} track the current suffixes for the file name. If
|
|
they are both @samp{z}, the file is just too big. Otherwise, @code{s1}
|
|
moves to the next letter in the alphabet and @code{s2} starts over again at
|
|
@samp{a}.
|
|
|
|
@example
|
|
@c @group
|
|
@c file eg/prog/split.awk
|
|
@{
|
|
if (++tcount > count) @{
|
|
close(out)
|
|
if (s2 == "z") @{
|
|
if (s1 == "z") @{
|
|
printf("split: %s is too large to split\n", \
|
|
FILENAME) > "/dev/stderr"
|
|
exit 1
|
|
@}
|
|
s1 = chr(ord(s1) + 1)
|
|
s2 = "a"
|
|
@} else
|
|
s2 = chr(ord(s2) + 1)
|
|
out = (outfile s1 s2)
|
|
tcount = 1
|
|
@}
|
|
print > out
|
|
@}
|
|
@c endfile
|
|
@c @end group
|
|
@end example
|
|
|
|
The @code{usage} function simply prints an error message and exits.
|
|
|
|
@example
|
|
@c @group
|
|
@c file eg/prog/split.awk
|
|
function usage( e)
|
|
@{
|
|
e = "usage: split [-num] [file] [outname]"
|
|
print e > "/dev/stderr"
|
|
exit 1
|
|
@}
|
|
@c endfile
|
|
@c @end group
|
|
@end example
|
|
|
|
@noindent
|
|
The variable @code{e} is used so that the function
|
|
fits nicely on the
|
|
@iftex
|
|
page.
|
|
@end iftex
|
|
@ifinfo
|
|
screen.
|
|
@end ifinfo
|
|
|
|
This program is a bit sloppy; it relies on @code{awk} to close the last file
|
|
for it automatically, instead of doing it in an @code{END} rule.
|
|
|
|
@node Tee Program, Uniq Program, Split Program, Clones
|
|
@subsection Duplicating Output Into Multiple Files
|
|
|
|
@cindex @code{tee} utility
|
|
The @code{tee} program is known as a ``pipe fitting.'' @code{tee} copies
|
|
its standard input to its standard output, and also duplicates it to the
|
|
files named on the command line. Its usage is:
|
|
|
|
@example
|
|
tee @r{[}-a@r{]} file @dots{}
|
|
@end example
|
|
|
|
The @samp{-a} option tells @code{tee} to append to the named files, instead of
|
|
truncating them and starting over.
|
|
|
|
The @code{BEGIN} rule first makes a copy of all the command line arguments,
|
|
into an array named @code{copy}.
|
|
@code{ARGV[0]} is not copied, since it is not needed.
|
|
@code{tee} cannot use @code{ARGV} directly, since @code{awk} will attempt to
|
|
process each file named in @code{ARGV} as input data.
|
|
|
|
If the first argument is @samp{-a}, then the flag variable
|
|
@code{append} is set to true, and both @code{ARGV[1]} and
|
|
@code{copy[1]} are deleted. If @code{ARGC} is less than two, then no file
|
|
names were supplied, and @code{tee} prints a usage message and exits.
|
|
Finally, @code{awk} is forced to read the standard input by setting
|
|
@code{ARGV[1]} to @code{"-"}, and @code{ARGC} to two.
|
|
|
|
@c 2e: the `ARGC--' in the `if (ARGV[1] == "-a")' isn't needed.
|
|
|
|
@findex tee.awk
|
|
@example
|
|
@group
|
|
@c file eg/prog/tee.awk
|
|
# tee.awk --- tee in awk
|
|
# Arnold Robbins, arnold@@gnu.org, Public Domain
|
|
# May 1993
|
|
# Revised December 1995
|
|
@end group
|
|
|
|
@group
|
|
BEGIN \
|
|
@{
|
|
for (i = 1; i < ARGC; i++)
|
|
copy[i] = ARGV[i]
|
|
@end group
|
|
|
|
@group
|
|
if (ARGV[1] == "-a") @{
|
|
append = 1
|
|
delete ARGV[1]
|
|
delete copy[1]
|
|
ARGC--
|
|
@}
|
|
@end group
|
|
@group
|
|
if (ARGC < 2) @{
|
|
print "usage: tee [-a] file ..." > "/dev/stderr"
|
|
exit 1
|
|
@}
|
|
@end group
|
|
@group
|
|
ARGV[1] = "-"
|
|
ARGC = 2
|
|
@}
|
|
@c endfile
|
|
@end group
|
|
@end example
|
|
|
|
The single rule does all the work. Since there is no pattern, it is
|
|
executed for each line of input. The body of the rule simply prints the
|
|
line into each file on the command line, and then to the standard output.
|
|
|
|
@example
|
|
@group
|
|
@c file eg/prog/tee.awk
|
|
@{
|
|
# moving the if outside the loop makes it run faster
|
|
if (append)
|
|
for (i in copy)
|
|
print >> copy[i]
|
|
else
|
|
for (i in copy)
|
|
print > copy[i]
|
|
print
|
|
@}
|
|
@c endfile
|
|
@end group
|
|
@end example
|
|
|
|
It would have been possible to code the loop this way:
|
|
|
|
@example
|
|
for (i in copy)
|
|
if (append)
|
|
print >> copy[i]
|
|
else
|
|
print > copy[i]
|
|
@end example
|
|
|
|
@noindent
|
|
This is more concise, but it is also less efficient. The @samp{if} is
|
|
tested for each record and for each output file. By duplicating the loop
|
|
body, the @samp{if} is only tested once for each input record. If there are
|
|
@var{N} input records and @var{M} input files, the first method only
|
|
executes @var{N} @samp{if} statements, while the second would execute
|
|
@var{N}@code{*}@var{M} @samp{if} statements.
|
|
|
|
Finally, the @code{END} rule cleans up, by closing all the output files.
|
|
|
|
@example
|
|
@c @group
|
|
@c file eg/prog/tee.awk
|
|
END \
|
|
@{
|
|
for (i in copy)
|
|
close(copy[i])
|
|
@}
|
|
@c endfile
|
|
@c @end group
|
|
@end example
|
|
|
|
@node Uniq Program, Wc Program, Tee Program, Clones
|
|
@subsection Printing Non-duplicated Lines of Text
|
|
|
|
@cindex @code{uniq} utility
|
|
The @code{uniq} utility reads sorted lines of data on its standard input,
|
|
and (by default) removes duplicate lines. In other words, only unique lines
|
|
are printed, hence the name. @code{uniq} has a number of options. The usage is:
|
|
|
|
@example
|
|
uniq @r{[}-udc @r{[}-@var{n}@r{]]} @r{[}+@var{n}@r{]} @r{[} @var{input file} @r{[} @var{output file} @r{]]}
|
|
@end example
|
|
|
|
The option meanings are:
|
|
|
|
@table @code
|
|
@item -d
|
|
Only print repeated lines.
|
|
|
|
@item -u
|
|
Only print non-repeated lines.
|
|
|
|
@item -c
|
|
Count lines. This option overrides @samp{-d} and @samp{-u}. Both repeated
|
|
and non-repeated lines are counted.
|
|
|
|
@item -@var{n}
|
|
Skip @var{n} fields before comparing lines. The definition of fields
|
|
is similar to @code{awk}'s default: non-whitespace characters separated
|
|
by runs of spaces and/or tabs.
|
|
|
|
@item +@var{n}
|
|
Skip @var{n} characters before comparing lines. Any fields specified with
|
|
@samp{-@var{n}} are skipped first.
|
|
|
|
@item @var{input file}
|
|
Data is read from the input file named on the command line, instead of from
|
|
the standard input.
|
|
|
|
@item @var{output file}
|
|
The generated output is sent to the named output file, instead of to the
|
|
standard output.
|
|
@end table
|
|
|
|
Normally @code{uniq} behaves as if both the @samp{-d} and @samp{-u} options
|
|
had been provided.
|
|
|
|
Here is an @code{awk} implementation of @code{uniq}. It uses the
|
|
@code{getopt} library function
|
|
(@pxref{Getopt Function, ,Processing Command Line Options}),
|
|
and the @code{join} library function
|
|
(@pxref{Join Function, ,Merging an Array Into a String}).
|
|
|
|
The program begins with a @code{usage} function and then a brief outline of
|
|
the options and their meanings in a comment.
|
|
|
|
The @code{BEGIN} rule deals with the command line arguments and options. It
|
|
uses a trick to get @code{getopt} to handle options of the form @samp{-25},
|
|
treating such an option as the option letter @samp{2} with an argument of
|
|
@samp{5}. If indeed two or more digits were supplied (@code{Optarg} looks
|
|
like a number), @code{Optarg} is
|
|
concatenated with the option digit, and then result is added to zero to make
|
|
it into a number. If there is only one digit in the option, then
|
|
@code{Optarg} is not needed, and @code{Optind} must be decremented so that
|
|
@code{getopt} will process it next time. This code is admittedly a bit
|
|
tricky.
|
|
|
|
If no options were supplied, then the default is taken, to print both
|
|
repeated and non-repeated lines. The output file, if provided, is assigned
|
|
to @code{outputfile}. Earlier, @code{outputfile} was initialized to the
|
|
standard output, @file{/dev/stdout}.
|
|
|
|
@findex uniq.awk
|
|
@example
|
|
@c @group
|
|
@c file eg/prog/uniq.awk
|
|
# uniq.awk --- do uniq in awk
|
|
# Arnold Robbins, arnold@@gnu.org, Public Domain
|
|
# May 1993
|
|
|
|
@group
|
|
function usage( e)
|
|
@{
|
|
e = "Usage: uniq [-udc [-n]] [+n] [ in [ out ]]"
|
|
print e > "/dev/stderr"
|
|
exit 1
|
|
@}
|
|
@end group
|
|
|
|
@group
|
|
# -c count lines. overrides -d and -u
|
|
# -d only repeated lines
|
|
# -u only non-repeated lines
|
|
# -n skip n fields
|
|
# +n skip n characters, skip fields first
|
|
@end group
|
|
|
|
BEGIN \
|
|
@{
|
|
count = 1
|
|
outputfile = "/dev/stdout"
|
|
opts = "udc0:1:2:3:4:5:6:7:8:9:"
|
|
while ((c = getopt(ARGC, ARGV, opts)) != -1) @{
|
|
if (c == "u")
|
|
non_repeated_only++
|
|
else if (c == "d")
|
|
repeated_only++
|
|
else if (c == "c")
|
|
do_count++
|
|
else if (index("0123456789", c) != 0) @{
|
|
# getopt requires args to options
|
|
# this messes us up for things like -5
|
|
if (Optarg ~ /^[0-9]+$/)
|
|
fcount = (c Optarg) + 0
|
|
else @{
|
|
fcount = c + 0
|
|
Optind--
|
|
@}
|
|
@} else
|
|
usage()
|
|
@}
|
|
|
|
if (ARGV[Optind] ~ /^\+[0-9]+$/) @{
|
|
charcount = substr(ARGV[Optind], 2) + 0
|
|
Optind++
|
|
@}
|
|
|
|
for (i = 1; i < Optind; i++)
|
|
ARGV[i] = ""
|
|
|
|
if (repeated_only == 0 && non_repeated_only == 0)
|
|
repeated_only = non_repeated_only = 1
|
|
|
|
@group
|
|
if (ARGC - Optind == 2) @{
|
|
outputfile = ARGV[ARGC - 1]
|
|
ARGV[ARGC - 1] = ""
|
|
@}
|
|
@}
|
|
@c endfile
|
|
@end group
|
|
@end example
|
|
|
|
The following function, @code{are_equal}, compares the current line,
|
|
@code{$0}, to the
|
|
previous line, @code{last}. It handles skipping fields and characters.
|
|
|
|
If no field count and no character count were specified, @code{are_equal}
|
|
simply returns one or zero depending upon the result of a simple string
|
|
comparison of @code{last} and @code{$0}. Otherwise, things get more
|
|
complicated.
|
|
|
|
If fields have to be skipped, each line is broken into an array using
|
|
@code{split}
|
|
(@pxref{String Functions, ,Built-in Functions for String Manipulation}),
|
|
and then the desired fields are joined back into a line using @code{join}.
|
|
The joined lines are stored in @code{clast} and @code{cline}.
|
|
If no fields are skipped, @code{clast} and @code{cline} are set to
|
|
@code{last} and @code{$0} respectively.
|
|
|
|
Finally, if characters are skipped, @code{substr} is used to strip off the
|
|
leading @code{charcount} characters in @code{clast} and @code{cline}. The
|
|
two strings are then compared, and @code{are_equal} returns the result.
|
|
|
|
@example
|
|
@c @group
|
|
@c file eg/prog/uniq.awk
|
|
function are_equal( n, m, clast, cline, alast, aline)
|
|
@{
|
|
if (fcount == 0 && charcount == 0)
|
|
return (last == $0)
|
|
|
|
if (fcount > 0) @{
|
|
n = split(last, alast)
|
|
m = split($0, aline)
|
|
clast = join(alast, fcount+1, n)
|
|
cline = join(aline, fcount+1, m)
|
|
@} else @{
|
|
clast = last
|
|
cline = $0
|
|
@}
|
|
if (charcount) @{
|
|
clast = substr(clast, charcount + 1)
|
|
cline = substr(cline, charcount + 1)
|
|
@}
|
|
|
|
return (clast == cline)
|
|
@}
|
|
@c endfile
|
|
@c @end group
|
|
@end example
|
|
|
|
The following two rules are the body of the program. The first one is
|
|
executed only for the very first line of data. It sets @code{last} equal to
|
|
@code{$0}, so that subsequent lines of text have something to be compared to.
|
|
|
|
The second rule does the work. The variable @code{equal} will be one or zero
|
|
depending upon the results of @code{are_equal}'s comparison. If @code{uniq}
|
|
is counting repeated lines, then the @code{count} variable is incremented if
|
|
the lines are equal. Otherwise the line is printed and @code{count} is
|
|
reset, since the two lines are not equal.
|
|
|
|
If @code{uniq} is not counting, @code{count} is incremented if the lines are
|
|
equal. Otherwise, if @code{uniq} is counting repeated lines, and more than
|
|
one line has been seen, or if @code{uniq} is counting non-repeated lines,
|
|
and only one line has been seen, then the line is printed, and @code{count}
|
|
is reset.
|
|
|
|
Finally, similar logic is used in the @code{END} rule to print the final
|
|
line of input data.
|
|
|
|
@example
|
|
@c @group
|
|
@c file eg/prog/uniq.awk
|
|
@group
|
|
NR == 1 @{
|
|
last = $0
|
|
next
|
|
@}
|
|
@end group
|
|
|
|
@{
|
|
equal = are_equal()
|
|
|
|
if (do_count) @{ # overrides -d and -u
|
|
if (equal)
|
|
count++
|
|
else @{
|
|
printf("%4d %s\n", count, last) > outputfile
|
|
last = $0
|
|
count = 1 # reset
|
|
@}
|
|
next
|
|
@}
|
|
|
|
if (equal)
|
|
count++
|
|
else @{
|
|
if ((repeated_only && count > 1) ||
|
|
(non_repeated_only && count == 1))
|
|
print last > outputfile
|
|
last = $0
|
|
count = 1
|
|
@}
|
|
@}
|
|
|
|
@group
|
|
END @{
|
|
if (do_count)
|
|
printf("%4d %s\n", count, last) > outputfile
|
|
else if ((repeated_only && count > 1) ||
|
|
(non_repeated_only && count == 1))
|
|
print last > outputfile
|
|
@}
|
|
@end group
|
|
@c endfile
|
|
@c @end group
|
|
@end example
|
|
|
|
@node Wc Program, , Uniq Program, Clones
|
|
@subsection Counting Things
|
|
|
|
@cindex @code{wc} utility
|
|
The @code{wc} (word count) utility counts lines, words, and characters in
|
|
one or more input files. Its usage is:
|
|
|
|
@example
|
|
wc @r{[}-lwc@r{]} @r{[} @var{files} @dots{} @r{]}
|
|
@end example
|
|
|
|
If no files are specified on the command line, @code{wc} reads its standard
|
|
input. If there are multiple files, it will also print total counts for all
|
|
the files. The options and their meanings are:
|
|
|
|
@table @code
|
|
@item -l
|
|
Only count lines.
|
|
|
|
@item -w
|
|
Only count words.
|
|
A ``word'' is a contiguous sequence of non-whitespace characters, separated
|
|
by spaces and/or tabs. Happily, this is the normal way @code{awk} separates
|
|
fields in its input data.
|
|
|
|
@item -c
|
|
Only count characters.
|
|
@end table
|
|
|
|
Implementing @code{wc} in @code{awk} is particularly elegant, since
|
|
@code{awk} does a lot of the work for us; it splits lines into words (i.e.@:
|
|
fields) and counts them, it counts lines (i.e.@: records) for us, and it can
|
|
easily tell us how long a line is.
|
|
|
|
This version uses the @code{getopt} library function
|
|
(@pxref{Getopt Function, ,Processing Command Line Options}),
|
|
and the file transition functions
|
|
(@pxref{Filetrans Function, ,Noting Data File Boundaries}).
|
|
|
|
This version has one major difference from traditional versions of @code{wc}.
|
|
Our version always prints the counts in the order lines, words,
|
|
and characters. Traditional versions note the order of the @samp{-l},
|
|
@samp{-w}, and @samp{-c} options on the command line, and print the counts
|
|
in that order.
|
|
|
|
The @code{BEGIN} rule does the argument processing.
|
|
The variable @code{print_total} will
|
|
be true if more than one file was named on the command line.
|
|
|
|
@findex wc.awk
|
|
@example
|
|
@c @group
|
|
@c file eg/prog/wc.awk
|
|
# wc.awk --- count lines, words, characters
|
|
# Arnold Robbins, arnold@@gnu.org, Public Domain
|
|
# May 1993
|
|
|
|
# Options:
|
|
# -l only count lines
|
|
# -w only count words
|
|
# -c only count characters
|
|
#
|
|
# Default is to count lines, words, characters
|
|
|
|
BEGIN @{
|
|
# let getopt print a message about
|
|
# invalid options. we ignore them
|
|
while ((c = getopt(ARGC, ARGV, "lwc")) != -1) @{
|
|
if (c == "l")
|
|
do_lines = 1
|
|
else if (c == "w")
|
|
do_words = 1
|
|
else if (c == "c")
|
|
do_chars = 1
|
|
@}
|
|
for (i = 1; i < Optind; i++)
|
|
ARGV[i] = ""
|
|
|
|
# if no options, do all
|
|
if (! do_lines && ! do_words && ! do_chars)
|
|
do_lines = do_words = do_chars = 1
|
|
|
|
print_total = (ARGC - i > 2)
|
|
@}
|
|
@c endfile
|
|
@c @end group
|
|
@end example
|
|
|
|
The @code{beginfile} function is simple; it just resets the counts of lines,
|
|
words, and characters to zero, and saves the current file name in
|
|
@code{fname}.
|
|
|
|
The @code{endfile} function adds the current file's numbers to the running
|
|
totals of lines, words, and characters. It then prints out those numbers
|
|
for the file that was just read. It relies on @code{beginfile} to reset the
|
|
numbers for the following data file.
|
|
|
|
@example
|
|
@c @group
|
|
@c file eg/prog/wc.awk
|
|
function beginfile(file)
|
|
@{
|
|
chars = lines = words = 0
|
|
fname = FILENAME
|
|
@}
|
|
|
|
function endfile(file)
|
|
@{
|
|
tchars += chars
|
|
tlines += lines
|
|
twords += words
|
|
@group
|
|
if (do_lines)
|
|
printf "\t%d", lines
|
|
@end group
|
|
if (do_words)
|
|
printf "\t%d", words
|
|
if (do_chars)
|
|
printf "\t%d", chars
|
|
printf "\t%s\n", fname
|
|
@}
|
|
@c endfile
|
|
@c @end group
|
|
@end example
|
|
|
|
There is one rule that is executed for each line. It adds the length of the
|
|
record to @code{chars}. It has to add one, since the newline character
|
|
separating records (the value of @code{RS}) is not part of the record
|
|
itself. @code{lines} is incremented for each line read, and @code{words} is
|
|
incremented by the value of @code{NF}, the number of ``words'' on this
|
|
line.@footnote{Examine the code in
|
|
@ref{Filetrans Function, ,Noting Data File Boundaries}.
|
|
Why must @code{wc} use a separate @code{lines} variable, instead of using
|
|
the value of @code{FNR} in @code{endfile}?}
|
|
|
|
Finally, the @code{END} rule simply prints the totals for all the files.
|
|
|
|
@example
|
|
@c @group
|
|
@c file eg/prog/wc.awk
|
|
# do per line
|
|
@{
|
|
chars += length($0) + 1 # get newline
|
|
lines++
|
|
words += NF
|
|
@}
|
|
|
|
END @{
|
|
if (print_total) @{
|
|
if (do_lines)
|
|
printf "\t%d", tlines
|
|
if (do_words)
|
|
printf "\t%d", twords
|
|
if (do_chars)
|
|
printf "\t%d", tchars
|
|
print "\ttotal"
|
|
@}
|
|
@}
|
|
@c endfile
|
|
@c @end group
|
|
@end example
|
|
|
|
@node Miscellaneous Programs, , Clones, Sample Programs
|
|
@section A Grab Bag of @code{awk} Programs
|
|
|
|
This section is a large ``grab bag'' of miscellaneous programs.
|
|
We hope you find them both interesting and enjoyable.
|
|
|
|
@menu
|
|
* Dupword Program:: Finding duplicated words in a document.
|
|
* Alarm Program:: An alarm clock.
|
|
* Translate Program:: A program similar to the @code{tr} utility.
|
|
* Labels Program:: Printing mailing labels.
|
|
* Word Sorting:: A program to produce a word usage count.
|
|
* History Sorting:: Eliminating duplicate entries from a history
|
|
file.
|
|
* Extract Program:: Pulling out programs from Texinfo source
|
|
files.
|
|
* Simple Sed:: A Simple Stream Editor.
|
|
* Igawk Program:: A wrapper for @code{awk} that includes files.
|
|
@end menu
|
|
|
|
@node Dupword Program, Alarm Program, Miscellaneous Programs, Miscellaneous Programs
|
|
@subsection Finding Duplicated Words in a Document
|
|
|
|
A common error when writing large amounts of prose is to accidentally
|
|
duplicate words. Often you will see this in text as something like ``the
|
|
the program does the following @dots{}.'' When the text is on-line, often
|
|
the duplicated words occur at the end of one line and the beginning of
|
|
another, making them very difficult to spot.
|
|
@c as here!
|
|
|
|
This program, @file{dupword.awk}, scans through a file one line at a time,
|
|
and looks for adjacent occurrences of the same word. It also saves the last
|
|
word on a line (in the variable @code{prev}) for comparison with the first
|
|
word on the next line.
|
|
|
|
The first two statements make sure that the line is all lower-case, so that,
|
|
for example,
|
|
``The'' and ``the'' compare equal to each other. The second statement
|
|
removes all non-alphanumeric and non-whitespace characters from the line, so
|
|
that punctuation does not affect the comparison either. This sometimes
|
|
leads to reports of duplicated words that really are different, but this is
|
|
unusual.
|
|
|
|
@c FIXME: add check for $i != ""
|
|
@findex dupword.awk
|
|
@example
|
|
@group
|
|
@c file eg/prog/dupword.awk
|
|
# dupword --- find duplicate words in text
|
|
# Arnold Robbins, arnold@@gnu.org, Public Domain
|
|
# December 1991
|
|
|
|
@{
|
|
$0 = tolower($0)
|
|
gsub(/[^A-Za-z0-9 \t]/, "");
|
|
if ($1 == prev)
|
|
printf("%s:%d: duplicate %s\n",
|
|
FILENAME, FNR, $1)
|
|
for (i = 2; i <= NF; i++)
|
|
if ($i == $(i-1))
|
|
printf("%s:%d: duplicate %s\n",
|
|
FILENAME, FNR, $i)
|
|
prev = $NF
|
|
@}
|
|
@c endfile
|
|
@end group
|
|
@end example
|
|
|
|
@node Alarm Program, Translate Program, Dupword Program, Miscellaneous Programs
|
|
@subsection An Alarm Clock Program
|
|
|
|
The following program is a simple ``alarm clock'' program.
|
|
You give it a time of day, and an optional message. At the given time,
|
|
it prints the message on the standard output. In addition, you can give it
|
|
the number of times to repeat the message, and also a delay between
|
|
repetitions.
|
|
|
|
This program uses the @code{gettimeofday} function from
|
|
@ref{Gettimeofday Function, ,Managing the Time of Day}.
|
|
|
|
All the work is done in the @code{BEGIN} rule. The first part is argument
|
|
checking and setting of defaults; the delay, the count, and the message to
|
|
print. If the user supplied a message, but it does not contain the ASCII BEL
|
|
character (known as the ``alert'' character, @samp{\a}), then it is added to
|
|
the message. (On many systems, printing the ASCII BEL generates some sort
|
|
of audible alert. Thus, when the alarm goes off, the system calls attention
|
|
to itself, in case the user is not looking at their computer or terminal.)
|
|
|
|
@findex alarm.awk
|
|
@example
|
|
@c @group
|
|
@c file eg/prog/alarm.awk
|
|
# alarm --- set an alarm
|
|
# Arnold Robbins, arnold@@gnu.org, Public Domain
|
|
# May 1993
|
|
|
|
# usage: alarm time [ "message" [ count [ delay ] ] ]
|
|
|
|
BEGIN \
|
|
@{
|
|
# Initial argument sanity checking
|
|
usage1 = "usage: alarm time ['message' [count [delay]]]"
|
|
usage2 = sprintf("\t(%s) time ::= hh:mm", ARGV[1])
|
|
|
|
if (ARGC < 2) @{
|
|
print usage > "/dev/stderr"
|
|
exit 1
|
|
@} else if (ARGC == 5) @{
|
|
delay = ARGV[4] + 0
|
|
count = ARGV[3] + 0
|
|
message = ARGV[2]
|
|
@} else if (ARGC == 4) @{
|
|
count = ARGV[3] + 0
|
|
message = ARGV[2]
|
|
@} else if (ARGC == 3) @{
|
|
message = ARGV[2]
|
|
@} else if (ARGV[1] !~ /[0-9]?[0-9]:[0-9][0-9]/) @{
|
|
print usage1 > "/dev/stderr"
|
|
print usage2 > "/dev/stderr"
|
|
exit 1
|
|
@}
|
|
|
|
# set defaults for once we reach the desired time
|
|
if (delay == 0)
|
|
delay = 180 # 3 minutes
|
|
if (count == 0)
|
|
count = 5
|
|
@group
|
|
if (message == "")
|
|
message = sprintf("\aIt is now %s!\a", ARGV[1])
|
|
else if (index(message, "\a") == 0)
|
|
message = "\a" message "\a"
|
|
@end group
|
|
@c endfile
|
|
@end example
|
|
|
|
The next section of code turns the alarm time into hours and minutes,
|
|
and converts it if necessary to a 24-hour clock. Then it turns that
|
|
time into a count of the seconds since midnight. Next it turns the current
|
|
time into a count of seconds since midnight. The difference between the two
|
|
is how long to wait before setting off the alarm.
|
|
|
|
@example
|
|
@c @group
|
|
@c file eg/prog/alarm.awk
|
|
# split up dest time
|
|
split(ARGV[1], atime, ":")
|
|
hour = atime[1] + 0 # force numeric
|
|
minute = atime[2] + 0 # force numeric
|
|
|
|
# get current broken down time
|
|
gettimeofday(now)
|
|
|
|
# if time given is 12-hour hours and it's after that
|
|
# hour, e.g., `alarm 5:30' at 9 a.m. means 5:30 p.m.,
|
|
# then add 12 to real hour
|
|
if (hour < 12 && now["hour"] > hour)
|
|
hour += 12
|
|
|
|
# set target time in seconds since midnight
|
|
target = (hour * 60 * 60) + (minute * 60)
|
|
|
|
# get current time in seconds since midnight
|
|
current = (now["hour"] * 60 * 60) + \
|
|
(now["minute"] * 60) + now["second"]
|
|
|
|
# how long to sleep for
|
|
naptime = target - current
|
|
if (naptime <= 0) @{
|
|
print "time is in the past!" > "/dev/stderr"
|
|
exit 1
|
|
@}
|
|
@c endfile
|
|
@c @end group
|
|
@end example
|
|
|
|
Finally, the program uses the @code{system} function
|
|
(@pxref{I/O Functions, ,Built-in Functions for Input/Output})
|
|
to call the @code{sleep} utility. The @code{sleep} utility simply pauses
|
|
for the given number of seconds. If the exit status is not zero,
|
|
the program assumes that @code{sleep} was interrupted, and exits. If
|
|
@code{sleep} exited with an OK status (zero), then the program prints the
|
|
message in a loop, again using @code{sleep} to delay for however many
|
|
seconds are necessary.
|
|
|
|
@example
|
|
@c @group
|
|
@c file eg/prog/alarm.awk
|
|
# zzzzzz..... go away if interrupted
|
|
if (system(sprintf("sleep %d", naptime)) != 0)
|
|
exit 1
|
|
|
|
# time to notify!
|
|
command = sprintf("sleep %d", delay)
|
|
for (i = 1; i <= count; i++) @{
|
|
print message
|
|
# if sleep command interrupted, go away
|
|
if (system(command) != 0)
|
|
break
|
|
@}
|
|
|
|
exit 0
|
|
@}
|
|
@c endfile
|
|
@c @end group
|
|
@end example
|
|
|
|
@node Translate Program, Labels Program, Alarm Program, Miscellaneous Programs
|
|
@subsection Transliterating Characters
|
|
|
|
The system @code{tr} utility transliterates characters. For example, it is
|
|
often used to map upper-case letters into lower-case, for further
|
|
processing.
|
|
|
|
@example
|
|
@var{generate data} | tr '[A-Z]' '[a-z]' | @var{process data} @dots{}
|
|
@end example
|
|
|
|
You give @code{tr} two lists of characters enclosed in square brackets.
|
|
Usually, the lists are quoted to keep the shell from attempting to do a
|
|
filename expansion.@footnote{On older, non-POSIX systems, @code{tr} often
|
|
does not require that the lists be enclosed in square brackets and quoted.
|
|
This is a feature.} When processing the input, the
|
|
first character in the first list is replaced with the first character in the
|
|
second list, the second character in the first list is replaced with the
|
|
second character in the second list, and so on.
|
|
If there are more characters in the ``from'' list than in the ``to'' list,
|
|
the last character of the ``to'' list is used for the remaining characters
|
|
in the ``from'' list.
|
|
|
|
Some time ago,
|
|
@c early or mid-1989!
|
|
a user proposed to us that we add a transliteration function to @code{gawk}.
|
|
Being opposed to ``creeping featurism,'' I wrote the following program to
|
|
prove that character transliteration could be done with a user-level
|
|
function. This program is not as complete as the system @code{tr} utility,
|
|
but it will do most of the job.
|
|
|
|
The @code{translate} program demonstrates one of the few weaknesses of
|
|
standard
|
|
@code{awk}: dealing with individual characters is very painful, requiring
|
|
repeated use of the @code{substr}, @code{index}, and @code{gsub} built-in
|
|
functions
|
|
(@pxref{String Functions, ,Built-in Functions for String Manipulation}).@footnote{This
|
|
program was written before @code{gawk} acquired the ability to
|
|
split each character in a string into separate array elements.
|
|
How might this ability simplify the program?}
|
|
|
|
There are two functions. The first, @code{stranslate}, takes three
|
|
arguments.
|
|
|
|
@table @code
|
|
@item from
|
|
A list of characters to translate from.
|
|
|
|
@item to
|
|
A list of characters to translate to.
|
|
|
|
@item target
|
|
The string to do the translation on.
|
|
@end table
|
|
|
|
Associative arrays make the translation part fairly easy. @code{t_ar} holds
|
|
the ``to'' characters, indexed by the ``from'' characters. Then a simple
|
|
loop goes through @code{from}, one character at a time. For each character
|
|
in @code{from}, if the character appears in @code{target}, @code{gsub}
|
|
is used to change it to the corresponding @code{to} character.
|
|
|
|
The @code{translate} function simply calls @code{stranslate} using @code{$0}
|
|
as the target. The main program sets two global variables, @code{FROM} and
|
|
@code{TO}, from the command line, and then changes @code{ARGV} so that
|
|
@code{awk} will read from the standard input.
|
|
|
|
Finally, the processing rule simply calls @code{translate} for each record.
|
|
|
|
@findex translate.awk
|
|
@example
|
|
@c @group
|
|
@c file eg/prog/translate.awk
|
|
# translate --- do tr like stuff
|
|
# Arnold Robbins, arnold@@gnu.org, Public Domain
|
|
# August 1989
|
|
|
|
# bugs: does not handle things like: tr A-Z a-z, it has
|
|
# to be spelled out. However, if `to' is shorter than `from',
|
|
# the last character in `to' is used for the rest of `from'.
|
|
|
|
function stranslate(from, to, target, lf, lt, t_ar, i, c)
|
|
@{
|
|
lf = length(from)
|
|
lt = length(to)
|
|
for (i = 1; i <= lt; i++)
|
|
t_ar[substr(from, i, 1)] = substr(to, i, 1)
|
|
if (lt < lf)
|
|
for (; i <= lf; i++)
|
|
t_ar[substr(from, i, 1)] = substr(to, lt, 1)
|
|
for (i = 1; i <= lf; i++) @{
|
|
c = substr(from, i, 1)
|
|
if (index(target, c) > 0)
|
|
gsub(c, t_ar[c], target)
|
|
@}
|
|
return target
|
|
@}
|
|
|
|
@group
|
|
function translate(from, to)
|
|
@{
|
|
return $0 = stranslate(from, to, $0)
|
|
@}
|
|
@end group
|
|
|
|
# main program
|
|
BEGIN @{
|
|
if (ARGC < 3) @{
|
|
print "usage: translate from to" > "/dev/stderr"
|
|
exit
|
|
@}
|
|
FROM = ARGV[1]
|
|
TO = ARGV[2]
|
|
ARGC = 2
|
|
ARGV[1] = "-"
|
|
@}
|
|
|
|
@{
|
|
translate(FROM, TO)
|
|
print
|
|
@}
|
|
@c endfile
|
|
@c @end group
|
|
@end example
|
|
|
|
While it is possible to do character transliteration in a user-level
|
|
function, it is not necessarily efficient, and we started to consider adding
|
|
a built-in function. However, shortly after writing this program, we learned
|
|
that the System V Release 4 @code{awk} had added the @code{toupper} and
|
|
@code{tolower} functions. These functions handle the vast majority of the
|
|
cases where character transliteration is necessary, and so we chose to
|
|
simply add those functions to @code{gawk} as well, and then leave well
|
|
enough alone.
|
|
|
|
An obvious improvement to this program would be to set up the
|
|
@code{t_ar} array only once, in a @code{BEGIN} rule. However, this
|
|
assumes that the ``from'' and ``to'' lists
|
|
will never change throughout the lifetime of the program.
|
|
|
|
@node Labels Program, Word Sorting, Translate Program, Miscellaneous Programs
|
|
@subsection Printing Mailing Labels
|
|
|
|
Here is a ``real world''@footnote{``Real world'' is defined as
|
|
``a program actually used to get something done.''}
|
|
program. This script reads lists of names and
|
|
addresses, and generates mailing labels. Each page of labels has 20 labels
|
|
on it, two across and ten down. The addresses are guaranteed to be no more
|
|
than five lines of data. Each address is separated from the next by a blank
|
|
line.
|
|
|
|
The basic idea is to read 20 labels worth of data. Each line of each label
|
|
is stored in the @code{line} array. The single rule takes care of filling
|
|
the @code{line} array and printing the page when 20 labels have been read.
|
|
|
|
The @code{BEGIN} rule simply sets @code{RS} to the empty string, so that
|
|
@code{awk} will split records at blank lines
|
|
(@pxref{Records, ,How Input is Split into Records}).
|
|
It sets @code{MAXLINES} to 100, since @code{MAXLINE} is the maximum number
|
|
of lines on the page (20 * 5 = 100).
|
|
|
|
Most of the work is done in the @code{printpage} function.
|
|
The label lines are stored sequentially in the @code{line} array. But they
|
|
have to be printed horizontally; @code{line[1]} next to @code{line[6]},
|
|
@code{line[2]} next to @code{line[7]}, and so on. Two loops are used to
|
|
accomplish this. The outer loop, controlled by @code{i}, steps through
|
|
every 10 lines of data; this is each row of labels. The inner loop,
|
|
controlled by @code{j}, goes through the lines within the row.
|
|
As @code{j} goes from zero to four, @samp{i+j} is the @code{j}'th line in
|
|
the row, and @samp{i+j+5} is the entry next to it. The output ends up
|
|
looking something like this:
|
|
|
|
@example
|
|
line 1 line 6
|
|
line 2 line 7
|
|
line 3 line 8
|
|
line 4 line 9
|
|
line 5 line 10
|
|
@end example
|
|
|
|
As a final note, at lines 21 and 61, an extra blank line is printed, to keep
|
|
the output lined up on the labels. This is dependent on the particular
|
|
brand of labels in use when the program was written. You will also note
|
|
that there are two blank lines at the top and two blank lines at the bottom.
|
|
|
|
The @code{END} rule arranges to flush the final page of labels; there may
|
|
not have been an even multiple of 20 labels in the data.
|
|
|
|
@findex labels.awk
|
|
@example
|
|
@c @group
|
|
@c file eg/prog/labels.awk
|
|
# labels.awk
|
|
# Arnold Robbins, arnold@@gnu.org, Public Domain
|
|
# June 1992
|
|
|
|
# Program to print labels. Each label is 5 lines of data
|
|
# that may have blank lines. The label sheets have 2
|
|
# blank lines at the top and 2 at the bottom.
|
|
|
|
BEGIN @{ RS = "" ; MAXLINES = 100 @}
|
|
|
|
function printpage( i, j)
|
|
@{
|
|
if (Nlines <= 0)
|
|
return
|
|
|
|
printf "\n\n" # header
|
|
|
|
for (i = 1; i <= Nlines; i += 10) @{
|
|
if (i == 21 || i == 61)
|
|
print ""
|
|
for (j = 0; j < 5; j++) @{
|
|
if (i + j > MAXLINES)
|
|
break
|
|
printf " %-41s %s\n", line[i+j], line[i+j+5]
|
|
@}
|
|
print ""
|
|
@}
|
|
|
|
printf "\n\n" # footer
|
|
|
|
for (i in line)
|
|
line[i] = ""
|
|
@}
|
|
|
|
# main rule
|
|
@{
|
|
if (Count >= 20) @{
|
|
printpage()
|
|
Count = 0
|
|
Nlines = 0
|
|
@}
|
|
n = split($0, a, "\n")
|
|
for (i = 1; i <= n; i++)
|
|
line[++Nlines] = a[i]
|
|
for (; i <= 5; i++)
|
|
line[++Nlines] = ""
|
|
Count++
|
|
@}
|
|
|
|
END \
|
|
@{
|
|
printpage()
|
|
@}
|
|
@c endfile
|
|
@c @end group
|
|
@end example
|
|
|
|
@node Word Sorting, History Sorting, Labels Program, Miscellaneous Programs
|
|
@subsection Generating Word Usage Counts
|
|
|
|
The following @code{awk} program prints
|
|
the number of occurrences of each word in its input. It illustrates the
|
|
associative nature of @code{awk} arrays by using strings as subscripts. It
|
|
also demonstrates the @samp{for @var{x} in @var{array}} construction.
|
|
Finally, it shows how @code{awk} can be used in conjunction with other
|
|
utility programs to do a useful task of some complexity with a minimum of
|
|
effort. Some explanations follow the program listing.
|
|
|
|
@example
|
|
awk '
|
|
# Print list of word frequencies
|
|
@{
|
|
for (i = 1; i <= NF; i++)
|
|
freq[$i]++
|
|
@}
|
|
|
|
END @{
|
|
for (word in freq)
|
|
printf "%s\t%d\n", word, freq[word]
|
|
@}'
|
|
@end example
|
|
|
|
The first thing to notice about this program is that it has two rules. The
|
|
first rule, because it has an empty pattern, is executed on every line of
|
|
the input. It uses @code{awk}'s field-accessing mechanism
|
|
(@pxref{Fields, ,Examining Fields}) to pick out the individual words from
|
|
the line, and the built-in variable @code{NF} (@pxref{Built-in Variables})
|
|
to know how many fields are available.
|
|
|
|
For each input word, an element of the array @code{freq} is incremented to
|
|
reflect that the word has been seen an additional time.
|
|
|
|
The second rule, because it has the pattern @code{END}, is not executed
|
|
until the input has been exhausted. It prints out the contents of the
|
|
@code{freq} table that has been built up inside the first action.
|
|
|
|
This program has several problems that would prevent it from being
|
|
useful by itself on real text files:
|
|
|
|
@itemize @bullet
|
|
@item
|
|
Words are detected using the @code{awk} convention that fields are
|
|
separated by whitespace and that other characters in the input (except
|
|
newlines) don't have any special meaning to @code{awk}. This means that
|
|
punctuation characters count as part of words.
|
|
|
|
@item
|
|
The @code{awk} language considers upper- and lower-case characters to be
|
|
distinct. Therefore, @samp{bartender} and @samp{Bartender} are not treated
|
|
as the same word. This is undesirable since, in normal text, words
|
|
are capitalized if they begin sentences, and a frequency analyzer should not
|
|
be sensitive to capitalization.
|
|
|
|
@item
|
|
The output does not come out in any useful order. You're more likely to be
|
|
interested in which words occur most frequently, or having an alphabetized
|
|
table of how frequently each word occurs.
|
|
@end itemize
|
|
|
|
The way to solve these problems is to use some of the more advanced
|
|
features of the @code{awk} language. First, we use @code{tolower} to remove
|
|
case distinctions. Next, we use @code{gsub} to remove punctuation
|
|
characters. Finally, we use the system @code{sort} utility to process the
|
|
output of the @code{awk} script. Here is the new version of
|
|
the program:
|
|
|
|
@findex wordfreq.sh
|
|
@example
|
|
@c file eg/prog/wordfreq.awk
|
|
# Print list of word frequencies
|
|
@{
|
|
$0 = tolower($0) # remove case distinctions
|
|
gsub(/[^a-z0-9_ \t]/, "", $0) # remove punctuation
|
|
for (i = 1; i <= NF; i++)
|
|
freq[$i]++
|
|
@}
|
|
@c endfile
|
|
|
|
END @{
|
|
for (word in freq)
|
|
printf "%s\t%d\n", word, freq[word]
|
|
@}
|
|
@end example
|
|
|
|
Assuming we have saved this program in a file named @file{wordfreq.awk},
|
|
and that the data is in @file{file1}, the following pipeline
|
|
|
|
@example
|
|
awk -f wordfreq.awk file1 | sort +1 -nr
|
|
@end example
|
|
|
|
@noindent
|
|
produces a table of the words appearing in @file{file1} in order of
|
|
decreasing frequency.
|
|
|
|
The @code{awk} program suitably massages the data and produces a word
|
|
frequency table, which is not ordered.
|
|
|
|
The @code{awk} script's output is then sorted by the @code{sort} utility and
|
|
printed on the terminal. The options given to @code{sort} in this example
|
|
specify to sort using the second field of each input line (skipping one field),
|
|
that the sort keys should be treated as numeric quantities (otherwise
|
|
@samp{15} would come before @samp{5}), and that the sorting should be done
|
|
in descending (reverse) order.
|
|
|
|
We could have even done the @code{sort} from within the program, by
|
|
changing the @code{END} action to:
|
|
|
|
@example
|
|
@c file eg/prog/wordfreq.awk
|
|
END @{
|
|
sort = "sort +1 -nr"
|
|
for (word in freq)
|
|
printf "%s\t%d\n", word, freq[word] | sort
|
|
close(sort)
|
|
@}
|
|
@c endfile
|
|
@end example
|
|
|
|
You would have to use this way of sorting on systems that do not
|
|
have true pipes.
|
|
|
|
See the general operating system documentation for more information on how
|
|
to use the @code{sort} program.
|
|
|
|
@node History Sorting, Extract Program, Word Sorting, Miscellaneous Programs
|
|
@subsection Removing Duplicates from Unsorted Text
|
|
|
|
The @code{uniq} program
|
|
(@pxref{Uniq Program, ,Printing Non-duplicated Lines of Text}),
|
|
removes duplicate lines from @emph{sorted} data.
|
|
|
|
Suppose, however, you need to remove duplicate lines from a data file, but
|
|
that you wish to preserve the order the lines are in? A good example of
|
|
this might be a shell history file. The history file keeps a copy of all
|
|
the commands you have entered, and it is not unusual to repeat a command
|
|
several times in a row. Occasionally you might wish to compact the history
|
|
by removing duplicate entries. Yet it is desirable to maintain the order
|
|
of the original commands.
|
|
|
|
This simple program does the job. It uses two arrays. The @code{data}
|
|
array is indexed by the text of each line.
|
|
For each line, @code{data[$0]} is incremented.
|
|
|
|
If a particular line has not
|
|
been seen before, then @code{data[$0]} will be zero.
|
|
In that case, the text of the line is stored in @code{lines[count]}.
|
|
Each element of @code{lines} is a unique command, and the indices of
|
|
@code{lines} indicate the order in which those lines were encountered.
|
|
The @code{END} rule simply prints out the lines, in order.
|
|
|
|
@cindex Rakitzis, Byron
|
|
@findex histsort.awk
|
|
@example
|
|
@group
|
|
@c file eg/prog/histsort.awk
|
|
# histsort.awk --- compact a shell history file
|
|
# Arnold Robbins, arnold@@gnu.org, Public Domain
|
|
# May 1993
|
|
|
|
# Thanks to Byron Rakitzis for the general idea
|
|
@{
|
|
if (data[$0]++ == 0)
|
|
lines[++count] = $0
|
|
@}
|
|
|
|
END @{
|
|
for (i = 1; i <= count; i++)
|
|
print lines[i]
|
|
@}
|
|
@c endfile
|
|
@end group
|
|
@end example
|
|
|
|
This program also provides a foundation for generating other useful
|
|
information. For example, using the following @code{print} satement in the
|
|
@code{END} rule would indicate how often a particular command was used.
|
|
|
|
@example
|
|
print data[lines[i]], lines[i]
|
|
@end example
|
|
|
|
This works because @code{data[$0]} was incremented each time a line was
|
|
seen.
|
|
|
|
@node Extract Program, Simple Sed, History Sorting, Miscellaneous Programs
|
|
@subsection Extracting Programs from Texinfo Source Files
|
|
|
|
@iftex
|
|
Both this chapter and the previous chapter
|
|
(@ref{Library Functions, ,A Library of @code{awk} Functions}),
|
|
present a large number of @code{awk} programs.
|
|
@end iftex
|
|
@ifinfo
|
|
The nodes
|
|
@ref{Library Functions, ,A Library of @code{awk} Functions},
|
|
and @ref{Sample Programs, ,Practical @code{awk} Programs},
|
|
are the top level nodes for a large number of @code{awk} programs.
|
|
@end ifinfo
|
|
If you wish to experiment with these programs, it is tedious to have to type
|
|
them in by hand. Here we present a program that can extract parts of a
|
|
Texinfo input file into separate files.
|
|
|
|
This @value{DOCUMENT} is written in Texinfo, the GNU project's document
|
|
formatting language. A single Texinfo source file can be used to produce both
|
|
printed and on-line documentation.
|
|
@iftex
|
|
Texinfo is fully documented in @cite{Texinfo---The GNU Documentation Format},
|
|
available from the Free Software Foundation.
|
|
@end iftex
|
|
@ifinfo
|
|
The Texinfo language is described fully, starting with
|
|
@ref{Top, , Introduction, texi, Texinfo---The GNU Documentation Format}.
|
|
@end ifinfo
|
|
|
|
For our purposes, it is enough to know three things about Texinfo input
|
|
files.
|
|
|
|
@itemize @bullet
|
|
@item
|
|
The ``at'' symbol, @samp{@@}, is special in Texinfo, much like @samp{\} in C
|
|
or @code{awk}. Literal @samp{@@} symbols are represented in Texinfo source
|
|
files as @samp{@@@@}.
|
|
|
|
@item
|
|
Comments start with either @samp{@@c} or @samp{@@comment}.
|
|
The file extraction program will work by using special comments that start
|
|
at the beginning of a line.
|
|
|
|
@item
|
|
Example text that should not be split across a page boundary is bracketed
|
|
between lines containing @samp{@@group} and @samp{@@end group} commands.
|
|
@end itemize
|
|
|
|
The following program, @file{extract.awk}, reads through a Texinfo source
|
|
file, and does two things, based on the special comments.
|
|
Upon seeing @samp{@w{@@c system @dots{}}},
|
|
it runs a command, by extracting the command text from the
|
|
control line and passing it on to the @code{system} function
|
|
(@pxref{I/O Functions, ,Built-in Functions for Input/Output}).
|
|
Upon seeing @samp{@@c file @var{filename}}, each subsequent line is sent to
|
|
the file @var{filename}, until @samp{@@c endfile} is encountered.
|
|
The rules in @file{extract.awk} will match either @samp{@@c} or
|
|
@samp{@@comment} by letting the @samp{omment} part be optional.
|
|
Lines containing @samp{@@group} and @samp{@@end group} are simply removed.
|
|
@file{extract.awk} uses the @code{join} library function
|
|
(@pxref{Join Function, ,Merging an Array Into a String}).
|
|
|
|
The example programs in the on-line Texinfo source for @cite{@value{TITLE}}
|
|
(@file{gawk.texi}) have all been bracketed inside @samp{file},
|
|
and @samp{endfile} lines. The @code{gawk} distribution uses a copy of
|
|
@file{extract.awk} to extract the sample
|
|
programs and install many of them in a standard directory, where
|
|
@code{gawk} can find them.
|
|
The Texinfo file looks something like this:
|
|
|
|
@example
|
|
@dots{}
|
|
This program has a @@code@{BEGIN@} block,
|
|
which prints a nice message:
|
|
|
|
@@example
|
|
@@c file examples/messages.awk
|
|
BEGIN @@@{ print "Don't panic!" @@@}
|
|
@@c end file
|
|
@@end example
|
|
|
|
It also prints some final advice:
|
|
|
|
@@example
|
|
@@c file examples/messages.awk
|
|
END @@@{ print "Always avoid bored archeologists!" @@@}
|
|
@@c end file
|
|
@@end example
|
|
@dots{}
|
|
@end example
|
|
|
|
@file{extract.awk} begins by setting @code{IGNORECASE} to one, so that
|
|
mixed upper-case and lower-case letters in the directives won't matter.
|
|
|
|
The first rule handles calling @code{system}, checking that a command was
|
|
given (@code{NF} is at least three), and also checking that the command
|
|
exited with a zero exit status, signifying OK.
|
|
|
|
@findex extract.awk
|
|
@example
|
|
@c @group
|
|
@c file eg/prog/extract.awk
|
|
# extract.awk --- extract files and run programs
|
|
# from texinfo files
|
|
# Arnold Robbins, arnold@@gnu.org, Public Domain
|
|
# May 1993
|
|
|
|
BEGIN @{ IGNORECASE = 1 @}
|
|
|
|
@group
|
|
/^@@c(omment)?[ \t]+system/ \
|
|
@{
|
|
if (NF < 3) @{
|
|
e = (FILENAME ":" FNR)
|
|
e = (e ": badly formed `system' line")
|
|
print e > "/dev/stderr"
|
|
next
|
|
@}
|
|
$1 = ""
|
|
$2 = ""
|
|
stat = system($0)
|
|
if (stat != 0) @{
|
|
e = (FILENAME ":" FNR)
|
|
e = (e ": warning: system returned " stat)
|
|
print e > "/dev/stderr"
|
|
@}
|
|
@}
|
|
@end group
|
|
@c endfile
|
|
@end example
|
|
|
|
@noindent
|
|
The variable @code{e} is used so that the function
|
|
fits nicely on the
|
|
@iftex
|
|
page.
|
|
@end iftex
|
|
@ifinfo
|
|
screen.
|
|
@end ifinfo
|
|
|
|
The second rule handles moving data into files. It verifies that a file
|
|
name was given in the directive. If the file named is not the current file,
|
|
then the current file is closed. This means that an @samp{@@c endfile} was
|
|
not given for that file. (We should probably print a diagnostic in this
|
|
case, although at the moment we do not.)
|
|
|
|
The @samp{for} loop does the work. It reads lines using @code{getline}
|
|
(@pxref{Getline, ,Explicit Input with @code{getline}}).
|
|
For an unexpected end of file, it calls the @code{@w{unexpected_eof}}
|
|
function. If the line is an ``endfile'' line, then it breaks out of
|
|
the loop.
|
|
If the line is an @samp{@@group} or @samp{@@end group} line, then it
|
|
ignores it, and goes on to the next line.
|
|
(These Texinfo control lines keep blocks of code together on one page;
|
|
unfortunately, @TeX{} isn't always smart enough to do things exactly right,
|
|
and we have to give it some advice.)
|
|
|
|
Most of the work is in the following few lines. If the line has no @samp{@@}
|
|
symbols, it can be printed directly. Otherwise, each leading @samp{@@} must be
|
|
stripped off.
|
|
|
|
To remove the @samp{@@} symbols, the line is split into separate elements of
|
|
the array @code{a}, using the @code{split} function
|
|
(@pxref{String Functions, ,Built-in Functions for String Manipulation}).
|
|
Each element of @code{a} that is empty indicates two successive @samp{@@}
|
|
symbols in the original line. For each two empty elements (@samp{@@@@} in
|
|
the original file), we have to add back in a single @samp{@@} symbol.
|
|
|
|
When the processing of the array is finished, @code{join} is called with the
|
|
value of @code{SUBSEP}, to rejoin the pieces back into a single
|
|
line. That line is then printed to the output file.
|
|
|
|
@example
|
|
@c @group
|
|
@c file eg/prog/extract.awk
|
|
@group
|
|
/^@@c(omment)?[ \t]+file/ \
|
|
@{
|
|
if (NF != 3) @{
|
|
e = (FILENAME ":" FNR ": badly formed `file' line")
|
|
print e > "/dev/stderr"
|
|
next
|
|
@}
|
|
@end group
|
|
if ($3 != curfile) @{
|
|
if (curfile != "")
|
|
close(curfile)
|
|
curfile = $3
|
|
@}
|
|
|
|
for (;;) @{
|
|
if ((getline line) <= 0)
|
|
unexpected_eof()
|
|
if (line ~ /^@@c(omment)?[ \t]+endfile/)
|
|
break
|
|
else if (line ~ /^@@(end[ \t]+)?group/)
|
|
continue
|
|
if (index(line, "@@") == 0) @{
|
|
print line > curfile
|
|
continue
|
|
@}
|
|
n = split(line, a, "@@")
|
|
@group
|
|
# if a[1] == "", means leading @@,
|
|
# don't add one back in.
|
|
@end group
|
|
for (i = 2; i <= n; i++) @{
|
|
if (a[i] == "") @{ # was an @@@@
|
|
a[i] = "@@"
|
|
if (a[i+1] == "")
|
|
i++
|
|
@}
|
|
@}
|
|
print join(a, 1, n, SUBSEP) > curfile
|
|
@}
|
|
@}
|
|
@c endfile
|
|
@c @end group
|
|
@end example
|
|
|
|
An important thing to note is the use of the @samp{>} redirection.
|
|
Output done with @samp{>} only opens the file once; it stays open and
|
|
subsequent output is appended to the file
|
|
(@pxref{Redirection, , Redirecting Output of @code{print} and @code{printf}}).
|
|
This allows us to easily mix program text and explanatory prose for the same
|
|
sample source file (as has been done here!) without any hassle. The file is
|
|
only closed when a new data file name is encountered, or at the end of the
|
|
input file.
|
|
|
|
Finally, the function @code{@w{unexpected_eof}} prints an appropriate
|
|
error message and then exits.
|
|
|
|
The @code{END} rule handles the final cleanup, closing the open file.
|
|
|
|
@example
|
|
@c file eg/prog/extract.awk
|
|
@group
|
|
function unexpected_eof()
|
|
@{
|
|
printf("%s:%d: unexpected EOF or error\n", \
|
|
FILENAME, FNR) > "/dev/stderr"
|
|
exit 1
|
|
@}
|
|
@end group
|
|
|
|
END @{
|
|
if (curfile)
|
|
close(curfile)
|
|
@}
|
|
@c endfile
|
|
@end example
|
|
|
|
@node Simple Sed, Igawk Program, Extract Program, Miscellaneous Programs
|
|
@subsection A Simple Stream Editor
|
|
|
|
@cindex @code{sed} utility
|
|
The @code{sed} utility is a ``stream editor,'' a program that reads a
|
|
stream of data, makes changes to it, and passes the modified data on.
|
|
It is often used to make global changes to a large file, or to a stream
|
|
of data generated by a pipeline of commands.
|
|
|
|
While @code{sed} is a complicated program in its own right, its most common
|
|
use is to perform global substitutions in the middle of a pipeline:
|
|
|
|
@example
|
|
command1 < orig.data | sed 's/old/new/g' | command2 > result
|
|
@end example
|
|
|
|
Here, the @samp{s/old/new/g} tells @code{sed} to look for the regexp
|
|
@samp{old} on each input line, and replace it with the text @samp{new},
|
|
globally (i.e.@: all the occurrences on a line). This is similar to
|
|
@code{awk}'s @code{gsub} function
|
|
(@pxref{String Functions, , Built-in Functions for String Manipulation}).
|
|
|
|
The following program, @file{awksed.awk}, accepts at least two command line
|
|
arguments; the pattern to look for and the text to replace it with. Any
|
|
additional arguments are treated as data file names to process. If none
|
|
are provided, the standard input is used.
|
|
|
|
@cindex Brennan, Michael
|
|
@cindex @code{awksed}
|
|
@cindex simple stream editor
|
|
@cindex stream editor, simple
|
|
@example
|
|
@c @group
|
|
@c file eg/prog/awksed.awk
|
|
# awksed.awk --- do s/foo/bar/g using just print
|
|
# Thanks to Michael Brennan for the idea
|
|
|
|
# Arnold Robbins, arnold@@gnu.org, Public Domain
|
|
# August 1995
|
|
|
|
@group
|
|
function usage()
|
|
@{
|
|
print "usage: awksed pat repl [files...]" > "/dev/stderr"
|
|
exit 1
|
|
@}
|
|
@end group
|
|
|
|
BEGIN @{
|
|
# validate arguments
|
|
if (ARGC < 3)
|
|
usage()
|
|
|
|
RS = ARGV[1]
|
|
ORS = ARGV[2]
|
|
|
|
# don't use arguments as files
|
|
ARGV[1] = ARGV[2] = ""
|
|
@}
|
|
|
|
# look ma, no hands!
|
|
@{
|
|
if (RT == "")
|
|
printf "%s", $0
|
|
else
|
|
print
|
|
@}
|
|
@c endfile
|
|
@c @end group
|
|
@end example
|
|
|
|
The program relies on @code{gawk}'s ability to have @code{RS} be a regexp
|
|
and on the setting of @code{RT} to the actual text that terminated the
|
|
record (@pxref{Records, ,How Input is Split into Records}).
|
|
|
|
The idea is to have @code{RS} be the pattern to look for. @code{gawk}
|
|
will automatically set @code{$0} to the text between matches of the pattern.
|
|
This is text that we wish to keep, unmodified. Then, by setting @code{ORS}
|
|
to the replacement text, a simple @code{print} statement will output the
|
|
text we wish to keep, followed by the replacement text.
|
|
|
|
There is one wrinkle to this scheme, which is what to do if the last record
|
|
doesn't end with text that matches @code{RS}? Using a @code{print}
|
|
statement unconditionally prints the replacement text, which is not correct.
|
|
|
|
However, if the file did not end in text that matches @code{RS}, @code{RT}
|
|
will be set to the null string. In this case, we can print @code{$0} using
|
|
@code{printf}
|
|
(@pxref{Printf, ,Using @code{printf} Statements for Fancier Printing}).
|
|
|
|
The @code{BEGIN} rule handles the setup, checking for the right number
|
|
of arguments, and calling @code{usage} if there is a problem. Then it sets
|
|
@code{RS} and @code{ORS} from the command line arguments, and sets
|
|
@code{ARGV[1]} and @code{ARGV[2]} to the null string, so that they will
|
|
not be treated as file names
|
|
(@pxref{ARGC and ARGV, , Using @code{ARGC} and @code{ARGV}}).
|
|
|
|
The @code{usage} function prints an error message and exits.
|
|
|
|
Finally, the single rule handles the printing scheme outlined above,
|
|
using @code{print} or @code{printf} as appropriate, depending upon the
|
|
value of @code{RT}.
|
|
|
|
@ignore
|
|
Exercise, compare the performance of this version with the more
|
|
straightforward:
|
|
|
|
BEGIN {
|
|
pat = ARGV[1]
|
|
repl = ARGV[2]
|
|
ARGV[1] = ARGV[2] = ""
|
|
}
|
|
|
|
{ gsub(pat, repl); print }
|
|
|
|
Exercise: what are the advantages and disadvantages of this version vs. sed?
|
|
Advantage: egrep regexps
|
|
speed (?)
|
|
Disadvantage: no & in replacement text
|
|
|
|
Others?
|
|
@end ignore
|
|
|
|
@node Igawk Program, , Simple Sed, Miscellaneous Programs
|
|
@subsection An Easy Way to Use Library Functions
|
|
|
|
Using library functions in @code{awk} can be very beneficial. It
|
|
encourages code re-use and the writing of general functions. Programs are
|
|
smaller, and therefore clearer.
|
|
However, using library functions is only easy when writing @code{awk}
|
|
programs; it is painful when running them, requiring multiple @samp{-f}
|
|
options. If @code{gawk} is unavailable, then so too is the @code{AWKPATH}
|
|
environment variable and the ability to put @code{awk} functions into a
|
|
library directory (@pxref{Options, ,Command Line Options}).
|
|
|
|
It would be nice to be able to write programs like so:
|
|
|
|
@example
|
|
# library functions
|
|
@@include getopt.awk
|
|
@@include join.awk
|
|
@dots{}
|
|
|
|
# main program
|
|
BEGIN @{
|
|
while ((c = getopt(ARGC, ARGV, "a:b:cde")) != -1)
|
|
@dots{}
|
|
@dots{}
|
|
@}
|
|
@end example
|
|
|
|
The following program, @file{igawk.sh}, provides this service.
|
|
It simulates @code{gawk}'s searching of the @code{AWKPATH} variable,
|
|
and also allows @dfn{nested} includes; i.e.@: a file that has been included
|
|
with @samp{@@include} can contain further @samp{@@include} statements.
|
|
@code{igawk} will make an effort to only include files once, so that nested
|
|
includes don't accidentally include a library function twice.
|
|
|
|
@code{igawk} should behave externally just like @code{gawk}. This means it
|
|
should accept all of @code{gawk}'s command line arguments, including the
|
|
ability to have multiple source files specified via @samp{-f}, and the
|
|
ability to mix command line and library source files.
|
|
|
|
The program is written using the POSIX Shell (@code{sh}) command language.
|
|
The way the program works is as follows:
|
|
|
|
@enumerate
|
|
@item
|
|
Loop through the arguments, saving anything that doesn't represent
|
|
@code{awk} source code for later, when the expanded program is run.
|
|
|
|
@item
|
|
For any arguments that do represent @code{awk} text, put the arguments into
|
|
a temporary file that will be expanded. There are two cases.
|
|
|
|
@enumerate a
|
|
@item
|
|
Literal text, provided with @samp{--source} or @samp{--source=}. This
|
|
text is just echoed directly. The @code{echo} program will automatically
|
|
supply a trailing newline.
|
|
|
|
@item
|
|
File names provided with @samp{-f}. We use a neat trick, and echo
|
|
@samp{@@include @var{filename}} into the temporary file. Since the file
|
|
inclusion program will work the way @code{gawk} does, this will get the text
|
|
of the file included into the program at the correct point.
|
|
@end enumerate
|
|
|
|
@item
|
|
Run an @code{awk} program (naturally) over the temporary file to expand
|
|
@samp{@@include} statements. The expanded program is placed in a second
|
|
temporary file.
|
|
|
|
@item
|
|
Run the expanded program with @code{gawk} and any other original command line
|
|
arguments that the user supplied (such as the data file names).
|
|
@end enumerate
|
|
|
|
The initial part of the program turns on shell tracing if the first
|
|
argument was @samp{debug}. Otherwise, a shell @code{trap} statement
|
|
arranges to clean up any temporary files on program exit or upon an
|
|
interrupt.
|
|
|
|
@c 2e: For the temp file handling, go with Darrel's ig=${TMP:-/tmp}/igs.$$
|
|
@c 2e: or something as similar as possible.
|
|
|
|
The next part loops through all the command line arguments.
|
|
There are several cases of interest.
|
|
|
|
@table @code
|
|
@item --
|
|
This ends the arguments to @code{igawk}. Anything else should be passed on
|
|
to the user's @code{awk} program without being evaluated.
|
|
|
|
@item -W
|
|
This indicates that the next option is specific to @code{gawk}. To make
|
|
argument processing easier, the @samp{-W} is appended to the front of the
|
|
remaining arguments and the loop continues. (This is an @code{sh}
|
|
programming trick. Don't worry about it if you are not familiar with
|
|
@code{sh}.)
|
|
|
|
@item -v
|
|
@itemx -F
|
|
These are saved and passed on to @code{gawk}.
|
|
|
|
@item -f
|
|
@itemx --file
|
|
@itemx --file=
|
|
@itemx -Wfile=
|
|
The file name is saved to the temporary file @file{/tmp/ig.s.$$} with an
|
|
@samp{@@include} statement.
|
|
The @code{sed} utility is used to remove the leading option part of the
|
|
argument (e.g., @samp{--file=}).
|
|
|
|
@item --source
|
|
@itemx --source=
|
|
@itemx -Wsource=
|
|
The source text is echoed into @file{/tmp/ig.s.$$}.
|
|
|
|
@item --version
|
|
@itemx --version
|
|
@itemx -Wversion
|
|
@code{igawk} prints its version number, and runs @samp{gawk --version}
|
|
to get the @code{gawk} version information, and then exits.
|
|
@end table
|
|
|
|
If none of @samp{-f}, @samp{--file}, @samp{-Wfile}, @samp{--source},
|
|
or @samp{-Wsource}, were supplied, then the first non-option argument
|
|
should be the @code{awk} program. If there are no command line
|
|
arguments left, @code{igawk} prints an error message and exits.
|
|
Otherwise, the first argument is echoed into @file{/tmp/ig.s.$$}.
|
|
|
|
In any case, after the arguments have been processed,
|
|
@file{/tmp/ig.s.$$} contains the complete text of the original @code{awk}
|
|
program.
|
|
|
|
The @samp{$$} in @code{sh} represents the current process ID number.
|
|
It is often used in shell programs to generate unique temporary file
|
|
names. This allows multiple users to run @code{igawk} without worrying
|
|
that the temporary file names will clash.
|
|
|
|
@cindex @code{sed} utility
|
|
Here's the program:
|
|
|
|
@findex igawk.sh
|
|
@example
|
|
@c @group
|
|
@c file eg/prog/igawk.sh
|
|
#! /bin/sh
|
|
|
|
# igawk --- like gawk but do @@include processing
|
|
# Arnold Robbins, arnold@@gnu.org, Public Domain
|
|
# July 1993
|
|
|
|
if [ "$1" = debug ]
|
|
then
|
|
set -x
|
|
shift
|
|
else
|
|
# cleanup on exit, hangup, interrupt, quit, termination
|
|
trap 'rm -f /tmp/ig.[se].$$' 0 1 2 3 15
|
|
fi
|
|
|
|
while [ $# -ne 0 ] # loop over arguments
|
|
do
|
|
case $1 in
|
|
--) shift; break;;
|
|
|
|
-W) shift
|
|
set -- -W"$@@"
|
|
continue;;
|
|
|
|
-[vF]) opts="$opts $1 '$2'"
|
|
shift;;
|
|
|
|
-[vF]*) opts="$opts '$1'" ;;
|
|
|
|
-f) echo @@include "$2" >> /tmp/ig.s.$$
|
|
shift;;
|
|
|
|
@group
|
|
-f*) f=`echo "$1" | sed 's/-f//'`
|
|
echo @@include "$f" >> /tmp/ig.s.$$ ;;
|
|
@end group
|
|
|
|
-?file=*) # -Wfile or --file
|
|
f=`echo "$1" | sed 's/-.file=//'`
|
|
echo @@include "$f" >> /tmp/ig.s.$$ ;;
|
|
|
|
-?file) # get arg, $2
|
|
echo @@include "$2" >> /tmp/ig.s.$$
|
|
shift;;
|
|
|
|
-?source=*) # -Wsource or --source
|
|
t=`echo "$1" | sed 's/-.source=//'`
|
|
echo "$t" >> /tmp/ig.s.$$ ;;
|
|
|
|
-?source) # get arg, $2
|
|
echo "$2" >> /tmp/ig.s.$$
|
|
shift;;
|
|
|
|
-?version)
|
|
echo igawk: version 1.0 1>&2
|
|
gawk --version
|
|
exit 0 ;;
|
|
|
|
-[W-]*) opts="$opts '$1'" ;;
|
|
|
|
*) break;;
|
|
esac
|
|
shift
|
|
done
|
|
|
|
if [ ! -s /tmp/ig.s.$$ ]
|
|
then
|
|
if [ -z "$1" ]
|
|
then
|
|
echo igawk: no program! 1>&2
|
|
exit 1
|
|
else
|
|
echo "$1" > /tmp/ig.s.$$
|
|
shift
|
|
fi
|
|
fi
|
|
|
|
# at this point, /tmp/ig.s.$$ has the program
|
|
@c endfile
|
|
@c @end group
|
|
@end example
|
|
|
|
The @code{awk} program to process @samp{@@include} directives reads through
|
|
the program, one line at a time using @code{getline}
|
|
(@pxref{Getline, ,Explicit Input with @code{getline}}).
|
|
The input file names and @samp{@@include} statements are managed using a
|
|
stack. As each @samp{@@include} is encountered, the current file name is
|
|
``pushed'' onto the stack, and the file named in the @samp{@@include}
|
|
directive becomes
|
|
the current file name. As each file is finished, the stack is ``popped,''
|
|
and the previous input file becomes the current input file again.
|
|
The process is started by making the original file the first one on the
|
|
stack.
|
|
|
|
The @code{pathto} function does the work of finding the full path to a
|
|
file. It simulates @code{gawk}'s behavior when searching the @code{AWKPATH}
|
|
environment variable
|
|
(@pxref{AWKPATH Variable, ,The @code{AWKPATH} Environment Variable}).
|
|
If a file name has a @samp{/} in it, no path search
|
|
is done. Otherwise, the file name is concatenated with the name of each
|
|
directory in the path, and an attempt is made to open the generated file
|
|
name. The only way in @code{awk} to test if a file can be read is to go
|
|
ahead and try to read it with @code{getline}; that is what @code{pathto}
|
|
does.@footnote{On some very old versions of @code{awk}, the test
|
|
@samp{getline junk < t} can loop forever if the file exists but is empty.
|
|
Caveat Emptor.}
|
|
If the file can be read, it is closed, and the file name is
|
|
returned.
|
|
@ignore
|
|
An alternative way to test for the file's existence would be to call
|
|
@samp{system("test -r " t)}, which uses the @code{test} utility to
|
|
see if the file exists and is readable. The disadvantage to this method
|
|
is that it requires creating an extra process, and can thus be slightly
|
|
slower.
|
|
@end ignore
|
|
|
|
@example
|
|
@c @group
|
|
@c file eg/prog/igawk.sh
|
|
gawk -- '
|
|
# process @@include directives
|
|
|
|
function pathto(file, i, t, junk)
|
|
@{
|
|
if (index(file, "/") != 0)
|
|
return file
|
|
|
|
for (i = 1; i <= ndirs; i++) @{
|
|
t = (pathlist[i] "/" file)
|
|
if ((getline junk < t) > 0) @{
|
|
# found it
|
|
close(t)
|
|
return t
|
|
@}
|
|
@}
|
|
return ""
|
|
@}
|
|
@c endfile
|
|
@c @end group
|
|
@end example
|
|
|
|
The main program is contained inside one @code{BEGIN} rule. The first thing it
|
|
does is set up the @code{pathlist} array that @code{pathto} uses. After
|
|
splitting the path on @samp{:}, null elements are replaced with @code{"."},
|
|
which represents the current directory.
|
|
|
|
@example
|
|
@group
|
|
@c file eg/prog/igawk.sh
|
|
BEGIN @{
|
|
path = ENVIRON["AWKPATH"]
|
|
ndirs = split(path, pathlist, ":")
|
|
for (i = 1; i <= ndirs; i++) @{
|
|
if (pathlist[i] == "")
|
|
pathlist[i] = "."
|
|
@}
|
|
@c endfile
|
|
@end group
|
|
@end example
|
|
|
|
The stack is initialized with @code{ARGV[1]}, which will be @file{/tmp/ig.s.$$}.
|
|
The main loop comes next. Input lines are read in succession. Lines that
|
|
do not start with @samp{@@include} are printed verbatim.
|
|
|
|
If the line does start with @samp{@@include}, the file name is in @code{$2}.
|
|
@code{pathto} is called to generate the full path. If it could not, then we
|
|
print an error message and continue.
|
|
|
|
The next thing to check is if the file has been included already. The
|
|
@code{processed} array is indexed by the full file name of each included
|
|
file, and it tracks this information for us. If the file has been
|
|
seen, a warning message is printed. Otherwise, the new file name is
|
|
pushed onto the stack and processing continues.
|
|
|
|
Finally, when @code{getline} encounters the end of the input file, the file
|
|
is closed and the stack is popped. When @code{stackptr} is less than zero,
|
|
the program is done.
|
|
|
|
@example
|
|
@c @group
|
|
@c file eg/prog/igawk.sh
|
|
stackptr = 0
|
|
input[stackptr] = ARGV[1] # ARGV[1] is first file
|
|
|
|
for (; stackptr >= 0; stackptr--) @{
|
|
while ((getline < input[stackptr]) > 0) @{
|
|
if (tolower($1) != "@@include") @{
|
|
print
|
|
continue
|
|
@}
|
|
fpath = pathto($2)
|
|
if (fpath == "") @{
|
|
printf("igawk:%s:%d: cannot find %s\n", \
|
|
input[stackptr], FNR, $2) > "/dev/stderr"
|
|
continue
|
|
@}
|
|
@group
|
|
if (! (fpath in processed)) @{
|
|
processed[fpath] = input[stackptr]
|
|
input[++stackptr] = fpath
|
|
@} else
|
|
print $2, "included in", input[stackptr], \
|
|
"already included in", \
|
|
processed[fpath] > "/dev/stderr"
|
|
@}
|
|
@end group
|
|
@group
|
|
close(input[stackptr])
|
|
@}
|
|
@}' /tmp/ig.s.$$ > /tmp/ig.e.$$
|
|
@end group
|
|
@c endfile
|
|
@c @end group
|
|
@end example
|
|
|
|
The last step is to call @code{gawk} with the expanded program and the original
|
|
options and command line arguments that the user supplied. @code{gawk}'s
|
|
exit status is passed back on to @code{igawk}'s calling program.
|
|
|
|
@c this causes more problems than it solves, so leave it out.
|
|
@ignore
|
|
The special file @file{/dev/null} is passed as a data file to @code{gawk}
|
|
to handle an interesting case. Suppose that the user's program only has
|
|
a @code{BEGIN} rule, and there are no data files to read. The program should exit without reading any data
|
|
files. However, suppose that an included library file defines an @code{END}
|
|
rule of its own. In this case, @code{gawk} will hang, reading standard
|
|
input. In order to avoid this, @file{/dev/null} is explicitly to the
|
|
command line. Reading from @file{/dev/null} always returns an immediate
|
|
end of file indication.
|
|
|
|
@c Hmm. Add /dev/null if $# is 0? Still messes up ARGV. Sigh.
|
|
@end ignore
|
|
|
|
@example
|
|
@c @group
|
|
@c file eg/prog/igawk.sh
|
|
eval gawk -f /tmp/ig.e.$$ $opts -- "$@@"
|
|
|
|
exit $?
|
|
@c endfile
|
|
@c @end group
|
|
@end example
|
|
|
|
This version of @code{igawk} represents my third attempt at this program.
|
|
There are three key simplifications that made the program work better.
|
|
|
|
@enumerate
|
|
@item
|
|
Using @samp{@@include} even for the files named with @samp{-f} makes building
|
|
the initial collected @code{awk} program much simpler; all the
|
|
@samp{@@include} processing can be done once.
|
|
|
|
@item
|
|
The @code{pathto} function doesn't try to save the line read with
|
|
@code{getline} when testing for the file's accessibility. Trying to save
|
|
this line for use with the main program complicates things considerably.
|
|
@c what problem does this engender though - exercise
|
|
@c answer, reading from "-" or /dev/stdin
|
|
|
|
@item
|
|
Using a @code{getline} loop in the @code{BEGIN} rule does it all in one
|
|
place. It is not necessary to call out to a separate loop for processing
|
|
nested @samp{@@include} statements.
|
|
@end enumerate
|
|
|
|
Also, this program illustrates that it is often worthwhile to combine
|
|
@code{sh} and @code{awk} programming together. You can usually accomplish
|
|
quite a lot, without having to resort to low-level programming in C or C++, and it
|
|
is frequently easier to do certain kinds of string and argument manipulation
|
|
using the shell than it is in @code{awk}.
|
|
|
|
Finally, @code{igawk} shows that it is not always necessary to add new
|
|
features to a program; they can often be layered on top. With @code{igawk},
|
|
there is no real reason to build @samp{@@include} processing into
|
|
@code{gawk} itself.
|
|
|
|
As an additional example of this, consider the idea of having two
|
|
files in a directory in the search path.
|
|
|
|
@table @file
|
|
@item default.awk
|
|
This file would contain a set of default library functions, such
|
|
as @code{getopt} and @code{assert}.
|
|
|
|
@item site.awk
|
|
This file would contain library functions that are specific to a site or
|
|
installation, i.e.@: locally developed functions.
|
|
Having a separate file allows @file{default.awk} to change with
|
|
new @code{gawk} releases, without requiring the system administrator to
|
|
update it each time by adding the local functions.
|
|
@end table
|
|
|
|
One user
|
|
@c Karl Berry, karl@ileaf.com, 10/95
|
|
suggested that @code{gawk} be modified to automatically read these files
|
|
upon startup. Instead, it would be very simple to modify @code{igawk}
|
|
to do this. Since @code{igawk} can process nested @samp{@@include}
|
|
directives, @file{default.awk} could simply contain @samp{@@include}
|
|
statements for the desired library functions.
|
|
|
|
@c Exercise: make this change
|
|
|
|
@node Language History, Gawk Summary, Sample Programs, Top
|
|
@chapter The Evolution of the @code{awk} Language
|
|
|
|
This @value{DOCUMENT} describes the GNU implementation of @code{awk}, which follows
|
|
the POSIX specification. Many @code{awk} users are only familiar
|
|
with the original @code{awk} implementation in Version 7 Unix.
|
|
(This implementation was the basis for @code{awk} in Berkeley Unix,
|
|
through 4.3--Reno. The 4.4 release of Berkeley Unix uses @code{gawk} 2.15.2
|
|
for its version of @code{awk}.) This chapter briefly describes the
|
|
evolution of the @code{awk} language, with cross references to other parts
|
|
of the @value{DOCUMENT} where you can find more information.
|
|
|
|
@menu
|
|
* V7/SVR3.1:: The major changes between V7 and System V
|
|
Release 3.1.
|
|
* SVR4:: Minor changes between System V Releases 3.1
|
|
and 4.
|
|
* POSIX:: New features from the POSIX standard.
|
|
* BTL:: New features from the Bell Laboratories
|
|
version of @code{awk}.
|
|
* POSIX/GNU:: The extensions in @code{gawk} not in POSIX
|
|
@code{awk}.
|
|
@end menu
|
|
|
|
@node V7/SVR3.1, SVR4, Language History, Language History
|
|
@section Major Changes between V7 and SVR3.1
|
|
|
|
The @code{awk} language evolved considerably between the release of
|
|
Version 7 Unix (1978) and the new version first made generally available in
|
|
System V Release 3.1 (1987). This section summarizes the changes, with
|
|
cross-references to further details.
|
|
|
|
@itemize @bullet
|
|
@item
|
|
The requirement for @samp{;} to separate rules on a line
|
|
(@pxref{Statements/Lines, ,@code{awk} Statements Versus Lines}).
|
|
|
|
@item
|
|
User-defined functions, and the @code{return} statement
|
|
(@pxref{User-defined, ,User-defined Functions}).
|
|
|
|
@item
|
|
The @code{delete} statement (@pxref{Delete, ,The @code{delete} Statement}).
|
|
|
|
@item
|
|
The @code{do}-@code{while} statement
|
|
(@pxref{Do Statement, ,The @code{do}-@code{while} Statement}).
|
|
|
|
@item
|
|
The built-in functions @code{atan2}, @code{cos}, @code{sin}, @code{rand} and
|
|
@code{srand} (@pxref{Numeric Functions, ,Numeric Built-in Functions}).
|
|
|
|
@item
|
|
The built-in functions @code{gsub}, @code{sub}, and @code{match}
|
|
(@pxref{String Functions, ,Built-in Functions for String Manipulation}).
|
|
|
|
@item
|
|
The built-in functions @code{close}, and @code{system}
|
|
(@pxref{I/O Functions, ,Built-in Functions for Input/Output}).
|
|
|
|
@item
|
|
The @code{ARGC}, @code{ARGV}, @code{FNR}, @code{RLENGTH}, @code{RSTART},
|
|
and @code{SUBSEP} built-in variables (@pxref{Built-in Variables}).
|
|
|
|
@item
|
|
The conditional expression using the ternary operator @samp{?:}
|
|
(@pxref{Conditional Exp, ,Conditional Expressions}).
|
|
|
|
@item
|
|
The exponentiation operator @samp{^}
|
|
(@pxref{Arithmetic Ops, ,Arithmetic Operators}) and its assignment operator
|
|
form @samp{^=} (@pxref{Assignment Ops, ,Assignment Expressions}).
|
|
|
|
@item
|
|
C-compatible operator precedence, which breaks some old @code{awk}
|
|
programs (@pxref{Precedence, ,Operator Precedence (How Operators Nest)}).
|
|
|
|
@item
|
|
Regexps as the value of @code{FS}
|
|
(@pxref{Field Separators, ,Specifying How Fields are Separated}), and as the
|
|
third argument to the @code{split} function
|
|
(@pxref{String Functions, ,Built-in Functions for String Manipulation}).
|
|
|
|
@item
|
|
Dynamic regexps as operands of the @samp{~} and @samp{!~} operators
|
|
(@pxref{Regexp Usage, ,How to Use Regular Expressions}).
|
|
|
|
@item
|
|
The escape sequences @samp{\b}, @samp{\f}, and @samp{\r}
|
|
(@pxref{Escape Sequences}).
|
|
(Some vendors have updated their old versions of @code{awk} to
|
|
recognize @samp{\r}, @samp{\b}, and @samp{\f}, but this is not
|
|
something you can rely on.)
|
|
|
|
@item
|
|
Redirection of input for the @code{getline} function
|
|
(@pxref{Getline, ,Explicit Input with @code{getline}}).
|
|
|
|
@item
|
|
Multiple @code{BEGIN} and @code{END} rules
|
|
(@pxref{BEGIN/END, ,The @code{BEGIN} and @code{END} Special Patterns}).
|
|
|
|
@item
|
|
Multi-dimensional arrays
|
|
(@pxref{Multi-dimensional, ,Multi-dimensional Arrays}).
|
|
@end itemize
|
|
|
|
@node SVR4, POSIX, V7/SVR3.1, Language History
|
|
@section Changes between SVR3.1 and SVR4
|
|
|
|
@cindex @code{awk} language, V.4 version
|
|
The System V Release 4 version of Unix @code{awk} added these features
|
|
(some of which originated in @code{gawk}):
|
|
|
|
@itemize @bullet
|
|
@item
|
|
The @code{ENVIRON} variable (@pxref{Built-in Variables}).
|
|
|
|
@item
|
|
Multiple @samp{-f} options on the command line
|
|
(@pxref{Options, ,Command Line Options}).
|
|
|
|
@item
|
|
The @samp{-v} option for assigning variables before program execution begins
|
|
(@pxref{Options, ,Command Line Options}).
|
|
|
|
@item
|
|
The @samp{--} option for terminating command line options.
|
|
|
|
@item
|
|
The @samp{\a}, @samp{\v}, and @samp{\x} escape sequences
|
|
(@pxref{Escape Sequences}).
|
|
|
|
@item
|
|
A defined return value for the @code{srand} built-in function
|
|
(@pxref{Numeric Functions, ,Numeric Built-in Functions}).
|
|
|
|
@item
|
|
The @code{toupper} and @code{tolower} built-in string functions
|
|
for case translation
|
|
(@pxref{String Functions, ,Built-in Functions for String Manipulation}).
|
|
|
|
@item
|
|
A cleaner specification for the @samp{%c} format-control letter in the
|
|
@code{printf} function
|
|
(@pxref{Control Letters, ,Format-Control Letters}).
|
|
|
|
@item
|
|
The ability to dynamically pass the field width and precision (@code{"%*.*d"})
|
|
in the argument list of the @code{printf} function
|
|
(@pxref{Control Letters, ,Format-Control Letters}).
|
|
|
|
@item
|
|
The use of regexp constants such as @code{/foo/} as expressions, where
|
|
they are equivalent to using the matching operator, as in @samp{$0 ~ /foo/}
|
|
(@pxref{Using Constant Regexps, ,Using Regular Expression Constants}).
|
|
@end itemize
|
|
|
|
@node POSIX, BTL, SVR4, Language History
|
|
@section Changes between SVR4 and POSIX @code{awk}
|
|
|
|
The POSIX Command Language and Utilities standard for @code{awk}
|
|
introduced the following changes into the language:
|
|
|
|
@itemize @bullet
|
|
@item
|
|
The use of @samp{-W} for implementation-specific options.
|
|
|
|
@item
|
|
The use of @code{CONVFMT} for controlling the conversion of numbers
|
|
to strings (@pxref{Conversion, ,Conversion of Strings and Numbers}).
|
|
|
|
@item
|
|
The concept of a numeric string, and tighter comparison rules to go
|
|
with it (@pxref{Typing and Comparison, ,Variable Typing and Comparison Expressions}).
|
|
|
|
@item
|
|
More complete documentation of many of the previously undocumented
|
|
features of the language.
|
|
@end itemize
|
|
|
|
The following common extensions are not permitted by the POSIX
|
|
standard:
|
|
|
|
@c IMPORTANT! Keep this list in sync with the one in node Options
|
|
|
|
@itemize @bullet
|
|
@item
|
|
@code{\x} escape sequences are not recognized
|
|
(@pxref{Escape Sequences}).
|
|
|
|
@item
|
|
Newlines do not act as whitespace to separate fields when @code{FS} is
|
|
equal to a single space.
|
|
|
|
@item
|
|
The synonym @code{func} for the keyword @code{function} is not
|
|
recognized (@pxref{Definition Syntax, ,Function Definition Syntax}).
|
|
|
|
@item
|
|
The operators @samp{**} and @samp{**=} cannot be used in
|
|
place of @samp{^} and @samp{^=} (@pxref{Arithmetic Ops, ,Arithmetic Operators},
|
|
and also @pxref{Assignment Ops, ,Assignment Expressions}).
|
|
|
|
@item
|
|
Specifying @samp{-Ft} on the command line does not set the value
|
|
of @code{FS} to be a single tab character
|
|
(@pxref{Field Separators, ,Specifying How Fields are Separated}).
|
|
|
|
@item
|
|
The @code{fflush} built-in function is not supported
|
|
(@pxref{I/O Functions, , Built-in Functions for Input/Output}).
|
|
@end itemize
|
|
|
|
@node BTL, POSIX/GNU, POSIX, Language History
|
|
@section Extensions in the Bell Laboratories @code{awk}
|
|
|
|
@cindex Kernighan, Brian
|
|
Brian Kernighan, one of the original designers of Unix @code{awk},
|
|
has made his version available via anonymous @code{ftp}
|
|
(@pxref{Other Versions, ,Other Freely Available @code{awk} Implementations}).
|
|
This section describes extensions in his version of @code{awk} that are
|
|
not in POSIX @code{awk}.
|
|
|
|
@itemize @bullet
|
|
@item
|
|
The @samp{-mf @var{NNN}} and @samp{-mr @var{NNN}} command line options
|
|
to set the maximum number of fields, and the maximum
|
|
record size, respectively
|
|
(@pxref{Options, ,Command Line Options}).
|
|
|
|
@item
|
|
The @code{fflush} built-in function for flushing buffered output
|
|
(@pxref{I/O Functions, ,Built-in Functions for Input/Output}).
|
|
|
|
@ignore
|
|
@item
|
|
The @code{SYMTAB} array, that allows access to the internal symbol
|
|
table of @code{awk}. This feature is not documented, largely because
|
|
it is somewhat shakily implemented. For instance, you cannot access arrays
|
|
or array elements through it.
|
|
@end ignore
|
|
@end itemize
|
|
|
|
@node POSIX/GNU, , BTL, Language History
|
|
@section Extensions in @code{gawk} Not in POSIX @code{awk}
|
|
|
|
@cindex compatibility mode
|
|
The GNU implementation, @code{gawk}, adds a number of features.
|
|
This sections lists them in the order they were added to @code{gawk}.
|
|
They can all be disabled with either the @samp{--traditional} or
|
|
@samp{--posix} options
|
|
(@pxref{Options, ,Command Line Options}).
|
|
|
|
Version 2.10 of @code{gawk} introduced these features:
|
|
|
|
@itemize @bullet
|
|
@item
|
|
The @code{AWKPATH} environment variable for specifying a path search for
|
|
the @samp{-f} command line option
|
|
(@pxref{Options, ,Command Line Options}).
|
|
|
|
@item
|
|
The @code{IGNORECASE} variable and its effects
|
|
(@pxref{Case-sensitivity, ,Case-sensitivity in Matching}).
|
|
|
|
@item
|
|
The @file{/dev/stdin}, @file{/dev/stdout}, @file{/dev/stderr}, and
|
|
@file{/dev/fd/@var{n}} file name interpretation
|
|
(@pxref{Special Files, ,Special File Names in @code{gawk}}).
|
|
@end itemize
|
|
|
|
Version 2.13 of @code{gawk} introduced these features:
|
|
|
|
@itemize @bullet
|
|
@item
|
|
The @code{FIELDWIDTHS} variable and its effects
|
|
(@pxref{Constant Size, ,Reading Fixed-width Data}).
|
|
|
|
@item
|
|
The @code{systime} and @code{strftime} built-in functions for obtaining
|
|
and printing time stamps
|
|
(@pxref{Time Functions, ,Functions for Dealing with Time Stamps}).
|
|
|
|
@item
|
|
The @samp{-W lint} option to provide source code and run time error
|
|
and portability checking
|
|
(@pxref{Options, ,Command Line Options}).
|
|
|
|
@item
|
|
The @samp{-W compat} option to turn off these extensions
|
|
(@pxref{Options, ,Command Line Options}).
|
|
|
|
@item
|
|
The @samp{-W posix} option for full POSIX compliance
|
|
(@pxref{Options, ,Command Line Options}).
|
|
@end itemize
|
|
|
|
Version 2.14 of @code{gawk} introduced these features:
|
|
|
|
@itemize @bullet
|
|
@item
|
|
The @code{next file} statement for skipping to the next data file
|
|
(@pxref{Nextfile Statement, ,The @code{nextfile} Statement}).
|
|
@end itemize
|
|
|
|
Version 2.15 of @code{gawk} introduced these features:
|
|
|
|
@itemize @bullet
|
|
@item
|
|
The @code{ARGIND} variable, that tracks the movement of @code{FILENAME}
|
|
through @code{ARGV} (@pxref{Built-in Variables}).
|
|
|
|
@item
|
|
The @code{ERRNO} variable, that contains the system error message when
|
|
@code{getline} returns @minus{}1, or when @code{close} fails
|
|
(@pxref{Built-in Variables}).
|
|
|
|
@item
|
|
The ability to use GNU-style long named options that start with @samp{--}
|
|
(@pxref{Options, ,Command Line Options}).
|
|
|
|
@item
|
|
The @samp{--source} option for mixing command line and library
|
|
file source code
|
|
(@pxref{Options, ,Command Line Options}).
|
|
|
|
@item
|
|
The @file{/dev/pid}, @file{/dev/ppid}, @file{/dev/pgrpid}, and
|
|
@file{/dev/user} file name interpretation
|
|
(@pxref{Special Files, ,Special File Names in @code{gawk}}).
|
|
@end itemize
|
|
|
|
Version 3.0 of @code{gawk} introduced these features:
|
|
|
|
@itemize @bullet
|
|
@item
|
|
The @code{next file} statement became @code{nextfile}
|
|
(@pxref{Nextfile Statement, ,The @code{nextfile} Statement}).
|
|
|
|
@item
|
|
The @samp{--lint-old} option to
|
|
warn about constructs that are not available in
|
|
the original Version 7 Unix version of @code{awk}
|
|
(@pxref{V7/SVR3.1, , Major Changes between V7 and SVR3.1}).
|
|
|
|
@item
|
|
The @samp{--traditional} option was added as a better name for
|
|
@samp{--compat} (@pxref{Options, ,Command Line Options}).
|
|
|
|
@item
|
|
The ability for @code{FS} to be a null string, and for the third
|
|
argument to @code{split} to be the null string
|
|
(@pxref{Single Character Fields, , Making Each Character a Separate Field}).
|
|
|
|
@item
|
|
The ability for @code{RS} to be a regexp
|
|
(@pxref{Records, , How Input is Split into Records}).
|
|
|
|
@item
|
|
The @code{RT} variable
|
|
(@pxref{Records, , How Input is Split into Records}).
|
|
|
|
@item
|
|
The @code{gensub} function for more powerful text manipulation
|
|
(@pxref{String Functions, , Built-in Functions for String Manipulation}).
|
|
|
|
@item
|
|
The @code{strftime} function acquired a default time format,
|
|
allowing it to be called with no arguments
|
|
(@pxref{Time Functions, , Functions for Dealing with Time Stamps}).
|
|
|
|
@item
|
|
Full support for both POSIX and GNU regexps
|
|
(@pxref{Regexp, , Regular Expressions}).
|
|
|
|
@item
|
|
The @samp{--re-interval} option to provide interval expressions in regexps
|
|
(@pxref{Regexp Operators, , Regular Expression Operators}).
|
|
|
|
@item
|
|
@code{IGNORECASE} changed, now applying to string comparison as well
|
|
as regexp operations
|
|
(@pxref{Case-sensitivity, ,Case-sensitivity in Matching}).
|
|
|
|
@item
|
|
The @samp{-m} option and the @code{fflush} function from the
|
|
Bell Labs research version of @code{awk}
|
|
(@pxref{Options, ,Command Line Options}; also
|
|
@pxref{I/O Functions, ,Built-in Functions for Input/Output}).
|
|
|
|
@item
|
|
The use of GNU Autoconf to control the configuration process
|
|
(@pxref{Quick Installation, , Compiling @code{gawk} for Unix}).
|
|
|
|
@item
|
|
Amiga support
|
|
(@pxref{Amiga Installation, ,Installing @code{gawk} on an Amiga}).
|
|
|
|
@c XXX ADD MORE STUFF HERE
|
|
|
|
@end itemize
|
|
|
|
@node Gawk Summary, Installation, Language History, Top
|
|
@appendix @code{gawk} Summary
|
|
|
|
This appendix provides a brief summary of the @code{gawk} command line and the
|
|
@code{awk} language. It is designed to serve as ``quick reference.'' It is
|
|
therefore terse, but complete.
|
|
|
|
@menu
|
|
* Command Line Summary:: Recapitulation of the command line.
|
|
* Language Summary:: A terse review of the language.
|
|
* Variables/Fields:: Variables, fields, and arrays.
|
|
* Rules Summary:: Patterns and Actions, and their component
|
|
parts.
|
|
* Actions Summary:: Quick overview of actions.
|
|
* Functions Summary:: Defining and calling functions.
|
|
* Historical Features:: Some undocumented but supported ``features''.
|
|
@end menu
|
|
|
|
@node Command Line Summary, Language Summary, Gawk Summary, Gawk Summary
|
|
@appendixsec Command Line Options Summary
|
|
|
|
The command line consists of options to @code{gawk} itself, the
|
|
@code{awk} program text (if not supplied via the @samp{-f} option), and
|
|
values to be made available in the @code{ARGC} and @code{ARGV}
|
|
predefined @code{awk} variables:
|
|
|
|
@example
|
|
gawk @r{[@var{POSIX or GNU style options}]} -f @var{source-file} @r{[@code{--}]} @var{file} @dots{}
|
|
gawk @r{[@var{POSIX or GNU style options}]} @r{[@code{--}]} '@var{program}' @var{file} @dots{}
|
|
@end example
|
|
|
|
The options that @code{gawk} accepts are:
|
|
|
|
@table @code
|
|
@item -F @var{fs}
|
|
@itemx --field-separator @var{fs}
|
|
Use @var{fs} for the input field separator (the value of the @code{FS}
|
|
predefined variable).
|
|
|
|
@item -f @var{program-file}
|
|
@itemx --file @var{program-file}
|
|
Read the @code{awk} program source from the file @var{program-file}, instead
|
|
of from the first command line argument.
|
|
|
|
@item -mf @var{NNN}
|
|
@itemx -mr @var{NNN}
|
|
The @samp{f} flag sets
|
|
the maximum number of fields, and the @samp{r} flag sets the maximum
|
|
record size. These options are ignored by @code{gawk}, since @code{gawk}
|
|
has no predefined limits; they are only for compatibility with the
|
|
Bell Labs research version of Unix @code{awk}.
|
|
|
|
@item -v @var{var}=@var{val}
|
|
@itemx --assign @var{var}=@var{val}
|
|
Assign the variable @var{var} the value @var{val} before program execution
|
|
begins.
|
|
|
|
@item -W traditional
|
|
@itemx -W compat
|
|
@itemx --traditional
|
|
@itemx --compat
|
|
Use compatibility mode, in which @code{gawk} extensions are turned
|
|
off.
|
|
|
|
@item -W copyleft
|
|
@itemx -W copyright
|
|
@itemx --copyleft
|
|
@itemx --copyright
|
|
Print the short version of the General Public License on the standard
|
|
output, and exit. This option may disappear in a future version of @code{gawk}.
|
|
|
|
@item -W help
|
|
@itemx -W usage
|
|
@itemx --help
|
|
@itemx --usage
|
|
Print a relatively short summary of the available options on the standard
|
|
output, and exit.
|
|
|
|
@item -W lint
|
|
@itemx --lint
|
|
Give warnings about dubious or non-portable @code{awk} constructs.
|
|
|
|
@item -W lint-old
|
|
@itemx --lint-old
|
|
Warn about constructs that are not available in
|
|
the original Version 7 Unix version of @code{awk}.
|
|
|
|
@item -W posix
|
|
@itemx --posix
|
|
Use POSIX compatibility mode, in which @code{gawk} extensions
|
|
are turned off and additional restrictions apply.
|
|
|
|
@item -W re-interval
|
|
@itemx --re-interval
|
|
Allow interval expressions
|
|
(@pxref{Regexp Operators, , Regular Expression Operators}),
|
|
in regexps.
|
|
|
|
@item -W source=@var{program-text}
|
|
@itemx --source @var{program-text}
|
|
Use @var{program-text} as @code{awk} program source code. This option allows
|
|
mixing command line source code with source code from files, and is
|
|
particularly useful for mixing command line programs with library functions.
|
|
|
|
@item -W version
|
|
@itemx --version
|
|
Print version information for this particular copy of @code{gawk} on the error
|
|
output.
|
|
|
|
@item --
|
|
Signal the end of options. This is useful to allow further arguments to the
|
|
@code{awk} program itself to start with a @samp{-}. This is mainly for
|
|
consistency with POSIX argument parsing conventions.
|
|
@end table
|
|
|
|
Any other options are flagged as invalid, but are otherwise ignored.
|
|
@xref{Options, ,Command Line Options}, for more details.
|
|
|
|
@node Language Summary, Variables/Fields, Command Line Summary, Gawk Summary
|
|
@appendixsec Language Summary
|
|
|
|
An @code{awk} program consists of a sequence of zero or more pattern-action
|
|
statements and optional function definitions. One or the other of the
|
|
pattern and action may be omitted.
|
|
|
|
@example
|
|
@var{pattern} @{ @var{action statements} @}
|
|
@var{pattern}
|
|
@{ @var{action statements} @}
|
|
|
|
function @var{name}(@var{parameter list}) @{ @var{action statements} @}
|
|
@end example
|
|
|
|
@code{gawk} first reads the program source from the
|
|
@var{program-file}(s), if specified, or from the first non-option
|
|
argument on the command line. The @samp{-f} option may be used multiple
|
|
times on the command line. @code{gawk} reads the program text from all
|
|
the @var{program-file} files, effectively concatenating them in the
|
|
order they are specified. This is useful for building libraries of
|
|
@code{awk} functions, without having to include them in each new
|
|
@code{awk} program that uses them. To use a library function in a file
|
|
from a program typed in on the command line, specify
|
|
@samp{--source '@var{program}'}, and type your program in between the single
|
|
quotes.
|
|
@xref{Options, ,Command Line Options}.
|
|
|
|
The environment variable @code{AWKPATH} specifies a search path to use
|
|
when finding source files named with the @samp{-f} option. The default
|
|
path, which is
|
|
@samp{.:/usr/local/share/awk}@footnote{The path may use a directory
|
|
other than @file{/usr/local/share/awk}, depending upon how @code{gawk}
|
|
was built and installed.} is used if @code{AWKPATH} is not set.
|
|
If a file name given to the @samp{-f} option contains a @samp{/} character,
|
|
no path search is performed.
|
|
@xref{AWKPATH Variable, ,The @code{AWKPATH} Environment Variable}.
|
|
|
|
@code{gawk} compiles the program into an internal form, and then proceeds to
|
|
read each file named in the @code{ARGV} array.
|
|
The initial values of @code{ARGV} come from the command line arguments.
|
|
If there are no files named
|
|
on the command line, @code{gawk} reads the standard input.
|
|
|
|
If a ``file'' named on the command line has the form
|
|
@samp{@var{var}=@var{val}}, it is treated as a variable assignment: the
|
|
variable @var{var} is assigned the value @var{val}.
|
|
If any of the files have a value that is the null string, that
|
|
element in the list is skipped.
|
|
|
|
For each record in the input, @code{gawk} tests to see if it matches any
|
|
@var{pattern} in the @code{awk} program. For each pattern that the record
|
|
matches, the associated @var{action} is executed.
|
|
|
|
@node Variables/Fields, Rules Summary, Language Summary, Gawk Summary
|
|
@appendixsec Variables and Fields
|
|
|
|
@code{awk} variables are not declared; they come into existence when they are
|
|
first used. Their values are either floating-point numbers or strings.
|
|
@code{awk} also has one-dimensional arrays; multiple-dimensional arrays
|
|
may be simulated. There are several predefined variables that
|
|
@code{awk} sets as a program runs; these are summarized below.
|
|
|
|
@menu
|
|
* Fields Summary:: Input field splitting.
|
|
* Built-in Summary:: @code{awk}'s built-in variables.
|
|
* Arrays Summary:: Using arrays.
|
|
* Data Type Summary:: Values in @code{awk} are numbers or strings.
|
|
@end menu
|
|
|
|
@node Fields Summary, Built-in Summary, Variables/Fields, Variables/Fields
|
|
@appendixsubsec Fields
|
|
|
|
As each input line is read, @code{gawk} splits the line into
|
|
@var{fields}, using the value of the @code{FS} variable as the field
|
|
separator. If @code{FS} is a single character, fields are separated by
|
|
that character. Otherwise, @code{FS} is expected to be a full regular
|
|
expression. In the special case that @code{FS} is a single space,
|
|
fields are separated by runs of spaces, tabs and/or newlines.@footnote{In
|
|
POSIX @code{awk}, newline does not separate fields.}
|
|
If @code{FS} is the null string (@code{""}), then each individual
|
|
character in the record becomes a separate field.
|
|
Note that the value
|
|
of @code{IGNORECASE} (@pxref{Case-sensitivity, ,Case-sensitivity in Matching})
|
|
also affects how fields are split when @code{FS} is a regular expression.
|
|
|
|
Each field in the input line may be referenced by its position, @code{$1},
|
|
@code{$2}, and so on. @code{$0} is the whole line. The value of a field may
|
|
be assigned to as well. Field numbers need not be constants:
|
|
|
|
@example
|
|
n = 5
|
|
print $n
|
|
@end example
|
|
|
|
@noindent
|
|
prints the fifth field in the input line. The variable @code{NF} is set to
|
|
the total number of fields in the input line.
|
|
|
|
References to non-existent fields (i.e.@: fields after @code{$NF}) return
|
|
the null string. However, assigning to a non-existent field (e.g.,
|
|
@code{$(NF+2) = 5}) increases the value of @code{NF}, creates any
|
|
intervening fields with the null string as their value, and causes the
|
|
value of @code{$0} to be recomputed, with the fields being separated by
|
|
the value of @code{OFS}.
|
|
Decrementing @code{NF} causes the values of fields past the new value to
|
|
be lost, and the value of @code{$0} to be recomputed, with the fields being
|
|
separated by the value of @code{OFS}.
|
|
@xref{Reading Files, ,Reading Input Files}.
|
|
|
|
@node Built-in Summary, Arrays Summary, Fields Summary, Variables/Fields
|
|
@appendixsubsec Built-in Variables
|
|
|
|
@code{gawk}'s built-in variables are:
|
|
|
|
@table @code
|
|
@item ARGC
|
|
The number of elements in @code{ARGV}. See below for what is actually
|
|
included in @code{ARGV}.
|
|
|
|
@item ARGIND
|
|
The index in @code{ARGV} of the current file being processed.
|
|
When @code{gawk} is processing the input data files,
|
|
it is always true that @samp{FILENAME == ARGV[ARGIND]}.
|
|
|
|
@item ARGV
|
|
The array of command line arguments. The array is indexed from zero to
|
|
@code{ARGC} @minus{} 1. Dynamically changing @code{ARGC} and
|
|
the contents of @code{ARGV}
|
|
can control the files used for data. A null-valued element in
|
|
@code{ARGV} is ignored. @code{ARGV} does not include the options to
|
|
@code{awk} or the text of the @code{awk} program itself.
|
|
|
|
@item CONVFMT
|
|
The conversion format to use when converting numbers to strings.
|
|
|
|
@item FIELDWIDTHS
|
|
A space separated list of numbers describing the fixed-width input data.
|
|
|
|
@item ENVIRON
|
|
An array of environment variable values. The array
|
|
is indexed by variable name, each element being the value of that
|
|
variable. Thus, the environment variable @code{HOME} is
|
|
@code{ENVIRON["HOME"]}. One possible value might be @file{/home/arnold}.
|
|
|
|
Changing this array does not affect the environment seen by programs
|
|
which @code{gawk} spawns via redirection or the @code{system} function.
|
|
(This may change in a future version of @code{gawk}.)
|
|
|
|
Some operating systems do not have environment variables.
|
|
The @code{ENVIRON} array is empty when running on these systems.
|
|
|
|
@item ERRNO
|
|
The system error message when an error occurs using @code{getline}
|
|
or @code{close}.
|
|
|
|
@item FILENAME
|
|
The name of the current input file. If no files are specified on the command
|
|
line, the value of @code{FILENAME} is the null string.
|
|
|
|
@item FNR
|
|
The input record number in the current input file.
|
|
|
|
@item FS
|
|
The input field separator, a space by default.
|
|
|
|
@item IGNORECASE
|
|
The case-sensitivity flag for string comparisons and regular expression
|
|
operations. If @code{IGNORECASE} has a non-zero value, then pattern
|
|
matching in rules, record separating with @code{RS}, field splitting
|
|
with @code{FS}, regular expression matching with @samp{~} and
|
|
@samp{!~}, and the @code{gensub}, @code{gsub}, @code{index},
|
|
@code{match}, @code{split} and @code{sub} built-in functions all
|
|
ignore case when doing regular expression operations, and all string
|
|
comparisons are done ignoring case.
|
|
The value of @code{IGNORECASE} does @emph{not} affect array subscripting.
|
|
|
|
@item NF
|
|
The number of fields in the current input record.
|
|
|
|
@item NR
|
|
The total number of input records seen so far.
|
|
|
|
@item OFMT
|
|
The output format for numbers for the @code{print} statement,
|
|
@code{"%.6g"} by default.
|
|
|
|
@item OFS
|
|
The output field separator, a space by default.
|
|
|
|
@item ORS
|
|
The output record separator, by default a newline.
|
|
|
|
@item RS
|
|
The input record separator, by default a newline.
|
|
If @code{RS} is set to the null string, then records are separated by
|
|
blank lines. When @code{RS} is set to the null string, then the newline
|
|
character always acts as a field separator, in addition to whatever value
|
|
@code{FS} may have. If @code{RS} is set to a multi-character
|
|
string, it denotes a regexp; input text matching the regexp
|
|
separates records.
|
|
|
|
@item RT
|
|
The input text that matched the text denoted by @code{RS},
|
|
the record separator.
|
|
|
|
@item RSTART
|
|
The index of the first character last matched by @code{match}; zero if no match.
|
|
|
|
@item RLENGTH
|
|
The length of the string last matched by @code{match}; @minus{}1 if no match.
|
|
|
|
@item SUBSEP
|
|
The string used to separate multiple subscripts in array elements, by
|
|
default @code{"\034"}.
|
|
@end table
|
|
|
|
@xref{Built-in Variables}, for more information.
|
|
|
|
@node Arrays Summary, Data Type Summary, Built-in Summary, Variables/Fields
|
|
@appendixsubsec Arrays
|
|
|
|
Arrays are subscripted with an expression between square brackets
|
|
(@samp{[} and @samp{]}). Array subscripts are @emph{always} strings;
|
|
numbers are converted to strings as necessary, following the standard
|
|
conversion rules
|
|
(@pxref{Conversion, ,Conversion of Strings and Numbers}).
|
|
|
|
If you use multiple expressions separated by commas inside the square
|
|
brackets, then the array subscript is a string consisting of the
|
|
concatenation of the individual subscript values, converted to strings,
|
|
separated by the subscript separator (the value of @code{SUBSEP}).
|
|
|
|
The special operator @code{in} may be used in a conditional context
|
|
to see if an array has an index consisting of a particular value.
|
|
|
|
@example
|
|
if (val in array)
|
|
print array[val]
|
|
@end example
|
|
|
|
If the array has multiple subscripts, use @samp{(i, j, @dots{}) in @var{array}}
|
|
to test for existence of an element.
|
|
|
|
The @code{in} construct may also be used in a @code{for} loop to iterate
|
|
over all the elements of an array.
|
|
@xref{Scanning an Array, ,Scanning All Elements of an Array}.
|
|
|
|
You can remove an element from an array using the @code{delete} statement.
|
|
|
|
You can clear an entire array using @samp{delete @var{array}}.
|
|
|
|
@xref{Arrays, ,Arrays in @code{awk}}.
|
|
|
|
@node Data Type Summary, , Arrays Summary, Variables/Fields
|
|
@appendixsubsec Data Types
|
|
|
|
The value of an @code{awk} expression is always either a number
|
|
or a string.
|
|
|
|
Some contexts (such as arithmetic operators) require numeric
|
|
values. They convert strings to numbers by interpreting the text
|
|
of the string as a number. If the string does not look like a
|
|
number, it converts to zero.
|
|
|
|
Other contexts (such as concatenation) require string values.
|
|
They convert numbers to strings by effectively printing them
|
|
with @code{sprintf}.
|
|
@xref{Conversion, ,Conversion of Strings and Numbers}, for the details.
|
|
|
|
To force conversion of a string value to a number, simply add zero
|
|
to it. If the value you start with is already a number, this
|
|
does not change it.
|
|
|
|
To force conversion of a numeric value to a string, concatenate it with
|
|
the null string.
|
|
|
|
Comparisons are done numerically if both operands are numeric, or if
|
|
one is numeric and the other is a numeric string. Otherwise one or
|
|
both operands are converted to strings and a string comparison is
|
|
performed. Fields, @code{getline} input, @code{FILENAME}, @code{ARGV}
|
|
elements, @code{ENVIRON} elements and the elements of an array created
|
|
by @code{split} are the only items that can be numeric strings. String
|
|
constants, such as @code{"3.1415927"} are not numeric strings, they are
|
|
string constants. The full rules for comparisons are described in
|
|
@ref{Typing and Comparison, ,Variable Typing and Comparison Expressions}.
|
|
|
|
Uninitialized variables have the string value @code{""} (the null, or
|
|
empty, string). In contexts where a number is required, this is
|
|
equivalent to zero.
|
|
|
|
@xref{Variables}, for more information on variable naming and initialization;
|
|
@pxref{Conversion, ,Conversion of Strings and Numbers}, for more information
|
|
on how variable values are interpreted.
|
|
|
|
@node Rules Summary, Actions Summary, Variables/Fields, Gawk Summary
|
|
@appendixsec Patterns
|
|
|
|
@menu
|
|
* Pattern Summary:: Quick overview of patterns.
|
|
* Regexp Summary:: Quick overview of regular expressions.
|
|
@end menu
|
|
|
|
An @code{awk} program is mostly composed of rules, each consisting of a
|
|
pattern followed by an action. The action is enclosed in @samp{@{} and
|
|
@samp{@}}. Either the pattern may be missing, or the action may be
|
|
missing, but not both. If the pattern is missing, the
|
|
action is executed for every input record. A missing action is
|
|
equivalent to @samp{@w{@{ print @}}}, which prints the entire line.
|
|
|
|
@c These paragraphs repeated for both patterns and actions. I don't
|
|
@c like this, but I also don't see any way around it. Update both copies
|
|
@c if they need fixing.
|
|
Comments begin with the @samp{#} character, and continue until the end of the
|
|
line. Blank lines may be used to separate statements. Statements normally
|
|
end with a newline; however, this is not the case for lines ending in a
|
|
@samp{,}, @samp{@{}, @samp{?}, @samp{:}, @samp{&&}, or @samp{||}. Lines
|
|
ending in @code{do} or @code{else} also have their statements automatically
|
|
continued on the following line. In other cases, a line can be continued by
|
|
ending it with a @samp{\}, in which case the newline is ignored.
|
|
|
|
Multiple statements may be put on one line by separating each one with
|
|
a @samp{;}.
|
|
This applies to both the statements within the action part of a rule (the
|
|
usual case), and to the rule statements.
|
|
|
|
@xref{Comments, ,Comments in @code{awk} Programs}, for information on
|
|
@code{awk}'s commenting convention;
|
|
@pxref{Statements/Lines, ,@code{awk} Statements Versus Lines}, for a
|
|
description of the line continuation mechanism in @code{awk}.
|
|
|
|
@node Pattern Summary, Regexp Summary, Rules Summary, Rules Summary
|
|
@appendixsubsec Pattern Summary
|
|
|
|
@code{awk} patterns may be one of the following:
|
|
|
|
@example
|
|
/@var{regular expression}/
|
|
@var{relational expression}
|
|
@var{pattern} && @var{pattern}
|
|
@var{pattern} || @var{pattern}
|
|
@var{pattern} ? @var{pattern} : @var{pattern}
|
|
(@var{pattern})
|
|
! @var{pattern}
|
|
@var{pattern1}, @var{pattern2}
|
|
BEGIN
|
|
END
|
|
@end example
|
|
|
|
@code{BEGIN} and @code{END} are two special kinds of patterns that are not
|
|
tested against the input. The action parts of all @code{BEGIN} rules are
|
|
concatenated as if all the statements had been written in a single @code{BEGIN}
|
|
rule. They are executed before any of the input is read. Similarly, all the
|
|
@code{END} rules are concatenated, and executed when all the input is exhausted (or
|
|
when an @code{exit} statement is executed). @code{BEGIN} and @code{END}
|
|
patterns cannot be combined with other patterns in pattern expressions.
|
|
@code{BEGIN} and @code{END} rules cannot have missing action parts.
|
|
|
|
For @code{/@var{regular-expression}/} patterns, the associated statement is
|
|
executed for each input record that matches the regular expression. Regular
|
|
expressions are summarized below.
|
|
|
|
A @var{relational expression} may use any of the operators defined below in
|
|
the section on actions. These generally test whether certain fields match
|
|
certain regular expressions.
|
|
|
|
The @samp{&&}, @samp{||}, and @samp{!} operators are logical ``and,''
|
|
logical ``or,'' and logical ``not,'' respectively, as in C. They do
|
|
short-circuit evaluation, also as in C, and are used for combining more
|
|
primitive pattern expressions. As in most languages, parentheses may be
|
|
used to change the order of evaluation.
|
|
|
|
The @samp{?:} operator is like the same operator in C. If the first
|
|
pattern matches, then the second pattern is matched against the input
|
|
record; otherwise, the third is matched. Only one of the second and
|
|
third patterns is matched.
|
|
|
|
The @samp{@var{pattern1}, @var{pattern2}} form of a pattern is called a
|
|
range pattern. It matches all input lines starting with a line that
|
|
matches @var{pattern1}, and continuing until a line that matches
|
|
@var{pattern2}, inclusive. A range pattern cannot be used as an operand
|
|
of any of the pattern operators.
|
|
|
|
@xref{Pattern Overview, ,Pattern Elements}.
|
|
|
|
@node Regexp Summary, , Pattern Summary, Rules Summary
|
|
@appendixsubsec Regular Expressions
|
|
|
|
Regular expressions are based on POSIX EREs (extended regular expressions).
|
|
The escape sequences allowed in string constants are also valid in
|
|
regular expressions (@pxref{Escape Sequences}).
|
|
Regexps are composed of characters as follows:
|
|
|
|
@table @code
|
|
@item @var{c}
|
|
matches the character @var{c} (assuming @var{c} is none of the characters
|
|
listed below).
|
|
|
|
@item \@var{c}
|
|
matches the literal character @var{c}.
|
|
|
|
@item .
|
|
matches any character, @emph{including} newline.
|
|
In strict POSIX mode, @samp{.} does not match the @sc{nul}
|
|
character, which is a character with all bits equal to zero.
|
|
|
|
@item ^
|
|
matches the beginning of a string.
|
|
|
|
@item $
|
|
matches the end of a string.
|
|
|
|
@item [@var{abc}@dots{}]
|
|
matches any of the characters @var{abc}@dots{} (character list).
|
|
|
|
@item [[:@var{class}:]]
|
|
matches any character in the character class @var{class}. Allowable classes
|
|
are @code{alnum}, @code{alpha}, @code{blank}, @code{cntrl},
|
|
@code{digit}, @code{graph}, @code{lower}, @code{print}, @code{punct},
|
|
@code{space}, @code{upper}, and @code{xdigit}.
|
|
|
|
@item [[.@var{symbol}.]]
|
|
matches the multi-character collating symbol @var{symbol}.
|
|
@code{gawk} does not currently support collating symbols.
|
|
|
|
@item [[=@var{classname}=]]
|
|
matches any of the equivalent characters in the current locale named by the
|
|
equivalence class @var{classname}.
|
|
@code{gawk} does not currently support equivalence classes.
|
|
|
|
@item [^@var{abc}@dots{}]
|
|
matches any character except @var{abc}@dots{} (negated
|
|
character list).
|
|
|
|
@item @var{r1}|@var{r2}
|
|
matches either @var{r1} or @var{r2} (alternation).
|
|
|
|
@item @var{r1r2}
|
|
matches @var{r1}, and then @var{r2} (concatenation).
|
|
|
|
@item @var{r}+
|
|
matches one or more @var{r}'s.
|
|
|
|
@item @var{r}*
|
|
matches zero or more @var{r}'s.
|
|
|
|
@item @var{r}?
|
|
matches zero or one @var{r}'s.
|
|
|
|
@item (@var{r})
|
|
matches @var{r} (grouping).
|
|
|
|
@item @var{r}@{@var{n}@}
|
|
@itemx @var{r}@{@var{n},@}
|
|
@itemx @var{r}@{@var{n},@var{m}@}
|
|
matches at least @var{n}, @var{n} to any number, or @var{n} to @var{m}
|
|
occurrences of @var{r} (interval expressions).
|
|
|
|
@item \y
|
|
matches the empty string at either the beginning or the
|
|
end of a word.
|
|
|
|
@item \B
|
|
matches the empty string within a word.
|
|
|
|
@item \<
|
|
matches the empty string at the beginning of a word.
|
|
|
|
@item \>
|
|
matches the empty string at the end of a word.
|
|
|
|
@item \w
|
|
matches any word-constituent character (alphanumeric characters and
|
|
the underscore).
|
|
|
|
@item \W
|
|
matches any character that is not word-constituent.
|
|
|
|
@item \`
|
|
matches the empty string at the beginning of a buffer (same as a string
|
|
in @code{gawk}).
|
|
|
|
@item \'
|
|
matches the empty string at the end of a buffer.
|
|
@end table
|
|
|
|
The various command line options
|
|
control how @code{gawk} interprets characters in regexps.
|
|
|
|
@c NOTE!!! Keep this in sync with the same table in the regexp chapter!
|
|
@table @asis
|
|
@item No options
|
|
In the default case, @code{gawk} provide all the facilities of
|
|
POSIX regexps and the GNU regexp operators described above.
|
|
However, interval expressions are not supported.
|
|
|
|
@item @code{--posix}
|
|
Only POSIX regexps are supported, the GNU operators are not special
|
|
(e.g., @samp{\w} matches a literal @samp{w}). Interval expressions
|
|
are allowed.
|
|
|
|
@item @code{--traditional}
|
|
Traditional Unix @code{awk} regexps are matched. The GNU operators
|
|
are not special, interval expressions are not available, and neither
|
|
are the POSIX character classes (@code{[[:alnum:]]} and so on).
|
|
Characters described by octal and hexadecimal escape sequences are
|
|
treated literally, even if they represent regexp metacharacters.
|
|
|
|
@item @code{--re-interval}
|
|
Allow interval expressions in regexps, even if @samp{--traditional}
|
|
has been provided.
|
|
@end table
|
|
|
|
@xref{Regexp, ,Regular Expressions}.
|
|
|
|
@node Actions Summary, Functions Summary, Rules Summary, Gawk Summary
|
|
@appendixsec Actions
|
|
|
|
Action statements are enclosed in braces, @samp{@{} and @samp{@}}.
|
|
A missing action statement is equivalent to @samp{@w{@{ print @}}}.
|
|
|
|
Action statements consist of the usual assignment, conditional, and looping
|
|
statements found in most languages. The operators, control statements,
|
|
and Input/Output statements available are similar to those in C.
|
|
|
|
@c These paragraphs repeated for both patterns and actions. I don't
|
|
@c like this, but I also don't see any way around it. Update both copies
|
|
@c if they need fixing.
|
|
Comments begin with the @samp{#} character, and continue until the end of the
|
|
line. Blank lines may be used to separate statements. Statements normally
|
|
end with a newline; however, this is not the case for lines ending in a
|
|
@samp{,}, @samp{@{}, @samp{?}, @samp{:}, @samp{&&}, or @samp{||}. Lines
|
|
ending in @code{do} or @code{else} also have their statements automatically
|
|
continued on the following line. In other cases, a line can be continued by
|
|
ending it with a @samp{\}, in which case the newline is ignored.
|
|
|
|
Multiple statements may be put on one line by separating each one with
|
|
a @samp{;}.
|
|
This applies to both the statements within the action part of a rule (the
|
|
usual case), and to the rule statements.
|
|
|
|
@xref{Comments, ,Comments in @code{awk} Programs}, for information on
|
|
@code{awk}'s commenting convention;
|
|
@pxref{Statements/Lines, ,@code{awk} Statements Versus Lines}, for a
|
|
description of the line continuation mechanism in @code{awk}.
|
|
|
|
@menu
|
|
* Operator Summary:: @code{awk} operators.
|
|
* Control Flow Summary:: The control statements.
|
|
* I/O Summary:: The I/O statements.
|
|
* Printf Summary:: A summary of @code{printf}.
|
|
* Special File Summary:: Special file names interpreted internally.
|
|
* Built-in Functions Summary:: Built-in numeric and string functions.
|
|
* Time Functions Summary:: Built-in time functions.
|
|
* String Constants Summary:: Escape sequences in strings.
|
|
@end menu
|
|
|
|
@node Operator Summary, Control Flow Summary, Actions Summary, Actions Summary
|
|
@appendixsubsec Operators
|
|
|
|
The operators in @code{awk}, in order of decreasing precedence, are:
|
|
|
|
@table @code
|
|
@item (@dots{})
|
|
Grouping.
|
|
|
|
@item $
|
|
Field reference.
|
|
|
|
@item ++ --
|
|
Increment and decrement, both prefix and postfix.
|
|
|
|
@item ^
|
|
Exponentiation (@samp{**} may also be used, and @samp{**=} for the assignment
|
|
operator, but they are not specified in the POSIX standard).
|
|
|
|
@item + - !
|
|
Unary plus, unary minus, and logical negation.
|
|
|
|
@item * / %
|
|
Multiplication, division, and modulus.
|
|
|
|
@item + -
|
|
Addition and subtraction.
|
|
|
|
@item @var{space}
|
|
String concatenation.
|
|
|
|
@item < <= > >= != ==
|
|
The usual relational operators.
|
|
|
|
@item ~ !~
|
|
Regular expression match, negated match.
|
|
|
|
@item in
|
|
Array membership.
|
|
|
|
@item &&
|
|
Logical ``and''.
|
|
|
|
@item ||
|
|
Logical ``or''.
|
|
|
|
@item ?:
|
|
A conditional expression. This has the form @samp{@var{expr1} ?
|
|
@var{expr2} : @var{expr3}}. If @var{expr1} is true, the value of the
|
|
expression is @var{expr2}; otherwise it is @var{expr3}. Only one of
|
|
@var{expr2} and @var{expr3} is evaluated.
|
|
|
|
@item = += -= *= /= %= ^=
|
|
Assignment. Both absolute assignment (@code{@var{var}=@var{value}})
|
|
and operator assignment (the other forms) are supported.
|
|
@end table
|
|
|
|
@xref{Expressions}.
|
|
|
|
@node Control Flow Summary, I/O Summary, Operator Summary, Actions Summary
|
|
@appendixsubsec Control Statements
|
|
|
|
The control statements are as follows:
|
|
|
|
@example
|
|
if (@var{condition}) @var{statement} @r{[} else @var{statement} @r{]}
|
|
while (@var{condition}) @var{statement}
|
|
do @var{statement} while (@var{condition})
|
|
for (@var{expr1}; @var{expr2}; @var{expr3}) @var{statement}
|
|
for (@var{var} in @var{array}) @var{statement}
|
|
break
|
|
continue
|
|
delete @var{array}[@var{index}]
|
|
delete @var{array}
|
|
exit @r{[} @var{expression} @r{]}
|
|
@{ @var{statements} @}
|
|
@end example
|
|
|
|
@xref{Statements, ,Control Statements in Actions}.
|
|
|
|
@node I/O Summary, Printf Summary, Control Flow Summary, Actions Summary
|
|
@appendixsubsec I/O Statements
|
|
|
|
The Input/Output statements are as follows:
|
|
|
|
@table @code
|
|
@item getline
|
|
Set @code{$0} from next input record; set @code{NF}, @code{NR}, @code{FNR}.
|
|
@xref{Getline, ,Explicit Input with @code{getline}}.
|
|
|
|
@item getline <@var{file}
|
|
Set @code{$0} from next record of @var{file}; set @code{NF}.
|
|
|
|
@item getline @var{var}
|
|
Set @var{var} from next input record; set @code{NR}, @code{FNR}.
|
|
|
|
@item getline @var{var} <@var{file}
|
|
Set @var{var} from next record of @var{file}.
|
|
|
|
@item @var{command} | getline
|
|
Run @var{command}, piping its output into @code{getline}; sets @code{$0},
|
|
@code{NF}, @code{NR}.
|
|
|
|
@item @var{command} | getline @code{var}
|
|
Run @var{command}, piping its output into @code{getline}; sets @var{var}.
|
|
|
|
@item next
|
|
Stop processing the current input record. The next input record is read and
|
|
processing starts over with the first pattern in the @code{awk} program.
|
|
If the end of the input data is reached, the @code{END} rule(s), if any,
|
|
are executed.
|
|
@xref{Next Statement, ,The @code{next} Statement}.
|
|
|
|
@item nextfile
|
|
Stop processing the current input file. The next input record read comes
|
|
from the next input file. @code{FILENAME} is updated, @code{FNR} is set to one,
|
|
@code{ARGIND} is incremented,
|
|
and processing starts over with the first pattern in the @code{awk} program.
|
|
If the end of the input data is reached, the @code{END} rule(s), if any,
|
|
are executed.
|
|
Earlier versions of @code{gawk} used @samp{next file}; this usage is still
|
|
supported, but is considered to be deprecated.
|
|
@xref{Nextfile Statement, ,The @code{nextfile} Statement}.
|
|
|
|
@item print
|
|
Prints the current record.
|
|
@xref{Printing, ,Printing Output}.
|
|
|
|
@item print @var{expr-list}
|
|
Prints expressions.
|
|
|
|
@item print @var{expr-list} > @var{file}
|
|
Prints expressions to @var{file}. If @var{file} does not exist, it is
|
|
created. If it does exist, its contents are deleted the first time the
|
|
@code{print} is executed.
|
|
|
|
@item print @var{expr-list} >> @var{file}
|
|
Prints expressions to @var{file}. The previous contents of @var{file}
|
|
are retained, and the output of @code{print} is appended to the file.
|
|
|
|
@item print @var{expr-list} | @var{command}
|
|
Prints expressions, sending the output down a pipe to @var{command}.
|
|
The pipeline to the command stays open until the @code{close} function
|
|
is called.
|
|
|
|
@item printf @var{fmt, expr-list}
|
|
Format and print.
|
|
|
|
@item printf @var{fmt, expr-list} > file
|
|
Format and print to @var{file}. If @var{file} does not exist, it is
|
|
created. If it does exist, its contents are deleted the first time the
|
|
@code{printf} is executed.
|
|
|
|
@item printf @var{fmt, expr-list} >> @var{file}
|
|
Format and print to @var{file}. The previous contents of @var{file}
|
|
are retained, and the output of @code{printf} is appended to the file.
|
|
|
|
@item printf @var{fmt, expr-list} | @var{command}
|
|
Format and print, sending the output down a pipe to @var{command}.
|
|
The pipeline to the command stays open until the @code{close} function
|
|
is called.
|
|
@end table
|
|
|
|
@code{getline} returns zero on end of file, and @minus{}1 on an error.
|
|
In the event of an error, @code{getline} will set @code{ERRNO} to
|
|
the value of a system-dependent string that describes the error.
|
|
|
|
@node Printf Summary, Special File Summary, I/O Summary, Actions Summary
|
|
@appendixsubsec @code{printf} Summary
|
|
|
|
Conversion specification have the form
|
|
@code{%}[@var{flag}][@var{width}][@code{.}@var{prec}]@var{format}.
|
|
@c whew!
|
|
Items in brackets are optional.
|
|
|
|
The @code{awk} @code{printf} statement and @code{sprintf} function
|
|
accept the following conversion specification formats:
|
|
|
|
@table @code
|
|
@item %c
|
|
An ASCII character. If the argument used for @samp{%c} is numeric, it is
|
|
treated as a character and printed. Otherwise, the argument is assumed to
|
|
be a string, and the only first character of that string is printed.
|
|
|
|
@item %d
|
|
@itemx %i
|
|
A decimal number (the integer part).
|
|
|
|
@item %e
|
|
@itemx %E
|
|
A floating point number of the form
|
|
@samp{@r{[}-@r{]}d.dddddde@r{[}+-@r{]}dd}.
|
|
The @samp{%E} format uses @samp{E} instead of @samp{e}.
|
|
|
|
@item %f
|
|
A floating point number of the form
|
|
@r{[}@code{-}@r{]}@code{ddd.dddddd}.
|
|
|
|
@item %g
|
|
@itemx %G
|
|
Use either the @samp{%e} or @samp{%f} formats, whichever produces a shorter
|
|
string, with non-significant zeros suppressed.
|
|
@samp{%G} will use @samp{%E} instead of @samp{%e}.
|
|
|
|
@item %o
|
|
An unsigned octal number (again, an integer).
|
|
|
|
@item %s
|
|
A character string.
|
|
|
|
@item %x
|
|
@itemx %X
|
|
An unsigned hexadecimal number (an integer).
|
|
The @samp{%X} format uses @samp{A} through @samp{F} instead of
|
|
@samp{a} through @samp{f} for decimal 10 through 15.
|
|
|
|
@item %%
|
|
A single @samp{%} character; no argument is converted.
|
|
@end table
|
|
|
|
There are optional, additional parameters that may lie between the @samp{%}
|
|
and the control letter:
|
|
|
|
@table @code
|
|
@item -
|
|
The expression should be left-justified within its field.
|
|
|
|
@item @var{space}
|
|
For numeric conversions, prefix positive values with a space, and
|
|
negative values with a minus sign.
|
|
|
|
@item +
|
|
The plus sign, used before the width modifier (see below),
|
|
says to always supply a sign for numeric conversions, even if the data
|
|
to be formatted is positive. The @samp{+} overrides the space modifier.
|
|
|
|
@item #
|
|
Use an ``alternate form'' for certain control letters.
|
|
For @samp{o}, supply a leading zero.
|
|
For @samp{x}, and @samp{X}, supply a leading @samp{0x} or @samp{0X} for
|
|
a non-zero result.
|
|
For @samp{e}, @samp{E}, and @samp{f}, the result will always contain a
|
|
decimal point.
|
|
For @samp{g}, and @samp{G}, trailing zeros are not removed from the result.
|
|
|
|
@item 0
|
|
A leading @samp{0} (zero) acts as a flag, that indicates output should be
|
|
padded with zeros instead of spaces.
|
|
This applies even to non-numeric output formats.
|
|
This flag only has an effect when the field width is wider than the
|
|
value to be printed.
|
|
|
|
@item @var{width}
|
|
The field should be padded to this width. The field is normally padded
|
|
with spaces. If the @samp{0} flag has been used, it is padded with zeros.
|
|
|
|
@item .@var{prec}
|
|
A number that specifies the precision to use when printing.
|
|
For the @samp{e}, @samp{E}, and @samp{f} formats, this specifies the
|
|
number of digits you want printed to the right of the decimal point.
|
|
For the @samp{g}, and @samp{G} formats, it specifies the maximum number
|
|
of significant digits. For the @samp{d}, @samp{o}, @samp{i}, @samp{u},
|
|
@samp{x}, and @samp{X} formats, it specifies the minimum number of
|
|
digits to print. For the @samp{s} format, it specifies the maximum number of
|
|
characters from the string that should be printed.
|
|
@end table
|
|
|
|
Either or both of the @var{width} and @var{prec} values may be specified
|
|
as @samp{*}. In that case, the particular value is taken from the argument
|
|
list.
|
|
|
|
@xref{Printf, ,Using @code{printf} Statements for Fancier Printing}.
|
|
|
|
@node Special File Summary, Built-in Functions Summary, Printf Summary, Actions Summary
|
|
@appendixsubsec Special File Names
|
|
|
|
When doing I/O redirection from either @code{print} or @code{printf} into a
|
|
file, or via @code{getline} from a file, @code{gawk} recognizes certain special
|
|
file names internally. These file names allow access to open file descriptors
|
|
inherited from @code{gawk}'s parent process (usually the shell). The
|
|
file names are:
|
|
|
|
@table @file
|
|
@item /dev/stdin
|
|
The standard input.
|
|
|
|
@item /dev/stdout
|
|
The standard output.
|
|
|
|
@item /dev/stderr
|
|
The standard error output.
|
|
|
|
@item /dev/fd/@var{n}
|
|
The file denoted by the open file descriptor @var{n}.
|
|
@end table
|
|
|
|
In addition, reading the following files provides process related information
|
|
about the running @code{gawk} program. All returned records are terminated
|
|
with a newline.
|
|
|
|
@table @file
|
|
@item /dev/pid
|
|
Returns the process ID of the current process.
|
|
|
|
@item /dev/ppid
|
|
Returns the parent process ID of the current process.
|
|
|
|
@item /dev/pgrpid
|
|
Returns the process group ID of the current process.
|
|
|
|
@item /dev/user
|
|
At least four space-separated fields, containing the return values of
|
|
the @code{getuid}, @code{geteuid}, @code{getgid}, and @code{getegid}
|
|
system calls.
|
|
If there are any additional fields, they are the group IDs returned by
|
|
@code{getgroups} system call.
|
|
(Multiple groups may not be supported on all systems.)
|
|
@end table
|
|
|
|
@noindent
|
|
These file names may also be used on the command line to name data files.
|
|
These file names are only recognized internally if you do not
|
|
actually have files with these names on your system.
|
|
|
|
@xref{Special Files, ,Special File Names in @code{gawk}}, for a longer description that
|
|
provides the motivation for this feature.
|
|
|
|
@node Built-in Functions Summary, Time Functions Summary, Special File Summary, Actions Summary
|
|
@appendixsubsec Built-in Functions
|
|
|
|
@code{awk} provides a number of built-in functions for performing
|
|
numeric operations, string related operations, and I/O related operations.
|
|
|
|
The built-in arithmetic functions are:
|
|
|
|
@table @code
|
|
@item atan2(@var{y}, @var{x})
|
|
the arctangent of @var{y/x} in radians.
|
|
|
|
@item cos(@var{expr})
|
|
the cosine of @var{expr}, which is in radians.
|
|
|
|
@item exp(@var{expr})
|
|
the exponential function (@code{e ^ @var{expr}}).
|
|
|
|
@item int(@var{expr})
|
|
truncates to integer.
|
|
|
|
@item log(@var{expr})
|
|
the natural logarithm of @code{expr}.
|
|
|
|
@item rand()
|
|
a random number between zero and one.
|
|
|
|
@item sin(@var{expr})
|
|
the sine of @var{expr}, which is in radians.
|
|
|
|
@item sqrt(@var{expr})
|
|
the square root function.
|
|
|
|
@item srand(@r{[}@var{expr}@r{]})
|
|
use @var{expr} as a new seed for the random number generator. If no @var{expr}
|
|
is provided, the time of day is used. The return value is the previous
|
|
seed for the random number generator.
|
|
@end table
|
|
|
|
@code{awk} has the following built-in string functions:
|
|
|
|
@table @code
|
|
@item gensub(@var{regex}, @var{subst}, @var{how} @r{[}, @var{target}@r{]})
|
|
If @var{how} is a string beginning with @samp{g} or @samp{G}, then
|
|
replace each match of @var{regex} in @var{target} with @var{subst}.
|
|
Otherwise, replace the @var{how}'th occurrence. If @var{target} is not
|
|
supplied, use @code{$0}. The return value is the changed string; the
|
|
original @var{target} is not modified. Within @var{subst},
|
|
@samp{\@var{n}}, where @var{n} is a digit from one to nine, can be used to
|
|
indicate the text that matched the @var{n}'th parenthesized
|
|
subexpression.
|
|
This function is @code{gawk}-specific.
|
|
|
|
@item gsub(@var{regex}, @var{subst} @r{[}, @var{target}@r{]})
|
|
for each substring matching the regular expression @var{regex} in the string
|
|
@var{target}, substitute the string @var{subst}, and return the number of
|
|
substitutions. If @var{target} is not supplied, use @code{$0}.
|
|
|
|
@item index(@var{str}, @var{search})
|
|
returns the index of the string @var{search} in the string @var{str}, or
|
|
zero if
|
|
@var{search} is not present.
|
|
|
|
@item length(@r{[}@var{str}@r{]})
|
|
returns the length of the string @var{str}. The length of @code{$0}
|
|
is returned if no argument is supplied.
|
|
|
|
@item match(@var{str}, @var{regex})
|
|
returns the position in @var{str} where the regular expression @var{regex}
|
|
occurs, or zero if @var{regex} is not present, and sets the values of
|
|
@code{RSTART} and @code{RLENGTH}.
|
|
|
|
@item split(@var{str}, @var{arr} @r{[}, @var{regex}@r{]})
|
|
splits the string @var{str} into the array @var{arr} on the regular expression
|
|
@var{regex}, and returns the number of elements. If @var{regex} is omitted,
|
|
@code{FS} is used instead. @var{regex} can be the null string, causing
|
|
each character to be placed into its own array element.
|
|
The array @var{arr} is cleared first.
|
|
|
|
@item sprintf(@var{fmt}, @var{expr-list})
|
|
prints @var{expr-list} according to @var{fmt}, and returns the resulting string.
|
|
|
|
@item sub(@var{regex}, @var{subst} @r{[}, @var{target}@r{]})
|
|
just like @code{gsub}, but only the first matching substring is replaced.
|
|
|
|
@item substr(@var{str}, @var{index} @r{[}, @var{len}@r{]})
|
|
returns the @var{len}-character substring of @var{str} starting at @var{index}.
|
|
If @var{len} is omitted, the rest of @var{str} is used.
|
|
|
|
@item tolower(@var{str})
|
|
returns a copy of the string @var{str}, with all the upper-case characters in
|
|
@var{str} translated to their corresponding lower-case counterparts.
|
|
Non-alphabetic characters are left unchanged.
|
|
|
|
@item toupper(@var{str})
|
|
returns a copy of the string @var{str}, with all the lower-case characters in
|
|
@var{str} translated to their corresponding upper-case counterparts.
|
|
Non-alphabetic characters are left unchanged.
|
|
@end table
|
|
|
|
The I/O related functions are:
|
|
|
|
@table @code
|
|
@item close(@var{expr})
|
|
Close the open file or pipe denoted by @var{expr}.
|
|
|
|
@item fflush(@r{[}@var{expr}@r{]})
|
|
Flush any buffered output for the output file or pipe denoted by @var{expr}.
|
|
If @var{expr} is omitted, standard output is flushed.
|
|
If @var{expr} is the null string (@code{""}), all output buffers are flushed.
|
|
|
|
@item system(@var{cmd-line})
|
|
Execute the command @var{cmd-line}, and return the exit status.
|
|
If your operating system does not support @code{system}, calling it will
|
|
generate a fatal error.
|
|
|
|
@samp{system("")} can be used to force @code{awk} to flush any pending
|
|
output. This is more portable, but less obvious, than calling @code{fflush}.
|
|
@end table
|
|
|
|
@node Time Functions Summary, String Constants Summary, Built-in Functions Summary, Actions Summary
|
|
@appendixsubsec Time Functions
|
|
|
|
The following two functions are available for getting the current
|
|
time of day, and for formatting time stamps.
|
|
They are specific to @code{gawk}.
|
|
|
|
@table @code
|
|
@item systime()
|
|
returns the current time of day as the number of seconds since a particular
|
|
epoch (Midnight, January 1, 1970 UTC, on POSIX systems).
|
|
|
|
@item strftime(@r{[}@var{format}@r{[}, @var{timestamp}@r{]]})
|
|
formats @var{timestamp} according to the specification in @var{format}.
|
|
The current time of day is used if no @var{timestamp} is supplied.
|
|
A default format equivalent to the output of the @code{date} utility is used if
|
|
no @var{format} is supplied.
|
|
@xref{Time Functions, ,Functions for Dealing with Time Stamps}, for the
|
|
details on the conversion specifiers that @code{strftime} accepts.
|
|
@end table
|
|
|
|
@iftex
|
|
@xref{Built-in, ,Built-in Functions}, for a description of all of
|
|
@code{awk}'s built-in functions.
|
|
@end iftex
|
|
|
|
@node String Constants Summary, , Time Functions Summary, Actions Summary
|
|
@appendixsubsec String Constants
|
|
|
|
String constants in @code{awk} are sequences of characters enclosed
|
|
in double quotes (@code{"}). Within strings, certain @dfn{escape sequences}
|
|
are recognized, as in C. These are:
|
|
|
|
@table @code
|
|
@item \\
|
|
A literal backslash.
|
|
|
|
@item \a
|
|
The ``alert'' character; usually the ASCII BEL character.
|
|
|
|
@item \b
|
|
Backspace.
|
|
|
|
@item \f
|
|
Formfeed.
|
|
|
|
@item \n
|
|
Newline.
|
|
|
|
@item \r
|
|
Carriage return.
|
|
|
|
@item \t
|
|
Horizontal tab.
|
|
|
|
@item \v
|
|
Vertical tab.
|
|
|
|
@item \x@var{hex digits}
|
|
The character represented by the string of hexadecimal digits following
|
|
the @samp{\x}. As in ANSI C, all following hexadecimal digits are
|
|
considered part of the escape sequence. E.g., @code{"\x1B"} is a
|
|
string containing the ASCII ESC (escape) character. (The @samp{\x}
|
|
escape sequence is not in POSIX @code{awk}.)
|
|
|
|
@item \@var{ddd}
|
|
The character represented by the one, two, or three digit sequence of octal
|
|
digits. Thus, @code{"\033"} is also a string containing the ASCII ESC
|
|
(escape) character.
|
|
|
|
@item \@var{c}
|
|
The literal character @var{c}, if @var{c} is not one of the above.
|
|
@end table
|
|
|
|
The escape sequences may also be used inside constant regular expressions
|
|
(e.g., the regexp @code{@w{/[@ \t\f\n\r\v]/}} matches whitespace
|
|
characters).
|
|
|
|
@xref{Escape Sequences}.
|
|
|
|
@node Functions Summary, Historical Features, Actions Summary, Gawk Summary
|
|
@appendixsec User-defined Functions
|
|
|
|
Functions in @code{awk} are defined as follows:
|
|
|
|
@example
|
|
function @var{name}(@var{parameter list}) @{ @var{statements} @}
|
|
@end example
|
|
|
|
Actual parameters supplied in the function call are used to instantiate
|
|
the formal parameters declared in the function. Arrays are passed by
|
|
reference, other variables are passed by value.
|
|
|
|
If there are fewer arguments passed than there are names in @var{parameter-list},
|
|
the extra names are given the null string as their value. Extra names have the
|
|
effect of local variables.
|
|
|
|
The open-parenthesis in a function call of a user-defined function must
|
|
immediately follow the function name, without any intervening white space.
|
|
This is to avoid a syntactic ambiguity with the concatenation operator.
|
|
|
|
The word @code{func} may be used in place of @code{function} (but not in
|
|
POSIX @code{awk}).
|
|
|
|
Use the @code{return} statement to return a value from a function.
|
|
|
|
@xref{User-defined, ,User-defined Functions}.
|
|
|
|
@node Historical Features, , Functions Summary, Gawk Summary
|
|
@appendixsec Historical Features
|
|
|
|
@cindex historical features
|
|
There are two features of historical @code{awk} implementations that
|
|
@code{gawk} supports.
|
|
|
|
First, it is possible to call the @code{length} built-in function not only
|
|
with no arguments, but even without parentheses!
|
|
|
|
@example
|
|
a = length
|
|
@end example
|
|
|
|
@noindent
|
|
is the same as either of
|
|
|
|
@example
|
|
a = length()
|
|
a = length($0)
|
|
@end example
|
|
|
|
@noindent
|
|
For example:
|
|
|
|
@example
|
|
$ echo abcdef | awk '@{ print length @}'
|
|
@print{} 6
|
|
@end example
|
|
|
|
@noindent
|
|
This feature is marked as ``deprecated'' in the POSIX standard, and
|
|
@code{gawk} will issue a warning about its use if @samp{--lint} is
|
|
specified on the command line.
|
|
(The ability to use @code{length} this way was actually an accident of the
|
|
original Unix @code{awk} implementation. If any built-in function used
|
|
@code{$0} as its default argument, it was possible to call that function
|
|
without the parentheses. In particular, it was common practice to use
|
|
the @code{length} function in this fashion, and this usage was documented
|
|
in the @code{awk} manual page.)
|
|
|
|
The other historical feature is the use of either the @code{break} statement,
|
|
or the @code{continue} statement
|
|
outside the body of a @code{while}, @code{for}, or @code{do} loop. Traditional
|
|
@code{awk} implementations have treated such usage as equivalent to the
|
|
@code{next} statement. More recent versions of Unix @code{awk} do not allow
|
|
it. @code{gawk} supports this usage if @samp{--traditional} has been
|
|
specified.
|
|
|
|
@xref{Options, ,Command Line Options}, for more information about the
|
|
@samp{--posix} and @samp{--lint} options.
|
|
|
|
@node Installation, Notes, Gawk Summary, Top
|
|
@appendix Installing @code{gawk}
|
|
|
|
This appendix provides instructions for installing @code{gawk} on the
|
|
various platforms that are supported by the developers. The primary
|
|
developers support Unix (and one day, GNU), while the other ports were
|
|
contributed. The file @file{ACKNOWLEDGMENT} in the @code{gawk}
|
|
distribution lists the electronic mail addresses of the people who did
|
|
the respective ports, and they are also provided in
|
|
@ref{Bugs, , Reporting Problems and Bugs}.
|
|
|
|
@menu
|
|
* Gawk Distribution:: What is in the @code{gawk} distribution.
|
|
* Unix Installation:: Installing @code{gawk} under various versions
|
|
of Unix.
|
|
* VMS Installation:: Installing @code{gawk} on VMS.
|
|
* PC Installation:: Installing and Compiling @code{gawk} on MS-DOS
|
|
and OS/2
|
|
* Atari Installation:: Installing @code{gawk} on the Atari ST.
|
|
* Amiga Installation:: Installing @code{gawk} on an Amiga.
|
|
* Bugs:: Reporting Problems and Bugs.
|
|
* Other Versions:: Other freely available @code{awk}
|
|
implementations.
|
|
@end menu
|
|
|
|
@node Gawk Distribution, Unix Installation, Installation, Installation
|
|
@appendixsec The @code{gawk} Distribution
|
|
|
|
This section first describes how to get the @code{gawk}
|
|
distribution, how to extract it, and then what is in the various files and
|
|
subdirectories.
|
|
|
|
@menu
|
|
* Getting:: How to get the distribution.
|
|
* Extracting:: How to extract the distribution.
|
|
* Distribution contents:: What is in the distribution.
|
|
@end menu
|
|
|
|
@node Getting, Extracting, Gawk Distribution, Gawk Distribution
|
|
@appendixsubsec Getting the @code{gawk} Distribution
|
|
@cindex getting @code{gawk}
|
|
@cindex anonymous @code{ftp}
|
|
@cindex @code{ftp}, anonymous
|
|
@cindex Free Software Foundation
|
|
There are three ways you can get GNU software.
|
|
|
|
@enumerate
|
|
@item
|
|
You can copy it from someone else who already has it.
|
|
|
|
@cindex Free Software Foundation
|
|
@item
|
|
You can order @code{gawk} directly from the Free Software Foundation.
|
|
Software distributions are available for Unix, MS-DOS, and VMS, on
|
|
tape and CD-ROM. The address is:
|
|
|
|
@quotation
|
|
Free Software Foundation @*
|
|
59 Temple Place---Suite 330 @*
|
|
Boston, MA 02111-1307 USA @*
|
|
Phone: +1-617-542-5942 @*
|
|
Fax (including Japan): +1-617-542-2652 @*
|
|
E-mail: @code{gnu@@gnu.org} @*
|
|
@end quotation
|
|
|
|
@noindent
|
|
Ordering from the FSF directly contributes to the support of the foundation
|
|
and to the production of more free software.
|
|
|
|
@item
|
|
You can get @code{gawk} by using anonymous @code{ftp} to the Internet host
|
|
@code{gnudist.gnu.org}, in the directory @file{/gnu/gawk}.
|
|
|
|
Here is a list of alternate @code{ftp} sites from which you can obtain GNU
|
|
software. When a site is listed as ``@var{site}@code{:}@var{directory}'' the
|
|
@var{directory} indicates the directory where GNU software is kept.
|
|
You should use a site that is geographically close to you.
|
|
|
|
@table @asis
|
|
@item Asia:
|
|
@table @code
|
|
@item cair-archive.kaist.ac.kr:/pub/gnu
|
|
@itemx ftp.cs.titech.ac.jp
|
|
@itemx ftp.nectec.or.th:/pub/mirrors/gnu
|
|
@itemx utsun.s.u-tokyo.ac.jp:/ftpsync/prep
|
|
@end table
|
|
|
|
@item Australia:
|
|
@table @code
|
|
@item archie.au:/gnu
|
|
(@code{archie.oz} or @code{archie.oz.au} for ACSnet)
|
|
@end table
|
|
|
|
@item Africa:
|
|
@table @code
|
|
@item ftp.sun.ac.za:/pub/gnu
|
|
@end table
|
|
|
|
@item Middle East:
|
|
@table @code
|
|
@item ftp.technion.ac.il:/pub/unsupported/gnu
|
|
@end table
|
|
|
|
@item Europe:
|
|
@table @code
|
|
@item archive.eu.net
|
|
@itemx ftp.denet.dk
|
|
@itemx ftp.eunet.ch
|
|
@itemx ftp.funet.fi:/pub/gnu
|
|
@itemx ftp.ieunet.ie:pub/gnu
|
|
@itemx ftp.informatik.rwth-aachen.de:/pub/gnu
|
|
@itemx ftp.informatik.tu-muenchen.de
|
|
@itemx ftp.luth.se:/pub/unix/gnu
|
|
@itemx ftp.mcc.ac.uk
|
|
@itemx ftp.stacken.kth.se
|
|
@itemx ftp.sunet.se:/pub/gnu
|
|
@itemx ftp.univ-lyon1.fr:pub/gnu
|
|
@itemx ftp.win.tue.nl:/pub/gnu
|
|
@itemx irisa.irisa.fr:/pub/gnu
|
|
@itemx isy.liu.se
|
|
@itemx nic.switch.ch:/mirror/gnu
|
|
@itemx src.doc.ic.ac.uk:/gnu
|
|
@itemx unix.hensa.ac.uk:/pub/uunet/systems/gnu
|
|
@end table
|
|
|
|
@item South America:
|
|
@table @code
|
|
@item ftp.inf.utfsm.cl:/pub/gnu
|
|
@itemx ftp.unicamp.br:/pub/gnu
|
|
@end table
|
|
|
|
@item Western Canada:
|
|
@table @code
|
|
@item ftp.cs.ubc.ca:/mirror2/gnu
|
|
@end table
|
|
|
|
@item USA:
|
|
@table @code
|
|
@item col.hp.com:/mirrors/gnu
|
|
@itemx f.ms.uky.edu:/pub3/gnu
|
|
@itemx ftp.cc.gatech.edu:/pub/gnu
|
|
@itemx ftp.cs.columbia.edu:/archives/gnu/prep
|
|
@itemx ftp.digex.net:/pub/gnu
|
|
@itemx ftp.hawaii.edu:/mirrors/gnu
|
|
@itemx ftp.kpc.com:/pub/mirror/gnu
|
|
@end table
|
|
|
|
@c NEEDED
|
|
@page
|
|
@item USA (continued):
|
|
@table @code
|
|
@itemx ftp.uu.net:/systems/gnu
|
|
@itemx gatekeeper.dec.com:/pub/GNU
|
|
@itemx jaguar.utah.edu:/gnustuff
|
|
@itemx labrea.stanford.edu
|
|
@itemx mrcnext.cso.uiuc.edu:/pub/gnu
|
|
@itemx vixen.cso.uiuc.edu:/gnu
|
|
@itemx wuarchive.wustl.edu:/systems/gnu
|
|
@end table
|
|
@end table
|
|
@end enumerate
|
|
|
|
@node Extracting, Distribution contents, Getting, Gawk Distribution
|
|
@appendixsubsec Extracting the Distribution
|
|
@code{gawk} is distributed as a @code{tar} file compressed with the
|
|
GNU Zip program, @code{gzip}.
|
|
|
|
Once you have the distribution (for example,
|
|
@file{gawk-@value{VERSION}.@value{PATCHLEVEL}.tar.gz}), first use @code{gzip} to expand the
|
|
file, and then use @code{tar} to extract it. You can use the following
|
|
pipeline to produce the @code{gawk} distribution:
|
|
|
|
@example
|
|
# Under System V, add 'o' to the tar flags
|
|
gzip -d -c gawk-@value{VERSION}.@value{PATCHLEVEL}.tar.gz | tar -xvpf -
|
|
@end example
|
|
|
|
@noindent
|
|
This will create a directory named @file{gawk-@value{VERSION}.@value{PATCHLEVEL}} in the current
|
|
directory.
|
|
|
|
The distribution file name is of the form
|
|
@file{gawk-@var{V}.@var{R}.@var{n}.tar.gz}.
|
|
The @var{V} represents the major version of @code{gawk},
|
|
the @var{R} represents the current release of version @var{V}, and
|
|
the @var{n} represents a @dfn{patch level}, meaning that minor bugs have
|
|
been fixed in the release. The current patch level is @value{PATCHLEVEL},
|
|
but when
|
|
retrieving distributions, you should get the version with the highest
|
|
version, release, and patch level. (Note that release levels greater than
|
|
or equal to 90 denote ``beta,'' or non-production software; you may not wish
|
|
to retrieve such a version unless you don't mind experimenting.)
|
|
|
|
If you are not on a Unix system, you will need to make other arrangements
|
|
for getting and extracting the @code{gawk} distribution. You should consult
|
|
a local expert.
|
|
|
|
@node Distribution contents, , Extracting, Gawk Distribution
|
|
@appendixsubsec Contents of the @code{gawk} Distribution
|
|
|
|
The @code{gawk} distribution has a number of C source files,
|
|
documentation files,
|
|
subdirectories and files related to the configuration process
|
|
(@pxref{Unix Installation, ,Compiling and Installing @code{gawk} on Unix}),
|
|
and several subdirectories related to different, non-Unix,
|
|
operating systems.
|
|
|
|
@table @asis
|
|
@item various @samp{.c}, @samp{.y}, and @samp{.h} files
|
|
These files are the actual @code{gawk} source code.
|
|
@end table
|
|
|
|
@table @file
|
|
@item README
|
|
@itemx README_d/README.*
|
|
Descriptive files: @file{README} for @code{gawk} under Unix, and the
|
|
rest for the various hardware and software combinations.
|
|
|
|
@item INSTALL
|
|
A file providing an overview of the configuration and installation process.
|
|
|
|
@item PORTS
|
|
A list of systems to which @code{gawk} has been ported, and which
|
|
have successfully run the test suite.
|
|
|
|
@item ACKNOWLEDGMENT
|
|
A list of the people who contributed major parts of the code or documentation.
|
|
|
|
@item ChangeLog
|
|
A detailed list of source code changes as bugs are fixed or improvements made.
|
|
|
|
@item NEWS
|
|
A list of changes to @code{gawk} since the last release or patch.
|
|
|
|
@item COPYING
|
|
The GNU General Public License.
|
|
|
|
@item FUTURES
|
|
A brief list of features and/or changes being contemplated for future
|
|
releases, with some indication of the time frame for the feature, based
|
|
on its difficulty.
|
|
|
|
@item LIMITATIONS
|
|
A list of those factors that limit @code{gawk}'s performance.
|
|
Most of these depend on the hardware or operating system software, and
|
|
are not limits in @code{gawk} itself.
|
|
|
|
@item POSIX.STD
|
|
A description of one area where the POSIX standard for @code{awk} is
|
|
incorrect, and how @code{gawk} handles the problem.
|
|
|
|
@item PROBLEMS
|
|
A file describing known problems with the current release.
|
|
|
|
@cindex artificial intelligence, using @code{gawk}
|
|
@cindex AI programming, using @code{gawk}
|
|
@item doc/awkforai.txt
|
|
A short article describing why @code{gawk} is a good language for
|
|
AI (Artificial Intelligence) programming.
|
|
|
|
@item doc/README.card
|
|
@itemx doc/ad.block
|
|
@itemx doc/awkcard.in
|
|
@itemx doc/cardfonts
|
|
@itemx doc/colors
|
|
@itemx doc/macros
|
|
@itemx doc/no.colors
|
|
@itemx doc/setter.outline
|
|
The @code{troff} source for a five-color @code{awk} reference card.
|
|
A modern version of @code{troff}, such as GNU Troff (@code{groff}) is
|
|
needed to produce the color version. See the file @file{README.card}
|
|
for instructions if you have an older @code{troff}.
|
|
|
|
@item doc/gawk.1
|
|
The @code{troff} source for a manual page describing @code{gawk}.
|
|
This is distributed for the convenience of Unix users.
|
|
|
|
@item doc/gawk.texi
|
|
The Texinfo source file for this @value{DOCUMENT}.
|
|
It should be processed with @TeX{} to produce a printed document, and
|
|
with @code{makeinfo} to produce an Info file.
|
|
|
|
@item doc/gawk.info
|
|
The generated Info file for this @value{DOCUMENT}.
|
|
|
|
@item doc/igawk.1
|
|
The @code{troff} source for a manual page describing the @code{igawk}
|
|
program presented in
|
|
@ref{Igawk Program, ,An Easy Way to Use Library Functions}.
|
|
|
|
@item doc/Makefile.in
|
|
The input file used during the configuration process to generate the
|
|
actual @file{Makefile} for creating the documentation.
|
|
|
|
@item Makefile.in
|
|
@itemx acconfig.h
|
|
@itemx aclocal.m4
|
|
@itemx configh.in
|
|
@itemx configure.in
|
|
@itemx configure
|
|
@itemx custom.h
|
|
@itemx missing/*
|
|
These files and subdirectory are used when configuring @code{gawk}
|
|
for various Unix systems. They are explained in detail in
|
|
@ref{Unix Installation, ,Compiling and Installing @code{gawk} on Unix}.
|
|
|
|
@item awklib/extract.awk
|
|
@itemx awklib/Makefile.in
|
|
The @file{awklib} directory contains a copy of @file{extract.awk}
|
|
(@pxref{Extract Program, ,Extracting Programs from Texinfo Source Files}),
|
|
which can be used to extract the sample programs from the Texinfo
|
|
source file for this @value{DOCUMENT}, and a @file{Makefile.in} file, which
|
|
@code{configure} uses to generate a @file{Makefile}.
|
|
As part of the process of building @code{gawk}, the library functions from
|
|
@ref{Library Functions, , A Library of @code{awk} Functions},
|
|
and the @code{igawk} program from
|
|
@ref{Igawk Program, , An Easy Way to Use Library Functions},
|
|
are extracted into ready to use files.
|
|
They are installed as part of the installation process.
|
|
|
|
@item atari/*
|
|
Files needed for building @code{gawk} on an Atari ST.
|
|
@xref{Atari Installation, ,Installing @code{gawk} on the Atari ST}, for details.
|
|
|
|
@item pc/*
|
|
Files needed for building @code{gawk} under MS-DOS and OS/2.
|
|
@xref{PC Installation, ,MS-DOS and OS/2 Installation and Compilation}, for details.
|
|
|
|
@item vms/*
|
|
Files needed for building @code{gawk} under VMS.
|
|
@xref{VMS Installation, ,How to Compile and Install @code{gawk} on VMS}, for details.
|
|
|
|
@item test/*
|
|
A test suite for
|
|
@code{gawk}. You can use @samp{make check} from the top level @code{gawk}
|
|
directory to run your version of @code{gawk} against the test suite.
|
|
If @code{gawk} successfully passes @samp{make check} then you can
|
|
be confident of a successful port.
|
|
@end table
|
|
|
|
@node Unix Installation, VMS Installation, Gawk Distribution, Installation
|
|
@appendixsec Compiling and Installing @code{gawk} on Unix
|
|
|
|
Usually, you can compile and install @code{gawk} by typing only two
|
|
commands. However, if you do use an unusual system, you may need
|
|
to configure @code{gawk} for your system yourself.
|
|
|
|
@menu
|
|
* Quick Installation:: Compiling @code{gawk} under Unix.
|
|
* Configuration Philosophy:: How it's all supposed to work.
|
|
@end menu
|
|
|
|
@node Quick Installation, Configuration Philosophy, Unix Installation, Unix Installation
|
|
@appendixsubsec Compiling @code{gawk} for Unix
|
|
|
|
@cindex installation, unix
|
|
After you have extracted the @code{gawk} distribution, @code{cd}
|
|
to @file{gawk-@value{VERSION}.@value{PATCHLEVEL}}. Like most GNU software,
|
|
@code{gawk} is configured
|
|
automatically for your Unix system by running the @code{configure} program.
|
|
This program is a Bourne shell script that was generated automatically using
|
|
GNU @code{autoconf}.
|
|
@iftex
|
|
(The @code{autoconf} software is
|
|
described fully in
|
|
@cite{Autoconf---Generating Automatic Configuration Scripts},
|
|
which is available from the Free Software Foundation.)
|
|
@end iftex
|
|
@ifinfo
|
|
(The @code{autoconf} software is described fully starting with
|
|
@ref{Top, , Introduction, autoconf, Autoconf---Generating Automatic Configuration Scripts}.)
|
|
@end ifinfo
|
|
|
|
To configure @code{gawk}, simply run @code{configure}:
|
|
|
|
@example
|
|
sh ./configure
|
|
@end example
|
|
|
|
This produces a @file{Makefile} and @file{config.h} tailored to your system.
|
|
The @file{config.h} file describes various facts about your system.
|
|
You may wish to edit the @file{Makefile} to
|
|
change the @code{CFLAGS} variable, which controls
|
|
the command line options that are passed to the C compiler (such as
|
|
optimization levels, or compiling for debugging).
|
|
|
|
Alternatively, you can add your own values for most @code{make}
|
|
variables, such as @code{CC} and @code{CFLAGS}, on the command line when
|
|
running @code{configure}:
|
|
|
|
@example
|
|
CC=cc CFLAGS=-g sh ./configure
|
|
@end example
|
|
|
|
@noindent
|
|
See the file @file{INSTALL} in the @code{gawk} distribution for
|
|
all the details.
|
|
|
|
After you have run @code{configure}, and possibly edited the @file{Makefile},
|
|
type:
|
|
|
|
@example
|
|
make
|
|
@end example
|
|
|
|
@noindent
|
|
and shortly thereafter, you should have an executable version of @code{gawk}.
|
|
That's all there is to it!
|
|
(If these steps do not work, please send in a bug report;
|
|
@pxref{Bugs, ,Reporting Problems and Bugs}.)
|
|
|
|
@node Configuration Philosophy, , Quick Installation, Unix Installation
|
|
@appendixsubsec The Configuration Process
|
|
|
|
@cindex configuring @code{gawk}
|
|
(This section is of interest only if you know something about using the
|
|
C language and the Unix operating system.)
|
|
|
|
The source code for @code{gawk} generally attempts to adhere to formal
|
|
standards wherever possible. This means that @code{gawk} uses library
|
|
routines that are specified by the ANSI C standard and by the POSIX
|
|
operating system interface standard. When using an ANSI C compiler,
|
|
function prototypes are used to help improve the compile-time checking.
|
|
|
|
Many Unix systems do not support all of either the ANSI or the
|
|
POSIX standards. The @file{missing} subdirectory in the @code{gawk}
|
|
distribution contains replacement versions of those subroutines that are
|
|
most likely to be missing.
|
|
|
|
The @file{config.h} file that is created by the @code{configure} program
|
|
contains definitions that describe features of the particular operating
|
|
system where you are attempting to compile @code{gawk}. The three things
|
|
described by this file are what header files are available, so that
|
|
they can be correctly included,
|
|
what (supposedly) standard functions are actually available in your C
|
|
libraries, and
|
|
other miscellaneous facts about your
|
|
variant of Unix. For example, there may not be an @code{st_blksize}
|
|
element in the @code{stat} structure. In this case @samp{HAVE_ST_BLKSIZE}
|
|
would be undefined.
|
|
|
|
@cindex @code{custom.h} configuration file
|
|
It is possible for your C compiler to lie to @code{configure}. It may
|
|
do so by not exiting with an error when a library function is not
|
|
available. To get around this, you can edit the file @file{custom.h}.
|
|
Use an @samp{#ifdef} that is appropriate for your system, and either
|
|
@code{#define} any constants that @code{configure} should have defined but
|
|
didn't, or @code{#undef} any constants that @code{configure} defined and
|
|
should not have. @file{custom.h} is automatically included by
|
|
@file{config.h}.
|
|
|
|
It is also possible that the @code{configure} program generated by
|
|
@code{autoconf}
|
|
will not work on your system in some other fashion. If you do have a problem,
|
|
the file
|
|
@file{configure.in} is the input for @code{autoconf}. You may be able to
|
|
change this file, and generate a new version of @code{configure} that will
|
|
work on your system. @xref{Bugs, ,Reporting Problems and Bugs}, for
|
|
information on how to report problems in configuring @code{gawk}. The same
|
|
mechanism may be used to send in updates to @file{configure.in} and/or
|
|
@file{custom.h}.
|
|
|
|
@node VMS Installation, PC Installation, Unix Installation, Installation
|
|
@appendixsec How to Compile and Install @code{gawk} on VMS
|
|
|
|
@c based on material from Pat Rankin <rankin@eql.caltech.edu>
|
|
|
|
@cindex installation, vms
|
|
This section describes how to compile and install @code{gawk} under VMS.
|
|
|
|
@menu
|
|
* VMS Compilation:: How to compile @code{gawk} under VMS.
|
|
* VMS Installation Details:: How to install @code{gawk} under VMS.
|
|
* VMS Running:: How to run @code{gawk} under VMS.
|
|
* VMS POSIX:: Alternate instructions for VMS POSIX.
|
|
@end menu
|
|
|
|
@node VMS Compilation, VMS Installation Details, VMS Installation, VMS Installation
|
|
@appendixsubsec Compiling @code{gawk} on VMS
|
|
|
|
To compile @code{gawk} under VMS, there is a @code{DCL} command procedure that
|
|
will issue all the necessary @code{CC} and @code{LINK} commands, and there is
|
|
also a @file{Makefile} for use with the @code{MMS} utility. From the source
|
|
directory, use either
|
|
|
|
@example
|
|
$ @@[.VMS]VMSBUILD.COM
|
|
@end example
|
|
|
|
@noindent
|
|
or
|
|
|
|
@example
|
|
$ MMS/DESCRIPTION=[.VMS]DESCRIP.MMS GAWK
|
|
@end example
|
|
|
|
Depending upon which C compiler you are using, follow one of the sets
|
|
of instructions in this table:
|
|
|
|
@table @asis
|
|
@item VAX C V3.x
|
|
Use either @file{vmsbuild.com} or @file{descrip.mms} as is. These use
|
|
@code{CC/OPTIMIZE=NOLINE}, which is essential for Version 3.0.
|
|
|
|
@item VAX C V2.x
|
|
You must have Version 2.3 or 2.4; older ones won't work. Edit either
|
|
@file{vmsbuild.com} or @file{descrip.mms} according to the comments in them.
|
|
For @file{vmsbuild.com}, this just entails removing two @samp{!} delimiters.
|
|
Also edit @file{config.h} (which is a copy of file @file{[.config]vms-conf.h})
|
|
and comment out or delete the two lines @samp{#define __STDC__ 0} and
|
|
@samp{#define VAXC_BUILTINS} near the end.
|
|
|
|
@item GNU C
|
|
Edit @file{vmsbuild.com} or @file{descrip.mms}; the changes are different
|
|
from those for VAX C V2.x, but equally straightforward. No changes to
|
|
@file{config.h} should be needed.
|
|
|
|
@item DEC C
|
|
Edit @file{vmsbuild.com} or @file{descrip.mms} according to their comments.
|
|
No changes to @file{config.h} should be needed.
|
|
@end table
|
|
|
|
@code{gawk} has been tested under VAX/VMS 5.5-1 using VAX C V3.2,
|
|
GNU C 1.40 and 2.3. It should work without modifications for VMS V4.6 and up.
|
|
|
|
@node VMS Installation Details, VMS Running, VMS Compilation, VMS Installation
|
|
@appendixsubsec Installing @code{gawk} on VMS
|
|
|
|
To install @code{gawk}, all you need is a ``foreign'' command, which is
|
|
a @code{DCL} symbol whose value begins with a dollar sign. For example:
|
|
|
|
@example
|
|
$ GAWK :== $disk1:[gnubin]GAWK
|
|
@end example
|
|
|
|
@noindent
|
|
(Substitute the actual location of @code{gawk.exe} for
|
|
@samp{$disk1:[gnubin]}.) The symbol should be placed in the
|
|
@file{login.com} of any user who wishes to run @code{gawk},
|
|
so that it will be defined every time the user logs on.
|
|
Alternatively, the symbol may be placed in the system-wide
|
|
@file{sylogin.com} procedure, which will allow all users
|
|
to run @code{gawk}.
|
|
|
|
Optionally, the help entry can be loaded into a VMS help library:
|
|
|
|
@example
|
|
$ LIBRARY/HELP SYS$HELP:HELPLIB [.VMS]GAWK.HLP
|
|
@end example
|
|
|
|
@noindent
|
|
(You may want to substitute a site-specific help library rather than
|
|
the standard VMS library @samp{HELPLIB}.) After loading the help text,
|
|
|
|
@example
|
|
$ HELP GAWK
|
|
@end example
|
|
|
|
@noindent
|
|
will provide information about both the @code{gawk} implementation and the
|
|
@code{awk} programming language.
|
|
|
|
The logical name @samp{AWK_LIBRARY} can designate a default location
|
|
for @code{awk} program files. For the @samp{-f} option, if the specified
|
|
filename has no device or directory path information in it, @code{gawk}
|
|
will look in the current directory first, then in the directory specified
|
|
by the translation of @samp{AWK_LIBRARY} if the file was not found.
|
|
If after searching in both directories, the file still is not found,
|
|
then @code{gawk} appends the suffix @samp{.awk} to the filename and the
|
|
file search will be re-tried. If @samp{AWK_LIBRARY} is not defined, that
|
|
portion of the file search will fail benignly.
|
|
|
|
@node VMS Running, VMS POSIX, VMS Installation Details, VMS Installation
|
|
@appendixsubsec Running @code{gawk} on VMS
|
|
|
|
Command line parsing and quoting conventions are significantly different
|
|
on VMS, so examples in this @value{DOCUMENT} or from other sources often need minor
|
|
changes. They @emph{are} minor though, and all @code{awk} programs
|
|
should run correctly.
|
|
|
|
Here are a couple of trivial tests:
|
|
|
|
@example
|
|
$ gawk -- "BEGIN @{print ""Hello, World!""@}"
|
|
$ gawk -"W" version
|
|
! could also be -"W version" or "-W version"
|
|
@end example
|
|
|
|
@noindent
|
|
Note that upper-case and mixed-case text must be quoted.
|
|
|
|
The VMS port of @code{gawk} includes a @code{DCL}-style interface in addition
|
|
to the original shell-style interface (see the help entry for details).
|
|
One side-effect of dual command line parsing is that if there is only a
|
|
single parameter (as in the quoted string program above), the command
|
|
becomes ambiguous. To work around this, the normally optional @samp{--}
|
|
flag is required to force Unix style rather than @code{DCL} parsing. If any
|
|
other dash-type options (or multiple parameters such as data files to be
|
|
processed) are present, there is no ambiguity and @samp{--} can be omitted.
|
|
|
|
The default search path when looking for @code{awk} program files specified
|
|
by the @samp{-f} option is @code{"SYS$DISK:[],AWK_LIBRARY:"}. The logical
|
|
name @samp{AWKPATH} can be used to override this default. The format
|
|
of @samp{AWKPATH} is a comma-separated list of directory specifications.
|
|
When defining it, the value should be quoted so that it retains a single
|
|
translation, and not a multi-translation @code{RMS} searchlist.
|
|
|
|
@node VMS POSIX, , VMS Running, VMS Installation
|
|
@appendixsubsec Building and Using @code{gawk} on VMS POSIX
|
|
|
|
Ignore the instructions above, although @file{vms/gawk.hlp} should still
|
|
be made available in a help library. The source tree should be unpacked
|
|
into a container file subsystem rather than into the ordinary VMS file
|
|
system. Make sure that the two scripts, @file{configure} and
|
|
@file{vms/posix-cc.sh}, are executable; use @samp{chmod +x} on them if
|
|
necessary. Then execute the following two commands:
|
|
|
|
@example
|
|
@group
|
|
psx> CC=vms/posix-cc.sh configure
|
|
psx> make CC=c89 gawk
|
|
@end group
|
|
@end example
|
|
|
|
@noindent
|
|
The first command will construct files @file{config.h} and @file{Makefile} out
|
|
of templates, using a script to make the C compiler fit @code{configure}'s
|
|
expectations. The second command will compile and link @code{gawk} using
|
|
the C compiler directly; ignore any warnings from @code{make} about being
|
|
unable to redefine @code{CC}. @code{configure} will take a very long
|
|
time to execute, but at least it provides incremental feedback as it
|
|
runs.
|
|
|
|
This has been tested with VAX/VMS V6.2, VMS POSIX V2.0, and DEC C V5.2.
|
|
|
|
Once built, @code{gawk} will work like any other shell utility. Unlike
|
|
the normal VMS port of @code{gawk}, no special command line manipulation is
|
|
needed in the VMS POSIX environment.
|
|
|
|
@c Rewritten by Scott Deifik <scottd@amgen.com>
|
|
@c and Darrel Hankerson <hankedr@mail.auburn.edu>
|
|
@node PC Installation, Atari Installation, VMS Installation, Installation
|
|
@appendixsec MS-DOS and OS/2 Installation and Compilation
|
|
|
|
@cindex installation, MS-DOS and OS/2
|
|
If you have received a binary distribution prepared by the DOS
|
|
maintainers, then @code{gawk} and the necessary support files will appear
|
|
under the @file{gnu} directory, with executables in @file{gnu/bin},
|
|
libraries in @file{gnu/lib/awk}, and manual pages under @file{gnu/man}.
|
|
This is designed for easy installation to a @file{/gnu} directory on your
|
|
drive, but the files can be installed anywhere provided @code{AWKPATH} is
|
|
set properly. Regardless of the installation directory, the first line of
|
|
@file{igawk.cmd} and @file{igawk.bat} (in @file{gnu/bin}) may need to be
|
|
edited.
|
|
|
|
The binary distribution will contain a separate file describing the
|
|
contents. In particular, it may include more than one version of the
|
|
@code{gawk} executable. OS/2 binary distributions may have a
|
|
different arrangement, but installation is similar.
|
|
|
|
The OS/2 and MS-DOS versions of @code{gawk} search for program files as
|
|
described in @ref{AWKPATH Variable, ,The @code{AWKPATH} Environment Variable}.
|
|
However, semicolons (rather than colons) separate elements
|
|
in the @code{AWKPATH} variable. If @code{AWKPATH} is not set or is empty,
|
|
then the default search path is @code{@w{".;c:/lib/awk;c:/gnu/lib/awk"}}.
|
|
|
|
An @code{sh}-like shell (as opposed to @code{command.com} under MS-DOS
|
|
or @code{cmd.exe} under OS/2) may be useful for @code{awk} programming.
|
|
Ian Stewartson has written an excellent shell for MS-DOS and OS/2, and a
|
|
@code{ksh} clone and GNU Bash are available for OS/2. The file
|
|
@file{README_d/README.pc} in the @code{gawk} distribution contains
|
|
information on these shells. Users of Stewartson's shell on DOS should
|
|
examine its documentation on handling of command-lines. In particular,
|
|
the setting for @code{gawk} in the shell configuration may need to be
|
|
changed, and the @code{ignoretype} option may also be of interest.
|
|
|
|
@code{gawk} can be compiled for MS-DOS and OS/2 using the GNU development tools
|
|
from DJ Delorie (DJGPP, MS-DOS-only) or Eberhard Mattes (EMX, MS-DOS and OS/2).
|
|
Microsoft C can be used to build 16-bit versions for MS-DOS and OS/2. The file
|
|
@file{README_d/README.pc} in the @code{gawk} distribution contains additional
|
|
notes, and @file{pc/Makefile} contains important notes on compilation options.
|
|
|
|
To build @code{gawk}, copy the files in the @file{pc} directory (@emph{except}
|
|
for @file{ChangeLog}) to the
|
|
directory with the rest of the @code{gawk} sources. The @file{Makefile}
|
|
contains a configuration section with comments, and may need to be
|
|
edited in order to work with your @code{make} utility.
|
|
|
|
The @file{Makefile} contains a number of targets for building various MS-DOS
|
|
and OS/2 versions. A list of targets will be printed if the @code{make}
|
|
command is given without a target. As an example, to build @code{gawk}
|
|
using the DJGPP tools, enter @samp{make djgpp}.
|
|
|
|
Using @code{make} to run the standard tests and to install @code{gawk}
|
|
requires additional Unix-like tools, including @code{sh}, @code{sed}, and
|
|
@code{cp}. In order to run the tests, the @file{test/*.ok} files may need to
|
|
be converted so that they have the usual DOS-style end-of-line markers. Most
|
|
of the tests will work properly with Stewartson's shell along with the
|
|
companion utilities or appropriate GNU utilities. However, some editing of
|
|
@file{test/Makefile} is required. It is recommended that the file
|
|
@file{pc/Makefile.tst} be copied to @file{test/Makefile} as a
|
|
replacement. Details can be found in @file{README_d/README.pc}.
|
|
|
|
@node Atari Installation, Amiga Installation, PC Installation, Installation
|
|
@appendixsec Installing @code{gawk} on the Atari ST
|
|
|
|
@c based on material from Michal Jaegermann <michal@gortel.phys.ualberta.ca>
|
|
|
|
@cindex atari
|
|
@cindex installation, atari
|
|
There are no substantial differences when installing @code{gawk} on
|
|
various Atari models. Compiled @code{gawk} executables do not require
|
|
a large amount of memory with most @code{awk} programs and should run on all
|
|
Motorola processor based models (called further ST, even if that is not
|
|
exactly right).
|
|
|
|
In order to use @code{gawk}, you need to have a shell, either text or
|
|
graphics, that does not map all the characters of a command line to
|
|
upper-case. Maintaining case distinction in option flags is very
|
|
important (@pxref{Options, ,Command Line Options}).
|
|
These days this is the default, and it may only be a problem for some
|
|
very old machines. If your system does not preserve the case of option
|
|
flags, you will need to upgrade your tools. Support for I/O
|
|
redirection is necessary to make it easy to import @code{awk} programs
|
|
from other environments. Pipes are nice to have, but not vital.
|
|
|
|
@menu
|
|
* Atari Compiling:: Compiling @code{gawk} on Atari
|
|
* Atari Using:: Running @code{gawk} on Atari
|
|
@end menu
|
|
|
|
@node Atari Compiling, Atari Using, Atari Installation, Atari Installation
|
|
@appendixsubsec Compiling @code{gawk} on the Atari ST
|
|
|
|
A proper compilation of @code{gawk} sources when @code{sizeof(int)}
|
|
differs from @code{sizeof(void *)} requires an ANSI C compiler. An initial
|
|
port was done with @code{gcc}. You may actually prefer executables
|
|
where @code{int}s are four bytes wide, but the other variant works as well.
|
|
|
|
You may need quite a bit of memory when trying to recompile the @code{gawk}
|
|
sources, as some source files (@file{regex.c} in particular) are quite
|
|
big. If you run out of memory compiling such a file, try reducing the
|
|
optimization level for this particular file; this may help.
|
|
|
|
@cindex Linux
|
|
With a reasonable shell (Bash will do), and in particular if you run
|
|
Linux, MiNT or a similar operating system, you have a pretty good
|
|
chance that the @code{configure} utility will succeed. Otherwise
|
|
sample versions of @file{config.h} and @file{Makefile.st} are given in the
|
|
@file{atari} subdirectory and can be edited and copied to the
|
|
corresponding files in the main source directory. Even if
|
|
@code{configure} produced something, it might be advisable to compare
|
|
its results with the sample versions and possibly make adjustments.
|
|
|
|
Some @code{gawk} source code fragments depend on a preprocessor define
|
|
@samp{atarist}. This basically assumes the TOS environment with @code{gcc}.
|
|
Modify these sections as appropriate if they are not right for your
|
|
environment. Also see the remarks about @code{AWKPATH} and @code{envsep} in
|
|
@ref{Atari Using, ,Running @code{gawk} on the Atari ST}.
|
|
|
|
As shipped, the sample @file{config.h} claims that the @code{system}
|
|
function is missing from the libraries, which is not true, and an
|
|
alternative implementation of this function is provided in
|
|
@file{atari/system.c}. Depending upon your particular combination of
|
|
shell and operating system, you may wish to change the file to indicate
|
|
that @code{system} is available.
|
|
|
|
@node Atari Using, , Atari Compiling, Atari Installation
|
|
@appendixsubsec Running @code{gawk} on the Atari ST
|
|
|
|
An executable version of @code{gawk} should be placed, as usual,
|
|
anywhere in your @code{PATH} where your shell can find it.
|
|
|
|
While executing, @code{gawk} creates a number of temporary files. When
|
|
using @code{gcc} libraries for TOS, @code{gawk} looks for either of
|
|
the environment variables @code{TEMP} or @code{TMPDIR}, in that order.
|
|
If either one is found, its value is assumed to be a directory for
|
|
temporary files. This directory must exist, and if you can spare the
|
|
memory, it is a good idea to put it on a RAM drive. If neither
|
|
@code{TEMP} nor @code{TMPDIR} are found, then @code{gawk} uses the
|
|
current directory for its temporary files.
|
|
|
|
The ST version of @code{gawk} searches for its program files as described in
|
|
@ref{AWKPATH Variable, ,The @code{AWKPATH} Environment Variable}.
|
|
The default value for the @code{AWKPATH} variable is taken from
|
|
@code{DEFPATH} defined in @file{Makefile}. The sample @code{gcc}/TOS
|
|
@file{Makefile} for the ST in the distribution sets @code{DEFPATH} to
|
|
@code{@w{".,c:\lib\awk,c:\gnu\lib\awk"}}. The search path can be
|
|
modified by explicitly setting @code{AWKPATH} to whatever you wish.
|
|
Note that colons cannot be used on the ST to separate elements in the
|
|
@code{AWKPATH} variable, since they have another, reserved, meaning.
|
|
Instead, you must use a comma to separate elements in the path. When
|
|
recompiling, the separating character can be modified by initializing
|
|
the @code{envsep} variable in @file{atari/gawkmisc.atr} to another
|
|
value.
|
|
|
|
Although @code{awk} allows great flexibility in doing I/O redirections
|
|
from within a program, this facility should be used with care on the ST
|
|
running under TOS. In some circumstances the OS routines for file
|
|
handle pool processing lose track of certain events, causing the
|
|
computer to crash, and requiring a reboot. Often a warm reboot is
|
|
sufficient. Fortunately, this happens infrequently, and in rather
|
|
esoteric situations. In particular, avoid having one part of an
|
|
@code{awk} program using @code{print} statements explicitly redirected
|
|
to @code{"/dev/stdout"}, while other @code{print} statements use the
|
|
default standard output, and a calling shell has redirected standard
|
|
output to a file.
|
|
|
|
When @code{gawk} is compiled with the ST version of @code{gcc} and its
|
|
usual libraries, it will accept both @samp{/} and @samp{\} as path separators.
|
|
While this is convenient, it should be remembered that this removes one,
|
|
technically valid, character (@samp{/}) from your file names, and that
|
|
it may create problems for external programs, called via the @code{system}
|
|
function, which may not support this convention. Whenever it is possible
|
|
that a file created by @code{gawk} will be used by some other program,
|
|
use only backslashes. Also remember that in @code{awk}, backslashes in
|
|
strings have to be doubled in order to get literal backslashes
|
|
(@pxref{Escape Sequences}).
|
|
|
|
@node Amiga Installation, Bugs, Atari Installation, Installation
|
|
@appendixsec Installing @code{gawk} on an Amiga
|
|
|
|
@cindex amiga
|
|
@cindex installation, amiga
|
|
You can install @code{gawk} on an Amiga system using a Unix emulation
|
|
environment available via anonymous @code{ftp} from
|
|
@code{ftp.ninemoons.com} in the directory @file{pub/ade/current}.
|
|
This includes a shell based on @code{pdksh}. The primary component of
|
|
this environment is a Unix emulation library, @file{ixemul.lib}.
|
|
@c could really use more background here, who wrote this, etc.
|
|
|
|
A more complete distribution for the Amiga is available on
|
|
the Geek Gadgets CD-ROM from:
|
|
|
|
@quotation
|
|
CRONUS @*
|
|
1840 E. Warner Road #105-265 @*
|
|
Tempe, AZ 85284 USA @*
|
|
US Toll Free: (800) 804-0833 @*
|
|
Phone: +1-602-491-0442 @*
|
|
FAX: +1-602-491-0048 @*
|
|
Email: @code{info@@ninemoons.com} @*
|
|
WWW: @code{http://www.ninemoons.com} @*
|
|
Anonymous @code{ftp} site: @code{ftp.ninemoons.com} @*
|
|
@end quotation
|
|
|
|
Once you have the distribution, you can configure @code{gawk} simply by
|
|
running @code{configure}:
|
|
|
|
@example
|
|
configure -v m68k-amigaos
|
|
@end example
|
|
|
|
Then run @code{make}, and you should be all set!
|
|
(If these steps do not work, please send in a bug report;
|
|
@pxref{Bugs, ,Reporting Problems and Bugs}.)
|
|
|
|
@node Bugs, Other Versions, Amiga Installation, Installation
|
|
@appendixsec Reporting Problems and Bugs
|
|
@display
|
|
@i{There is nothing more dangerous than a bored archeologist.}
|
|
The Hitchhiker's Guide to the Galaxy
|
|
@c the radio show, not the book. :-)
|
|
@end display
|
|
@sp 1
|
|
|
|
If you have problems with @code{gawk} or think that you have found a bug,
|
|
please report it to the developers; we cannot promise to do anything
|
|
but we might well want to fix it.
|
|
|
|
Before reporting a bug, make sure you have actually found a real bug.
|
|
Carefully reread the documentation and see if it really says you can do
|
|
what you're trying to do. If it's not clear whether you should be able
|
|
to do something or not, report that too; it's a bug in the documentation!
|
|
|
|
Before reporting a bug or trying to fix it yourself, try to isolate it
|
|
to the smallest possible @code{awk} program and input data file that
|
|
reproduces the problem. Then send us the program and data file,
|
|
some idea of what kind of Unix system you're using, and the exact results
|
|
@code{gawk} gave you. Also say what you expected to occur; this will help
|
|
us decide whether the problem was really in the documentation.
|
|
|
|
Once you have a precise problem, there are two e-mail addresses you
|
|
can send mail to.
|
|
|
|
@table @asis
|
|
@item Internet:
|
|
@samp{bug-gnu-utils@@gnu.org}
|
|
|
|
@item UUCP:
|
|
@samp{uunet!gnu.org!bug-gnu-utils}
|
|
@end table
|
|
|
|
Please include the
|
|
version number of @code{gawk} you are using. You can get this information
|
|
with the command @samp{gawk --version}.
|
|
You should send a carbon copy of your mail to Arnold Robbins, who can
|
|
be reached at @samp{arnold@@gnu.org}.
|
|
|
|
@cindex @code{comp.lang.awk}
|
|
@strong{Important!} Do @emph{not} try to report bugs in @code{gawk} by
|
|
posting to the Usenet/Internet newsgroup @code{comp.lang.awk}.
|
|
While the @code{gawk} developers do occasionally read this newsgroup,
|
|
there is no guarantee that we will see your posting. The steps described
|
|
above are the official, recognized ways for reporting bugs.
|
|
|
|
Non-bug suggestions are always welcome as well. If you have questions
|
|
about things that are unclear in the documentation or are just obscure
|
|
features, ask Arnold Robbins; he will try to help you out, although he
|
|
may not have the time to fix the problem. You can send him electronic
|
|
mail at the Internet address above.
|
|
|
|
If you find bugs in one of the non-Unix ports of @code{gawk}, please send
|
|
an electronic mail message to the person who maintains that port. They
|
|
are listed below, and also in the @file{README} file in the @code{gawk}
|
|
distribution. Information in the @file{README} file should be considered
|
|
authoritative if it conflicts with this @value{DOCUMENT}.
|
|
|
|
@c NEEDED for looks
|
|
@page
|
|
The people maintaining the non-Unix ports of @code{gawk} are:
|
|
|
|
@cindex Deifik, Scott
|
|
@cindex Fish, Fred
|
|
@cindex Hankerson, Darrel
|
|
@cindex Jaegermann, Michal
|
|
@cindex Rankin, Pat
|
|
@cindex Rommel, Kai Uwe
|
|
@table @asis
|
|
@item MS-DOS
|
|
Scott Deifik, @samp{scottd@@amgen.com}, and
|
|
Darrel Hankerson, @samp{hankedr@@mail.auburn.edu}.
|
|
|
|
@item OS/2
|
|
Kai Uwe Rommel, @samp{rommel@@ars.de}.
|
|
|
|
@item VMS
|
|
Pat Rankin, @samp{rankin@@eql.caltech.edu}.
|
|
|
|
@item Atari ST
|
|
Michal Jaegermann, @samp{michal@@gortel.phys.ualberta.ca}.
|
|
|
|
@item Amiga
|
|
Fred Fish, @samp{fnf@@ninemoons.com}.
|
|
@end table
|
|
|
|
If your bug is also reproducible under Unix, please send copies of your
|
|
report to the general GNU bug list, as well as to Arnold Robbins, at the
|
|
addresses listed above.
|
|
|
|
@node Other Versions, , Bugs, Installation
|
|
@appendixsec Other Freely Available @code{awk} Implementations
|
|
@cindex Brennan, Michael
|
|
@ignore
|
|
From: emory!amc.com!brennan (Michael Brennan)
|
|
Subject: C++ comments in awk programs
|
|
To: arnold@gnu.ai.mit.edu (Arnold Robbins)
|
|
Date: Wed, 4 Sep 1996 08:11:48 -0700 (PDT)
|
|
|
|
@end ignore
|
|
@display
|
|
@i{It's kind of fun to put comments like this in your awk code.}
|
|
@code{// Do C++ comments work? answer: yes! of course}
|
|
Michael Brennan
|
|
@end display
|
|
@sp 1
|
|
|
|
There are two other freely available @code{awk} implementations.
|
|
This section briefly describes where to get them.
|
|
|
|
@table @asis
|
|
@cindex Kernighan, Brian
|
|
@cindex anonymous @code{ftp}
|
|
@cindex @code{ftp}, anonymous
|
|
@item Unix @code{awk}
|
|
Brian Kernighan has been able to make his implementation of
|
|
@code{awk} freely available. You can get it via anonymous @code{ftp}
|
|
to the host @code{@w{netlib.bell-labs.com}}. Change directory to
|
|
@file{/netlib/research}. Use ``binary'' or ``image'' mode, and
|
|
retrieve @file{awk.bundle.gz}.
|
|
|
|
This is a shell archive that has been compressed with the GNU @code{gzip}
|
|
utility. It can be uncompressed with the @code{gunzip} utility.
|
|
|
|
You can also retrieve this version via the World Wide Web from
|
|
@uref{http://cm.bell-labs.com/who/bwk, Brian Kernighan's home page}.
|
|
|
|
This version requires an ANSI C compiler; GCC (the GNU C compiler)
|
|
works quite nicely.
|
|
|
|
@cindex Brennan, Michael
|
|
@cindex @code{mawk}
|
|
@item @code{mawk}
|
|
Michael Brennan has written an independent implementation of @code{awk},
|
|
called @code{mawk}. It is available under the GPL
|
|
(@pxref{Copying, ,GNU GENERAL PUBLIC LICENSE}),
|
|
just as @code{gawk} is.
|
|
|
|
You can get it via anonymous @code{ftp} to the host
|
|
@code{@w{ftp.whidbey.net}}. Change directory to @file{/pub/brennan}.
|
|
Use ``binary'' or ``image'' mode, and retrieve @file{mawk1.3.3.tar.gz}
|
|
(or the latest version that is there).
|
|
|
|
@code{gunzip} may be used to decompress this file. Installation
|
|
is similar to @code{gawk}'s
|
|
(@pxref{Unix Installation, , Compiling and Installing @code{gawk} on Unix}).
|
|
@end table
|
|
|
|
@node Notes, Glossary, Installation, Top
|
|
@appendix Implementation Notes
|
|
|
|
This appendix contains information mainly of interest to implementors and
|
|
maintainers of @code{gawk}. Everything in it applies specifically to
|
|
@code{gawk}, and not to other implementations.
|
|
|
|
@menu
|
|
* Compatibility Mode:: How to disable certain @code{gawk} extensions.
|
|
* Additions:: Making Additions To @code{gawk}.
|
|
* Future Extensions:: New features that may be implemented one day.
|
|
* Improvements:: Suggestions for improvements by volunteers.
|
|
@end menu
|
|
|
|
@node Compatibility Mode, Additions, Notes, Notes
|
|
@appendixsec Downward Compatibility and Debugging
|
|
|
|
@xref{POSIX/GNU, ,Extensions in @code{gawk} Not in POSIX @code{awk}},
|
|
for a summary of the GNU extensions to the @code{awk} language and program.
|
|
All of these features can be turned off by invoking @code{gawk} with the
|
|
@samp{--traditional} option, or with the @samp{--posix} option.
|
|
|
|
If @code{gawk} is compiled for debugging with @samp{-DDEBUG}, then there
|
|
is one more option available on the command line:
|
|
|
|
@table @code
|
|
@item -W parsedebug
|
|
@itemx --parsedebug
|
|
Print out the parse stack information as the program is being parsed.
|
|
@end table
|
|
|
|
This option is intended only for serious @code{gawk} developers,
|
|
and not for the casual user. It probably has not even been compiled into
|
|
your version of @code{gawk}, since it slows down execution.
|
|
|
|
@node Additions, Future Extensions, Compatibility Mode, Notes
|
|
@appendixsec Making Additions to @code{gawk}
|
|
|
|
If you should find that you wish to enhance @code{gawk} in a significant
|
|
fashion, you are perfectly free to do so. That is the point of having
|
|
free software; the source code is available, and you are free to change
|
|
it as you wish (@pxref{Copying, ,GNU GENERAL PUBLIC LICENSE}).
|
|
|
|
This section discusses the ways you might wish to change @code{gawk},
|
|
and any considerations you should bear in mind.
|
|
|
|
@menu
|
|
* Adding Code:: Adding code to the main body of @code{gawk}.
|
|
* New Ports:: Porting @code{gawk} to a new operating system.
|
|
@end menu
|
|
|
|
@node Adding Code, New Ports, Additions, Additions
|
|
@appendixsubsec Adding New Features
|
|
|
|
@cindex adding new features
|
|
@cindex features, adding
|
|
You are free to add any new features you like to @code{gawk}.
|
|
However, if you want your changes to be incorporated into the @code{gawk}
|
|
distribution, there are several steps that you need to take in order to
|
|
make it possible for me to include your changes.
|
|
|
|
@enumerate 1
|
|
@item
|
|
Get the latest version.
|
|
It is much easier for me to integrate changes if they are relative to
|
|
the most recent distributed version of @code{gawk}. If your version of
|
|
@code{gawk} is very old, I may not be able to integrate them at all.
|
|
@xref{Getting, ,Getting the @code{gawk} Distribution},
|
|
for information on getting the latest version of @code{gawk}.
|
|
|
|
@item
|
|
@iftex
|
|
Follow the @cite{GNU Coding Standards}.
|
|
@end iftex
|
|
@ifinfo
|
|
See @inforef{Top, , Version, standards, GNU Coding Standards}.
|
|
@end ifinfo
|
|
This document describes how GNU software should be written. If you haven't
|
|
read it, please do so, preferably @emph{before} starting to modify @code{gawk}.
|
|
(The @cite{GNU Coding Standards} are available as part of the Autoconf
|
|
distribution, from the FSF.)
|
|
|
|
@cindex @code{gawk} coding style
|
|
@cindex coding style used in @code{gawk}
|
|
@item
|
|
Use the @code{gawk} coding style.
|
|
The C code for @code{gawk} follows the instructions in the
|
|
@cite{GNU Coding Standards}, with minor exceptions. The code is formatted
|
|
using the traditional ``K&R'' style, particularly as regards the placement
|
|
of braces and the use of tabs. In brief, the coding rules for @code{gawk}
|
|
are:
|
|
|
|
@itemize @bullet
|
|
@item
|
|
Use old style (non-prototype) function headers when defining functions.
|
|
|
|
@item
|
|
Put the name of the function at the beginning of its own line.
|
|
|
|
@item
|
|
Put the return type of the function, even if it is @code{int}, on the
|
|
line above the line with the name and arguments of the function.
|
|
|
|
@item
|
|
The declarations for the function arguments should not be indented.
|
|
|
|
@item
|
|
Put spaces around parentheses used in control structures
|
|
(@code{if}, @code{while}, @code{for}, @code{do}, @code{switch}
|
|
and @code{return}).
|
|
|
|
@item
|
|
Do not put spaces in front of parentheses used in function calls.
|
|
|
|
@item
|
|
Put spaces around all C operators, and after commas in function calls.
|
|
|
|
@item
|
|
Do not use the comma operator to produce multiple side-effects, except
|
|
in @code{for} loop initialization and increment parts, and in macro bodies.
|
|
|
|
@item
|
|
Use real tabs for indenting, not spaces.
|
|
|
|
@item
|
|
Use the ``K&R'' brace layout style.
|
|
|
|
@item
|
|
Use comparisons against @code{NULL} and @code{'\0'} in the conditions of
|
|
@code{if}, @code{while} and @code{for} statements, and in the @code{case}s
|
|
of @code{switch} statements, instead of just the
|
|
plain pointer or character value.
|
|
|
|
@item
|
|
Use the @code{TRUE}, @code{FALSE}, and @code{NULL} symbolic constants,
|
|
and the character constant @code{'\0'} where appropriate, instead of @code{1}
|
|
and @code{0}.
|
|
|
|
@item
|
|
Provide one-line descriptive comments for each function.
|
|
|
|
@item
|
|
Do not use @samp{#elif}. Many older Unix C compilers cannot handle it.
|
|
|
|
@item
|
|
Do not use the @code{alloca} function for allocating memory off the stack.
|
|
Its use causes more portability trouble than the minor benefit of not having
|
|
to free the storage. Instead, use @code{malloc} and @code{free}.
|
|
@end itemize
|
|
|
|
If I have to reformat your code to follow the coding style used in
|
|
@code{gawk}, I may not bother.
|
|
|
|
@item
|
|
Be prepared to sign the appropriate paperwork.
|
|
In order for the FSF to distribute your changes, you must either place
|
|
those changes in the public domain, and submit a signed statement to that
|
|
effect, or assign the copyright in your changes to the FSF.
|
|
Both of these actions are easy to do, and @emph{many} people have done so
|
|
already. If you have questions, please contact me
|
|
(@pxref{Bugs, , Reporting Problems and Bugs}),
|
|
or @code{gnu@@gnu.org}.
|
|
|
|
@item
|
|
Update the documentation.
|
|
Along with your new code, please supply new sections and or chapters
|
|
for this @value{DOCUMENT}. If at all possible, please use real
|
|
Texinfo, instead of just supplying unformatted ASCII text (although
|
|
even that is better than no documentation at all).
|
|
Conventions to be followed in @cite{@value{TITLE}} are provided
|
|
after the @samp{@@bye} at the end of the Texinfo source file.
|
|
If possible, please update the man page as well.
|
|
|
|
You will also have to sign paperwork for your documentation changes.
|
|
|
|
@item
|
|
Submit changes as context diffs or unified diffs.
|
|
Use @samp{diff -c -r -N} or @samp{diff -u -r -N} to compare
|
|
the original @code{gawk} source tree with your version.
|
|
(I find context diffs to be more readable, but unified diffs are
|
|
more compact.)
|
|
I recommend using the GNU version of @code{diff}.
|
|
Send the output produced by either run of @code{diff} to me when you
|
|
submit your changes.
|
|
@xref{Bugs, , Reporting Problems and Bugs}, for the electronic mail
|
|
information.
|
|
|
|
Using this format makes it easy for me to apply your changes to the
|
|
master version of the @code{gawk} source code (using @code{patch}).
|
|
If I have to apply the changes manually, using a text editor, I may
|
|
not do so, particularly if there are lots of changes.
|
|
@end enumerate
|
|
|
|
Although this sounds like a lot of work, please remember that while you
|
|
may write the new code, I have to maintain it and support it, and if it
|
|
isn't possible for me to do that with a minimum of extra work, then I
|
|
probably will not.
|
|
|
|
@node New Ports, , Adding Code, Additions
|
|
@appendixsubsec Porting @code{gawk} to a New Operating System
|
|
|
|
@cindex porting @code{gawk}
|
|
If you wish to port @code{gawk} to a new operating system, there are
|
|
several steps to follow.
|
|
|
|
@enumerate 1
|
|
@item
|
|
Follow the guidelines in
|
|
@ref{Adding Code, ,Adding New Features},
|
|
concerning coding style, submission of diffs, and so on.
|
|
|
|
@item
|
|
When doing a port, bear in mind that your code must co-exist peacefully
|
|
with the rest of @code{gawk}, and the other ports. Avoid gratuitous
|
|
changes to the system-independent parts of the code. If at all possible,
|
|
avoid sprinkling @samp{#ifdef}s just for your port throughout the
|
|
code.
|
|
|
|
If the changes needed for a particular system affect too much of the
|
|
code, I probably will not accept them. In such a case, you will, of course,
|
|
be able to distribute your changes on your own, as long as you comply
|
|
with the GPL
|
|
(@pxref{Copying, ,GNU GENERAL PUBLIC LICENSE}).
|
|
|
|
@item
|
|
A number of the files that come with @code{gawk} are maintained by other
|
|
people at the Free Software Foundation. Thus, you should not change them
|
|
unless it is for a very good reason. I.e.@: changes are not out of the
|
|
question, but changes to these files will be scrutinized extra carefully.
|
|
The files are @file{alloca.c}, @file{getopt.h}, @file{getopt.c},
|
|
@file{getopt1.c}, @file{regex.h}, @file{regex.c}, @file{dfa.h},
|
|
@file{dfa.c}, @file{install-sh}, and @file{mkinstalldirs}.
|
|
|
|
@item
|
|
Be willing to continue to maintain the port.
|
|
Non-Unix operating systems are supported by volunteers who maintain
|
|
the code needed to compile and run @code{gawk} on their systems. If no-one
|
|
volunteers to maintain a port, that port becomes unsupported, and it may
|
|
be necessary to remove it from the distribution.
|
|
|
|
@item
|
|
Supply an appropriate @file{gawkmisc.???} file.
|
|
Each port has its own @file{gawkmisc.???} that implements certain
|
|
operating system specific functions. This is cleaner than a plethora of
|
|
@samp{#ifdef}s scattered throughout the code. The @file{gawkmisc.c} in
|
|
the main source directory includes the appropriate
|
|
@file{gawkmisc.???} file from each subdirectory.
|
|
Be sure to update it as well.
|
|
|
|
Each port's @file{gawkmisc.???} file has a suffix reminiscent of the machine
|
|
or operating system for the port. For example, @file{pc/gawkmisc.pc} and
|
|
@file{vms/gawkmisc.vms}. The use of separate suffixes, instead of plain
|
|
@file{gawkmisc.c}, makes it possible to move files from a port's subdirectory
|
|
into the main subdirectory, without accidentally destroying the real
|
|
@file{gawkmisc.c} file. (Currently, this is only an issue for the MS-DOS
|
|
and OS/2 ports.)
|
|
|
|
@item
|
|
Supply a @file{Makefile} and any other C source and header files that are
|
|
necessary for your operating system. All your code should be in a
|
|
separate subdirectory, with a name that is the same as, or reminiscent
|
|
of, either your operating system or the computer system. If possible,
|
|
try to structure things so that it is not necessary to move files out
|
|
of the subdirectory into the main source directory. If that is not
|
|
possible, then be sure to avoid using names for your files that
|
|
duplicate the names of files in the main source directory.
|
|
|
|
@item
|
|
Update the documentation.
|
|
Please write a section (or sections) for this @value{DOCUMENT} describing the
|
|
installation and compilation steps needed to install and/or compile
|
|
@code{gawk} for your system.
|
|
|
|
@item
|
|
Be prepared to sign the appropriate paperwork.
|
|
In order for the FSF to distribute your code, you must either place
|
|
your code in the public domain, and submit a signed statement to that
|
|
effect, or assign the copyright in your code to the FSF.
|
|
@ifinfo
|
|
Both of these actions are easy to do, and @emph{many} people have done so
|
|
already. If you have questions, please contact me, or
|
|
@code{gnu@@gnu.org}.
|
|
@end ifinfo
|
|
@end enumerate
|
|
|
|
Following these steps will make it much easier to integrate your changes
|
|
into @code{gawk}, and have them co-exist happily with the code for other
|
|
operating systems that is already there.
|
|
|
|
In the code that you supply, and that you maintain, feel free to use a
|
|
coding style and brace layout that suits your taste.
|
|
|
|
@node Future Extensions, Improvements, Additions, Notes
|
|
@appendixsec Probable Future Extensions
|
|
@ignore
|
|
From emory!scalpel.netlabs.com!lwall Tue Oct 31 12:43:17 1995
|
|
Return-Path: <emory!scalpel.netlabs.com!lwall>
|
|
Message-Id: <9510311732.AA28472@scalpel.netlabs.com>
|
|
To: arnold@skeeve.atl.ga.us (Arnold D. Robbins)
|
|
Subject: Re: May I quote you?
|
|
In-Reply-To: Your message of "Tue, 31 Oct 95 09:11:00 EST."
|
|
<m0tAHPQ-00014MC@skeeve.atl.ga.us>
|
|
Date: Tue, 31 Oct 95 09:32:46 -0800
|
|
From: Larry Wall <emory!scalpel.netlabs.com!lwall>
|
|
|
|
: Greetings. I am working on the release of gawk 3.0. Part of it will be a
|
|
: thoroughly updated manual. One of the sections deals with planned future
|
|
: extensions and enhancements. I have the following at the beginning
|
|
: of it:
|
|
:
|
|
: @cindex PERL
|
|
: @cindex Wall, Larry
|
|
: @display
|
|
: @i{AWK is a language similar to PERL, only considerably more elegant.} @*
|
|
: Arnold Robbins
|
|
: @sp 1
|
|
: @i{Hey!} @*
|
|
: Larry Wall
|
|
: @end display
|
|
:
|
|
: Before I actually release this for publication, I wanted to get your
|
|
: permission to quote you. (Hopefully, in the spirit of much of GNU, the
|
|
: implied humor is visible... :-)
|
|
|
|
I think that would be fine.
|
|
|
|
Larry
|
|
@end ignore
|
|
@cindex PERL
|
|
@cindex Wall, Larry
|
|
@display
|
|
@i{AWK is a language similar to PERL, only considerably more elegant.}
|
|
Arnold Robbins
|
|
|
|
@i{Hey!}
|
|
Larry Wall
|
|
@end display
|
|
@sp 1
|
|
|
|
This section briefly lists extensions and possible improvements
|
|
that indicate the directions we are
|
|
currently considering for @code{gawk}. The file @file{FUTURES} in the
|
|
@code{gawk} distributions lists these extensions as well.
|
|
|
|
This is a list of probable future changes that will be usable by the
|
|
@code{awk} language programmer.
|
|
|
|
@c these are ordered by likelihood
|
|
@table @asis
|
|
@item Localization
|
|
The GNU project is starting to support multiple languages.
|
|
It will at least be possible to make @code{gawk} print its warnings and
|
|
error messages in languages other than English.
|
|
It may be possible for @code{awk} programs to also use the multiple
|
|
language facilities, separate from @code{gawk} itself.
|
|
|
|
@item Databases
|
|
It may be possible to map a GDBM/NDBM/SDBM file into an @code{awk} array.
|
|
|
|
@item A @code{PROCINFO} Array
|
|
The special files that provide process-related information
|
|
(@pxref{Special Files, ,Special File Names in @code{gawk}})
|
|
may be superseded by a @code{PROCINFO} array that would provide the same
|
|
information, in an easier to access fashion.
|
|
|
|
@item More @code{lint} warnings
|
|
There are more things that could be checked for portability.
|
|
|
|
@item Control of subprocess environment
|
|
Changes made in @code{gawk} to the array @code{ENVIRON} may be
|
|
propagated to subprocesses run by @code{gawk}.
|
|
|
|
@ignore
|
|
@item @code{RECLEN} variable for fixed length records
|
|
Along with @code{FIELDWIDTHS}, this would speed up the processing of
|
|
fixed-length records.
|
|
|
|
@item A @code{restart} keyword
|
|
After modifying @code{$0}, @code{restart} would restart the pattern
|
|
matching loop, without reading a new record from the input.
|
|
|
|
@item A @samp{|&} redirection
|
|
The @samp{|&} redirection, in place of @samp{|}, would open a two-way
|
|
pipeline for communication with a sub-process (via @code{getline} and
|
|
@code{print} and @code{printf}).
|
|
|
|
@item Function valued variables
|
|
It would be possible to assign the name of a user-defined or built-in
|
|
function to a regular @code{awk} variable, and then call the function
|
|
indirectly, by using the regular variable. This would make it possible
|
|
to write general purpose sorting and comparing routines, for example,
|
|
by simply passing the name of one function into another.
|
|
|
|
@item A built-in @code{stat} function
|
|
The @code{stat} function would provide an easy-to-use hook to the
|
|
@code{stat} system call so that @code{awk} programs could determine information
|
|
about files.
|
|
|
|
@item A built-in @code{ftw} function
|
|
Combined with function valued variables and the @code{stat} function,
|
|
@code{ftw} (file tree walk) would make it easy for an @code{awk} program
|
|
to walk an entire file tree.
|
|
@end ignore
|
|
@end table
|
|
|
|
This is a list of probable improvements that will make @code{gawk}
|
|
perform better.
|
|
|
|
@table @asis
|
|
@item An Improved Version of @code{dfa}
|
|
The @code{dfa} pattern matcher from GNU @code{grep} has some
|
|
problems. Either a new version or a fixed one will deal with some
|
|
important regexp matching issues.
|
|
|
|
@item Use of GNU @code{malloc}
|
|
The GNU version of @code{malloc} could potentially speed up @code{gawk},
|
|
since it relies heavily on the use of dynamic memory allocation.
|
|
|
|
@end table
|
|
|
|
@node Improvements, , Future Extensions, Notes
|
|
@appendixsec Suggestions for Improvements
|
|
|
|
Here are some projects that would-be @code{gawk} hackers might like to take
|
|
on. They vary in size from a few days to a few weeks of programming,
|
|
depending on which one you choose and how fast a programmer you are. Please
|
|
send any improvements you write to the maintainers at the GNU project.
|
|
@xref{Adding Code, , Adding New Features},
|
|
for guidelines to follow when adding new features to @code{gawk}.
|
|
@xref{Bugs, ,Reporting Problems and Bugs}, for information on
|
|
contacting the maintainers.
|
|
|
|
@enumerate
|
|
@item
|
|
Compilation of @code{awk} programs: @code{gawk} uses a Bison (YACC-like)
|
|
parser to convert the script given it into a syntax tree; the syntax
|
|
tree is then executed by a simple recursive evaluator. This method incurs
|
|
a lot of overhead, since the recursive evaluator performs many procedure
|
|
calls to do even the simplest things.
|
|
|
|
It should be possible for @code{gawk} to convert the script's parse tree
|
|
into a C program which the user would then compile, using the normal
|
|
C compiler and a special @code{gawk} library to provide all the needed
|
|
functions (regexps, fields, associative arrays, type coercion, and so
|
|
on).
|
|
|
|
An easier possibility might be for an intermediate phase of @code{awk} to
|
|
convert the parse tree into a linear byte code form like the one used
|
|
in GNU Emacs Lisp. The recursive evaluator would then be replaced by
|
|
a straight line byte code interpreter that would be intermediate in speed
|
|
between running a compiled program and doing what @code{gawk} does
|
|
now.
|
|
|
|
@item
|
|
The programs in the test suite could use documenting in this @value{DOCUMENT}.
|
|
|
|
@item
|
|
See the @file{FUTURES} file for more ideas. Contact us if you would
|
|
seriously like to tackle any of the items listed there.
|
|
@end enumerate
|
|
|
|
@node Glossary, Copying, Notes, Top
|
|
@appendix Glossary
|
|
|
|
@table @asis
|
|
@item Action
|
|
A series of @code{awk} statements attached to a rule. If the rule's
|
|
pattern matches an input record, @code{awk} executes the
|
|
rule's action. Actions are always enclosed in curly braces.
|
|
@xref{Action Overview, ,Overview of Actions}.
|
|
|
|
@item Amazing @code{awk} Assembler
|
|
Henry Spencer at the University of Toronto wrote a retargetable assembler
|
|
completely as @code{awk} scripts. It is thousands of lines long, including
|
|
machine descriptions for several eight-bit microcomputers.
|
|
It is a good example of a
|
|
program that would have been better written in another language.
|
|
|
|
@item Amazingly Workable Formatter (@code{awf})
|
|
Henry Spencer at the University of Toronto wrote a formatter that accepts
|
|
a large subset of the @samp{nroff -ms} and @samp{nroff -man} formatting
|
|
commands, using @code{awk} and @code{sh}.
|
|
|
|
@item ANSI
|
|
The American National Standards Institute. This organization produces
|
|
many standards, among them the standards for the C and C++ programming
|
|
languages.
|
|
|
|
@item Assignment
|
|
An @code{awk} expression that changes the value of some @code{awk}
|
|
variable or data object. An object that you can assign to is called an
|
|
@dfn{lvalue}. The assigned values are called @dfn{rvalues}.
|
|
@xref{Assignment Ops, ,Assignment Expressions}.
|
|
|
|
@item @code{awk} Language
|
|
The language in which @code{awk} programs are written.
|
|
|
|
@item @code{awk} Program
|
|
An @code{awk} program consists of a series of @dfn{patterns} and
|
|
@dfn{actions}, collectively known as @dfn{rules}. For each input record
|
|
given to the program, the program's rules are all processed in turn.
|
|
@code{awk} programs may also contain function definitions.
|
|
|
|
@item @code{awk} Script
|
|
Another name for an @code{awk} program.
|
|
|
|
@item Bash
|
|
The GNU version of the standard shell (the Bourne-Again shell).
|
|
See ``Bourne Shell.''
|
|
|
|
@item BBS
|
|
See ``Bulletin Board System.''
|
|
|
|
@item Boolean Expression
|
|
Named after the English mathematician Boole. See ``Logical Expression.''
|
|
|
|
@item Bourne Shell
|
|
The standard shell (@file{/bin/sh}) on Unix and Unix-like systems,
|
|
originally written by Steven R.@: Bourne.
|
|
Many shells (Bash, @code{ksh}, @code{pdksh}, @code{zsh}) are
|
|
generally upwardly compatible with the Bourne shell.
|
|
|
|
@item Built-in Function
|
|
The @code{awk} language provides built-in functions that perform various
|
|
numerical, time stamp related, and string computations. Examples are
|
|
@code{sqrt} (for the square root of a number) and @code{substr} (for a
|
|
substring of a string). @xref{Built-in, ,Built-in Functions}.
|
|
|
|
@item Built-in Variable
|
|
@code{ARGC}, @code{ARGIND}, @code{ARGV}, @code{CONVFMT}, @code{ENVIRON},
|
|
@code{ERRNO}, @code{FIELDWIDTHS}, @code{FILENAME}, @code{FNR}, @code{FS},
|
|
@code{IGNORECASE}, @code{NF}, @code{NR}, @code{OFMT}, @code{OFS}, @code{ORS},
|
|
@code{RLENGTH}, @code{RSTART}, @code{RS}, @code{RT}, and @code{SUBSEP},
|
|
are the variables that have special meaning to @code{awk}.
|
|
Changing some of them affects @code{awk}'s running environment.
|
|
Several of these variables are specific to @code{gawk}.
|
|
@xref{Built-in Variables}.
|
|
|
|
@item Braces
|
|
See ``Curly Braces.''
|
|
|
|
@item Bulletin Board System
|
|
A computer system allowing users to log in and read and/or leave messages
|
|
for other users of the system, much like leaving paper notes on a bulletin
|
|
board.
|
|
|
|
@item C
|
|
The system programming language that most GNU software is written in. The
|
|
@code{awk} programming language has C-like syntax, and this @value{DOCUMENT}
|
|
points out similarities between @code{awk} and C when appropriate.
|
|
|
|
@cindex ISO 8859-1
|
|
@cindex ISO Latin-1
|
|
@item Character Set
|
|
The set of numeric codes used by a computer system to represent the
|
|
characters (letters, numbers, punctuation, etc.) of a particular country
|
|
or place. The most common character set in use today is ASCII (American
|
|
Standard Code for Information Interchange). Many European
|
|
countries use an extension of ASCII known as ISO-8859-1 (ISO Latin-1).
|
|
|
|
@item CHEM
|
|
A preprocessor for @code{pic} that reads descriptions of molecules
|
|
and produces @code{pic} input for drawing them. It was written in @code{awk}
|
|
by Brian Kernighan and Jon Bentley, and is available from
|
|
@email{@w{netlib@@research.bell-labs.com}}.
|
|
|
|
@item Compound Statement
|
|
A series of @code{awk} statements, enclosed in curly braces. Compound
|
|
statements may be nested.
|
|
@xref{Statements, ,Control Statements in Actions}.
|
|
|
|
@item Concatenation
|
|
Concatenating two strings means sticking them together, one after another,
|
|
giving a new string. For example, the string @samp{foo} concatenated with
|
|
the string @samp{bar} gives the string @samp{foobar}.
|
|
@xref{Concatenation, ,String Concatenation}.
|
|
|
|
@item Conditional Expression
|
|
An expression using the @samp{?:} ternary operator, such as
|
|
@samp{@var{expr1} ? @var{expr2} : @var{expr3}}. The expression
|
|
@var{expr1} is evaluated; if the result is true, the value of the whole
|
|
expression is the value of @var{expr2}, otherwise the value is
|
|
@var{expr3}. In either case, only one of @var{expr2} and @var{expr3}
|
|
is evaluated. @xref{Conditional Exp, ,Conditional Expressions}.
|
|
|
|
@item Comparison Expression
|
|
A relation that is either true or false, such as @samp{(a < b)}.
|
|
Comparison expressions are used in @code{if}, @code{while}, @code{do},
|
|
and @code{for}
|
|
statements, and in patterns to select which input records to process.
|
|
@xref{Typing and Comparison, ,Variable Typing and Comparison Expressions}.
|
|
|
|
@item Curly Braces
|
|
The characters @samp{@{} and @samp{@}}. Curly braces are used in
|
|
@code{awk} for delimiting actions, compound statements, and function
|
|
bodies.
|
|
|
|
@item Dark Corner
|
|
An area in the language where specifications often were (or still
|
|
are) not clear, leading to unexpected or undesirable behavior.
|
|
Such areas are marked in this @value{DOCUMENT} with ``(d.c.)'' in the
|
|
text, and are indexed under the heading ``dark corner.''
|
|
|
|
@item Data Objects
|
|
These are numbers and strings of characters. Numbers are converted into
|
|
strings and vice versa, as needed.
|
|
@xref{Conversion, ,Conversion of Strings and Numbers}.
|
|
|
|
@item Double Precision
|
|
An internal representation of numbers that can have fractional parts.
|
|
Double precision numbers keep track of more digits than do single precision
|
|
numbers, but operations on them are more expensive. This is the way
|
|
@code{awk} stores numeric values. It is the C type @code{double}.
|
|
|
|
@item Dynamic Regular Expression
|
|
A dynamic regular expression is a regular expression written as an
|
|
ordinary expression. It could be a string constant, such as
|
|
@code{"foo"}, but it may also be an expression whose value can vary.
|
|
@xref{Computed Regexps, , Using Dynamic Regexps}.
|
|
|
|
@item Environment
|
|
A collection of strings, of the form @var{name@code{=}val}, that each
|
|
program has available to it. Users generally place values into the
|
|
environment in order to provide information to various programs. Typical
|
|
examples are the environment variables @code{HOME} and @code{PATH}.
|
|
|
|
@item Empty String
|
|
See ``Null String.''
|
|
|
|
@item Escape Sequences
|
|
A special sequence of characters used for describing non-printing
|
|
characters, such as @samp{\n} for newline, or @samp{\033} for the ASCII
|
|
ESC (escape) character. @xref{Escape Sequences}.
|
|
|
|
@item Field
|
|
When @code{awk} reads an input record, it splits the record into pieces
|
|
separated by whitespace (or by a separator regexp which you can
|
|
change by setting the built-in variable @code{FS}). Such pieces are
|
|
called fields. If the pieces are of fixed length, you can use the built-in
|
|
variable @code{FIELDWIDTHS} to describe their lengths.
|
|
@xref{Field Separators, ,Specifying How Fields are Separated},
|
|
and also see
|
|
@xref{Constant Size, , Reading Fixed-width Data}.
|
|
|
|
@item Floating Point Number
|
|
Often referred to in mathematical terms as a ``rational'' number, this is
|
|
just a number that can have a fractional part.
|
|
See ``Double Precision'' and ``Single Precision.''
|
|
|
|
@item Format
|
|
Format strings are used to control the appearance of output in the
|
|
@code{printf} statement. Also, data conversions from numbers to strings
|
|
are controlled by the format string contained in the built-in variable
|
|
@code{CONVFMT}. @xref{Control Letters, ,Format-Control Letters}.
|
|
|
|
@item Function
|
|
A specialized group of statements used to encapsulate general
|
|
or program-specific tasks. @code{awk} has a number of built-in
|
|
functions, and also allows you to define your own.
|
|
@xref{Built-in, ,Built-in Functions},
|
|
and @ref{User-defined, ,User-defined Functions}.
|
|
|
|
@item FSF
|
|
See ``Free Software Foundation.''
|
|
|
|
@item Free Software Foundation
|
|
A non-profit organization dedicated
|
|
to the production and distribution of freely distributable software.
|
|
It was founded by Richard M.@: Stallman, the author of the original
|
|
Emacs editor. GNU Emacs is the most widely used version of Emacs today.
|
|
|
|
@item @code{gawk}
|
|
The GNU implementation of @code{awk}.
|
|
|
|
@item General Public License
|
|
This document describes the terms under which @code{gawk} and its source
|
|
code may be distributed. (@pxref{Copying, ,GNU GENERAL PUBLIC LICENSE})
|
|
|
|
@item GNU
|
|
``GNU's not Unix''. An on-going project of the Free Software Foundation
|
|
to create a complete, freely distributable, POSIX-compliant computing
|
|
environment.
|
|
|
|
@item GPL
|
|
See ``General Public License.''
|
|
|
|
@item Hexadecimal
|
|
Base 16 notation, where the digits are @code{0}-@code{9} and
|
|
@code{A}-@code{F}, with @samp{A}
|
|
representing 10, @samp{B} representing 11, and so on up to @samp{F} for 15.
|
|
Hexadecimal numbers are written in C using a leading @samp{0x},
|
|
to indicate their base. Thus, @code{0x12} is 18 (one times 16 plus 2).
|
|
|
|
@item I/O
|
|
Abbreviation for ``Input/Output,'' the act of moving data into and/or
|
|
out of a running program.
|
|
|
|
@item Input Record
|
|
A single chunk of data read in by @code{awk}. Usually, an @code{awk} input
|
|
record consists of one line of text.
|
|
@xref{Records, ,How Input is Split into Records}.
|
|
|
|
@item Integer
|
|
A whole number, i.e.@: a number that does not have a fractional part.
|
|
|
|
@item Keyword
|
|
In the @code{awk} language, a keyword is a word that has special
|
|
meaning. Keywords are reserved and may not be used as variable names.
|
|
|
|
@code{gawk}'s keywords are:
|
|
@code{BEGIN},
|
|
@code{END},
|
|
@code{if},
|
|
@code{else},
|
|
@code{while},
|
|
@code{do@dots{}while},
|
|
@code{for},
|
|
@code{for@dots{}in},
|
|
@code{break},
|
|
@code{continue},
|
|
@code{delete},
|
|
@code{next},
|
|
@code{nextfile},
|
|
@code{function},
|
|
@code{func},
|
|
and @code{exit}.
|
|
|
|
@item Logical Expression
|
|
An expression using the operators for logic, AND, OR, and NOT, written
|
|
@samp{&&}, @samp{||}, and @samp{!} in @code{awk}. Often called Boolean
|
|
expressions, after the mathematician who pioneered this kind of
|
|
mathematical logic.
|
|
|
|
@item Lvalue
|
|
An expression that can appear on the left side of an assignment
|
|
operator. In most languages, lvalues can be variables or array
|
|
elements. In @code{awk}, a field designator can also be used as an
|
|
lvalue.
|
|
|
|
@item Null String
|
|
A string with no characters in it. It is represented explicitly in
|
|
@code{awk} programs by placing two double-quote characters next to
|
|
each other (@code{""}). It can appear in input data by having two successive
|
|
occurrences of the field separator appear next to each other.
|
|
|
|
@item Number
|
|
A numeric valued data object. The @code{gawk} implementation uses double
|
|
precision floating point to represent numbers.
|
|
Very old @code{awk} implementations use single precision floating
|
|
point.
|
|
|
|
@item Octal
|
|
Base-eight notation, where the digits are @code{0}-@code{7}.
|
|
Octal numbers are written in C using a leading @samp{0},
|
|
to indicate their base. Thus, @code{013} is 11 (one times 8 plus 3).
|
|
|
|
@item Pattern
|
|
Patterns tell @code{awk} which input records are interesting to which
|
|
rules.
|
|
|
|
A pattern is an arbitrary conditional expression against which input is
|
|
tested. If the condition is satisfied, the pattern is said to @dfn{match}
|
|
the input record. A typical pattern might compare the input record against
|
|
a regular expression. @xref{Pattern Overview, ,Pattern Elements}.
|
|
|
|
@item POSIX
|
|
The name for a series of standards being developed by the IEEE
|
|
that specify a Portable Operating System interface. The ``IX'' denotes
|
|
the Unix heritage of these standards. The main standard of interest for
|
|
@code{awk} users is
|
|
@cite{IEEE Standard for Information Technology, Standard 1003.2-1992,
|
|
Portable Operating System Interface (POSIX) Part 2: Shell and Utilities}.
|
|
Informally, this standard is often referred to as simply ``P1003.2.''
|
|
|
|
@item Private
|
|
Variables and/or functions that are meant for use exclusively by library
|
|
functions, and not for the main @code{awk} program. Special care must be
|
|
taken when naming such variables and functions.
|
|
@xref{Library Names, , Naming Library Function Global Variables}.
|
|
|
|
@item Range (of input lines)
|
|
A sequence of consecutive lines from the input file. A pattern
|
|
can specify ranges of input lines for @code{awk} to process, or it can
|
|
specify single lines. @xref{Pattern Overview, ,Pattern Elements}.
|
|
|
|
@item Recursion
|
|
When a function calls itself, either directly or indirectly.
|
|
If this isn't clear, refer to the entry for ``recursion.''
|
|
|
|
@item Redirection
|
|
Redirection means performing input from other than the standard input
|
|
stream, or output to other than the standard output stream.
|
|
|
|
You can redirect the output of the @code{print} and @code{printf} statements
|
|
to a file or a system command, using the @samp{>}, @samp{>>}, and @samp{|}
|
|
operators. You can redirect input to the @code{getline} statement using
|
|
the @samp{<} and @samp{|} operators.
|
|
@xref{Redirection, ,Redirecting Output of @code{print} and @code{printf}},
|
|
and @ref{Getline, ,Explicit Input with @code{getline}}.
|
|
|
|
@item Regexp
|
|
Short for @dfn{regular expression}. A regexp is a pattern that denotes a
|
|
set of strings, possibly an infinite set. For example, the regexp
|
|
@samp{R.*xp} matches any string starting with the letter @samp{R}
|
|
and ending with the letters @samp{xp}. In @code{awk}, regexps are
|
|
used in patterns and in conditional expressions. Regexps may contain
|
|
escape sequences. @xref{Regexp, ,Regular Expressions}.
|
|
|
|
@item Regular Expression
|
|
See ``regexp.''
|
|
|
|
@item Regular Expression Constant
|
|
A regular expression constant is a regular expression written within
|
|
slashes, such as @code{/foo/}. This regular expression is chosen
|
|
when you write the @code{awk} program, and cannot be changed doing
|
|
its execution. @xref{Regexp Usage, ,How to Use Regular Expressions}.
|
|
|
|
@item Rule
|
|
A segment of an @code{awk} program that specifies how to process single
|
|
input records. A rule consists of a @dfn{pattern} and an @dfn{action}.
|
|
@code{awk} reads an input record; then, for each rule, if the input record
|
|
satisfies the rule's pattern, @code{awk} executes the rule's action.
|
|
Otherwise, the rule does nothing for that input record.
|
|
|
|
@item Rvalue
|
|
A value that can appear on the right side of an assignment operator.
|
|
In @code{awk}, essentially every expression has a value. These values
|
|
are rvalues.
|
|
|
|
@item @code{sed}
|
|
See ``Stream Editor.''
|
|
|
|
@item Short-Circuit
|
|
The nature of the @code{awk} logical operators @samp{&&} and @samp{||}.
|
|
If the value of the entire expression can be deduced from evaluating just
|
|
the left-hand side of these operators, the right-hand side will not
|
|
be evaluated
|
|
(@pxref{Boolean Ops, ,Boolean Expressions}).
|
|
|
|
@item Side Effect
|
|
A side effect occurs when an expression has an effect aside from merely
|
|
producing a value. Assignment expressions, increment and decrement
|
|
expressions and function calls have side effects.
|
|
@xref{Assignment Ops, ,Assignment Expressions}.
|
|
|
|
@item Single Precision
|
|
An internal representation of numbers that can have fractional parts.
|
|
Single precision numbers keep track of fewer digits than do double precision
|
|
numbers, but operations on them are less expensive in terms of CPU time.
|
|
This is the type used by some very old versions of @code{awk} to store
|
|
numeric values. It is the C type @code{float}.
|
|
|
|
@item Space
|
|
The character generated by hitting the space bar on the keyboard.
|
|
|
|
@item Special File
|
|
A file name interpreted internally by @code{gawk}, instead of being handed
|
|
directly to the underlying operating system. For example, @file{/dev/stderr}.
|
|
@xref{Special Files, ,Special File Names in @code{gawk}}.
|
|
|
|
@item Stream Editor
|
|
A program that reads records from an input stream and processes them one
|
|
or more at a time. This is in contrast with batch programs, which may
|
|
expect to read their input files in entirety before starting to do
|
|
anything, and with interactive programs, which require input from the
|
|
user.
|
|
|
|
@item String
|
|
A datum consisting of a sequence of characters, such as @samp{I am a
|
|
string}. Constant strings are written with double-quotes in the
|
|
@code{awk} language, and may contain escape sequences.
|
|
@xref{Escape Sequences}.
|
|
|
|
@item Tab
|
|
The character generated by hitting the @kbd{TAB} key on the keyboard.
|
|
It usually expands to up to eight spaces upon output.
|
|
|
|
@item Unix
|
|
A computer operating system originally developed in the early 1970's at
|
|
AT&T Bell Laboratories. It initially became popular in universities around
|
|
the world, and later moved into commercial evnironments as a software
|
|
development system and network server system. There are many commercial
|
|
versions of Unix, as well as several work-alike systems whose source code
|
|
is freely available (such as Linux, NetBSD, and FreeBSD).
|
|
|
|
@item Whitespace
|
|
A sequence of space, tab, or newline characters occurring inside an input
|
|
record or a string.
|
|
@end table
|
|
|
|
@node Copying, Index, Glossary, Top
|
|
@unnumbered GNU GENERAL PUBLIC LICENSE
|
|
@center Version 2, June 1991
|
|
|
|
@display
|
|
Copyright @copyright{} 1989, 1991 Free Software Foundation, Inc.
|
|
59 Temple Place --- Suite 330, Boston, MA 02111-1307, USA
|
|
|
|
Everyone is permitted to copy and distribute verbatim copies
|
|
of this license document, but changing it is not allowed.
|
|
@end display
|
|
|
|
@c fakenode --- for prepinfo
|
|
@unnumberedsec Preamble
|
|
|
|
The licenses for most software are designed to take away your
|
|
freedom to share and change it. By contrast, the GNU General Public
|
|
License is intended to guarantee your freedom to share and change free
|
|
software---to make sure the software is free for all its users. This
|
|
General Public License applies to most of the Free Software
|
|
Foundation's software and to any other program whose authors commit to
|
|
using it. (Some other Free Software Foundation software is covered by
|
|
the GNU Library General Public License instead.) You can apply it to
|
|
your programs, too.
|
|
|
|
When we speak of free software, we are referring to freedom, not
|
|
price. Our General Public Licenses are designed to make sure that you
|
|
have the freedom to distribute copies of free software (and charge for
|
|
this service if you wish), that you receive source code or can get it
|
|
if you want it, that you can change the software or use pieces of it
|
|
in new free programs; and that you know you can do these things.
|
|
|
|
To protect your rights, we need to make restrictions that forbid
|
|
anyone to deny you these rights or to ask you to surrender the rights.
|
|
These restrictions translate to certain responsibilities for you if you
|
|
distribute copies of the software, or if you modify it.
|
|
|
|
For example, if you distribute copies of such a program, whether
|
|
gratis or for a fee, you must give the recipients all the rights that
|
|
you have. You must make sure that they, too, receive or can get the
|
|
source code. And you must show them these terms so they know their
|
|
rights.
|
|
|
|
We protect your rights with two steps: (1) copyright the software, and
|
|
(2) offer you this license which gives you legal permission to copy,
|
|
distribute and/or modify the software.
|
|
|
|
Also, for each author's protection and ours, we want to make certain
|
|
that everyone understands that there is no warranty for this free
|
|
software. If the software is modified by someone else and passed on, we
|
|
want its recipients to know that what they have is not the original, so
|
|
that any problems introduced by others will not reflect on the original
|
|
authors' reputations.
|
|
|
|
Finally, any free program is threatened constantly by software
|
|
patents. We wish to avoid the danger that redistributors of a free
|
|
program will individually obtain patent licenses, in effect making the
|
|
program proprietary. To prevent this, we have made it clear that any
|
|
patent must be licensed for everyone's free use or not licensed at all.
|
|
|
|
The precise terms and conditions for copying, distribution and
|
|
modification follow.
|
|
|
|
@iftex
|
|
@c fakenode --- for prepinfo
|
|
@unnumberedsec TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION
|
|
@end iftex
|
|
@ifinfo
|
|
@center TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION
|
|
@end ifinfo
|
|
|
|
@enumerate 0
|
|
@item
|
|
This License applies to any program or other work which contains
|
|
a notice placed by the copyright holder saying it may be distributed
|
|
under the terms of this General Public License. The ``Program'', below,
|
|
refers to any such program or work, and a ``work based on the Program''
|
|
means either the Program or any derivative work under copyright law:
|
|
that is to say, a work containing the Program or a portion of it,
|
|
either verbatim or with modifications and/or translated into another
|
|
language. (Hereinafter, translation is included without limitation in
|
|
the term ``modification''.) Each licensee is addressed as ``you''.
|
|
|
|
Activities other than copying, distribution and modification are not
|
|
covered by this License; they are outside its scope. The act of
|
|
running the Program is not restricted, and the output from the Program
|
|
is covered only if its contents constitute a work based on the
|
|
Program (independent of having been made by running the Program).
|
|
Whether that is true depends on what the Program does.
|
|
|
|
@item
|
|
You may copy and distribute verbatim copies of the Program's
|
|
source code as you receive it, in any medium, provided that you
|
|
conspicuously and appropriately publish on each copy an appropriate
|
|
copyright notice and disclaimer of warranty; keep intact all the
|
|
notices that refer to this License and to the absence of any warranty;
|
|
and give any other recipients of the Program a copy of this License
|
|
along with the Program.
|
|
|
|
You may charge a fee for the physical act of transferring a copy, and
|
|
you may at your option offer warranty protection in exchange for a fee.
|
|
|
|
@item
|
|
You may modify your copy or copies of the Program or any portion
|
|
of it, thus forming a work based on the Program, and copy and
|
|
distribute such modifications or work under the terms of Section 1
|
|
above, provided that you also meet all of these conditions:
|
|
|
|
@enumerate a
|
|
@item
|
|
You must cause the modified files to carry prominent notices
|
|
stating that you changed the files and the date of any change.
|
|
|
|
@item
|
|
You must cause any work that you distribute or publish, that in
|
|
whole or in part contains or is derived from the Program or any
|
|
part thereof, to be licensed as a whole at no charge to all third
|
|
parties under the terms of this License.
|
|
|
|
@item
|
|
If the modified program normally reads commands interactively
|
|
when run, you must cause it, when started running for such
|
|
interactive use in the most ordinary way, to print or display an
|
|
announcement including an appropriate copyright notice and a
|
|
notice that there is no warranty (or else, saying that you provide
|
|
a warranty) and that users may redistribute the program under
|
|
these conditions, and telling the user how to view a copy of this
|
|
License. (Exception: if the Program itself is interactive but
|
|
does not normally print such an announcement, your work based on
|
|
the Program is not required to print an announcement.)
|
|
@end enumerate
|
|
|
|
These requirements apply to the modified work as a whole. If
|
|
identifiable sections of that work are not derived from the Program,
|
|
and can be reasonably considered independent and separate works in
|
|
themselves, then this License, and its terms, do not apply to those
|
|
sections when you distribute them as separate works. But when you
|
|
distribute the same sections as part of a whole which is a work based
|
|
on the Program, the distribution of the whole must be on the terms of
|
|
this License, whose permissions for other licensees extend to the
|
|
entire whole, and thus to each and every part regardless of who wrote it.
|
|
|
|
Thus, it is not the intent of this section to claim rights or contest
|
|
your rights to work written entirely by you; rather, the intent is to
|
|
exercise the right to control the distribution of derivative or
|
|
collective works based on the Program.
|
|
|
|
In addition, mere aggregation of another work not based on the Program
|
|
with the Program (or with a work based on the Program) on a volume of
|
|
a storage or distribution medium does not bring the other work under
|
|
the scope of this License.
|
|
|
|
@item
|
|
You may copy and distribute the Program (or a work based on it,
|
|
under Section 2) in object code or executable form under the terms of
|
|
Sections 1 and 2 above provided that you also do one of the following:
|
|
|
|
@enumerate a
|
|
@item
|
|
Accompany it with the complete corresponding machine-readable
|
|
source code, which must be distributed under the terms of Sections
|
|
1 and 2 above on a medium customarily used for software interchange; or,
|
|
|
|
@item
|
|
Accompany it with a written offer, valid for at least three
|
|
years, to give any third party, for a charge no more than your
|
|
cost of physically performing source distribution, a complete
|
|
machine-readable copy of the corresponding source code, to be
|
|
distributed under the terms of Sections 1 and 2 above on a medium
|
|
customarily used for software interchange; or,
|
|
|
|
@item
|
|
Accompany it with the information you received as to the offer
|
|
to distribute corresponding source code. (This alternative is
|
|
allowed only for non-commercial distribution and only if you
|
|
received the program in object code or executable form with such
|
|
an offer, in accord with Subsection b above.)
|
|
@end enumerate
|
|
|
|
The source code for a work means the preferred form of the work for
|
|
making modifications to it. For an executable work, complete source
|
|
code means all the source code for all modules it contains, plus any
|
|
associated interface definition files, plus the scripts used to
|
|
control compilation and installation of the executable. However, as a
|
|
special exception, the source code distributed need not include
|
|
anything that is normally distributed (in either source or binary
|
|
form) with the major components (compiler, kernel, and so on) of the
|
|
operating system on which the executable runs, unless that component
|
|
itself accompanies the executable.
|
|
|
|
If distribution of executable or object code is made by offering
|
|
access to copy from a designated place, then offering equivalent
|
|
access to copy the source code from the same place counts as
|
|
distribution of the source code, even though third parties are not
|
|
compelled to copy the source along with the object code.
|
|
|
|
@item
|
|
You may not copy, modify, sublicense, or distribute the Program
|
|
except as expressly provided under this License. Any attempt
|
|
otherwise to copy, modify, sublicense or distribute the Program is
|
|
void, and will automatically terminate your rights under this License.
|
|
However, parties who have received copies, or rights, from you under
|
|
this License will not have their licenses terminated so long as such
|
|
parties remain in full compliance.
|
|
|
|
@item
|
|
You are not required to accept this License, since you have not
|
|
signed it. However, nothing else grants you permission to modify or
|
|
distribute the Program or its derivative works. These actions are
|
|
prohibited by law if you do not accept this License. Therefore, by
|
|
modifying or distributing the Program (or any work based on the
|
|
Program), you indicate your acceptance of this License to do so, and
|
|
all its terms and conditions for copying, distributing or modifying
|
|
the Program or works based on it.
|
|
|
|
@item
|
|
Each time you redistribute the Program (or any work based on the
|
|
Program), the recipient automatically receives a license from the
|
|
original licensor to copy, distribute or modify the Program subject to
|
|
these terms and conditions. You may not impose any further
|
|
restrictions on the recipients' exercise of the rights granted herein.
|
|
You are not responsible for enforcing compliance by third parties to
|
|
this License.
|
|
|
|
@item
|
|
If, as a consequence of a court judgment or allegation of patent
|
|
infringement or for any other reason (not limited to patent issues),
|
|
conditions are imposed on you (whether by court order, agreement or
|
|
otherwise) that contradict the conditions of this License, they do not
|
|
excuse you from the conditions of this License. If you cannot
|
|
distribute so as to satisfy simultaneously your obligations under this
|
|
License and any other pertinent obligations, then as a consequence you
|
|
may not distribute the Program at all. For example, if a patent
|
|
license would not permit royalty-free redistribution of the Program by
|
|
all those who receive copies directly or indirectly through you, then
|
|
the only way you could satisfy both it and this License would be to
|
|
refrain entirely from distribution of the Program.
|
|
|
|
If any portion of this section is held invalid or unenforceable under
|
|
any particular circumstance, the balance of the section is intended to
|
|
apply and the section as a whole is intended to apply in other
|
|
circumstances.
|
|
|
|
It is not the purpose of this section to induce you to infringe any
|
|
patents or other property right claims or to contest validity of any
|
|
such claims; this section has the sole purpose of protecting the
|
|
integrity of the free software distribution system, which is
|
|
implemented by public license practices. Many people have made
|
|
generous contributions to the wide range of software distributed
|
|
through that system in reliance on consistent application of that
|
|
system; it is up to the author/donor to decide if he or she is willing
|
|
to distribute software through any other system and a licensee cannot
|
|
impose that choice.
|
|
|
|
This section is intended to make thoroughly clear what is believed to
|
|
be a consequence of the rest of this License.
|
|
|
|
@item
|
|
If the distribution and/or use of the Program is restricted in
|
|
certain countries either by patents or by copyrighted interfaces, the
|
|
original copyright holder who places the Program under this License
|
|
may add an explicit geographical distribution limitation excluding
|
|
those countries, so that distribution is permitted only in or among
|
|
countries not thus excluded. In such case, this License incorporates
|
|
the limitation as if written in the body of this License.
|
|
|
|
@item
|
|
The Free Software Foundation may publish revised and/or new versions
|
|
of the General Public License from time to time. Such new versions will
|
|
be similar in spirit to the present version, but may differ in detail to
|
|
address new problems or concerns.
|
|
|
|
Each version is given a distinguishing version number. If the Program
|
|
specifies a version number of this License which applies to it and ``any
|
|
later version'', you have the option of following the terms and conditions
|
|
either of that version or of any later version published by the Free
|
|
Software Foundation. If the Program does not specify a version number of
|
|
this License, you may choose any version ever published by the Free Software
|
|
Foundation.
|
|
|
|
@item
|
|
If you wish to incorporate parts of the Program into other free
|
|
programs whose distribution conditions are different, write to the author
|
|
to ask for permission. For software which is copyrighted by the Free
|
|
Software Foundation, write to the Free Software Foundation; we sometimes
|
|
make exceptions for this. Our decision will be guided by the two goals
|
|
of preserving the free status of all derivatives of our free software and
|
|
of promoting the sharing and reuse of software generally.
|
|
|
|
@iftex
|
|
@c fakenode --- for prepinfo
|
|
@heading NO WARRANTY
|
|
@end iftex
|
|
@ifinfo
|
|
@center NO WARRANTY
|
|
@end ifinfo
|
|
|
|
@item
|
|
BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY
|
|
FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW@. EXCEPT WHEN
|
|
OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES
|
|
PROVIDE THE PROGRAM ``AS IS'' WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED
|
|
OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
|
|
MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE@. THE ENTIRE RISK AS
|
|
TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU@. SHOULD THE
|
|
PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING,
|
|
REPAIR OR CORRECTION.
|
|
|
|
@item
|
|
IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
|
|
WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR
|
|
REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES,
|
|
INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING
|
|
OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED
|
|
TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY
|
|
YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER
|
|
PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE
|
|
POSSIBILITY OF SUCH DAMAGES.
|
|
@end enumerate
|
|
|
|
@iftex
|
|
@c fakenode --- for prepinfo
|
|
@heading END OF TERMS AND CONDITIONS
|
|
@end iftex
|
|
@ifinfo
|
|
@center END OF TERMS AND CONDITIONS
|
|
@end ifinfo
|
|
|
|
@page
|
|
@c fakenode --- for prepinfo
|
|
@unnumberedsec How to Apply These Terms to Your New Programs
|
|
|
|
If you develop a new program, and you want it to be of the greatest
|
|
possible use to the public, the best way to achieve this is to make it
|
|
free software which everyone can redistribute and change under these terms.
|
|
|
|
To do so, attach the following notices to the program. It is safest
|
|
to attach them to the start of each source file to most effectively
|
|
convey the exclusion of warranty; and each file should have at least
|
|
the ``copyright'' line and a pointer to where the full notice is found.
|
|
|
|
@smallexample
|
|
@var{one line to give the program's name and an idea of what it does.}
|
|
Copyright (C) 19@var{yy} @var{name of author}
|
|
|
|
This program is free software; you can redistribute it and/or
|
|
modify it under the terms of the GNU General Public License
|
|
as published by the Free Software Foundation; either version 2
|
|
of the License, or (at your option) any later version.
|
|
|
|
This program is distributed in the hope that it will be useful,
|
|
but WITHOUT ANY WARRANTY; without even the implied warranty of
|
|
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE@. See the
|
|
GNU General Public License for more details.
|
|
|
|
You should have received a copy of the GNU General Public License
|
|
along with this program; if not, write to the Free Software
|
|
Foundation, Inc., 59 Temple Place --- Suite 330, Boston, MA 02111-1307, USA.
|
|
@end smallexample
|
|
|
|
Also add information on how to contact you by electronic and paper mail.
|
|
|
|
If the program is interactive, make it output a short notice like this
|
|
when it starts in an interactive mode:
|
|
|
|
@smallexample
|
|
Gnomovision version 69, Copyright (C) 19@var{yy} @var{name of author}
|
|
Gnomovision comes with ABSOLUTELY NO WARRANTY; for details
|
|
type `show w'. This is free software, and you are welcome
|
|
to redistribute it under certain conditions; type `show c'
|
|
for details.
|
|
@end smallexample
|
|
|
|
The hypothetical commands @samp{show w} and @samp{show c} should show
|
|
the appropriate parts of the General Public License. Of course, the
|
|
commands you use may be called something other than @samp{show w} and
|
|
@samp{show c}; they could even be mouse-clicks or menu items---whatever
|
|
suits your program.
|
|
|
|
You should also get your employer (if you work as a programmer) or your
|
|
school, if any, to sign a ``copyright disclaimer'' for the program, if
|
|
necessary. Here is a sample; alter the names:
|
|
|
|
@smallexample
|
|
@group
|
|
Yoyodyne, Inc., hereby disclaims all copyright
|
|
interest in the program `Gnomovision'
|
|
(which makes passes at compilers) written
|
|
by James Hacker.
|
|
|
|
@var{signature of Ty Coon}, 1 April 1989
|
|
Ty Coon, President of Vice
|
|
@end group
|
|
@end smallexample
|
|
|
|
This General Public License does not permit incorporating your program into
|
|
proprietary programs. If your program is a subroutine library, you may
|
|
consider it more useful to permit linking proprietary applications with the
|
|
library. If this is what you want to do, use the GNU Library General
|
|
Public License instead of this License.
|
|
|
|
@node Index, , Copying, Top
|
|
@unnumbered Index
|
|
@printindex cp
|
|
|
|
@summarycontents
|
|
@contents
|
|
@bye
|
|
|
|
Unresolved Issues:
|
|
------------------
|
|
1. From ADR.
|
|
|
|
Robert J. Chassell points out that awk programs should have some indication
|
|
of how to use them. It would be useful to perhaps have a "programming
|
|
style" section of the manual that would include this and other tips.
|
|
|
|
2. The default AWKPATH search path should be configurable via `configure'
|
|
The default and how this changes needs to be documented.
|
|
|
|
Consistency issues:
|
|
/.../ regexps are in @code, not @samp
|
|
".." strings are in @code, not @samp
|
|
no @print before @dots
|
|
values of expressions in the text (@code{x} has the value 15),
|
|
should be in roman, not @code
|
|
Use tab and not TAB
|
|
Use ESC and not ESCAPE
|
|
Use space and not blank to describe the space bar's character
|
|
The term "blank" is thus basically reserved for "blank lines" etc.
|
|
The `(d.c.)' should appear inside the closing `.' of a sentence
|
|
It should come before (pxref{...})
|
|
" " should have an @w{} around it
|
|
Use "non-" everywhere
|
|
Use @code{ftp} when talking about anonymous ftp
|
|
Use upper-case and lower-case, not "upper case" and "lower case"
|
|
Use alphanumeric, not alpha-numeric
|
|
Use --foo, not -Wfoo when describing long options
|
|
Use findex for all programs and functions in the example chapters
|
|
Use "Bell Laboratories", but not "Bell Labs".
|
|
Use "behavior" instead of "behaviour".
|
|
Use "zeros" instead of "zeroes".
|
|
Use "Input/Output", not "input/output". Also "I/O", not "i/o".
|
|
Use @code{do}, and not @code{do}-@code{while}, except where
|
|
actually discussing the do-while.
|
|
The words "a", "and", "as", "between", "for", "from", "in", "of",
|
|
"on", "that", "the", "to", "with", and "without",
|
|
should not be capitalized in @chapter, @section etc.
|
|
"Into" and "How" should.
|
|
Search for @dfn; make sure important items are also indexed.
|
|
"e.g." should always be followed by a comma.
|
|
"i.e." should never be followed by a comma, and should be followed
|
|
by `@:'.
|
|
The numbers zero through ten should be spelled out, except when
|
|
talking about file descriptor numbers. > 10 and < 0, it's
|
|
ok to use numbers.
|
|
In tables, put command line options in @code, while in the text,
|
|
put them in @samp.
|
|
When using @strong, use "Note:" or "Caution:" with colons and
|
|
not exclamation points. Do not surround the paragraphs
|
|
with @quotation ... @end quotation.
|
|
|
|
Date: Wed, 13 Apr 94 15:20:52 -0400
|
|
From: rsm@gnu.ai.mit.edu (Richard Stallman)
|
|
To: gnu-prog@gnu.ai.mit.edu
|
|
Subject: A reminder: no pathnames in GNU
|
|
|
|
It's a GNU convention to use the term "file name" for the name of a
|
|
file, never "pathname". We use the term "path" for search paths,
|
|
which are lists of file names. Using it for a single file name as
|
|
well is potentially confusing to users.
|
|
|
|
So please check any documentation you maintain, if you think you might
|
|
have used "pathname".
|
|
|
|
Note that "file name" should be two words when it appears as ordinary
|
|
text. It's ok as one word when it's a metasyntactic variable, though.
|
|
|
|
Suggestions:
|
|
------------
|
|
Enhance FIELDWIDTHS with some way to indicate "the rest of the record".
|
|
E.g., a length of 0 or -1 or something. May be "n"?
|
|
|
|
Make FIELDWIDTHS be an array?
|
|
|
|
What if FIELDWIDTHS has invalid values in it?
|