Clarify undisplayable characters, --unibyte, locales.

Clarify self-insertion of non-ASCII 8-bit chars. Clarify coding system detection of escape sequences. Clarify keyboard input methods and coding systems. Comment out the commands to inquire about character sets. Misc cleanups.
2024-11-28 07:45:00 +00:00 · 2001-02-17 18:12:07 +00:00 · 2001-02-17 18:12:07 +00:00 · 4b40407a71
commit 4b40407a71
parent 8e375db276
1 changed files with 160 additions and 150 deletions
--- a/man/mule.texi
+++ b/man/mule.texi
@ -42,7 +42,7 @@ have been merged from the modified version of Emacs known as MULE (for
 ``MULti-lingual Enhancement to GNU Emacs'')

  Emacs also supports various encodings of these characters used by
-internationalized software, such as word processors, mailers, etc.
+other internationalized software, such as word processors and mailers.

@menu
 * International Intro::     Basic concepts of multibyte characters.
@ -80,16 +80,31 @@ cases) in the @kbd{C-q} command (@pxref{Multibyte Conversion}).
@kindex C-h h
@findex view-hello-file
@cindex undisplayable characters
-@cindex ?
-@cindex ??
+@cindex @samp{?} in display
  The command @kbd{C-h h} (@code{view-hello-file}) displays the file
@file{etc/HELLO}, which shows how to say ``hello'' in many languages.
-This illustrates various scripts.  If the font you're using doesn't have
-characters for all those different languages, you will see some hollow
-boxes instead of characters; see @ref{Fontsets}.  On non-windowing
-displays, @samp{?} is displayed in place of the hollow box.  More than
-one @samp{?} is displayed for undisplayable characters that are wider
-than one column.
+This illustrates various scripts.  If some characters can't be
+displayed on your terminal, they appear as @samp{?} or as hollow boxes
+(@pxref{Undisplayable Characters}).
+
+  Keyboards, even in the countries where these character sets are used,
+generally don't have keys for all the characters in them.  So Emacs
+supports various @dfn{input methods}, typically one for each script or
+language, to make it convenient to type them.
+
+@kindex C-x RET
+  The prefix key @kbd{C-x @key{RET}} is used for commands that pertain
+to multibyte characters, coding systems, and input methods.
+
+@ignore
+@c This is commented out because it doesn't fit here, or anywhere.
+@c This manual does not discuss "character sets" as they
+@c are used in Mule, and it makes no sense to mention these commands
+@c except as part of a larger discussion of the topic.
+@c But it is not clear that topic is worth mentioning here,
+@c since that is more of an implementation concept
+@c than a user-level concept.  And when we switch to Unicode,
+@c character sets in the current sense may not even exist.

@findex list-charset-chars
@cindex characters in a certain charset
@ -101,15 +116,7 @@ character set, and displays all the characters in that character set.
  The command @kbd{M-x describe-character-set} prompts for a character
 set name and displays information about that character set, including
 its internal representation within Emacs.
-
-  Keyboards, even in the countries where these character sets are used,
-generally don't have keys for all the characters in them.  So Emacs
-supports various @dfn{input methods}, typically one for each script or
-language, to make it convenient to type them.
-
-@kindex C-x RET
-  The prefix key @kbd{C-x @key{RET}} is used for commands that pertain
-to multibyte characters, coding systems, and input methods.
+@end ignore

@node Enabling Multibyte
@section Enabling Multibyte Characters
@ -153,16 +160,22 @@ have basically the same effect as @samp{--unibyte}.
@cindex unibyte operation, and Lisp files
@cindex init file, and non-ASCII characters
@cindex environment variables, and non-ASCII characters
-  Multibyte strings are not created during initialization from the
-values of environment variables, @file{/etc/passwd} entries etc.@: that
-contain non-ASCII 8-bit characters.  However, Lisp files, when they are
-loaded for running, and in particular the initialization file
-@file{.emacs}, are normally read as multibyte---even with
-@samp{--unibyte}.  To avoid multibyte strings being generated by
-non-ASCII characters in Lisp files, put @samp{-*-unibyte: t;-*-} in a
-comment on the first line, or specify the coding system @samp{raw-text}
-with @kbd{C-x @key{RET} c}.  Do the same for initialization files for
-packages like Gnus.
+  With @samp{--unibyte}, multibyte strings are not created during
+initialization from the values of environment variables,
+@file{/etc/passwd} entries etc.@: that contain non-ASCII 8-bit
+characters.
+
+  Emacs normally loads Lisp files as multibyte, regardless of whether
+you used @samp{--unibyte}.  This includes the Emacs initialization
+file, @file{.emacs}, and the initialization files of Emacs packages
+such as Gnus.  However, you can specify unibyte loading for a
+particular Lisp file, by putting @samp{-*-unibyte: t;-*-} in a comment
+on the first line.  Then that file is always loaded as unibyte text,
+even if you did not start Emacs with @samp{--unibyte}.  The motivation
+for these conventions is that it is more reliable to always load any
+particular Lisp file in the same way.  However, you can load a Lisp
+file as unibyte, on any one occasion, by typing @kbd{C-x @key{RET} c
+raw-text @key{RET}} immediately before loading it.

  The mode line indicates whether multibyte character support is enabled
 in the current buffer.  If it is, there are two or more characters (most
@ -206,13 +219,12 @@ sign), Polish, Romanian, Slovak, Slovenian, Thai, Tibetan, Turkish,
 Dutch, Spanish, and Vietnamese.
@end quotation

-@cindex fonts, for displaying different languages
-  To be able to display the script(s) used by your language environment
-on a windowed display, you need to have a suitable font installed.  If
-some of the characters appear as empty boxes, download and install the
-GNU Intlfonts distribution, which includes fonts for all supported
-scripts.  @xref{Fontsets}, for more details about setting up your
-fonts.
+@cindex fonts for various scripts
+  To display the script(s) used by your language environment on a
+graphical display, you need to have a suitable font.  If some of the
+characters appear as empty boxes, you should install the GNU Intlfonts
+package, which includes fonts for all supported scripts.
+@xref{Fontsets}, for more details about setting up your fonts.

@findex set-locale-environment
@vindex locale-language-names
@ -220,31 +232,21 @@ fonts.
@cindex locales
  Some operating systems let you specify the language you are using by
 setting the locale environment variables @env{LC_ALL}, @env{LC_CTYPE},
-and @env{LANG}; the first of these which is nonempty specifies your
-locale.  Emacs handles this during startup by invoking the
-@code{set-locale-environment} function, which matches your locale
-against entries in the value of the variable
+or @env{LANG}.@footnote{If more than one of these is set, the first
+one that is nonempty specifies your locale for this purpose.}  Emacs
+handles this during startup by matching your locale against entries in
+the value of the variables @code{locale-charset-language-names} and
@code{locale-language-names} and selects the corresponding language
-environment if a match is found.  But if your locale also matches an
-entry in the variable @code{locale-charset-language-names}, this entry
-is preferred if its character set disagrees.  For example, suppose the
-locale @samp{en_GB.ISO8859-15} matches @code{"Latin-1"} in
-@code{locale-language-names} and @code{"Latin-9"} in
-@code{locale-charset-language-names}; since these two language
-environments' character sets disagree, Emacs uses @code{"Latin-9"}.
+environment if a match is found.  (The former variable overrides the
+latter.)  It also adjusts the display table and terminal coding
+system, the locale coding system, and the preferred coding system as
+needed for the locale.

-  If all goes well, the @code{set-locale-environment} function selects
-the language environment, since language is part of locale.  It also
-adjusts the display table and terminal coding system, the locale coding
-system, and the preferred coding system as needed for the locale.
+  If you modify the @env{LC_ALL}, @env{LC_CTYPE}, or @env{LANG}
+environment variables while running Emacs, you may want to invoke the
+@code{set-locale-environment} function afterwards to readjust the
+language environment from the new locale.

-  Since the @code{set-locale-environment} function is automatically
-invoked during startup, you normally do not need to invoke it yourself.
-However, if you modify the @env{LC_ALL}, @env{LC_CTYPE}, or @env{LANG}
-environment variables, you may want to invoke the
-@code{set-locale-environment} function afterwards.
-
-@findex set-locale-environment
@vindex locale-preferred-coding-systems
  The @code{set-locale-environment} function normally uses the preferred
 coding system established by the language environment to decode system
@ -255,10 +257,10 @@ matches @code{japanese-shift-jis} in
@code{locale-preferred-coding-systems}, Emacs uses that encoding even
 though it might normally use @code{japanese-iso-8bit}.

-  The environment chosen from the locale when Emacs starts is
-overidden by any explicit use of the command
-@code{set-language-environment} or customization of
-@code{current-language-environment} in your init file.
+  You can override the language environment chosen at startup with
+explicit use of the command @code{set-language-environment}, or with
+customization of @code{current-language-environment} in your init
+file.

@kindex C-h L
@findex describe-language-environment
@ -369,8 +371,10 @@ characters to type next is displayed in the echo area (but not when you
 are in the minibuffer).

@cindex Leim package
-Input methods are implemented in the separate Leim package, which must
-be installed with Emacs.
+  Input methods are implemented in the separate Leim package: they are
+available only if the system administrator used Leim when building
+Emacs.  If Emacs was built without Leim, you will find that no input
+methods are defined.

@node Select Input Method
@section Selecting an Input Method
@ -443,11 +447,12 @@ method, including the string that stands for it in the mode line.
 through 0377 (octal) are not really legitimate in the buffer.  The valid
 non-ASCII printing characters have codes that start from 0400.

-  If you type a self-inserting character in the range 0240
-through 0377, Emacs assumes you intended to use one of the ISO
-Latin-@var{n} character sets, and converts it to the Emacs code
-representing that Latin-@var{n} character.  You select @emph{which} ISO
-Latin character set to use through your choice of language environment
+  If you type a self-inserting character in the range 0240 through
+0377, or if you use @kbd{C-q} to insert one, Emacs assumes you
+intended to use one of the ISO Latin-@var{n} character sets, and
+converts it to the Emacs code representing that Latin-@var{n}
+character.  You select @emph{which} ISO Latin character set to use
+through your choice of language environment
@iftex
 (see above).
@end iftex
@ -456,13 +461,12 @@ Latin character set to use through your choice of language environment
@end ifinfo
 If you do not specify a choice, the default is Latin-1.

-  The same thing happens when you use @kbd{C-q} to enter an octal code
-in this range.  If you enter a code in the range 0200 through 0237,
-which forms the @code{eight-bit-control} character set, it is inserted
+  If you insert a character in the range 0200 through 0237, which
+forms the @code{eight-bit-control} character set, it is inserted
 literally.  You should normally avoid doing this since buffers
 containing such characters have to be written out in either the
-@code{emacs-mule} or @code{raw-text} coding system, which is usually not
-what you want.
+@code{emacs-mule} or @code{raw-text} coding system, which is usually
+not what you want.

@node Coding Systems
@section Coding Systems
@ -652,24 +656,24 @@ to non-@code{nil}.
@cindex escape sequences in files
  By default, the automatic detection of coding system is sensitive to
 escape sequences.  If Emacs sees a sequence of characters that begin
-with an @key{ESC} character, and the sequence is valid as an ISO-2022
-code, the code is determined as one of ISO-2022 encoding, and the file
-is decoded by the corresponding coding system
-(e.g. @code{iso-2022-7bit}).
+with an escape character, and the sequence is valid as an ISO-2022
+code, that tells Emacs to use one of the ISO-2022 encodings to decode
+the file.

-  However, there may be cases that you want to read escape sequences in
-a file as is.  In such a case, you can set th variable
+  However, there may be cases that you want to read escape sequences
+in a file as is.  In such a case, you can set the variable
@code{inhibit-iso-escape-detection} to non-@code{nil}.  Then the code
-detection will ignore any escape sequences, and so no file is detected
-as being encoded in some of ISO-2022 encoding.  The result is that all
-escape sequences become visible in a buffer.
+detection ignores any escape sequences, and never uses an ISO-2022
+encoding.  The result is that all escape sequences become visible in
+the buffer.

  The default value of @code{inhibit-iso-escape-detection} is
-@code{nil}, and it is strongly recommended not to change it.  That's
-because many Emacs Lisp source files that contain non-ASCII characters
-are encoded in the coding system @code{iso-2022-7bit} in the Emacs
-distribution, and they won't be decoded correctly when you visit those
-files if you suppress the escape sequence detection.
+@code{nil}.  We recommend that you not change it permanently, only for
+one specific operation.  That's because many Emacs Lisp source files
+that contain non-ASCII characters are encoded in the coding system
+@code{iso-2022-7bit} in the Emacs distribution, and they won't be
+decoded correctly when you visit those files if you suppress the
+escape sequence detection.

@vindex coding
  You can specify the coding system for a particular file using the
@ -700,33 +704,34 @@ a different coding system, you can specify a different coding system for
 the buffer using @code{set-buffer-file-coding-system} (@pxref{Specify
 Coding}).

-  While editing a file, you will sometimes insert characters which
-cannot be encoded with the coding system stored in
-@code{buffer-file-coding-system}.  For example, suppose you start with
-an ASCII file and insert a few Latin-1 characters into it.  Or you could
-edit a text file in Polish encoded in @code{iso-8859-2} and add to it
-translations of several Polish words into Russian.  When you save the
-buffer, Emacs can no longer use the previous value of the buffer's
-coding system, because the characters you added cannot be encoded by
-that coding system.
+  You can insert any possible character into any Emacs buffer, but
+most coding systems can only handle some of the possible characters.
+This means that you can insert characters that cannot be encoded with
+the coding system that will be used to save the buffer.  For example,
+you could start with an ASCII file and insert a few Latin-1 characters
+into it, or or you could edit a text file in Polish encoded in
+@code{iso-8859-2} and add to it translations of several Polish words
+into Russian.  When you save the buffer, Emacs cannot use the current
+value of @code{buffer-file-coding-system}, because the characters you
+added cannot be encoded by that coding system.

  When that happens, Emacs tries the most-preferred coding system (set
 by @kbd{M-x prefer-coding-system} or @kbd{M-x
-set-language-environment}), and if that coding system can safely encode
-all of the characters in the buffer, Emacs uses it, and stores its value
-in @code{buffer-file-coding-system}.  Otherwise, Emacs pops up a window
-with a list of coding systems suitable for encoding the buffer, and
-prompts you to choose one of those coding systems.
+set-language-environment}), and if that coding system can safely
+encode all of the characters in the buffer, Emacs uses it, and stores
+its value in @code{buffer-file-coding-system}.  Otherwise, Emacs
+displays a list of coding systems suitable for encoding the buffer's
+contents, and asks to choose one of those coding systems.

-  If you insert characters which cannot be encoded by the buffer's
-coding system while editing a mail message, Emacs behaves a bit
-differently.  It additionally checks whether the most-preferred coding
-system is recommended for use in MIME messages; if it isn't, Emacs tells
-you that the most-preferred coding system is not recommended and prompts
-you for another coding system.  This is so you won't inadvertently send
-a message encoded in a way that your recipient's mail software will have
-difficulty decoding.  (If you do want to use the most-preferred coding
-system, you can type its name to Emacs prompt anyway.)
+  If you insert the unsuitable characters in a mail message, Emacs
+behaves a bit differently.  It additionally checks whether the
+most-preferred coding system is recommended for use in MIME messages;
+if it isn't, Emacs tells you that the most-preferred coding system is
+not recommended and prompts you for another coding system.  This is so
+you won't inadvertently send a message encoded in a way that your
+recipient's mail software will have difficulty decoding.  (If you do
+want to use the most-preferred coding system, you can type its name to
+Emacs prompt anyway.)

@vindex sendmail-coding-system
  When you send a message with Mail mode (@pxref{Sending Mail}), Emacs has
@ -916,13 +921,14 @@ name, or it may get an error.  If such a problem happens, use @kbd{C-x
 C-w} to specify a new file name for that buffer.

@vindex locale-coding-system
-  The variable @code{locale-coding-system} specifies a coding system to
-use when encoding and decoding system strings such as system error
-messages and @code{format-time-string} formats and time stamps.  This
-coding system should be compatible with the underlying system's coding
-system, which is normally specified by the first environment variable in
-the list @env{LC_ALL}, @env{LC_CTYPE}, @env{LANG} whose value is
-nonempty.
+  The variable @code{locale-coding-system} specifies a coding system
+to use when encoding and decoding system strings such as system error
+messages and @code{format-time-string} formats and time stamps.  You
+should choose a coding system that is compatible with the underlying
+system's text representation, which is normally specified by one of
+the environment variables @env{LC_ALL}, @env{LC_CTYPE}, and
+@env{LANG}.  (The first one whose value is nonempty is the one that
+determines the text representation.)

@node Fontsets
@section Fontsets
@ -941,7 +947,7 @@ specifying its name, anywhere that you could use a single font.  Of
 course, Emacs fontsets can use only the fonts that the X server
 supports; if certain characters appear on the screen as hollow boxes,
 this means that the fontset in use for them has no font for those
-characters.@footnote{The installation instructions have information on
+characters.@footnote{The Emacs installation instructions have information on
 additional font support.}

  Emacs creates two fontsets automatically: the @dfn{standard fontset}
@ -1099,23 +1105,27 @@ call this function explicitly to create a fontset.
@node Undisplayable Characters
@section Undisplayable Characters

-Your terminal may not be able to display some non-@sc{ascii} characters.
-Most non-windowing terminals can only use a single character set,
-specified by the variable @code{default-terminal-coding-system}
-(@pxref{Specify Coding}) and characters which can't be encoded in it are
-displayed as @samp{?} by default.  Windowing terminals may not have the
-necessary font available to display a given character and display a
-hollow box instead.  You can change the default behavior.
+  Your terminal may be unable to display some non-@sc{ascii}
+characters.  Most non-windowing terminals can only use a single
+character set (use the variable @code{default-terminal-coding-system}
+(@pxref{Specify Coding}) to tell Emacs which one); characters which
+can't be encoded in that coding system are displayed as @samp{?} by
+default.

-If you use Latin-1 characters but your terminal can't display Latin-1,
-you can arrange to display mnemonic @sc{ascii} sequences instead, e.g.@:
-@samp{"o} for o-umlaut.  Load the library @file{iso-ascii} to do this.
+  Windowing terminals can display a broader range of characters, but
+you may not have fonts installed for all of them; characters that have
+no font appear as a hollow box.

-If your terminal can display Latin-1, you can display characters from
-other European character sets using a mixture of equivalent Latin-1
-characters and @sc{ascii} mnemonics.  Use the Custom option
-@code{latin1-display} to enable this.  The mnemonic @sc{ascii} sequences
-mostly correspond to those of the prefix input methods.
+  If you use Latin-1 characters but your terminal can't display
+Latin-1, you can arrange to display mnemonic @sc{ascii} sequences
+instead, e.g.@: @samp{"o} for o-umlaut.  Load the library
+@file{iso-ascii} to do this.
+
+  If your terminal can display Latin-1, you can display characters
+from other European character sets using a mixture of equivalent
+Latin-1 characters and @sc{ascii} mnemonics.  Use the Custom option
+@code{latin1-display} to enable this.  The mnemonic @sc{ascii}
+sequences mostly correspond to those of the prefix input methods.

@node Single-Byte Character Support
@section Single-byte Character Set Support
@ -1172,18 +1182,18 @@ characters:
@findex set-keyboard-coding-system
@vindex keyboard-coding-system
 If your keyboard can generate character codes 128 and up, representing
-non-ASCII characters, use the command @code{M-x
-set-keyboard-coding-system} or the Custom option
-@code{keyboard-coding-system} to specify this in the same way as for
-multibyte usage (@pxref{Specify Coding}).
+non-ASCII you can type those character codes directly.

-It is not necessary to do this under a window system which can
-distinguish 8-bit characters and Meta keys.  If you do this on a normal
-terminal, you will probably need to use @kbd{ESC} to type Meta
-characters.@footnote{In some cases, such as the Linux console and
-@code{xterm}, you can arrange for Meta to be converted to @kbd{ESC} and
-still be able type 8-bit characters present directly on the keyboard or
-using @kbd{Compose} or @kbd{AltGr} keys.}  @xref{User Input}.
+On a windowing terminal, you should not need to do anything special to
+use these keys; they should simply work.  On a text-only terminal, you
+should use the command @code{M-x set-keyboard-coding-system} or the
+Custom option @code{keyboard-coding-system} to specify which coding
+system your keyboard uses (@pxref{Specify Coding}).  Enabling this
+feature will probably require you to use @kbd{ESC} to type Meta
+characters; however, on a Linux console or in @code{xterm}, you can
+arrange for Meta to be converted to @kbd{ESC} and still be able type
+8-bit characters present directly on the keyboard or using
+@kbd{Compose} or @kbd{AltGr} keys.  @xref{User Input}.

@item
 You can use an input method for the selected language environment.
@ -1205,7 +1215,7 @@ and in any other context where a key sequence is allowed.
 library is loaded, the @key{ALT} modifier key, if you have one, serves
 the same purpose as @kbd{C-x 8}; use @key{ALT} together with an accent
 character to modify the following letter.  In addition, if you have keys
-for the Latin-1 ``dead accent characters'', they too are defined to
+for the Latin-1 ``dead accent characters,'' they too are defined to
 compose with the following character, once @code{iso-transl} is loaded.
 Use @kbd{C-x 8 C-h} to list the available translations as mnemonic
 command names.
@ -1215,9 +1225,9 @@ command names.
@cindex ISO Accents mode
@findex iso-accents-mode
@cindex Latin-1, Latin-2 and Latin-3 input mode
-For Latin-1, Latin-2 and Latin-3, @kbd{M-x iso-accents-mode} installs a
-minor mode which provides a facility like the @code{latin-1-prefix}
-input method but independent of the Leim package.  This mode is
-buffer-local.  It can be customized for various languages with @kbd{M-x
-iso-accents-customize}.
+For Latin-1, Latin-2 and Latin-3, @kbd{M-x iso-accents-mode} installs
+a minor mode which works much like the @code{latin-1-prefix} input
+method does not depend on having the input methods installed.  This
+mode is buffer-local.  It can be customized for various languages with
+@kbd{M-x iso-accents-customize}.
@end itemize