1
0
mirror of https://git.savannah.gnu.org/git/emacs.git synced 2024-11-28 07:45:00 +00:00

Clarify undisplayable characters, --unibyte, locales.

Clarify self-insertion of non-ASCII 8-bit chars.
Clarify coding system detection of escape sequences.
Clarify keyboard input methods and coding systems.
Comment out the commands to inquire about character sets.

Misc cleanups.
This commit is contained in:
Richard M. Stallman 2001-02-17 18:12:07 +00:00
parent 8e375db276
commit 4b40407a71

View File

@ -42,7 +42,7 @@ have been merged from the modified version of Emacs known as MULE (for
``MULti-lingual Enhancement to GNU Emacs'')
Emacs also supports various encodings of these characters used by
internationalized software, such as word processors, mailers, etc.
other internationalized software, such as word processors and mailers.
@menu
* International Intro:: Basic concepts of multibyte characters.
@ -80,16 +80,31 @@ cases) in the @kbd{C-q} command (@pxref{Multibyte Conversion}).
@kindex C-h h
@findex view-hello-file
@cindex undisplayable characters
@cindex ?
@cindex ??
@cindex @samp{?} in display
The command @kbd{C-h h} (@code{view-hello-file}) displays the file
@file{etc/HELLO}, which shows how to say ``hello'' in many languages.
This illustrates various scripts. If the font you're using doesn't have
characters for all those different languages, you will see some hollow
boxes instead of characters; see @ref{Fontsets}. On non-windowing
displays, @samp{?} is displayed in place of the hollow box. More than
one @samp{?} is displayed for undisplayable characters that are wider
than one column.
This illustrates various scripts. If some characters can't be
displayed on your terminal, they appear as @samp{?} or as hollow boxes
(@pxref{Undisplayable Characters}).
Keyboards, even in the countries where these character sets are used,
generally don't have keys for all the characters in them. So Emacs
supports various @dfn{input methods}, typically one for each script or
language, to make it convenient to type them.
@kindex C-x RET
The prefix key @kbd{C-x @key{RET}} is used for commands that pertain
to multibyte characters, coding systems, and input methods.
@ignore
@c This is commented out because it doesn't fit here, or anywhere.
@c This manual does not discuss "character sets" as they
@c are used in Mule, and it makes no sense to mention these commands
@c except as part of a larger discussion of the topic.
@c But it is not clear that topic is worth mentioning here,
@c since that is more of an implementation concept
@c than a user-level concept. And when we switch to Unicode,
@c character sets in the current sense may not even exist.
@findex list-charset-chars
@cindex characters in a certain charset
@ -101,15 +116,7 @@ character set, and displays all the characters in that character set.
The command @kbd{M-x describe-character-set} prompts for a character
set name and displays information about that character set, including
its internal representation within Emacs.
Keyboards, even in the countries where these character sets are used,
generally don't have keys for all the characters in them. So Emacs
supports various @dfn{input methods}, typically one for each script or
language, to make it convenient to type them.
@kindex C-x RET
The prefix key @kbd{C-x @key{RET}} is used for commands that pertain
to multibyte characters, coding systems, and input methods.
@end ignore
@node Enabling Multibyte
@section Enabling Multibyte Characters
@ -153,16 +160,22 @@ have basically the same effect as @samp{--unibyte}.
@cindex unibyte operation, and Lisp files
@cindex init file, and non-ASCII characters
@cindex environment variables, and non-ASCII characters
Multibyte strings are not created during initialization from the
values of environment variables, @file{/etc/passwd} entries etc.@: that
contain non-ASCII 8-bit characters. However, Lisp files, when they are
loaded for running, and in particular the initialization file
@file{.emacs}, are normally read as multibyte---even with
@samp{--unibyte}. To avoid multibyte strings being generated by
non-ASCII characters in Lisp files, put @samp{-*-unibyte: t;-*-} in a
comment on the first line, or specify the coding system @samp{raw-text}
with @kbd{C-x @key{RET} c}. Do the same for initialization files for
packages like Gnus.
With @samp{--unibyte}, multibyte strings are not created during
initialization from the values of environment variables,
@file{/etc/passwd} entries etc.@: that contain non-ASCII 8-bit
characters.
Emacs normally loads Lisp files as multibyte, regardless of whether
you used @samp{--unibyte}. This includes the Emacs initialization
file, @file{.emacs}, and the initialization files of Emacs packages
such as Gnus. However, you can specify unibyte loading for a
particular Lisp file, by putting @samp{-*-unibyte: t;-*-} in a comment
on the first line. Then that file is always loaded as unibyte text,
even if you did not start Emacs with @samp{--unibyte}. The motivation
for these conventions is that it is more reliable to always load any
particular Lisp file in the same way. However, you can load a Lisp
file as unibyte, on any one occasion, by typing @kbd{C-x @key{RET} c
raw-text @key{RET}} immediately before loading it.
The mode line indicates whether multibyte character support is enabled
in the current buffer. If it is, there are two or more characters (most
@ -206,13 +219,12 @@ sign), Polish, Romanian, Slovak, Slovenian, Thai, Tibetan, Turkish,
Dutch, Spanish, and Vietnamese.
@end quotation
@cindex fonts, for displaying different languages
To be able to display the script(s) used by your language environment
on a windowed display, you need to have a suitable font installed. If
some of the characters appear as empty boxes, download and install the
GNU Intlfonts distribution, which includes fonts for all supported
scripts. @xref{Fontsets}, for more details about setting up your
fonts.
@cindex fonts for various scripts
To display the script(s) used by your language environment on a
graphical display, you need to have a suitable font. If some of the
characters appear as empty boxes, you should install the GNU Intlfonts
package, which includes fonts for all supported scripts.
@xref{Fontsets}, for more details about setting up your fonts.
@findex set-locale-environment
@vindex locale-language-names
@ -220,31 +232,21 @@ fonts.
@cindex locales
Some operating systems let you specify the language you are using by
setting the locale environment variables @env{LC_ALL}, @env{LC_CTYPE},
and @env{LANG}; the first of these which is nonempty specifies your
locale. Emacs handles this during startup by invoking the
@code{set-locale-environment} function, which matches your locale
against entries in the value of the variable
or @env{LANG}.@footnote{If more than one of these is set, the first
one that is nonempty specifies your locale for this purpose.} Emacs
handles this during startup by matching your locale against entries in
the value of the variables @code{locale-charset-language-names} and
@code{locale-language-names} and selects the corresponding language
environment if a match is found. But if your locale also matches an
entry in the variable @code{locale-charset-language-names}, this entry
is preferred if its character set disagrees. For example, suppose the
locale @samp{en_GB.ISO8859-15} matches @code{"Latin-1"} in
@code{locale-language-names} and @code{"Latin-9"} in
@code{locale-charset-language-names}; since these two language
environments' character sets disagree, Emacs uses @code{"Latin-9"}.
environment if a match is found. (The former variable overrides the
latter.) It also adjusts the display table and terminal coding
system, the locale coding system, and the preferred coding system as
needed for the locale.
If all goes well, the @code{set-locale-environment} function selects
the language environment, since language is part of locale. It also
adjusts the display table and terminal coding system, the locale coding
system, and the preferred coding system as needed for the locale.
If you modify the @env{LC_ALL}, @env{LC_CTYPE}, or @env{LANG}
environment variables while running Emacs, you may want to invoke the
@code{set-locale-environment} function afterwards to readjust the
language environment from the new locale.
Since the @code{set-locale-environment} function is automatically
invoked during startup, you normally do not need to invoke it yourself.
However, if you modify the @env{LC_ALL}, @env{LC_CTYPE}, or @env{LANG}
environment variables, you may want to invoke the
@code{set-locale-environment} function afterwards.
@findex set-locale-environment
@vindex locale-preferred-coding-systems
The @code{set-locale-environment} function normally uses the preferred
coding system established by the language environment to decode system
@ -255,10 +257,10 @@ matches @code{japanese-shift-jis} in
@code{locale-preferred-coding-systems}, Emacs uses that encoding even
though it might normally use @code{japanese-iso-8bit}.
The environment chosen from the locale when Emacs starts is
overidden by any explicit use of the command
@code{set-language-environment} or customization of
@code{current-language-environment} in your init file.
You can override the language environment chosen at startup with
explicit use of the command @code{set-language-environment}, or with
customization of @code{current-language-environment} in your init
file.
@kindex C-h L
@findex describe-language-environment
@ -369,8 +371,10 @@ characters to type next is displayed in the echo area (but not when you
are in the minibuffer).
@cindex Leim package
Input methods are implemented in the separate Leim package, which must
be installed with Emacs.
Input methods are implemented in the separate Leim package: they are
available only if the system administrator used Leim when building
Emacs. If Emacs was built without Leim, you will find that no input
methods are defined.
@node Select Input Method
@section Selecting an Input Method
@ -443,11 +447,12 @@ method, including the string that stands for it in the mode line.
through 0377 (octal) are not really legitimate in the buffer. The valid
non-ASCII printing characters have codes that start from 0400.
If you type a self-inserting character in the range 0240
through 0377, Emacs assumes you intended to use one of the ISO
Latin-@var{n} character sets, and converts it to the Emacs code
representing that Latin-@var{n} character. You select @emph{which} ISO
Latin character set to use through your choice of language environment
If you type a self-inserting character in the range 0240 through
0377, or if you use @kbd{C-q} to insert one, Emacs assumes you
intended to use one of the ISO Latin-@var{n} character sets, and
converts it to the Emacs code representing that Latin-@var{n}
character. You select @emph{which} ISO Latin character set to use
through your choice of language environment
@iftex
(see above).
@end iftex
@ -456,13 +461,12 @@ Latin character set to use through your choice of language environment
@end ifinfo
If you do not specify a choice, the default is Latin-1.
The same thing happens when you use @kbd{C-q} to enter an octal code
in this range. If you enter a code in the range 0200 through 0237,
which forms the @code{eight-bit-control} character set, it is inserted
If you insert a character in the range 0200 through 0237, which
forms the @code{eight-bit-control} character set, it is inserted
literally. You should normally avoid doing this since buffers
containing such characters have to be written out in either the
@code{emacs-mule} or @code{raw-text} coding system, which is usually not
what you want.
@code{emacs-mule} or @code{raw-text} coding system, which is usually
not what you want.
@node Coding Systems
@section Coding Systems
@ -652,24 +656,24 @@ to non-@code{nil}.
@cindex escape sequences in files
By default, the automatic detection of coding system is sensitive to
escape sequences. If Emacs sees a sequence of characters that begin
with an @key{ESC} character, and the sequence is valid as an ISO-2022
code, the code is determined as one of ISO-2022 encoding, and the file
is decoded by the corresponding coding system
(e.g. @code{iso-2022-7bit}).
with an escape character, and the sequence is valid as an ISO-2022
code, that tells Emacs to use one of the ISO-2022 encodings to decode
the file.
However, there may be cases that you want to read escape sequences in
a file as is. In such a case, you can set th variable
However, there may be cases that you want to read escape sequences
in a file as is. In such a case, you can set the variable
@code{inhibit-iso-escape-detection} to non-@code{nil}. Then the code
detection will ignore any escape sequences, and so no file is detected
as being encoded in some of ISO-2022 encoding. The result is that all
escape sequences become visible in a buffer.
detection ignores any escape sequences, and never uses an ISO-2022
encoding. The result is that all escape sequences become visible in
the buffer.
The default value of @code{inhibit-iso-escape-detection} is
@code{nil}, and it is strongly recommended not to change it. That's
because many Emacs Lisp source files that contain non-ASCII characters
are encoded in the coding system @code{iso-2022-7bit} in the Emacs
distribution, and they won't be decoded correctly when you visit those
files if you suppress the escape sequence detection.
@code{nil}. We recommend that you not change it permanently, only for
one specific operation. That's because many Emacs Lisp source files
that contain non-ASCII characters are encoded in the coding system
@code{iso-2022-7bit} in the Emacs distribution, and they won't be
decoded correctly when you visit those files if you suppress the
escape sequence detection.
@vindex coding
You can specify the coding system for a particular file using the
@ -700,33 +704,34 @@ a different coding system, you can specify a different coding system for
the buffer using @code{set-buffer-file-coding-system} (@pxref{Specify
Coding}).
While editing a file, you will sometimes insert characters which
cannot be encoded with the coding system stored in
@code{buffer-file-coding-system}. For example, suppose you start with
an ASCII file and insert a few Latin-1 characters into it. Or you could
edit a text file in Polish encoded in @code{iso-8859-2} and add to it
translations of several Polish words into Russian. When you save the
buffer, Emacs can no longer use the previous value of the buffer's
coding system, because the characters you added cannot be encoded by
that coding system.
You can insert any possible character into any Emacs buffer, but
most coding systems can only handle some of the possible characters.
This means that you can insert characters that cannot be encoded with
the coding system that will be used to save the buffer. For example,
you could start with an ASCII file and insert a few Latin-1 characters
into it, or or you could edit a text file in Polish encoded in
@code{iso-8859-2} and add to it translations of several Polish words
into Russian. When you save the buffer, Emacs cannot use the current
value of @code{buffer-file-coding-system}, because the characters you
added cannot be encoded by that coding system.
When that happens, Emacs tries the most-preferred coding system (set
by @kbd{M-x prefer-coding-system} or @kbd{M-x
set-language-environment}), and if that coding system can safely encode
all of the characters in the buffer, Emacs uses it, and stores its value
in @code{buffer-file-coding-system}. Otherwise, Emacs pops up a window
with a list of coding systems suitable for encoding the buffer, and
prompts you to choose one of those coding systems.
set-language-environment}), and if that coding system can safely
encode all of the characters in the buffer, Emacs uses it, and stores
its value in @code{buffer-file-coding-system}. Otherwise, Emacs
displays a list of coding systems suitable for encoding the buffer's
contents, and asks to choose one of those coding systems.
If you insert characters which cannot be encoded by the buffer's
coding system while editing a mail message, Emacs behaves a bit
differently. It additionally checks whether the most-preferred coding
system is recommended for use in MIME messages; if it isn't, Emacs tells
you that the most-preferred coding system is not recommended and prompts
you for another coding system. This is so you won't inadvertently send
a message encoded in a way that your recipient's mail software will have
difficulty decoding. (If you do want to use the most-preferred coding
system, you can type its name to Emacs prompt anyway.)
If you insert the unsuitable characters in a mail message, Emacs
behaves a bit differently. It additionally checks whether the
most-preferred coding system is recommended for use in MIME messages;
if it isn't, Emacs tells you that the most-preferred coding system is
not recommended and prompts you for another coding system. This is so
you won't inadvertently send a message encoded in a way that your
recipient's mail software will have difficulty decoding. (If you do
want to use the most-preferred coding system, you can type its name to
Emacs prompt anyway.)
@vindex sendmail-coding-system
When you send a message with Mail mode (@pxref{Sending Mail}), Emacs has
@ -916,13 +921,14 @@ name, or it may get an error. If such a problem happens, use @kbd{C-x
C-w} to specify a new file name for that buffer.
@vindex locale-coding-system
The variable @code{locale-coding-system} specifies a coding system to
use when encoding and decoding system strings such as system error
messages and @code{format-time-string} formats and time stamps. This
coding system should be compatible with the underlying system's coding
system, which is normally specified by the first environment variable in
the list @env{LC_ALL}, @env{LC_CTYPE}, @env{LANG} whose value is
nonempty.
The variable @code{locale-coding-system} specifies a coding system
to use when encoding and decoding system strings such as system error
messages and @code{format-time-string} formats and time stamps. You
should choose a coding system that is compatible with the underlying
system's text representation, which is normally specified by one of
the environment variables @env{LC_ALL}, @env{LC_CTYPE}, and
@env{LANG}. (The first one whose value is nonempty is the one that
determines the text representation.)
@node Fontsets
@section Fontsets
@ -941,7 +947,7 @@ specifying its name, anywhere that you could use a single font. Of
course, Emacs fontsets can use only the fonts that the X server
supports; if certain characters appear on the screen as hollow boxes,
this means that the fontset in use for them has no font for those
characters.@footnote{The installation instructions have information on
characters.@footnote{The Emacs installation instructions have information on
additional font support.}
Emacs creates two fontsets automatically: the @dfn{standard fontset}
@ -1099,23 +1105,27 @@ call this function explicitly to create a fontset.
@node Undisplayable Characters
@section Undisplayable Characters
Your terminal may not be able to display some non-@sc{ascii} characters.
Most non-windowing terminals can only use a single character set,
specified by the variable @code{default-terminal-coding-system}
(@pxref{Specify Coding}) and characters which can't be encoded in it are
displayed as @samp{?} by default. Windowing terminals may not have the
necessary font available to display a given character and display a
hollow box instead. You can change the default behavior.
Your terminal may be unable to display some non-@sc{ascii}
characters. Most non-windowing terminals can only use a single
character set (use the variable @code{default-terminal-coding-system}
(@pxref{Specify Coding}) to tell Emacs which one); characters which
can't be encoded in that coding system are displayed as @samp{?} by
default.
If you use Latin-1 characters but your terminal can't display Latin-1,
you can arrange to display mnemonic @sc{ascii} sequences instead, e.g.@:
@samp{"o} for o-umlaut. Load the library @file{iso-ascii} to do this.
Windowing terminals can display a broader range of characters, but
you may not have fonts installed for all of them; characters that have
no font appear as a hollow box.
If your terminal can display Latin-1, you can display characters from
other European character sets using a mixture of equivalent Latin-1
characters and @sc{ascii} mnemonics. Use the Custom option
@code{latin1-display} to enable this. The mnemonic @sc{ascii} sequences
mostly correspond to those of the prefix input methods.
If you use Latin-1 characters but your terminal can't display
Latin-1, you can arrange to display mnemonic @sc{ascii} sequences
instead, e.g.@: @samp{"o} for o-umlaut. Load the library
@file{iso-ascii} to do this.
If your terminal can display Latin-1, you can display characters
from other European character sets using a mixture of equivalent
Latin-1 characters and @sc{ascii} mnemonics. Use the Custom option
@code{latin1-display} to enable this. The mnemonic @sc{ascii}
sequences mostly correspond to those of the prefix input methods.
@node Single-Byte Character Support
@section Single-byte Character Set Support
@ -1172,18 +1182,18 @@ characters:
@findex set-keyboard-coding-system
@vindex keyboard-coding-system
If your keyboard can generate character codes 128 and up, representing
non-ASCII characters, use the command @code{M-x
set-keyboard-coding-system} or the Custom option
@code{keyboard-coding-system} to specify this in the same way as for
multibyte usage (@pxref{Specify Coding}).
non-ASCII you can type those character codes directly.
It is not necessary to do this under a window system which can
distinguish 8-bit characters and Meta keys. If you do this on a normal
terminal, you will probably need to use @kbd{ESC} to type Meta
characters.@footnote{In some cases, such as the Linux console and
@code{xterm}, you can arrange for Meta to be converted to @kbd{ESC} and
still be able type 8-bit characters present directly on the keyboard or
using @kbd{Compose} or @kbd{AltGr} keys.} @xref{User Input}.
On a windowing terminal, you should not need to do anything special to
use these keys; they should simply work. On a text-only terminal, you
should use the command @code{M-x set-keyboard-coding-system} or the
Custom option @code{keyboard-coding-system} to specify which coding
system your keyboard uses (@pxref{Specify Coding}). Enabling this
feature will probably require you to use @kbd{ESC} to type Meta
characters; however, on a Linux console or in @code{xterm}, you can
arrange for Meta to be converted to @kbd{ESC} and still be able type
8-bit characters present directly on the keyboard or using
@kbd{Compose} or @kbd{AltGr} keys. @xref{User Input}.
@item
You can use an input method for the selected language environment.
@ -1205,7 +1215,7 @@ and in any other context where a key sequence is allowed.
library is loaded, the @key{ALT} modifier key, if you have one, serves
the same purpose as @kbd{C-x 8}; use @key{ALT} together with an accent
character to modify the following letter. In addition, if you have keys
for the Latin-1 ``dead accent characters'', they too are defined to
for the Latin-1 ``dead accent characters,'' they too are defined to
compose with the following character, once @code{iso-transl} is loaded.
Use @kbd{C-x 8 C-h} to list the available translations as mnemonic
command names.
@ -1215,9 +1225,9 @@ command names.
@cindex ISO Accents mode
@findex iso-accents-mode
@cindex Latin-1, Latin-2 and Latin-3 input mode
For Latin-1, Latin-2 and Latin-3, @kbd{M-x iso-accents-mode} installs a
minor mode which provides a facility like the @code{latin-1-prefix}
input method but independent of the Leim package. This mode is
buffer-local. It can be customized for various languages with @kbd{M-x
iso-accents-customize}.
For Latin-1, Latin-2 and Latin-3, @kbd{M-x iso-accents-mode} installs
a minor mode which works much like the @code{latin-1-prefix} input
method does not depend on having the input methods installed. This
mode is buffer-local. It can be customized for various languages with
@kbd{M-x iso-accents-customize}.
@end itemize