2.2. The alphabet of C
This is an interesting area; alphabets are important. All the same, this is the one part of this chapter that you can read superficially first time round without missing too much. Read it to make sure that you've seen the contents once, and make a mental note to come back to it later on.
2.2.1. Basic Alphabet
Few computer languages bother to define their alphabet rigorously. There's usually an assumption that the English alphabet augmented by a sprinkling of more or less arbitrary punctuation symbols will be available in every environment that is trying to support the language. The assumption is not always borne out by experience. Older languages suffer less from this sort of problem, but try sending C programs by Telex or restrictive e-mail links and you'll understand the difficulty.
The Standard talks about two different character sets: the one that programs are written in and the one that programs execute with. This is basically to allow for different systems for compiling and execution, which might use different ways of encoding their characters. It doesn't actually matter a lot except when you are using character constants in the preprocessor, where they may not have the same value as they do at execution time. This behaviour is implementation-defined, so it must be documented. Don't worry about it yet.
The Standard requires that an alphabet of 96 symbols is available for C as follows:
a b c d e f g h i j k l m n o p q r s t u v w x y z |
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z |
0 1 2 3 4 5 6 7 8 9 |
! " # % & ' ( ) * + , - . / |
: ; < = > ? [ \ ] ^ _ { | } ~ |
space, horizontal and vertical tab |
form feed, newline |
It turns out that most of the commonly used computer alphabets contain all the symbols that are needed for C with a few notorious exceptions. The C alphabetic characters shown below are missing from the International Standards Organization ISO 646 standard 7-bit character set, which is as a subset of all the widely used computer alphabets.
# [ \ ] ^ { | } ~
To cater for systems that can't provide the full 96 characters needed by C, the Standard specifies a method of using the ISO 646 characters to represent the missing few; the technique is the use of trigraphs.
2.2.2. Trigraphs
Trigraphs are a sequence of three ISO 646 characters that get
treated as if they were one character in the C alphabet; all of the
trigraphs start with two question marks ??
which helps
to indicate that ‘something funny’ is going on. Table 2.1 below shows the trigraphs defined in the Standard.
As an example, let's assume that your terminal doesn't have the
#
symbol. To write the preprocessor line
#define MAX 32767
isn't possible; you must use trigraph notation instead:
??=define MAX 32767
Of course trigraphs will work even if you do have a
#
symbol; they are there to help in difficult
circumstances more than to be used for routine programming.
The ?
‘binds to the right’, so in any sequence of
repeated ?
s, only the two at the right could possibly be part
of a trigraph, depending on what comes next—this disposes of any
ambiguity.
It would be a mistake to assume that programs written to be highly portable would use trigraphs ‘in case they had to be moved to systems that only support ISO 646’. If your system can handle all 96 characters in the C alphabet, then that is what you should be using. Trigraphs will only be seen in restricted environments, and it is extremely simple to write a character-by-character translator between the two representations. However, all compilers that conform to the Standard will recognize trigraphs when they are seen.
Trigraph substitution is the very first operation that a compiler performs on its input text.
2.2.3. Multibyte Characters
Support for multibyte characters is new in the Standard. Why?
A very large proportion of day-to-day computing involves data that represents text of one form or another. Until recently, the rather chauvinist computing industry has assumed that it is adequate to provide support for about a hundred or so printable characters (hence the 96 character alphabet of C), based on the requirements of the English language—not surprising, since the bulk of the development of commercial computing has been in the US market. This alphabet (technically called the repertoire) fits conveniently into 7 or 8 bits of storage, which is why the US-ASCII character set standard and the architecture of mini and microcomputers both give very heavy emphasis to the use of 8-bit bytes as the basic unit of storage.
C also has a byte-oriented approach to data storage. The smallest individual item of storage that can be directly used in C is the byte, which is defined to be at least 8 bits in size. Older systems or architectures that are not designed explicitly to support this may incur a performance penalty when running C as a result, although there are not many that find this a big problem.
Perhaps there was a time when the English alphabet was acceptable for data processing applications worldwide—when computers were used in environments where the users could be expected to adapt—but those days are gone. Nowadays it is absolutely essential to provide for the storage and processing of textual material in the native alphabet of whoever wants to use the system. Most of the US and Western European language requirements can be squeezed together into a character set that still fits in 8 bits per character, but Asian and other languages simply cannot.
There are two general ways of extending character sets. One is to use a fixed number of bytes (often two) for every character. This is what the wide character support in C is designed to do. The other method is to use a shift-in shift-out coding scheme; this is popular over 8-bit communication links. Imagine a stream of characters that looks like:
a b c <SI> a b g <SO> x y
where <SI>
and <SO>
mean
‘switch to Greek’ and ‘switch back to English’
respectively. A display device that agreed to use that method might well
then display a, b, c, alpha, beta, gamma, x and y. This is roughly the
scheme used by the shift-JIS Japanese standard, except that once the
shift-in has been seen, pairs of characters together are used as
the code for a single Japanese character. Alternative schemes exist which
use more than one shift-in character, but they are less common.
The Standard now allows explicitly for the use of extended character sets. Only the 96 characters defined earlier are used for the C part of a program, but in comments, strings, character constants and header names (these are really data, not part of the program as such) extended characters are permitted if your environment supports them. The Standard lays down a number of pretty obvious rules about how you are allowed to use them which we will not repeat here. The most significant one is that a byte whose value is zero is interpreted as a null character irrespective of any shift state. That is important, because C uses a null character to indicate the end of strings and many library functions rely on it. An additional requirement is that multibyte sequences must start and end in the initial shift state.
The char
type is specified by the Standard as suitable to
hold the value of all of the characters in the ‘execution character
set’, which will be defined in your system's documentation. This means
that (in the example above) it could hold the value of
‘a
’ or ‘b
’ or even the "switch to
Greek" character itself. Because of the shift-in shift-out mechanism,
there would be no difference between the value stored in a char that was
intended to represent ‘a
’ or the Greek ‘alpha’
character. To do that would mean using a different representation -
probably needing more than 8 bits, which on many systems would be too big
for a char
. That is why the Standard introduces the
wchar_t
type. To use this, you must include the
<stddef.h> header, because wchar_t
is simply defined as
an alternative name for one of C's other types. We discuss it further in
Section 2.8.
Summary
- C requires at least 96 characters in the source program character set.
- Not all character sets in common use can stretch to 96 characters, trigraphs allow the basic ISO 646 character set to be used (at a pinch).
- Multibyte character support has been added by the Standard, with
support for
- Shift-encoded multibyte characters, which can be squeezed into
‘ordinary’ character arrays, so still have
char
type. - Wide characters, each of which may use more storage than a regular
character. These usually have a different type from
char
.
- Shift-encoded multibyte characters, which can be squeezed into
‘ordinary’ character arrays, so still have