5.4. Character handling
C is widely used for character and string handling applications. This is odd, in some ways, because the language doesn't really have any built-in string handling features. If you're used to languages that know about string handling, you will almost certainly find C tedious to begin with.
The standard library contains lots of functions to help with string processing but the fact remains that it still feels like hard work. To compare two strings you have to call a function instead of using an equality operator. There is a bright side to this, though. It means that the language isn't burdened by having to support string processing directly, which helps to keep it small and less cluttered. What's more, once you get your string handling programs working in C, they do tend to run very quickly.
Character handling in C is done by declaring arrays (or allocating them
dynamically) and moving characters in and out of them ‘by hand’.
Here is an example of a program which reads text a line at a time from
its standard input. If the line consists of the string of characters
stop
, it stops; otherwise it prints the length of the line.
It uses a technique which is invariably used in C programs; it reads the
characters into an array and indicates the end of them with an extra
character whose value is explicitly 0 (zero). It uses the library
strcmp
function to compare two strings.
#include <stdio.h> #include <stdlib.h> #include <string.h> #define LINELNG 100 /* max. length of input line */ main(){ char in_line[LINELNG]; char *cp; int c; cp = in_line; while((c = getc(stdin)) != EOF){ if(cp == &in_line[LINELNG-1] || c == '\n'){ /* * Insert end-of-line marker */ *cp = 0; if(strcmp(in_line, "stop") == 0 ) exit(EXIT_SUCCESS); else printf("line was %d characters long\n", (int)(cp-in_line)); cp = in_line; } else *cp++ = c; } exit(EXIT_SUCCESS); }Example 5.6
Once more, the example illustrates some interesting methods used widely in C programs. By far the most important is the way that strings are represented and manipulated.
Here is a possible implementation of strcmp
, which
compares two strings for equality and returns zero if they are the same.
The library function actually does a bit more than that, but the added
complication can be ignored for the moment. Notice the use of
const
in the argument declarations. This shows that the
function will not modify the contents of the strings, but just inspects
them. The definitions of the standard library functions make extensive
use of this technique.
/* * Compare two strings for equality. * Return 'false' if they are. */ int str_eq(const char *s1, const char *s2){ while(*s1 == *s2){ /* * At end of string return 0. */ if(*s1 == 0) return(0); s1++; s2++; } /* Difference detected! */ return(1); }Example 5.7
5.4.1. Strings
Every C programmer ‘knows’ what a string is. It is an array of
char
variables, with the last character in the string
followed by a null. ‘But I thought a string was something in double
quote marks’, you cry. You are right, too. In C, a sequence like
this
"a string"
is really a character array. It's the only example in C where you can declare something at the point of its use.
Be warned: in Old C, strings were stored just like any other
character array, and were modifiable. Now, the Standard states that
although they are are arrays of char
, (not const
char
), attempting to modify them results in undefined
behaviour.
Whenever a string in quotes is seen, it has two effects: it provides a declaration and a substitute for a name. It makes a hidden declaration of a char array, whose contents are initialized to the character values in the string, followed by a character whose integer value is zero. The array has no name. So, apart from the name being present, we have a situation like this:
char secret[9]; secret[0] = 'a'; secret[1] = ' '; secret[2] = 's'; secret[3] = 't'; secret[4] = 'r'; secret[5] = 'i'; secret[6] = 'n'; secret[7] = 'g'; secret[8] = 0;
an array of characters, terminated by zero, with character values in it. But when it's declared using the string notation, it hasn't got a name. How can we use it?
Whenever C sees a quoted string, the presence of the string itself serves as the name of the hidden array—not only is the string an implicit sort of declaration, it is as if an array name had been given. Now, we all remember that the name of an array is equivalent to giving the address of its first element, so what is the type of this?
"a string"
It's a pointer of course: a pointer to the first element of the hidden
unnamed array, which is of type char
, so the pointer is of
type ‘pointer to char
’. The situation is shown in
Figure 5.7.
For proof of that, look at the following program:
#include <stdio.h> #include <stdlib.h> main(){ int i; char *cp; cp = "a string"; while(*cp != 0){ putchar(*cp); cp++; } putchar('\n'); for(i = 0; i < 8; i++) putchar("a string"[i]); putchar('\n'); exit(EXIT_SUCCESS); }Example 5.8
The first loop sets a pointer to the start of the array, then walks
along until it finds the zero at the end. The second one ‘knows’
about the length of the string and is less useful as a result. Notice
how the first one is independent of the length—that is a most
important point to remember. It's the way that strings are handled in
C almost without exception; it's certainly the format that all of the
library string manipulation functions expect. The zero at the end allows
string processing routines to find out that they have reached the end of
the string—look back now to the example function
str_eq
. The function takes two character pointers as
arguments (so a string would be acceptable as one or both arguments). It
compares them for equality by checking that the strings are
character-for-character the same. If they are the same at any point,
then it checks to make sure it hasn't reached the end of them both with
if(*s1 == 0)
: if it has, then it returns 0 to show that
they were equal. The test could just as easily have been on
*s2
, it wouldn't have made any difference. Otherwise
a difference has been detected, so it returns 1 to indicate failure.
In the example, strcmp
is called with two arguments which
look quite different. One is a character array, the other is a string.
In fact they're the same thing—a character array terminated by zero
(the program is careful to put a zero in the first ‘empty’ element
of in_line
), and a string in quotes—which is
a character array terminated by a zero. Their use as arguments to strcmp
results in character pointers being passed, for the reasons explained to
the point of tedium above.
5.4.2. Pointers and increment operators
We said that we'd eventually revisit expressions like
(*p)++;
and now it's time. Pointers are used so often to walk down arrays that
it just seems natural to use the ++
and --
operators on them. Here we write zeros into an array:
The pointer ip
is set to the start of the array. While it
remains inside the array, the place that it points to has zero written
into it, then the increment takes effect and the pointer is stepped one
element along the array. The postfix form of ++
is
particularly useful here.
This is very common stuff indeed. In most programs you'll find
pointers and increment operators used together like that, not just once
or twice, but on almost every line (or so it seems while you find them
difficult). What is happening, and what combinations can we get? Well,
the *
means indirection, and ++
or
--
mean increment; either pre- or post-increment. The
combinations can be pre- or post-increment of either the pointer or the
thing it points to, depending on where the brackets are put. Table 5.1 gives a list.
++(*p) |
pre-increment thing pointed to |
(*p)++ |
post-increment thing pointed to |
*(p++) |
access via pointer, post-increment pointer |
*(++p) |
access via pointer which has already been incremented |
Read it carefully; make sure that you understand the combinations.
The expressions in the list above can usually be understood after
a bit of head-scratching. Now, given that the precedence of
*
, ++
and --
is the same in all
three cases and that they associate right to left, can you work out what
happens if the brackets are removed? Nasty, isn't it? Table 5.2 shows that there's only one case where the brackets have to
be there.
With parentheses | Without, if possible |
---|---|
++(*p) |
++*p |
(*p)++ |
(*p)++ |
*(p++) |
*p++ |
*(++p) |
*++p |
The usual reaction to that horrible sight is to decide that you don't care that the parentheses can be removed; you will always use them in your code. That's all very well but the problem is that most C programmers have learnt the important precedence rules (or at least learnt the table above) and they very rarely put the parentheses in. Like them, we don't—so if you want to be able to read the rest of the examples, you had better learn to read those expressions with or without parentheses. It'll be worth the effort in the end.
5.4.3. Untyped pointers
In certain cases it's essential to be able to convert pointers from one type to another. This is always done with the aid of casts, in expressions like the one below:
(type *) expression
The expression is converted into ‘pointer to type’, regardless of the expression's previous type. This is only supposed to be done if you're sure that you know what you're trying to do. It is not a good idea to do much of it until you have got plenty of experience. Furthermore, do not assume that the cast simply suppresses diagnostics of the ‘mismatched pointer’ sort from your compiler. On several architectures it is necessary to calculate new values when pointer types are changed.
There are also some occasions when you will want to use
a ‘generic’ pointer. The most common example is the
malloc
library function, which is used to allocate storage
for objects that haven't been declared. It is used by telling it how
much storage is wanted—enough for a float
, or an array
of int
, or whatever. It passes back a pointer to enough
storage, which it allocates in its own mysterious way from a pool of
free storage (the way that it does this is its own business). That
pointer is then cast into the right type—for example if
a float
needs 4 bytes of free store, this is the flavour of
what you would write:
float *fp; fp = (float *)malloc(4);
Malloc
finds 4 bytes of store, then the address of that
piece of storage is cast into pointer-to-float and assigned to the
pointer.
What type should malloc
be declared to have? The type
must be able to represent every known value of every type of pointer;
there is no guarantee that any of the basic types in C can hold such
a value.
The solution is to use the void *
type that we've already
talked about. Here is the last example with a declaration of
malloc
:
void *malloc(); float *fp; fp = (float *)malloc(4);
The rules for assignment of pointers show that there is no need to use
a cast on the return value from malloc
, but it is often
done in practice.
Obviously there needs to be a way to find out what value the argument
to malloc
should be: it will be different on different
machines, so you can't just use a constant like 4. That is what the
sizeof
operator is for.