5.4. Character handling

C is widely used for character and string handling applications. This is odd, in some ways, because the language doesn't really have any built-in string handling features. If you're used to languages that know about string handling, you will almost certainly find C tedious to begin with.

The standard library contains lots of functions to help with string processing but the fact remains that it still feels like hard work. To compare two strings you have to call a function instead of using an equality operator. There is a bright side to this, though. It means that the language isn't burdened by having to support string processing directly, which helps to keep it small and less cluttered. What's more, once you get your string handling programs working in C, they do tend to run very quickly.

Character handling in C is done by declaring arrays (or allocating them dynamically) and moving characters in and out of them ‘by hand’. Here is an example of a program which reads text a line at a time from its standard input. If the line consists of the string of characters stop, it stops; otherwise it prints the length of the line. It uses a technique which is invariably used in C programs; it reads the characters into an array and indicates the end of them with an extra character whose value is explicitly 0 (zero). It uses the library strcmp function to compare two strings.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define LINELNG 100     /* max. length of input line */

main(){
      char in_line[LINELNG];
      char *cp;
      int c;

      cp = in_line;
      while((c = getc(stdin)) != EOF){
              if(cp == &in_line[LINELNG-1] || c == '\n'){
                      /*
                       * Insert end-of-line marker
                       */
                      *cp = 0;
                      if(strcmp(in_line, "stop") == 0 )
                              exit(EXIT_SUCCESS);
                      else
                              printf("line was %d characters long\n",
                                      (int)(cp-in_line));
                      cp = in_line;
              }
              else
                      *cp++ = c;
      }
      exit(EXIT_SUCCESS);
}
Example 5.6

Once more, the example illustrates some interesting methods used widely in C programs. By far the most important is the way that strings are represented and manipulated.

Here is a possible implementation of strcmp, which compares two strings for equality and returns zero if they are the same. The library function actually does a bit more than that, but the added complication can be ignored for the moment. Notice the use of const in the argument declarations. This shows that the function will not modify the contents of the strings, but just inspects them. The definitions of the standard library functions make extensive use of this technique.

/*
* Compare two strings for equality.
* Return 'false' if they are.
*/
int
str_eq(const char *s1, const char *s2){
      while(*s1 == *s2){
              /*
               * At end of string return 0.
               */
              if(*s1 == 0)
                      return(0);
              s1++; s2++;
      }
      /* Difference detected! */
      return(1);
}
Example 5.7

5.4.1. Strings

Every C programmer ‘knows’ what a string is. It is an array of char variables, with the last character in the string followed by a null. ‘But I thought a string was something in double quote marks’, you cry. You are right, too. In C, a sequence like this

"a string"

is really a character array. It's the only example in C where you can declare something at the point of its use.

Be warned: in Old C, strings were stored just like any other character array, and were modifiable. Now, the Standard states that although they are are arrays of char, (not const char), attempting to modify them results in undefined behaviour.

Whenever a string in quotes is seen, it has two effects: it provides a declaration and a substitute for a name. It makes a hidden declaration of a char array, whose contents are initialized to the character values in the string, followed by a character whose integer value is zero. The array has no name. So, apart from the name being present, we have a situation like this:

char secret[9];
secret[0] = 'a';
secret[1] = ' ';
secret[2] = 's';
secret[3] = 't';
secret[4] = 'r';
secret[5] = 'i';
secret[6] = 'n';
secret[7] = 'g';
secret[8] = 0;

an array of characters, terminated by zero, with character values in it. But when it's declared using the string notation, it hasn't got a name. How can we use it?

Whenever C sees a quoted string, the presence of the string itself serves as the name of the hidden array—not only is the string an implicit sort of declaration, it is as if an array name had been given. Now, we all remember that the name of an array is equivalent to giving the address of its first element, so what is the type of this?

"a string"

It's a pointer of course: a pointer to the first element of the hidden unnamed array, which is of type char, so the pointer is of type ‘pointer to char’. The situation is shown in Figure 5.7.

Diagram showing an unnamed array of 'const char' values, where the            last item has the value '0', and showing that a 'const char *'            value that points to the first of them can be used as a string.
Figure 5.7. Effect of using a string

For proof of that, look at the following program:

#include <stdio.h>
#include <stdlib.h>
main(){
      int i;
      char *cp;

      cp = "a string";
      while(*cp != 0){
              putchar(*cp);
              cp++;
      }
      putchar('\n');

      for(i = 0; i < 8; i++)
              putchar("a string"[i]);
      putchar('\n');
      exit(EXIT_SUCCESS);
}
Example 5.8

The first loop sets a pointer to the start of the array, then walks along until it finds the zero at the end. The second one ‘knows’ about the length of the string and is less useful as a result. Notice how the first one is independent of the length—that is a most important point to remember. It's the way that strings are handled in C almost without exception; it's certainly the format that all of the library string manipulation functions expect. The zero at the end allows string processing routines to find out that they have reached the end of the string—look back now to the example function str_eq. The function takes two character pointers as arguments (so a string would be acceptable as one or both arguments). It compares them for equality by checking that the strings are character-for-character the same. If they are the same at any point, then it checks to make sure it hasn't reached the end of them both with if(*s1 == 0): if it has, then it returns 0 to show that they were equal. The test could just as easily have been on *s2, it wouldn't have made any difference. Otherwise a difference has been detected, so it returns 1 to indicate failure.

In the example, strcmp is called with two arguments which look quite different. One is a character array, the other is a string. In fact they're the same thing—a character array terminated by zero (the program is careful to put a zero in the first ‘empty’ element of in_line), and a string in quotes—which is a character array terminated by a zero. Their use as arguments to strcmp results in character pointers being passed, for the reasons explained to the point of tedium above.

5.4.2. Pointers and increment operators

We said that we'd eventually revisit expressions like

(*p)++;

and now it's time. Pointers are used so often to walk down arrays that it just seems natural to use the ++ and -- operators on them. Here we write zeros into an array:

#define ARLEN 10

int ar[ARLEN], *ip;

ip = ar;
while(ip < &ar[ARLEN])
      *(ip++) = 0;
Example 5.9

The pointer ip is set to the start of the array. While it remains inside the array, the place that it points to has zero written into it, then the increment takes effect and the pointer is stepped one element along the array. The postfix form of ++ is particularly useful here.

This is very common stuff indeed. In most programs you'll find pointers and increment operators used together like that, not just once or twice, but on almost every line (or so it seems while you find them difficult). What is happening, and what combinations can we get? Well, the * means indirection, and ++ or -- mean increment; either pre- or post-increment. The combinations can be pre- or post-increment of either the pointer or the thing it points to, depending on where the brackets are put. Table 5.1 gives a list.

++(*p) pre-increment thing pointed to
(*p)++ post-increment thing pointed to
*(p++) access via pointer, post-increment pointer
*(++p) access via pointer which has already been incremented
Table 5.1. Pointer notation

Read it carefully; make sure that you understand the combinations.

The expressions in the list above can usually be understood after a bit of head-scratching. Now, given that the precedence of *, ++ and -- is the same in all three cases and that they associate right to left, can you work out what happens if the brackets are removed? Nasty, isn't it? Table 5.2 shows that there's only one case where the brackets have to be there.

With parentheses Without, if possible
++(*p) ++*p
(*p)++ (*p)++
*(p++) *p++
*(++p) *++p
Table 5.2. More pointer notation

The usual reaction to that horrible sight is to decide that you don't care that the parentheses can be removed; you will always use them in your code. That's all very well but the problem is that most C programmers have learnt the important precedence rules (or at least learnt the table above) and they very rarely put the parentheses in. Like them, we don't—so if you want to be able to read the rest of the examples, you had better learn to read those expressions with or without parentheses. It'll be worth the effort in the end.

5.4.3. Untyped pointers

In certain cases it's essential to be able to convert pointers from one type to another. This is always done with the aid of casts, in expressions like the one below:

(type *) expression

The expression is converted into ‘pointer to type’, regardless of the expression's previous type. This is only supposed to be done if you're sure that you know what you're trying to do. It is not a good idea to do much of it until you have got plenty of experience. Furthermore, do not assume that the cast simply suppresses diagnostics of the ‘mismatched pointer’ sort from your compiler. On several architectures it is necessary to calculate new values when pointer types are changed.

There are also some occasions when you will want to use a ‘generic’ pointer. The most common example is the malloc library function, which is used to allocate storage for objects that haven't been declared. It is used by telling it how much storage is wanted—enough for a float, or an array of int, or whatever. It passes back a pointer to enough storage, which it allocates in its own mysterious way from a pool of free storage (the way that it does this is its own business). That pointer is then cast into the right type—for example if a float needs 4 bytes of free store, this is the flavour of what you would write:

float *fp;

fp = (float *)malloc(4);

Malloc finds 4 bytes of store, then the address of that piece of storage is cast into pointer-to-float and assigned to the pointer.

What type should malloc be declared to have? The type must be able to represent every known value of every type of pointer; there is no guarantee that any of the basic types in C can hold such a value.

The solution is to use the void * type that we've already talked about. Here is the last example with a declaration of malloc:

void *malloc();
float *fp;

fp = (float *)malloc(4);

The rules for assignment of pointers show that there is no need to use a cast on the return value from malloc, but it is often done in practice.

Obviously there needs to be a way to find out what value the argument to malloc should be: it will be different on different machines, so you can't just use a constant like 4. That is what the sizeof operator is for.