Sequential File etc. [Part 2]

Now we shall discuss how to use "structure" in files. This is the preferred way of storing data, and this method may be used for random access files as well. Unfortunately, the data thus stored are not human readable.

We first discuss some library subroutines in <stdio.h>.

(1)  int fseek(FILE *ptr, long offset, int origin)

     In UNIX, file is regarded as a sequence of bytes (measured from 0, not 1,
     i.e. 0,1,2,3,.... byte and not 1,2,3, ...).  A file is called "stream"
     in UNIX, because of this.  

     This subroutine will move the current position to the position specified
     in "offset", which is a long integer.

     "origin" may be

          SEEK_SET      origin is at 0 (byte) position (beginning of file).
          SEEK_CUR      origin is the current file position.
          SEEK_END      origin is measured from the end of file (used for append).

     Notice that SEEK_SET is actually 0, and SEEK_CUR is 1, and SEEK_END is 2.  
     These 3 constants "SEEK_CUR, ..." are defined in  <stdio.h>.

     Usually, we measure "offset" from the start, i.e. SEEK_SET.

     "offset" is relative to the "origin".

     On error, it returns non-zero.
(2)  size_t fread(void *rec, size_t size, size_t no_of_objects, FILE *ptr)

     This subroutine will read in one structure at the current file-position.

     size_t        is defined to be unsigned integer in LINUX.

     void *rec     is usually the address of a "structure".

     size_t size   is the size of the structure.  This may be found out
                   using  sizeof(structure_name) or sizeof structure_name.

     size_t  no_of_objects  We may read in an array of structures.  But usually,
                   it is 1 (i.e. we read in one structure only).

     It returns no of objects read.  This number should be equal to "no_of_object".
     If it is less, EOF may have been reached, or there is an error.
(3)  size_t fwrite(void *rec, size_t size, size_t no_of_objects, FILE *ptr)

     This subroutine will write the "structure" (whose address is *rec), and
     whose size is "size" (sizeof structure_name) to a file.

     Notice that an array of structures may be written, but usually we write only
     1 structure (i.e. no_of_objects = 1).

     It returns no of objects written.  If it is less than "no_of_objects", error
     may have occurred.
(4)  long ftell(FILE *stream)

     This subroutine returns the current file-position, or -1 if error occurs.

fseek(..), fread(..), fwrite(..) are usually used in random files, but we may use them in sequential files as well.

Example : We rewrite the Qbasic program
OPEN "C:\test\master.txt" FOR OUTPUT AS #1

L20:
    INPUT "Enter name "; name$

    IF name$ <> "****" THEN

         INPUT "Enter age "; age
         INPUT "Enter telephone no. "; tel
         WRITE #1, name$, age, tel
         GOTO L20

    ELSE

         CLOSE #1
         END

    END IF
using this technique.

#include <c.h>

int main()
{    char buffer[80];
     int nrecsize;
     struct {char name[40];
             int age;
             char tel[24];
            } rec;
     FILE *fout;

     if ( (fout = fopen("master2","w") ) == NULL )
          {perror("Unable to open file");
           exit(1);
          }

     nrecsize = sizeof rec;
     printf("Size of record is %d bytes\n", nrecsize);

 L20:
     printf("Enter name\n");
     gets(rec.name);
     if ( strcmp(rec.name,"****") == 0 )   
          {fclose(fout);
           return 0;
          }
     printf("Enter age\n");
     gets(buffer);
     sscanf(buffer,"%d",&(rec.age) );
     printf("Enter telephone no\n");
     gets(rec.tel);
     fwrite(&rec, nrecsize, 1, fout);
     printf("Current file position is %d\n",ftell(fout));
     printf("%s\n%d\n%s\n", rec.name, rec.age, rec.tel);
     goto L20;

}

Exercise : Write a program to read in the data from the above file.

Ans :
#include <c.h>

int main()
{    int nrecsize;
     struct {char name[40];
             int age;
             char tel[24];
            } rec;
     FILE *fin;

     if ( (fin = fopen("master2","r") ) == NULL )
          {perror("Unable to open file");
           exit(1);
          }

     nrecsize = sizeof rec;
     printf("Size of record is %d bytes\n", nrecsize);


 L20:
     if ( fread(&rec, nrecsize, 1, fin) != 1)   
          {fclose(fin);
           return 0;
          }
     printf("Current file position is %d\n",ftell(fin));
     printf("%s\n%d\n%s\n", rec.name, rec.age, rec.tel);
     goto L20;

}

Writing Custom Subroutines for Input/Output

Suppose we have run the following program in Qbasic

OPEN "C:\test\master.txt" FOR OUTPUT AS #1

L20:
    INPUT "Enter name "; name$

    IF name$ <> "****" THEN

         INPUT "Enter age "; age
         INPUT "Enter telephone no. "; tel
         WRITE #1, name$, age, tel
         GOTO L20

    ELSE

         CLOSE #1
         END

    END IF
And get a data file "c:\test\master.txt" whose content is (Note that Qbasic "write #n, ... " will write out the data using comma as field separator, and will also put quotation marks around strings.)
"Wu Siu Yan",50,23003
"tom",60,93939
"mary",30,939393
"jackie chan",49,3003

Later we copied this file onto floppy. We wish to read this file under LINUX using a C program.

Exercise : What is the command in LINUX that enables us to view the file on the floppy?

Ans :
mount -t msdos /dev/fd0 /mnt
cat /mnt/master.txt

Recall that under LINUX, floppy drives are called "/dev/fd0, /dev/fd1, ..." and hard drives are called "/dev/hda, /dev/hdb, ..." (or /dev/hda1, /dev/hda2, ... if we have divided the hard disks into partitions.) Here we mount the floppy, "/dev/fd0".

The option "-t msdos" means the floppy is prepared by Microsoft DOS (Disk Operating System). LINUX can read files from many different operating systems.

"/mnt" is the mount directory under root directory, "/". We may make many mount points under "/", e.g. "cd /; mkdir mnt1; mkdir mnt2; mkdir mnt3; ...", then we may mount various drives on "/mnt1, /mnt2, /mnt3, ... ".

Exercise :Although the operating system are different: one is MSDOS, one is LINUX, do you think the recording format for files are the same for both - all a linear sequence of bytes ?

Ans : No. But the difference is minor.

For MSDOS, after one line, there is "CR, LF" (carriage-return, line-feed, or '\r', '\n').

But for LINUX, there is only "LF" (line-feed, '\n'), and no carriage return.

Exercise : Write custom subroutines to read in the data

"Wu Siu Yan",50,23003
"tom",60,93939
"mary",30,939393
"jackie chan",49,3003
And print the result in
Wu Siu Yan
50
23003
tom
60
93939
...
...
(Hint: write a subroutine that reads in one line, using fgets(..), then replace the "CR" (or '\r') by '\0'.
Next, write a subroutine to chop up the line into 3 separate strings.
Finally, write a subroutine to remove the double quotation marks.)

Ans :

#include <stdio.h>           /* You may use <c.h> */
#include <assert.h>
#include <string.h>

/*  This subroutine reads a string from FILE *fin, and puts it in
    char a[..]   (It also returns the string a[]).
  
    It returns NULL if EOF is detected in FILE *fin.

    It is assumed that the file is prepared with MSDOS where
    a line ends with CR-LF (i.e. \r\n).

    Notice that fgetc(..) is used here.  But you may use fgets(..)
*/

char *QB_gets2(char *a, FILE *fin)
{    char *b;
     int c;

     b=a ;
L20:
     c=fgetc(fin);
     if (c == EOF) {return NULL;}
     if (c != '\r')
         {*b=c ;
          b++;
          goto L20;
         }
     else
         {*b='\0';
          c=fgetc(fin);  /*this reads in the newline too */
          return a;
         }
}

/*  The string written by Qbasic using  "write #n, .... " has comma
    as field separator.  This subroutine parse the string and
    returns the number of fields.  The addresses of the various
    strings are put in char **ptr, which is in fact char *ptr[..] - an
    array of character pointers. 

    Notice that there is a "#define MFIELD 10" statement, which is
    a declarative for C pre-processor.  It tells the C pre-processor
    to substitute every occurrence of MFIELD by 10.  We will discuss such
    declaratives later.
*/

#define MFIELD 10

int QB_parse(char *a, char **ptr)
{   int i=1;
    int c;

    ptr[0]=a;

L20:
    c=*a;
    if (c=='\0') {return i;}
    if (c==',' )
        {*a='\0';
         a++;
         ptr[i]=a;
         i++;
         assert( i<MFIELD );
         goto L20;
        }
    else
        {a++;
         goto L20;
        }
}

/*   This subroutine removes (blanks or double quotation mark) at the 
     beginning and end of a string 
*/

char *QB_blanks(char *a)
{   char *newa, *b;
    int nlength, c, i;

    nlength = strlen(a);
    b=a;

L20:
    c=*b;
    if (c==' ' || c=='\"')
       {b++;
        goto L20;
       }

    newa=b;

    for (i=nlength-1; i>=0; i--)
       {c=a[i];
        if (c==' ' || c=='\"')        /* Note : we use backslash to preserve (") */
           {continue;}
        else
           {a[i+1]='\0';
            return newa;
           }
       }
}

int main()
{  char buf[80], *ptr[MFIELD];   /* ptr[..] is an array of char pointers to store
                                    the various fields (sub-strings) */
   FILE *fin;
   int mfield,i;
    

   if ( (fin=fopen("/mnt/master.txt","r")) == NULL )
       { perror("Unable to open file");
         exit(1);
       }

L20:
   if ( (QB_gets2(buf,fin) == NULL) ) {goto L40;}
   puts(buf);
   mfield = QB_parse(buf,ptr);
   printf("No of fields in string %d\n",mfield);
   if (mfield > 0)
      {for (i=0; i<mfield; i++)
          {ptr[i] = QB_blanks(ptr[i]);
           puts(ptr[i]);
          }
      }
   printf("\n");
   goto L20;

L40:
   fclose(fin);
   return 0;
}

===========================================
The content of file 

"Wu Siu Yan",50,23003
"tom",60,93939
"mary",30,939393
"jackie chan",49,3003

===========================================
Output of this test program

"Wu Siu Yan",50,23003
No of fields in string 3
Wu Siu Yan
50
23003

"tom",60,93939
No of fields in string 3
tom
60
93939

"mary",30,939393
No of fields in string 3
mary
30
939393

"jackie chan",49,3003
No of fields in string 3
jackie chan
49
3003

Exercise : What library subroutines are there that can convert a string to integer or double (Hint : see also subroutines in <stdlib.h>, apart from <stdio.h>.)

Ans : We may use sscanf(..), with %d (for integer), or %lf (for double), or we may use

double atof(char *s)
int atoi(char *s)
in <stdlib.h>.
atof(..) (Ascii string to float) will convert a string to double.
atoi(..) (Ascii string to integer) will convert a string to integer.

Exercise : (This exercise is about the WWW.) When surfing the net, very often, viewers are presented with a form, and has to fill in data. The responses are sent back in one string to the WWW-server. The string usually is encoded before sending. The exercises below attempt to decode the string.)

  1. The data gathered from a form are encoded and sent back as, e.g.
    name=tom%20sawyer&age=40&tel=123456
    where the fields are separated by "&". Write a subroutine to separate the fields at "&". The subroutine should return the number of fields, as well as put their addresses in a character pointer array.

    Ans : It should be similar to QB_parse(..) above, except that the character is now "&" and not comma, ",".

  2. Now that we have 3 separate strings,
    name=tom%20sawyer
    age=40
    tel=123456
    Now write a subroutine to separate the string at "=", e.g. "age=40" is separated into two strings : "age", "40".

    Ans : Also similar to QB_parse(..), except that we have "=", and not ",", nor "&".

  3. Notice that originally, the name is "tom sawyer". But the browser has encoded the "blank" in its ASCII hexadecimal equivalent, plus a prefix "%". (The browser will encode everything except 0-9, a-z or A-Z.) The encoding is 2 characters long with a prefix "%". All control characters are encoded this way.

    Find out the ASCII code for line-feed ('\n'), carriage-return, ('\r'), horizontal-tab ('\t'), backspace ('\b'), form-feed ('\f'), and then encode them as the browser does.

    Ans :

          Control Character       ASCII in decimal       Encoded in hexadecimal
            
             line-feed                 10                     %0A
             carriage-return           13                     %0D
             horizontal tab             9                     %09
             backspace                  8                     %08
             form feed                 12                     %0C
             

  4. Write a subroutine

    char *WWW_decode(char *d, char *o)
    which will change the "original encoded string", char *o, into a decoded string char *d. It also returns the string "char *d".

    Ans :

    #include <c.h> static int convert(int d) { if ( d >= '0' && d<='9' ) {return d-'0'; } else if ( d >='a' && d<='f' ) {return d-'a'+10; } else if ( d>='A' && d<='F' ) {return d-'A'+10; } else {assert(0); } } char *WWW_decode(char *d, char *o) { int c, ca, cb; char *dd, *oo; dd=d; oo=o; L20: c=*oo; if (c == '\0') {return d;} if (c != '\%') {*dd=c; } else {oo++; ca=*oo; oo++; cb=*oo; *dd = 16* convert(ca) + convert(cb); } oo++; dd++; goto L20; } /* This main program is used to test the above subroutines */ int main() { char a[80], b[80]; L20: printf("enter one number "); gets(a); if (strcmp(a,"****")==0) {return 0;} WWW_decode(b,a); puts(b); goto L20; }

    Notice that there is a "static" in "static int convert(int d)". This is to tell the linkage-editor (notice that "gcc" will first compile a program, then use a linkage editor to link up all the subroutines.) not to allow "convert(..)" to be linked by subroutines other than those in this file, i.e. "convert(..)" is private. But as there is no "static" before "char *WWW_decode(..)", this "WWW_decode" will appear in the "symbol table", and may be called upon by other subroutines outside this file. We will come back to this "static" usage later.



C is a standardized assembler, but is a minimal language. So we should form the habit of breaking down a program into many function modules, and write small subroutines for these function modules.

If we can make such modules "general", that is good. But we need not do so. We should aim at a working program first.

For the C programs given in this book, you SHOULD NOT try to understand the logic, but think out the logic of your own. If you draw flow-chart, and dry-run both the flow-chart and coding, you can usually get your program right. To understand the logic of a program written by other people is a tedious and un-rewarding task. Instead, you should read over the programs, and pick up programming techniques instead.

Personally, I think it is a waste of time to assign programmers to maintain the programs of others. Once I was assigned to run a program written by another person. I found it hard to understand his logic, and so I wrote the program afresh. It took me very little time to write a fresh program.

We shall discuss "random file" in the next chapter. It will use "fseek(..), fread(..), and fwrite(..)", and I think you may have guessed how to write such programs. "fseek(..)" will be used to position the file to whatever position we want, and then read/write may be performed.


[Previous] [Home] [Next]