Thoughts on using C-- in the practical implementation of a vector graphics Migration on Request tool

Phil Mellor <p.j.mellor@as.leeds.ac.uk>

CAMiLEON

Migration of digital data from one format to another is a key strategy for the preservation of digital materials. The CAMiLEON project investigated the practicalities of a particular migration concept called Migration on Request [1]. A tool to migrate vector graphic files was developed to evaluate the technique.

A key principle behind Migration on Request is to preserve the software tool over time. The Migration on Request tool must be maintained in the future for as yet unknown systems and architectures. Important choices have to be made about how the program should be implemented in order to ensure that it remains portable.

C is a fairly small and simple language. It is well established and support exists on practically every platform. It seems likely that such support would be continued on future platforms due to the popularity of the language and the amount of legacy code that already exists.

Not all compilers are the same. One C compiler may encounter problems with some source code that another compiler will handle quite happily. The problem is partly due to compiler writers adding extra features or being lenient when checking programs against the language standards. The languages are still evolving - C90 allows variables to be declared anywhere in a program, whereas earlier standards require variables to be declared at the start of a block.

If there is a need to convert a C program to another language, this would be a fairly easy job were it not for a few parts of the language which are now accepted as bad practice and do not appear in modern language design. A restricted version, named C--, has been proposed by David Holdsworth of the CAMiLEON group [2]. C-- removes the "unhygenic" aspects of C.

The vector graphic Migration on Request tool was written in accordance with the suggestions for C-- (see Appendix A), along with some extra restrictions and recommendations described in this document.

Object oriented programming

C and C-- do not provide facilities for classes - an important facility for developing large, modern, applications. A little more ingenuity and some enforced consistency is required to simulate a similar method of programming. How this was done is described in this section.

C structures (struct) were used to create the data areas of an object. Public and private member variables could be distinguished through the use of special prefixes in their names. This could not be enforced by the compiler - accessing a "private" variable directly would not therefore raise an error or warning.

Class methods were simulated using ordinary functions. The function name was prefixed by the class name and an underscore; the first parameter would always be a pointer to the object on which the method was called.

It was required that several standard functions were written for each class - new, destroy and copy. These were to behave in a particular way, to simulate the constructors and deconstructors, etc, found in object oriented programming.

This might sound like the organisational approaches that a programmer should be using anyway - just good software development and management skills. The idea is that when a suitable converter to another language is written, it should recognise these standard naming conventions, and from them be able to construct proper objects (or whatever the major construct of the next programming paradigm might be).

The method described here does not cater for the more complicated aspects of object oriented programming -- overloading functions, inheritance or polymorphism. It may be possible, although untidy, to do this in C and C--, as development tools and early C++ compilers such as Cfront already exist to convert C++ code into C.

Number storage

The size of various types are not defined in the C standards. Although most current C implementations use 32 bit integers, this cannot be assumed. Some languages such as Java are more explicit about the size of each type. Likewise, a short int cannot be assumed to be 16 bits (or half the size of an int).

There is also the issue of whether numbers are stored in a little-endian or big-endian format, or least-significant / most-significant byte order.

These problems are not easily resolved. The approach taken when developing the vector graphic Migration on Request tool is to assume that ints are at least 32 bits long, and that short ints are at least 16 bits long. This seemed sensible given the bit lengths used in other languages, and reduced worries of, for example, an attempt to set a bit flag that didn't exist.

Memory management

The C language does not have very advanced memory management functions. The malloc function is used to claim a number of bytes of memory. The sizeof operator can be used to calculate the amount of memory required, or the programmer can predict the amount. Arrays can be allocated by multiplying the size passed to malloc or by giving the number of items to allocate to calloc. It could be difficult to determine the size of the array being calculated, especially if sizeof is not used - this could cause problems when the source code is converted to another langage.

LanguageIndividual objectArray of objects
C
struct Thing* x = (struct Thing*)
    malloc(sizeof(struct Thing));
free(x);
struct Thing* xs = (struct Thing*)
  calloc(10, sizeof(struct Thing));
free(xs);
C++
Thing* x = new Thing;
delete x;
Thing* xs = new Thing[10];
delete [] xs;
Java
Thing x = new Thing();
Thing xs[] = new Thing[10];
Dynamic memory allocation

These basic memory facilities can be dangerous as it is very easy to allocate the wrong amount of memory. It is possible to access elements in an array that are unclaimed which can lead to unexpected results. Other languages (eg. Java) may prevent this from happening, perhaps by throwing an exception. Great care must be taken to avoid these sorts of problems.

In Java every object can be thought of as a pointer, and all objects are allocated memory dynamically using the new operator. This doesn't apply to the basic data types such as an int which cannot be pointed to and instead has to be put in a 'wrapper class' such as Integer.

For these reasons, in the Migration on Request tool, all structs were created dynamically using malloc. Obtaining and manipulating the reference address for basic data types such as int was also avoided.

Because in C there is no distinction between a pointer to an individual object and a pointer to the start of an array, it was recommended that standard arrays were not used - linked lists (or similar concepts) were to be used instead. It could now be assumed (but not enforced) that a C pointer would point to an individual object. This also reduced the temptation to use address arithmetic.

Variadic functions

These are not allowed in C--. There are some common library routines that are variadic - eg. printf - but the facility does not carry forward to many other languages.

Within highly specified modules, the use of variadic routines such as fprintf was permitted in the Migration on Request tool. Calls were in separate functions, which were identified and well documented. This minimised the occurrences of variadic functions and highlights exactly where modification needs to take place when migrating the program to another language.

Thoughts on programming with C--

The restrictions on C-- sometimes made parts of the Migration on Request tool more difficult to write than if conventional C or a higher level language like C++ or Java had been used. Memory allocation, input and output, and data structures were the main areas of difficulty. Although the problems could be worked around it meant that care had to be taken to ensure the code was readable, as it is likely to be more verbose.

The vector graphic tool would have been easier to design and implement if a true object oriented approach could have been used. The data structures for the elements would have been more logical with a true hierarchy, and inheritance would have meant that adding new element types and applying attribute effects would have been neater.

Of course this could apply to many languages, not just C--, but it is an important consideration.

Although the portability of the program to future systems and languages remains essential, it is also necessary to remember that it has to be maintained by a human programmer who has not overseen the development of the software. Even if the program can be ported to another system with ease, this advantage could be diminished if intensive work is required to extend its capabilities.

Developing a large program in C-- is certainly possible, but, as with any major project, it requires discipline and good management.

A new language?

C-- was conceived as a language for writing emulators more than migration tools. Holdsworth [2] states that "although C may not lend itself quite so well to the task of writing migration tools as it does to the writing of emulators, it is still quite viable in this area ... by writing migration tools in our more restrictive C--, the extra longevity achieved will more than repay any inconvenience in the programming language."

It would be worthwhile to investigate whether a restricted version of a higher level language such as C++ or Java could offer a similar degree of longevity but with a specification more suited to writing a migration tool.

C++ has received criticism for being too large and having a sometimes flawed design [3]. It is based on C, and therefore the concerns raised about using C-- might also apply to C++--.

Java uses core object oriented features without many of the complications found in C++. But Java also brings, as an example, garbage collection, which may be less portable to other languages. Although Java programs are executed a virtual machine, it does not necessarily follow that a virtual machine would be used to provide longevity of a Migration on Request tool. There is still the possibility in the future of porting the tool to another language where it could run natively. Java-- would be used because of its language constructs rather than its method of execution.

The C-- proposal excluded the undesirable features from the C language. To devise a subset of C++/Java involves a selection from a much larger and more compilicated specification. It is perhaps better to start with a very basic set of features (as found in C--) and build from there. Quite what to include requires further analysis of the possible languages.

Appendix A: Recommendations for C--

These recommendations were taken from 'C-ing ahead for Digital Longevity' [2].

Features for omission from C--

Macros

The C macro preprocessor is widely regarded as a route to confusing code, although it originally allowed efficient implementation of multiple variants from a single source code.

It is now regarded that normal iftests using values known at compile time enable modern optimising compilers to achieve the same level of efficiency -- which in any case, is not our main concern.

Unions

The particular style of the C union does not survive to other languages, although Pascal and Ada both have (different) equivalent facilities. Object orientation techniques rather render the idea obsolete. In any case the nature of the code of an emulator is such that the concept is likely to be of little value.

In some respects unions have their origin in FORTRAN's EQUIVALENCE statement, that was a notorious cause of portability problems in the past.

Address arithmetic

Many typical C programs are filled with address arithmetic. In part this is historic, because the array facilities of C were not there in the earliest versions of the language.

Also, address arithmetic code often compiles to faster code than the equivalent algorithm written using array subscripting. C-- should force the use of array subscripting (as does Java).

Various bits of the C library might well be omitted.

Perhaps we should be stronger, and explicitly list those library routines that are allowed as part of C--.

Restrictions on usage in C--

Conditions

The condition in if-, while-statements, etc. must be boolean, or perhaps a relational expression. C's use of an integer does not carry forward into other languages. Forcing a relational expression, would give correct code that delivered a boolean test when translated into other languages.

Assignments

The result of an assignment must be voided. This is perhaps more contentious. The facility does carry forward into Java, but is absent from Pascal and many other languages. Its use can make code more difficult to read.

Variadic functions are not allowed

There are some classic library routines that are variadic (e.g. printf) but the facility does not carry forward to many other languages.

References

[1] Phil Mellor, Paul Wheatley, Derek Sergeant. Migration on Request: a practical technique for preservation.
[2] David Holdsworth. C-ing ahead for Digital Longevity.
[3] Ian Joyner: C++?? - A Critique of C++ (3rd Ed.) and Programming and Language Trends of the 1990s