Migration of digital data from one format to another is a key strategy for the preservation of digital materials. The CAMiLEON project investigated the practicalities of a particular migration concept called Migration on Request [1]. A tool to migrate vector graphic files was developed to evaluate the technique.
A key principle behind Migration on Request is to preserve the software tool over time. The Migration on Request tool must be maintained in the future for as yet unknown systems and architectures. Important choices have to be made about how the program should be implemented in order to ensure that it remains portable.
C is a fairly small and simple language. It is well established and support exists on practically every platform. It seems likely that such support would be continued on future platforms due to the popularity of the language and the amount of legacy code that already exists.
Not all compilers are the same. One C compiler may encounter problems with some source code that another compiler will handle quite happily. The problem is partly due to compiler writers adding extra features or being lenient when checking programs against the language standards. The languages are still evolving - C90 allows variables to be declared anywhere in a program, whereas earlier standards require variables to be declared at the start of a block.
If there is a need to convert a C program to another language, this would be a fairly easy job were it not for a few parts of the language which are now accepted as bad practice and do not appear in modern language design. A restricted version, named C--, has been proposed by David Holdsworth of the CAMiLEON group [2]. C-- removes the "unhygenic" aspects of C.
The vector graphic Migration on Request tool was written in accordance with the suggestions for C-- (see Appendix A), along with some extra restrictions and recommendations described in this document.
C and C-- do not provide facilities for classes - an important facility for developing large, modern, applications. A little more ingenuity and some enforced consistency is required to simulate a similar method of programming. How this was done is described in this section.
C structures (struct) were used to create the data areas of an
object. Public and private member variables could be distinguished through
the use of special prefixes in their names. This could not be enforced by the
compiler - accessing a "private" variable directly would not therefore raise
an error or warning.
Class methods were simulated using ordinary functions. The function name was prefixed by the class name and an underscore; the first parameter would always be a pointer to the object on which the method was called.
It was required that several standard functions were written for each
class - new, destroy and copy.
These were to behave in a particular way, to simulate the constructors and
deconstructors, etc, found in object oriented programming.
This might sound like the organisational approaches that a programmer should be using anyway - just good software development and management skills. The idea is that when a suitable converter to another language is written, it should recognise these standard naming conventions, and from them be able to construct proper objects (or whatever the major construct of the next programming paradigm might be).
The method described here does not cater for the more complicated aspects of object oriented programming -- overloading functions, inheritance or polymorphism. It may be possible, although untidy, to do this in C and C--, as development tools and early C++ compilers such as Cfront already exist to convert C++ code into C.
The size of various types are not defined in the C standards. Although
most current C implementations use 32 bit integers, this cannot be assumed.
Some languages such as Java are more explicit about the size of each type.
Likewise, a short int cannot be assumed to be 16 bits (or half
the size of an int).
There is also the issue of whether numbers are stored in a little-endian or big-endian format, or least-significant / most-significant byte order.
These problems are not easily resolved. The approach taken when
developing the vector graphic Migration on Request tool is to assume that
ints are at least 32 bits long, and that short
ints are at least 16 bits long. This seemed sensible given the
bit lengths used in other languages, and reduced worries of, for example, an
attempt to set a bit flag that didn't exist.
The C language does not have very advanced memory management functions.
The malloc function is used to claim a number of bytes of
memory. The sizeof operator can be used to calculate the amount
of memory required, or the programmer can predict the amount. Arrays can be
allocated by multiplying the size passed to malloc or by giving
the number of items to allocate to calloc. It could be
difficult to determine the size of the array being calculated, especially if
sizeof is not used - this could cause problems when the source
code is converted to another langage.
| Language | Individual object | Array of objects |
| C | struct Thing* x = (struct Thing*)
malloc(sizeof(struct Thing));
free(x); |
struct Thing* xs = (struct Thing*) calloc(10, sizeof(struct Thing)); free(xs); |
| C++ | Thing* x = new Thing; delete x; |
Thing* xs = new Thing[10]; delete [] xs; |
| Java | Thing x = new Thing(); |
Thing xs[] = new Thing[10]; |
These basic memory facilities can be dangerous as it is very easy to allocate the wrong amount of memory. It is possible to access elements in an array that are unclaimed which can lead to unexpected results. Other languages (eg. Java) may prevent this from happening, perhaps by throwing an exception. Great care must be taken to avoid these sorts of problems.
In Java every object can be thought of as a pointer, and all objects are
allocated memory dynamically using the new operator. This
doesn't apply to the basic data types such as an int which
cannot be pointed to and instead has to be put in a 'wrapper class' such as
Integer.
For these reasons, in the Migration on Request tool, all
structs were created dynamically using malloc.
Obtaining and manipulating the reference address for basic data types such as
int was also avoided.
Because in C there is no distinction between a pointer to an individual object and a pointer to the start of an array, it was recommended that standard arrays were not used - linked lists (or similar concepts) were to be used instead. It could now be assumed (but not enforced) that a C pointer would point to an individual object. This also reduced the temptation to use address arithmetic.
These are not allowed in C--. There are some common library routines that
are variadic - eg. printf - but the facility does not carry
forward to many other languages.
Within highly specified modules, the use of variadic routines such as
fprintf was permitted in the Migration on Request tool.
Calls were in separate
functions, which were identified and well documented. This minimised the
occurrences of variadic functions and highlights exactly where modification
needs to take place when migrating the program to another language.
The restrictions on C-- sometimes made parts of the Migration on Request tool more difficult to write than if conventional C or a higher level language like C++ or Java had been used. Memory allocation, input and output, and data structures were the main areas of difficulty. Although the problems could be worked around it meant that care had to be taken to ensure the code was readable, as it is likely to be more verbose.
The vector graphic tool would have been easier to design and implement if a true object oriented approach could have been used. The data structures for the elements would have been more logical with a true hierarchy, and inheritance would have meant that adding new element types and applying attribute effects would have been neater.
Of course this could apply to many languages, not just C--, but it is an important consideration.
Although the portability of the program to future systems and languages remains essential, it is also necessary to remember that it has to be maintained by a human programmer who has not overseen the development of the software. Even if the program can be ported to another system with ease, this advantage could be diminished if intensive work is required to extend its capabilities.
Developing a large program in C-- is certainly possible, but, as with any major project, it requires discipline and good management.
C-- was conceived as a language for writing emulators more than migration tools. Holdsworth [2] states that "although C may not lend itself quite so well to the task of writing migration tools as it does to the writing of emulators, it is still quite viable in this area ... by writing migration tools in our more restrictive C--, the extra longevity achieved will more than repay any inconvenience in the programming language."
It would be worthwhile to investigate whether a restricted version of a higher level language such as C++ or Java could offer a similar degree of longevity but with a specification more suited to writing a migration tool.
C++ has received criticism for being too large and having a sometimes flawed design [3]. It is based on C, and therefore the concerns raised about using C-- might also apply to C++--.
Java uses core object oriented features without many of the complications found in C++. But Java also brings, as an example, garbage collection, which may be less portable to other languages. Although Java programs are executed a virtual machine, it does not necessarily follow that a virtual machine would be used to provide longevity of a Migration on Request tool. There is still the possibility in the future of porting the tool to another language where it could run natively. Java-- would be used because of its language constructs rather than its method of execution.
The C-- proposal excluded the undesirable features from the C language. To devise a subset of C++/Java involves a selection from a much larger and more compilicated specification. It is perhaps better to start with a very basic set of features (as found in C--) and build from there. Quite what to include requires further analysis of the possible languages.
These recommendations were taken from 'C-ing ahead for Digital Longevity' [2].
The C macro preprocessor is widely regarded as a route to confusing code, although it originally allowed efficient implementation of multiple variants from a single source code.
It is now regarded that normal iftests using values
known at compile time enable modern optimising compilers to achieve
the same level of efficiency -- which in any case, is not our main concern.
The particular style of the C union does not survive to other languages, although Pascal and Ada both have (different) equivalent facilities. Object orientation techniques rather render the idea obsolete. In any case the nature of the code of an emulator is such that the concept is likely to be of little value.
In some respects unions have their origin in FORTRAN's
EQUIVALENCE statement, that was a notorious cause of portability
problems in the past.
Many typical C programs are filled with address arithmetic. In part this is historic, because the array facilities of C were not there in the earliest versions of the language.
Also, address arithmetic code often compiles to faster code than the equivalent algorithm written using array subscripting. C-- should force the use of array subscripting (as does Java).
Perhaps we should be stronger, and explicitly list those library routines that are allowed as part of C--.
The condition in if-, while-statements, etc. must be boolean, or perhaps a relational expression. C's use of an integer does not carry forward into other languages. Forcing a relational expression, would give correct code that delivered a boolean test when translated into other languages.
The result of an assignment must be voided. This is perhaps more contentious. The facility does carry forward into Java, but is absent from Pascal and many other languages. Its use can make code more difficult to read.
There are some classic library routines that are variadic
(e.g. printf)
but the facility does not carry forward to many other languages.