Digital Library Models and Implementations

Ken Varnum

Keywords: Digital Libraries, Archives, Collection Development

Table of Contents

Introduction

I. What is a Digital Library: Toward a Working Definition
A. What a digital library should look like
B. Existing models

II. Archives as a Model
What is an archives
Lessons To be added

III. Criteria for Judging the Implementation of a Model
A. Scalable
B. Useful
C. Useable

IV. Virtual Books for the Virtual Library: Collection Development
A. Comparing collection development policies: traditional and digital libraries
B. Developing a collection development policy for the digital library

Conclusion: What Next?


Introduction

It seems that the technological aspects of the digital library are the ones that receive the most attention. The technology is important, but also important are the intellectual underpinnings which both drive and inform technological innovation. It is assumed, for the purpose of this essay, that the optimists among the computer science set are correct and that technological problems (such as Internet bandwidth, computing power, and interactions among intelligent agents, among others) will inevitably be solved. These solutions will happen with or without a theoretical underpinning to what problems the digital library hopes to solve, how those problems, once identified, should be solved, or what to do with a working digital library once it comes into existence. This essay seeks to explore these three areas of inquiry.

This essay has three main sections. The first section will explore the concept of the digital library, examine how it differs from traditional libraries, and in the process develop a definition of the digital library. The second section is devoted to developing a set of criteria for evaluating digital library models and proposing a conceptual model to help clarify what a digital library should be. The final section concerns the aftermath of having created a digital library. Once one exists, how should it be stocked with resources?

I. What is a Digital Library:
Toward a Working Definition

Before I begin talking about models for the digital library, or criteria for judging one, I will first develop what I think a digital library is. The term itself is widely used in a variety of contexts and has already been endowed with so many connotations that to discuss a "digital library" without explaining what is meant is to risk misunderstanding. At the same time, a precise description is difficult since the concept of the digital library is evolving rapidly. I shall endeavor to at least narrow the range of concepts embodied in the rubric to a manageable and coherent range.

What a digital library should look like

My basic premise is that there is not a single digital library that will be perfect for all situations. To think so is as unreasonable as to think that there is just one kind of library--that there is one organizational scheme, one set of materials, one fixed range of subjects carved that all libraries should have. That regular libraries quite happily use the Dewey and Library of Congress organizational systems without undo stress and strain on the user, that there are libraries specializing in certain types of resources--periodicals, microform publications, pre-printing press manuscripts, etc.--and that there are libraries as general as the Library of Congress and as specialized as the Folger Shakespeare Library should belie that belief. And while the digital library will be many things the (now broadly-defined) "traditional" library is not, uniform is not one of them.

Thus, a digital library should have whatever set of resources are of most use to its clientele. And, since it is irrelevant to the search engine how the materials are displayed to the user, the organizational scheme that most benefits a set of users should be employed for that set. The flexibility of the digital library is one of its main selling points. The digital library can be many things because it can be customized. Thus, it should be able to attract any user who might currently not use the traditional library because of the arcane or misunderstood organizational scheme employed there. If a user wants to browse a collection in Library of Congress cataloging order, so be it. If she prefers a straight alphabetical order, then she should be accommodated.

As far as content, this is an even broader question. Just as the digital library is not bound by a physical organizations, it will not be (by its very nature) bound by any "spatial" constraint--since "space" is no longer an apt term. As both Bill Birmingham and Mike Wellman described [Lectures to ILS 605, October 5, 1994 and November 30, 1994, respectively], and as is logical given the capabilities of decentralized computing, there will not be a digital library, but rather digital libraries, interconnected but independent. Many heralds of the digital future speak with hushed tones about the wonders of decentralization and how it will change the face of information retrieval and access. These heralds have, perhaps, become too caught up in the future to remember that, at present, information is already distributed. As outlined above, there already exists a variety of paper libraries with specializations ranging from the very broad to the very narrow. Libraries are already decentralized. They are even interconnected, but use Interlibrary Loan (ILL) in place of the NSF Backbone. The true advantage of the digital library will not be decentralization or interconnectedness, but the speed at which a user can access any particular digital library within the overall environment. Existing models A digital library can be defined broadly or narrowly. It seems to be that the digital library is, in fact, not a new creation--we already have and use them, although their current form is not as extensive or elaborate as what is envisioned for the future. As a starting definition, a digital library is a collection of electronic resources which can be searched from a common (often, but not necessarily, central) location. Dialog, the well-known collection of on-line databases, is a digital library, as is Lexis-Nexis. User interfaces for these databases are not as developed as they might be, and require a great deal of training and experience for successful use. Nonetheless, they are digital libraries. A large number of disparate information resources are available "under one roof," as it were, and although the vocabulary and syntax needed to search one file may be different from that needed for another, the process is essentially the same.

Although this definition is a good starting point, it does not quite capture the scope (I might even say "grandeur") of what will soon be the digital library. The future digital library should be much larger, should allow searches without much, if any, specialized training or experience, should adapt to the needs and expectations of the user, and should be.

II. Archives as a Model

This section to be added

III. Criteria for Judging the
Implementation of a Model

The assumption that the problem of creating a digital library architecture that works will be solved does not imply that whatever the computer programmers and theoreticians decide will be a good solution actually will be, or that stated above will be solved--that digital libraries can be created as a blank form into which virtually unlimited information can be loaded--a second question arises: What should this digital library look like? What are the criteria to judge whether or not a particular model of the digital library is a successful implementation of the concept? In this section, I propose three general requirements that must be met if a digital library is to be considered a success. The implementation must be: scaleable, useful and--most important--useable. Below, I will outline what I mean by each of these.

Scaleable

One of the largest assumptions of the digital library project is that digital materials will be created or found to fill the virtual shelves of the library. In the current, testbed, phase of digital library development, the assumption that sufficient materials can be found to fill the single shelf being used is not far from the mark. The UMDL, for example, plans to have a testbed program up and running by February 1995 with a small number of carefully selected databases made available for exploration by a small number of simultaneous users [Bill Birmingham, lecture to ILS 605, 30 November 1994]. Such a testbed is essential to ensure that the architecture created for the digital library is robust enough not to come down with a crash when subjected to "real" use. I put real in quotation marks here because, just as the digital library itself is an amorphous and slippery concept, so are ideas of how a such library will be used, even initially. Regardless of the whether a digital library testbed is endowed with more resources or subjected to more users, it will still be a testbed, which will not function the same way as a real-world working model.

Any digital library design must, therefore, be able to expand to be able to handle many more users and provide access to many more resources than even seems likely to be extant. One lesson (and possibly even a rule) of the information revolution of the last few years is that use grows far faster than expectations. For example:

America On Line

AOL, an Internet access provider, had to stop advertising for new customers in mid-1994 until it could expand its computing resources and rewrite its source code to handle the use loads it unexpectedly achieved, and announced plans at the end of 1994 to double its subscription base (to three million people) by the end of 1995. [ The New York Times, National Edition, 6 December 1994, p. C6.]

Lycos

The most popular WWW search engine, which came on-line in May 1994 as part of the Carnegie Mellon University's Computer Science Department server, has grown from about 30 accesses/week to over 120,000 access/week by November 1994 and now resides on two dedicated servers. [Lycos and usage statistics are viewable here.

These are but two examples of the phenomenal growth experienced by Internet resources and service providers. While such rapid rates can not continue indefinitely, it is reasonable to expect the number of service providers and users to increase significantly over the foreseeable future. Any system must be expandable beyond what seems reasonable at the time it is created.

Useful

The idea of usefulness is so subjective that I do not believe there are concrete measures by which a system's usefulness can be judged effectively until it is up and running. At that point (or during testing, when real users work with the system), an estimate of user satisfaction can be made. How this should be measured is an open question. Much research in such "well-understood" areas as on-line database retrieval focuses on relevance of materials obtained through a searching tool. But who is the judge of relevance? Even if a test database is constructed, with articles carefully selected to fit into one of a small number of categories, who is to say that if a particular user's search obtains an item outside the assumed category that it is not relevant? It was selected, for better or worse; there may be some aspect of it that matches the searcher's needs. The University of Michigan UMDL proposal recognizes the failure of traditional means of measuring usefulness of materials culled from a database, and by extension, of the database itself. The UMDL will evaluate the database according to another, very subjective, measurement--the value of the information obtained to the user who obtained it. [UMDL Proposal, p. 44.] I do not think the substitution of "value" for "relevance" in any way clears up the vagueness of either concept. Perhaps the only way to judge the usefulness of a digital library is to allow it to compete with other digital libraries, all of which provide access to the same information resources, and to see which one is used more.

When the digital library is perceived as better to use than the traditional library for a given purpose, the usefulness criteria will have been met. This is not to say that a digital library can not be somewhat useful; it might work sufficiently well in a given area to be useful for one purpose, or for one group of people, but not for another.

Useable

Usability is less simple to define than scalability. In the latter's case, it is clear if a system is scaleable--if it can handle increased loads on both ends (user and provider), then it is scaleable. This might not be testable before hand, but it is nonetheless obvious when scalability breaks down. Usability, on the other hand, is more subjective. I will explore some of the more important criteria in this section.

Can users find information on a topic? If the system allows users to find information on a topic of interest to them, then the system has passed the most important test of usability. Obviously, a digital library will not contain information on every topic immediately. There will be a prolonged development stage before even an approximation of "all knowledge" is available. In the short run, though, if a digital library can provide the bulk of its users with information resources they need, it will be useable.

Is the information provided by the digital library at an appropriate level? A useable digital library will have information in a variety of formats and for a variety of purposes--from cursory and introductory to thorough and intensive. Not all users will want highly detailed, footnoted, and researched information; for many purposes, an overview will do. The age and education level of the individual user must also be taken into account. A grade school student will not be able to use a graduate-level explanation of how and aircraft flies, just as a Ph.D. candidate will have little use for a high-school text on political party theory.

Is the access interface sufficiently easy to use that it can be employed by people at different educational levels? No matter how good the program that matches users with resources, no good results will be achieved if the user cannot effectively instruct the computer what he wants. The interface must be intuitive, and must be flexible enough that more advanced users, who better know how to use it, can access more advanced functions. A WWW-browsing tool like Netscape or Lynx might be an appropriate model for the interface used at a lower level of library savvy--much as inexperienced library users go right for the card catalog or on-line public-access catalog without first looking through a thesaurus of subject terms (LCSH, for example). For more advanced users, a less scripted interface would be appropriate.

A wonderful metaphor for this, coined by Yuri Rubinsky, is that, much like Disneyland, a digital library must keep the technology (the "magic", if you will) hidden. In Disneyland, the magic is tunnels beneath the entire park--the same tunnels are beneath Space Mountain as Mr. Toad's Wild Ride. In the digital library, the tunnels become the programming--completely transparent to the user. There is, in Mr. Rubinsky's phrase, no "difference between 'asking a question' and 'doing research'." [Yuri Rubinsky, Electronic Texts the Day after Tomorrow, p. 12.]

Usability is not an absolute; what works in one environment, with one group of users, will not work so well (or at all) in a different context. The system must therefore be able to communicate at a variety of levels.

There are certainly other measures by which a digital library can be evaluated. I am not considering economic measures because I do not think that they should be the guiding force by which a digital library should be judged--not to say that basic economic factors will not or should not be considered, but that while the costs of a digital library can be added quite easily, the benefits to society of better information are invaluable but unfortunately inestimable. In an article which advocates the quickest possible development and implementation of the digital library, Brian Hawkins of Brown University writes that "the electronic library is specifically both a solution to the economic problems facing libraries and a vehicle for a new functionality that promises to transform scholarship and bring the cultural, social, and economic benefits of information to many." [Brian Hawkins, "Creating the Library of the Future: Incrementalism Won't Get Us There", New Scholarship: New Serials, 1994.]

I think that this dream is what underpins much of the enthusiasm for the digital library. The digital library is neither the first technological/philosophical creation to be hailed as mankind's panacea, nor will it be the last. There is a tendency in the world of Internet and computer experts to do something just because it can be done. I think those people interested in the digital library should think carefully about what a digital library should be, and be careful to create a system that meets the criteria outlined above, and others, and does not provide features that are wonders of programming but do not serve a particularly useful function.

IV. Virtual Books for the Virtual Library:
Collection Development

Having established what I think a digital library should be, and what the criteria for evaluating a particular implementation of a digital library infrastructure are, I now turn toward my third topic which is will examine the question of how information should find its way into the digital library. The process with traditional libraries is quite well understood, but currently only slightly applicable to the digital environment and, it seems likely to me, as people become more familiar with the digital world over time, the traditional methods will have increasingly less application.

Collection development in the digital library will not be concerned with finding resources (as it is today), but with separating the wheat from the chaff. If previous experience with the Internet is any guide at all, the future of the digital library will be an information feast, not a famine. The proliferation of information on the Internet, and particularly on the World Wide Web (WWW), leads me to believe that the digital library's shelves will be filled. Again, based on the experience of the WWW, quantity is not likely to be the main issue, but quality without doubt is.

Comparing collection development policies:
traditional and digital libraries

For those readers who are not familiar with collection development in the traditional library, I will briefly outline the tools available to help librarians. A large number of reference sources have been published to help the librarian select resources. Jobbers, intermediaries between the library and the publisher, help libraries select materials relevant to their clientele or budget. The pattern of use among a library's patrons is a significant help. But the most important thing is a collection development policy, which, briefly, is a statement of focus (what the library will and will not collect), clientele (educational level and interest of the bulk of patrons), and decision-making (who decides the specifics of what resources should be acquired). In the remainder of this section, I will outline the major features of a collection development policy for the traditional library, followed by an elaboration of how these ideas could apply to the digital library.

It might seem that a collection development policy is not important for a digital library because of the interconnected nature of the network. It is true that any resource will likely be made available on any terminal. But I think it also likely that the "Ann Arbor Public Virtual Library" will have some resources loaded and ready to use, while others will have to be located remotely (much like the distinction between the reference room and the stacks in most public and academic libraries today).

Focus

The start of a collection development policy is an explicit statement of the library's interests. Is this a public, private, or academic library? Who are the main users of the library's materials? Archives have their own twist on the collection development policy. Since they endeavor to document specific facets of life, and must conserve resources in other areas to further that end, archival collection development policies often include explicit statements of subjects that are not of interest. Another important feature of a collection development policy is an explicit statement that resources will not be selected or excluded because of the political, religious or other views expressed within them. Just as librarians and archivists have traditionally upheld the principle of equal access to information, so must the digital librarian.

The digital library actually has two separate foci. A particular resource might be created with a very narrow audience in mind. The Human Genome Project, for example, may be vastly important to a certain class of scientists and scholars, but beyond that relatively small group is not comprehensible. Or a resource might be of very broad interest--electronic texts of out-of-copyright literature, for example, are in this category. A collection development policy does not really come into play at this level. However, at the broader, system, level, it does. In this case, it could well be embedded in the computer code that matches a user with a resource by whatever mechanism.

In the archival environment, it is sometimes possible to find a single collection that answers a specific question--especially if the subject of research is an individual. When the topic being covered is an institution, it is often necessary to examine several different collections to discover the entire story. Much will be true in the digital library environment. The relationships among and between collections are sometimes explicit--noted on the catalog card or computer record, as is the case in libraries as well--but often not. The contents of one collection will lead the researcher to a second, and so on. Digital collections must develop themselves so that the researcher who finds one collection can move to another transparently.

Clientele

The next section of a collection development policy is an explicit acknowledgment of the intellectual interests and abilities of the people who will be using the library's resources. Budgets aside, academic research and town public libraries have very different sets of users for which they obtain very different kinds of resources. For a research library, high school texts will likely be excluded for general use. For a public library, highly technical and scholarly works will likewise also be excluded. A library should select materials that will both be of use and be accessible to its patrons.

The digital library must make the same decisions, but as above must do so at a different level. The creator of a resource must determine its level of use and interest and inform the digital library of that so appropriate users can be directed to it. While this process might be hidden from the user, it must be done carefully and accurately. High-school aged users will find the digital library useless without materials accessible to that educational level, while the layperson will find abstruse technical descriptions of chemical reactions unintelligible. It goes without saying, though, that the system must not prevent users from finding information that is not at their presumed level.

Decision making

A third important part of the collection development policy is to describe the selection process fairly explicitly. Are decisions made by committee, by a librarian expert in that field or by the head librarian? Who resolves disputes within the committee? When there are several resources available on the same topic, what are the guidelines for selecting the best resource for that library? Since the digital library will be formed in a more decentralized manner, these exact questions may not apply directly. However, the archives version of these does provide more insight into the process. The best archives (those which thoroughly document a specific topic) often work closely with an organization or individual over a long period of time to ensure that sufficient evidence of that entity's activities is ultimately preserved in the archives. Other archives, and even the "best" on some occasions, rely on chance donations of documents to the archives. This situation seems to reflect the digital environment better than that of libraries which an plan out future acquisition of resources with some sureness because of reference works like Forthcoming Books.

The digital environment is less sure, though; resources can come and go, and be changed at whim. A good digital library must act like an archives and seize on a good resource when it appears. The choice is less irrevocable for a digital library than an archives; the usual alternative disposition of documents not taken by an archives is the local landfill, which will not happen with electronic resources. Nonetheless, the digital library must be sure that is policy neither excludes chance arrivals nor accepts them all.

Developing a collection development policy
for the digital library

A collection development policy is in many ways similar to a mission statement: it details the reason for collecting a certain range of materials and explains how that process is undertaken. A policy should be relatively constant over time--if it can be (or does) change at a whim, the collection will lose focus and holes will appear in it. A well-written policy can be a powerful document in the defense of freedom of information and freedom of access--but it should be so written as to ensure flexibility.

In neither digital libraries nor traditional libraries does a collection development policy state explicitly what resources should be purchased (except to the extent that periodically revised reference sources might be mentioned). In the digital library environment, however, such prescriptions are, in the short run, not recommended because sources will come and go. At present, WWW and other servers come on and off line with abandon, and often refuse to connect to users if too many are already accessing them. Furthermore, a truly useful resource could be entered into a digital library from one server, only to disappear when the creator moves on. Since the network will be decentralized--resources will be not be stored on one machine--there is very little to prevent the only copy of a resource from vanishing into the void--the equivalent of "out of print" except that, in the digital environment, there would not necessarily be a library which possessed a copy from which it could be borrowed.

The need for a collection development policy in the digital library environment is probably even greater than it is for the traditional library. The traditional library is just that--traditional--so there are conventions and expectations about what should be found on the shelves. Since the mass-use digital library is a new and rapidly evolving concept, a collection development policy for it is all the more urgent. A digital library which includes resources simply because they are available might be acceptable in the earlier, testbed, phases of development, but a "working" digital library so created will be a very poor one indeed. Unfortunately, the foregoing discussion presumes the existence of sufficient resources to select among them. At the current stage of digital library development, implementing the above outline is not likely to be very effective. It is presented not as a description of what should be done today, but as a way of thinking about the problem in the future.

Conclusion: What Next?

While the digital library, as presently conceived, is a new idea, I think that it is not so revolutionary in its practical aspects that an entirely new means of thinking needs to be developed. The digital library has many analogies in the present world of libraries and archives. Although he differences between archives and libraries are presently fairly broad, I think they will diminish in the electronic world of information storage and retrieval to which we seem to be heading.

As a last thought, I would like to mention briefly an interesting model for developing WWW resources. A class titled Internet Resource Discovery and Organization has been offered at the University of Michigan's School of Information and Library Studies (SILS) for the past two years. It mixes the traditional collection development issues with the Internet. The course focuses on locating, evaluating and describing Internet resources on a specific topic. The evaluation and description take the form of a guide to that subject area which is made available through various Internet tools to the world at large. While only a beginning, the combined efforts of two years of students, and many other, non-SILS, people has resulted in about 150 subject guides to Internet resources--a first step toward cataloging the Internet. While that is not the avowed purpose of these guides, and they by no means cover all, or even most, of the information available, it is a start. And a start using a mix of traditional library tools and concepts with the new organizational tools of the Internet.

As strong as the inclination to start fresh may be, I think that would be a mistake. The current world of library and archival science has a great deal to offer the digital library environment. Our expertise is not in programming, but in helping the programmers create interfaces between the database and the end user (who, it must never be forgotten, is not a computer expert, is not a librarian or archivists, and, for that matter, is not even an expert in navigating the public library of today). It is likely that many ILS types and many computer programmers will not see the importance of working with one another. While unfortunate, it is to be expected. Whatever systems are created and thrown into the marketplace for public use, the ones that are easiest to use and most successful at locating appropriate information resources will be the ones in highest demand (especially if there is money involved). The successful ones will take advantage of what both the traditional and digital libraries have to offer.