In this section we present our vision for the Digital Library, discuss several scenarios of how both basic and advanced users might take advantage of it, and how we plan on building the digital library by combining the research and development strengths of many individuals and group on and off campus. We also summarize the state of the art in approaches to building digital libraries and the underlying technical approaches.
The following figure, #2.2, from the Executive Summary provides a roadmap for the entire project description section.
In the following, we present two scenarios of how various types of users might interact with the UMDL. Each scenario raises different research questions and serves to inform the design of the Library. Our goal is to enable user friendly personalization/customization of information access and representation, including support, for harvesting of relevant information and protection from information overload.
The UMDL will incorporate a very large number of distributed information sources, which communicate over existing networks such as the Internet and the evolving National Information Infrastructure (NII). The UMDL will be built as a distributed, modular system, where a variety of active "agents" coordinate to perform various tasks, such as query processing, information integration, and information management. Each agent has specialized knowledge mechanisms for applying that knowledge to different library tasks.
Sheila Grant is a research scientist specializing in pollution in the upper atmosphere. She is interested in exploring the relationship of jet exhaust on the ozone layer and, further, if such pollution leads to any significant modifications in the frequency of aurora borealis phenomena.
Accessing the UMDL from her office workstation, Grant activates her interview agent, an interface agent which will helps her to formulate queries and to intercept results appropriate to her specialist background. Avoiding older, general information sources, she focuses her search, volunteering constraints (e.g., retrieve sources created since 1991; include specific journals, research groups, or data archives). Occasionally the interface agent poses additional questions: "According to the European space-physics conferences agent, there are sources from the 1980s which have become available since the collapse of the Soviet Union. Do you want to include these?"
Grant's interview agent communicates with other agents that maintain information about factors such availability of sources, costs, and delays of access, and relationships among alternative sources (e.g., refereed publications might be considered more valuable than technical reports). Other agents specialize in information about particular journals or journal groups. For example, an agent that tracks atmospheric journals would direct her query to sources with physical-chemical sources, searching the Journal of Atmospheric and Terrestrial Physics rather than Sky and Telescope.
Requests for services flow among agents and the search for information is spread across geographically or conceptually disparate information sources. To reduce the breadth of the search, only those content agents that have announced to the network an interest or competence in the domain of the query are invoked. A competitive mechanism within the network of agents ensures that the available computational resources are allocated to the most cost-effective sources. As responses to her query come in, Grant browses retrieved facts, documents, and portions of documents, and gives feedback to her interviewer agent on the value of the information. As results of the search are retrieved, the underlying agent infrastructure changes, acquiring better (i.e., more complete, certain, and current) knowledge of the available information in the network.
Before leaving the system, Grant instructs her interface agent to activate a monitoring agent that will periodically, or upon notification of a change to a relevant information source, send her electronic mail if any new information turns up. Later that night, a data-collection site in Greenland senses and identifies a significant plasma event. The agent representing the site's database multicasts a notice to mediators that the new plasma data have been recorded. Various agents throughout the network update their indices, and the network progressively assimilates the notice. Grant's monitoring agent is informed of the data, and automatically retrieves it for her review in the morning.
Mr. Sorenson showed his 11th grade science class a PBS video entitled: "Great Moments In Science in the 90's". A video clip of scientists using the UARC System (scientists in Denmark, California, Maryland, and Michigan monitoring instruments in Greenland as part of the Upper Atmospheric Research Collaboratory) captured the attention of the students, particularly the clip which featured the most unusual Northern Lights ever recorded. The excitement of the scientists and the actual phenomena combined to stimulate the students to action.
The group decided to focus their class project on the Northern Lights phenomenon, specifically the question whether the Northern Lights can be seen in Ann Arbor. In the school's Media Center, they chatted with the media specialist, Ms. Duvall, about their plans; she in turn gave the group pointers on how to find resources on the UMDL and logged the projects into the system. The students then began their exploration in the digital library.
Quite coincidentally, students at Stuyvesant High School in New York City had seen the same PBS program, and one group wanted to know if the Northern Lights can be seen in New York City. Aided by their school media specialist, they signed onto the UMDL and discovered that there was a student group working on a similar question in Ann Arbor. Before the New York students proceeded with their inquiry, they contacted the students in Ann Arbor and proposed that they share their strategies and ideas.
That evening, after discussing their project with their parents, the Ann Arbor students met at the Public Library. Accessing the UMDL, they sent off a video information agent to find good video footage of the Northern Lights. Then, they interacted with the UARC Project database, retrieving data about sightings of the Northern Lights in the lower North American continent. The video information agent returned near the end of that exploration with several video clips of Northern Lights taken from NASA satellites. These were the only free video clips, the agent reported, but information about other clips, associated costs, and retrieval mechanisms were displayed as well.
The students identified the following sources of information that they needed for their report: (1) contacts with actual UARC scientists present at "The Event"; (2) digital video of the readings of the instruments capturing changes over time; (3) digital video of the scientists engaged in trying to decipher the instruments and the "ah-ha" moment when they realized that something wonderful had happened; (4) text and graphics information on the UARC project; (5) contacts with the PBS producer and permission to use portions of the PBS clip for their report; 6) data about Northern Lights in Ann Arbor and New York , including simulations that permitted the students to explore the physics of northern lights; and (7) collections of related journal articles.
As the week passed, having become rather adept in their interactions with the UMDL, the students updated their preference file with the help of the media specialist. Now they were able to access a broader range of sources, moving to a lower scaffolding parameter.
Throughout the entire research process, the Ann Arbor and New York groups collaborated, exchanging information. Given the vast collection of resources available, it was only reasonable to divide the labor and share the findings. Upon completion, the students entered their reports into the UMDL. These contained not only a study of Northern Lights in Ann Arbor and NYC, but an annotated history of their research process tracked by the UMDL for other groups to review and re-use.
Finally, the students put together a multimedia presentation on Northern Lights which was shared during an evening session hosted by the Ann Arbor Public Library and recorded for broadcast on the community's local access cable TV station.
Although the semester ended and the NYC and Ann Arbor groups broke up, some of the students continued into Earth Science II. These students teamed up with other students to examine the local waterways and pollution. The UMDL was again a rich information resource. And, as in the days of the one-room schoolhouse, the more knowledgeable students helped the more novice students become UMDL literate.
As illustrated in the scenarios, the digital library has a wide range of capabilities and broad implications for education and professional life. We view the development, evaluation, and refinement of the digital library as an organic exercise that requires the talents of a diverse group. We have assembled a team with research interests and experience in the fields of computer science, information and library studies, longitudinal studies, economics, and system building and support.
The following diagram shows the interrelationship among the various research activities with reference to the particular sections where the work is described.
Under this initiative, we plan on advancing the basic methods for building and deploying digital libraries. In this section we discuss some of the current research and systems approaches and their limitations.
Information retrieval contains a rich tradition of research into the effective design of user interfaces and search and retrieval algorithms. In general, the interface should help the user in formulating the search, providing assistance where necessary. Search algorithms work from that point on to effectively match the user's query with representations of available information resources.
Although on-line retrieval systems have been available for many years, they are not much of an improvement over print-based systems in terms of helping users determine which source will be most useful for satisfying their information needs. They typically list available databases on initial screens and let users discover the usefulness (or uselessness) of these sources on their own (Evans 1989, Machovec 1989). Tools for exploring cyberspace also repeat the mistakes of the past. For example, gophers are merely hierarchically organized menus of available resources; they do go some extra distance connecting end users with a particular resource that catches their interest or fancy.
One way of helping the user to navigate through information resources makes use of tools such as hypertext. In using hypertext-linked data, success in finding desired information is highly dependent on links being established by the information provider, a priori, according to the same "cognitive map" as the person trying to find the data. These links really represent "pre-computed queries", and if the person needing information thinks of the structure of a query along different lines from the person creating the original hypertext links, these links can be useless in locating desired information.
The research proposed here will focus on techniques to improve system responsiveness to user queries. Information retrieval searching capabilities have ranged from simple Boolean techniques through keyword, statistical approaches, and, finally, to linguistic algorithms and artificial intelligence techniques.
Most current operational information retrieval systems index individual words in text, minus stop words and phrases that are unambiguously delimited, i.e., in controlled vocabulary fields where they have been assigned manually by indexers. They search by matching Boolean-connected text words and controlled vocabulary phrases in queries with entries in the inverted indexes processed from texts. Users can further structure their queries by using truncation and proximity operators (e.g., to search natural language phrases). They can also incorporate additional terms from a manually-structured thesaurus (if one has been constructed and is available in machine-readable form), or from "hedges" (groups of free text and controlled vocabulary terms stored and used frequently to search particular concepts).
Although users can employ the controlled vocabulary of a particular database, it is usually the case that controlled vocabularies are not compatible across databases, making it necessary to reformulate controlled terms to match successfully in each database. An approach to solving this problem involves the design and implementation of a "metathesaurus", which goes some way toward allowing users to access a variety of databases through a common "thesaurus of thesauri" (Warner, 1992; Warner, 1990; Warner, 1991). However, construction of this metathesaurus is very time-consuming and expensive, and it does not allow users to retrieve documents on other document fields, such as natural language.
An alternative to the binary vectors just described is the approach offered by statistical information retrieval techniques, as described by Salton and McGill (1983). This employs free text documents and search requests, with automatic stemming using one of a variety of algorithms intended to conflate morphological variants of words (Harman 1991). Individual stems are weighted using a variety of statistical techniques. In addition, where weighting on isolated stems is not satisfactory, stems may be combined into larger "statistical phrases" or into "clusters" (combinations of re-weighted co-occurring stems). Resulting sets of documents can be ranked based on their probable (statistical) relevance to the request. Based on user feedback about which documents are and are not in fact relevant, weights may be recalculated and documents re-ranked, in an iterative process. Statistical approaches have the advantage of working on natural language, but do not make use of the retrieval devices offered by the controlled vocabulary approach.
There have been many information retrieval systems with linguistic capabilities (Dillon and Gray 1983; Lewis and Croft 1990; Liddy, Paik and Yu 1993; Voorhees 1993). They usually feature a true lexicon, consisting of a list of words and (at least) their associated parts of speech. This information is used to drive a true morphological analysis, which, unlike stemming, considers both suffixes on words and their associated grammatical classes. Further linguistic sophistication can be added by creating a true parser, which extracts important linguistic constructions from free text; in the case of information retrieval systems, these are various types of noun phrases. Knowledge of the semantic content of words can be incorporated by adding a machine-readable lexicon. Finally, knowledge of actual search heuristics can be added to the system interface, making it an expert system (Gauch and Smith 1993).
An important generalization about linguistic processing and expert systems techniques is that there is a continued trade-off between depth and breadth. The techniques employed by most Boolean information retrieval and statistical systems use mainly the surface structure of language and incorporate little true structure or semantic analyses in their algorithms. However, they are useful in that they work across virtually any subject domain. As algorithms become linguistically more sophisticated, they require increasing amounts of computational overhead and/or the need to manually construct large lexicons. Additionally systems containing large semantic components are restricted to a fairly narrow subject domain. However, processing can be more accurate using greater amounts of information.
The UMDL architecture is based on the paradigm of having multiple, distributed agents, each with its own limited and modular capabilities, that cooperatively team up to provide end-to-end services. These types of system architectures are being explored by a number of researchers in a variety of fields.
In the field of computer-supported, cooperative work (CSCW), for example, one role for agents is to act as a buffer between the human and the larger information system. For example, in Malone's ObjectLens and OVAL work, a user can construct agents that direct, manipulate, and respond to objects such as mail messages, thus freeing up human attention for other matters. The emphasis in this work has thus been on providing users with agents that act as personal assistants. The UMDL requires these kinds of agents, but also agents that work behind the scenes, not associated with any particular users but rather with important functions for managing, maintaining, and exploiting a diverse collection of information resources.
Concurrent-engineering researchers have used agents to model design problems, where an agent generally represents some element of expertise or point of view (e.g., reliability improvement). These architectures, such as SHADE/PACT, have been successful in solving complex design problems, but they assume a closed world, where all the agents and the communication paths among them are known. These systems cannot adapt to situations where agents can come on-line and off-line at their whim, as will happen in a digital library, nor do these architectures support the level of decentralized control that is necessary. For example, if a critical agent goes off-line, the concurrent-engineering system is crippled.
The digital library requires federating agents developed by different groups. Efforts for federating systems range from ontological approaches to developing common data models. The ontological philosophy for federation is in stark contrast to other schemes for federation. For example, the electronic computer-aided design (CAD) area has built over the past several decades an enormous number of databases containing intricate details of electronic systems. As collaboration in the industry grows and CAD tools become increasingly specialized, there is a need to federate databases either to share information or to use different CAD tools. The CAD area has embraced a traditional federation approach that focuses on specifying a common data model and data-interchange formats, usually in context of a particular language (EDIF, CFI). These efforts, though long-standing, have produced precious little meaningful interoperability. The problem lies in standardizing on the syntax of some specific representation, without fully specifying the semantics. This gives rise to ambiguity and inconsistent usage, making federation impossible.
Semantic translation has also been explored in the CARNOT project at MCC (Collet, Huhns, Chen 1991), which translates information that might be related but is expressed differently. The CYC system provides a comprehensive knowledge base for concepts and objects that CARNOT uses to identify commonalties in information represented in diverse ways.
Finally, networks of intelligent agents have been a primary focus of distributed artificial intelligence research since the inception of that field. To date, several approaches to resource allocation using contracting (Davis and Smith 1983) and distributed constraint satisfaction search (Sycara, Sadeh and Foy 1991; Conry et al 1991; Yokoo et al 1992; Darr and Birmingham 1993) have been investigated to solve problems in information systems, communications networks, and manufacturing control. More recent mechanisms based on market concepts play a fundamental role in the UMDL (Wellman 1993). Beyond the needs for resource allocation, however, are the complexities of dynamic organizational self-design among autonomous agents, which continually team up in different ways in response to varying service requests from an evolving user community. Techniques for organization self-design among autonomous systems without the intervention of a global coordinator are only now just evolving (Ishida, Gasser and Yokoo 1992; So and Durfee 1992; Durfee and Lesser 1991).
With virtually all systems in place today, there exists a very strong binding between clients and the servers they use to locate and retrieve information. This binding takes many forms. In most commercial systems, clients are tightly bound to particular databases, and one typically uses a pre-defined (and database vendor provided) client to access a particular database. Often these clients are further bound to running on the same machine as the database, although new systems are using distributed protocols such as Z39.50 to break this tight attachment.
In other systems, typified by Gopher, one manually navigates from one server to another. Servers advertise their existence in advance (simply by letting a central source know of their existence, but not their contents), and are statically linked into a mesh that users navigate through. In this case, advance or "guessed" (and often serendipitous) knowledge of what information physical servers are likely to contain is critical to finding the desired information. Super-indexes such as Veronica attempt to provide some facility to allow server location independent querying, but are still based on static indexing of servers. Furthermore, the quality of a query result is totally dependent on servers employing a common set of index attributes, which is increasingly unlikely for independently operated servers.
Systems like Mosaic and the WorldWideWeb replace the notion of navigating from physical server to physical server based on location to one based on following someone's pre-defined idea of what are likely to be important traversal paths. While this approach seemingly makes invisible the fact that one is navigating from one server to another, it replaces it with an environment where information is easy to find only as long as the retriever's view of important links is congruent with that of the person establishing the link.
While digital library systems being deployed are typically supporting a more diverse set of media types, these are largely provided simply to provide a richer presentation format to the user. For example, systems providing scanned images of documents are giving way to those displaying rendered versions of structured text from formats such as SGML or ODA. In these systems, tables may be rendered from underlying spreadsheets, graphics from a graphical data base, and equations from an embedded mathematical system such as Maple or Mathematica.
In all these cases, the structure is being used to provide a higher-quality rendering of what is presented on the screen, or, in more advanced cases, to allow a user to interact with a document once it has been located. However, seldom is such structure employed in actually assisting the user in finding desired information in the first place. We plan on exploring how embedded structure can be used to aid in the information retrieval process in addition to improving the quality of the end presentation.
Evaluation has always been an important part of research on and development of information systems. Systems which provide little or no information of interest or relevance to their users will quickly find themselves with no users. However, the methods for evaluating systems to date are in large part derived from ideas which have little empirical basis regarding why people use information systems, what they are seeking when they do use them, and how they should express those decisions.
Most early (pre-1975) research focused on judgments of relevance of documents retrieved from small, experimental systems. Much of it did not make use of real users with real queries against real systems, and there was significant confusion as to what exactly was being measured (relevance, utility, satisfaction, pertinence, topicality, or something altogether different). In fact, in 1975, work in relevance effectively stopped for more than a decade. "Evaluation" since that time in virtually all information systems design work reflects this paradigm.
In addition, the methods which have been used to capture such decisions are also flawed, since they typically involved asking for a binary judgment of each document as it is retrieved, out of the context of a user's problem or project, and often without the ability to go back and refine or redo a search based on what was learned from the first pass.
The combination of these factors has led to poor information regarding the real value of information systems to users, and thus has not allowed systems designers to improve the performance of their systems in ways in which they could be of more use to more people.
In pursuing our research, we will be guided by the practical context in which the library is to be developed and deployed. In particular, we will address issues arising from the diversity of user groups, computational environments, and information collections comprising the UMDL. Characteristics of the digital library posing special challenges include:
Further, a major challenge of digital libraries is avoiding information overload. The ever-growing availability of data can reduce the amount of effective information that users can retrieve from the system in an acceptable amount of time and with reasonable ease. Our goal is to enable users truly to profit from the amount of available information by providing them with tools that simplify the retrieval of meaningful information from this mountain of data. Several ramifications of these challenges influence the design of the digital library:
Our system architecture, described in the remainder of this section, is designed primarily to afford flexibility in addressing the diversity of requirements and resources described above. Our broad goal is to develop a new paradigm for integration of autonomous, disparate systems that is truly distributed, yet performs seamlessly. A major challenge is to share information across the UMDL, while maintaining the autonomy of individual collections. In particular, consider that individual information-source providers (third parties to those developing and maintaining the digital library), working without interaction from other providers need to be able to place their information resources in the network without requiring them to necessarily understand details of the overall library system. The same applies to third parties producing user interfaces.
Figure 2 illustrates, at a very high level, the digital library, linking several users through their User-Interface (UI) agents to collections through Collection-Interface (CI) agents. In a simplistic system, the UI agents and CI agents could be networked together, allowing UI agents to query collections directly, either sequentially or in parallel. There are many problems with this simple solution, however, such as the duplication of effort in having UI agents determine the subset of CI agents relevant for meeting a particular request, or the complexities of terminating the search once one of the CI agents has successfully answered the user's query.

Figure 2: A federated agent network.
Embedding specialized information agents (besides the UI and CI agents) into the infrastructure to act as mediators between users and collections can alleviate these problems, and provide additional useful services to library users and contributors of information to the library. Different types of mediating agents for finding, processing, and delivering information are distinguished by their specific knowledge and expertise. Roughly speaking, we can identify the following types of required skills:
The knowledge and computational resources available to particular information agents dictate the range of information services they can provide to users or other agents. Each individual service offered by an information agent is a building block from which to construct complex information-processing strategies. Combinations of cooperative agents can collectively implement the more complex tasks required from the digital library system, such as information storage tasks (e.g., caching and indexing schemes), access plan strategies (e.g., browsing options and traversal paths), and so on. To realize the benefits of populating the digital-library infrastructure with a community of diverse information agents requires that the agents be able to team together dynamically to provide a particular information service on demand. In the next sections, we describe a representative set of agents that comprise the digital library, as well as the general mechanisms for coordination and communication that we employ.
The information agents in the network, as computational processes, can operate continuously and work actively on behalf of users and collections. We will thus develop a variation of the personalized UI agent that is active. Such an agent will continuously interact with other agents in the network, rather than simply carrying out one particular user request. An important task of this active UI agent is the notification of users about relevant modifications to the library; for instance, the addition of new articles on a particular subject. To facilitate the development of services of this type, we will study the integration of declarative rules into data models, and in particular, into object-oriented models. Such rules can be utilized to easily encode alerters and triggers, which will activate the appropriate notification agent on the appropriate occasion. This will be more efficient than continuously reprocessing the standing query, since a query is effectively distributed throughout the system. Relevant objects which may influence the outcome of the query result will thus have to be identified and then associated with the particular notification mechanism.
These personalized session objects will represent an invaluable resource, addressing several important issues. First, they will allow a user to resubmit his or her typical query request modifying search terms, if necessary. Second, they allow for the submission of such complicated requests to the system for continuous and proactive reevaluation. Third, it will provide an instructional tool, allowing experts to demonstrate search strategies by simply exercising the system. The UI agent would maintain the session, providing templates of these sessions to the students. The students can either simply replay them in order to observe the search process as it unfolds, or they can modify this session to meet their particular information needs.
As an example, we will develop and test a user interface agent to provide search strategies to the user accessing the UMDL. This Interviewing Agent (IA) strives to lead digital library users to the best information for their needs regardless of the type or genre of the resource. It also serves as a helpful companion that users can call upon for guidance and instruction during their navigation through the federated network or examination of digital resources.
Since the IA provides the end user with a guided pathway to resources relevant to the individual search, it relies on described behavior according to definable characteristics and styles of users (e.g., high school, undergraduate, graduate researchers), as well as discipline-based methodologies. We plan to interview and study this broad base of users to design the required search strategies.
While earlier work at Ohio State University (OSU) built strategies for conducting such interviews into their online catalog and CD-ROM sources, they were limited to a single library (Tiefel 1993). In our approach, the IA will have access to all the other agents and resources in the federated network. Interviewer pathways include generalized strategies but they will be expanded to include others for the various user groups that will utilize this digital resource. Furthermore, the UMDL will search for resources in response to queries that end users enter through an "express window" or immediately after the initial IA interaction of initial Gateway navigation.
Through the interviewing agent, information will be obtained from the user about:
The interview agent will use all of this information to gain a sense of what the user is looking for and to place parameters and constraints on the search space required for the efficient technical operation of the system. This information will be sent to the appropriate agents for further processing.
A number of significant problems will need to be addressed and resolved in doing this research and creating this agent. These problems include:
In our distributed-agent paradigm, queries are eventually submitted to local data repositories in order to execute the elementary requests on the actual data sources. In order for an information source to effectively participate in the network, we envision that there will be Collection Interface (CI) agents designated to focus on each autonomous data repository. Each CI agent is in charge of maintaining a link between the repository and the rest of the system. These agents will be capable of translating query requests, mapping between data types and formats, resolving schema inconsistencies, etc.
Given the complexity of constructing these CI agents and the potential similarity between such CI agents, we propose to aid this labor-intensive process in the following two ways. First, we will investigate the key characteristics of the architecture and range of capabilities that such CI agents typically would exhibit. This will result in a set of guidelines for constructing CI agents, possibly classified by the type of information source. Second, we will develop CI agent templates for particular classes of collections, e.g., for directories of technical reports in Post Script format, for highly structured relational systems, etc. These templates, built using object-oriented principles, can then be customized for a particular collection type by refining objects and by plugging in specifics about the information content of the source, the type of query services offered, etc.
It will ultimately be the responsibility of the collection provider to develop the corresponding CI agent for the new information source and to guarantee its accuracy. We expect that economic measures, such as the evaluation of users with the source in terms of negative hits and wasted resources, will provide incentives to correctly reflect the capabilities and scope of the source. For a detailed discussion on this topic see Section 3.2.7 Economic Resource Allocation.
As indicated earlier, our system philosophy is to impose a minimum of requirements on publishers of new information sources. In fact, our system will respect the autonomy of individual information sources, allowing for existing technology to be plugged into our future network, though with possible limitations on their effective usage. Our goal is to gain a better understanding of the ideal component of a future digital-library system. We will thus investigate what basic services an information source should provide in order to be used to its fullest potential in the network. Our results may drive the development of novel database and information technology providing these identified library-specific services. One example of a basic service is a protocol for reporting on the status of a query, including whether any relevant results have been found so far and which percentage of the information source has already been searched. Other examples include protocols for terminating a submitted search request prematurely or for modifying a query request without having to restart query processing from scratch.
The availability of particular services via a CI agent will determine the effectiveness of the corresponding information source in this new type of system. For instance, assume that a given source uses the default response to a status inquiry (simply returning the 'not-completed' label) and also disables the request for termination of a search activity. In this case, the non-interruptable information source may waste resources by completing a query, while the distributed search may already have terminated this particular search path in the library. Another CI agent supporting the termination service would have the capability of avoiding this waste of resources.
In the remainder of this section, we focus on two particular capabilities that CI agents will possess in UMDL. The first capability is to use knowledge supplied by document and domain specialists about the structure of the documents and/or other information sources to characterize formats and contents so as to support queries and browsing. In general, this amounts to expertly guided structuring and organization of the documents and other information resources. The second capability is a more dynamic structuring activity, based on usage patterns, into (possibly transient) virtual collections.
The UM digital library testbed is composed of image and text data. One important task is to organize this data so content in different collections and formats can be intelligently located, quickly retrieved, and easily reused in unanticipated and arbitrary ways. For large, complex collections, brute force strategies have limited utility. In these cases, conventional retrieval terms need to be supplemented by some form of knowledge representation. Knowledge representation is used to segment the search space so intelligent agents apply brute force techniques only in areas where probability of success is high. To be useful in the digital library environment, the entire process has to be heavily automated.
Project testbed data will be inventoried to determine which components are in image and which are in text format; and which components are or can be made available in a structured representation by their source organizations. The corpus will be analyzed to determine the set of objects that need to be defined, and the context relationships that need to be supported. Generalized strategies for data conversion will be identified and pursued. These investigations will incorporate in the expertise of content specialist consultants, librarians, and software developers listed in the Appendix.
One simple way of capturing knowledge representation in a digital library collection is to associate abstracts and reviews of works with items, separated from content. These objects can be further enhanced by expert review of seminal works to build 'key concepts' catalogues. Annotation by knowledgeable readers is yet another path that will be explored. So will correlation of works that share particular references to the literature. Provisions for these kinds of activities will be made when the structure of items is specified. The results will be populated as part of testbed development.
In addition to the information about structure that the information source is providing to its CI agent, we must also investigate what meta-data each source should make available, as well as which modeling techniques should best be used to describe this meta-data. The meta-data includes a description of the content of the database (schema), available index strategies and access methods, the integrity mechanisms enforced, and other information for administrative purposes. A meta-data agent will then be in charge of posting a comprehensive description of the collection to the digital library system, representing a wrapper between the local information source and the rest of the system. (See section on structured documents.)
Since the structures of the entities in the database could change with time, we must develop tools to facilitate this evolution of the schema. The meta-data needs to be represented in a manner that would facilitate such schema evolution. This is the responsibility of the meta-data servers mentioned above. These agents may propagate this change in content to its adjacent directory agents, which have published an advertisement of the capabilities of the particular information source to the rest of the system in the form of catalogues. By serving meta-data on demand, the information services available can be responsive to changes in the content and form of information in the repositories.
As an example, we propose to develop a schema for browsing images. Image collections pose problems of intellectual access because of the different meanings that images can convey to different users (Besser 1990, 1992). Thus, systems which are primarily dependent on text-based descriptors fail to capture the richness of possible meanings of a given image. Automatic image recognition systems are intriguing in that they exploit the language of the visual medium as a retrieval mechanism for searching visual data, but there is room for alternative systems which allow searching of a more "abstract" nature. (Cawkell 1992, Leung 1990).
This research emphasizes classification structures as devices to group image sets into meaningful categories that support browsing. Utilizing the collection agents and mediators described in section 3.2.6, the images residing in disparate repositories are brought together as a single virtual collection appropriate for the user's browsing. Browsing lends itself particularly well to visual images since the graphic image is able to convey its message in its own terms to the viewer. A number of researchers have advocated the inclusion of browsing features into the design of information systems. (Kwasnik 1992, Bates 1989, Larson 1986, Marchionini and Schneiderman 1988). Besser's prototype system uses visual browsing tools on computer workstations to effect retrieval of visual images (Besser 1990, Besser 1992).
Image browsing provides UMDL users with an alternative search strategy which can be of particular value in navigating image collections. If the UMDL user wishes to see visual information sources, he is presented with choices represented by sets of thumbnail images. The image set can be determined by attributes chosen by the user. For example, the user may wish to see instances of "Northern Lights" as they occur in a specific geographic locale, or data gathered by a particular researcher, or by a particular type of instrument. The thumbnail images may in some instances represent a fuller version of that same image; in other instances, the image may serve as a surrogate to represent a dataset. The system retrieves and displays sequences of image clusters which match the user's query and which the user can browse. The user looks at the clusters to see if he is on the right track; then selects individual images to see in fuller detail or looks to see what other images are in the dataset.
In other cases, the user may not be familiar with search terms which describe a visual resource. While the researcher may conduct a known-item search using technical terminology, the high school student or person unfamiliar with the field may benefit from a graphical display of examples of categories on their topic. The graphical interface will group images from data repositories into organizational schemes which provide a contextual display for the user to browse.
The novice user can then narrow his search by selecting choices within the classification framework. The user could progress through the classification "tree," using the organizational scheme as a context in which to narrow or broaden their search as desired, and see exemplars of images placed at appropriate stages in the tree.
In addition to experimenting with structures for image data, we plan on exploring techniques for structuring video data. This work will be a coopertive effort with researchers from Bellcore, and will include work in the following areas:
Given the number of information sources in the network, a flat structure of interconnected agents and information sources will result in inefficient performance of the system. We propose to develop strategies for structuring the network components into meaningful organizational structures, thus limiting the amount of knowledge and communication required by individual members of the community for achieving particular tasks. This research will involve close collaboration with the domain scientists to provide us with initial reasonable classifications of the subject areas. We also will exploit information structuring mechanisms developed by information scientists and librarians. This was described above.
In addition to these other approaches, the structure of the network will also evolve according to the flow of information among users, collections, and agents. Based on usage patterns, the conceptual organization of the information sources can evolve to decrease the time needed to process queries (So and Durfee, 1993) and to group together information that is currently being accessed together. For example, if one information source maintaining images of experiments is repeatedly accessed together with another source maintaining written reports on these experiments, then these two sources may be grouped into the same virtual vicinity. Such structure may eliminate agent coordination efforts among these two sources, improving search and information integration performance. This reorganization can be envisioned as a network of virtual links among agents and knowledge that is superimposed on other organizational structures of the system. As new topics of national interest appear, or even as established scientific disciplines merge or split, these changes will be reflected in the internal structure of the library because of adaptation to usage patterns. We propose to approach this dynamic reconfiguration problem by developing self-monitoring agents that may suggest changes to the given structures based on typical usage patterns.
The development of query paradigms that allow the user to retrieve the desired material with ease by processing complex requests in this distributed environment is a key research problem. Traditionally, query-optimization techniques determine a fixed execution strategy for a query by evaluating and comparing all information given in the meta-data, e.g., availability of indices, size of data sets, etc. In the federated digital library, this becomes a much harder problem because the query optimizer will have to make decisions with incomplete information (e.g., without studying all possible metadata servers). More importantly, the query processor will have to incrementally adjust the query execution plan depending on hit ratios, quality of partial results, etc.
In the UMDL, this sort of information is distributed among the agents expert in the various repositories and access techniques, the so-called meta-data agents. We will develop strategies to control the parallel spawning of query requests to different meta-data servers. This includes determining strategies for whether such requests can be interrupted, whether partial results can be collected, and whether on-going query requests submitted to local data repositories can themselves be queried on their status of query processing completion. We will establish strategies to evaluate when a particular thread of the search should be terminated, based on the success of other subsearches and the value of material found thus far. For instance, as soon as we discover the major journal article on the desired subject, we no longer want to search for technical reports on that same subject, unless written at a later date. Such dynamic modification of the query will result in a reevaluation of the query plan and potentially in the termination or redirection of some of the still outstanding query threads.
We no longer expect to achieve the optimal query execution plan, since we are dealing with incomplete knowledge about available information resources due to the enormous size of the system. In fact, there are often many plausible plans depending, for instance, on what data source we decide to get the information from, since the same type of (but probably not identical) information may now be found in multiple sources. This clearly puts a new twist on existing query-execution techniques.
In our paradigm, queries are submitted to other agents for further processing and/or to local data repositories, and the results are passed back through the network via various layers of agents. The network will thus have to deal with load-balancing of these communication links and data transfers, minimizing contention of individual resources as well as the overall network. Our general approach will be to ship query requests to data repositories, and get them processed locally, rather than sending huge data units across the network to the general-purpose query processor (processing agents). This will have to be based on heuristic measures involving typical size of queries, data responses, query processing expertise available at local data repositories, etc.
Much of the information in the UMDL will consist of documents and representations in natural and controlled language. Problems in the digital library include not only the intrinsic problems posed by language used in a given database, but also by both the quantity and heterogeneity of the information which needs to be searched and integrated across multiple collections. This requires techniques which will make searching more precise, by including information about the linguistic content and structure of documents and databases. Linguistic retrieval methods will be identified to address these problems.
The linguistic information which will need to be incorporated includes the following:
The identification and construction of linguistic techniques is already under investigation by Warner (1992), and builds upon prior research in manipulating the surface structure of documents and queries to build linguistic capabilities into an information retrieval system. (Warner, 1990; Warner and Wenzel, 1991). These methods make use of the existing surface structure found in documents and queries, as well as the structure and content available in already existing controlled vocabularies. It makes as much use as possible of the explicit structure available in documents and queries, without making use of more intensive knowledge engineering techniques. This is desirable because of the tremendous quantity and diversity of language which will have to be handled.
In terms of the overall system architecture, there are two different ways that linguistic processing can be accomplished. These two methods will be investigated, and the most effective one will be chosen:
In order to increase the benefit that users can get from utilizing the library as a resource, we want to support the formation of comprehensive answers, rather than simply returning the raw data as stored in one of the information repositories. Processing of raw data is one important category of information service that can be provided by mediating agents.
One type of integration is the abstraction of information from several sources, for example, composing the average rainfall as a summary value from numerous charts. Another example is the construction of a reduced document containing only section headings and figures, but not the actual paragraphs. It also may involve conversion between different forms of media, e.g., rather than returning a complete digitized image, the system might return the classification terms extracted from the image.
Most existing work on schema integration has been done in the context of traditional (relational) database systems, which are plagued by simplistic data structures. Our goal now is to build upon this work, by placing particular emphasis on how such integration can be achieved for more unstructured types of data sources (such as textual documents) and for specialized multi-media types of data sources (such as images and weather maps). In particular, we need to determine to what degree structure can be imposed on these new data types, i.e., whether a non-trivial schema can be constructed for these new types of data that decomposes these "massive unstructured data types" into a group of several smaller, well-defined objects and their relationships. This study also will deal with an evaluation of whether such added structure results in true benefits in terms of retrieval speeds for typical queries, information content successfully retrieved, etc.
With most integration effort initially targeted to the meta-data level, we will need to compare the commonly used strategies of statically, a priori, resolving inconsistencies and establishing mappings between meta-data sets (schemas) versus dynamic on-the-fly processing. It is likely that the former will be unrealistic in the context of huge library systems, where we cannot expect to statically resolve and integrate all data repositories. Rather, this process may have to be done on demand. Clearly, for groups of data repositories which frequently are used together to extract related information, the establishment of such static mappings may be a desirable choice. We will consider the application of object-oriented view mechanisms to develop and maintain such mappings (Rundensteiner 92). Self-monitoring of the system activities, and possibly self-initiation of the construction of such resolution mediator is an interesting, open question.
Knowledge about how to focus, monitor, and terminate queries can itself be embedded in agents other than UI agents. These agents will be responsible for estimating the benefit and cost of alternative queries, involving both the expected utility of the querying message [Gmytrasiewicz and Durfee] and of the response to the query. Responses can be evaluated along several quality dimensions, including timeliness (is the response returned quickly enough), completeness (does it represent all of the relevant data), confidence (how likely is it to be correct), and precision (is it a ballpark response or completely accurate) [Musliner et al]. For example, if the user asks about the current population of Russia, she might get several responses, some of which could be precise (to the person), but of low confidence (based on data gathered before the 1917 revolution), and other data could be more complete (including expatriots still with Russian citizenship), but not timely (it took several days to generate this answer). Mediating agents have to determine which solutions to strive to achieve, and when acceptable solutions have been generated.
To make such decisions, it is useful to know the costs of searching for solutions to queries. Mediators directing searches must interact with other mediators responsible for monitoring and allocating computational and other network resources (see section 3.2.7, Economic Resource Allocation). At times when few users are accessing the library, a more thorough search employing more computational resources acting on more collections can be justified, while at other times a search will have to be much more circumscribed. If timeliness is less critical, the user might initiate a surrogate mediator to inquire from the resource monitoring agents whether sufficient resources exist for a query, and carry out this query when appropriate, even if the user has left the library.
Other information agents in the network will have the expertise needed to translate queries and responses between the heterogeneous collections, as described in section 3.2.8, Ontology. Still others will have sufficient domain expertise to be capable of decomposing a user's query, posing subqueries, and using their results to synthesize an answer to the user, all without the user being aware of these activities.
Finally, recall that the information agents in the network, as computational processes, can operate continuously and work actively on behalf of users and collections. In particular, the structure of the network, in terms of virtual links among agents and knowledge that has been propagated, is fluid, and will evolve according to the flow of information among users, collections, and agents. For example, based on usage patterns, the organization of the information sources can evolve to decrease the time needed to process queries [So and Durfee] and to group together information that is currently being accessed together. As interdisciplinary fields change, or even as established scientic disciplines merge or split, these changes will be reflected in the internal structure of the library because of adaptation to usage patterns.
Thus, interactions among mediators will resemble a cooperative problem-solving effort among a diverse set of specialists, where the problems to be solved, the expertise of the specialists, and the population of specialists can all change over time. To provide this functionality, we will draw on a variety of techniques for distributed problem solving, organizational self design, coordination theory, and distributed artificial intelligence. For example, the process of query decomposition, subquery allocation, and result synthesis can be cast as a contracting arrangement among query processors, and can thus employ standard distributed-AI techniques, such as the Contract Net (cite). However, implementing such techniques in the digital library poses an exciting research challenge because of the dynamic nature of the relevant knowledge. For example, decisions about how best to decompose queries must be based on what collections are likely to be available to respond to queries. Consequently, the decomposition process itself may require communication among mediators to first determine reasonable decompositions, followed up by further communication to then allocate subqueries and collect results. In similar ways, many existing techniques in the literature roughly match the requirements for coordinating mediators working together to provide other services, but the unique nature of the digital library calls for extensions and improvements to these techniques.
Given the size and scope of the digital library, there is potentially unbounded demand for computational resources, both time and space, that could be requested by information agents. For example, any amount of preprocessing of data in the collections--such as indexing, meta-data gathering, or caching--might improve the response of the system to subsequent user requests.In order to achieve an appropriate level of the various preprocessing options, we seek principled methods for expressing the operating tradeoffs and allocating the computational resources toward their maximal expected benefit. Moreover, we require that these methods be sufficiently flexible to adapt to patterns of usage that evolve during the operation of the digital library.
In approaching this resource allocation problem we plan to treat the alternative information services as competing economic activities. Given a measure of priorities over the end-user services provided, the various agents effectively compete to provide the highest level of service using the minimal computational resources. We, however, emphasize that the ultimate specification of service priorities is left open to the operators of the digital library.Whether the competition is to be realized via explicit economic transactions or merely an internal "currency" is a policy question on which we take no stance at present. Regardless, if the competition operates smoothly, the result can be an efficient overall allocation of computational resources towards the optimal provision of services to users.
To organize the processing activities within an economic framework, we view the interactions between agents as supplier-producer relationships, where each agent produces value-added information products from the input products provided by others. Agents dynamically connect with each other as opportunities arise for mutually beneficial exchanges. The collections provide the ultimate "raw materials" in this process, whereas the end users are the ultimate consumers of the "finished goods". The intermediate agents ("middlemen") bridge the gap by bringing to bear knowledge, processing, storage, or other computational resources to improve in some way the expected value of the information as it passes along the chain from agent to agent.
Economic mechanisms for allocating computational resources have been studied by a variety of researchers in recent years (Cheriton & Harty, 1993; Huberman, 1988; Kurose & Simha, 1989; Waldspurger, Hogg, Huberman, Kephart, & Stornetta, 1992). Our implementation of virtual markets in information services will be based on the idea of "smart auctions" proposed for smooth allocation of bandwidth on the Internet (MacKie-Mason & Varian, 1993). The mechanisms for managing multiple, interacting markets will be based on our previously developed "market-oriented programming" system (Wellman, 1993).
In order for the production and distribution of information goods to be economically viable, there must be some mechanism for recovering costs. Yet pricing information goods is notoriously difficult. The first copy of an information good will often be very costly to produce, while subsequent copies may cost next to nothing. The combination of high fixed costs and negligible marginal costs creates difficulties for conventional forms of pricing. For example, standard economic theory argues that it is desirable to price goods at marginal cost. But if the cost of (re)production is zero, marginal cost pricing will not recover costs.
Conventional markets for information address this problem by bundling the information good with a good that is costly to reproduce: printed books, documentation, user support, a special kind of viewer, etc. We will consider digital-library analogs of this approach, where provision of documents is bundled with special services such as delivery, search, customization, and so on, which add "user-specific value" to the information. For example, a user might want to retrieve cross-tabulated data on the incidence of hurricanes and insurance premiums for different locations, along with newspaper anecdotes and photographs. Such a search would require querying several disparate databases and merging the results.
As this example illustrates, the value added to the user depends on the organization of the information, not simply the raw data. Similarly, the resource cost to the provider depends on the expense of organizing and customizing the information. Since this reflects a non-negligible marginal cost, the charging mechanism can approach the desired economic result. As a side benefit, the fact that information may be organized differently for different users reduces the incentive for unauthorized copying and redistribution.
At an abstract level we can pose the pricing problem as follows. Our objective is to construct a payment scheme that depends only on observable characteristics of users, that maximizes overall benefits subject to the constraint of covering costs. Of course, we have to build into this optimization problem the fact that the users' choices of information services will depend on the nature of the pricing scheme that they face; economists refer to this constraint as the incentive compatibility constraint.
This abstract formulation of the problem can be examined using the methods of mechanism design (Wilson, 1993). In general, we will want prices to differ across users based on both observed characteristics of them and the form and amount of the information delivered. For example, the charge for information could be based on 1) frequency of use, 2) immediacy of delivery, 3) structure or formatting, 4) amount of information retrieved, 5) membership in a group, etc. This means, of course, that the digital library must maintain sufficient records to be able to base charges on these characteristics.
We propose to investigate both the theoretical formulation of the pricing problem for information goods and the concrete implementation of various pricing strategies in the context of digital library materials.
Information held in the collections may be owned by various entities, some of which may demand some control over the dissemination of contents or compensation for access to their copyrighted material. This can be particularly difficult in dealing with digitized information, since users can copy information just as efficiently as publishers. Although we cannot in this project resolve all the thorny issues underlying the notion of intellectual property in a digital library, we must design our system to accommodate mechanisms to protect information access and support royalty payments and other remuneration operations. This includes provisions for executing the transactions themselves, as well as designing the methods for selection of information services so that they are sensitive to relative costs and other economic factors. We will also explore particular schemes for compensating publishers in a distributed environment. One example is "superdistribution" (Cox, 1993; Mori & Kawahara, 1990), where the access of information is free but its use is charged.
To understand the virtual information-services marketplace, it may help to consider the perspective of a specific hypothetical agent that performs one small function in the process of an information retrieval task. This agent watches the network for keyword requests (or whatever a meaningful piece of a query might be--this example is meant to illustrate resource-allocation issues rather than our information-retrieval model), and produces abstracts of documents that it believes are salient to those keywords. This agent's activities demand resources--processing, storage, and access to documents--and its value to the system as a whole must justify the allocation of these resources.
To ensure that resources are allocated to the most valuable enterprises, we measure the value of this agent's product by direct compensation. For example, we could pay the agent a fee for each abstract produced. However, we must also provide an incentive for the agent to produce relevant abstracts. We could have the consumer of the abstract (another mediator agent, or ultimately the end user) pass judgment about relevance, but we also want to ensure that the value of a document is not misrepresented in order to evade fees. One way to do this is to provide the abstracts for free, but then charge the finder's commission whenever the full document is retrieved. Presumably, this is an accurate signal of the perceived value of the document.
It may seem that the agent still has an incentive to over-produce abstracts, since increasing the number of abstracts offered might increase the number of documents retrieved. However, an astute information entrepreneur will recognize that offering irrelevant documents will damage its reputation as an information agent, and in the long run decrease its commissions. By keeping track of the identities of other agents they interact with, agents can adapt to patronize those performing most effectively.
One sharp example of the importance of reputation is in the prevalence of subscription services. Readers may subscribe to The New York Times, for example, because this newspaper has a track record of providing news relevant to their interests. Similarly, we expect subscription services to proliferate in the digital library, though with far greater customization and flexibility than traditional subscription services. The concept of standing requests and proactive agents can be viewed as a form of subscription service.
This example illustrates some of the economic issues that come up in designing the configuration of agents comprising the digital library. The central point is that designing a distributed system of this sort is largely a matter of getting the incentives right, and economic analysis can play a central role.
Central to federating any collection of independently generated information sources, or databases, is a common language for describing contents without detailed information about access mechanisms, organization, or any other implementation-specific issues. The description of the content, in a sense, is a declaration of what is ; this is commonly called an ontology. The ontology, because it must be communicated, is described in some (semi) formal language, facilitating concise and thorough statement of the contents {cite Gruber AID '92 Portable Ontology paper}. For example, we can communicate the contents of a database through attribute-value pairs. An ontology would describe the meanings of the attributes and units for the values, but would not give the actual value in the value field for a particular database, nor would the ontology specify how the data should be stored or accessed.
Our approach overcomes this problem by ignoring syntax and concentrating on content: the specific data that is to be interchanged. The syntax of representation is left to individual database or software developers.
The ontological approach therefore concentrates on defining the following:
We expect the ontologies developed as part of the digital library to describe bibliographical items, specifically, the organization of large collections of collections (information sources in the digital library), which are necessary to control query processing. As the number of information sources is vast, the ontological descriptions will be hierarchical: we expect to concurrently develop ontologies for specific domains, such as atmospheric science, as well as an overarching ontology for describing the organization of the library itself. By partitioning the ontology-writing process, we can successfully write them. It is important to note that our project has experts in the area of writing ontologies (see MATRIX).
The writing of ontologies will initially be done by hand, exploiting techniques commonly used by information and library scientists. Our plan is to write primary ontologies, and then publish them in the library. Designers of information sources and interfaces can use these ontologies to find the appropriate terms for publishing information about the contents of their sources (for information providers), or finding terms for making queries (for interface builders). The ontologies can be modified by developers to cover new information and media types. A protocol for modifying ontologies will be established, but it is important to note that we are specifically not proposing committees or other coordinating bodies to control the modification of ontologies. Rather, we propose that market incentives will be the coordinating mechanism. For example, if an information provider misunderstands or permutes an ontology in an unacceptable way, his information source will not be accessed by users of the library. This provides broad-based feedback to the provider on the adequacy of both the information source and the way in which its contents where published.
Currently available ontological description languages, such as Ontolingua, do not support hierarchical descriptions. We plan to extend Ontolingua to include mechanisms for describing various forms of hierarchy relations.
The protocols define the full range of activities that can be performed in the library and will apply primarily to mediators and communication wrappers around information sources and user interfaces, as they are the active agents in the library. By insulating information sources and interfaces from the details of the operation of the network, they are easier to construct and to maintain. Furthermore, we can control their actions, adding security to the network. The metaphor is the telephone network. Many manufacturers produce telephone sets, which interface to the extremely complex telephone network, in a relatively simple way. As the common carriers make most of their money by carrying conversations, it is in their interest to make telephone sets as available as possible.
The definition of protocols will be under our control, as will their implementation (at least initially). As the design of the network is not completeÑit will be fully developed as part of the proposed projectÑthe details of the protocols are not yet available. We, however, give several example operations below:
As part of this proposal, the University of Michigan plans a comprehensive on- and off-campus deployment activity. Partnerships have been established with the publishers and users, resulting in our ability to test and evaluate our research under realistic conditions with a large representative collection. Since we have already developed and deployed an image based digital library system (DIRECT) at the University, we expect to begin implementation of the testbed as soon as the project starts.
In order to test and evaluate the proposed research in a controlled but still significant environment, we have chosen the domain of earth and space sciences, and a user community consisting of high school, undergraduate and graduate students, researchers and the general public. This subject domain and user community was selected because they offer an especially diverse range of content types and expertise. Moreover, we are able to exploit existing relationships at the University of Michigan, as well as to complement other on-going research efforts in the areas of collaboration technology and educational technology, several of which are already funded by the NSF.
The University's Department of Atmospheric, Oceanic, and Space Sciences (AOSS) and its associated Space Physics Research Lab (SPRL) is a nationally recognized program, and represents a core competency of this institution. This department and laboratory have already been heavily involved in setting the requirements for the testbed collection and in providing us with their needs.
NSF is currently supporting an experimental collaboratory for upper atmospheric research involving Michigan scientists and colleagues in Denmark, the University of Maryland, and the Stanford Research Institute. The digital library testbed will be made available to these researchers, and will provide an important supplement to the on-going collaboratory research. The Foundation has also awarded several curriculum development project to the Michigan AOSS and Computer Science faculty to develop new approaches to teaching and learning in this discipline at the high school level.
Basic science curricula in high schools contain earth and space science concepts, making these topics highly relevant to students in grades 10 through 12. Through our digital library, it will be possible to provide these students with a broad range of information in a variety of formats and to connect them to the researchers and research activity at the University as well as at remote specialized laboratories.
We have developed working relationships with several high school programs and selected the Ann Arbor Public Schools, the Ann Arbor Public Library, Stuyvesant High School and the New York Public Science, Industry and Business Library as primary non-campus deployment sites. Based on the results of the research and testbed deployment, we may extend UMDL availability to Bloomfield Township and Battle Creek, Michigan communities. In these communities, we will work with high schools through the science teachers, high school media centers, and public libraries to develop a coordinated base to bring the digital library to students and adult learners.
The Ann Arbor site will build on well established working relationships between project faculty and science teachers and librarians at Community and Pioneer High Schools and the Ann Arbor Public Library. In addition, the UMDL will be available to students at Stuyvesant Science Magnet School in New York in collaboration with the New York Public Library's new Science, Industry and Business Library, scheduled for completion in 1995.
We have forged strategic alliances with several primary and secondary publishers of key journals, textbooks, magazines, and other reference materials relevant to this domain. These publishing partners, including Elsevier, McGraw Hill, and University Microfilms, have agreed to make available an extensive array of published materials, in full text, formatted form, for the purposes of this research project. We have agreements with all of these publishers to work aggressively to capture much of the material in source digital form, and to provide it to us over time in a structured digital form such as SGML.
This approach makes possible a large collection of existing material that may be available only in digital image form, and yet to move quickly to providing materials in a structured format. This also provides an environment for understanding trade-offs between the different formats, and in particular, in understanding whether and how structured documents and the availability of structured information can assist and inform the search process.
The collection will include textual, video, still image and data sets as well as archives of the UARC project. The journal, monograph, and reference material will span the range of user sophistication and types of resources--e.g., journals such as Aviation Week and Space Technology, Remote Sensing of the Environment, and Atmospheric Research; the McGraw Hill Encyclopedia of Science and Technology; and Elsevier's major bibliographic tool GEOBASE.
We have already held focus groups with representatives of the user communities to understand the intersection of their needs with materials published by our partners, and have come up with an extensive list of titles that will be made available to this project. A detailed list of materials for which we have negotiated use permissions for this project is located in Appendix 10.2.
Data sets to be included in the testbed will come from a variety of sources including the federal government, professional associations, and academic providers. Partnerships with academic units at the University will allow us to include significant data such as: EPA Air Quality Archives, I.R.I.S. Real Time Seismic Data, and UNIDATA Real-Time Meteorological Data. These will be complemented with geographic data including Michigan Land Use/Land Cover Data, N.A.S.A., G.I.S.S. Global Vegetation and Land Use Databases, and an array of U.S. Geological Survey data.
We also have discussions underway with providers of video, for example Encyclopedia Britannica, and audio programs and expect these materials will be made available to the UMDL.
The UMDL will enable students to explore questions in ways that would be exceedingly difficult, if not impossible, with current resources. For example, students will have access to the same data as would researchers, as well as some access to the researchers themselves. No longer must students rely on minimalist summaries in outdated textbooks. Taken together, the UMDL provides an information infrastructure that should enable students to carry out inquiries into timely, provocative, and authentic -- and hence, motivating -- scientific questions.
The challenge, then, of this effort from the educational standpoint is how to scaffold the process so that learners can indeed transform information into knowledge and understanding. By "scaffolding" we mean all manners of support, from computer-based coaches to prompts for specific information to watching an expert model interacting with the UMDL. To that end, we see two mechanisms by which that scaffolding will take place:
We start deployment of the testbed with a significant advantage: a small software development project, DIRECT (Desktop Information Resources and Collaboration Technology), a joint initiative of Digital Equipment Corporation and the University, has produced a prototype digital library system for image-based documents. The initial deployment of DIRECT has been undertaken with a journal set provided by Elsevier Science Publishers under its TULIP (The University Licensing Program) initiative.
TULIP has provided nine universities with access to 43 materials science journals in the image form. Michigan was the first university to make available these journals to anyone on campus -- in full text and with full page fidelity -- with a bit-mapped workstation on their desk. DIRECT/TULIP has now been in operation since April of 1993, and we have gained extensive experience with providing such resources as a production service to a user community.
The journals are delivered biweekly as a series of bitmap images of the full text with associated 'dirty' OCR files and 'clean' abstract/citation data. In the future, Elsevier plans to move to SGML. The participating universities have full 'on campus' rights to use the journals. Elsevier is particularly interested in experimentation with various economic models for usage (e.g. subscription models vs. charge-per-print or other methods), and with the behavior of the journals' users.
In return, the universities develop the software required to access the information and maintain and report journal usage logs to Elsevier. They must also provide some control of access to the journals, limiting access to university users only.
Implementation at the University of Michigan has followed a two-pronged approach, both as part of the University's Library Management System (NOTIS) and as a database in the experimental DIRECT system. The former addresses the needs of a user with a low-end, possibly ASCII only, low-bandwidth network connection terminal, while the latter provides functionality for a user with high-end access such as a workstation with high resolution display and an Ethernet or higher-speed network connection.
Under NOTIS, users can search the citation/abstract data, but not the full text. From the results of this search, the user can view the full abstract and citation of any article. If the user then wants to see the full article, she can issue a print request which is sent to the DIRECT system. The article is then printed to the specified laser printer anywhere on campus.
When the TULIP project began, DIRECT (Desktop Information REsources and Collaboration Technology) was an existing project focused on building experimental systems for delivering multimedia information to end users. It also sought to explore issues associated with intelligent agents watching over information collections for data of interest to the user. Under the DIRECT system, the user can search the full text of the TULIP data using a search engine developed here at Michigan, retrieve an article or abstract, the display the full bitmapped page images on his workstation. The printing capabilities include access to any of approximately 300 printers on the campus network.
In addition to its interactive search facility, the DIRECT system allows users to store 'productive' searches as agents for automatic processing against additions to the search base. The results of these stored searches are automatically e-mailed to the user in the form of a digest containing the abstract/citation information of any matching articles. This service amounts to a sort of personalized subscription, which frees the user from having to continually return to the system to re-enter queries in order to not miss something that might have arrived since the last search.
While DIRECT is a 'distributed' system in the sense that it is client/server based, the clients are strongly bound to a particular server, known a priori., DIRECT provides a starting point for deployment of the testbed; the research described in the previous section is critical towards breaking this binding and towards moving towards an environment where clients dynamically locate servers based on the contents of queries.
Migration from a standard client/server system to a completely distributed self-assembling federated database constitutes the major development in the underlying structure of the system. Currently, DIRECT consists of multiple client programs which all talk to a single known central server, which contains the search engine, indices and documents. The system will evolve from its current state into the advanced digital library architecture of the UMDL by incorporating technologies devised by the research groups involved with this project.
The first phase of development will introduce the notion of a mediator, which adds an additional level of abstraction between client and server. Initially the clients will connect to a single known mediator, which in turn will be able to connect to multiple separate database servers depending on the contents of a particular query. This differs from a system like WAIS in which a query is sequentially sent to all databases that a user explicitly selects because a mediator will itself choose the appropriate data sources.
The second phase will involve multiple mediators and multiple databases. Where the mediators previously knew about various data sources a priori, now none of the mediators or databases are initially known to each other or the clients. Through combinations of multicast methods and registration systems, new pieces of the system will automatically coalesce. This is useful not only for ensuring that large numbers of different databases can work together, but also for enabling them to maintain autonomy of administration and easy redundancy. Some distributed-systems research will be required to determine the optimum way to enable this self-assembly process.
The third phase will allow greatly increased heterogeneity of data sources. To accomplish this, the research group will devise an ontological description language. Data sources will use this language to describe themselves to mediators in an object-oriented manner by giving details about their contents, capabilities, classification system, and access methods. It is then the job of the mediators to use this information to determine the worth of the particular database, formulate the proper sort of query, and translate the results into a usable form.
With the exception of basic citation fields, the DIRECT system currently employs very little document structure information. Full text documents are represented by bitmap images, unstructured plain text, and citations. For this project, the system will be expanded to deal with arbitrary document structure types.
The UMDL is designed as a testbed to which any number of toolsets can be adapted and a variety of effects studied. Specifically, the relationship of structure to efficiency and effectiveness of retrieval (location) and reuse (application) of information will be investigated. Elsevier and McGraw-Hill are committed to providing structured representations for all or some collections. Both organizations have adopted SGML (Standard Generalized Mark-up Language) for this purpose, a technology that is being increasingly adopted by other publishers and information providers.
SGML is a grammar for specifying valid objects and their contexts. It is a storage representation that allows data to outlive the technology used to create or access it, and it is a representation suitable to direct reuse, including linking (e.g., HTML). It is a foundation for enabling multiple paths by which intelligent agents can locate, retrieve, and manipulate data.
SGML applications consist of a Document Type Definition (DTD), an instance of a conforming document(s), and a parser which validates the instance of a document against valid objects and context specified by the DTD:
Intelligent agents do not need any prior or special knowledge of the structure of a collection; DTDs provide that. Agents can learn new document structures as they encounter them. Users determine application of retrieval operations.
Conforming SGML documents are in an object-oriented representation. Document databases can be created, based on generalized parent-child relationships between objects. A document can be traversed by its structural components. Attribute-value pairs can be attached to objects which can be manipulated and can be linked within and among documents. Versions of objects can be stored and annotated.
Existing DTDs, the controlling environment for particular document types (e.g., bibliographic data, journal article, book chapter, book) will be analyzed from the point of view of digital library requirements. A framework DTD that is modular for fragments will be developed. It will incorporate both conventional and multimedia document structures as well as fragments needed to encode components of knowledge representation. The DTD will be used to validate testbed data. Then, FOSIs (Format Output Specification Instance), DTDs that specify a formatter independent layout style sheet, will be developed.
While SGML is gaining in popularity, competitors do exist, and others may emerge. Adobe's PDF and Xerox's use of ODA are alternative approaches that have communities of supporters. Appendix 10.3 identifies a group of experts in structured information drawn from university, not-for-profit, and for-profit settings who will help assure that the paths selected by the UMDL can interoperate and evolve.
A very significant benefit of having the testbed in a conforming SGML application is the ease with which changes in record layout can be made to accommodate advances in retrieval technology. Simple models for knowledge representation are easily replaced by more sophisticated mechanisms as the project unfolds. The effort involved in creating the initial SGML application provides excellent experience and is not, in any case, throwaway work. It is the first step in a process.
We will investigate different aspects of document structure, both internal and external to individual documents. Internal structure can be tags within the document such as SGML, and external structure can be names attribute tags and classification systems that are attached outside of the documents themselves. We intend to determine the optimum amounts and types of each, and the ways to best use them to help the search process.
For this purpose, the system will have flexible support for many types of internal and external structure, both of which are described to mediators by the previously mentioned database ontology. We expect to have some data sources that will make extensive use of SGML, and others that support annotations and experimental classification systems.
This subsection describes functionality that we initially believe should be contained in the digital library. The order in which we implement these features will depend upon the priorities of the different research efforts for which they are needed. The feature list will grow and change based on research results, iterative testing, user feedback, user studies, and new ideas.
The system will naturally allow interactive searching and browsing across multiple remote databases simultaneously and automatically, without the user having to select particular databases. Location and format of information will be transparent unless the user chooses otherwise. Many kinds of search queries will be supported, including requests to find additional information similar to a particular document. The system will allow a user to browse and navigate through large amounts of information, probably initially through a WorldWideWeb/Mosaic hypertext model. It will also support the notion of a "virtual library," allowing the construction of a personal information space or logical view into the information universe.
The system will provide "scheduled" searching, or notification facilities. Active computerized agents will seek out new or changed information that meets a user's interests, collate it, and inform the user about it. The intervals and methods of this notification will be specifiable by the user.
Users will access the digital library through many different desktop and network access methods, ranging from text-based terminals to high-resolution graphics workstations. Client programs will run on as many platforms as possible, taking advantage of the characteristics of each. At a lower level, it will interoperate with existing systems using standard protocols, including HTTP (WorldWideWeb) and Z39.50.< P> Some databases will allow users to add their own documents. This user-publishing feature will require a detailed bibliographic control system in order for the database and the mediators to know the significance and proper classification of the new documents. While individual databases can maintain autonomy of administration, the bibliographic control system will maintain consistency within a particular database. This part of the system will also incorporate digital signatures at a fundamental level to keep track of both authorship and acceptance. For example, an on-line journal may use a digital signature annotation as an indication that a particular item has indeed been accepted by the journal's editorial board.
The system will provide "collaboration hooks" for integration with collaborative technology tools. This will allow users to work with others and with human librarians when needed, just as with a conventional library. The collaboration tools will allow sharing of advice, queries, and documents. There may also be tools to facilitate collaborative authorship as part of the digital publishing process.
Some databases will require authentication. This is necessary for information that is restricted to a certain group of users due to copyright or licensing, and for information that requires a fee for use. Multiple authentication methods will be supported for different databases. Finding the best methods of charging and billing will require investigation, as detailed in section 3.2.7.
Usage will be monitored, both for any billing needs and also anonymously for user studies and research. Usage statistics will be fed back to the researchers and user-studies groups, who will in turn suggest changes, improvements, and new features for further testing. By using this iterative process through the duration of the project, the system will continue to evolve both to incorporate new research results and to meet the changing needs of the user community.
While a dominant role of the testbed activity is to provide a real-life vehicle for testing our of research ideas, we nevertheless expect it to support production users from the outset and to provide real benefits to our test user communities. The fact that we will be able to deploy, in image mode and with the limitations admittedly inherent in the existing system, a wide variety of useful publications from day 0 will enable us to begin immediately providing a useful service to our users.
The following represents current thinking on the roll out of the testbed, and how we plan on moving results from the research activity into the testbed.
Upon initial award of the cooperative agreement, we will be able immediately to mount under the existing DIRECT system the journals and other materials provided us by our publishing partners. As this material is already available in image form, we plan on making this material available within several months after proposal award.
As part of this initial effort, we plan on working with the publishers to provide us with material in SGML format as soon as possible; we have commitments from the publishers to aggressively work with us in this area. There are several phases to this joint activity, including:
Toward the end of the second year, we expect to be able to begin deploying a new version of the client that is able to interact with mediators instead of directly querying a particular database. Initially, this will still be based on a fixed query format and a priori identification of the mediator of interest.
In the third and fourth years we plan to further provide extensions to the system to allow for interaction with the network of mediators and servers, using results from the research activity. Details of the precise mechanisms to be deployed are clearly dependent on research to be carried out under this program.
While the above plan presents a linear roll out of research into the testbed, the entire development and deployment philosophy is based on the model of fast prototyping and testing. The goal is to develop experiments from the research activity and very quickly to build prototypes implementing those experiments within the existing testbed. Thus, we envision the testbed as continually evolving as workable solutions to problems are developed. Also, as the evaluation and assessment activity tests changes in terms of utility and success in improving the ability to find relevant information, input from that process will be used quickly in changing and improving the implementation.
Management of the testbed will involve three main areas: development, technical support, and user support. The development activity is responsible for the actual programming of testbed software. Technical support includes responsibility for system operation, system programming, hardware support, and maintenance. User support includes any necessary user assistance, user training, and system documentation.
The testbed development activity will involve extending the existing DIRECT system in many new directions. This activity will be lead by Randall Frank, Director of Information Technology for the College of Engineering and the School of Information and Library Studies. As has been demonstrated by TULIP, the DIRECT team has established a record of building production level digital library systems. They were the first (by over half a year) to implement the TULIP initiative in a working system, and in the process gained credibility with the publishing community. The willingness of publishers such as Elsevier to work with us in this project in a direct result of their experiences to date with the TULIP project at the University of Michigan (see supporting letter from Elsevier). Mr. Frank has headed the existing DIRECT project from its inception, and was responsible for overall system architecture and conception. He will continue as overall director of the testbed development and deployment activity.
The existing DIRECT team, augmented by additional staff and graduate students, will have primary responsibility for programming the testbed and for moving back and forth between the research and development efforts. To ensure maximum technology transfer between the research and development components of this project, we plan on treating both staff and research assistants as a project wide resource, and not as dedicated to a particular component. Thus, as research results become ready for deployment, the staff and research assistants involved with the initial research will take lead responsibility for actually deploying the work in the testbed. We believe that this incremental and interactive process will shorten the time from research to deployment, and will help ensure that research results get early exposure and testing by users.
The responsibility for technical support and operation of the testbed (including servers, networks, and related equipment) will reside within the Computer Aided Engineering Network (CAEN). CAEN has overall responsibility for one of the largest workstation and server-based networks within the academic community, consisting of over 1,200 UNIX-based workstations and a comparable number of Macintoshes and other PCs. CAEN maintains dozens of Ethernets and a 100 Mb/sec FDDI backbone connecting all of these machines. CAEN is also leading an experimental campus activity in Asynchronous Transfer Mode (ATM) networking.
Hardware support and maintenance will be supplied by CAEN's hardware operations group. This group has considerable experience, maintaining the thousands of computers and connecting networks of the College of Engineering. The operations group also works cooperatively with other University computing organizations to ensure the stability of the overall campus network. Support for services such as backup and server maintenance will also be provided by CAEN.
As the testbed moves into a production system, CAEN will maintain primary responsibility for continued system programming and evolution of the software.
The user community for the digital library spans a range of expertise and sophistication with information systems, with computers, and with information resources. Critical to the exploitation of these resources will be ongoing programs of training, user assistance, and outreach to promote use of the digital library. Closely associated with these issues of user support is the ongoing development of the collection of information resources through continued partnerships with information providers, including commercial, governmental, or academic sources. The user support structures envisioned for this project will bring together these themes of technical assistance, user skill development and responsiveness to user need both in terms of tapping existing information resources and the development of future resources.
A project librarian within the University Library system will be responsible for coordinating the on- and off- campus user support programs, serving as a vital link between systems designers and technical support personnel and the user community. This professional will be assisted by a number of subject specialist librarians (with assigned responsibilities for oceanography, meteorology, space science, global change, geographic information systems, etc.) drawn from the University of Michigan University Library who will play an ongoing role of identifying and evaluating information resources to be added to the UMDL and working with campus constituencies to encourage use of and participation in the digital library.
The project librarian will also work closely with the curricular research team and high school communities to build a system for remote user support through on-site workshops, phone and electronic mail reference services, and instructional guides available through the digital library. Ongoing communication between the school media specialists, science teachers, project librarians, and users will be essential to ensure both the collections and systems are meeting the curricular needs of the external partners.
The New York Public Library expects to provide user support for the general public and high school outreach to Stuyvesant Science Magnet School. This will be accomplished through close interaction with the Michigan user support team and assistants. That group will provide training to the New York professionals.
Each time the user of the Digital Library obtains information, several intellectual property right questions arise.
First, the reader must be provided with data which identifies the source of the information being provided from the DL. This is essential if the source and the author are to be appropriately and accurately acknowledged in any document subsequently prepared by the user. This information will also be important for the user's subsequent research in which he or she may well want to read more material from particularly apposite sources than the information search mechanism has provided.
Second, the user of the information and its author both need to be assured of its integrity: that it is being presented exactly as the originator intended it. This raises subtle questions. For example, for much information, the typographic presentation is not part of the author's concept and need not be sustained. But for some materials, the typographic presentation is an integral part of the authorial concept and does need to be maintained as the material migrates from author to the UMDL to reader. Of course, the dominant concern and one applicable to all materials is that of textual integrity.
Third, documents in the UMDL will have owners. The Constitution recognizes the need to provide incentives and the basis for cost recovery to creators and disseminators of documents. It has done this by establishing intellectual property rights and allowing the means for the owners of those rights to earn income from their property. The UMDL must respect the property rights in the materials included in the Library and provide means for recording usage and for collecting payments from the users and remitting those payments to the property owners.
Within the general area of intellectual property, there is one problem that involves both conceptual questions and complex operational ones. This is "fair use" -- that there are some defined uses of copyrighted materials for which neither permission need be obtained nor payment made. There is of course considerable debate about the extension of fair use to the electronic environment. But adoption of any definition of fair use requires that a system for monitoring and charging for the use of copyrighted documents take notice not only of the parameters set by the owner but also of the use to which the material is to be put.
A complete system for handling intellectual property rights within the UMDL must attend not only to the first use of materials, i.e., by the immediate UMDL user, but also subsequent use of the material in documents created by the user and copied then to other people. There will be many instances of such activity where payment is legally due to the original copyright holder.
In managing the testbed and implementing the research (see sections 3.2 and 3.3), the project team will seek strategies for recognizing and monitoring these intellectual property issues. Some of the research is expected to resolve the problems described above.
As discussed previously, several existing production organizations, including the Computer Aided Engineering Network and the University Library system are heavily involved in the deployment and support of the testbed. By using existing organizations to provide this support, systems built under this project will easily and naturally find their way into the supported production environment of the University upon completion of this project. Both CAEN and the University Library are committed to supporting the results of this project after its completion as part of their on-going activities.
The eventual physical home for production resources (hardware, software, and collections) developed under this project is the Integrated Technology Instructional Center (ITIC) being built on the North Campus of the University. This new 225,000 square foot facility will be the physical centerpiece of the University's efforts in digital libraries, and will house existing engineering, art, and architecture library resources along with information technology resources for the north campus. This $42 million building, scheduled for completion in December of 1995, is being funded by the State of Michigan as an indication of the importance that the State and University place on making electronic information available both on and off campus. The transition of this project to ITIC will be relatively straightforward, as Randall Frank, who will manage the testbed and deployment activity, is acting director of ITIC as well, and he will ensure smooth transition of the project to the new facility.
The design of this building has centered around the integration of library and information technology resources, as opposed to simply housing them side by side. This facility will contain extensive facilities for the electronic capture and conversion of information to digital form, along with over 700 workstations and an extensive training facility to help users learn how to make best use of the evolving digital infrastructure.
Where no intellectual property constraints are present, we plan to allow any users on the Internet to have access to the UMDL collections. This will be true for various government data and other information not owned by private publishers. Already, significant data to be used in this project, such as the weather data provided by the Weather Underground project at the University, is already provided to thousands of Internet users world-wide (at last count at the rate of over 250,000 accesses per week) and will be an active component of this project.
Using the new extensible forms facility in Mosaic, we are in the process of allowing Mosaic users to access various servers already provided through DIRECT. This obviates the need for users to learn yet another client, and simplifies our task of implementing clients for all possible machines. Clearly, our ability to allow random network users to access materials within the UMDL is dependent on intellectual property issues.
Where intellectual property constraints do exist, our publishing partners have expressed an interest in exploring use of the Internet to allow for widespread distribution of the information, although not necessarily at no cost. (As part of their contribution to this project, there is no cost for use of their intellectual property for use with the direct project partners).
In particular, both this project and our publishing partners are interested in exploring mechanisms for allowing widespread use of the UMDL with some type of royalty recovery arrangement. Since a significant component of the research activity involves the integration of cost considerations into the information finding process, this offers a natural path towards exploring how even data with intellectual property constraints could be made available to all Internet users.
In order to handle requests from other non-project researchers for access to components of the testbed that are not publicly available, we plan on initiating a formal application process. As collections become available, we will make announcements over the Internet and in other publications appropriate to the subject matter, and solicit applications for access. This will be done on (at least) a quarterly basis (once a substantial collection base has been established). Applications will be solicited from both research end users interested primarily in access to the developing collection, as well as systems researchers who would like to investigate interfacing their own clients and servers to the federated environment that will be developed. In particular, especially in the later years of the project, we are very interested in integrating servers developed by others into the testbed, and gaining practical experience in how well external services fit with our environment.
Applications will be evaluated by a team of project personnel headed by the director of testbed activity, and then submitted to the project operating committee for concurrence. We will base decisions on providing access on the quality of the match between the needs of external researchers and our ability to provide access that meets those needs (in terms of the contents of the collection, and the methods of access needed by the external researchers). Due to intellectual property constraints, some supplementary funding may be required by these external researchers in order to compensate intellectual property providers for access. We will pursue the guidelines identified by NSF to respond to this need. External users must also agree to work with the evaluation and assessment team in understanding how successful their use of the testbed has been.
As with any library, the digital library exists to assist people in finding information. In the course of work in developing and studying basic ideas on the design of the UMDL, it is also vital that people are able to use the facilities of the library to locate, identify and integrate appropriate information which is of value to them. This is an important component of any research into information systems, but it is especially crucial here since this project will involve a great deal of innovative design work and its intent is to help point the way to an architecture for future information environments.
Although systems evaluation over the last few decades has been inadequate for designers, in the last few years (since 1988), there has been a renaissance of interest in how users evaluate information and information items. A number of researchers have taken up these issues and begun to investigate them in new ways and from new perspectives. There is increasing concern with the criteria which users employ in making judgments, a greater connection with other areas of research, especially that focusing on users and their needs, the introduction of qualitative methodologies and new quantitative ones, and a fundamental re-examination of exactly what users are looking for and how they decide on what's presented to them.
Specifically, the UMDL will be ev