Validating quality in large-scale digitization

This project studied the quality in large-scale digitization, with a particular interest in the errors that have been caused through the digitization process and the difference it makes to users when error is found in digitized content. The project examined books digitized by the University of Michigan and the more than 80 partner institutions that make up the HathiTrust Digital Library. It is also analyzed the potential impact of errors on educational and scholarly use within a representative set of use cases: reading online, printing copies, mining texts, and managing print collections.

Start date: 9/1/2008
End date: 8/31/2012

Read More

Ongoing mass digitization of books and serials is generating vast digital collections and transforming education and research at all levels. However, these efforts have also raised questions about value of the digital copies produced by such large-scale projects. For digital repositories and their communities of users to trust that deposited objects have the capacity to meet the uses envisioned for them, repositories must validate the quality and fitness for use of the objects they preserve. 

This project examined some of questions concerning the value of digital copies: What difference do imaging errors and missing pages make? Do these errors get in the way of people learning? Do these errors prevent people from finding what they want? Do these errors prevent people from understanding books in an online environment?

To assess the frequency and severity of errors and investigate possible methods for detecting and measuring errors and other quality issues, the research team defined 11 error types and developed a six-point severity scale that characterized various levels of loss associated with reading ability and content. A review staff, trained to be consistent, assigned a severity score to perceived errors in displayed page-images. The research team and review staff coded error data on more than 350,000 individual sampled page images from 3,000 volumes; 690,000 pages for whole volume error from 2,000 volumes; and physical characteristic data from over 1,500 volumes.

The findings of this study developed four distinct clusters of results: an error model, data gathering interfaces, data analysis, and user validation studies. These results represent a significant contribution to the field of information quality, and are designed to inform digital repositories about assessing the quality of objects they have committed to preserving on a large scale. Understanding how to judge the quality of the HathiTrust digital deposits will help libraries make future decisions about re-digitization of materials and about managing collections of print volumes with secure and useable copies held in digital repositories. The ability to assess and document the quality of volumes will pave the way for certification of these volumes in relation to specific uses, enhancing the decision‐making capabilities of users and stakeholders when selecting a volume or set of volumes for particular purposes.

To hear Paul Conway talk about the project, watch his YouTube interview:



The Institute of Museum and Library Sciences' National Leadership Grants for Libraries program enhances the quality of library services nationwide by supporting innovative projects that can be widely replicated. Areas of funding include education, research, digitization, and library-museum collaboration. 


The Andrew W. Mellon Foundation’s grant-making philosophy is to build, strengthen and sustain institutions and their core capacities by developing thoughtful, long-term collaborations with recipients and investing sufficient funds to achieve meaningful results.