Integrated research support, training and data documentation

Professor Lagoze’s work is a subcontract of a larger project headed by Cornell University’s John Abowd and will help contribute to the design of The Comprehensive Census Bureau Metadata Repository (CCBMR), a curation system that permits synchronization between the public and confidential versions of the repository. This project is focused on finding new ways to unlock social, behavioral and economic data collected through the U.S. Census Bureau.

Start date: 10/1/2012
End date: 9/30/2016

Read More

The Confidential Information Protection and Statistical Efficiency Act of 2002 required every statistical agency in the United States to take custody of confidential micro-data used for its work, leading to the demise of public-use micro-datasets as a cornerstone of empirical research in the social sciences. While it still is feasible to create data without breaching confidentiality, scholars are pursuing research programs that mandate identifiable data, such as geospatial relations, exact genome data, networks of all sorts, and linked administrative records. This requires researchers to acquire authorized restricted access to confidential identifiable data and perform their analyses in secure environments. 

Researchers are allowed to publish results that have been filtered through a statistical disclosure limitation protocol, but this process hampers scientific scrutiny because researchers cannot effectively share restricted-access data with other scholars. This problem is impeding the "acquire, archive, and curate" model that dominated social science data preservation in the era of public-use micro-data. 

This project will bridge the transition to restricted-access data and offer the scholar, the scientific community, and the custodial agency a path to long-term data preservation. It will seek to generate new research methods, advanced practices and procedures, and relate basic research findings to the core missions of the federal statistical system. This will include collecting data that serve the public interest through censuses, surveys and administrative records while respecting the privacy of individual citizens and businesses.

The Comprehensive Census Bureau Metadata Repository (CCBMR) will be a Data Documentation Initiative-based curation system designed and implemented to permit synchronization between the public and confidential versions of the repository. The scholarly community will use the CCBMR as it would use a conventional metadata repository, deprived only of the values of certain confidential information, but not their metadata. The authorized user, working on the secure Census Bureau network, will use the CCBMR with full information in authorized domains. 

This study will also teach doctoral students how to develop research programs using restricted-access Census Bureau data and repository tools developed in this project and previous projects. The same tools will be used to develop algorithms to improve the integration, editing, and imputation models that assemble the micro-data used for the Census Bureau's employer-employee database.

The CCBMR, the education based on this repository, and the collaborative computational statistics model all can be generalized to meet the restricted-access research requirements of other statistical agencies. These tools allow statistical agencies to harness the efforts of researchers who want to understand the structure and complexity of confidential data in order to propose and implement reproducible scientific results. 

Future generations of scientists will be able to build on these efforts because long-term data preservation in the CCBMR will operate on the original scientific inputs, not inputs subjected to statistical disclosure limitation. This curation will result in a viable system for enforcing data management plans on projects, ensuring that results can be tested and replicated by future scientists. 

Joining Lagoze and principal investigator John Abowd on this project are William Block, Warren Brown, Stefan Kramer and Lars Vilhuber, all from Cornell. To read the press release announcing funding for this project, please visit Cornell’s Industrial and Labor Relations School website here.


NCRN-MN: Cornell Census-NSF Research Node: Integrated Research Support, Training and Data Documentation, National Science Foundation: $139,649


The National Science Foundation (NSF) is an independent federal agency created by Congress in 1950 "to promote the progress of science; to advance the national health, prosperity, and welfare; to secure the national defense…"