Using AI to Make a Century of Congressional Speeches Searchable and Analyzable
Tuesday, 05/05/2026
Last Updated: Tuesday, 05/05/2026
University of Michigan School of Information assistant professor Dallas Card is featured in this story written by Nicole Frawley-Panyard and published with the Initiative for Democracy and Civic Engagement at U-M.
When researchers like Dallas Card, Assistant Professor in the School of Information, and Sandhya Srivinas, Research Fellow at the Ross School of Business, try to study how political attitudes have shifted over time, they often turn to one of the most comprehensive archives of American democracy: The Congressional Record. The Record captures everything said on the floors of the U.S. House and Senate, providing an unparalleled view of public discourse and policymaking over more than a century.
But there’s a problem.
“The PDFs are massive, messy, and practically unusable for large-scale analysis,” Card explained. “We’re talking about thousands of pages per session. The text is technically public, but functionally inaccessible.”
Card’s project, funded through the University of Michigan’s Center for Ethics, Society, and Computing (ESC) and the Initiative for Democracy & Civic Empowerment (DCE), aims to change that. He and his team are using machine learning and optical character recognition (OCR) to transform decades of congressional transcripts into a structured, searchable, and analyzable format that can support both scholarly research and public understanding.
Unlocking a Democratic Archive
The first phase of the project focuses on the technology itself. In practice, this means teaching computers to accurately read and organize historical documents that were never designed for computational analysis.
“The modern government produces both PDF and HTML versions of the Congressional Record,” said Card. “The HTML versions are much easier to work with, but they only go back so far. For older sessions, all we have are PDFs or printed books sitting in law libraries.”
To bridge that gap, Card’s team is refining OCR tools that convert scanned text into structured data by identifying who is speaking, what they said, and when. After testing several options, they settled on an open-source package called Surya, which combines text recognition with layout analysis.
“Surya not only reads the words,” Card noted. “It also interprets the structure by distinguishing headings, quotes, and paragraphs. That’s crucial for correctly mapping speeches and speaker attribution.”
Once complete, the team plans to release the data in open formats such as spreadsheets and JSON files, allowing anyone to download and work with the full text. A longer-term goal is to create an online interface where users can search by keyword, date, or speaker.
“For researchers like me, having the full-text data dumps is the most useful part,” Card said. “But I would love to see this evolve into an interactive site where anyone could explore what their representatives have said about specific issues.”
From Data to Accountability
While the infrastructure work is the current focus, the motivation for the project is deeply tied to Card’s earlier research on political rhetoric, particularly around immigration. In previous work, his team analyzed more than 100 years of congressional speeches to trace how attitudes toward immigration have changed over time.
Using machine learning and linguistic analysis, they found that while rhetoric today is more positive on average than in the past, polarization around immigration has intensified sharply. With a cleaner and more complete dataset, Card sees an opportunity to revisit and extend that analysis.
“With this dataset, we can go further,” he said. “You can examine how individual members’ language shifts over time or how parties frame the same issue differently. Ultimately, it’s about transparency and making it easier to hold public officials accountable for what they have said.”
Although immigration will be the first policy area revisited, the tools themselves are not issue-specific. Any topic debated in Congress could be analyzed in the same way, opening the door to research on polarization, authoritarian language, or broader shifts in democratic discourse.
Ethics, Scale, and a Living Record
Unlike many AI-driven projects, this one raises relatively few ethical concerns. “If there’s any domain where analyzing text is ethically clear, it’s this one,” Card said. “These are public officials speaking in public, in an official capacity. Our goal is simply to make that information easier to access.”
The team is also deliberately avoiding massive, resource-intensive language models. Instead, they are using smaller, more efficient systems that can operate at scale without high computational or financial costs.
Looking ahead, Card hopes to have a preliminary version of the dataset available within the year, with future progress depending on additional funding and staffing. Beyond the technical milestones, his broader goal is to create a living public resource that supports scholarship, journalism, and civic engagement alike.
“It really is the story of democracy told in words,” Card said. “Every speech and every debate is there. We just need to make it readable.”