MIDAS Talk: Arthur Spirling
340 West Hall
Word Embeddings: What works, what doesn't, and how to tell the difference for applied research
We consider the properties and performance of word embeddings techniques in the context of political science research. In particular, we explore key parameter choices—including context window length, embedding vector dimensions and the use of pre-trained vs locally fit variants—with respect to efficiency and quality of inferences possible with these models. Reassuringly we show that results are generally robust to such choices for political corpora of various sizes and in various languages. Beyond reporting extensive technical findings, we provide a novel crowdsourced “Turing test”-style method for examining the relative performance of any two models that produce substantive, text-based outputs. Encouragingly, we show that popular, easily available pre-trained embeddings perform at a level close to—or surpassing— both human coders and more complicated locally-fit models. For completeness, we provide best practice advice for cases where local fitting is required.
Arthur Spirling is Professor of Politics and Data Science at New York University. He received a bachelor’s and master’s degree from the London School of Economics, and a master’s degree and PhD from the University of Rochester. Spirling’s research centers on quantitative methods for social science, especially those that use text as data and more recently, deep learning and embedding representations. His work on these subjects has appeared in outlets such as the American Political Science Review, the American Journal of Political Science, the Journal of the American Statistical Association and conference proceedings in computer science. Substantively, he is interested in the political development of institutions, especially for the United Kingdom.
Sponsored by the Michigan Institute for Data Science (MIDAS).