“Large Language Models: Replication and Robustness for Social Science,” Arthur Spirling, Princeton University

COMPUTATIONAL SOCIAL SCIENCE WORKSHOP
Abstract: Large Language Models (LLMs) are exciting tools: they require minimal researcher input yet make it possible to annotate and generate large quantities of data. Yet there has been almost no systematic research into the reproducibility of research using LLMs. This is a potential problem for scientific integrity. In the first part of the talk, we present a theoretical framework for replication in the discipline and show that LLM work is perhaps uniquely problematic. We demonstrate the problem empirically using a rolling iterated replication design in which we compare crowdsourcing and LLMs on multiple repeated tasks, over many months. We find that LLMs can be accurate, but the observed variance in performance is often unacceptably high. This affects downstream results. In the second part of the talk, we consider the effects of LLMs becoming more accurate than expert human coders. We show what this will mean for inference in downstream tasks: optimistically, it is that estimated treatment effects will become larger, although claimed null effects may be more dubious. We argue that authors should focus more on sensitivity and robustness with respect to future technological change, and we demonstrate how to use local calibration for such problems
Arthur Spirling is the Class of 1987 Professor of Politics at Princeton University. He received a bachelor’s and master’s degree from the London School of Economics, and a master’s degree and PhD from the University of Rochester. Previously, he served on the faculties of Harvard University and New York University. His research centers on problems at the intersection of data science and social science, including those related to machine learning, and large language models. He currently serves as the Director of Princeton’s Center for Statistics and Machine Learning.
Lunch will be served.