Reproducibility Matters: Yale Experts Lead Sessions on Research Integrity and Best Practices

Good science requires more than a good experiment. It invites others to assess the work and to reproduce and replicate the results. To do so, researchers need to provide access to their methods and data.
Facing growing pressure from funders and evolving journal standards, the Institution for Social and Policy Studies (ISPS), the Data-Intensive Social Science Center (DISSC), and Yale Library are holding a series of training sessions for faculty, students, and staff about open and reproducible research (ORR).
“These sessions are not just about compliance,” said Limor Peer, associate director for research and strategic initiatives at ISPS and ORR program lead at DISSC. “They are about teaching scholars how to share their research with the same rigor they apply to conducting their research. Attendees come away with practical tools, guidelines, and a sense of community around the shared challenge of making science more transparent and trustworthy.”
Last week, Peer launched the series with a session on who reproducible research is for and why it matters. She began by clarifying the differences between reproducibility (the ability to produce the same results with the same data and the same code), replicability (the ability to reach similar conclusions using new data and independent methods), and robustness (the degree to which results hold under different assumptions, models, or analytical choices).
Reproducibility aims to strengthen scientific results by enabling verification and generalization of results by peers.
“Really what we want to be able to do is codify, expand, and instill practices that follow scientific principles, which include investigating, testing, and self-correcting when warranted,” Peer said. “These practices will also improve your productivity, allow you to verify your own results, and enable other people to extend the work.”
Anthony Lollo, director of data science and analytics at Yale School of Public Health, and Maurice Dalton, DISSC’s data engineering and solutions lead, also contributed to the first session.
Lollo presented a detailed walkthrough of his team’s journey preparing a replication package for their paper on Medicaid privatization in Louisiana. The project spanned seven years, involved data limited by federal privacy laws protecting medical information, and required coordination across multiple researchers and programming languages.
“Reproducibility is not optional,” Lollo said. “We knew that if the numbers were changing in figures or tables and we didn’t know what was going on, those were going to be red flags for data editors.”
Because the project took seven years, the researchers worked to minimize what software engineers call “technical debt,” the accumulated cost of shortcuts, quick fixes, or deferred maintenance in a codebase or technical system that eventually makes future work harder, slower, or more error prone.
“It was crucial to carve out specific time to solve these challenges throughout the project, even if it slowed research progress in the short term,” Lollo said.
And he stressed the importance of documentation. Lollo’s team assembled a report about a dozen pages long, including citations, environment setup, and file lists. Due to privacy protections, they could not share their data, but they documented how others could obtain it and replicate the work.
Dalton, whose paper on hospital quality indicators and closures has been conditionally accepted for publication, focused on best practices for reproducibility while still in the process of building his replication package. His process has included an artificial intelligence tool to help scrape replication documentation from the American Economic Association website which was then used to auto-generate metadata and drafts of a README document, restructure the data repository to the journal’s standards, and create synthetic data for testing.
“For me, using AI produced a first good draft” Dalton said. “I’ll still need to review the replication package and tweak it, but having that initial draft will save me time.”
DISSC promotes open and reproducible research by providing infrastructure, centralized resources, and expert consultation through its Open and Reproducible Research Program. In April, DISSC and the Tobin Center for Economic Policy hosted Lars Vilhuber, executive director of the Labor Dynamics Institute at Cornell University and data editor for the American Economic Association, for a pair of lessons on best practices for reproducible science.
On Oct. 21, the organizers will host another hybrid in-person and Zoom event from 9 a.m. to 10 a.m. in ISPS classroom 108 at 24 Hillhouse, on “Tips from a Data Archive: Preparing and Working with Replication Packages.” On Nov. 18, at the same time and location, the Yale community is invited to attend a session on “Basics of Research Data Management.” Register to attend either or both.
Ron Borzekowski, executive director of DISSC, encouraged any faculty, students, or staff members engaged in research to attend the series and sign up for the DISSC newsletter to learn about upcoming events.
“With our partners across the university, we’re helping researchers do better science by offering the tools, support, and infrastructure they need to make their work open, reproducible, and reliable,” Borzekowski said. “Bringing researchers together through events like these allows us to learn from one another and raise the standard of scientific inquiry.”