Yale’s New Dataverse: Enhancing Research Transparency and Reproducibility

Authored By 
Rick Harrison
March 18, 2025

Abstract illustration of a university building with columns on a book and surrounded by logos connected by lines and circles, with buildings in the background

In October 1960, Fred Greenstein, then a junior faculty member who had recently earned his Ph.D. at Yale, sent a letter to the United Nations seeking all information available on international data stored on IBM punch cards.

In an internal memo four days later, he appealed to the executive committee of Yale’s Political Science Research Library, of which he served as director, to invest in the preservation and sharing of detailed congressional district data, collected by Ph.D. candidate Leroy Rieselbach, as a research boon for students and faculty.

“For a minimum expense (using bursary labor and mimeograph), the library could sponsor and circulate the material Rieselbach describes,” Greenstein wrote. “In return we could have not only a general altruistic satisfaction, but also an opportunity to make people elsewhere aware of our existence. Such a publication invariably would be cited in journals over a period of several years. In addition, it would enable us to offer something to the various research groups from which we solicit data.”

Those seeds and others like it have grown into a deep forest of data fueling Yale research findings for decades. And earlier this month, the university launched the Yale Dataverse, a campus-wide repository to maintain and share such raw materials.

The Yale Dataverse originated as part of a broader effort toward open and reproducible research at the Institution for Social and Policy Studies (ISPS) and Yale’s Data-Intensive Social Science Center (DISSC). It is now run by Yale Library with support from Yale’s Information and Technology Services and the Office of the Provost.

“Good science not only requires good data, good study design, and careful analysis but also a reliable way to store the data and allow others to check your work and build upon it,” said Ron Borzekowski, executive director of DISSC. “We are so pleased that DISSC could play a part in establishing this new resource for our colleagues across the university.”

A lot has changed since 1960. Greenstein, who died in 2018 at the age of 88, taught politics at Princeton University for nearly three decades and helped define and redefine how to understand the leadership styles of American presidents. Rieselbach, professor emeritus of political science at Indiana University, has written books and taught about national institutions and the legislative process.

The data on which they and their colleagues built their careers have migrated from hardbound books, paper photocopies, punch cards, and magnetic tape to solid-state drives and cloud storage available for near-instant retrieval around the world.

Currently, Yale stores about 7 petabytes of data in a tier of “active” data (1 petabyte = 1 million gigabytes), adding up to more than 6 billion files, much of which is designated as research materials. There are another 9.6 petabytes of archived data.

In the 2000s, other institutions created centralized repositories for digital research data, while Yale encouraged researchers to share their data within separate discipline-specific locations. In 2011, ISPS created a custom data archive, working with faculty members and graduate students to enable long-term usability of the data and reproducibility of the research.

“The ISPS data archive aligns with the core mission of ISPS to uphold the very best practices in all aspects of social science research,” said Limor Peer, associate director for research and strategic initiatives at ISPS and senior research support specialist at DISSC. “And to back it up with resources.”

In 2018, ISPS launched YARD (Yale Application for Research Data), an open-source web application for reviewing and enhancing research outputs that feeds into the ISPS data archive.

ISPS staff work with social science researchers to review their files, documentation, data, and any code for studies prior to publication, as part of a process that has influenced and informed the development of similar practices in other social science data archives that recently joined together under a consortium called Curating for Reproducibility (CURE).

“We make sure that people can use the digital material as intended,” Peer said. “If there’s a data file, we make sure they can open it and understand what’s in it. If there’s a script, we ensure they can run it, and it works. We want the data and code to be as usable as possible to the scientific community for as long as possible, so the community can do science.”

But outside of ISPS and its affiliates, faculty members have often found their own storage solutions to meet current standards for data management, reproducibility, and greater ease of sharing increasingly demanded by research funders and academic journals.

“When researchers submit grant proposals, they need to be able to describe how they plan to manage their data and ensure that it be safe and accessible for the long term,” said Rebecca Dikow, director of computational methods and data at Yale Library. “This has become a growing burden that Yale Dataverse can alleviate.”

In 2019, ISPS began to meet with the library to build what would become Yale Dataverse, based on a platform created at Harvard University and eventually made available for Yale’s use.

In 2022, DISSC arrived on campus and joined the effort, having emerged from a series of recommendations by a committee of social scientists from across the university. DISSC now serves as a campus hub to manage and facilitate the collection, protection, and utilization of new, frequently large datasets that are currently revolutionizing fields like political science, economics, psychology, and sociology.

“This has been more than a seven-year journey to bring Yale Dataverse to fruition,” Peer said. “We are incredibly grateful to our peers at Harvard and our partners across the university.”

The Yale Dataverse team explored the metadata needs for Yale researchers to efficiently search for files, piloted the process of uploading data with select faculty members, demonstrated the capacity to harvest data from other repositories, documented the process to create non-Yale collaborator accounts, and worked with faculty members to test the system.

“We have intentionally broken this many times to ensure its robustness,” said Barbara Esty, head of data services for Yale Library, noting the disparate, uncoordinated methods that researchers currently use to manage their data, such as websites and external hard drives. “We are trying to create a system that works for us, using what we have learned from our experience elsewhere.”

Joshua Kalla, ISPS faculty fellow and associate professor of political science, said that as a researcher, he values how this new repository enhances research transparency and strengthens the reproducibility of scholarly work.

“The ability to easily archive, discover, and access research data through Yale Dataverse will help ensure our findings can be properly replicated and built upon by other scholars, expanding our research’s impact and reach,” Kalla said.

ISPS Director Alan Gerber, Sterling Professor of Political Science, sees the repository as an extension of rigorous scientific work.

“There is a growing expectation that researchers make the data underlying their work available in a reliable and easy-to-use form,” Gerber said. “It is great to see how the Yale Dataverse is dramatically expanding what ISPS has done to build tools and infrastructure in support of this collective scientific project.”

The library aims to recruit a dedicated research data management librarian for the Dataverse, improve the interface, and integrate other systems like YARD.

“My goal would be for this service to get to a point where once the researcher has completed their research and deposited their data, they don’t need to worry about it,” said Dale Hendrickson, senior director for library information technology. “What better place than the library to play a role in that, since that’s what we do.”