A Look at Essential Data-Related Challenges for Research in Political Science
Imagine you want to collect data in a war zone. How do you safeguard the data, as well as the people involved in gathering and providing it? Where do you keep it? How do you collaboratively work on the data with the rest of the research team? Do you publicly share it? Where? And what goes into preparing the data for sharing?
These were some of the questions discussed by ISPS Faculty Fellow Jason Lyall in a fascinating talk earlier this month at Yale.
Lyall’s presentation was part of the Day of Data Spring Discussion Series organized by Yale's Data and eScience Group (DaEG). The presentation, entitled Archives, Satellite Imagery, and Everything in Between: Reflections on Data Management in Political Science, was a response to a set of questions the DaEG formulated to get a better sense of the workflow of a single researcher working on diverse projects, specifically with respect to aspects of data collection, re-use, and long term preservation.
First, some relevant background:
Professor Lyall is Associate Professor of Political Science at Yale University, a faculty fellow at the Institution for Social and Policy Studies, and affiliated with the Jackson Institute for Global Affairs and the MacMillan Center for International and Area studies. Since April 2012 he has been the Technical Adviser for USAID's MISTI initiative in Afghanistan.
He is currently working on three projects related to international conflicts:
- The Changing Patterns of Warfare: By exploring primary source materials in 16 languages, Lyall and his team are compiling a data set of conflict events to explore conflict worldwide more comprehensively than has previously been possible.
- MISTI-Minerva: A complex project that involves conducting impact evaluation of USAID and Department of Defense programming on insurgent recruitment and behavior in Afghanistan.
- The Coercion Lab: A new project to conduct survey based experiments, lab experiments, and lab-in-the-field experiments on the effect of violence on a population.
In the fall of 2013, DaEG organized Yale’s first Day of Data. In his talk, Provost Ben Polak identified three major challenges that face researchers working with data at academic institutions: access, storage, and “everything in between” (methods/tools for analysis and management) (see summary).
Professor Lyall's challenges and experiences certainly fall within these three challenges the Provost identified, but also go beyond these three in ways that are both unique to his projects and common to many researchers in similar disciplines.
In the Changing Patterns of Warfare project, Lyall is drawing information directly from primary source materials to build an entirely new data set. This involves time-consuming, elaborate data entry required to turn written knowledge, memoirs, and photographs into numeric data that can be analyzed. This is not just a problem of access to data sources, but a problem of access and management of all kinds of primary source information, in various digital formats, spanning centuries.
The MISTI-Minerva project includes massive population surveys done bi-annually and the integration of bombing coordinates and satellite imagery to form a more complete data set to analyze. Access here is really about the ability to get data, and compile the appropriate data, in order to be able to answer research questions.
Methods/Analysis (and everything in between):
Professor Lyall emphasized that, like many researchers doing diverse work in the social sciences, he has no standard infrastructure to refer to for data collection, management, and analysis when conducting projects, and has to rebuild his infrastructure every time he begins a new project in order to meet the needs and specifications of that project.
In addition to selecting tools and methods and building an infrastructure, Lyall needs access to these tools in the field, often without a reliable Internet connection. He also has extraordinary requirements for safeguarding his team and the data they're gathering in Afghanistan, including provisions for quick evacuation, and the strict anonymity of survey respondents.
Lyall noted that the key to success for his research logistics is a dedicated project manager who interfaces on logistics and data with the rest of the team, and is paid by his own research funds. She also does quality control on the data after the students and before Lyall himself, and Lyall indicated that there really need to be at least two people running a project to adequately administer the project and also do the training, protocols, and data management.
For analysis of his data, he partners with his colleagues in Princeton's Political Science department, which has a dedicated high performance computing center.
When the Provost mentioned storage at the Day of Data 2013, he contextualized it as a response to funder mandates. While the storage of massive amounts of data is indeed a challenge for many disciplines, preservation and long-term accessibility of data is a greater challenge facing researchers. Mandates are increasingly leaning toward not only the long-term retention of research products, but also the long-term accessibility and availability of these research products as a primary concern. This means that digital data not only has to be stored, but taken care of through format migrations when necessary, and sufficiently backed up and preserved to guard against natural disasters, hardware failures, and bit rot. In addition, the data will have to be made available through some platform -- whether a website, a dedicated repository for digital materials, or another solution.
For Lyall, the question of storage is one of security -- backups of data on physical drives in various locations that aren't networked, and field-usable encryption standards and formats. The size of data Lyall gathers for his projects is trivial compared to other disciplines, but complex in terms of access and description. Lyall uses a variety of systems to provide for the long-term retention of his data products, including the ISPS Data Archive, Dataverse, and project websites.
A full write-up of Professor Lyall's talk as well as his slides are available on the DaEG site.
The next event in the Day of Data Spring Discussion Series is on May 15 from 9:30 - 11:00 at the Center for Science and Social Science Information and features Kelly M. Anastasio discussing Yale's implementation of OnCore for clinical research data management.
Michelle Hudson is Science and Social Science Data Librarian at Yale University.