Common Data Problems
At the recent Yale Day of Data, researchers spoke about practical challenges they have relating to the data they use in their research. These challenges require swift, decisive, and first rate solutions if researchers are to continue to produce excellent scholarship. In this short post, I will use Provost Ben Polak’s introductory remarks at the Day of Data as a framework for categorizing these “data problems”: Data access, methods, and storage. I will rephrase these terms slightly and expand on them here.
First, access to data. This refers to the fundamental need scholars have for data to do their research, test hypotheses, and build theory. The essential “data problem” for researchers here is, “how do I get my hands on some awesome data?” Data of all sorts – administrative, survey, images, GIS, observational, qualitative, etc. – are the bread and butter of research, necessary for researchers to carry out their work. The critical issues here are how to get these data efficiently and seamlessly, and then how to manage them once you have them, especially when working with collaborators.
For example, Walter Jetz (Ecology & Evolutionary Biology) described his lab’s development of an application to collect data about species from citizen scientists. In another project, by the Yale Center for Earth Observations, Jetz explained how remote sensing data is collected, noting that keeping track, organizing, and documenting such data presents real challenges. Michael Krauthammer (Medicine) discussed the challenges of determining that the right data are collected in the first place, managing the incoming flow of data, and assigning responsibility to various people in that flow. David Rand (Psychology) described collecting data for online experiments through Amazons' Mechanical Turk, and Sarah Demers (Physics) talked about the effort of collecting vast amounts of data, and then making a decision in real time about which data to keep. In addition to collecting the data, her group also records the conditions under which the data were collected. Pieter van Dokkum (Astronomy) described a host of infrastructure challenges from server back-ups, to facilitating far flung collaborations, displaying and interacting with images, data management, debugging, and version control for software. And Gregory Huber (Political Science) said one of his “data nightmares” is losing access to data he has collected, due to things like back-up failure, aging software, or lack of documentation.
The second issue is what the Provost calls data methods, which I think of as “data literacy.” It refers to the tools and the techniques for making sense of the data. The “data problem” here is, “how do I analyze and interpret these data?” The analytical approaches tend to be discipline-specific and have historically developed idiosyncratically, but making sense of data is a universal goal of all research endeavors. As van Dokkum noted, all researchers are engaged in turning raw data into “data products,” that is, into something people can actually understand and use. Huber echoed that idea when he described the data he frequently works with as often heterogeneous, compiled, and merged from various sources. The work of the researcher, he said, is to use the raw data and turn it into a work product; the value of the work product is realized only when all elements are shared – the data, the code, and the documentation of the entire process. This requires knowledge and skills ranging from experimental design, to sampling technique, to following IRB protocols, to scraping code, to proficiency with statistical analysis and more. However, some methods and approaches can be applied more broadly, especially in certain contexts or via inter-disciplinary research efforts. For example, the notion of “big data” can facilitate a convergence of methods, tools, and approaches. At the Yale Day of Data, Nicholas Christakis (Sociology, Human Nature Lab) and Andrew Papachristos (Sociology) demonstrated how diverse sources and types of data combine for analyzing networks and answering a fascinating set of questions.
Part of the challenge here is to support researchers by getting the tools and infrastructure they need for analysis to take place. The other part of the challenge is to make this type of activity entirely familiar and integral to education: As the Provost said, both access to data and knowing what to do with data are as fundamental to students’ experience as writing instruction.
A third issue is data storage, which can more accurately be named “data stewardship.” This refers to the question of what to do with the data that researchers collect and analyze in the course of doing their research – not just in terms of storage, but in term of care. The central “data problem” is, “who should take care of these data?” and numerous questions cascade from this first one: What do I do with these data now that the research is complete? What data should be cared for? For how long? By whom? What does “care” really mean? Who owns these data? Who has the right to keep these data? What data could be shared publicly? At the Yale Day of Data, faculty from various disciplines expressed the same concerns about sharing data. Several faculty presenters said they share data via their websites. Krauthammer emphasized that researchers should be allowed to share their data on their own terms, and warned that imposing standards and mandates may backfire. Christakis noted that the desire to have free accessible data needs to be balanced with privacy and confidentiality considerations. Similarly, Huber said that, researchers may need to protect their data and only release public versions of the data. The policies, standards, and additional work involved in doing that are a burden on researchers. Huber pointed to the ISPS Data Archive as a model solution and a resource for researchers on those issues.
Solutions to these various problems are still evolving, unevenly, across the disciplines. But, as Yale University Librarian Susan Gibbons observed, similar “data problems” are shared by researchers in all disciplines. At the Day of Data, there was consensus that institutions could and should also support the disciplines, and their researchers. Researchers’ data problems need to be solved so they can excel at what they do: produce top notch research.