Panel on Confidentiality and Open Access to Research Data at the International Digital Curation Conference
*** This post originally appeared on the Digital Curation Centre blog on January 31, 2013 ***
In his talk at IDCC, Ewan Birney, Associate Director of the EMBL European Informatics Institute, described how the field of genomics has paved the path to large-scale data sharing. The field now publicly confronts the issues arising from breaches to privacy as a result of such sharing. An editorial in Nature titled, “Genetic Privacy,” was published just one day after the January 16 panel at IDCC (8th International Digital Curation Conference, Amsterdam, the Netherlands). The piece discusses the “privacy loophole” that was revealed when researchers at the Biomedical Research in Cambridge, Massachusetts were able to identify individuals whose genomes were sequenced by cross referencing individuals’ pattern of genetic markers with names from genealogical databases and public records containing age and location. The editorial calls upon the research community to come up with a solution soon because of the likelihood that the potential for exploitation of the privacy loophole will only grow. This panel suggested that our community is already well on its way.
At the heart of the issue is the tension between privacy and openness. Funders are beginning to demand open access, advocacy groups to expect it, journals to require it, institutions to encourage it, and even researchers are increasingly aware of the benefits of sharing their data. But how can we reconcile this momentum with the long-held academic tradition of protecting the identity of research subjects? That was the topic of a panel on confidentiality and open research data at the 8th Annual International Data Curation Conference in Amsterdam on January 16, 2013.
Jared Lyle, Director of Curation Service at the Inter-University Consortium for Political and Social Research (ICPSR) at the University of Michigan, started off speaking about archiving and providing access to confidential social science data. The issue is disclosure risk, and the risk is that sensitive information could be linked to research subjects in ways that can cause harm. While protecting human subjects’ confidentiality is a long-standing practice in the social sciences, Lyle described some new challenges this experienced repository (since 1962) faces due to new types of social science data, such as video, sensor, and online behavior. These data, if released, could be used to identify research subjects. ICPSR’s approach to minimizing the risk is comprehensive: Safe data, safe places, and safe people. As a data archive, ICPSR puts extensive resources behind their goal of minimizing disclosure risk: Data files are processed and vetted by humans and machines, and various techniques are used to ensure no personally identifiable information remains in public versions of the datasets. When sensitive data cannot be removed or modified, ICPSR maintains them in secure environments and has developed ways to manage access to the files by using secure technologies. Finally, ICPSR enforces compliance with its disclosure risk policies by training staff and requiring users to sign user agreements, and, when accessing restricted data, more restricted ones.
Louise Corti, Associate Director of the UK Data Archive at the University of Essex, which archives digital social and economic research data, spoke next about the disclosure spectrum and how to enable access. Corti talked about the case for and against open access, reminding us that restricting access may not always occur just because the data warrant it, but also because it can be costly and time consuming to deal with it, the data may be of poor quality, or modifying it may render it un-useful or un-usable. Corti described the UKDA’s multiple use licenses, which respond to UK laws governing both open data and privacy protection. The UKDA also follows the safe data, safe places, safe people framework, applying techniques and licenses as needed according to the “open data spectrum.” Corti spoke of the particular challenges of anonymizing qualitative data and the strategies used by the UKDA. She ended her talk with a key question: “Who should do this work?” to which her answer, basically, was “everyone along the data life cycle” – but repositories are the ones who have special responsibility and could also be the ones on the line.
Carl Lagoze, Associate Professor at the School of Information, University of Michigan, then talked about data management of confidential data. His talk focused on the issues that arise from increased use of restricted-access data. Reiterating the difficulties involved with the pressure to archive and share inputs of scientific results on the one hand and the increased use of inherently identifiable data (e.g., geospatial relations, exact genome data, etc.) on the other, Lagoze discussed the problem of having restricted access data in the provenance chain: It complicates the curation and knowledge discovery process. Lagoze described his team’s work on managing the exposure of metadata fields across open and restricted environments. The new tool, CED2AR, allows metadata ‘cloaking’ through DDI by coding each metadata element for the appropriate level of access, keeping that information behind the firewall. The tool is a novel solution for better curation of the data, more accurate identification, and allows selective hiding of both data and metadata.
I came away from these excellent talks with a few thoughts: First, curating the data responsibly, including managing the tension between confidentiality and openness, is extremely labor intensive. Kudos to data repositories and archives for being responsible citizens and for investing resources into this essential area. Second, all three papers described excellent technical and workflow solutions embedded in the curation process – these are going to continue to be a vital part of data curation. Finally, the data curation community has shown leadership out of necessity and devotion to ethical research practices. But is that enough?
The broader research community also includes practitioners (that is, data producers, researchers) and policy-makers (at all levels, including institutional, national, disciplinary). But there are deficiencies at both ends: Practitioners need to take ownership of their data management practices (and to do that they need education and they need support; see workshops on data management planning and training). Policy-makers need to work toward clarifying and updating standards and figuring out how to apply them consistently.
As I said in my comments to kick off the panel, currently, roles and oversight are unclear. Who is, or should be, responsible for protecting subjects’ personal information: the researchers, their institutional review boards (IRB), the repositories that preserve and disseminate the data, national governing bodies? Who decides if a dataset can be published? How do we make the decision-making process simple and responsive but also intelligent? A related problem is that current available criteria are applied unevenly. Current guidelines for assessing whether release of information gathered about individuals, organizations, or businesses might cause harm are inconsistent and unevenly enforced. For example, in the U.S. data may fall under at least four distinct regimes with different requirements: health information is covered by the Health Insurance Portability and Accountability Act (HIPAA), education records are under the Family Educational Rights and Privacy Act (FERPA), statistical data collected by government agencies is regulated by Confidential Information Protection and Statistical Efficiency Act (CIPSEA), and most federally funded research data falls under the “Common Rule.” IRBs are relatively free to interpret these regulations and to impose their own standards on sensitive data, and there is even more confusion about the role of repositories. Finally, these criteria may be outdated in light of societal changes in the meaning of, and expectations about, both privacy and openness, and indeed, personal data. Standard procedures such as informed consent and the right to withdraw from the study may have worked well when one PI, in one institution, collected data for one study, and kept the data locked. This is often not the case anymore: We now have collaboration across institutions, networks of people and data, unanticipated data re-use, not to mention data collected that is by definition identifiable (genomics), and all this puts stress on the old model (see excellent paper by Jane Kaye).
In addition, new models for researcher-subject relationships are emerging. For example, commercial companies like 23andMe are finding ways to share data with study participants, treating them as partners in research, not passive information providers. Meanwhile, as our community is working hard at best practices, vast amounts of personal data, collected by the commercial sector, is leaked, linked, bought and sold often with no permission or knowledge and with little restraint or oversight. These data can find their way to our public datasets, potentially leading to re-identification in our own data despite our best efforts. Perhaps a shift in paradigm is necessary?
To sum up, we’ve heard from leading data repositories in the U.S. and the U.K. and from a team of experts in data security about how to walk the delicate line of protecting privacy and opening access to data. As a community we know what we have to do. Now this issue should be addressed head on by all involved in the data lifecycle.
* I’d like to acknowledge and thank George Alter and John Abowd for helping me put together this panel.