Why “Intelligent Openness” is Especially Important When Content is Disaggregated
Ten years ago, my job at the Media Management Center at Northwestern University involved working with newspaper companies on their digital strategy. I was recently reminded of a particular aspect of digital that was blamed for disrupting all sorts of systems and ways of doing things: the fact that it enables the disaggregation of content. Where a traditional newspaper combined a mix of editorial content with display and classified advertising, digital news content can be distributed in an unbundled way. Online, news content “could appear in any number of configurations next to any mix of content over which you [news organizations] have no control” (Benkoil, 2005). While this may be beneficial to news consumers, news organizations worried about financial loss because they will no longer be able to monetize people’s habit of spending time with a packaged product (the industry is still grappling with this).
In my current job, I focus on scholarly communication, open data, and research transparency, and I see the disaggregation of content in this arena as well. In the academic context, disaggregated content is “the disintegration of an established and taken-for-granted genre, the scholarly journal article, into discrete components,” such as maps, tables, and figures, as defined in 2008 in Sandusky & Tenopir. These components, too, can appear in any number of configurations and be used in any number of ways. And now, thanks to the open access movement, we are seeing even more components underlying the traditional journal article – such as datasets, code, software, and lab notebooks – being made publicly available. Researchers can use and re-use data, for example, for additional analyses, to validate results, to teach, or to do a replication study.
These are positive developments for all the reasons that many in the open science movement have articulated (e.g., Piwowar & Vision, and lists by UKDA, DataOne). Champions of reproducibility make a particularly compelling argument for publishing all the components underlying the scholarship: “Most of the work in a modern research project is hidden in computational scripts that go to produce the reported results. If these scripts are not brought out into the open, no one really knows what was done in a certain project.” (Donoho, 2010)
The research ecosystem – scholars, university libraries, journals, repositories, archives, and other actors – has generally embraced the proper capture of digital research components by focusing on open access, persistence, and citation. It is especially exciting to see a growing number of initiatives, primarily around ensuring that discrete research components can be shared in the first place (e.g., figshare, RunMyCode), that they have persistent links (e.g., DOI), and that they are properly cited (e.g., DataCite). These should be recognized as great leaps forward toward open science.
However, the potential for content to be disaggregated means that the research community must also find ways to ensure the content is usable, not just accessible, persistent, and citable.
Disaggregated content in the scholarly context requires extraordinary care so that it can actually be usable for the foreseeable future. Let’s focus on research data. In many cases, it is difficult to interpret and make use of data: perhaps descriptive information is minimal, variable labels are incomprehensible, it isn’t clear how the data were generated, or your software is out of date.
So what does it take for data to “be usable”? First, data has to be independent of the particular people who created the data. According to the OAIS Reference Model, information should be “independently understandable to (and usable by) the Designated Community,” and there needs to be enough information to be understood “without needing the assistance of the experts who produced the information.” Similarly, Gary King defined the “replication standard” in the mid-1990s, as when “sufficient information exists with which to understand, evaluate, and build upon a prior work if a third party can replicate the results without any additional information from the author.”
Second, usable data has to be independent of the specific machines or technologies with which the data is read or rendered. A group at the 2013 Open Knowledge Conference defined usable data as “understood by any person asking a question of it,” referring also explicitly to the requirement that machines (not those storing the data) must be able to “read data.” The basic idea is that data should exist beyond the technology and the people who produced it (in the present and in the future), and it could be used for any purpose.
Summing it up nicely, the United Kingdom’s Royal Society states that the standard we need to aim for is intelligent openness: “Data must be accessible and readily located; they must be intelligible to those who wish to scrutinise them; data must be assessable so that judgments can be made about their reliability and the competence of those who created them; and they must be usable by others” (p.7).
I have argued that, while data archives may have a special responsibility to ensure the usability of the data they hold, everyone in the research ecosystem can, and should, engage in making data usable.
Unlike for the news consumer, for whom disaggregated content may result in some loss of context, for the potential data re-user, content that is not contextualized, documented and properly annotated may mean that it cannot be used or deciphered at all. If the threat of disaggregated content for the news industry is losing its audience, the threat for science is losing itself.