Verifying vs Reproducing Science and the importance of software

Verifying vs Reproducing Science and the importance of software

Torsten Reimer’s excellent blogpost on the use of software at Imperial College, led me to similar reflections on the usage of the software at TU Delft.

As part of the Data Stewardship project, I am involved in interviews with research groups across TU Delft’s eight faculties. In many of the interviews, the role of software often comes up.

Whether writing code to simulate the movement of volcanic ash clouds, to help model the problems in including renewable energy sources in power grids, or normalising a mass of data resulting from chemical analysis of biological samples, the software plays a critical role in defining and testing scientific hypothesis.

5052486185_f4e5d3e48a_z

Utah House Solar Panels, CC-BY, Rick Willoughby

The relative importance attributed to software by researchers as part of their research life cycle also has interesting implications for research data management.

Take for example a scientist running simulations on the effectiveness of solar power in the electricity grid. Such a scientist runs hundreds of simulations, testing the effects of small adjustments to input parameters (eg customer demand, energy input from the solar panels) that goes into the model based on the code. Each simulation will spit out results as data.

When it comes to writing up the results in a paper, typically only the data from a few of the simulations will be referenced, perhaps in the form of graphs. Hopefully, the data from these referenced simulations will be available from a data repository.

But from the data management point of view what is interesting here is thinking about the reproducibility of this research. If another group wants to verify the results of the original group’s research that’s not too difficult. They can download the resulting data and documentation from the data repository (or they can ask the original scientist for it)

But does that mean the science in the paper is reproducible? To be reproducible, the second group would have to test the same software with the same input parameters and check that they got the same data. The data itself would not be enough. Reproducing science needs not just the data but the software as well.

This has implications for data management. As the term suggests much of the focus in libraries is currently on ‘data’. Yet much of the rationale behind good data management is that it helps make science reproducible. But if we really want to do this, then maybe we need to do a lot more in terms of good ‘software development’.

As the Imperial College blog post demonstrates, some university libraries are already thinking about this. But I think we have a long way to go.

Open Science Framework for Data and Project Management

Open Science Framework for Data and Project Management

One of the most common requests by new research projects at TU Delft is for a tool that can manage their all kinds of research data during a project and also deal with other types of data created during a project – for example, steering group minutes, presentations, interview permissions.

osf-screenshot.PNG

Often, projects ending up using a mixture of tools (Basecamp, Google Drive, GitHub, SharePoint) that have different advantages and disadvantages

In this light, I’ve had an introductory look through the science-focussed Open Science Framework (OSF), that provides tools to help the entire workflow. Some of the advantages are listed below.

  • Very quick start up time– it’s possible to get a project up and running in a couple of minutes

 

  • Possible to upload and categorise all kinds of data and files. For example, ‘methods’, ‘hypotheses’ and ‘communication’

 

  • Ability to store versions of data – revisions to each file can be stored

 

  • Different files can have different levels of permission. OSF introduces the concept of component to help organise files and data in different ways. Each component can have different levels of access (e.g. admin, read/write, read only). This is very useful for projects involving multiple institutions and data requiring protection.

 

  • Ability to create public versions of parts of projects, with citations. For fully-fledged projects that wish to share data and ensure appropriate attribution this could be a strong pull.

 

Other questions that the usage of OSF raises:

 

  • How efficiently does OSF deal with big data sets? Individual files can be no more than 5GB. For larger files, linking to add-ons such as Dropbox is possible, but it would be interesting to see if OSF retains its speed when accessing multiple large data sets

 

  • How does it work with third party tools? Integration with common Cloud Apps such as Google Drive is already included. But for some research projects it will be the ability of the tool to connect to specialist code, tools and instruments could make OSF much more useful. But such integration is challenging. For example, how could a sensor recording meteorological data on a daily basis automatically transfer data to OSF? Or how could OSF expose data from traffic logs to allow the visual analysis of movement of cars, buses and lorries in a city? OSF have made their API public to respond to such goals, but that requires developer time to integrate

 

  • If data is being made public and being given a DOI for use in citations, the OSF will need to work hard to ensure long-term sustainability and the trustworthiness of the data. It will still be useful for research projects to deposit their final published data in a repository that accords with the Data Seal of Approval, for long-term curation of data.
PREPARE TO SHARE – DATA STEWARDSHIP AT TU DELFT

PREPARE TO SHARE – DATA STEWARDSHIP AT TU DELFT

Coming fast on the heels of the Open Access movement for scholarly articles, the research data movement aims to liberate the data that provides the evidence for scholarly debate and argument.

Take one of the projects currently been done at the Faculty of Architecture. Led by Assistant Professor Stefan van der Spek, the Rhythm of the Campus project is creating a huge dataset of usage by staff and students on the wifi over the entire campus of TU Delft.

Such a rich dataset can tell researchers, students and indeed other interested parties across deal about how teachers, undergraduates, postgraduates and support services all make use of the university wifi.

A whole spectrum of questions can be asked on how individuals interact in different types of groups, and how attitudes and behaviour change in specific places and at specific times.

From the point of the Research Data Services team in the Library, there is great interest in what happens to the collected data. Such a catalogue of digital behaviour will be of interest not just to Stefan’s colleagues and students but many potential re-users around the world.

But before such research data like that can be re-used, many issues need to be addressed. How is it documented? How is it anonymised? How is it archived? How is it cited? These are all questions for Data Stewardship.

At TU Delft Library, the Research Data Services team has just kicked off the Data Stewardship project. It aims to create mature working practices and policies for research data management across each of the faculties at TU Delft, so that any project can make sure their data is managed well.

Four key values underpin such work

  1. The safe storage and protection of intellectual capital developed by scientists
  2. Best practice in ensuring scientific arguments are replicable in the long term
  3. Better exposure of work of scientists and improved citation rates
  4. Improved practices for meeting the demands of funders, publishers and others in respect to research data

To implement these values, work has begun on a draft policy framework (https://drive.google.com/open?id=0BxR5kUQ2pArDX1gtQXJNaFRQdTQ). This is being discussed over the summer at a faculty level, and their input will steer and refine the policies and practices throughout the university (e.g.,  on the need for training for PhDs in data management). We will continue to report on development on this blog as the project continues.

On the funding cuts at Trove

On the funding cuts at Trove

I read with great disappointment about the funding cuts at the National Library of Australia, including its flagship digital archive, Trove.

Trove has always been at the forefront of work to to share and publish digital cultural heritage.

Capture

Its tremendous newspapers collection work demonstrates that digitising historical sources can have wide public impact. Musty documents, when digitised, are not just for fusty scholars but for those interested in animal accidents, knitting patterns, cricket scores and any kind of odd historical quirk. The quantity and quality of usage are outstanding.

Trove also shows the way for crowdsourcing. It is one of the very first libraries to ask users to transcribe and correct OCR (Optical Character Recognition) text, and they have done so in great abundance.

But Trove is not just newspapers. A whole range of content from across Australia is included – records, books, maps, photos, diaries, letters, sounds.

Many digital libraries have attempted to aggregate such material and provide compelling friendly interfaces. Trove is one of the few where searching through such a heterogeneous range of material seems natural.

In the 1950s and 60s, Australian writers spoke of the cultural cringe, the feeling that their work would always be judged as inferior and in subservience to older English and European traditions. Australian culture has moved on fabulously since then, and in the case of Trove, it’s been us in Europe who have looked at Trove and wondered how we could do something as good as that.

 

 

 

Some articles on the benfeits of open data for citation rates

Some articles on the benfeits of open data for citation rates

Piwowar HA, Vision TJ. (2013) Data reuse and the open data citation advantage.PeerJ 1:e175  https://doi.org/10.7717/peerj.175

Conclusion. After accounting for other factors affecting citation rate, we find a robust citation benefit from open data, although a smaller one than previously reported. We conclude there is a direct effect of third-party data reuse that persists for years beyond the time when researchers have published most of the papers reusing their own data. Other factors that may also contribute to the citation benefit are considered.

We further conclude that, at least for gene expression microarray data, a substantial fraction of archived datasets are reused, and that the intensity of dataset reuse has been steadily increasing since 2003.

—————————————————————————————————

Piwowar HA, Day RS, Fridsma DB (2007) Sharing Detailed Research Data Is Associated with Increased Citation Rate. PLoS ONE 2(3): e308.  doi:10.1371/journal.pone.0000308

Sharing research data provides benefit to the general scientific community, but the benefit is less obvious for the investigator who makes his or her data available.

We examined the citation history of 85 cancer microarray clinical trial publications with respect to the availability of their data. The 48% of trials with publicly available microarray data received 85% of the aggregate citations.

Publicly available data was significantly (p = 0.006) associated with a 69% increase in citations, independently of journal impact factor, date of publication, and author country of origin using linear regression.

—————————————————————————————————

Belter CW (2014) Measuring the Value of Research Data: A Citation Analysis of Oceanographic Data Sets. PLoS ONE 9(3): e92590.  doi:10.1371/journal.pone.0092590

Evaluation of scientific research is becoming increasingly reliant on publication-based bibliometric indicators, which may result in the devaluation of other scientific activities – such as data curation – that do not necessarily result in the production of scientific publications. This issue may undermine the movement to openly share and cite data sets in scientific publications because researchers are unlikely to devote the effort necessary to curate their research data if they are unlikely to receive credit for doing so.

This analysis attempts to demonstrate the bibliometric impact of properly curated and openly accessible data sets by attempting to generate citation counts for three data sets archived at the National Oceanographic Data Center.

My findings suggest that all three data sets are highly cited, with estimated citation counts in most cases higher than 99% of all the journal articles published in Oceanography during the same years. I also find that methods of citing and referring to these data sets in scientific publications are highly inconsistent, despite the fact that a formal citation format is suggested for each data set.

These findings have important implications for developing a data citation format, encouraging researchers to properly curate their research data, and evaluating the bibliometric impact of individuals and institutions

—————————————————————————————————

Pienta, Amy M.; Alter, George C.; Lyle, Jared A. (2010) The Enduring Value of Social Science Research: The Use and Reuse of Primary Research Data  http://hdl.handle.net/2027.42/78307

Abstract : The goal of this paper is to examine the extent to which social science research data are shared and assess whether data sharing affects research productivity tied to the research data themselves. We construct a database from administrative records containing information about thousands of social science studies that have been conducted over the last 40 years.

Included in the database are descriptions of social science data collections funded by the National Science Foundation and the National Institutes of Health. A survey of the principal investigators of a subset of these social science awards was also conducted.

We report that very few social science data collections are preserved and disseminated by an archive or institutional repository. Informal sharing of data in the social sciences is much more common. The main analysis examines publication metrics that can be tied to the research data collected with NSF and NIH funding – total publications, primary publications (including PI), and secondary publications (non-research team).

Multivariate models of count of publications suggest that data sharing, especially sharing data through an archive, leads to many more times the publications than not sharing data. This finding is robust even when the models are adjusted for PI characteristics, grant award features, and institutional characteristics

—————————————————————————————————

Bertil Dorch. On the Citation Advantage of linking to data: Astrophysics. 2012.  <hprints-00714715v2>

Abstract : This paper present some indications of the existence of a Citation Advantage related to linked data, using astrophysics as a case. Using simple measures, I find that the Citation Advantage presently (at the least since 2009) amounts to papers with links to data receiving on the average 50% more citations per paper per year, than the papers without links to data.

A similar study by other authors should a cumulative effect after several years amounting to 20%. Hence, a Data Sharing Citation Advantage seems inevitable.

—————————————————————————————————

Edwin A. Henneken, Alberto Accomazzi (2011) Linking to Data – Effect on Citation Rates in Astronomy.  http://arxiv.org/abs/1111.3618

Abstract: Is there a difference in citation rates between articles that were published with links to data and articles that were not? Besides being interesting from a purely academic point of view, this question is also highly relevant for the process of furthering science. Data sharing not only helps the process of verification of claims, but also the discovery of new findings in archival data.

However, linking to data still is a far cry away from being a “practice”, especially where it comes to authors providing these links during the writing and submission process. You need to have both a willingness and a publication mechanism in order to create such a practice.

Showing that articles with links to data get higher citation rates might increase the willingness of scientists to take the extra steps of linking data sources to their publications. In this presentation we will show this is indeed the case: articles with links to data result in higher citation rates than articles without such link

A Vision for the Digital Humanities

A Vision for the Digital Humanities

The Digital Humanities does not exist. Or rather, it does not exist as a separate field, a bounded up box distinct from traditional disciplines.

Rather is it a metaphor, a powerful connector that allies existing disciplines with nascent ones. It is a connector that suddenly injects disparate subjects across the humanities with common concerns. Concerns over infrastructure, public engagement, scholarly communication. Over method.

Interdisciplinarity has always existed in the humanities, but the digital turn strengthens bonds in new ways. Archaeologists and linguists share needs for powerful processing infrastructures; philosophers need to reconsider their publishing strategies; theologians and historians suddenly have new audiences for their research opened up. A Sinologist suddenly has common cause with an historian of the book over XML mark up.

Grand Challenges

While this instinctive interdisciplinarity has influenced the restructuring of humanities faculties to include DH components, there is still doubt as to the effectiveness of the digital humanities. What have these ‘technological insurgents’ done to help answer the grand intellectual challenges facing the humanities? Therefore, an essential component of any contemporary digital humanities vision is helping address such challenges.

Aonach Tailteann Athletics- Croke Park: Hurdles Race | Independent Newspapers PLC
Hurdles Race (taken by Independent Newspapers PLC), National Library of Ireland, Rights Reserved

 

Any lasting vision for the Digital Humanities needs to strike a common basis with the broader faculty. It needs to be a venue to form teams not just in interdisciplinary sense but in the sense of methods and knowledge. A DH centre that wishes to thrive cannot content itself with tinkering with technological expertise and digital innovation.

Rather it needs to employ this knowledge in the framework of larger intellectual questions. A DH Centre must strike out and find common cause with those who are currently pursuing cutting edge themes but without incorporating digital methods. The task is not necessarily easy, but it is essential for the digital humanities to attain the respect it deserves.

Internationalisation

The Digital Humanities is a connector. Up until very recently it has failed to tap into global perspectives, with an emphasis on first world cultural history. This has relegated narratives from the global south, with the sundry effect of denying the relationships that exist at a global level.

Terrestrial pocket globe
Terrestrial pocket globe, Royal Museums Greenwich, CC-BY-NC-SA

 

By tackling questions of global import (often related to the grand challenges above), and by deploying open technical infrastructures and standards (for metadata, licensing, and via linked data), creating connections between the global objects of study in the humanities becomes much more feasible.

There is a softer side to this as well. The failure of the digital humanities to become truly global has practical roots; units in the global south can lack the finances, access to technology and general cultural to support DH. DH Centres can help in creating the alliances and sharing the infrastructures that would allow a global DH to blossom.

Scholarly Communication and Public Engagement

Inside and outside the academy, the humanities is undergoing a crisis. Public opinion is sometimes characterised by mistrust or disdain; government loathe the ambiguity, non-commercial and ideological aspects of the humanities. This has an obvious knock on effect in many ways – threatening student numbers and reducing government investment. The digital humanities can play a critical role in tackling this crisis.

Erasmus of Rotterdam
Erasmus of Rotterdam, Austrian National Library, Public Domain

This goes hand in hand with the changing landscape for scholarly communication. For the active DH centre, reconceptualising modes of scholarly communication is not an afterthought but an intrinsic part of examining how the humanities communicates amongst itself and with a wider public.

The adoption (and critical awareness) of new platforms for writing, visualisation, crowdsourcing, multimedia, and online resources themselves – allied to the reformed use of traditional outputs such as articles and monographs – can radically alter the humanists’ engagement with its audiences.

Any DH Centre must explore and build on these change, as well as tacking the grand international challenges of our time.