Torsten Reimer’s excellent blogpost on the use of software at Imperial College, led me to similar reflections on the usage of the software at TU Delft.

As part of the Data Stewardship project, I am involved in interviews with research groups across TU Delft’s eight faculties. In many of the interviews, the role of software often comes up.

Whether writing code to simulate the movement of volcanic ash clouds, to help model the problems in including renewable energy sources in power grids, or normalising a mass of data resulting from chemical analysis of biological samples, the software plays a critical role in defining and testing scientific hypothesis.

5052486185_f4e5d3e48a_z

Utah House Solar Panels, CC-BY, Rick Willoughby

The relative importance attributed to software by researchers as part of their research life cycle also has interesting implications for research data management.

Take for example a scientist running simulations on the effectiveness of solar power in the electricity grid. Such a scientist runs hundreds of simulations, testing the effects of small adjustments to input parameters (eg customer demand, energy input from the solar panels) that goes into the model based on the code. Each simulation will spit out results as data.

When it comes to writing up the results in a paper, typically only the data from a few of the simulations will be referenced, perhaps in the form of graphs. Hopefully, the data from these referenced simulations will be available from a data repository.

But from the data management point of view what is interesting here is thinking about the reproducibility of this research. If another group wants to verify the results of the original group’s research that’s not too difficult. They can download the resulting data and documentation from the data repository (or they can ask the original scientist for it)

But does that mean the science in the paper is reproducible? To be reproducible, the second group would have to test the same software with the same input parameters and check that they got the same data. The data itself would not be enough. Reproducing science needs not just the data but the software as well.

This has implications for data management. As the term suggests much of the focus in libraries is currently on ‘data’. Yet much of the rationale behind good data management is that it helps make science reproducible. But if we really want to do this, then maybe we need to do a lot more in terms of good ‘software development’.

As the Imperial College blog post demonstrates, some university libraries are already thinking about this. But I think we have a long way to go.