Winning the Tour de France, Research Data and Data Stewardship


Presentation to Sport Data Valley given at TU Delft Library meeting on value of Data Stewardship, May 2016


On the funding cuts at Trove

I read with great disappointment about the funding cuts at the National Library of Australia, including its flagship digital archive, Trove.

Trove has always been at the forefront of work to to share and publish digital cultural heritage.


Its tremendous newspapers collection work demonstrates that digitising historical sources can have wide public impact. Musty documents, when digitised, are not just for fusty scholars but for those interested in animal accidents, knitting patterns, cricket scores and any kind of odd historical quirk. The quantity and quality of usage are outstanding.

Trove also shows the way for crowdsourcing. It is one of the very first libraries to ask users to transcribe and correct OCR (Optical Character Recognition) text, and they have done so in great abundance.

But Trove is not just newspapers. A whole range of content from across Australia is included – records, books, maps, photos, diaries, letters, sounds.

Many digital libraries have attempted to aggregate such material and provide compelling friendly interfaces. Trove is one of the few where searching through such a heterogeneous range of material seems natural.

In the 1950s and 60s, Australian writers spoke of the cultural cringe, the feeling that their work would always be judged as inferior and in subservience to older English and European traditions. Australian culture has moved on fabulously since then, and in the case of Trove, it’s been us in Europe who have looked at Trove and wondered how we could do something as good as that.




Some articles on the benfeits of open data for citation rates

Piwowar HA, Vision TJ. (2013) Data reuse and the open data citation advantage.PeerJ 1:e175

Conclusion. After accounting for other factors affecting citation rate, we find a robust citation benefit from open data, although a smaller one than previously reported. We conclude there is a direct effect of third-party data reuse that persists for years beyond the time when researchers have published most of the papers reusing their own data. Other factors that may also contribute to the citation benefit are considered.

We further conclude that, at least for gene expression microarray data, a substantial fraction of archived datasets are reused, and that the intensity of dataset reuse has been steadily increasing since 2003.


Piwowar HA, Day RS, Fridsma DB (2007) Sharing Detailed Research Data Is Associated with Increased Citation Rate. PLoS ONE 2(3): e308.  doi:10.1371/journal.pone.0000308

Sharing research data provides benefit to the general scientific community, but the benefit is less obvious for the investigator who makes his or her data available.

We examined the citation history of 85 cancer microarray clinical trial publications with respect to the availability of their data. The 48% of trials with publicly available microarray data received 85% of the aggregate citations.

Publicly available data was significantly (p = 0.006) associated with a 69% increase in citations, independently of journal impact factor, date of publication, and author country of origin using linear regression.


Belter CW (2014) Measuring the Value of Research Data: A Citation Analysis of Oceanographic Data Sets. PLoS ONE 9(3): e92590.  doi:10.1371/journal.pone.0092590

Evaluation of scientific research is becoming increasingly reliant on publication-based bibliometric indicators, which may result in the devaluation of other scientific activities – such as data curation – that do not necessarily result in the production of scientific publications. This issue may undermine the movement to openly share and cite data sets in scientific publications because researchers are unlikely to devote the effort necessary to curate their research data if they are unlikely to receive credit for doing so.

This analysis attempts to demonstrate the bibliometric impact of properly curated and openly accessible data sets by attempting to generate citation counts for three data sets archived at the National Oceanographic Data Center.

My findings suggest that all three data sets are highly cited, with estimated citation counts in most cases higher than 99% of all the journal articles published in Oceanography during the same years. I also find that methods of citing and referring to these data sets in scientific publications are highly inconsistent, despite the fact that a formal citation format is suggested for each data set.

These findings have important implications for developing a data citation format, encouraging researchers to properly curate their research data, and evaluating the bibliometric impact of individuals and institutions


Pienta, Amy M.; Alter, George C.; Lyle, Jared A. (2010) The Enduring Value of Social Science Research: The Use and Reuse of Primary Research Data

Abstract : The goal of this paper is to examine the extent to which social science research data are shared and assess whether data sharing affects research productivity tied to the research data themselves. We construct a database from administrative records containing information about thousands of social science studies that have been conducted over the last 40 years.

Included in the database are descriptions of social science data collections funded by the National Science Foundation and the National Institutes of Health. A survey of the principal investigators of a subset of these social science awards was also conducted.

We report that very few social science data collections are preserved and disseminated by an archive or institutional repository. Informal sharing of data in the social sciences is much more common. The main analysis examines publication metrics that can be tied to the research data collected with NSF and NIH funding – total publications, primary publications (including PI), and secondary publications (non-research team).

Multivariate models of count of publications suggest that data sharing, especially sharing data through an archive, leads to many more times the publications than not sharing data. This finding is robust even when the models are adjusted for PI characteristics, grant award features, and institutional characteristics


Bertil Dorch. On the Citation Advantage of linking to data: Astrophysics. 2012.  <hprints-00714715v2>

Abstract : This paper present some indications of the existence of a Citation Advantage related to linked data, using astrophysics as a case. Using simple measures, I find that the Citation Advantage presently (at the least since 2009) amounts to papers with links to data receiving on the average 50% more citations per paper per year, than the papers without links to data.

A similar study by other authors should a cumulative effect after several years amounting to 20%. Hence, a Data Sharing Citation Advantage seems inevitable.


Edwin A. Henneken, Alberto Accomazzi (2011) Linking to Data – Effect on Citation Rates in Astronomy.

Abstract: Is there a difference in citation rates between articles that were published with links to data and articles that were not? Besides being interesting from a purely academic point of view, this question is also highly relevant for the process of furthering science. Data sharing not only helps the process of verification of claims, but also the discovery of new findings in archival data.

However, linking to data still is a far cry away from being a “practice”, especially where it comes to authors providing these links during the writing and submission process. You need to have both a willingness and a publication mechanism in order to create such a practice.

Showing that articles with links to data get higher citation rates might increase the willingness of scientists to take the extra steps of linking data sources to their publications. In this presentation we will show this is indeed the case: articles with links to data result in higher citation rates than articles without such link

A Vision for the Digital Humanities

The Digital Humanities does not exist. Or rather, it does not exist as a separate field, a bounded up box distinct from traditional disciplines.

Rather is it a metaphor, a powerful connector that allies existing disciplines with nascent ones. It is a connector that suddenly injects disparate subjects across the humanities with common concerns. Concerns over infrastructure, public engagement, scholarly communication. Over method.

Interdisciplinarity has always existed in the humanities, but the digital turn strengthens bonds in new ways. Archaeologists and linguists share needs for powerful processing infrastructures; philosophers need to reconsider their publishing strategies; theologians and historians suddenly have new audiences for their research opened up. A Sinologist suddenly has common cause with an historian of the book over XML mark up.

Grand Challenges

While this instinctive interdisciplinarity has influenced the restructuring of humanities faculties to include DH components, there is still doubt as to the effectiveness of the digital humanities. What have these ‘technological insurgents’ done to help answer the grand intellectual challenges facing the humanities? Therefore, an essential component of any contemporary digital humanities vision is helping address such challenges.

Aonach Tailteann Athletics- Croke Park: Hurdles Race | Independent Newspapers PLC

Hurdles Race (taken by Independent Newspapers PLC), National Library of Ireland, Rights Reserved


Any lasting vision for the Digital Humanities needs to strike a common basis with the broader faculty. It needs to be a venue to form teams not just in interdisciplinary sense but in the sense of methods and knowledge. A DH centre that wishes to thrive cannot content itself with tinkering with technological expertise and digital innovation.

Rather it needs to employ this knowledge in the framework of larger intellectual questions. A DH Centre must strike out and find common cause with those who are currently pursuing cutting edge themes but without incorporating digital methods. The task is not necessarily easy, but it is essential for the digital humanities to attain the respect it deserves.


The Digital Humanities is a connector. Up until very recently it has failed to tap into global perspectives, with an emphasis on first world cultural history. This has relegated narratives from the global south, with the sundry effect of denying the relationships that exist at a global level.

Terrestrial pocket globe

Terrestrial pocket globe, Royal Museums Greenwich, CC-BY-NC-SA


By tackling questions of global import (often related to the grand challenges above), and by deploying open technical infrastructures and standards (for metadata, licensing, and via linked data), creating connections between the global objects of study in the humanities becomes much more feasible.

There is a softer side to this as well. The failure of the digital humanities to become truly global has practical roots; units in the global south can lack the finances, access to technology and general cultural to support DH. DH Centres can help in creating the alliances and sharing the infrastructures that would allow a global DH to blossom.

Scholarly Communication and Public Engagement

Inside and outside the academy, the humanities is undergoing a crisis. Public opinion is sometimes characterised by mistrust or disdain; government loathe the ambiguity, non-commercial and ideological aspects of the humanities. This has an obvious knock on effect in many ways – threatening student numbers and reducing government investment. The digital humanities can play a critical role in tackling this crisis.

Erasmus of Rotterdam

Erasmus of Rotterdam, Austrian National Library, Public Domain

This goes hand in hand with the changing landscape for scholarly communication. For the active DH centre, reconceptualising modes of scholarly communication is not an afterthought but an intrinsic part of examining how the humanities communicates amongst itself and with a wider public.

The adoption (and critical awareness) of new platforms for writing, visualisation, crowdsourcing, multimedia, and online resources themselves – allied to the reformed use of traditional outputs such as articles and monographs – can radically alter the humanists’ engagement with its audiences.

Any DH Centre must explore and build on these change, as well as tacking the grand international challenges of our time.

Truth in Art History – 3 Dutch examples

Art historians love interpreting paintings but they also love finding ‘true facts’ about paintings (excuse the postmodern snigger quotes). Three recent examples related to Dutch art history are below, two of which show a definite input from digital / scientific methodology.


The Next Rembrandt created an entirely fictitious Rembrandt portrait based on use the mass of existing technical data related to existing Rembrandt paintings.


Van Gogh’s Bedrooms was an exhibition at the Art Institute of Chicago. It included results of chemical analysis that allowed conservators to claim they found the true, original colours of the painting (or at least one of them; Van Gogh painted three). While there has been much hubbub about the slowness of art history to adopt digital methods, it’s worth noting that conservators / technical art historians have been working with scientific analysis of paintings for a considerable time.


Finally, the actual location of Vermeer’s The Little Street in Delft was revealed. However, this was based on painstaking analysis of archival material in its physical rather than digital form, checking extant maps and documents in archives to try and find the eponymous location.

#IDCC16: Atomising data: Rethinking data use in the age of explicitome

(Originally posted at Digital Curation Centre)

Data re-use is an elixir for those involved in research data.

Make the data available, add rich metadata, and then users will download the spreadsheets, databases, and images. The archive will be visited, making librarians happy. Datasets will be cited, making researchers happy. Datasets may be even re-used by the private sector, making university deans even happier.

But it seems to me that data re-use, or at least a particular conceptualisation of re-use as is established in most data repositories, is not the definitive way of conceiving of data in the 21st century.

Two great examples from the International Data Curation Conference illustrated this.

Barend Mons declared that the real scientific value in scholarly communication is not abstracts, articles or supplementary information. Rather the data that sits behinds these outputs is the real oil to be exploited, featuring millions of assertions about all kinds of biological entities.

Describing the sum of these assertions as the explicitome, it enables cross fertilisation between distinct scientific work. With all experimental data made available in the explicitome, researchers taking an aerial view can suddenly see all kinds of new connections and patterns between entities cited in wholly different research projects.

Secondly, Eric Kansa’s talk on the Open Context framework for publishing archaeological data. Following the same principle as Barend Mons, OpenContext breaks data down into individual items. Instead of downloading a whole spreadsheet relating to a single excavation, you can access individual bits of data. From an excavation, you can see the data related to a particular trench, and then items discovered in that trench.

open context image

(A screenshot from Open Context)

In both cases, data re-use is promoted, but in an entirely different way to datasets being uploaded to an archive and then downloaded by a re-user.

In the model proposed by Mons and Kansa, data is atomised, and then published. Each individual item, or each individual assertion, gets it own identity. And that piece of data can then easily be linked to other relevant pieces of data.

This hugely increases the chance of data re-use; not whole datasets of course, but tiny fractions of datasets. An archaeologist examining remains of jars on French archaeological sites might not even think to look at a dataset from a Turkish excavation. But if the latter dataset is atomised in a way that it allows it identify the presence of jars as well, then suddenly that element of the Turkish dataset will become useful.

This approach to data is the big challenge for those charged with archiving such data. Many data repositories, particularly institutional ones, store individual files but not individual pieces of data. How research data managers begin to cope with the explicitome – enabling it, nourishing and sustaining it – may well be a topic of interest for IDCC17.

Strategies and Tactics in changing behaviour around research data

The International Data Curation Conference (IDCC) continues to be about change.

That is, how do we change the eco-system so that managing data is an essential component of the research lifecycle? How can we free the rich data trapped in PDFs or lost to linkrot? How can we get researchers to data mine and not data whine?

While, for some, the pace of change is not quick enough, IDCC still demonstrates an impressive breadth of strategy and tactics to enable this change.

On the first day of the conference, Barend Mons set out the vision. The value of research is not in journals but in the underlying data – thousands and thousands of assertions about genes, bacteria, viruses, proteins, indeed any biological entity are locked in figures and tables. Release such data and the interconnections between related entities in different datasets reveals whole new patterns. How to make this happen? One part of the solution: all projects should allocate 5% of their budget to data stewardship.

Andrew Sallans of the Center for Open Science followed this up with their eponymous platform for managing Open Science for linking data to all kinds of cloud providers and (fingers crossed) institutions’ data repositories. In large-scale projects, sharing and versioning data can easily get out of control; the framework helps to manage this process more easily. They have some pretty nifty financial incentives to change practice too – $1000 awards for pre-registration of research plans.

Following this we saw many posters – tactics to alter behaviours of individuals and groups of researchers. There were some great ideas here, such as plans at the University of Toronto to develop packages of information for librarians on data requirements of different disciplines. 

Despite this, my principal concern was the huge gap between the massive sweep of the strategic visions and the tactics for implementing change. Many of the posters were valiant but were locked in an institutional setting – the libraries wrestling how to influence faculty without the in depth knowledge (or institutional clout) to make winning arguments within a particular area.

What still seems to be missing from IDCC is the disciplinary voice. How are particular subjects approaching research data? How can the existing community work more closely with them? There was one excellent presentation on building workflows for physicists studying gravitational waves; and other results from OCLC work with social scientists and zoologists. But in most cases it was us librarians doing the talking rather than it being a shared platform with the researchers. If we want that change to happen, there still needs to be greater engagement with the subjects that are creating the research data in the first place.


Get every new post delivered to your Inbox.

Join 2,747 other followers