One of the great problems of digitisation programmes is the time and cost of copyright clearance. There can be problems not just in getting the agreement of copyright holders, but actually finding the holders in the first place.
Many projects have to undergo due diligence searches to reach copyright holders – a time consuming mix of Google searches, general advertising and emails and letters sent off, often as ‘shots in the dark’.
The Arrow project (and its sequel ArrowPlus) are the first big building blocks in a EU-wide infrastructure, providing a tool that should provide managers of large-scale digitisation programmes with copyright information on books they want to digitise.
One of projects’ test cases involved undertaking a due diligence search on 1,700 books (from 19th and 20th century) on genetics that the Wellcome Trust wished to digitise, returning information on whether a book was in or out of commerce, if it was an orphan work or who the copyright holder was.
To function, Arrow requires the bringing together of three sets of databases in the 16 countries involved – their national library catalogues, their books in print database, and their databases of rightsholder information. This is no easy task with massive interoperability and data quality challenges. For some countries, the databases do not exist or commercial interests act against such centralised collection of data.
It’s an exceptionally ambitious piece of work, indicative of the EU desire to create large chunks of infrastructure to solve pan-European problems. If it works, its impact will be tremendous, paving the way to the digitisation of many 20th-century books.
But the challenges are great. Sustaining the Arrow system will have a basic administrative cost of Euro 100k a year, and it’s not clear (to me at least) who will support that annual cost.
It should also be noted that Arrow only supports book rights clearance (although visual images are being explored) and large-scale queries. There is no webpage point to make queries over individual books (for reasons I have not grasped). For most digitisation projects (small scale, and digitising non-book material) the system does not really help.
The mass digitisation projects that could support the costs of Arrow are not frequent, but they might still happen. The French are talking about digitising 2m out of commerce works. Arrow could still play a role for the Google digitisation programme. But whether such irregular programmes can sustain the system, only time will tell.
King’s College London’s Andrew Prescott has written a knowledgeable and very readable blog post on digital humanities infrastructure in the UK. I agree wholeheartedly with his main point – there are big issues relating to the provision of services and content, in particular the commercialisation of knowledge, which the (digital) humanities community needs to address. However, to achieve this I still think there is a role for some kind of grouping round an ‘arts and humanities service’ within the UK, although perhaps with a very different focus than the erstwhile AHDS.
There are three main issues to address:
1) Liaising with publishers and librarians
As the blog posts point out, humanities scholars are unaware of much of the provision of the current infrastructure and the attendant benefits and restrictions of scholarship.
This is, I think, not surprising. Researchers are interested in developing resources for their research questions or those of their generally quite narrow community. When it comes to working out the much broader licensing deals, copyright issues and institutional demand, the responsibility falls to the librarians – digital humanities scholars rarely the time or the focus to deal with such issues.
But, as Andrew says, they need to be involved in those discussions. If the AHRC is not going to take that role, than (digital) humanities in the UK needs to find some kind of common grouping so to inform a better dialogue with librarians and publishers and get its interests heard.
2) Training, communication
The AHDS was not just about data preservation. It was also about developing a community, putting people in touch and providing expert training. If you needed to learn about scanning, metadata, copyright etc and see what others in your area were doing, the AHDS provided that forum. The Methods Network did very similar things at a more advanced level.
My guess is services like these are still very much needed. Where does that new humanities scholar go when he needs customised help with scanning, data modelling or publishing resources on the web? How do experts in the field find the time and space to exchange findings at a disciplinary and an interdisciplinary level? Possibilities for transfer of knowledge between interested parties do still exist, but the AHDS infrastructure provided a layer on top of this
3) Data storage / aggregation
As far as I understand, the US HathiTrust was partially born out of the need to provide a repository for Google digitised material, but also a common area for a variety of digitised material.
If the UK, or indeed Europe, is to create structures to free up our content for sophisticated reuse in digital scholarship, then we will need to have the the technical infrastructure to allow this to happen. Without such technical infrastructure we rely on what the publishers provide.
Of course, we have a growing system of institutional repositories and digital libraries, so there is less demand for a centralised repository. However, we do still need a way of connecting data distributed in different sources. And for that connecting to be done, there needs to be some kind of service and expertise that can help can provide the loose connections between different sources and enable innovative digital scholarship over a range of datasets.
One of main projects I am involved in my new job at The European Library is helping create the interface for an EU project called provide A Gateway to European Newspapers (worth around 5.16m Euros).
It sounds like an obvious and very worthwhile project – taking the OCRd and scanned newspaper collections of the numerous libraries in Europe (including the Netherlands, Germany, France, Austria, Serbia and also Turkey), improving the quality of that OCRd text and then placing them in a centralised index. Instead of having to search over more than 10 different archives, the end user can search over one. The benefits are obvious.
Other excellent websites such as the Australian Trove or the US Chronicling America point the way in creating such an aggregated service; and this is what The European Library is attempting to do as part of this project
However, the actual reality of the situation with the European project is a little different.
Whilst there are considerable technical challenges in marking up articles, refining text and indexing 18 million pages of newspapers, the real issue is political.
For some of the individual libraries that are contributing content to the project, there is the fear that any central service will draw users away from their own national based services for making digitised newspapers accessible.
Thus there is an understandable reluctance on the behalf of the libraries to make their images available elsewhere for cross-searching. For some involved in the project, part of their library’s income from their relevant governments, and possibly even their jobs, depend on the success of their newspaper (or broader digital content) platforms.
So while all interested European citizens, in fact any interested party around the world, would obviously gain much advantage from a sophisticated aggregated interface, allowing full text searching and displaying images from multiple newspaper collections, things may not transpire so easily.
The project is young, having started in February 2012 and due to last three years. There is still time for debate and discussion, and for different approaches to delivering digital content to evolve.
It may be that an increasing push towards libraries exposing their content via APIs may help. Rather than an aggregating service having to centrally store screen resolution versions of the images, it could simply display the images held at the local libraries. This might not guarantee more traffic to a library’s *platform*, but it would definitely guarantee more use of their *digitised content*
We shall see.