The Great Twentieth-Century Hole Or, what the Digital Humanities Miss


Presentation given at DH Benelux June 2014

Presentation on Europeana Newspapers

Presentation given at British Library information day on digitised newspapers

Digitisation Projects Classified by Date of Corpus

At the DH Benelux Conference in The Hague in June, I’m looking into the extent to which the Digital Humanties ignores the twentieth century. The abstract is here.

As part of this work, I’ve been investigating the projects undertaken at various DH centres, in particular those projects that are working with a specific corpus of data (as opposed to doing networking, or tools development), and the dates of those corpora.

I’ve taken some significant DH Centres and marked each of the projects according to a very rudimentary temporal classifications – ‘Classical, Medieval, Renaissance, 18th century, 19th century, 1900-1950, 1950 onwards’

The Google spreadsheet with the results so far is at

So far, I’ve included

Department of Digital Humanities, King’s College London
Huygens Institute, National Library of Netherlands (The Hague)
Maryland Institute for Technology in the Humanities, University of Maryland
Centre for Literary and Linguistic Computing, University of Newcastle
Center for Digital Research in the Humanities, University of Nebraska

There is a sixth tab with the total number of projects.

There is a fuller list I wish to explore on the ‘Totals’ tab of the published spreadsheet. Any more links to identifiable lists of projects based at DH Centres would be gratefully received !

PS I’m aware there is a whole bunch of methodological/sampling problems with focussing on ‘projects in DH Centres’ ! I’m hoping to bring this out in the paper.

The Great 20th-Century Hole; What the Digital Humanities Miss

(Abstract submitted for the DH Benelux 2014 conference), The Hague, June 2014)

Over the past few years, there have been endless debates about the definition of the Digital Humanities. Many angles are considered – the practitioner of DH as builder, as coder, as theorist, as user – and also where the practitioner of DH sits and works – in the library, at home, in a ‘laboratory’, in the computer science department or with other disciplines.

However, this paper argues that that another angle has been ignored – a temporal one. The Digital Humanities has an uncritiqued bias towards the pre-20th century. The projects, papers, books and conferences that constitute the field of Digital Humanities (or at least in the Digital Humanities within the western tradition) have taken as their objects of study the classics, the Middle Ages, the early modern period, the Enlightenment and the nineteenth century. The twentieth century – arguably the most important era for study for the humanities – remains relatively untouched as a point of investigation. Whereas there is a mass of projects related to the digitisation of early printed books, manuscripts, maps, early photography, those related to film and media, contemporary books, or modern letters, documents or recent politics are relatively scarce.

The paper draws in evidence from projects such as Europeana Newspapers, programmes like Digging into Data; centres such as the King’s College London Department of Digital Humanities; events like the annual DH conference and books such as the recent Debates in the Digital Humanities to indicate the extent of this bias. It explores the extent to which projects relating to the twentieth century feature within such academic endeavour.

The paper explores the reasons for this bias. Not surprisingly, reasons of licensing and copyright play a role. The copyright status of much twentieth-century material creates a barrier that seems to block engagement from the outset. Indeed, it will be argued that this key problem, and one one that the community has been not only been slow to address but even to recognise. But there are other reasons to consider as well – issues relating to economics, file formats, and ambitions and relationships of individual disciplines within the humanities to the digital.

It concludes that if the Digital Humanities wishes to fully live up to its potential it needs to conceive of itself in a particular way and tackle these problems as part of a larger alliance. The type of partnerships that scholars within the DH umbrella have formed – with librarians, archivists, publishers – need to be reformulated and strengthened. The twenty-century hole is a massive problem for the digital humanities and only one that can be dealt with by the community by presenting and articulating the issue as part of a larger group of interested stakeholders

Europeana in a Research Context

Slides on ‘Europeana in a Research Context‘ for the Mining Digital Respositories conference, April 2014, National Library of the Netherlands

Distributed or Centralised Hosting of Content

Digitised scans of historic newspapers create rather large file sizes. If you need to see a text up close, a low-resolution version just won’t do – words and characters are too blurry. So libraries that have undertaken digitisation projects on newspapers create individual master files of anything from 10 to 50 GB per image.

This creates quite a challenge for The European Library (TEL) within the the Europeana Newspapers project. TEL is creating an end-user interface for these historic documents, assembling around 10m images of newspaper pages from the 12 library partners involved. To create a successful user experience, TEL needs to be able to present good quality images – maybe not master files, but images of size at least 0.5 Megabytes (MB) and up to 2.5 MB

Great for the user, but a headache for the technical manager. 10m images at an average of 1.5 MB per image demands a total of server space around 15 m Megabytes (around 14 Terabytes). This is okay in a project setting, but not sustainable in the long term.

Therefore the project has come up with a new solution.

Rather than all the images be centrally harvested and then stored at TEL, some libraries have offered TEL access to their image server, ie their own hardware space where suitably sized images are stored.

When a user makes a request (via the search or browse) to see a particular image the TEL interface then dynamically grabs the image from the source library.

Have a look at an 1814 issue of the Viennese newspaper Wiener Zeitung from the National Library of Austria. Here the user can zoom in and out and explore the image within the TEL interface – but the digital version remains housed in Vienna.

This approach has other advantages in that it lets the curator of the original material maintain control of the digitised versions – lack of control is one of the reasons cited by managers as to why they are reluctant to share content with third party publishers.

However, not all libraries in the project have taken this approach, as it takes a bit of effort to allow the images to be grabbed in this way.

Therefore copies of the newspaper from the National Library of Latvia (such as a 1914 issue of ‘Drywa’) are pre-harvested and stored at TEL.

But as knowledge of this technique increases, I imagine it will become more popular. Rather than pre-assembling such collection and having to go through the process of harvesting and then storing a collection (which is time consuming and costly), third party aggregators will be able to curate, showcase and publish specific collections drawn from a variety of sources. With the result that content no longer remain trapped in institutional silos, but can be more easily seen and contextualised in a variety of different settings.


Get every new post delivered to your Inbox.

Join 29 other followers