Europeana in a Research Context

Slides on ‘Europeana in a Research Context‘ for the Mining Digital Respositories conference, April 2014, National Library of the Netherlands


Distributed or Centralised Hosting of Content

Digitised scans of historic newspapers create rather large file sizes. If you need to see a text up close, a low-resolution version just won’t do – words and characters are too blurry. So libraries that have undertaken digitisation projects on newspapers create individual master files of anything from 10 to 50 GB per image.

This creates quite a challenge for The European Library (TEL) within the the Europeana Newspapers project. TEL is creating an end-user interface for these historic documents, assembling around 10m images of newspaper pages from the 12 library partners involved. To create a successful user experience, TEL needs to be able to present good quality images – maybe not master files, but images of size at least 0.5 Megabytes (MB) and up to 2.5 MB

Great for the user, but a headache for the technical manager. 10m images at an average of 1.5 MB per image demands a total of server space around 15 m Megabytes (around 14 Terabytes). This is okay in a project setting, but not sustainable in the long term.

Therefore the project has come up with a new solution.

Rather than all the images be centrally harvested and then stored at TEL, some libraries have offered TEL access to their image server, ie their own hardware space where suitably sized images are stored.

When a user makes a request (via the search or browse) to see a particular image the TEL interface then dynamically grabs the image from the source library.

Have a look at an 1814 issue of the Viennese newspaper Wiener Zeitung from the National Library of Austria. Here the user can zoom in and out and explore the image within the TEL interface – but the digital version remains housed in Vienna.

This approach has other advantages in that it lets the curator of the original material maintain control of the digitised versions – lack of control is one of the reasons cited by managers as to why they are reluctant to share content with third party publishers.

However, not all libraries in the project have taken this approach, as it takes a bit of effort to allow the images to be grabbed in this way.

Therefore copies of the newspaper from the National Library of Latvia (such as a 1914 issue of ‘Drywa’) are pre-harvested and stored at TEL.

But as knowledge of this technique increases, I imagine it will become more popular. Rather than pre-assembling such collection and having to go through the process of harvesting and then storing a collection (which is time consuming and costly), third party aggregators will be able to curate, showcase and publish specific collections drawn from a variety of sources. With the result that content no longer remain trapped in institutional silos, but can be more easily seen and contextualised in a variety of different settings.


Feedback on Horizon 2020 ICT day

Since it published the first calls for the Horizon 2020 programme, the EU commission has been busy organising events to give more information on particular calls.

As the European Library is interested in being involved in new projects (particularly on how publishers and libraries can work together on usage data), I attended the event related to some of the specific elements of the call on the ICT call related to big data.

Over 500 delegates registered; a significant proportion were from university research departments.

The EU project officers went through the calls and it soon became apparent there was significant adjustment from FP7, the previous EU research call.

There was very much a focus on embedding research within business, particularly Small and Medium Enterprises (SME). Research for research sake was nowhere to be seen. The EU project officers were clear about the need for impact

Most projects would either be  ‘Research and Innovation’ or just ‘Innovation’. In this case innovation is defined in a business context rather than a knowledge context; ie its about innovation in products and services, rather than ideas. Projects that resulted in research publications would be frowned upon. 80% of the pre-proposals one EU officer had read did not have a sufficiently strong business case to get funding.

Of course, this only a slice of the Horizon 2020 funding. Other aspects will be focussed on pure research, or related research infrastructures. But within the context of these ICT-focussed calls, those applying from universities and related organisations will need to work much more closely with commercial partners if they wish to get project funding.


Memorial to Alessandro Valtrini

Made a stub of the Memorial to Alessandro Valtrini  - https://en.wikipedia.org/wiki/Memorial_to_Alessandro_Valtrini


How I use social media in my work

As part of a survey by the UK’s Arts and Humanities Research Council, I was asked how I made use of Twitter. Here are the response I gave

 

1.      Which social media services do you currently use for professional networking or discussing your research?

 

Twitter (a lot), Google Plus, Linked In (a little bit)

 

2.      What do you see as the benefits of engaging with social media for you as a researcher?

 

I am not strictly a researcher, but am involved in a lot of projects realted to digital humanities, digital libraries and information science. Twitter is great as it allows one to to build networks, learn what is going on elsewhere. The latter is really important – I have a much better idea of work being done around the globe in the field of digital humanities; and get garner that basic information much more quickly than other sources (eg. Conferences, papers). Twitter is not a replacement for that latter type of scholarly comms though; it is a supplement.

 

3.      Have you encountered any problems or barriers to using social media in relation to your research work?

 

You need to be aware of the limits. Twitter is good for starting or maintaining some social connections, but a lot more is needed if you want to have in depth conversations. Also I find that I share viewpoints with people I am in contact with on Twitter; so less direct argument and critique happens (although in my twitter stream there is plenty of critique of third parties that are not part of that group)

 

4.      Do you find it easy to find and connect with other researchers in the arts and humanities fields?

 

Yes, very easy to connect with people in information science, digital humanities, libraries. But I think this group is well disposed to Twitter in the first place. I am also interested in connecting with art historians but there are very few on Twitter. A community needs a critical mass of numbers  to be worthwhile

 

5.  How do you usually find out about other current research projects in your field?

 

 Nearly always via Twitter; but conferences and word of mouth can help provide much more illumination.


Representation and Absence in Digital Resources: The Case of Europeana Newspapers

Abstract submitted to DH2014 by Alastair Dunning (The European Library)Clemens Neudecker (KB National Library of the Netherlands)

Within the Digital Humanities, there is a long history of debate and discussion as to how texts are accurately represented in digital form. Arguments as to how texts are encoded in both a logical and semantic sense are a recurring feature of past DH conferences.

Yet the intense intellectual focus on the precise details of marking up small corpora or even individual texts has masked the fact that issues related to the representation of large corpora of digitised materials – books, manuscripts, newspapers, records etc. – have been too often ignored. Libraries, archives, museums and other collection institutions have now been digitising corpora of material for many years, but with a very few exceptions, it is still quite rare for an entire run of primary sources to be digitised and made available online.

This means that there are gaps within the digital record. Yet it is unusual for online resources to actively demonstrate these gaps; resources may be advertised as a growing corpus, but when searching through or downloading a digital resources there is rarely any indication of what has not been digitised. This skews the sense of the nature of the collection the scholar is working with and erodes trust.

This problem is compounded by assumptions made by end users that when a search is made in a digital resource, they actually are searching over everything in the original archive. In most cases, this is far from being the case.

This long paper looks at this problem in the context of the Europeana Newspapers project (www.europeana-newspapers.eu), a three year, four million euro project, which is creating full-text for 10m pages of digitised newspapers from 12 libraries across Europe, and also developing an interface to allow for cross searching of over 18m newspaper pages. The final interface, available from the European Library in 2014 (http://www.theeuropeanlibrary.org/), will also provide keyword searching over the OCRd (Optical Character Recognition) text and allow users to compare different newspapers from around Europe published on the same day.

While it is an ambitious project, it is only a drop in the ocean of the overall number of digitised newspapers in Europe (a conservative calculation within the project put the number of digitised newspaper pages in European libraries at 130m ). What appears on the final interface will only be a sample of what actually exists in European libraries.

Moreover, other issues – political, economic, legal and technical – mean that the quality and national distribution of newspapers in the project (and therefore represented in the final online interface) are unevenly balanced. For the resource to be trusted by the academic community, this lack of balance must be acknowledged.

In terms of the economic and legal issues, the project is integrating newspapers from 12 existing newspapers online libraries, each of which have different business models. These different business models affect the final project interface. The National Library of Turkey and the British Library newspapers operate behind a pay wall, for instance – therefore the final Europeana Newspapers site will not be able to directly show images from their collection.

Other libraries are wary of sharing full-resolution images, with the legitimate fear that the users will no longer visit their own national website. In such cases, only fragments of their newspaper images will appear in the central site. Legal issues are also pertinent; some libraries are unsure of the copyright status of some of their historic newspapers and therefore do not want to commit to allowing another entity to publish them.

In addition, there are several technical issues impeding uniform access to the resources. Nearly every digital newspaper collection today contains full-text derived from automatic processing with OCR software. But while some newspaper repositories grant access to the full-text, often the full-text is hidden and only exposed as an index for searching, but not available to the end user for online display or (programmatic) download, or sometimes not even for indexing by Google.

In other cases, full-text is made available, but not for the entirety of the collection, either due to IP issues or because the content holder took a deliberate decision not to show the full-text to the user, often because of the amount of error rate in the OCRd text. Regularly there is no sufficient information provided about the OCR error rate of a particular digital resource, which makes it even harder to assess what amount of the content can realistically be retrieved through a full-text search.

There are also different ways how digital facsimiles are made accessible. Many recent online newspaper portals use the JPEG2000 image file format. The benefit of this is the ability to zoom more or less seamlessly in and out of the digital facsimile. But since JPEG2000 has not been around for a very long time in the digitisation community, many collections that have been digitised in the past are only available in TIF format. This means that zooming can only be provided in a static way on these images, e.g. through different resolution JPEGs. As a result, it is often not possible for researchers to explore these legacy resources in much the same way as they do with recently digitised materials.

In other cases, digital facsimiles have been produced by capturing existing microfilm copies rather than the original source material, thus the digital versions expose artefacts that were not present in the original paper source, but only introduced in the microfilm. However, this type of provenance is most typically not available to end users who are left alone in their interpretation of the differences in resource presentation and functionality.

Finally, the metadata standards used to describe the digital contents also vary. Not only are there different representations in use for encoding full-text such as plain text, ALTO or TEI. But also descriptive metadata is commonly encoded in different standards, and with different degrees of granularity. While standard bibliographic information such as the title or date of publication are commonly available, more specific information on, for example, a particular article or the names of persons or places occurring in it rarely are. Within the Europeana Newspapers project a subset of 2m pages out of the total 10m will be refined further down to the article level, thus enabling more sophisticated search and retrieval functionality than the remaining 8m pages.

A central point of this paper is that these issues are not just issues for librarians; it is not about showcasing how a digital resource is. Rather it is the urgent need to demonstrate how such issues have a profound effect on the academic community’s engagement with online resources.

If a researcher wants to conduct a comparative analysis of newspapers in Chronicling America (the US historic newspaper site), the National Library of France and the British Library, she will have to use three different interfaces with different levels of content and metadata quality. Moreover, she will also have to grasp the particularities of each of these collections with regard to their quality and completeness and what that entails for her research.

This paper will conclude with some recommendations for how those building digital resources can make their content choices more transparent. Informed dialogue between the cultural heritage organisations and the research communities is required. It calls for creators to tear down the illusion of completeness and help persuade end users that many digital resources are fragmentary things, where the representation of absence is just as important as representation of existence.

————

References (unformatted):


If you make a searchable full-text index of a digital document, is that index metadata or content ?


Follow

Get every new post delivered to your Inbox.

Join 27 other followers