On 14 May, The European Library was part of a hackday organised by Research Libraries UK (RLUK). This was the first effort to get use of both the RLUK’s 20m bibliographic records (relating to the holdings of over 30 UK university libraries) after the conversion to Linked Open Data – work undertaken by The European Library. The day also offered attendees the chance to use The European Library API, that searches overs the entire European bibliographic dataset.

TEL logo

A summary of the day as a whole has been written, but there are a couple of specific points worth highlighting.

It is clear that the RLUK data is incredibly rich and offers plenty of opportunity for analysis, exploration and further exploitation. There are over 660m subject headings, 240m place names and 289m person names in the data.

But actually undertaking that process of exploitation at the hackday was tricky. The entire RLUK dataset is over 6GB; even a random sample of 1m records was over 330MB. For the developers with laptops at the event, manipulating such large chunks of data was prohibitively slow.

Most of the developers started exploring the sample of 1,000 records that has been prepared, and using that in combination with the API. This small samples was a manageable 349 KB

Of course, this meant that the full extent of the data was not being exploited – less than 1% of the collection ! Some of the developers could make good use of just a few records in RDF form, linking them to other sources of linked data on the web. But others expressed a wish to do more work on the whole dataset and do some statistical analysis, at least on certain facets, e.g. all the records related to a particular library.

So the question that became obvious at the end of the day concerned is how RLUK and The European Library can make Linked Open Data more usable.

Should we be offering plenty of different versions of the LOD, split up by properties like university, subject, author? Should we be offering an on-demand service, that would allow users to request a specific sub-set of data? Should we be offering the chance for users to not download the data, but to upload code ? These are still investigations we need to make !