Is crowdsourcing dumbing down research?

(Originally published in The Guardian, July 29 2011)

Whether your favourite tipple is a lanny or a craitur might depend on whether you’re a wine or a whiskey drinker, and even where in Scotland you live.

Last month an innovative new project funded by JISC asked people to contribute to a unique dictionary of Scottish words and place-names. The twist? Contributors are using tools of the web: posting messages on a Facebook page, tweeting the project team and contributing to an online discussion.

It’s the latest in a series of community projects that are asking the general public to contribute their knowledge and expertise to research through interactive web technology, not simply because they can or because it’s trendy, but because crowdsourcing is now, by default, digital. The idea behind this particular project is to focus firmly on how people are speaking now rather than the more traditional approach of largely gathering evidence from written material – so it makes perfect sense to go out to where people are, already tweeting, posting and updating their Facebook pages.

Two major factors have contributed to the growth of such projects. Web 2.0 technologies have developed to offer far more interactivity in the past few years – whether it’s adding comments to a page, video to YouTube or simply uploading photos to a central archive, content publishers now have more flexibility than ever before for interacting with a wide range of users. The British Library sound map project asks people to contribute audio recordings that are published on their webpages; JISC’s Strandlines project is assembling documents that articulate the history of one of London’s most famous streets, The Strand. As digital cameras, video devices and supporting software become more widespread, it’s possible to collate a range of media from the crowd when this might be very expensive to do independently.

But it’s not just about multimedia – the Old Weather project asks the public to transcribe Royal Navy log books from the early 20th century, which include valuable meteorological data recorded by ships’ crew members. Such an approach has a triple benefit – naval enthusiasts have whole new stories about British seafaring; military and other historians have fresh evidence and scientists have access to vital meteorological information to help them understand long-term patterns in climate change.

Researchers are seeing the advantages in developing meaningful relationships with businesses, public sector partners and community groups just as the universities they work for are actively developing their external engagement missions. These outside groups are sources of expertise, funding and advice but can also take research to wider audiences. Getting people involved means these users evolve to become both consumers and creators of digital data.

But when does ‘crowdsourcing’ work well? First, if you’re looking for expertise from a range of sources then the potential for ideas is massive. BMW received 4000 ideas within seven days of setting up its Virtual Innovation Agency which invites ideas for products and designs. The term crowdsourcing doesn’t seem to accurately cover the depth of this kind of activity.

Second, asking for contributions online can be an excellent option when funds are limited. JISC supported the Great War Archive project, which asked people to contribute photos and memories of their own wartime collections to a central website either directly online or through roadshows where they were brought along and digitised on the spot. The project team calculated that this was incredibly cost effective – each item submitted through the archive cost around £3.50 to ‘capture’, catalogue, and distribute, compared with around £40 per item when digitisation was managed in-house. The sheer scale of such collecting would also take much longer if you have a small team, whereas crowdsourcing can speed up a potentially time consuming process.

Idle computing power has long since been donated by those wishing to contribute to projects like the search for extra-terrestrial intelligence. But when some of the responsibility for content is pushed out into the public arena, is there a risk that we are trawling research data from the hands of those who know little about it? How do we balance the quantity of content we need with rigorous quality control?

The University of Oxford’s Galaxy Zoo, which asks the general public to describe and classify astronomical images has addressed this well. In addition to developing intelligent mechanisms for recording and analysing public contributions, the Oxford team and their partners ensure that they give due credit to their contributing ‘citizen scientists’ right from the outset – to the extent that they are cited as contributing authors in published articles. Galaxy Zoo demonstrates that we have to be prepared to share that balance of power with those who fund, contribute to and benefit from our research. Only by showing our processes and opening up our data, early findings and papers we are going to find support for the research of tomorrow. Just as big brands can build consumer trust by getting them involved in initiatives like MyStarbucks, so we can enhance the non-academic world’s trust in research by inviting them through the keyhole right from the start of our projects.


Crowdsourcing and Variant Digital Editions – some troubles ahead

(This blog first published on JISC Digitisation blog, July 2011)

Projects like UCL’s Transcribe Bentham and New York Public Library’s What’s on the Menu? have done groundbreaking work in engaging the public to transcribe their manuscript collections.

Crowdsourcing allows rapid, and it seems high-quality, creation of transcribed data from original documents. Transcribe Bentham has so far created 1,330 transcribed versions, and only a handful have been rejected for a lack of quality. Previously, such scholarly transcription would have taken considerable time and effort, spanning many years.

With notable successes like these, crowdsourcing is now becoming more familiar as an academic tool. But for certain datasets, particularly ones of considerable academic importance, this could bring some problems with crowdsourcing having the ability to create multiple editions.

For example, the much-lauded Early English Books Online (EEBO) and Eighteenth Century Collections Online (ECCO) are now beginning to appear on many different digital platforms.

ProQuest currently hold a licence that allows users to search over the entire EEBO corpus, while Gale-Cengage own the rights to ECCO.

Meanwhile, JISC Collections are planning to release a platform entitled JISC Historic Books, which makes licenced versions of EEBO and ECCO available to UK Higher Education users.

And finally, the Universities of Michigan and Oxford are heading the Text Creation Partnership (TCP), which is methodically working its way through releasing full-text versions of EEBOECCO and other resources. These versions are available online, and are also being harvested out to sites like 18th Century Connect.

So this gives us four entry points into ECCO – and it’s not inconceivable that there could be more in the future.

What’s more, there have been some initial discussions about introducing crowdsourcing techniques to some of these licensed versions; allowing permitted users to transcribe and interpret the original historical documents. But of course this crowdsourcing would happen on different platforms with different communities, who may interpret and transcribe the documents in different way.This could lead to the tricky problem of different digital versions of the corpus. Rather than there being one EEBO, several EEBOs exist.

But this is part of a larger problem. If there are multiple versions of the original content, then which one is the one you use? In fact it’s not only about the content. Which platform works quickest? Which gives the most ‘accurate’ search results? Which one provides enhanced tools for analysis? Which gives the best results for your particular area of research? Where do you send your students? Which one do you cite?

Most importantly, which one do you trust? And why?

In ‘traditional scholarship’, different editions of original documents would be published at, for example, 50 year intervals, and it would be part of the scholarly workflow to review and criticise such editions. The complexity and proliferation of digital resources radically changes this – not only are there more digital resources but the knowledge and skills needed to critically analyse a resource are considerably widened out.

At the moment, there are no immediate solutions for these challenges. But it’s clear that the potential of the Internet continues to fracture existing practices of scholarship – despite the care, attention, and research intelligence that has gone into creating EEBO, ECCO and their various platforms, the potential for academics, funders, publishers to push forward and develop new digital ideas mean that thenotion of the Internet as a place where traditional scholarly practices can simply be repeated continues to disintegrate.


Follow

Get every new post delivered to your Inbox.