On 7 December 2015, the text and data mining projects OpenMinTeD and FutureTDM organised a workshop about the text and data mining challenges for cultural heritage institutions. This workshop took place at the DISH conference, a biennial international conference on digital heritage and strategies for heritage institutions.
Presentations on text and data mining: examples and projects
In the first part of the workshop, the participants were given a quick peek into the world of text and data mining: Hege van Dijke (LIBER) gave a presentation about the need for text and data mining, and the European projects OpenMinTeD and FutureTDM that both work on this topic (See Slideshare). Steven Claeyssens (National Library of the Netherlands) presented how researchers can mine the data of the National Library and how the National Library has made this possible (See Slideshare).
Interactive session: text and data mining, what does it mean for cultural heritage institutions?
In the second part of the workshop, the participants were invited to give their input on what they perceive as the biggest barriers to making their text and data available for mining.
The identified barriers can be divided into three categories:
The cultural heritage institutions felt they don’t have the in-house knowledge to understand what researchers want or technically need in order to work with their data. There is a clear knowledge gap, and both the institutions and the researchers would benefit from more cooperation and interactions. Researchers can also play a role in convincing the institutions of the benefits of text and data mining of cultural heritage data.
The institutions are protective of their data. They want to prevent that their data gets misused or used to make a profit. With that regard, they feel opening up their text and data for mining is a risk. Also, they find it important that their data gets credited when used.
The cultural heritage institutions find the current copyright laws very difficult to understand. Questions they are dealing with are: how do we deal with personal records among our data? When do we need to get permission from authors? What if the authors are deceased? What are the copyright rules on old images, old newspapers, and legal documents?
The institutions have data with many different kind of licenses attached to them. Some items even have a per-item license, instead of a per-dataset license.
The cultural heritage institutions mentioned that the quality of their data differs very much per item. Some items are well fit for optical character recognition (OCR), others not at all. The institutions are afraid that the OCR will not turn out a 100% perfect. However, other institutions pointed out that it’s always better to try: the OCR doesn’t have to be a 100% perfect, this is almost impossible. Also, researchers can improve datasets along the way by pointing out mistakes in the OCR (open source).
The institutions also saw cost barriers: the price of creating and maintaining metadata is quite high, and investments in harmonisation of formats and licenses would be needed.
The cultural heritage institutions also brought to the table that they need external support: they need IT skills, sustainable tools, and TDM knowledge.
It was good to see that the cultural heritage institutions are very interested in text and data mining, but it is clear that there are many barriers to overcome before their data is “ready to be mined”. This workshop was a good exercise in identifying those challenges. and fuelled the energy to start working on overcoming all these challenges. The outcomes of the workshop will be fed back into the projects OpenMinTeD and FutureTDM, when working towards solutions.
The workshop brought many participants to realise that the potential of text and data mining is huge, and that every text and dataset is a valuable bubble in the grand ocean of worldwide data.
Text and data mining truly is the future! 🙂