Mining Repositories: Assisting Researchers in their Text and Data mining Needs


On 13 June 2016, the OpenMinTeD project organised its third stakeholder  workshop titled “Mining Repositories: How to assist the research and academic community in their text and data mining needs”. The workshop took place in Trinity College Dublin as part of the OpenRepositories Conference, and brought together repository managers from all over the world who are interested in text and data mining.

The workshop started with a presentation by project lead Natalia Manola (Athena Research Centre, Greece), who explained the OpenMinTeD project. She asked the repositories in the room to make their data available for text and data mining and she showed the scientific importance of doing this: there is a wealth of information hidden in big data and only by mining this information it can be discovered. For the presentation slides, please click here.




Natalia Manola

The next speakers were Nancy Pontika and Petr Knoth from the Open University in the UK. In their presentation they explained why repositories are in such an ideal position to educate and inform researchers about the possibilities of text and data mining. They defined the role of repositories with regard to supporting text and data mining as “TDM [text and data mining] is all about processing text and data at scale. The role of repositories is to facilitate the aggregation of research papers at a full-text level (and beyond) effectively enabling TDM services to operate seamlessly on all available research content.” Petr Knoth highlighted a number of principles and tips which repositories should follow in order to best enable the harvesting of their data. For the presentation slides, please click here.


Nancy Pontika


Petr Knoth












The next presentation was given by Thomas Margoni from the University of Stirling, who talked about legal issues around text and data mining. Thomas explained the legal work done in the OpenMinTeD project, and showed some of the legal barrier that currently constitute obstacles to text and data mining. He also explained how important the correct application of licenses is for repositories and what kind of licenses qualify as “open access” licenses. For the presentation slides, please click here.


Thomas Margoni

In the interactive session that followed, the participants divided themselves into three groups. Each group talked for 15 minutes with a technical expert (Petr Knoth), a legal expert (Thomas Margoni) or an institutional expert (Natalia Manola). After 15 minutes, the experts changed to the next group, so that every group got to talk with every one of the experts.


Interactive session

After the break, each expert presented the topics discussed at their table:

At the technical table, the discussion went quite into depth. The groups talked about the need for a coordinated effort to improve the way metadata records are linked to full-texts within repositories to increase the amount of content that can be aggregated for text and data mining purposes. There are few technical barriers to content harvesting as long as repositories follow certain common good practices. These include the creation and agreement of minimum services levels for open repositories, which can include, for example, maximum allowable download rate limits in repositories for aggregators.


Technical expert table

At the legal table a lot of questions were asked about open licenses and which licenses can be categorized as “open access”. According to Thomas Margoni, licences such as Creative Commons Public Licence (CCPL) Attribution (BY) and waivers such as CC0 (CC Zero, public domain dedication) are proper open access tools; whereas other licenses with limitations such as non-commercial or non-derivative restrictions are not open access because they do not comply with the relevant international statements on Open Access. Another important concern that came forward was the lack of a generalised metadata interoperability and standardisation which in turn causes a constant problems for use and reuse of resources. Likewise, measurements and metrics are still done by just a few commercial publishers, and this is also an important lock-in factor that impedes to embrace OA fully. Thomas reminded the repositories, academic institutions and funding agencies of their responsibility towards especially young researchers to encourage them to publish open access.


Legal expert table

At the institutional table, the groups talked about how repositories can be most valuable to researchers. The groups mentioned the need for repositories to be fuller and richer, and that additional services on top of the data makes them more valuable. Text and data mining services would be great additional services. However, not many academics want to text mine, they are often not even aware of the potential. The groups expressed their need for better education on how they can be the best point of reference to the researchers when it comes to text and data mining the data in their repositories, but also on how they can make researchers more aware of the potential and benefit of text and data mining.


Institutional expert table

The next part of the workshop consisted of four talks:

  • Sara Gould (EThOS): “Tentative steps in mining UK theses”.

In her presentation, Sara first introduced EThOS, which is an e-theses online service which publishes about 20,000 PhD theses every year. She then went on to give some examples of how these publications have been mined so far. First, she mentioned a report on the current landscape of UK dementia research, and second, an interactive language learning environment called FLAX. For the presentation slides, please click here.


Sara Gould

  • Chris Mansfield (QMUL): “Building Teaching and Learning Corpora with the British Library EthOS Collection”.

The presentation by Chris build on Sara Gould’s presentation as it went more into detail about EThOS. Chris explained the benefits and concerns that have been identified about EThOS. For the presentation slides, please click here.


Chris Mansfield

  • Mahendra Mahey (British Library Labs): “Small Text and Data Mining Experiments with the British Library’s Digitised Collections”.

Mahendra wasn’t able to join physically, but he sent his presentation in a nice video, which is now available for viewing online. In his presentation, Mahenda explains what kind of data the British Library Labs make available and how the Library Labs encourages researchers to mine their data. He also gave examples of researchers who have already mined this data: researcher Katrina Navickas mined the data to make a political meetings mapper, and researcher Bob Nicholson made a Victorian Meme Machine. For more information, watch the video.

  • Balviar Notay “JISC Open Access Services and importance of text mining capabilities”.

In her presentation, Balviar showed the importance of text and data mining through a number of benefits that text and data mining can bring to society. She also demonstrated the many ways that JISC is involved in facilitating text and data mining. For the presentation slides, please click here.


Balviar Notay

Petr Knoth concluded the workshop by thanking all the participants and speakers for their contributions. He noted that there is an interest in the repository community in text and data mining, but there is also still a need for more awareness to be raised among this group. Repository managers can play an important part in helping, guiding and instructing researchers on how to text and data mine the publications in repositories.

Other links: