The OpenMinTeD event titled ‘ Paving the way for text and data mining in science’ was successfully organized in Brussels on May 24th, 2018. It was an open invitation to all stakeholders (publishers as content providers, TDM experts, researchers and SMEs) of TDM in Europe. The structure of the event’s agenda was carefully designed as to provide a full TDM experience and only focus on OpenMinTeD. Afterall, OpenMinTeD is a “TDM Hub” of TDM applications and components combined with open access content from open access aggregators.
The event started with a brief welcome and a short introduction of what OpenMinTeD is by the OpenMinTeD coordinator and OpenAIRE Managing Director, Natalia Manola.
Following, the EC perspective on Text and Data Mining and Open Science was presented by two EC officers Caroline Colin, and Jean-François Dechamp. In their presentation, the audience was informed on the main objectives of the new directive on copyright in the Digital Single Market.
- Modernising EU rules on key exceptions and limitations in the areas of research, education, and preservation of cultural heritage
- Facilitating licences in order to ensure wider access to content (out-of-commerce works, negotiation mechanism/VoD platforms)
- Introducing fairer rules for a better functioning copyright marketplace (press publisher’s rights, value gap, remuneration of authors and performers)
Furthermore, it was explained, why do we (EC) care for TDM?
- It’s different for science, meaning that authors usually give away their copyright rights and license-based solutions for scientific papers do not seem to work
- Digital data amount of content requires massive analysis with TDM and almost all scientific journals are already available online such as research libraries collections
- Open Science that is supported by public funding, is composed of multi-discipline sources from public and private owners, and allows reusability of data
Additionally, it was mentioned that the proposal of the European Commission in the European Council, focusing on TDM was to set a mandatory exception allowing research organisations to carry out TDM on content they have lawful access to for scientific research purposes (commercial and non-commercial).
The next session was on a storyline on TDM, on “Making sense of Science”. The story of OpenMinTeD was also presented, how it started and now how you can process, share and discover TDM tools and content, by Stelios Piperidis (Institute for Language and Speech Processing, Athena Research & Innovation Center). The presentation pointed the massive content production in general and focused on the scientific content (2.5 million articles/year). The need to make sense of all that data by using machine learning, understanding of entities, relations, structures, and extract meaningful insights to improve the ability to predict was pointed out. Even though there are solutions out there, they focus on different text types, domains, tasks, languages, creating a complex landscape. This complexity triggered the initialization of the OpenMinTeD project and its services that focus on content providers, software providers, researchers, SMEs. The services and the overall operations of OpenMinTeD were explained.
The services of OpenMinTeD platform are briefly the following:
- The OpenMinTeD catalogue of corpora, mainly datasets of open access scholarly publications, registered in the OpenMinTeD platform. Users can view and browse publicly available corpora.
- The OpenMinTeD catalogue of TDM applications. The catalogue targets users with no or little prior text mining experience that can search for, discover and easily use ready-to-run applications on content registered in the platform.
- The OpenMinTeD catalogue of TDM components, i.e. pieces of software that perform basic tasks and can be reused to build applications, targets mainly TDM developers who know how to combine them together in order to build workflows with the OpenMinTeD workflow editor and finally offer them to end-users in the form of ready-to-use applications.
- The OpenMinTeD catalogue of ancillary knowledge resources includes Machine Learning (ML) models and computational grammars that can be combined with TDM software, as well as annotation resources, (lexica, ontologies, etc.), that can be used for annotating content resources. Users can browse through the catalogue or discover resources according to specific criteria.
- OpenMinTeD TDM applications execution service This service targets primarily researchers with little or no knowledge of text mining who need to find and run TDM applications on content without going through complicated processes.
- OpenMinTeD corpus builder of scholarly works. This service mechanism allows users to form a collection of open access to scholarly and scientific content from major content aggregators (i.e. OpenAIRE, CORE) and create a “corpus” to mine.
- OpenMinTeD builder of TDM applications, where users can build new TDM applications by combining together various TDM components. The service is intended for expert TDM developers who know how to configure the TDM components.
- OpenMinTeD TDM Support & Training services that aim to (a) raise awareness about TDM among researchers and instruct them on how to integrate it in their research activities and workflows, and (b) promote the OpenMinTeD platform. The OpenMinTeD services on TDM support & training include FAQs, Webinars, Tutorials, TDM stories courses, guidelines. More can be found in OpenMinTeD Knowledge Base in the FOSTER platform.
- Catering for legal interoperability, OpenMinTeD has elaborated a license compatibility matrix , a service that expands its usage beyond OpenMinTeD. It demonstrates the compatibility among available licenses on content, software and services.
Lastly, Piperidis demonstrated how OpenMinTeD is reaching out to scientific communities from the very beginning of this project, on Scholarly communication, Life Sciences, Agriculture, Social Sciences.
Next session was on TDM for scientific literature in practice; starting with the publishers and closing with the success stories of three winners of the OpenMinTeD Open call2 on software providers. The publishers that kindly accepted our invitation to participate in this discussion were: Elizabeth Crossick (RELX Group), Frederick Fenter (Frontiers) and Stuart Taylor (Royal Society). All three representatives of publishers group, explained that the TDM approach over analysing many articles is crucial to assist research.
The session started with the panelists making brief presentations on the barriers to and opportunities of TDM from their own perspectives and experiences. It was then followed by an open discussion between the panelists and the audience. Several key themes were touched upon, including technical and policy barriers to mine content from scientific publishers, expectations and trust both from the publisher perspective and the miner perspective, opportunities for effective collaboration and mutual benefits, licensing and the role of Open Access publishing in TDM.
Throughout the discussion, the collaborative aspect, along with the need from the TDM community to be able to efficiently mining the corpora hosted on the publisher platforms without incurring in unnecessary technical barriers, were emphasised, with both the panelists and the audience agreeing that it is extremely important to lower as much as possible barrier to TDM within the legal framework of copyright and that only through thoughtful and practical conversations with the community publishers would be able to provide the best services in support on efficient and effective TDM practices.
The session was completed with the following presentations:
Three winners of the Open Calls were invited to present their work. Horacio Saggion (UPF, TALN Group, University of Barcelona), showed the “Scientific Summarization Services” tool that his team has integrated in the OpenMinTeD platform. It automatically identifies the most important information of a research article, by analyzing, extracting and characterizing several aspects of each sentence. This information is used to compute different scores to rank each sentence of the article.
Fabio Rinaldi (University of Zurich and Swiss Institute of Bioinformatics, Switzerland), presented the “BTH & OGER for OpenMinTeD” tool integrated in OpenMinTeD. The OntoGene’s Biomedical Entity Recogniser (OGER) allows annotation of a collection of documents, while the Bio Term Hub is a one-stop site for obtaining up-to-date biomedical terminological resources.
Matthew Shardlow (Manchester Metropolitan University), presented a Text mining application for Journalism, integrated in the OpenMinTeD platform. “A journalist must be a temporary expert in a wide variety of topics”. Starting from this fact, the presentation showed how the five W’s (What, Where, When, Who, Why) a journalist has to answer, can be found by searching in scientific literature and applying this text mining tool.
Continuing, the legal session took over with Maria Rehbinder (Aalto University) and Prodromos Tsiavos (Athena Research Center) accepting the invitation to join. The almost identical day of activating the GDPR directive all over Europe, initiated an open discussion on the effect of GDPR on TDM. Would GDPR signal the death of TDM? Thomas Margoni (University of Glasgow, Create) explained how OpenMinTeD managed to overcome legal challenges, barriers and informed researchers, TDM experts, content providers. The key element was the “Compatibility Matrix” created within OpenMinTeD project to guide stakeholders on combination of licenses on content, software, services.
At the end of this session, the winners of the Open Call 2 discussed and commented on the unique features of OpenMinTeD in comparison to other platforms in this area. These include that OpenMinTeD enables, as opposed to other TDM orchestration platforms, a very flexible way of integrating text and data mining components available widely used TDM tools, including UIMA and GATE, as well as the use of custom built TDM components as docker images and external web services. Another area mentioned that has been seen as a powerful feature of OpenMinTeD is the availability of large corpora and text processing tools within the same platform.
The legal session offered an overview of the main results of the project’s legal interoperability working group led by Thomas Margoni from CREATe – University of Glasgow. The report started with a brief overview of the current EU legal framework in the field of TDM and why the currently proposed text of Art. 3 (the TDM exception for research organisations) while underpinned by the right innovation policy goal is not satisfactory. Furthermore, in addition to the already mentioned licence compatibility matrix, a set of supporting documents (e.g. the Open Science Fact Sheet and an Open Access FAQs) and a recent analysis of the legal implications on training models for natural language processing (NLP) applications (poster here) were showcased. These results and documents were presented in the format of an open discussion. Maria Rehbinder (Aalto University) kindly accepted to moderate and Prodromos Tsiavos (Athena Research Center) offered a high level perspective extending to privacy/data protection (very timely as the GDPR entered into force on the next day!) and Public Sector Information and suggesting that these latter pieces of EU law, which are or have been also object of recent reform or reform proposals, may offer a better source of inspiration for the future challenges of data governance.
The last session 3YFN (3 years from now) was a panel discussion, focusing on the potential use of TDM technologies, platforms, infrastructures in the near future. How industry responds and moves towards the TDM adoption? What do researchers foresee? The panel was composed by: Alfonso Valencia (ELIXIR & Barcelona Supercomputing Center), Laurence El Khouri (ISTEX & National Center for Scientific Research (DIST/CNRS)), Sophia Ananiadou (NaCTeM, National Centre for Text Mining, University of Manchester), Claire Nédellec (INRA, Institut national de la recherche agronomique).
Presentations material here: