LREC Workshop on Cross-Platform Text Mining and Natural Language Processing Interoperability
Our efforts towards improving interoperability in the communities of Text Mining (TM) and Natural Language (NLP) processing continue. OpenMinTeD organised a workshop on this subject at the International Conference on Language Resources and Evaluation (LREC) on 23 May 2016. Alessandro Di Bari (IBM) opened the workshop with a keynote on transferring ideas from the model driven approaches of software engineering to enhance interoperability in TM and NLP.
Then, 14 accepted papers were presented in a series of 5-minute lightning talks. But for most of the day, the forty participants from all over the globe discussed interoperability issues and solutions in four groups focussing on data and metadata management, processing and workflows, language and semantics, as well as legal and policy aspects.
The focus on personal communication and discussion set this workshop distinctly apart from the mini-conference style of other workshops. It gave the participants the opportunity to go beyond only reporting on their individual works and to engage in detailed investigations of particular aspects. The participants enjoyed the format and the creative and constructive environment so much that lunch and coffee breaks became largely secondary.
Several topics discussed at the workshop continued to be picked up and discussed during the remainder of the conference. Here a few points from the many that were discussed:
- Creating and curating metadata on digital objects is important. That requires techniques to collect and harmonize metadata from various sources, combining user-generated metadata with metadata mined/extracted from the data, to avoid redundancy, to promote standardisation of descriptive metadata and to uniquely identify digital objects across different repositories, e.g. through content-based hashing.
- The discussion on processing workflows was giving a lot of attention to the Galaxy workflow editor and execution system, which has its origins in the bioinformatics domain. Recently, it has gained attention also in the text mining and NLP communities and many of its benefits and drawbacks were discussed. It was interesting to note how everybody seems to have had a different strategy of adapting the tool to language processing needs and to run language processing components as distributed services or on a computer cluster. Better support for workflow validation and the capability of deploying workflows dynamically to computing resources were among the most desired but unavailable features.
- The discussion on language and semantics was revolving around mapping different knowledge representations and annotation schemes to each other. Approaches discussed ranged from techniques of the semantic web and open linked data to the concept of model-driven architectures from the domain of software engineering.
- The legal discussion group examined issues around mining and processing text based on various concrete use-case scenarios in an attempt to pinpoint the legal obstacles and uncertainties that currently hinder the full development of TDM. A specific European copyright exception for text and data mining was considered to be necessary and feasible in the short term, while a more general and an open-ended norm was seen as the ultimate and optimal goal in the long term. In addition, the issues of derivative works and of applicable laws in transnational research were identified as areas that need further investigation. Read more about this group in their blog post.
We made good use of the social internet by collaboratively editing the workshop minutes on Google Docs and providing links to these via the workshop’s website. This did not only allow participants to peek into the discussions being made at other tables, but even outsiders could follow the course of the workshop as it evolved. For example, one of the paper authors was unfortunately unable to personally attend, but thanks to the excellent media system at the conference venue could present their paper via Skype and later follow the discussion on Google Docs.
This blogpost was written by Richard Eckart de Castilho, UKP Lab, Technische Universität Darmstadt