From Mess to Machine: Making Text and Data Mining Services Interoperable



Hi, my name is Richard. I am leading the “Interoperability Framework” work package (WP 5) in OpenMinTeD. Today, I am blogging about our activity in the “Interoperability Specification” task, which is one of several tasks within the work package. This is very exciting because we have many high-profile partners in this work package, not only from within OpenMinTeD but also from the Natural Language Processing and Data Mining communities at large – and the list is still growing as I am writing this.

What is interoperability and why do we need it?

A data mining system consists of many small parts that perform specialized operations. In the context of OpenMinTeD, such operations may include the identification of research methods used in a scientific publication, the identification of genes mentioned in a publication, the extraction of citation information, etc. Such information can then be used for example to locate all scientific publications describing the effects of a particular gene or to determine the most relevant publications in a particular field of research. Interoperability is a tough issue here, because many researchers worldwide (and many European ones organized in OpenMinTeD) develop such specialized analysis operations independently of each other. So colloquially speaking, we have lots of cogs that do not quite fit together into a running clockwork because they adhere to different standards. The task of the OpenMinTeD interoperability work package is to work out a way of overcoming these differences.

Four expert working groups

We have four expert working groups that cover the four aspects of metadata (WG1), language resources (WG2), licensing (WG3), and workflow (WG4): • Metadata: relates how the data being analyzed (e.g. scientific publications), the operations used to analyze the data, and the analysis outputs can be described and how they can be found by people looking for them. • Language resources: relate to auxiliary data that is needed during the data mining process. This could e.g. be a catalog of all the known genes and their different names. • Licensing: covers legal questions as to whether data may be analyzed at all, if yes in which ways, and how can the results of this analysis be further used. • Workflow: considers how simple analysis operations can be combined into complex analysis workflows (the example operations mentioned above are not really simple but rather made up from smaller analysis steps that are not very illustrative). This is a lot of ground to cover and there is a lot of existing work to take into consideration.

Workshop on 12 November

This brings me back again to the work with the partners. Many hours of discussions and research will go into the OpenMinTeD interoperability specification. An upcoming milestone on this road is our first OpenMinTeD Interoperability Workshop that is going to take place on November 12 in The Hague. It brings together the experts from OpenMinTeD and experts from the community at large to discuss the state of the art of interoperability in text and data mining and to set out the first ideas and actionable items for interoperability within OpenMinTeD and of course between OpenMinTeD and the rest of the world. So thanks for your interest and hopefully see you soon again in another blog post. — Richard ——————————————————————- Dr. Richard Eckart de Castilho works as a Technical Lead at the Ubiquitous Knowledge Processing (UKP) Lab (Technische Universität Darmstadt). For more information or inquiries about OpenMinTeD’s interoperability framework, please contact Richard by email or place a comment below this post.