e-Infrastructures in the language technology community get together
On 22 May 2016, OpenMinTeD held its second stakeholder workshop at the LREC conference in Portoroz, Slovenia. The workshop took place in the form of a roundtable, and brought together strategic players and stakeholders from the language technology community and neighboring areas. Stelios Piperidis (Athena Research Center / ILSP) led the discussion. Among the attendees were representatives from CLARIN-CZ, CLARIN-ERIC, OpenAire, ELDA and LAPPS Grid.
The goal of the roundtable was to discuss parallel efforts being undertaken and opportunities for long-term harmonisation and strategic collaboration among existing (and future) e-Infrastructures in Europe and beyond. Stelios Piperidis started the discussion by explaining the roundtable is structured around achieving interoperability on 5 levels:
- Metadata of language resources and annotations
- Data representations and vocabularies
- Services across platforms and frameworks
- Cloud computing infrastructures
- Access restrictions and permissions for certain operations
For each of these levels, the participants discussed:
- the necessary strategies to enable crossing the boundaries of platforms, scientific domains, languages, national legislations
- what they have learned from similar attempts so far
- what the research community can do to promote interoperability at this level
- what sort of alliances and policies are necessary to support overcoming the current barriers
For example, in the discussion on metadata interoperability, the participants debated on whether a detailed metadata schema is needed for the storing of data, or whether a minimal set of metadata is enough. Issues were raised such as:
- Different domains require different metadata, so it is difficult to make one set that fits all. Either machines need to make this distinction, or we just require a minimal set.
- We can leave it up to the researcher to submit the metadata, then each e-Infrastructures can make a copy of the dataset and put it in our own standard format.
- The metadata of the processing is also important to keep track of, especially for text and data miners who need to make their workflow reproducible. We would need a version-following system to track every step in the workflow of the text or data miner. The workflow is not only tools, but also dictionaries and other datasets.
- The quality and accuracy of the metadata needs to become much higher than it currently is in order to be useful for text mining.
The roundtable took place in two sessions of 2 hours, and was part of an OpenMinTeD interoperability workshop at the LREC Conference in Slovenia.
The most important points that were discussed in summary:
- In general, e-Infrastructures notice a high fear of sharing data and results among data providers
- Scientific publications should be openly available (through publication hubs)
- e-Infrastructures would benefit from:
- A common minimal metadata set but also room for linking additional metadata
- Identifiers for datasets/services/workflows
- A registry of registries
- Data management plans;
- Incentives for researchers to publish their data in repositories
- Workflows for data publishing that are as simple as possible
- Galaxy is a good example for offering an incentive for adopting standards and caring for interoperability
- We should distinguish between e-Infrastructures for non-IT savvy users (simple, controllable, observable and understandable); and then more complex e-Infrastructures for IT skilled users
- Fragmentation weakens our HLT field, but also the field is isolated from other groups and areas in Computer Science – language technology is just a modality for many other disciplines
- The impact of current deep learning based methodologies on the future of e-Infrastructures and their profile (mostly in terms of their complexity)
Towards the end of the roundtable, Nicoletta Calzolari (LAPPS GRID / ELRA / ILG-CNR) said that over time, different language communities have established different ways of storing their data. A way to link all this data is in high demand. Google has become one of the major hubs for researchers to find data. However, lots of the data produced by the language community is not visible here. Maybe the Linked Hub will give us the visibility we need, but otherwise we need to look for other ways to link the data within our field and make it visible. And, compared to other communities, the language community needs to become less isolated and better linked to other research communities. We are looking for cooperation and opportunities to do this.
After this remark, the participants started discussing the opportunities they see for better cooperation. The new OpenMinTed platform will be one of the ways to establish better cooperation, as it will link open data with services and tools and looks at finding solutions for a range of interoperability issues. But more is needed, for both data that is open and for data under stricter licenses. It was said that there is no need for another interoperability project, but a need for an open science visionary programme for e-Infrastructures to work together. The first step forward for the attendees of the roundtable will be to organise a small joint action among e-Infrastructures to offer services to data providers in exchange for their contributions.
The next OpenMinTeD stakeholder workshop will be for repository managers and will take place in Dublin on 13 June as part of the OpenRepositories conference.