Working Groups

OpenMinted will come up with an interoperability framework to connect existing tools, services, resources and content.
We have formed four working groups of internal and external experts who will focus on horizontal infrastructural aspects and come up with guidelines (common formats and protocols) for the text-mining infrastructure.

Resource Metadata

Description

The purpose of WG1 “Resource Metadata Working Group” is to compile and maintain an inventory of existing metadata schemas and their representation (XML, RDF) for documenting scholarly communication/literature content, language resources, text mining/language processing services and platforms. The focus will be on content & service discoverability and interaction, as well as on metadata quality. The work will be based on consortium expertise and services developed by ARC (OpenAIRE, META-SHARE) and USFD (AnnoMarket), as well specifications and recommendations developed in the framework of CLARIN and the LOD initiatives.

WG1 will prepare a draft specification consolidating the above to support TDM’s resource registration and access services. This specification should cater for:
  • generic and domain-specific metadata descriptions for scientific publications (e.g., OpenAIRE/RIOXX/NISO/CrossRef guidelines),
  • generic and domain-specific metadata descriptions for language resources, language processing and text mining services (e.g. W3C DCAT, META-SHARE, CLARIN/CMDI) as well as links to data categories taxonomies (e.g., ISOCat),
  • standards for persistently identifying scientific publications, language resources, language processing and text mining services (DOIs, PIDs, etc.),
  • standards for metadata harvesting and federated search in distributed repositories,
  • standards for representing provenance information (e.g., W3C PROV-O).
WG1 will deal with external metadata. By its nature, WG1 will coordinate with the other WGs, and more notably:
  • WG3 IPR, considering lawful access, reuse and processability of content an integral part of metadata-based documentation
  • WG4 Annotation and workflows, in particular for the language processing/text mining services dimension.

Members

Stelios Piperidis (ILSP/ARC) (Lead), Penny Labropoulou (ILSP/ARC) (Lead), Maria Gavriilidou (ILSP/ARC), Prokopis Piperidis (ILSP/ARC), Natalia Manola (ILSP/ARC), Theo Manouilidis (ILSP/ARC), Kalina Bontcheva (USFD), Wim Peters (USFD), Angus Roberts (USFD), John McNaught (UNIMAN), Sophie Aubin (INRA), Lucas Anastasiou (OU), Richard Eckart de Castilho (UKP-TUDA), Masoud Kiaeeha (UKP-TUDA), Patricia Geretto (INRA), Fred Fenter (Frontiers), Daan Broeder (EUDAT, CLARIN ERIC), Jochen Schirrwagen (OpenAIRE, University of Bielefeld), Lukasz Bolikowski (OpenAIRE, ICM Poland), Geoffrey Bilder (CrossRef), Christian Chiarcos (LLOD)

First Workshop Results

Interoperability issues

Metadata schema diversity: For documenting primary content (e.g. scientific publications that are in focus in OpenMinTeD) Dublin Core (DC), in many cases appropriately qualified, is currently the only common denominator amongst stakeholders. However, DC is not considered to be expressive enough for TDM tasks and TDM users. In particular, the following information is not sufficiently covered: licensing, domain classification information, provenance, full text download location, format, relationships to other data, and versioning. For documenting language and knowledge resources, there exist several schemas with Metashare and various CMDI profiles enjoying currently wider uptake, and MetaShare being compatible with CMDI. Additional metadata: In particular for automatically generated or metadata enriched by human users as an addition to the original metadata of a resource, a second level of metadata is required to describe the provenance of this metadata.

Next steps

At the present stage, collecting important metadata schemas in the Interoperability Landscape (D5.1) was considered the most important next step to make an informed decision whether to prescribe a single metadata schema in OpenMinTeD or whether to support/endorse multiple and provide mappings between them. The schemas to be collected concern primary content (scientific publications), language and knowledge resources, content processing software and web services.

Language Resources

Description

This group will define a specification for the representation of text, lexicons, terminologies, thesauri and ontologies, and their interoperability across different resources/tools for the purpose of their exploitation by text and data mining (TDM) applications.

At present there are many converging developments in the form of (de facto) standardization of the representation of information elements required for interoperable text consumption and processing. The focus of this group will be on ensuring the interoperability, consistency and discoverability of linguistic, terminological and ontological content at the granular representation level of individual knowledge elements. This knowledge is either contained within resources or produced by text mining tools. Its interoperability will foster common understanding, data sharing and reuse.

Data model

Based on the practical requirements of OpenMinTeD’s use cases, the group’s activities will cover a maximal range of knowledge either contained in resources or produced and consumed by text mining services. The group will seek to adopt and link existing standards for the representation of multilingual linguistic, terminological and ontological information, in order to arrive at a practically motivated interoperability specification for TDM. The use of existing data category semantics, data structures and linking strategies will ensure maximal consensus regarding standardization and best practice.

Tasks

  • Overview of relevant initiatives and standards for description of resource content and tool output.
  • Particular candidate standards for the provision of a core set of data category elements are, for instance:
  • The creation of links between the elements of these vocabularies
  • Compilation/creation of an RDF serialization of the core set.
  • Draft specification report and publication of the specification model as linked open data.

Who will be involved

Close collaboration with the resource metadata group (T5.2.1) is foreseen where resource description meets content specification. Our recommendations will inform the input/output specifications for components of language processing/text mining workflows. The group will involve internal and external experts from the global fields of linguistic, terminological and ontological representation, as well as representatives of content and data/text mining service providers and consumers.

Members

Wim Peters (USFD) (Lead), Jacob Carter (UNIMAN), John McNaught (UNIMAN), Matt Shardlow (UNIMAN), Kalina Bontcheva (USFD), Wim Peters (USFD), Angus Roberts (USFD), Louise Deléger (INRA), Sophie Aubin (INRA), Maria Gavriilidou (ILSP/ARC), Theo Manouilidis (ILSP/ARC), Penny Labropoulou (ILSP/ARC), Prokopis Prokopidis (ILSP/ARC), Richard Eckart de Castilho (UKP-TUDA), Masoud Kiaeeha (UKP-TUDA), John McCrae (LIDER), Nancy Ide (LAPPS), Steve Cassidy (ALVEO), Dominique Estival (ALVEO), Menzo Windhouwer (CLARIN), Andreas Kempf (ZBW, Hamburg), Ineke Schuurman (CCL), Maarten van Gompel (Radboud University)

First Workshop Results

Interoperability issues

Resource access: It was recommended that URIs are used for each individual resource and resource element. Together with resource schemas in a standard format such as OWL/RDF this enables flexible querying through e.g. Sparql.
Resource content: There are many informational elements describing linguistic/terminological/ontological content that can be operationalized in TDM workflows. In order to harmonize these they need to be mapped onto each other. Various options were considered for this purpose:
● Pairwise conversions between each format/vocabulary coming from external resources and pipelines. This results in a many to many mapping.
● The adoption of standard vocabularies, which would be advantageous with respect to uptake and conversion/mapping load.
We decided that harmonization of resource specific knowledge into standardised data categories and an interchange format is the way forward to ensuring that OpenMinTeD can make full use of this information in its TDM workflows. This option offers practical advantages and is supported by various worldwide initiatives working on (de facto) standard vocabularies for capturing content.
Resource linking: Various linking vocabularies e.g. SKOS are in existence. In particular, the ISO25964 standard was mentioned, which deals with interoperability between controlled vocabularies, and describes different mapping types.
Vocabulary: Our criteria for any configuration of adopted vocabularies are customizability and extensibility. This will allow us to flexibly extend and adjust our specification if necessary. We identified a bottomup approach that starts with simple lists of descriptors for linguistic/terminological/ontological knowledge. Incremental extension will take place within the specification phase based on practical needs, adding complexity where needed.
This first step in our incremental approach will inform us on the coverage of various candidate vocabularies for the of resource content and workflow component I/O. We decided to start with the following vocabularies:
● Folia (https://proycon.github.io/folia/)
● LAPPS (http://vocab.lappsgrid.org)
● Ontolex (http://www.w3.org/community/ontolex/wiki/Main_Page)

Next steps

Our next step is to compare the type systems defined in OpenMinTeD’s participant TDM platforms with the three reference vocabularies above and report on compatibility findings. This gives us the opportunity to start building a network of extendable core vocabularies for our specification task and further explore issues. Once we have established a core vocabulary and format, we need to enable users to harmonize their own data categories with converters for common formats and tools providing mapping recommendations such as Google Refine (now OpenRefine http://openrefine.org/) will be useful for that purpose.

IPR and licensing

Description

The goal WG3 “IPR and licensing” is to study and identify copyright and related rights (e.g. sui generis database right) restrictions and exceptions to the use and reuse of sources (both textual sources and text-mining services) in TDM activities. On this basis the WG will also identify contractual tools and schemes (e.g. licences) that can best serve the needs of TDM services.

In particular, it will examine which exceptions are currently available (e.g. the newly implemented TDM exception in the UK), which are upcoming and whether the current/proposed solutions embraces all the needs of the scientific and academic sector (e.g. is the non-commercial limitation necessary?).

Additionally, open licensing models for both the scientific related textual sources and the text-mining services will be explored and evaluated.

The group will compile and maintain an inventory of existing licences that grant access and reuse rights for content: Creative Commons, OSI-approved licenses, EU public licenses, NISO standards, META-SHARE and CLARIN licensing kits, as well as commercial schemes open to content reuse.

The project will develop in close coordination with the GARRI project (FutureTDM) in order to prepare:
  • a draft policy specification for copyright law empowered permissions and licensing contracts based reuse;
  • to represent such rights in appropriate standardised rights expression languages (ODRL, CCREL) for open content consumption and reuse;
  • translate the legal and policy aspects into authentication and authorization specifications for userto-service and service-to-service interactions (OAuth2, GEANT’s EduGain, ORCID IDs).

Members

Thomas Margoni (UoG)(LEAD), Giulia Dore (UoG), Angus Roberts (USFD), Kalina Bontcheva (USFD), Wim Peters (USFD), John McNaught (UNIMAN), Lucie Guibault (UvA), Marco Caspers (UvA), Richard Eckart de Castilho (UKP-TUDA), Stelios Piperidis (ILSP/ARC), Penny Labropoulou (ILSP/ARC), Fred Fenter (Frontiers), Mappet Marquez (Frontiers), Natalia Manola (ILSP/ARC), Matt Shardlow (UNIMAN), Liam Earney (JISC), Geoffrey Bilder (CrossRef), Diane Peters (CCHQ), Christopher Cieri (LDC, LAPPS), Federico Morando (Synapta and NEXA), Mark Perry (UNE AU), Maurizio Borghi (CIPPM Bournemouth University), Pawel Kamocki (CLARIN), Enrique Alonso (Consejo de Estado), Paul Uhlir (National Academy of Sciences), Giulia Ajmone Marsan (OECD), Maarten Zeinstra (Kennisland), Kristofer Erickson (UoG), Prodromos Tsiavos (The Media Institute), Gwen Franck (Creative Commons and EIFL), Peter Suber (Berkman Klein Centre, Harvard University), Freyja van den Boom (Open Knowledge International), Antonio Vetrò (Nexa Center for Internet and Society).

First Workshop Results

Interoperability issues

Licenses : The definition of licences as legal documents is central. In particular, the scenarios reflect the difference between copyright licences on content (which is formed by elements that can be copyrightable or not and which can assume the form of a protected DB), copyright licences/EULAs on tools such as software used to perform TDM, and the Terms of Use (ToS) applied to services employed to perform TDM. This three partition is a fundamental aspect in the activity of WG3 since the “licence compatibility matrix” will need to address not only “horizontal interoperability”, i.e. interoperability of difference licences within one of the three aforementioned categories, but also “vertical interoperability”, i.e. interoperability of licences and ToS placed at the different levels in the aforementioned categories.
Rights: It emerged that the persistence of licences with regard to derivative uses in TDM activities is a central point that need to be properly addressed in order to put the license interoperability matrix in the right context. This aspect is addressed in scenario 3.
Metadata : An aspect emerged during the discussion which is not directly addressed by the current WG3 scenarios is metadata licensing, i.e. the licence on the metadata. This aspect will be addressed in WG3 scenarios either in a dedicated case or as an addon to an existing case. The key issue here is to point out that in many instances metadata are factual information and therefore not protected by copyright or related rights. However, in certain cases some metadata may qualify for protection.
Terminology: The clarification of terminology is a major issue when talking about licensing and intellectual property rights (IPR). For example, what are the exact meanings of “free access”, “open access” or “free to read”. WG 3 will create a glossary in collaboration with relevant stakeholders which can then be used as a base of discussion between publishers, TDM experts, and legal experts.
Other legal issues: The question of examining legal issues beyond licensing and IPR, in particular privacy issues, was discussed. A conclusion was reached that such additional issues will be not be in focus of the project if they are not specifically relevant to performing TDM on scholarly publications. Considerations regarding these additional issues will 2 be listed in the Interoperability Landscape (D5.1).

Next steps

To address the problem of a common terminology, WG 3 will create a glossary in collaboration with relevant stakeholders which can then be used as a base of discussion between publishers, TDM experts, and legal experts. Related to the creation of the glossary, WG 3 will collect terms of use and licenses from relevant stakeholders. These will be examined in the context of the scenarios which the WGs have defined to identify and understand potential incompatibility problems.

Annotation and workflows

Description

This group will study aspects concerning annotation and workflow services: annotation models/type systems, input-output representation formats (e.g., XML, XMI, RDF, JSON, NIF, W3C Annotations, efficient wire formats like UIMA Binary CAS), appropriate annotation service input-output conversions, type system alignment, annotation tag sets, confidence scores, workflow persistence formats (e.g., UIMA aggregate engine descriptor).

Modern approaches to text mining emphasise using combinations of components (i.e., in a workflow). Each component addresses some part of the overall task. Such approaches increase re-usability, allow experimentation with different components for the same sub-task, enable certain sub-tasks to be distributed, etc. Some components handle core processing (e.g., a named entity recogniser), others provide interfaces to resources to support processing (e.g., a lexicon), others enable human interaction (e.g., an annotation editor), yet others deal with acquisition and conversion of inputs (e.g., Web crawler, content collection reader, PDF converter), or with data export (e.g., mapping to a database, to linked open data, Web service API).

However, it is not currently possible to plug and play with arbitrary components or to share or link workflows, in the general case. There are many text mining frameworks, each with its own specificities, and many standalone tools, that one would like to incorporate in some framework. Different standards are used, custom approaches are adopted, with excellent original motivation, but that may prevent seamless integration of components or workflows. Some barriers to interoperability are listed in the first paragraph above. These in essence revolve around annotations, whether human or machine supplied. That is, the typical strategy in text mining when analysing text is to represent the analysis at some informational level as annotations suited to that level. Humans play a part in building systems by supplying ‘gold standard’ annotations over texts to help train or evaluate systems; they may also interact with the results of system annotation to validate or correct these.

Annotations, in the sense of data models representing annotations over spans of text, are the means of communication between components, between components and humans, between humans (e.g., when establishing consensus annotations), between components and external stores, and between entire workflows. Some text mining frameworks rely on formally-defined type systems to specify the various types of annotation allowed in building representations over spans of text, others take a less formal approach: both work perfectly well in their own world. The challenge for this project is to find ways that allow different worlds to interoperate, which means, for this WG, in essence studying all the different aspects of annotation formats/languages/formalisms, from low- to high-level, what is standardised, what is widely used (or not), what we should foreseeably and flexibly take into account as text mining research expands to encompass new tasks and types of information (e.g., recent moves to encode an author’s level of certainty/uncertainty, not just bare ‘facts’).

Moreover, this WG is also concerned with how, in different text mining workflow environments, components are registered, made available, selected, built, tested, compared, etc., and in how workflows are designed, executed, made persistent, exposed, composed or linked with other environments.

Building on the Interoperability Landscaping work of Task 5.1, and interacting with the other WP5 WGs, particularly WG2, specifications will be proposed that maximise interoperability of annotations and workflows, while recognising that these should also allow for different levels of interoperability to be achieved, depending on circumstances. Specifications will initially target interoperability of the infrastructures of the text mining partners, but be designed to be open to motivated evolution as well as the possibility of level-wise conformity.

Members

John McNaught (UNIMAN) (Lead), Angus Roberts (USFD), Mark Greenwood (USFD), Kalina Bontcheva (USFD), Wim Peters (USFD), Claire Nedellec (INRA), Robert Bossy (INRA), Jacob Carter (UNIMAN), Sophia Ananiadou (UNIMAN), Matt Shardlow (UNIMAN), Piotr Przybyla (UNIMAN), Richard Eckart de Castilho (UKP-TUDA), Theodoros Manouilidis (ILSP/ARC), Prokopis Prokopidis (ILSP/ARC), Dimitris Galanis (ILSP/ARC), Effie Tsiflidou (Agro-Know), Lukasz Bolikowski (OpenAIRE, ICM Poland), Rafal Rak (UberResearch), Nancy Ide (LAPPS, Vassar College), Dominique Estival (ALVEO), Steve Cassidy (ALVEO), Thilo Götz (IBM), Piek Vossen (KYOTO project), Takuya Matsuzaki (Nagoya University, Japan), Eric Nyberg (LAPPS, Carnegie Mellon University), Marc Verhagen (Brandeis University, LAPPS)

First Workshop Results

Interoperability issues

Architecture: We discussed the need to have a component registry and a workflow editor provided by OpenMinTeD. The workflow editor should communicate with the component registry to obtain a list of components from which workflows can be built. An instance of the workflow editor should be hosted by OpenMinTeD, but users should also be able to run it on their own hardware, e.g. to process sensitive data that should not leave their machines. In the latter case, the editor could download the processing components, e.g. in the form of Docker images or Java libraries, and run them locally.
Metadata: For the effective communication between workflow editor and component repository as well as for the ability of guiding the user during workflow construction, we need a common metadata schema to describe components. It was suggested that the MetaShare repositories and the MetaShare metadata schema could be used or extended for this goal. Data interchange model: Data exchanged between analysis components needs to be interoperable at several levels, e.g. the meta model (what can be expressed), the type system (what is expressed), and the serialization format (how is it transferred to disk/through a network). The ability for users to customize any model endorsed by OpenMinTeD was perceived as a key requirement. Also a need has been expressed that such a model should already be associated with different types of serialization mechanisms (e.g. XML for archival, a binary encoding for network communication, etc.). The question was raised whether currently used type systems are sufficiently similar that a generic configurable conversion mechanism could be implemented by OpenMinTeD to facilitate the creation of type system mappings, but we decided that our knowledge about these type systems is presently insufficient to answer this question. The idea of selecting a single interchange format (either an existing or a new one defined by OpenMinTeD) was also brought up, but met with scepticism. On one hand, a single standard might not be sufficient and creating a new model would only add yet another model to be interoperable with. On the other hand, the requirement for customizability conflicts with the prescription of a single model. The need to have a clear policy to handle character encodings and encoding conversions was also expressed multiple times.
User experience: We expect users with different level of experience and different backgrounds to be using the OpenMinTeD platform. This must be taken into account for the granularity of offered analytics components (lowlevel (POS tagging, etc.) or highlevel (maybe sentiment analysis)) and for the way that analytics are described to the user. For example, in a workflow editor, end-to-end workflows could be advertised more prominently to new users, while individual components for building new workflows would be aggregated under an “experts” section. Workflows for particular target user communities could be described using specific keywords common in these communities.

Next steps

In collaboration with WG2, alignments between the annotation type systems used by the OpenMinTeD partners and suitable external reference vocabularies will be created to identify compatibilities, interoperability problems, and gaps. Similarly, all partners should examine the MetaShare metadata schema and align to it the componentlevel metadata used in their respective frameworks. This will help in determining the necessary set of metadata as well as the metadata presently missing in MetaShare.