Text mining for the discovery of small molecules

When scientists need information about the structure, name or properties of small molecules, they often turn to a high quality database called ChEBI. This database is largely curated manually and this process takes a lot of time. OpenMinTeD is working on a textmining application that can help to speed up the process, while maintaining the quality of the database. 

ChEBI: a high quality database for molecules

ChEBI (stands for Chemical Entities of Biological Interest) is a freely available, electronic dictionary and ontology of small molecules. It is produced by the European Bioinformatics Institute, that is part of the European Molecular Biology Laboratory (EMBL). ChEBI was created to help researchers in the field of molecular biology who need to know the structure, names, and properties of the small molecules that they encounter in their research.

There are a number of freely-available chemical databases. Most of them are created by an automatic ‘pipeline’ process and contain information on polymers, industrial chemicals, synthetic intermediates, etc. Some contain over 50 million compounds! Their sheer size creates problems for users, as any search may result in hundreds or even thousands of answers. For non-expert users it is very difficult to determine which, if any, is the compound they are really looking for, which one shows the correct stereochemistry, which isomer is naturally occurring, etc.. By contrast, the focus of the ChEBI database is on high quality rather than quantity. It is manually curated and more narrowly focused on the requirements of the molecular biology community. For example, a vast array of industrial polymers, intermediates and screening compounds are deliberately excluded.

An additional feature of ChEBI is that all entries are classified according to their structural features and biological properties in an ontology, a special classification system designed to be easily understood by both people and machines and to enable a single search to be made across several different databases.

A popular feature of ChEBI is that each entity is assigned a permanent ID. This can be used in the researcher’s database or in publications to link to an entry in ChEBI. By using this ID, users are assured that the link will never go dead and will always point to the latest version of the ChEBI page where the particular entity is described.

Curation of ChEBI: largely a manual job

The manual curation assures the high quality of the database, but it also makes it an expensive database to produce. Users who would like to add a new entry, may only know a research code or a trivial name from which it is not possible to deduce a structure. The curator then has to search through scientific literature to find as much information as possible about the compound. The information is likely to include:

  • When was the compound first reported?
  • What is known about its molecular structure?
  • Have any subsequent revisions to the proposed structure been made?
  • What other names is it known by?
  • Does it have any interesting biological properties, possible applications, etc?
  • If the compound is a naturally occurring product, additional details need to be recorded, including the name of the organism(s) from which the compound has been isolated.

All of this information ideally needs to be supported by appropriate citations to publications in the scientific literature. An overview of the steps that the curator takes are shown in the image below.

Simplified workflow for curation of a new entry into ChEBI

Even in the age of high speed internet, the repetitive searches across various different resources generally make finding the required information a slow and tedious process – or “very expensive” in terms of curator time. However, we anticipate that the curation process can be significantly improved by the use of appropriate text mining tools.

Text mining techniques that can help

There’s a lot of options we may employ for the task of aiding automated curation in ChEBI. Below, we give a quick overview of a few tools and techniques from the text mining literature. Named entity recognition is a means of identifying terms of interest in certain categories. Relation extraction allows us to link these terms together. Argo is a text-mining workbench that can be used by a novice user to annotate documents. Finally, UIMA is a framework for the annotation of unstructured information.

Named Entity Recognition

Named Entity Recognition (NER) is the most important technique in any text miner’s arsenal. In NER, the tool automatically highlights (or annotates as it is often called) all parts of a text that correspond to a specific class that we are interested in. For example, let’s see the results of running the following text through a chemical NER tool:

Aurasperone F– a new member of the naphtho-gamma-pyrone class isolated from a cultured microfungus, Aspergillus niger C-433.

You can see that the tool was able to identify that Ausperone F and naptho-gamma-pyrone are entities of interest. Next, we might wish to also run an NER tool to identify species names:

Aurasperone F– a new member of the naphtho-gamma-pyrone class isolated from a cultured microfungus, Aspergillus niger C-433.

Again, you can see that the tool has now identified the species of interest and labelled it as such.

We will use NER in our OpenMinTeD ChEBI application to identify names of Metabolites, Chemicals and Species, as well as chemical structural information.

 Relation Extraction

As the name suggests, relation extraction helps us to identify whether two named entities are related. These relationships have specific types, which help us to classify and distinguish between them. In the example from above, we can use a relation extraction tool to identify whether the chemicals have been extracted from the species. If we do this, we get the following result:

Aurasperone F– a new member of the naphtho-gamma-pyrone class isolated from a cultured
microfungus, Aspergillus niger C-433.

Aurasperone F is class member naphtho-gamma-pyrone
Aurasperone F isolated from Aspergillus niger C-433

So, the tool has extracted 2 relations. The first tells us that Aurasperone F is a member of the chemical class naphtho-gamma-pyrone. The second tells us that Aurasperone F was isolated from the species Aspergillus niger C-433. Note that these relations are directional (“A is a member of B” does not entail “B is a member of A”) and that a relation can exist between any two entities of the correct types. There as several potential relations in the sentence above, as shown below (the incorrect ones have a “*” in front of them:

Aurasperone F                          is class member         Naphtho-gamma-pyrone
* Naphtho-gamma-pyrone is class member        Aurasperone F
Aurasperone F                          isolated from              Aspergillus niger C-433
* Naphtho-gamma-pyrone isolated from             Aspergillus niger C-433

Of these four potential relations, we have identified that only two truly exist in the sentence. For example, the incorrect relation:

* Naphtho-gamma-pyrone isolated from Aspergillus niger C-433

is not included as part of our results.

Incorporating relation extraction into our ChEBI application will allow us to link the named entities together correctly, i.e. to be able to tell a user which species a metabolite has been extracted from.

Argo / UIMA

Argo is a text mining workbench, which allows a user to interact with TDM without having to be an expert. Argo provides precompiled modules of code that users can connect together into a text mining workflow to suit their needs. Argo is distributed by the National Centre for Text Mining, which is a partner of the OpenMinTeD project. We are currently using the Argo infrastructure as an experimental workbench for development of the OpenMinTeD platform. We expect to have interoperability between the OpenMinTeD platform and Argo, so that components developed for Argo will be available in OpenMinTeD and vice versa. Argo is available for free use in beta at http://argo.nactem.ac.uk/

An example of a text mining workflow for recognising chemicals in Argo

Argo is built on top of Apache UIMA, an OASIS standard for the interoperability of unstructured information processing pipelines (like the one in the image above). UIMA provides a central point for all annotations called the Common Analysis Structure (CAS). Each processing component in a UIMA pipeline interacts with the CAS, having the ability to read, add and remove annotations. UIMA tools never edit the underlying text to ensure that subsequent tools always see exactly what the user provided. UIMA also provides types, which are organised into hierarchical type systems, allowing us to distinguish between different annotations. In the example above, Ausperone F may have the type uk.ac.nactem.uima.Chemical, whereas Aspergillus niger C-433 may have the type uk.ac.nactem.uima.Species.

The components we use in Argo are built on top of the UIMA framework and will therefore be usable within the OpenMinTed framework.

Choosing the scenario for our application

There are several approaches that we could take to use the tools outlined above to help ChEBI’s curators get new molecules into the database.

Scenario 1: Fully automated curation

In one scenario, we would run our tools over a large collection of documents and automatically export the results into ChEBI’s database. In this scenario, we would create an annotation template which would be populated on the basis of entities and relations mined from literature sources (See Fig. 1). However, this approach could cause more problems than it solves. The text mining tools we use work to a certain level of accuracy. This means that as well as getting many of the annotations correct, the tools also make mistakes and get some wrong. It is dangerous then to put all of the annotations directly into the database, as we may end up presenting false information to the final user of the database. That is why we chose not to work according to this scenario for our application.

Scenario 2: Fully automated detection, human validation

In this scenario we would do the same as above, running a fully automated curation pipeline and attempting to produce accurate results. However, instead of directly inputting the results to the database we would pass them back to a curator for validation. If there is no error then the curator accepts the result. If there is an error then the curator fixes the record before inputting it to the database. Although this measure ensures comprehensivity, it also has some drawbacks. Firstly, the validation process may be tedious for a curator, especially when processing large corpora. If a curator loses focus during their task, they may begin to make mistakes, introducing errors into the process. Further, the curation process is not targeted to the entities of interest to the curator, but rather depends on what is found by the TDM process. This may lead to uninteresting results being detected which the curator has to wade through to find the valuable TDM results they are interested in. Because of this lack of focus, we decided not to work with this scenario.

Scenario 3: Semi-automated curation

Finally, we could use TDM in semi-automated curation. In this scenario, the curator uploads a corpus of papers that they know to be in the correct domain to a TDM system. The system processes these documents for the entities and relations of interest to the user and then makes the results available to search. The curator uses the search interface to identify information about new molecules of interest. The curator digests this information and adds it to the database, ensuring the final record is correct. This overcomes both the accuracy issues of Scenario 1 as well as the lack of focus we witnessed in Scenario 2. NaCTeM (University of Manchester) has a wide range of experience in developing this type of system, for example in the Facta (http://www.nactem.ac.uk/facta/), Kleio (http://www.nactem.ac.uk/Kleio/) or History of Medicine (http://nactem.ac.uk/MHM/) projects.

The Application

The application we will build is this third type of system. We will look at identifying metabolites, chemicals, species, proteins, biological information and chemical structural information. We will also identify relations between these entities to help curators quickly find useful associations. The application has been implemented as an Argo workflow with a REST interface. It is privately accessible via a simple web form as shown in the image below:

The interface for the Web Application

The user enters their bio-text into the box and clicks submit, triggering an Argo workflow. Which returns a link to the annotations displayed in the brat visualisation tool.

An example of the output of the application

In future releases of the application we intend to expand the range of entities that are identified, introduce automatically annotated information about relations, and store the output in a database which can be queried by the users.

Written by Matthew Shardlow, Gareth Owen, Piotr Przybyła, Jiakang Chang