Mapping Seed Development Thanks to TDM

The Bibliome group at the French National Institute for Agricultural Research (INRA) has developed a text-mining application that extracts fine information about seed development from thousands of texts. It gives scientists better and quicker access to how molecules, genes and proteins interact when a seed starts to grow.

Good Seed Makes a Good Crop

Inside a seed are components such as molecules, genes and proteins. The presence of these components and how they interact determines if a particular seed can be used for human or animal consumption or by industry. A better understanding of seed biology and development is therefore important for both crop breeders and industrial companies. Finding out which genes interact with which protein in which tissue at which stage is a key question for researchers in plant breeding.

Crucial Information Scattered in Thousands of Papers

Many scientific publications describe the relations between biological entities during plant development. Extracting these relations from text is difficult, however, as the information is scattered among thousands of papers and one paper often focuses on particular interactions, for example between some genes and proteins. Building the entire network of regulations involved in seed development would require researchers to read all the literature on the subject. This could take years – time which they do not have!

Text and Data Mining Can Help

Everyone interested in plant breeding needs more powerful tools to aggregate knowledge on seed development. As this complex information is expressed in natural language, dedicated text mining is the obvious solution. The Bibliome group at the French National Institute for Agricultural Research (INRA) developed  a text and data mining (TDM) application of the OpenMinTeD project for the extraction of information on seed development. It focuses on Arabidopsis thaliana, which is a small flowering plant used as a model organism by biologists.

How Does the TDM Process Work?

The TDM process first recognizes words that denote biological entities (gene, protein, RNA). It then normalises them. Different words that refer to the same entity are attached to a single identifier. For instance, items such as « AP2 » or « Apetala-2 » are attached to a same reference object : « AT4G36920 » which is described in a public databases (e.g. TAIR). This part is tricky because many coined names are used by authors and some words are used for naming different biological entities.

Entities are then typed according to a knowledge model, which formalizes the objects and their relations at stake in this specific use case.

During the third step, text mining is used to identify relations between the biological entities: for example, an interaction of protein AGL21 with protein ABI5 as shown in the screenshot below. In general, the relationships are not nicely denoted by verbs such as regulate or bind as it is in the example. Sophisticated machinery of natural language methods adapted by machine learning are hidden behind the scene and supported by the OpenMinTeD framework.

Screenshot of one relation (green arrow) expressed in two different ways as recognized by the TDM process.

Results obtained for SeeDev prototype application

For this application, TDM was performed on the titles and abstracts of almost 7,000 references from PubMed about Arabidopsis seed development. It automatically extracted more than 7000 different entities, some of which interact with each other. This data calculated by the machine is used in conjunction with curated data produced by experts. They manually annotated 20 full-text documents with the same knowledge model as the automatic process, so that both sources of data were compatible and combined in the SeeDev database.

How Can Scientists Mine Information?

Not surprisingly, biologists prefer to use tools they are familiar with. That is why Inra Bibliome team integrated the results obtained by the TDM process from scientific papers in the public tool FLAGdb++.

SeeDev for FLAGdb++

FLAGdb++ is a reference tool for biologists who study plant genetics. It gives access to different plant genomes and relations between biological entities. It connects several databases so that the user can map out the path from genes to proteins to biological functions.

From the location of a given gene on the genetic map, the user displays by a click the list of entities with which this gene interacts and the type of those relations. For each interaction, the texts from which it was extracted can be displayed.

Screenshot of results obtained by a query to the FLAGdb++ application.













The result table shows both the interactions and the location of the gene product in the cell and its biological function collected from other databases.

At this stage, not all OpenMinTeD TDM results are available in FLAGdb++. They will be posted online as developments progress to be fully operational by the end of 2018.

SeeDev for Alvis Search Engine

The results obtained by the OpenMinTeD TDM process are also made available through the SeeDev Alvis Semantic Information Retrieval engine (AlvisIR) which displays the article text extracts that described the entities, and the relationships in a list of snippets in a Google-like form.

The SeeDev Alvis Semantic Information Retrieval engine is a web application, publicly available and compatible with main web browsers.






What AlvisIR offers in comparison to other search engines is the possibility to express a query on the relationships between biological entities, or more generally on types of entities. Once on the home page, the user who wants to know for instance all the proteins that interact with Apetala 2 gene would enter the following request: {protein}* ~interact AP2 and will obtain the results displayed by the image below. {protein}* here means, any protein.








Not only does SeeDev AlvisIR gives the user the type of relations between entities, it also shows the text were the relations are mentioned. The user can filter results according to entity types and bibliographic metadata elements (journal, authors, …) which are displayed as facets on the left side. The information on the right panel allow the user to better understand the query results.

Easily Obtained New Data

Producing high-value data by TDM and exposing it in a tool well-known to biologists offers them the possibility to access precise data from a wide range of courses – and this in a familiar environment without having to run other applications. It is definitely a more efficient process compared to more traditional TDM applications!

Future work will focus on the improvement of relationship extraction that remains an issue in many cases such as implied or inter-sentence relationships. The TDM application will be adapted and then generalized to plant organs other than seeds such as maize flower.

More information

The SeeDev TDM application will soon be available as a reusable OpenMinTeD platform component as Gene Regulation Extractor

Associated publications:

  • Presentation of the knowledge model used for the extraction of Arabidopsis seed development information from text and ML tools training corpus: Chaix, E., Dubreucq, B., Fatihi, A., Valsamou, D., Bossy, R., Ba, M., … & Nédellec, C. (2016). Overview of the Regulatory Network of Plant Seed Development (SeeDev) Task at the BioNLP Shared Task 2016. In Proceedings of the 4th BioNLP Shared Task Workshop (pp. 1-11).