What microorganisms live in my cheese?

With a tasty bite of cheese necessarily come some microbial strains. Some of them are well known, but the presence of others can puzzle researchers and they might want to investigate why they are there. A better understanding of microorganisms, their interaction and their adaptation to their environment are important issues for research and industry. It could help improve public health or develop innovative products.

Critical information stored in many public databases

In recent years, developments in molecular technologies have led to an exponential growth of experimental data and publications, many of which are open, however accessible separately. Therefore, it is now crucial for researchers to have bioinformatics infrastructures at their disposal, that propose unified access to both data and related scientific articles. With the right text mining infrastructures and tools, application developers and data managers can rapidly access and process textual data, link them with other data and make the results available for scientists.

Florilege web application

As one of OpenMinTeD use case, the French National Institute for Agricultural Research (INRA) has set up the Florilege[1] application. It aims to gather, in a unified representation, public information on food products flora with a focus on positive flora (microorganisms involved in transformation, bioconservation or probiotics).

For example, if a strain of Listeria monocytogenese, responsible for the bacterial infection listeriosis is found in a cheese, it is important to identify the potential source of contamination and at what stage of processing the contamination may have occurred. The Florilege web application offers the possibility to ask queries such as « Listeria monocytogenes » and obtain the list of all the habitats where this microorganism has been found according to scientific studies. By using Florilege, a microbiologist can rapidly see that Listeria monocytogenese is also present in pasture, beef, raw milk or feces and therefore establish the food-chain link and start to make hypothesis in order to study contamination cycles.

The query can be expressed about multiple strains, species or families and the results, which are downloadable as a table, also give the link to the scientific papers describing the relationship between the strains and their habitats.

Example of results for the request “Listeria monocytogenese” in Florilege

Extracting precise information from scientific papers and databases

The text-mining process behind Florilege has been set up by INRA using the OpenMinTeD environment. It consists in extracting the relevant information, mostly textual, from scientific literature and databases. Words or word groups are identified and assigned a type, like “gene”, “habitat” or “taxon”. They are then normalized, meaning they are assigned either a finer category (e.g. cheese as habitat) or an ID that is shared with other public databases (e.g. 1639 is Listeria monocytogenese ID in the NCBI taxinomy). Reference semantic resources such as nomenclatures, ontologies define these IDs or categories. For example, “Irish dairy farms”, “dairy cattle farms” or “dairy farms environment” are designated by the same habitat reference class “dairy farm” according to the OntoBiotope ontology. The main source for information in Florilege is scientific articles but Florilege is integrating an increasing volume of textual and non-textual information from relevant biological databases such as Biological Resource Centers (e.g. Inra CIRM, DSMZ) and major genetic databases (GenBank).

The result of this TDM process is a database with relevant and organized information, where 3 million taxa, including strains, are linked to their habitats [2] and phenotypes [3].

TDM process, from data sources to service

Formal representation of the information

The Florilege application exploits the results obtained by the TDM process to answer user queries in a quick and precise way. As mentioned above, scientists can make a request by organisms like « Listeria monocytogenes » but also by habitat. If they want to know what microorganisms are found in “cheese”, they will obtain results not only for the word “cheese” but also for “Cheddar” or “Tibetan Qula Cheese”, or any other kind of cheese.

The unique organized information displayed by the Florilege application is publicly accessible online and offers researchers numerous cross-functional avenues of exploitation in different fields like food security, ecology, and human health.

The integration of the Florilege text-mining workflow in the OpenMinTeD platform will allow users to reproduce the process with other corpora and integrate the results in other applications.

The text-mining process developed by INRA is also used by the Alvis search engine which directly displays the text from which the information was extracted :

The development of Florilège involved a large number of INRA laboratories, including MaIAGE (Bibliome and Migale teams), STLO and DIST (Scientific and Technical Information Department).

Related links:

Articles on TDM technology and evaluation:

  • Robert Bossy, Wiktoria Golik, Zorana Ratkovic, Dialekti Valsamou, Philippe Bessières, Claire Nédellec. Overview of the  Gene Regulation Network and the Bacteria Biotope Tasks in BioNLP’13 shared task BMC Bioinformatics, 16(Suppl 10):S1, 2015. doi:10.1186/1471-2105-16-S10-S1.
  • Estelle Chaix, Louise Deléger, Robert Bossy, Claire Nédellec. Text-mining tools for extracting information about microbial biodiversity in food. Food Microbiology, in press.
  • MaIAGE, INRA, Université Paris-Saclay, 78352 Jouy-en-Josas, France

[1] Florilege runs on the Migale bioinformatics platform at Inra.

[2] 18.5 million habitats are assigned to 2,000 ontology classes from OntoBiotope, an ontology of microorganism habitats distributed by AgroPortal

[3] 1 million phenotypes assigned to 203 ontology classes from  OntoBiotope.