
Text mining in Agriculture: The AgroTagger Keyword Extractor
The use of keywords is crucial for the description, organization, indexing, retrieval and sharing of research in every scientific field and agriculture is not excluded. However, manual annotation of research outcomes is time-consuming and error-prone so automatic methods for metadata annotation are always explored. AgroTagger is one of the tools facilitating the work of information and knowledge managers (among others) in the agri-food sector, by applying text-mining on top of agri-food research outcomes.
AgroTagger is a keyword extractor that uses a subset of the AGROVOC thesaurus (about 2,5K concepts out of the total >40K concepts of AGROVOC) as a set of allowable keywords, used for indexing information resources. It can extract keywords from Microsoft Office documents, PDF files as well as web pages. AgroTagger was originally developed by the Indian Institute of Technology of Kanpur (IITK) in 2010, using the popular Keyword Extraction Algorithm (KEA) as its basis. It was later evolved in a Web based application by MIMOS (Malaysia’s national R&D centre in ICT) in collaboration with UN FAO and IITK; this new version generated keywords in the form of RDF triples. Its last update came in the context of the agINFRA FP7 project (that since then has evolved into the AGINFRA agricultural research hub), during which an AGROVOC-based indexing package was assembled using the Maui indexing framework.
How does AgroTagger work?
Simply put, AgroTagger crawls documents for AGROVOC terms, identifies an AgroTag term (the subset mentioned earlier) for each AGROVOC term and uses statistical techniques for calculating the suitability of these terms as keywords.
On a more technical basis, AgroTagger is a Java application based on three sub-applications that are executed sequentially. Some bash scripts are provided to execute the application on a Unix environment. In the case where input are Web documents, a file containing a list of URLs to be indexed can be used with AgroTagger, or alternatively, a file containing the output of an Apache Nutch Web Crawler can be used. The output of the AgroTagger is an RDF NTRIPLE file (zipped in a tar.gz archive), which mainly contains the “dcterms:subject” predicate. Other predicates can be activated using boolean flags. You can find more technical information on the FAO AIMS web page and the video above.
Applications of AgroTagger
AgroTagger is used in the case of the FAO AGRIS, a collection of almost 8 million bibliographic references in agriculture. Most of these references are described with AGROVOC terms; at the same time, AGROVOC terms have been mapped to other vocabularies used for the classification of different data types, such as statistical data, maps, germplasm data, country profiles etc.) so links are automatically created between related resources leading to a semantically rich description of a specific bibliographic reference (see figure below). However, not all AGRIS records are annotated with AGROVOC terms so they cannot be linked to related external resources.
The solution comes from AgroTagger: By running AgroTagger on top of the records that have not been annotated with AGROVOC terms, keywords are automatically extracted and applied to the record, allowing additional information in the form of metadata as well as the linking to external data sources also annotated with the same AGROVOC terms (or ones mapped with a related AGROVOC term). The process has already been applied in several cases with high success rates in terms of quantity (e.g. 4-10 AGROVOC terms applied in each resource) and quality (high relevance of extracted terms with the context of the publication). The role of AgroTagger is crucial in cases where full text of publications are available but no metadata records (or poor quality ones) are associated with them, saving time and effort needed for manual anotation of these resources.
The source code of AgroTagger is available on GitHub.
This blog post was written by Vassilis Protonotarios, Senior Project Manager at AgroKnow.
More information
Agrotags – A Tagging Scheme for Agricultural Digital Objects
AgroTagger as a part of the SemaGrow demonstrator (slides)
Agrotagger: indexation automatique de PDFs avec le thesaurus Agrovoc (INRA blog)
This blogpost was written by Vassilis Protonotarios, who is a member of the Business Development team of Agro-Know. It was initially posted on the Agroknow blog.
Tags: AgroTagger, AGROVOC, keyword extraction, Text mining