Univ. of Manchester (NaCTeM)
The National Centre for Text Mining (NaCTeM), based in the School of Computer Science at the University of Manchester, is the first publicly-funded text mining centre in the world. It was established to provide support, advice, and information on text mining (TM) technologies and to disseminate information from the larger TM community, whilst also providing tailored services and tools in response to the requirements of the academic community.
NaCTeM has developed a number of software tools, ranging from tokenisers and part-of-speech taggers, to more complex tools able to extract named entities and relationships between then (events). These tools, which operate on different types of texts covering different subject areas, including biomedicine, systems biology, chemistry and social sciences, provide researchers with the building blocks needed to apply text mining techniques to problems within their specific areas of interest. Our resources, including corpora annotated with various levels of semantic information, terminological repositories and computational lexica, provide the means to train and develop new tools for different domains.
We have our used our TM tools to develop semantic search systems with various functionalities, operating over document collections belonging to different textual domains. Examples include the KLEIO system for MEDLINE abstracts, which provides faceted refinement of search results based on the presence of named entities. FACTA+, also operating over MEDLINE, finds both direct and indirect associations between concepts of different types. Understanding these associations is assisted by our FACTA+ Visualizer tool.
NaCTeM’s EvidenceFinder and EvidenceFinder for Anatomical Entities tools helps users to explore relationships involving entities of interest in the Europe PubMed Central archive, whilst the latter also allows such relationships to be filtered according to various aspects of their interpretation (meta-knowledge), e.g., whether the relationship is definite or speculated, positive or negated, if it describes a definite fact, experimental observation, analysis of results, etc. The ASCOT system clusters clinical trials documents and allows semantically motivated, faceted search refinements, whilst the recently developed History of Medicine (HOM) search system over historical medical archives dating from 1840 to the present day, combines semantic faceted search using domain-specific and historically relevant named entities with more specific event-based searching, together with the suggestion of (possibly historically relevant) terms related to query terms to widen searches, and historical tracking and comparison of term usage.
Construction of customised TM solutions for specific textual domains and tasks is further enhanced by our focus on interoperability of tools, based on the UIMA framework. We have designed and implemented graphical user interfaces that allow pipelines of interoperable tools to be rapidly constructed and evaluated. The latest of these interfaces, Argo, is a web-based platform that allows construction of complex workflows, and facilitates collaboration between groups of users through sharing of workflows and documents. The compliance of our interoperable tools with a hierarchy of annotation types commonly produced by different types of TM tools, i.e., the U-Compare type system, promotes greater interoperability between tools. The type system has been demonstrated to be applicable to tools operating on a number of different textual domains and on a number of different European languages.
Recent and ongoing projects at NaCTeM include Big Mechanism, which is using TM techniques to help in building up a detailed background knowledge about causal models of cancer mechanisms by automatically analysing vast volumes of literature; Mining Biodiversity, which is providing enhanced access to the Biodiversity Heritage Library (BHL), by applying innovative TM techniques to enrich the library contents with semantic metadata and developing a term inventory; Supporting Evidence-based Public Health Interventions using Text Mining, which aims to improve the efficiency of the systematic reviewing process in the public health domain, through the application of innovative TM techniques to detect topics and cluster documents; ISHER, which provides semantically-enhanced search facilities over the archives of the New York Times; OSSMETER, whose aim has been to extend the state-of-the-art in the field of automated analysis and quality measurement of Open Source Software, and which integrates TM techniques to automatically assess the quality of user support and the level of user satisfaction as observed in different communication channels; and Mining the History of Medicine, which, in addition to the HOM system introduced above, has developed two new resources for medical historical TM, i.e., a corpus of historical medical documents representing different periods and writing styles, annotated with named entities and events, and a time-sensitive terminological inventory of historical medical terminological inventory, which accounts for semantic terminological shifts over time, to facilitate more effective searching of document collections spanning long time periods.
|Sofia Ananiadou Sophia.Ananiadou@manchester.ac.uk, and John McNaught John.McNaught@manchester.ac.uk