Text mine millions of research papers with the CORE dataset
CORE is an aggregation service that harvests open access journals and repositories, institutional and disciplinary, from around the world. It offers one of the largest collections of scientific content via its Datasets, ready to be text-mined. We encourage everyone to use it as part of OpenMinTeD and beyond.
The current version of the dataset was released last October and contains 24 million metadata records and 4 million full-text records of research articles. Comparing to the past years, the amount of data in our dataset has massively increased and our collection has doubled since the previous dataset release in September 2015. CORE is a great Open Access supporter and with its service it aims to provide content that can be text-mined mainly for research purposes. Our dataset collection dates back to April 2013.
The CORE Dataset is composed of a file with enriched metadata and a file with the full-text. On our website, you can find more information regarding the file compression and how a metadata item in a dataset is structured. The dataset can be easily downloaded from the CORE website. If you are interested in the CORE “live” data, we also offer a CORE API for free.
CORE participates in the OpenMinTeD project and one of our roles is to act as a content provider of Open Access scholarly literature. Towards this target, a massive effort has been undertaken to normalise and harmonise the dataset in order to make it easily accessible to text miners and especially to those who wish to use it for professional purposes. The dataset is interoperable, because it is intended for use in a variety of tasks and by multiple heterogeneous and different purposed software components. Within the OpenMinTeD project, use cases considered include automatic classification of publications based in specific taxonomies (used for the agriculture/biodiversity domain), complex information linking and retrieval from social sciences publications, extraction of metabolites and their properties and modes of action, and many others.
This blogpost was written by Nancy Pontika, Lucas Anastasiou and Petr Knoth, CORE, The Open University.