Providing insight into the structure of scientific papers
How is a scientific paper structured and how related is it to other papers? These are some of the things that Iana Atanassova of the University of Bourgogne Franche-Comte (Besancon, France) focuses on in her research. She uses text and data mining (TDM) to study full-text scientific articles. Studying these papers can be a challenge, as they are usually in a format that is hard to process.
“In this paper we look at the full-text content of scientific articles. And we look at citations. We try to identify references that appear more than once in an article. For example a reference that will appear in the introduction section, and then later in the methods or in the results section. And this is really interesting because we can consider that a reference that is cited more than once, means that the citing paper will be very strongly related to the cited paper. And all this allowed us to produce a visualisation and examine the age or references.
What we found out, considering the age of the recurring references is, for example, that in the method section, the average age of these references is somewhat lower than in the other sections. Which means that the method sections will tend to cite sources which are more recent than the other sections. And also, we were able to produce visualisations of recurring references, that can give some insight into the structure of scientific papers.
I work mostly on the full text of scientific articles and one of the problems is the availability of corpora in full text. Because most of the corpora actually are in PDF format, which is quite difficult to process and then do the text mining from this format.
I think text mining is a great way to approach new domains. And it is always interesting to take up a dataset and look into the results. This is really a great way to start working on something.”
- Related paper: Atanassova and Bertin, Temporal properties of recurring in-text references, D-Lib Magazine Vol 22, September/October 2016