Text Mining for social sciences – tackling the challenges to make search systems smarter

9hi8ujmsdza-braden-collumIn the OpenMinTeD project, partners from different scientific communities are involved to make sure the OpenMinTeD infrastructure will address their needs. As regards the social sciences, a useful application for text mining is the improvement of literature search and information interlinking. To this end, three main challenges were identified: named entity recognition, automatic keyword assignment to texts and automatic detection of mentions of survey variables. This post gives an overview of these tasks and the progress of work so far.


Update: where we are now

In our last blog post, we explained how we identified the major challenges that social scientists face when searching for the information they need for their research. We did a survey that showed that a lot of respondents wish for better search facilities for finding scientific publications and research data. In particular, they found a lot of our proposed solutions for this problem helpful or very helpful, for example detection of domain-specific entities in full-texts, or explicit presentation of relations between publications and research data.

Thus, we decided to divide our use case work into three sub-tasks:

  1. Named Entity Recognition on social science publications, which will allow, among others, detection and disambiguation of terms during the search process;
  2. Automatic keyword assignment to texts, which enhances indexing and thus indirectly influences the search experience for the user, hopefully for the better; and
  3. Automatic detection of mentions of survey variables in publications, and linking between publications and corresponding variables from research datasets.

In the following we will explain briefly how we plan to tackle these problems with text mining methods.

Named Entity Recognition

The potential use of this is twofold: First, as said before, it will allow for disambiguation of search terms. If the search system knows, for example, that there’s a difference between “Washington” as a person and “Washington” as a location, it can go beyond simple keyword search and provide the user with more precise results according to their information need. Second, the recognized entities in a text may be presented to the user in a way that makes it easier for them to see which entities occur in a text, how often and where. This may be achieved for example with interactive highlighting.




Both of the ideas just mentioned require information about words in a text that can be generated via a text mining technique called Named Entity Recognition. A Named Entity Recognizer is a piece of software that is able to identify certain entities in a text and attach a label to it according to its entity type. It does this either by following hand-crafted rules (e.g. with help of linguists and domain experts), or by exploiting certain features of the input words, for example shape or context features. Because hand-crafting rules for different entity types is hard work, time-consuming and in many cases not even possible with satisfying accuracy, the second option (which belongs to the field of supervised learning) is preferable. In either case, an entity can be identified by looking at its textual context – picking up the “Washington” example from above, a well-trained/configured system will label the word “Washington” when accompanied by phrases denoting actions (go, say, sign etc.) as a person, while the same word might be labeled as location when the context indicates it’s a place where something happens, people go to etc.

In order to train the named entity recognizer, we first need to have the relevant entities annotated in a set of documents. Because we plan to go beyond the traditional set of entities (persons, locations and organizations) to have results more suitable to our domain, it is essential to define our own tag set and annotate a domain-specific corpus. On the one hand, we decided to define more fine-grained categories, for example distinguishing between individual persons and groups (like ethnic groups, important in sociology) or between scientific, political and other organizations. On the other hand, we introduced new categories for e.g. media, historical events and scientific topics and research methods. With these annotations at hand, we can then train the classifier, evaluate its accuracy, and finally apply it to any desired publication from the social sciences. We plan to test several different named entity recognizers available, evaluate their performance and choose the best one to be implemented in our digital library services.

Automatic keyword assignment

The task of assigning keywords from a fixed set of classifiers (e.g. a thesaurus) to publications is currently a manual task in our domain. This means that librarians, who can be considered experts in the scientific field they are working in, have to read each incoming publication and decide carefully which terms and descriptors match the content best.

There are techniques in text mining that may at least partly automate this procedure. We don’t plan to make the jobs of those people superfluous, but we want to ease the life of the librarians in assisting them in what they’re doing, such that they are able to classify publications more quickly and easily.

The idea is simple: We plan to compile a corpus of textual documents that are already classified by human curators. We take the classification terms as ground truth and train the system to assign thesaurus terms to new documents that are not yet classified. Then we compare the results to the ground truth to assess the performance of the system.

Unfortunately, this approach has one big flaw: the manual keyword assignments cannot be seen as a gold standard in the sense that a keyword assigned by the system that has not been assigned by a human does not necessarily mean it is a wrong choice. Instead, it could just mean that the human has overlooked that term. So we will still need the librarian in the chain to evaluate the system’s output. This is also the reason why we don’t plan to fully automate the keyword assignment process, but rather create an interactive tool that provides the librarian with the keywords the system has identified, such that they can then mark them as correct or incorrect, and add missing ones manually if necessary. This process of manually correcting the system’s output may even feed back into training such that the system will make better decisions over time.

Automatic detection of variable mentions in publications

In empirical social sciences, survey data plays an important role. Not just for those planning and conducting the surveys, but also for those working with the results. They analyze them, combine them with other survey data or data from other sources, and finally they publish their results in textual form. Other researchers might look for those publications to see which surveys have been analyzed by others in the past and in which context the data has been used. Unfortunately, there is currently no easy way to identify dataset mentions in texts other than careful reading. We want to change this situation.

First step towards that end is another project at GESIS called InFoLiS, where a system has been implemented that automatically identifies mentions of datasets in text, and links corresponding publications and datasets for easy navigation between both. A next step is now to identify the precise subset of the data that is referenced, in case of surveys those are the survey variables. Oftentimes a survey contains hundreds of different variables, whereas in a publication only a small group of variables is analyzed in detail, e.g. because they all cover a specific topic.

As for the datasets themselves, there is no standard way to reference a specific variable in text. It is even worse, because as opposed to datasets, variables don’t really have a unique name. They have an ID that is used internally in data catalogues and is neither unique across datasets, nor used in running text as a reference. They also have a label, which is some shortened form of the actual question asked in the questionnaire or interview. There is also no standard for how to turn such a question text into a variable label. So long story short, variable labels have to be treated as more or less natural-language text.

Now, when an author refers to a variable, he uses this natural-language label and adjusts it such that it fits neatly into their own formulations. Here’s an example that illustrates this case:

variable ID

variable Label



Religious leaders should not influence vote

[…] where respondents were asked about the influence of religious leaders on people’s votes and the government.


Unlike Named Entity Recognition and Keyword Extraction, there are no ready-to-use tools out there that just have to be adjusted to the domain at hand, so there is no trivial solution for this problem. We need to come up with clever algorithms ourselves that take an input text and a knowledge base containing data sets and their variables, and identify overlaps between phrases in text and variable labels (or question texts) with a confidence above a certain threshold. How we do this in detail is still open to discussion. We believe that this is a very interesting use case for Text Mining in the context of OpenMinTeD.

Currently we are in the process of building an evaluation dataset by having students of the social sciences annotate empirical publications manually for survey variable mentions. We will use this data to 1) gain more insight into how the problem might be tackled best, and 2) evaluate the results of our algorithms against.


In this post, we outlined the subtasks of the social sciences use case in more detail. We explained why, and how, we need the tasks of Named Entity Recognition, Keyword Detection and detection of survey variable mentions in full-texts to enhance digital services for our community. We plan to make the fruits of our work so far publicly available with the First Use Case Application Release coming up in January 2017. There is also a lot more to do on the road ahead, so stay tuned for our next post about progress in the social sciences use case.

This blogpost was written by Mandy Neumann (GESIS) and Masoud Kiaeeha (TU Darmstadt).