For the envisaged creation of an open infrastructure for text and data mining, it is one of the essential first steps to identify our target users that will eventually make use of the tools and services provided. We need to get to know the needs, benefits, challenges and barriers of text mining in each community. To this end, several use cases have been identified, which fall under 4 thematic areas of interest to the project. Among these areas are the Social Sciences, where GESIS leads the subtask that focuses on gathering the requirements for text and data mining in this domain.
GESIS – Leibniz Institute for the Social Sciences is the largest German infrastructure institute for the Social Sciences. The GESIS department “Knowledge Technologies for the Social Sciences” (WTS) is focused on advancing and improving digital services for the Social Sciences on the basis of novel knowledge technologies. To ensure a high quality of GESIS services WTS is carrying out research in applied Computer Science, in particular in the fields of Web Science, Semantic Web, Linked Open Data and Information Retrieval.
TDM for the Social Sciences
There are several ways in which social scientists may come in touch with text and data mining, which may be divided broadly into two categories:
social science researchers actively performing TDM in their research, for example political scientists analyzing large amounts of tweet data to gain insights about political discourse;
social science researchers that do not perform TDM themselves, but make use of services that perform TDM in the background to create a satisfying user experience.
Because research at WTS focuses on providing services for social scientists, we want to target primarily the second group of users. That is, we plan to use the OpenMinTeD platform to develop and advance solutions that will be used by social science researchers.
To this end we envisaged two use cases:
Automatic detection,disambiguation and linking of entities in Social Science text corpora to enhance indexing and searching
Automatic coding of unstructured answers in surveys
The first use case aims to ensure the reliability and usefulness of search results. Social Scientists spend a significant part of their time searching for relevant information and data, like publications related to their research, or research data from studies. GESIS provides services that assist them in their search. These services could be enhanced using TDM. If relevant entities – like persons or data citations – in texts are reliably detected and disambiguated, they can be linked within and across documents, which might facilitate the retrieval process.
The second use case addresses the problem that open-ended questions in surveys are by far harder to code than closed questions. Nevertheless, they are still frequently included in surveys because they can provide valuable insights about the respondent’s thoughts. As answers to this kind of question are free natural language text (though often elliptical or otherwise ungrammatical and erroneous), text mining may be used to support human coders in mapping those unstructured answers to a finite set of categories, called code schemas.
To gain insight in data and information needs of social scientists and, specifically, their challenges when searching for information or data (thus addressing our first use case), we conducted an online survey in our user community. We specifically asked respondents for content-related challenges they face when using specialized portals for their information search. We also proposed some potential solutions addressing these challenges and asked for an estimation of the usefulness of these proposals. In order to get to know our targeted personas even better, we also asked if they already had experience with TDM and if so, which tools or services they already knew or used. Also, we included an open-ended question asking for research projects using TDM (which targets the first group of users identified, see above).
We got some interesting insights from the results of our survey. A lot of respondents wish for better search facilities for finding scientific publications and research projects/data. Concerning current challenges, in addition to finding the data, another problem is that search engines don’t disambiguate and thus search results are often unusable. Around 80% of the respondents rated our proposed solutions to those problems as at least partially useful. The proposed solutions are, among others
Resolution of ambiguities (conceptual and topical)
Automatic extraction of e.g. definitions of terms
Explicit presentation of relations between publications and research data, or publications and authors (who cited whom, who was cited by whom)
We will further elaborate on the challenges and proposed solutions for enhancing information search with TDM techniques. To this end, we plan on doing interviews with some stakeholders from our community. Also, a small workshop will be organized with the same purpose.
This article was written by Mandy Neumann.
Mandy Neumann works as a junior scientist at the department “Knowledge Technologies for the Social Sciences” (WTS) at GESIS. For further questions about this topic, you can send her an email.