Why the proposed Text and Data Mining exception is not what EU copyright law needs

Photo by Jamie Street on Unsplash


The Proposal for a Directive on Copyright in the Digital Single Market (the Proposal) contains a number of provisions intended to modernise EU copyright law and to make it “fit for the digital age”.[1] Some of these provisions have been object of a lively scholarly debate in the light of their controversial nature (the proposed adjustment of intermediary liability for copyright purposes contained in Art. 13, see here at p. 7) or because they propose to introduce a new right within the already variegate EU neighbouring right landscape (i.e. the protection for press publishers contained in Art. 11).

Far less attention has attracted the provision contained in Art. 3 of the Proposal dedicated to “Text and data mining” (however, see here and here). The goal of Art. 3 is to introduce a mandatory exception in EU copyright law which will exempt acts of reproduction made by research organisations in order to carry out text and data mining for the purposes of scientific research. In this blog, Thomas Margoni and Martin Kretschmer discuss Art. 3 and explain why its formulation – although underpinned by the right innovation policy goal – is wrong.

This blog was originally posted on April 25, 2018 on by Thomas Margoni and Martin Kretschmer. 

2) Text and Data Mining. Or the creation of new knowledge from existing information (but not in the EU)

 It has been calculated that the global research community generates over 1.5 million new scholarly articles per annum (The STM report (2009) p. 5) or approximately one new paper every 30 seconds (Spangler et al, Automated Hypothesis Generation based on Mining Scientific Literature, (2014), p. 1877). It is quite clear that the scientific community as a whole is not able to maintain an adequate level of understanding of all the scientific knowledge produced. This is not only bad for science but is also bad for the economy because resources are spent to duplicate knowledge that probably already exists but has not been found. Data confirms this by showing that some 90% of all published scientific papers are never cited, whereas 50% of them are never read by anyone other than their authors, referees and journal editors (Lokman I. Meho,  The rise and rise of citation analysis (2007)). It would be fantastic if it were possible to “hire” an additional 1 million well trained and well paid researchers willing to cover all this wealth of knowledge. But this is unlikely to happen any time soon. Nevertheless, there is a solution that could be put in place right now: to use the power of computers and the “intelligence” of modern machine learning algorithms to perform that job. The cost of computer hardware and software, their speed and tireless energy could easily be used to allow the scientific community to fix the problem that half of the scientific knowledge currently produced goes unread. There is only one little problem: this is a copyright infringement. At least in the EU, since in other more “innovation oriented” economies, TDM is generally considered a lawful activity.

There are many examples of how TDM may significantly improve the quality of research and boost its development, including in ways that could not be covered even by the 1 million new researchers hired under the aforementioned anecdotal example. In the EU, in fields such as linguistics, the ability to develop automated translation tools is currently limited mostly to the official documents produced by the European Union, which are translated in all EU official languages, but most importantly are generally openly available and reusable. Imagine what would it mean for these types of applications if the original data sources were not limited to the official texts of the EU bodies, but thanks to properly devised copyright laws, include all information available on the Internet (you might even get a EU start-up to finally compete on a level playing field with the various Google, Facebook, Amazon, Twitter, etc). Similar examples can be found in the possibility to TDM the web and the online archives of journals, libraries and collections in order to verify the historical accuracy of certain facts and thus to combat fake news (something not covered by Art. 3 because journalists are not research organisations operating for research purposes). Or to favour new developments within the field of TDM such as deep learning, knowledge discovery, machine learning and so on.

It is worth noting that the large majority (if not the totality) of these cases are restricted by copyright law in the EU, but are considered lawful in countries such as the US (and other countries implementing similar investment-innovation balancing approaches), mainly thanks to flexible norms, e.g. the fair use doctrine, which considers most of these uses transformative. In particular, under US copyright law the more transformative the new work the likelier that it constitutes fair use, and courts have found that text and data mining is inherently transformative.

2.1) A TDM definition

Text and data mining is a term used to refer to a variety of analytical tools normally based on the use of digital technologies, big data and the Internet. The Proposal defines TDM as “any automated analytical technique aiming to analyse text and data in digital form in order to generate information such as patterns, trends and correlations” (Art. 2(2) of the Proposal) as well as “the automated computational analysis of information in digital form, such as text, sounds, images or data” enabled by new technologies (Recital 8).

Importantly, TDM allows the creation of new knowledge from any sort of information, in particular from already existing structured and unstructured data such as texts, images, sounds or databases which often were created for other purposes (e.g. a public agency maintaining a log of temperature measurements in a given location, or the dataset collected for a now concluded research project). Furthermore and perhaps crucially, what TDM enables is the correlation of the most diverse sets of information by combining data that would have otherwise never been combined just because no one would have thought that any correlation or pattern could be identified. This type of analysis is usually very time and labour consuming and involves a certain degree of risk (it does not guarantee that any pattern or correlation will be identified), but if a properly trained algorithm can do this efficiently (i.e. at a marginal cost tending to zero) then the risk is significantly reduced, if not completely eliminated. In cases like these, where scientific achievements can offer new opportunities of socio economic and cultural development the legal system must offer a clear set of rules within which science can move confidently.

2.2) Where is the problem?

The main problem is that EU copyright law considers most TDM activities as a copyright infringement. It is noteworthy that other more innovation-oriented jurisdictions (such as the U.S., Singapore, Japan) consider TDM lawful, therefore the scientific and economic sectors in those jurisdictions have been employing TDM for a number of years leaving the EU behind.

The reason for this situation can be found in a broad definition of protected rights (especially, but not exclusively, the right of reproduction, i.e. to make copies) which is not counterbalanced by a similarly broad definition of limitations to copyright (especially, but not exclusively, to the right of reproduction). The right of reproduction is defined as any “direct or indirect, temporary or permanent reproduction by any means and in any form, in whole or in part” by Art. 2 of Directive 2001/29/EC (InfoSoc Directive). As it is the norm with digital technologies, in order to “text-and-data-mine” information it is usually necessary to make (temporary) copies of the original data and dataset in order to extract information (see here for a paper describing a machine learning example). It is important to note that TDM is a type of “non consumptive use” of copyright material. The work is not used as a work, but only the information, ideas, facts contained therein are used.

In the light of the broad definition of the right of reproduction, the copies made during TDM analysis possess the potential to infringe copyright. This infringement, however, could be exempted on the basis of an exception or limitation to copyright. After all, as explained in the Preamble of the same InfoSoc Directive, the broadly defined EU right of reproduction would make the very same act of browsing the Internet a copyright infringement (for the temporary copy of web pages made in the cache memory of computers) if it was not for the mandatory exception of Art. 5(1) InfoSoc that allows certain temporary acts of reproduction.

2.3) Is the exception for temporary acts of reproduction of Art. 5(1) available to acts of TDM?

Partially. The CJEU had the occasion to clarify that temporary acts of reproduction made during “data capture” processes can be covered by the exemption of Art. 5(1) under the cumulative conditions that they:

1) constitute an integral and essential part of a technological process;

2) pursue a sole purpose, namely to enable the lawful use of a protected work; and

3)  do not have an independent economic significance provided that:

3.1) the implementation of those acts does not enable the generation of an additional profit going beyond that derived from the lawful use of the protected work;


3.2) the acts of temporary reproduction do not lead to a modification of that work.

These conditions, which as the Court of Justice of the European Union (CJEU) pointed out have to be interpreted narrowly, are not always easy to meet in TDM processes and with reference to n. 2 and n. 3.1, are difficult to interpret. Therefore, whereas Art. 5(1) constitutes an important exception for TDM activities, the cumulative, narrow and uncertain nature of those conditions do not offer a clear and efficient legal framework within which science can move confidently. In other words, the current EU copyright law framework is failing to meet the goal of efficiently balancing the promotion of innovation and the protection of investments.

2.4) Is therefore a dedicated TDM exception necessary?

This is a good question. There are two levels at which this question should be answered: the copyright theory level and the copyright law level.

On the theoretical level, copyright protects the original expression of ideas, not ideas themselves or facts or data. Therefore, the extraction of factual information or ideas from textual or data sources is simply outwith copyright’s scope (this together with other aspects of this blog are analysed in greater detail in a forthcoming paper).

However, for a number of reasons that cannot be analysed in depth in here but that mostly relate to the development that copyright law has undertaken as a consequence of the digital revolution – a development mainly in the direction of resisting it rather than understanding and exploiting it – a machine learning algorithm analysing a poem almost certainly constitutes  a copyright infringement. This is due to the temporary copy that the data capture process of the machine learning – or most other TDM – procedure creates. The copyright infringement can be avoided in one of two cases: the authorisation of the copyright holder (e.g. a copyright licence, see here) or the authorisation of the law in the form of a copyright exception. This exception could be Art. 5(1) InfoSoc, although, as seen above, the restrictiveness and uncertain boundaries of the exception do not really offer a satisfying answer. Other exceptions to copyright are available but due to their narrow or fragmented nature, they likewise do not offer an adequate answer.

Therefore, in practice, in the current state of EU copyright law a TDM exception is necessary.

3) Is the TDM exception as drafted in Art. 3 of the Proposal the right solution?

No, it is not. The main argument against the current formulation of Art. 3 of the Proposal is that it introduces a double limitation for TDM: it can only be performed by research organisations and only for the purpose of scientific research. Therefore, a commercial enterprise will not be able to benefit from the exception. Nor a University acting for any other purpose than research (e.g. commercial). Other purposes commonly accepted as fundamental in democratic societies are also excluded, such as journalism, criticisms or review.

In the opinion of the drafter of Art. 3 Proposal, the current wording is thought to be less restrictive than the “non commercial” limitation (which is instead found in the UK TDM exception), as confirmed by the analysis developed in the Impact Assessment at pages 108-9. It seems however, that Art. 3’s double limitation is very close to the non-commercial requirement and in certain respects even more restrictive in the sense that a “non commercial” limitation would allow a business acting for non commercial purposes (e.g. research, criticisms, news reporting, etc) to benefit from the exception, something that is not possible under Art. 3 (although Public-Private Partnerships are explicitly allowed). This is a major and unjustified limit that excludes important economic sectors and SMEs from benefiting from a crucially important innovation tool. This clearly contrast with fundamental rights such as the freedom of expression and the freedom to conduct a business and (Arts. 11 and 16 of the Charter of Fundamental Rights of the European Union), even though in the same proposal this contrast has been explicitly, although somehow superficially, excluded (see page 9 of the Proposal).

Many have suggested that the Commission should have opted for the so called “option four” (see page 8 of the current Proposal, or pages 108 – 109 of the of the Impact Assessment), that is to say a TDM exception not limited to any beneficiary nor to any type of purpose. Option four would have certainly been a much better option, but, and this is something that is not fully addressed in the current debate around Art. 3, still insufficient.

Even if Art. 3 did not contemplate the reported double limitation (only research organisations and for the purposes of scientific research), there are a number of additional problems with that formulation.

3.1) Reproductions and distributions

The main problem of the structure of Art. 3 is that it only exempts the right of reproduction but not the right of distribution or communication to the public, nor the right of adaptation (although the latter is not object of harmonisation at the EU level, with some limited exceptions).

This means that in all the situations when the results of an act of TDM include a protected part of the original “mined” work (and the CJEU clarified that excerpts as short as 11 consecutive words could be protected) these results cannot be communicated to the public or redistributed. In certain areas this will not be a major concern, however in other areas, e.g. natural language processing, the fact that certain models trained on a number of copyright protected corpora (i.e. texts) could include 11 consecutive words, means that those models, the result of the research purpose conducted by the research organisation, cannot be shared with anyone. Of course, the test is not “11 consecutive words” but is whether those 11 (or 15 or 8?) consecutive words are the “author’s own intellectual creation”, an answer that will depend on each specific case, making the situation even less predictable.

Therefore, a properly formulated TDM exception should cover not only the right of reproduction but also the rights that cover the human (and computer) activities connected with the sharing of those results, such as redistribution and communication to the public. A so devised exemption would only apply to TDM activities, therefore only the communication to the public of parts of the original work which are necessary for TDM purposes would be covered by the exemption, nothing else. This would not be too different from what currently happens with the exception for parody or quotation, where the original work is redistributed but only as part of the parody or quotation. Once again, more flexible and innovation friendly solutions (e.g. fair use doctrines) already cover all the above.

 3.2) Contractual overridability and technological overridability

Art. 3 in its current formulation clarifies that contractual provisions contrary to the TDM exception shall be unenforceable. This is a good provision, as many times access to scientific databases is based on acceptance of Terms of Use that limit TDM. Nevertheless, if the same contractual provision contrary to the TDM exception is expressed through a Technological Protection Measure, the exception ceases to take prevalence, as there is no direct reference to Technological Protection Measures in Art. 3.

In other words, a result (contracting-out the TDM exception) that the law forbids, can in fact be reintroduced by other means (the Technological Protection Measure) as the current formulation omits to cover this case. It is worth recalling here that there is no basis in EU law to circumvent an illegitimate technological protection measure, that is to say a technological measure that impedes someone to do what an exception to copyright allows. This is contradictory, creates legal uncertainty and frustrates the policy goals of Art. 3 paragraph 2. The EU legislature is fully aware of this contradiction but failed to addressed it properly. In fact, Art. 6 of the Proposal (“common provisions”) clarifies that the provisions of the first, third and fifth subparagraph of Art. 6(4) InfoSoc directive apply. In plain English this means that if a user qualifies for an exception to copyright (e.g. TDM) but a Technological Protection Measure prevents them from doing it, Member States have an obligation to take appropriate measures to ensure that right holders make available to the beneficiary an exception or limitation. In the almost 20 years since when the InfoSoc directive was enacted, the UKIPO, which has correctly put in place a specific procedure for this type of situations, has received less than a handful of requests.

4) Conclusions

The current formulation of Art. 3 is unsatisfactory and lacks ambition. In the Commission proposal there are some good elements that properly reflect the copyright theory behind TDM, which – it should be stressed – provides that ideas, facts and mere data are not object of copyright protection. Copyright protects authors’ original expressions which ensures a creativity-innovation equilibrium leading to the maximum level of socio-economic welfare.

The good elements of the Commission proposal are: the mandatory nature of the exception, the fact that it cannot be limited by contract and that no remuneration scheme is set.

Nevertheless, a number of outstanding issues remain in the current Proposal, in particular the limitation to research institutes for research purposes, the absence of a prohibition to circumvent the exception through technological measures, and the fact that it only exempts acts of reproduction.

What is needed is a broad and flexible EU wide exception that does not only cover TDM but also any other similar future technological development. Otherwise, each time a new technology is developed EU copyright law will require to go through a lengthy and likely to be contested legislative process in order to create an exception. During this period of, usually, years other jurisdictions (that have the necessary flexibility to address the natural tension between the protection of investments and favouring innovation designed into their copyright laws) will leave the EU further behind. If the proposal, as it appears in the current stage, is not able to address these as well as other concerns, perhaps it should be abandoned altogether.