TDM Rights in PDF Documents and the TDMRep Protocol

How content owners can declare text and data mining rights in PDF files using the W3C TDMRep protocol — and why it matters in the age of generative AI.

← Back to Blog

TDM Rights in PDF Documents and the TDMRep Protocol

You can also declare TDM rights on PDFs online for free using Mapsoft's PDF Hub — no installation required.

What is Text and Data Mining?

Quick answer: Text and Data Mining (TDM) is the automated process of extracting patterns, facts, and relationships from large volumes of text and data. AI companies use TDM to train large language models by scraping web pages, PDFs, and other digital content — often without the knowledge or consent of the content owner.

Text and Data Mining encompasses a broad range of computational techniques used to analyse text, images, and structured data at scale. In its traditional academic form, TDM has been used for decades in fields such as computational linguistics, bioinformatics, and social science research. Researchers would collect corpora of published papers, extract named entities, identify statistical relationships, and draw conclusions that would be impossible through manual reading alone.

The rise of generative AI has transformed TDM from a niche research technique into a global industrial operation. Companies building large language models (LLMs) such as GPT, Claude, Gemini, and LLaMA require vast training datasets. These datasets are assembled by crawling the open web, downloading PDFs, scraping digital libraries, and ingesting any text-based content that can be reached by automated crawlers. The resulting models can then generate text, answer questions, summarise documents, and produce derivative works — all based on patterns learned from the original content.

This creates a fundamental tension between the interests of content creators and the interests of AI developers. Authors, publishers, journalists, academics, and businesses invest significant effort in producing original content. When that content is ingested into an AI training pipeline, the resulting model may reproduce or paraphrase the material without attribution, compensation, or consent. The content owner may never know their work was used, and the AI model's outputs may directly compete with the original source.

PDF documents are particularly vulnerable to this form of extraction. Unlike web pages, which can use robots.txt and meta tags to signal crawling preferences, PDFs have historically lacked any standardised mechanism for declaring mining rights. A PDF sitting on a public web server, in a digital repository, or attached to an email is an open target for any crawler that can parse the format — and virtually all modern AI pipelines can.

The Legal Framework

The legal landscape around text and data mining is complex and varies significantly by jurisdiction. Understanding the relevant legislation is essential for any organisation seeking to protect its content or, conversely, to conduct mining lawfully.

The EU Digital Single Market Directive

The most significant piece of TDM legislation is the European Union's Directive on Copyright in the Digital Single Market (Directive (EU) 2019/790), which came into force in June 2019 and was required to be transposed into national law by June 2021. Two articles are directly relevant:

  • Article 3 provides a mandatory exception allowing research organisations and cultural heritage institutions to carry out text and data mining for scientific research purposes. This exception cannot be overridden by contract. Content owners cannot prevent mining by these organisations when it is conducted for genuine scientific research.
  • Article 4 provides a broader exception that allows anyone to carry out text and data mining on lawfully accessed works — unless the rightsholder has expressly reserved their rights. This is the critical provision: if a content owner declares that they reserve their TDM rights, then mining under Article 4 is not permitted. The Directive specifies that for online content, rights reservation must be made "in a machine-readable manner."

The phrase "machine-readable manner" is the key that unlocks the entire TDMRep protocol. The Directive does not specify how rights should be reserved in machine-readable form — it simply requires that such reservation be possible and recognisable by automated systems. This created a need for a standardised technical mechanism, which the W3C community group subsequently addressed.

The EU AI Act

The EU AI Act (Regulation (EU) 2024/1689), which entered into force in August 2024 with provisions being phased in through 2027, reinforces these requirements. Article 53 obliges providers of general-purpose AI models to comply with EU copyright law, including the TDM opt-out mechanism established by the DSM Directive. Providers must implement policies to respect rights reservations expressed in machine-readable form, and they must make publicly available a sufficiently detailed summary of the training data used. This gives the TDMRep protocol additional legal weight: AI providers operating in the EU market are now legally required to respect properly declared TDM reservations.

The UK Perspective

The United Kingdom, no longer bound by EU directives following Brexit, has taken a different path. The UK's existing copyright exception for text and data mining (Section 29A of the Copyright, Designs and Patents Act 1988) is narrower than the EU's Article 3, allowing mining only for non-commercial research. The UK Intellectual Property Office consulted on a broader TDM exception in 2022 but ultimately did not proceed with it. As of 2026, the UK government has been working on a code of practice for AI and copyright, but no equivalent to the EU's machine-readable opt-out regime has been enacted. Content owners in the UK rely primarily on existing copyright protections and contractual terms.

The US Perspective

In the United States, text and data mining falls under the fair use doctrine (17 U.S.C. § 107). There is no specific TDM exception or opt-out mechanism in US copyright law. Several high-profile lawsuits — including cases brought by the New York Times, Getty Images, and groups of authors against AI companies — are testing whether large-scale ingestion of copyrighted works for AI training constitutes fair use. The outcome of these cases will shape the US legal framework for years to come. In the meantime, declaring TDM rights using TDMRep provides a clear, documented signal of the rightsholder's intent, which may be relevant in any future legal proceedings.

The W3C TDMRep Protocol

The Text and Data Mining Reservation Protocol (TDMRep) is a specification published by the W3C TDM Reservation Protocol Community Group as a Final Community Group Report on 10 May 2024. It defines a standardised, machine-readable way for rightsholders to declare their TDM rights — that is, to state whether they reserve the right to control text and data mining of their content, and optionally to point to a policy document that describes the terms under which mining may be permitted.

The protocol is built around two core properties:

  • tdm-reservation — A boolean value (0 or 1) indicating whether the rightsholder reserves their TDM rights. A value of 1 means rights are reserved (mining is not freely permitted under the EU DSM Directive Article 4 exception). A value of 0 means no reservation is made (mining is permitted under applicable exceptions).
  • tdm-policy — An optional URL pointing to a TDM policy file that describes the specific terms and conditions under which mining may be conducted. This allows rightsholders to offer licences, set conditions, or specify permitted uses rather than simply blocking all mining.

The elegance of TDMRep lies in its simplicity and its cross-format design. The same two properties can be expressed in multiple ways depending on the content type:

  • HTML pages — Using <meta> tags in the document head, e.g. <meta name="tdm-reservation" content="1">.
  • HTTP headers — Using custom HTTP response headers, allowing servers to declare TDM rights for any resource type without modifying the content itself.
  • PDF documents — Using XMP metadata embedded within the PDF file, ensuring the declaration travels with the document regardless of how it is distributed.
  • EPUB publications — Using metadata elements in the EPUB package document.
  • Site-wide declarations — Using a .well-known/tdmrep.json file, similar in concept to robots.txt, that declares TDM rights for an entire domain or specific URL patterns.

This multi-channel approach means that a rightsholder can protect their content at the server level, the page level, and the individual document level simultaneously. A PDF that is downloaded, forwarded by email, or uploaded to a third-party repository retains its TDM declaration in its embedded metadata, regardless of whether the original server's HTTP headers or .well-known file are accessible.

TDMRep in PDF Documents

For PDF documents, TDMRep declarations are embedded using XMP (Extensible Metadata Platform) metadata. XMP is an ISO standard (ISO 16684-1) for embedding structured metadata in files, and it has been a core part of the PDF specification since PDF 1.4. Every modern PDF reader can access XMP metadata, and it survives most PDF processing operations including optimisation, linearisation, and format conversion.

The TDMRep specification defines a dedicated XMP namespace for its properties:

  • Namespace URI: http://www.w3.org/ns/tdmrep/
  • Preferred prefix: tdm

Within this namespace, two properties are defined:

  • tdm:reservation — An integer value: 1 to reserve TDM rights, 0 to indicate no reservation. This directly corresponds to the tdm-reservation property in the general protocol.
  • tdm:policy — A URI string pointing to the TDM policy file. This corresponds to the tdm-policy property and is optional. When present, it tells crawlers where to find the detailed licensing terms.

In practice, the XMP metadata packet within a PDF file would contain an RDF description block similar to this:

<rdf:Description rdf:about=""
  xmlns:tdm="http://www.w3.org/ns/tdmrep/">
  <tdm:reservation>1</tdm:reservation>
  <tdm:policy>https://example.com/tdm-policy.json</tdm:policy>
</rdf:Description>

A key advantage of the XMP approach is its compatibility across all PDF versions. The TDMRep metadata works with PDF 1.4 through PDF 1.7, PDF 2.0 (ISO 32000-2), and all PDF/A conformance levels (PDF/A-1, PDF/A-2, PDF/A-3, and PDF/A-4). Since XMP is a standard part of the PDF metadata architecture, adding TDMRep properties does not break conformance with any of these standards.

The PDF Association has published guidance on using TDMRep to license AI mining of PDFs, and has also addressed the broader question of defending PDFs from AI scraping. Their recommendation is clear: embedding TDMRep metadata in PDF documents is the most robust and standards-compliant method currently available for declaring mining rights in a way that is both legally meaningful and technically discoverable.

Unlike PDF encryption and password protection, which attempt to prevent access to content entirely, TDMRep operates as a rights declaration. It does not prevent a determined actor from extracting text — just as a "No Trespassing" sign does not physically prevent entry. But it does create a clear, machine-readable record of the rightsholder's intent, which has legal significance under the EU DSM Directive and the EU AI Act. An AI company that ignores a properly declared TDM reservation is knowingly acting against the rightsholder's expressed wishes, with potential legal consequences.

The TDM Policy File

When a rightsholder sets tdm-reservation to 1 and provides a tdm-policy URL, the URL should point to a JSON file that describes the terms under which text and data mining is permitted. This policy file uses the ODRL 2.2 (Open Digital Rights Language) vocabulary, an existing W3C Recommendation for expressing digital rights and policies in a machine-readable format.

A typical TDM policy file has the following structure:

{
  "@context": [
    "http://www.w3.org/ns/odrl.jsonld",
    { "tdm": "http://www.w3.org/ns/tdmrep#" }
  ],
  "uid": "https://example.com/tdm-policy.json",
  "@type": "Offer",
  "profile": "http://www.w3.org/ns/tdmrep",
  "assigner": {
    "uid": "https://example.com",
    "vcard:fn": "Example Publisher Ltd",
    "vcard:hasEmail": "rights@example.com"
  },
  "permission": [
    {
      "action": "tdm:mine",
      "duty": [
        {
          "action": "odrl:compensate",
          "constraint": [
            {
              "leftOperand": "odrl:payAmount",
              "operator": "odrl:eq",
              "rightOperand": { "@value": "0", "@type": "xsd:decimal" },
              "unit": "http://dbpedia.org/resource/Euro"
            }
          ]
        }
      ]
    }
  ]
}

Let us break down the key elements:

  • @context — Combines the standard ODRL JSON-LD context with the TDMRep namespace, allowing the policy to reference both ODRL vocabulary and TDM-specific actions.
  • uid — A unique identifier for the policy, typically the URL at which the policy file is hosted.
  • @type: "Offer" — Indicates that this policy is an offer from the rightsholder. A crawler or AI company reading this policy understands that the rightsholder is offering specific terms for mining.
  • profile — References the TDMRep profile, identifying this as a TDM-specific policy rather than a general ODRL policy.
  • assigner — Identifies the rightsholder using vCard properties. The vcard:fn property provides the full name, and vcard:hasEmail provides a contact address for licensing enquiries.
  • permission — Describes what is permitted. The tdm:mine action is the TDMRep-specific action for text and data mining. Permissions can include duties (obligations the miner must fulfil) and constraints (conditions that must be met).

The policy file is intentionally flexible. A rightsholder might offer free mining for non-commercial research, require payment for commercial use, mandate attribution, or impose any combination of conditions that can be expressed in ODRL. The machine-readable format means that sophisticated crawlers can automatically evaluate whether they meet the policy's conditions before proceeding with mining.

In the simplest case, a rightsholder who wants to block all mining can set tdm-reservation to 1 and omit the tdm-policy URL entirely. This signals that rights are reserved and no terms are being offered — mining is simply not permitted.

How AI Crawlers Discover TDM Rights

The TDMRep specification defines a clear priority order for discovering TDM declarations, ensuring that when multiple signals are present, the most specific declaration takes precedence:

  1. .well-known/tdmrep.json — A site-wide declaration file hosted at a well-known URL. This is the broadest mechanism, allowing a domain owner to declare TDM rights for all content on their site, or for specific URL patterns, in a single file. It functions similarly to robots.txt but is specifically designed for TDM rights rather than general crawling permissions.
  2. HTTP response headers — The tdm-reservation and tdm-policy headers can be set on individual HTTP responses. This is more specific than the .well-known file and overrides it for the resource in question.
  3. HTML meta tags — For HTML pages, <meta> tags in the document head provide page-level declarations. These override HTTP headers for the page content.
  4. PDF and EPUB embedded metadata — XMP metadata in PDF files and package metadata in EPUB files provide the most specific, document-level declarations. Because this metadata is embedded in the file itself, it persists regardless of how the document is distributed or hosted.

This layered approach allows organisations to set a site-wide default (for example, reserving rights across all content) while overriding it for specific resources (for example, allowing mining of press releases or openly licensed research papers).

As for crawler support, adoption is still in its early stages. The major AI companies have made varying commitments to respecting content owner preferences. Google has stated that its AI crawlers (Googlebot and Google-Extended) respect robots.txt directives, and support for TDMRep is under consideration. OpenAI's GPTBot respects robots.txt and has indicated awareness of TDMRep. Anthropic's crawler similarly respects robots.txt. The EU AI Act's requirements are expected to accelerate adoption, as AI providers selling into the EU market will need to demonstrate compliance with machine-readable rights reservations. The practical reality is that the legal obligation already exists — the question is one of technical implementation and enforcement.

It is worth noting that TDMRep and robots.txt serve complementary but distinct purposes. Robots.txt controls whether a crawler may access a resource. TDMRep controls whether the content may be mined — a distinction that matters because content may be lawfully accessed (for indexing, caching, or display) without being lawfully mined for AI training. A PDF may be freely downloadable and indexable by search engines, yet its TDMRep metadata may reserve all mining rights.

Using Mapsoft TDMRep

Mapsoft provides two free tools for adding TDMRep declarations to PDF documents, making it straightforward for content owners to protect their files without any programming or command-line knowledge.

TDMRep Acrobat Plug-in

The Mapsoft TDMRep plug-in is a free Adobe Acrobat plug-in that adds TDMRep metadata to PDF files directly within the Acrobat interface. Once installed, it adds a menu item that allows you to set the tdm:reservation flag and optionally specify a tdm:policy URL. The plug-in writes the correct XMP namespace and properties into the PDF's metadata stream, ensuring full compliance with the W3C TDMRep specification.

For organisations with large document libraries, the plug-in supports batch processing through Acrobat's Action Wizard, allowing you to apply TDMRep declarations to hundreds or thousands of PDFs in a single operation. Detailed instructions are available in the TDMRep User Guide and the TDMRep Batch Processing Guide.

TDMRep Online Tool

For users who do not have Adobe Acrobat, or who need to process individual files quickly, the TDMRep online tool on Mapsoft's PDF Hub provides the same functionality through a web browser. Upload a PDF, set the reservation flag and optional policy URL, and download the updated file. No software installation is required, and files are processed securely without being stored on the server.

Both tools produce identical results: a PDF file with correctly structured XMP metadata declaring the rightsholder's TDM rights in full compliance with the W3C specification. The resulting files are compatible with all PDF versions (1.x and 2.0) and all PDF/A conformance levels.

Protect Your PDFs from AI Mining

Declare your text and data mining rights in PDF documents using the W3C TDMRep protocol. Mapsoft's free tools make it easy to add machine-readable rights declarations — whether you are processing a single file or an entire document library.

Learn more about TDMRep →

Related Articles

PDF Security: Passwords, Permissions & Encryption

A comprehensive guide to PDF security covering password types, permission flags, encryption algorithms, certificate security, and best practices for protecting PDF documents.

PDF Metadata: Properties, XMP & Custom Fields

Understanding PDF metadata: document properties, XMP extensible metadata, custom fields, and how to view and edit metadata in Adobe Acrobat and other tools.

The PDF/A Standard for Long-Term Archiving

A guide to the PDF/A standard for long-term document archiving: conformance levels, validation, creation tools, and best practices for archival PDF documents.

Protect Your PDF Content from AI Mining

Declare your text and data mining rights using the free Mapsoft TDMRep plug-in for Adobe Acrobat, or use the online PDF Hub tool to process individual files instantly — no installation required.