AI Training Opt-Outs: A Practical Strategy for Publishers

Beyond a single tdm:reservation flag — segmenting the archive, monetising the licensable tier, and operationalising rights across CMS, PDF, and EPUB pipelines.

← Back to Blog

AI Training Opt-Outs: A Practical Strategy for Publishers

If you have already read our companion piece on TDM rights and the TDMRep protocol, you know the technical and legal foundation: how to embed a tdm:reservation flag in a PDF, what the EU DSM Directive and AI Act require, and how an ODRL policy file is structured. This post is for the next conversation — the one that happens in a publishing organisation after the technical team says “we can declare TDM rights on every PDF; what should we declare?”

The Publisher's Dilemma

Quick answer: There is no one-size-fits-all opt-out. Blocking every AI crawler removes content from training pipelines but also forecloses licensing revenue and reduces visibility in AI-mediated discovery. Allowing everything cedes control. The right answer for most publishers is a tiered policy: open for some content, licensable for some, fully reserved for the rest.

The framing of “opt-out” is itself misleading. The W3C TDMRep protocol is not a binary block-or-allow switch — it is a structured way to declare what terms apply. A publisher who reasons about this purely in defensive terms (“keep our content out of the models”) misses half the point. The same metadata that signals “rights reserved” can also point at a licensing offer, turning a potential infringement into a potential customer. Publishers who have already negotiated bulk licensing deals with AI labs — News Corp, Axel Springer, the Financial Times, Reuters, the Associated Press, and Le Monde, among others — have shown that an opt-out tag with a tdm:policy URL behind it is also a price tag.

At the same time, the cost of a sloppy opt-out strategy is asymmetric. Once a piece of content is ingested into a training run, it cannot be reliably removed from the resulting model weights; even “machine unlearning” research is not at production scale. So the cautious default for content the publisher is unsure about is to reserve rights, and revisit the policy as the legal and commercial picture clarifies. Reservations can be lifted later; ingestion cannot be undone.

The Three Tiers Most Publishers Should Build

Most publishing operations end up with three useful tiers. They map cleanly onto the two TDMRep properties (tdm:reservation and tdm:policy) and onto the underlying business models.

  • Open tier — tdm:reservation = 0. Mining is permitted. Examples: press releases, marketing collateral, open-access journal articles, free reports, public-domain reissues. The publisher actively wants this content in training data because exposure drives downstream demand.
  • Licensed tier — tdm:reservation = 1 with a tdm:policy URL. Mining requires a licence; the policy file describes the terms. Examples: paywalled articles, premium reports, technical specifications, syndicated wire copy. The publisher is willing to license but on commercial terms; the policy file is the storefront.
  • Reserved tier — tdm:reservation = 1 with no policy URL. Rights reserved, no terms offered. Examples: subscriber-only research, expensive proprietary datasets, content under restrictive author contracts, sensitive editorial archives. The signal to crawlers is unambiguous: do not mine, do not ask.

The omission of tdm:policy in the third tier is a deliberate choice rather than a missing field. Including a policy URL implies an offer; omitting it signals that no offer exists. The W3C TDMRep specification treats this as the default reserved state.

Mapping Content to Tiers

The hard part of a tiered strategy is not setting the flag — it is deciding which content sits in which tier. A useful test is to consider four questions for each content type in the catalogue:

  1. Legal status. Who holds the rights? Is the contract clear about derivative uses including AI training? Pre-2023 author contracts rarely contemplate this; the safe default for older content is the reserved tier until clarified.
  2. Monetisation model. Is the content already sold, syndicated, or licensed? Then it belongs in the licensed tier — an AI training corpus is just another distribution channel. Is it a free promotional asset? Then opening it up costs nothing and may help.
  3. Brand and reputation. Is there a meaningful risk that an LLM regurgitates this content and embarrasses the publisher (errors, dated views, sensitive editorial decisions)? If so, lean toward reserved.
  4. Jurisdiction. If the publisher operates primarily in the EU, the DSM Directive and the AI Act give a properly declared reservation real legal weight. In the US, fair use is unsettled; declarations are more about establishing intent for future litigation than about clear-cut enforcement.

For most general-interest publishers, the rough split that emerges is: 10–20% open, 50–70% licensed, 15–30% reserved. Academic publishers tend to have a larger open tier (open-access output) and a larger reserved tier (paywalled monographs). Trade publishers and newsrooms tend to concentrate in the licensed tier. The exact mix matters less than having the conversation explicitly rather than defaulting every file to the same flag.

Designing the Licensing Tier

The licensed tier is where TDMRep stops being a defensive tag and becomes a commercial instrument. The tdm:policy URL points to a JSON file written in ODRL 2.2 — a vocabulary expressive enough to encode most realistic licensing patterns. Five patterns cover the majority of publisher use cases.

  • Free for non-commercial research. A permission to mine with a constraint that the purpose must be non-commercial. Useful for academic publishers who want to support legitimate research without granting a free licence to commercial AI labs.
  • Compensated commercial licence. A permission to mine with a duty to compensate, parameterised by amount and unit (per article, per token, per month). The policy file lists the rate card; the actual contracting still happens off-platform but with the rate exposed up front.
  • Attribution-required. A permission to mine with a duty to attribute the source in any model output that surfaces the content. Difficult to enforce technically today, but the declaration matters for legal positioning.
  • Consent required. A permission predicated on prior contact with the rightsholder. The policy file's vcard:hasEmail field is the contact channel. The signal to AI crawlers is “come and talk to us” rather than “go away.”
  • Tiered or hybrid. ODRL permits multiple permission rules. A common pattern is “free for the first N records, compensate above that” or “free for academics, paid for commercial use.” Combine permissions deliberately rather than burying the structure in fine print.

Whichever pattern is chosen, the policy file should always include working contact information. The TDMRep specification requires vcard:hasEmail to use the mailto: URI scheme — a bare email string is invalid. A surprising number of early TDMRep deployments fail this check, which makes the licensing tier effectively non-functional even if the flag is set correctly. The Mapsoft TDMRep plug-in's policy file generator handles this automatically.

Operationalising It: Where to Set the Flags

The TDMRep specification deliberately allows declarations at multiple layers, in priority order from broadest to most specific:

  1. Site-wide — a .well-known/tdmrep.json file at the domain root, with rules keyed to URL patterns. This is the cheapest tier to deploy; one file covers the whole catalogue. It is also the easiest to get wrong, because the URL patterns must accurately reflect how the publishing system serves files.
  2. HTTP headerstdm-reservation and tdm-policy response headers, set by the web server, CDN, or application layer. Useful for overriding the site-wide default for a specific path or for resources that do not have an obvious in-file metadata channel.
  3. HTML <meta> tags — in the <head> of each page. The right tier for HTML article pages where editorial systems already inject per-article metadata.
  4. Embedded metadata — XMP in PDFs and EPUB package metadata in EPUBs. The most specific layer, and the one that travels with the file regardless of where it ends up. A PDF emailed to a third party, uploaded to a repository, or mirrored on a partner site retains its embedded TDM declaration even when none of the upstream HTTP headers or .well-known files apply.

The mistake to avoid is “declare it once and assume it is everywhere.” A single site-wide .well-known/tdmrep.json only protects content that is actually fetched from the publisher's domain. The moment a PDF is downloaded, attached to an email, posted on a partner site, or syndicated through an aggregator, the site-wide declaration no longer applies. For any publisher whose business model includes file distribution — downloadable PDFs, EPUB sales, licensed reprints — embedded metadata is not optional.

The opposite mistake is to declare only at the file level and skip the site-wide layer. That works for downloads but not for the publisher's own website pages. The pragmatic answer is to declare at all four layers and let the priority order handle conflicts: site-wide default, header overrides for specific paths, per-page meta tags for HTML, and embedded metadata for everything that ships as a file.

Author and Contributor Communication

Most publishing contracts written before 2023 say nothing useful about AI training. Some grant the publisher broad rights to “reproduce, distribute, and create derivative works in any medium now known or hereafter invented,” which arguably covers training-data ingestion. Others reserve all rights not explicitly granted, which arguably does not. The legal answer depends on jurisdiction, the precise language, and whether courts decide that AI training constitutes a derivative work or a separate use category. None of this is settled.

A defensible publisher posture has three parts:

  • Default policy for the existing back-catalogue, communicated in writing to authors and contributors. “Unless we hear otherwise from you by [date], we will declare your work in the licensed tier, with our standard ODRL policy. Revenue from AI licensing will be shared per [terms].” This shifts the burden in a fair way and creates a paper trail.
  • Updated contract templates for new commissions, with explicit AI-training language. The simplest pattern is to give the publisher TDM rights as part of the bundle of distribution rights, with the author retaining the option to request a reserved-tier flag for specific works.
  • An opt-out mechanism for individual authors who object. Honour these requests by setting tdm:reservation = 1 with no policy URL on the affected files, and document the request. This costs almost nothing operationally and is the only credible answer to authors who have made their views public.

The Society of Authors, the Authors Guild, and similar bodies have begun publishing model contract clauses on AI training. Adopting their language verbatim is faster than drafting from scratch and signals good faith.

Audit, Monitoring, and Enforcement

Setting flags is the easy part; verifying that they remain set, accurately, across thousands of files over time is harder. A workable monitoring approach has four components.

Inventory and audit. Before declaring anything, find out what is already declared. The Mapsoft TDMRep plug-in's batch Check Status mode reads existing TDMRep metadata across folders of PDFs without modifying anything, producing a report of reservation values and policy URLs. Run this against the live archive before a rollout to establish a baseline; run it again periodically to detect drift.

Web server logs. Every fetch of a tdm:policy URL is, by construction, an automated agent reading the policy. Logging these requests gives a free signal of which crawlers are paying attention. The User-Agent strings will not be definitive — some crawlers spoof — but the access pattern (sudden bursts, specific IPs, particular times of day) is informative.

LLM output sampling. Periodically prompt the major commercial models with verbatim phrases from reserved-tier content and check whether they reproduce it. This is imperfect because models can be trained on the content without surfacing it verbatim, but a positive hit is unambiguous evidence of ingestion. Some publishers maintain “canary” phrases — deliberately distinctive sentences in reserved content — to make detection easier.

Legal recourse. Article 4 of the EU DSM Directive and Article 53 of the EU AI Act provide a clear basis for action against AI providers who ignore properly declared reservations in the EU market. The standard of proof is the reservation declaration plus evidence of ingestion. Outside the EU the picture is murkier, but a documented reservation strengthens any future copyright claim by establishing that the rightsholder did not consent.

Common Pitfalls Publishers Make

  • Single global “block all” flag. Easiest to deploy, lowest commercial value. Sets a defensive posture but forecloses licensing revenue, and may push AI labs to ignore the publisher entirely rather than negotiate.
  • Mismatched declarations across formats. The HTML article allows mining; the downloadable PDF reserves rights. Crawlers that fetch the HTML get one signal; those that fetch the PDF get another. The publisher does not in practice know which the AI lab actually ingested. Align declarations across formats, or be explicit about the difference and why.
  • Stale or unmaintained tdm:policy URLs. A policy URL that returns 404 or has not been updated in two years is worse than no URL at all. It signals that the publisher does not take the licensing tier seriously, and it may not satisfy the “machine-readable” requirement of the DSM Directive in a future court case.
  • Hand-coded JSON with formatting bugs. The mailto: prefix on vcard:hasEmail, the JSON-LD @context, and the profile URI all have to be exact. A tooling-generated policy file avoids this whole class of error.
  • Ignoring PDF/A extension schema. Adding TDMRep XMP properties to a PDF/A document without the matching extension schema breaks PDF/A conformance. For archival publishers, this is a regression. The Mapsoft plug-in includes the schema automatically; ad-hoc XMP edits often do not.
  • Treating TDMRep as DRM. It is a declaration, not a lock. A determined actor can strip XMP metadata in seconds. The value is in establishing intent and creating legal leverage, not in technical prevention.

A Six-Week Rollout Plan

For a publisher with an existing backlist and active publishing pipeline, six weeks is enough to go from no TDMRep deployment to a working tiered policy. The schedule below assumes a small cross-functional team (legal, editorial, IT) and an organisation that already has standard publishing tooling.

  • Week 1 — Inventory and tier classification. Run the audit tool across the file archive. List content categories (imprints, series, journal sections, content types) and assign each to a tier. Hold one workshop with legal and editorial to ratify the classification.
  • Week 2 — Draft policy files. Write the ODRL JSON for each licensing tier in use (often two or three variants is enough). Validate against a JSON-LD playground; verify vcard:hasEmail and contact details. Get sign-off from legal.
  • Week 3 — Site-wide and HTTP layer. Deploy .well-known/tdmrep.json with the URL pattern rules. Configure CDN or web server to set tdm-reservation and tdm-policy headers. Add HTML <meta> tags to article templates.
  • Week 4 — Backfill PDF and EPUB libraries. Use batch tools to embed XMP metadata in existing files. The Mapsoft TDMRep plug-in handles thousands of PDFs in a single Action Wizard run; for EPUBs, equivalent CLI tools exist. Re-run the audit to confirm coverage.
  • Week 5 — CMS and contract templates. Update the publishing system so new files inherit the right tier from a CMS field. Update author contract templates with explicit AI-training language. Send the back-catalogue communication to existing authors.
  • Week 6 — Monitoring and ecosystem comms. Set up dashboards for policy-URL fetches and audit drift. Publish a short policy page explaining the publisher's TDM stance to readers, authors, and AI labs. Notify trade bodies (PA, STM, IFRRO) so the policy is visible.

Six weeks is not the only viable schedule, but it is short enough to retain executive attention and long enough to do the work properly. Faster rollouts tend to skip the classification workshop in week 1 and end up with a single global flag they later regret.

Tools Mapsoft Provides

Three Mapsoft products map to the operational steps above.

  • The TDMRep Acrobat plug-in handles single-file and batch declaration, policy-file generation, audit (Check Status), and removal. Free, Windows-only.
  • The TDMRep online tool on Mapsoft's PDF Hub does single-file work in a browser, with no installation. Useful for occasional users and for verifying behaviour without a desktop deployment.
  • For publishers who need integration into a CMS, asset management, or workflow system, our custom development team builds bespoke pipelines that emit TDMRep-compliant files automatically as part of publication.

The plug-in's batch mode is the workhorse for week-4 backfills: a single Action Wizard run can apply a chosen reservation flag and policy URL to every PDF in a folder tree, with a status report at the end. For PDF/A archives, the Include PDF/A extension schema option preserves conformance.

Where the Market Is Heading

The next two years will reshape this picture in three predictable ways.

First, EU enforcement will become concrete. The AI Act's Article 53 obligations on general-purpose AI providers are phasing in through 2026 and 2027; the first regulator decisions are likely in 2026. Publishers with declared reservations and good audit trails will be in a stronger position than those without.

Second, US litigation will clarify (or further muddy) the fair-use boundary. The cases brought by the New York Times, Getty Images, the Authors Guild, and music publishers are working through the courts. Whatever the outcome, the existence of a clear machine-readable reservation will be cited — as evidence of intent on one side, as a contractual signal on the other.

Third, ecosystem support will improve. The PDF Association, the STM Association, and IFRRO are building shared infrastructure for TDM declarations and licensing. Crawler compliance will become measurable; non-compliance will become reputationally costly. The publishers who deploy now will find that their declarations work better in 2027 than they do in 2026, simply because more of the ecosystem is paying attention.

None of this requires waiting. The flag, the policy file, the operational rollout, and the audit are all available today, and a serviceable deployment is six weeks of work. Publishers who are still asking “should we” are increasingly the exception; the question for most is “how, and how well.”

Roll Out TDMRep Across Your Publishing Pipeline

The Mapsoft TDMRep plug-in handles declaration, policy generation, batch backfill, and audit across PDF libraries of any size — the operational backbone of a tiered AI training opt-out strategy.

Learn more about TDMRep →

Related Articles

TDM Rights in PDF Documents and the TDMRep Protocol

The technical and legal foundation for TDMRep: the EU DSM Directive and AI Act framework, XMP metadata implementation, ODRL policy file structure, and PDF/A compliance.

PDF Document Metadata

How PDF metadata works: DocInfo, XMP streams, custom schemas, PDF/A requirements, and the metadata layer that TDMRep declarations live in.

The PDF/A Standard for Long-Term Archiving

PDF/A conformance levels, validation, and what extension schemas are needed for archival documents to carry custom XMP properties such as TDMRep.

Build a Tiered AI Opt-Out Strategy for Your PDF Library

Use the free Mapsoft TDMRep plug-in to declare, generate policy files, batch-backfill, and audit TDM rights across thousands of PDFs — or have us build a CMS-integrated pipeline that does it automatically.