PDF document properties: why every PDF needs proper metadata and how to set it at scale

Most PDFs ship with whatever Word or InDesign last typed into the title field — “Document1”, “Untitled-1”, “FINAL_v3_USE_THIS_ONE” — with no author, no subject, no keywords. That's a cost. Here's what each property is for, why it matters, and how to fix it across thousands of files in one pass.

← Back to Blog

PDF Document Properties: Why Metadata Matters and How to Set It at Scale

Quick summary

Quick answer: A PDF carries eight standard metadata fields — Title, Subject, Author, Keywords, Creator, Producer, Index, Base URL — plus an extensible XMP stream. Search engines, SharePoint, document management systems, e-signature tools, and screen readers all read this metadata to identify what a file is and who it's by. The defaults are almost always wrong: Word leaves “Microsoft Word - contract_v2.docx” in the Title; InDesign leaves the original .indd filename. Setting metadata properly on one PDF is a five-minute job in Acrobat. Setting it on three thousand PDFs is impossible by hand — which is why a free Mapsoft plug-in, InfoSetter, exists to apply named metadata configurations to single files, batches, and CSV-driven workflows.

A records team imports 3,400 historical contracts into a SharePoint library. The library is configured to use the PDF Title field as the document title in search results. Half the contracts list “Microsoft Word - contract_v2.docx” as their title. A third have no Author at all. Almost none have Keywords. The search index goes live and the lawyers can't find anything: every result reads the same. The cost — tens of thousands of pounds in lost discoverability — traces back to a metadata field nobody set, on documents nobody re-opened. PDF metadata is the file's address. It's how every downstream system identifies what it's looking at. And almost no one sets it.

What “document properties” really are

A PDF stores metadata in two places. The older, smaller location is the document information dictionary — eight named fields that travel inside every PDF and that Acrobat exposes under File › Properties › Description: Title, Subject, Author, Keywords, Creator, Producer, Index, and Base URL. The newer, richer location is the XMP stream, an XML block that holds standardized vocabularies (Dublin Core, rights management, PDF/A) plus any custom schemas an organization cares to bolt on. Most modern PDFs carry both; older ones only have the dictionary. For a deeper tour of how the two layers interact, see our explainer on PDF document metadata.

Acrobat's built-in dialog edits these fields one PDF at a time. Open the file, click through Properties, type, save, close. That's a fine workflow for one document. For ten it's tedious. For ten thousand it's impossible — which is exactly the scale at which metadata matters most.

The eight standard fields, and what each is for

Title

Title is the most visible piece of metadata in a PDF. It's what Acrobat displays in the title bar, what SharePoint shows as the document name in a library view, what screen readers announce when a document is opened, what Google displays in search results. The default failure is that nobody sets it — so Acrobat falls back to the filename, which is usually a slug like contract_v2_FINAL.pdf or, worse, the original Word document's internal title (“Document1”) that Word never updated when the user pressed Save As. A Title set to “Acme Master Services Agreement — ABC Corp — March 2026” is the difference between a SharePoint library that's searchable and one that isn't.

Author

Author is attribution and accountability. Many archival schemas require it (Dublin Core's dc:creator property maps directly to it); regulator submission systems frequently reject documents with empty author fields; e-discovery tooling uses it as a hint when sorting documents during litigation. Default Author is “System Administrator” (the Windows username at the time of conversion), which is rarely useful and occasionally embarrassing — the username of whoever happened to be logged in when the PDF was made.

Subject

Subject is a one-line description of what the document is about. It appears in search snippets when the document body is image-only, in document-management UIs as a column, and in many e-discovery and DAM systems as a faceting field. A document with the Subject field set to “Q4 financial performance summary” sorts and searches differently from a document with no Subject at all. Treat it as a one-sentence abstract that travels with the file.

Keywords

Keywords are weighted heavily by enterprise search indexers (SharePoint Search, Google Search Appliance and its successors, Elastic-based intranet search, Algolia, every DAM that handles PDFs). The convention is comma-separated terms; the typical mistake is leaving the field empty. A short, deliberate set of keywords — product names, region codes, project IDs, document type tags — turns a flat repository into a faceted one. The keywords don't need to be visible to humans; they need to be visible to the search index.

Creator

Creator is the application that originally created the document content — Word, InDesign, LaTeX, AutoCAD — before the conversion to PDF. It's a triage field: when something goes wrong with rendering, accessibility, or PDF/A conformance, the Creator value tells you which upstream tool produced the problem. In compliance audits, the Creator field is often the first thing an auditor checks to confirm a document came from an approved authoring tool.

Producer

Producer is the application or library that converted the source to PDF — Adobe PDF Library, Distiller, Microsoft's print-to-PDF, Ghostscript, an internal pipeline. Like Creator, it's a triage field. When a particular PDF library has a known bug (font subsetting issue, broken transparency, missing structure tags), the Producer tells you which files are affected. Setting Producer deliberately — e.g., “Acme PDF Pipeline 4.2” — lets a records team identify provenance years after the fact.

Index

Index points at a separately-built full-text search index file (a .pdx). Niche, but in enterprise document libraries the difference is dramatic: Acrobat's Search command can use a pre-built index to query 200,000 PDFs in a fraction of a second, where opening each file individually would take hours. If your organization runs an Acrobat Catalog index, the Index field is what tells each PDF where to look.

Base URL

Base URL resolves relative web links inside the document. A manual that links out to /supplements/index.html rather than to a fully-qualified URL relies on the Base URL setting to know which server to ask. The field matters when the same PDF is mirrored to multiple domains (a public site and a partner extranet, say), or when a document references its own online supplements without hard-coding the host. Set it once and the relative links resolve correctly wherever the file lands.

Single file: when one PDF needs the right metadata

Most metadata problems start with a single file headed somewhere it matters. The scenarios are familiar:

  • A press-ready PDF heading to a third-party printer with the Subject and Keywords set so the printer's MIS picks the file up automatically and routes it to the right job ticket without human intervention.
  • A regulatory filing where the receiving agency's intake schema requires Title and Author populated to specific formats before the submission is accepted at all.
  • A contract sent for signature through DocuSign or Adobe Sign — the recipient sees the document Title in the e-signature UI, and a Title that reads “Document1” signals carelessness in a context where signals matter.
  • An archival deposit to an institutional repository (Figshare, Zenodo, an internal records archive) where Dublin Core metadata is part of the ingest contract and a missing dc:creator rejects the deposit at the door.
  • A revised manual where the Subject and Keywords need to change to reflect a new product version, the Title needs a revision marker, and everything else — Author, Producer, Index — should be left alone. This is where InfoSetter's checkbox-per-field design earns its keep: enable only the fields you want written, and the rest are preserved exactly as they were.

For all of these, the task is small enough that Acrobat's built-in File › Properties › Description dialog will do, particularly if you cover the same ground in our existing How-To on editing PDF metadata and document properties. The point of a dedicated metadata tool isn't to replace that dialog for the one-off case; it's to handle the cases where the one-off becomes ten thousand.

Batch processing: when one PDF is a thousand PDFs

The reality of any document estate is that metadata isn't a one-file problem. It's a collection problem. The typical scenarios:

  • Standardizing an existing collection. Five thousand historical contracts in a shared drive, half with empty Author fields, the other half with the personal Windows username of whoever happened to convert them. Set Author = “Acme Legal” and Producer = “Acme PDF Pipeline” across the lot in one pass; suddenly the records team's faceting works.
  • CSV-driven per-file metadata. The CMS already knows what each document is — title, author, subject, keywords are all sitting in a database export. The CSV becomes the source of truth: each row carries filename, title, author, subject, keywords and InfoSetter applies it row-by-row. Hours of manual editing become a one-button job.
  • Pre-distribution clean-up. Before publishing 200 datasheets to a customer-facing portal, strip the “FINAL_v3” titles that crept in during authoring and replace them with marketing-approved titles drawn from a CSV. The portal renders the cleaned-up titles; nobody outside the team ever sees the working filenames.
  • Compliance sweep. An audit directory of 800 quarterly reports is supposed to have a populated Subject field. A batch run with only Subject enabled fills in the missing values from a master list while leaving every other field untouched.
  • Regulatory submissions. A quarter's worth of submission-ready PDFs needs the receiving agency's metadata schema applied across the entire batch before the submission deadline. The batch is the deadline; the alternative is missing it.

An InfoSetter CSV looks something like this in its simplest form:

filename,title,author,subject,keywords
contracts\2026-001.pdf,Acme MSA — ABC Corp,Acme Legal,Master Services Agreement,"contract, MSA, ABC Corp, 2026"
contracts\2026-002.pdf,Acme MSA — XYZ Ltd,Acme Legal,Master Services Agreement,"contract, MSA, XYZ Ltd, 2026"
contracts\2026-003.pdf,Acme NDA — PQR Inc,Acme Legal,Non-Disclosure Agreement,"contract, NDA, PQR Inc, 2026"

One row per file. One pass. The collection acquires consistent, accurate metadata it would never have got hand-edited.

Saved configurations: why this is the load-bearing feature

The dialog is not the differentiator; the configuration is. InfoSetter's Configuration dropdown lets you save a named set of metadata values and the checkbox state — which fields will actually be written when the configuration is applied. That's not a values snapshot; it's a policy. A configuration encodes both “what should the metadata say” and “which fields are this configuration's business”. Two different configurations can share most of their values and disagree only on which fields they touch.

The realisation that turns InfoSetter into a productivity tool, rather than a nicer dialog, is that different document types want different metadata policies. A team that produces contracts, manuals, brochures, and slide decks doesn't want one metadata standard — it wants one per category. With saved configurations that's free: define the policies once, give each one a clear name, and the dialog becomes a chooser. The same idea is why named configurations matter for PDF open options: in both cases, the configuration is the contract, and the dialog is just a way to apply it.

A realistic set of named configurations a single team might keep on tap:

  • “House style” — Author = “Acme Ltd”, Producer = “Acme PDF Pipeline 4.2”, Title disabled (so existing per-document titles are preserved). The everyday baseline applied to anything outgoing.
  • “Press-ready” — Subject = “Print master”, Keywords = “CMYK, PDF/X-4, press-ready”, everything else disabled. Applied to files heading to the printer; the printer's MIS reads Subject and Keywords to route the job.
  • “Customer-facing” — Title enabled (overwritten per-row from a CSV pulled from the CMS), Author = “Acme Customer Services”, Subject enabled (per-row from CSV), Keywords enabled (per-row from CSV). Applied as part of the publish-to-portal pipeline.
  • “Archival” — full Dublin Core in the XMP stream, Author / Subject / Keywords all enabled, Title left to the CSV. Applied before the deposit to an institutional repository.
  • “Compliance audit” — only Author and Producer enabled, used as a no-op verification pass to confirm the rest of the metadata is intact across a sample of files.

The same configuration applies to a single open document and to a batch run — the InfoSetter dialog and the CSV pipeline both read from the same set of named configurations. A new starter joining the team inherits the configurations rather than guessing the house style. The policy is in the configurations; the configurations are in source control or on a shared drive; the metadata is consistent because the policy is consistent.

XMP: the metadata Acrobat doesn't show you in Description

The XMP stream is an XML block embedded in the PDF that holds metadata richer than the eight document-information fields. It's the lingua franca of digital archives, DAMs, and Adobe's own creative apps. What it carries beyond the standard fields:

  • Dublin Core (dc:creator, dc:rights, dc:title, dc:description, dc:subject) — the core archival vocabulary that institutional repositories, libraries, and records-management systems all understand.
  • PDF/A-specific properties when the document is targeting an archival standard. PDF/A validators inspect the XMP for conformance markers; a document missing them is a document that won't pass an archival audit.
  • Rights management (xmpRights:UsageTerms, xmpRights:Marked, copyright owner) — usage terms that travel with the file regardless of whether the body of the document spells them out.
  • Custom namespaces — internal classification codes, retention schedules, project IDs, content-type tags. Any organization that takes records management seriously eventually defines its own XMP namespace, and InfoSetter's Advanced dialog gives namespace-level control over which parts of XMP are written or preserved.

InfoSetter's Copy All XMP button copies an entire XMP stream from one file to another — useful when a master file holds the source-of-truth metadata and a new release needs the same metadata applied. The Advanced dialog is the surgical option: preserve dc:creator while overwriting xmpRights, leave the custom namespace alone while updating Dublin Core. Crucially, when InfoSetter writes the eight standard fields it automatically keeps the matching XMP properties in sync — so the Title in the document-info dictionary always agrees with dc:title in the XMP. That two-streams-disagree failure mode is one of the more common bugs in older PDFs, and InfoSetter just doesn't let it happen.

Automation: COM, PowerShell, watched folders

For organizations whose PDFs flow through a pipeline rather than a human, InfoSetter exposes a COM automation interface, IInfoSetterAuto, with properties for each metadata field and a Set() method that applies the configured values to a PDF document handle. VBScript, PowerShell, C#, VB.NET, and Python all consume it the same way. Typical use cases: an overnight watched-folder script that applies house-style metadata to every new PDF dropped into an inbound directory; a custom intake pipeline that reads metadata from a database and writes it to corresponding PDFs in a single run; an integration with a document-management system that calls InfoSetter as part of a publishing workflow. The full API surface is documented in the user guide; once the automation DLL is registered on the system, it behaves like any other COM object.

What about Acrobat's built-in Description dialog?

Acrobat exposes the same eight standard fields under File › Properties › Description, and the more advanced XMP options under File › Properties › Additional Metadata. If you only have one PDF to touch, that built-in dialog is fine and free. The InfoSetter advantage is what comes after that: named configurations, batch processing across folders, CSV-driven per-file metadata, and namespace-level XMP control. If you produce or receive PDFs at scale, hand-editing each one in Acrobat is not a workflow — it's a bottleneck dressed up as a process.

Where to start

If your team produces a handful of PDFs a week, install the free InfoSetter plug-in and get into the habit of saving named configurations for each document type you ship — it's a one-time setup cost that pays back from the second document onwards. If you've inherited a document estate with thousands of files and obviously bad metadata, build a CSV of the metadata you wish those files had and run InfoSetter's CSV import once: a single batch will fix more metadata in five minutes than a person could fix in a week. And if your PDF pipeline is already automated, the COM interface is what wires InfoSetter into it. The user guide walks through every option in detail.

Related Articles

PDF Document Metadata

The full explainer on how PDF metadata works — the document information dictionary, XMP streams, custom schemas, PDF/A requirements, and scripting.

How to Edit PDF Metadata and Document Properties

The hands-on companion to this post — how to view and edit PDF metadata in Acrobat, in scripts, and at the command line.

PDF Open Options: Why How a PDF Opens Matters

The companion post on the other side of PDF document policy — how each file should appear when it opens, and why named configurations apply there too.

Try it yourself

InfoSetter is a free Mapsoft plug-in for Adobe Acrobat. Download it, save a configuration for your most common document type, and apply it to a folder.