Understanding Tagged PDFs: Definition, Usage and Examples
A comprehensive guide to tagged PDF documents, their role in accessibility, and how to create them effectively.
What Is a Tagged PDF?
A tagged PDF is a PDF document that contains an underlying logical structure tree, sometimes called a "tag tree," which describes the organisation and semantic meaning of the document's content. Tags identify elements such as headings, paragraphs, lists, tables, images, and links, much like HTML tags describe the structure of a web page.
While a standard (untagged) PDF describes only the visual appearance of content on each page — where to place text, which fonts to use, how to draw graphics — a tagged PDF adds a layer of semantic information that describes what the content is, not just how it looks. This distinction is critical for accessibility, content reflow, and reliable text extraction.
Why Tagged PDFs Matter
Accessibility
The primary purpose of tagged PDF is to make documents accessible to people with disabilities. Screen readers and other assistive technologies rely on the tag structure to read content in the correct order, convey the document's logical hierarchy, and provide meaningful descriptions of non-text elements such as images.
Without tags, a screen reader can only extract raw text from the page content streams, which may not be in the correct reading order and provides no information about the document's structure. A visually impaired user would have no way of distinguishing a heading from body text, understanding table relationships, or knowing that an image is present.
Legal and Regulatory Requirements
Many jurisdictions and organisations require that electronic documents be accessible. Key regulations and standards include:
- Section 508 of the US Rehabilitation Act, which requires federal agencies to make electronic information accessible to people with disabilities.
- WCAG 2.1 (Web Content Accessibility Guidelines), which, while primarily focused on web content, is increasingly applied to PDF documents as well.
- PDF/UA (ISO 14289), the international standard for universally accessible PDF documents, which mandates the use of tags and specifies detailed requirements for how they must be structured.
- European Accessibility Act and the EN 301 549 standard, which require accessible ICT products and services across the European Union.
Content Reflow
Tagged PDFs enable content reflow, allowing the document to be displayed on screens of different sizes (such as mobile devices) by rearranging the content to fit the available space. Without tags, reflow is unreliable because the reading order and element relationships cannot be determined from the visual layout alone.
Reliable Text Extraction and Repurposing
Tags make it possible to extract text from a PDF in the correct logical order, which is important for search indexing, copy-paste operations, and converting PDFs to other formats such as HTML, EPUB, or Word documents.
The Tag Structure
The tag tree in a tagged PDF is a hierarchy of structure elements, each identified by a structure type. The most common standard structure types defined in the PDF specification include:
Document-Level Tags
- Document: The root element of the tag tree, representing the entire document.
- Part: A large division of the document.
- Sect: A section within the document.
- Art: A self-contained article or body of content.
Block-Level Tags
- H, H1–H6: Headings at various levels, establishing the document's outline.
- P: A paragraph of text.
- L, LI, Lbl, LBody: Lists, list items, labels, and list item bodies.
- Table, TR, TH, TD: Tables, table rows, header cells, and data cells.
- BlockQuote: A block quotation.
- TOC, TOCI: Table of contents and table of contents items.
Inline-Level Tags
- Span: A generic inline element.
- Link: A hyperlink.
- Note: A footnote or endnote reference.
- Reference: A citation or cross-reference.
- Code: Computer code.
- Em: Emphasis (typically rendered as italic).
- Strong: Strong emphasis (typically rendered as bold).
Illustration Tags
- Figure: An image, diagram, or other graphical content. Figures should have alternative text describing their content.
- Formula: A mathematical formula.
- Form: A widget annotation (form field).
Alternative Text and Attributes
Structure elements in a tagged PDF can carry several important attributes:
- Alt (Alternative Text): A textual description of a non-text element, such as an image. This is read by screen readers in place of the visual content.
- ActualText: The actual text represented by content that might be visually rendered in a non-standard way (for example, a ligature or decorative glyph).
- E (Expansion Text): The expansion of an abbreviation or acronym.
- Lang: The natural language of the content (for example, "en-GB" for British English), which helps screen readers use the correct pronunciation.
Examples of Tagged PDF in Practice
Example 1: A Simple Document
Consider a simple document with a title, two paragraphs, and an image. The tag tree would look like:
- Document
- H1: "Report Title"
- P: "This is the first paragraph..."
- Figure (Alt: "A bar chart showing quarterly sales")
- P: "This is the second paragraph..."
Example 2: A Table
For a table with headers and data, the tag structure would be:
- Table
- TR
- TH: "Product"
- TH: "Price"
- TH: "Quantity"
- TR
- TD: "Widget A"
- TD: "$10.00"
- TD: "500"
- TR
The TH tags identify header cells, which allows assistive technologies to associate each data cell with its corresponding header when reading the table.
Example 3: A List
A numbered list would be tagged as:
- L (ListNumbering: Decimal)
- LI
- Lbl: "1."
- LBody: "First item in the list"
- LI
- Lbl: "2."
- LBody: "Second item in the list"
- LI
How to Create Tagged PDFs
There are several approaches to creating tagged PDF documents:
From Authoring Applications
The most reliable way to create tagged PDFs is to generate them from an authoring application that supports tagging:
- Microsoft Word: When exporting to PDF, use the "Best for electronic distribution and accessibility" option, or check "Document structure tags for accessibility" in the PDF options. Ensure that proper heading styles (Heading 1, Heading 2, etc.) are used in the source document.
- Adobe InDesign: InDesign can export tagged PDFs when the "Create Tagged PDF" option is selected during PDF export. The Articles panel can be used to control the reading order.
- LibreOffice: LibreOffice Writer can export tagged PDFs when the "Tagged PDF" option is checked in the PDF export dialogue.
Using Adobe Acrobat
Adobe Acrobat Pro provides tools for adding, editing, and verifying tags in existing PDF documents:
- The Accessibility tools can automatically add tags to an untagged PDF, though the results typically require manual review and correction.
- The Tags panel (View > Show/Hide > Navigation Panes > Tags) allows you to view and edit the tag tree directly.
- The Reading Order tool provides a visual interface for assigning tags to content on the page.
- The Accessibility Checker can validate a tagged PDF against accessibility requirements and identify issues.
Programmatically
Developers can create tagged PDFs programmatically using PDF libraries that support the tagged PDF specification. This approach is ideal for automated document generation workflows where accessibility must be built in from the start.
Common Issues and Best Practices
- Reading order: Ensure the tag tree reflects the correct logical reading order, which may differ from the visual layout. Multi-column layouts and documents with sidebars require particular attention.
- Alternative text: Every meaningful image must have appropriate alternative text. Decorative images should be tagged as artifacts so they are ignored by assistive technologies.
- Table structure: Complex tables with merged cells or nested headers need careful tagging with appropriate scope attributes and header-cell associations.
- Language specification: Set the document's primary language and mark any passages in a different language with the appropriate Lang attribute.
- Artifacts: Content that is not part of the logical document structure — such as page numbers, headers, footers, and decorative elements — should be marked as artifacts rather than tagged as content.
- Validation: Always validate tagged PDFs using accessibility checking tools such as Adobe Acrobat's built-in checker or the PAC (PDF Accessibility Checker) tool.
Conclusion
Tagged PDFs are essential for creating accessible, well-structured documents that can be used by everyone, including people who rely on assistive technologies. As accessibility requirements become increasingly important across industries and jurisdictions, understanding and implementing tagged PDF properly is a critical skill for document professionals. By following the standards and best practices outlined above, you can ensure that your PDF documents are accessible, compliant, and ready for a wide range of uses beyond simple visual display.
Need Help Working with PDFs?
Mapsoft offers professional PDF tools and expert consultancy services to help you get the most from your documents.