Tag Archives: PDF/A

The Impact of PDF/A for Long-term Digital Preservation

The Impact of PDF/A for Long-term Digital Preservation and archiving

When it comes to preserving electronic documents for long term pdf archiving, it is important to use a standardized format. One such format that has gained significant recognition in the field of digital preservation is PDF/A. The International Organization for Standardization (ISO) developed PDF/A as a specialized version of the PDF format designed specifically for preserving electronic documents. This article will explore how PDF/A impacts the longevity and accessibility of archival digital records and its significance in ensuring these records remain accessible over time.

What is PDF/A?

PDF/A stands for Portable Document Format Archivable. It is an ISO-standardized version of the PDF format that focuses on maintaining the visual appearance and content integrity of electronic documents over time. Unlike regular PDFs, which can be easily modified and edited, PDF/A files are intended for long-term archiving. They are designed to be self-contained and self-descriptive.

One of the key features of PDF/A is that it embeds all the necessary fonts, images, and metadata within the file itself, ensuring that the document will look the same regardless of the software or operating system used to view it. This self-contained nature of PDF/A makes it an ideal archival format for long-term preservation, as it eliminates the risk of dependencies on external resources that may become obsolete or unavailable over time.

The Significance of PDF/A for Digital Preservation and Archiving

Preserving digital documents for the long term is a complex task that involves not only storing the files but also ensuring their accessibility and usability in the future. PDF/A plays a crucial role in this process by providing a standardized archival format that guarantees the authenticity, integrity, and longevity of electronic records stored in a pdf archive.

One of the main challenges in digital preservation is the rapid obsolescence of technology. File formats that were once widely used can quickly become obsolete, rendering the documents stored in those formats inaccessible. PDF/A addresses this issue by providing a format that is independent of specific software or hardware platforms. This means that PDF/A documents can be reliably accessed and rendered decades from now, regardless of the advancements in technology.

Another significant aspect of PDF/A is its support for metadata preservation. Metadata, such as author information, creation date, and document properties, is crucial for understanding the context and provenance of digital records. PDF/A embeds this metadata within the file, ensuring it is preserved over time. This approach makes it easier for future generations to understand and interpret the documents.

Benefits of Using PDF/A for Long-term Digital Preservation and archiving

Using PDF/A for long-term digital preservation offers several benefits:

  1. Document integrity: PDF/A preserves the content, layout, and visual appearance of electronic documents over time, maintaining the document’s original intent and meaning.
  2. Platform independence: PDF/A files can be viewed and accessed on any platform or operating system, ensuring long-term accessibility and reducing the risk of format obsolescence.
  3. Metadata preservation: PDF/A embeds metadata within the file, preserving important information about the document and allowing it to be easily accessed in the future.
  4. Legal compliance: PDF/A is widely recognized as a reliable format for long-term preservation, making it suitable for meeting legal and regulatory compliance requirements.
  5. Searchability: PDF/A supports text extraction and indexing, making it easier to search for and retrieve specific information within a document.

Versions and Conformance Levels

PDF/A has evolved through several versions, each building on the previous one to include more features while maintaining the core principles of the standard. Each later version must maintain the previous versions to preserve compatibility:

  • PDF/A-1: Based on PDF 1.4, this version is the most restrictive, excluding features like transparency and layers. It includes two levels of conformance: PDF/A-1a (accessible) and PDF/A-1b (basic).
  • PDF/A-2: Introduced new features such as JPEG 2000 compression, transparency, and layers. It is based on PDF 1.7 and includes an additional conformance level, PDF/A-2u, which focuses on Unicode text semantics.
  • PDF/A-3: Similar to PDF/A-2 but allows embedding of any file type as an attachment, making it useful for archiving documents that require additional files.
  • PDF/A-4: The latest version, based on PDF 2.0, continues to expand on the capabilities of earlier versions while maintaining backward compatibility.
 

Several tools support the creation and conversion of documents to the PDF/A format. Here are some notable ones:

Tools for Creating and Converting to PDF/A

  • Adobe Acrobat: Adobe integrated PDF/A support starting from Acrobat 8, allowing users to create PDF/A-compliant documents directly from their software.
  • Microsoft Office: Microsoft offers a plug-in for Office 2007 and later versions that enables users to save documents as PDF/A directly from Office applications
  • PDFelement: This tool allows users to validate and convert documents to PDF/A, supporting various standard versions like PDF/A-1a and PDF/A-1b.
  • PDF Studio: This tool provides advanced features for validating and creating PDF/A documents, including a “Preflight” mode for ensuring compliance with different PDF/A standards.
  • Apryse PDF/A Library: Offers a comprehensive solution for PDF/A conversion and validation on multiple platforms, including Windows, Linux, and macOS. It also provides a command-line tool, PDF/A Manager, for batch processing.
  • PDF2Go: Provides an online tool for converting PDF files to PDF/A format, ensuring compliance with ISO standards.

These tools are widely used across various industries to ensure that documents are preserved in a format that guarantees long-term accessibility and compliance with archival standards.

 

Conclusion

PDF/A has revolutionized the field of long-term pdf digital archiving by providing a standardized format that ensures the authenticity, integrity, and accessibility of electronic documents. Its self-contained nature, platform independence, and support for metadata preservation make it an ideal choice for organizations and institutions looking to preserve their digital records for future generations. By adopting PDF/A, we can ensure that our valuable electronic documents will remain accessible and usable for years to come.

ISO Standards

The ISO standards for PDF/A, which is a format designed for long-term preservation of electronic documents, are part of the ISO 19005 series. Here are the key parts of the standard:

  • ISO 19005-1:2005: This is the original PDF/A standard, known as PDF/A-1, which is based on PDF 1.4. It defines how to use PDF for long-term preservation of electronic documents, ensuring that documents can be reliably reproduced in the future. It includes two conformance levels: PDF/A-1a (accessible) and PDF/A-1b (basic).
  • ISO 19005-2:2011: Known as PDF/A-2, this standard is based on PDF 1.7 and introduces new features like JPEG 2000 compression, transparency, and layers. It also includes an additional conformance level, PDF/A-2u, which focuses on Unicode text semantics.
  • ISO 19005-3:2012: This version, PDF/A-3, is similar to PDF/A-2 but allows any file type to be embedded as an attachment, which is useful for archiving documents that require additional files.
  • ISO 19005-4:2020: Known as PDF/A-4, this version is based on PDF 2.0 and introduces the PDF/A-4e conformance level, which supports interactive 3D models for engineering workflows.

These standards can be purchased from the ISO website as shown by the above links, as they are protected by copyright and not available for free distribution.

 

Subscribe to our Newsletter:

Summary of the Structure of PDF files

Deeper Insight into the Complex Structure of PDF Files and Their Key Components.

PDF file format structure can be looked upon as a combination of different file types presented in a single container. The reason for this is that a PDF file contains Text, vector art, images, fonts and other file format can be embedded – even the native files that were used to create the PDF in the first place.

The complex structure of PDF files consists of objects where items can be connected directly or indirectly to each other. Often the indirection is because an object might is used multiple times as would be the case for a logo, font, color.

The objects within a PDF file can be divided into the following types:

Dictionaries

A group containing direct or references to indirect objects. Dictionaries can be seen as the glue holding together the elements in a PDF files. The example below shows the structure of a typical page dictionary:

The Contents stream has an attributes dictionary that contains a filter name and the length of the stream
The CropBox array contains the coordinates of the rectangle that defines the area that is visible on the page.
The MediaBox array contains the coordinates of the rectangle that defines the media size. This will typically match a standard media size such as Letter or A4 and will allow the PDF page to be reliably printed on a device that contains these standard media sizes.
The Resources dictionary contains references and information for elements that are needed to reliably output the visual elements of the page such as colors, fonts and Images.

Streams

The collection of operators outputting information onto the page. Normally the stream will also require elements of the page resources dictionary such as colors and fonts. Streams are either stored as a single element or in an array.
 
				
					q
567.48 61.011 -540 720 re
W* n
q
/GS0 gs
0 720 -541.1399536 0 567.4799194 61.0105438 cm
/Im0 Do
Q
Q
/CS0 cs 0.302 0.302 0.302  scn
1 i 
/GS1 gs
56.7 286.911 m
56.7 295.191 56.7 303.471 56.7 311.751 c
59.1 311.751 61.5 311.751 63.9 311.751 c
63.9 306.831 63.9 301.911 63.9 296.991 c
65.88 296.991 67.8 296.991 69.72 296.991 c
69.72 301.191 69.72 305.391 69.72 309.591 c
72 309.591 74.22 309.591 76.5 309.591 c
76.5 305.391 76.5 301.191 76.5 296.991 c
81.06 296.991 85.62 296.991 90.18 296.991 c
90.18 293.631 90.18 290.271 90.18 286.911 c
79.02 286.911 67.86 286.911 56.7 286.911 c
f*
				
			
You can see that there are several references to items in the page resources dictionary:
GS0 is a reference to a graphics state and gs is the operator that sets it.
Im0 is an XObject image and the Do operator draws the image.
CS0 is a reference to a color dictionary and the scn operator assigns it to strokes.
 
You can also see usage of several path operators re – rectangle, m – moveto, c – curve f* – fill.
 

Text strings

These can either be ANSI (single byte characters) or Unicode (multi-byte). The example here is the representation of the last date modified in the catalog dictionary.These can either be ANSI (single byte characters) or Unicode (multi-byte). The example here is the representation of the last date modified in the catalog dictionary.

Images

Images are normally held within the page resources and the stream will also have an associated Attributes dictionary that will describe the attributes of the data within the stream. BitsPerComponent size of the data that is used to define a single pixel (dot) within the image. The ColorSpace dictionary describes the colour model that is used to define the colors within the image.

Names

Used normally to provide a name that can be used to refer to a dictionary or dictionary item. For example, the pages dictionary has a name “Type” with the value “Pages” and a single page has a name of “Type” with a value of “Page”.

Arrays

Fixed length data holding types and/or references to other elements. For an example see the Real Numbers example below.

Real numbers

Decimal numbers. In this example they are being used to define the rectangle of the page media box:

Integers

Whole numbers. For example to show the total number of  pages in the PDF file.

For further details on pdf file format structure see the PDF Specification at https://www.adobe.com/devnet/pdf/pdf_reference.html
 
Contact:
 
Michael Peters

What is an Acrobat Plugin?

What is an Acrobat Plugin?

An Adobe Acrobat plugin is a software component that can extend the functionality of the Adobe Acrobat reader/viewer application, a popular program for viewing, creating, and editing PDF documents. Plugins can add new features to Acrobat, such as the ability to create interactive forms, add watermarks, or perform advanced document processing tasks. Some plugins are developed by Adobe, while others are created by third parties. These plugins can be installed within the Acrobat application such as our TOCBuilder plugin are typically activated when a particular action or task is performed within the software.

Adobe Acrobat Reader plugin

Why do we need plugins?

To make Adobe Acrobat more flexible and applicable to a broader range of industries and organizations, the number of built-in features must be restricted to the wider community. This is because adding features that serve only a small portion of Acrobat’s user base would unnecessarily increase the application’s size. Consequently, plugins are required to add functionality as required by the user.

Can Acrobat plugins be used in the Adobe Reader?

Special support needs to be added to the plugin so that it can run under Adobe Reader. However the Reader plugin will require a special license and needs to go through an approval process with Adobe Systems Inc. – https://www.adobe.com/devnet/reader/ikla.html.

Are plugins specific to a particular version of Adobe Acrobat?

We have plug-ins that we developed for Acrobat 6 that still run without modification in Acrobat DC. However, if new features are used that are specific to a later version then it won’t work under later versions. If earlier versions used the Adobe Dialog Manager (ADM) then they won’t now work in current versions of Acrobat.
 

Update:

Probably the biggest change in this last year was the introduction of the 64bit version of Acrobat which has also meant the rebuilding of our Acrobat Plugins to support it.. This has basically meant a rebuild of all plug-ins so that they will run on both platforms. In this process we decided to remove a number of our products that are now decades old and concentrate on maintaining and improving a subset of the products.

Examples of Plugins

  • New security handlers that might be specific to a particular organisation. For example, we have developed security handlers that do not allow PDF files to be viewed outside a particular organisations offices. 
  • New annotations types. For example, we created a plugin that supported all of the British Standard Markups.
  • Flattening annotations and form fields into the main document. This ensured that they could not be changed or modified and that they would print as part of the document even if the printing of annotations was switched off.
  • Adding text and images to PDF files.
  • Creating a table of Contents for PDF files
  • Automating the creation of bookmarks based on the styles in a pdf file.
  • Adding fields for variable data printing
  • Hardware integration of Adobe Acrobat into whiteboards and interactive tables

How Portable is PDF?

Although PDF is an ISO standard, it also includes various other standards like PDF/A and PDF/X, intended to enhance its portability. Nonetheless, several factors can render it unportable. Adobe deliberately imposes some of these limitations to maintain control over the format and the functionality of their products, such as Adobe Acrobat, that interact with PDFs. Third parties impose additional limitations.

The question is how close PDF is to being truly portable across all devices and whether this portability is improving or worsening over time. Let’s examine the initial definition of PDF.

A definition of PDF

The early PDF specification versions have largely stayed the same in this section for PDF 1.7, PDF 2.0, and the ISO standard, outlining the PDF ‘ideal’.

The goal of PDF is to enable users to exchange and view electronic documents easily and reliably, independent of the environment in which they were created or the environment in which they are viewed or printed. At the core of PDF is an advanced imaging model derived from the PostScript® page description language. This PDF Imaging Model enables the description of text and graphics in a device-independent and resolution-independent manner. To improve performance for interactive viewing, PDF defines a more structured format than that used by most PostScript language programs. Unlike Postscript, which is a programming language, PDF is based on a structured binary file format that is optimised for high performance in interactive viewing. PDF also includes objects, such as annotations and hypertext links, that are not part of the page content itself but are useful for interactive viewing and document interchange.

So the big question is how close is PDF today in reaching this “goal” envisioned by its designers?

Adobe Imposed Unportability

Adobe has made PDF an ISO standard. Yet, some argue Adobe still treats it like a proprietary format. This strategy may aim to keep users within Adobe’s ecosystem, favoring applications like Adobe Acrobat. Adobe employs several methods to achieve this.

  • Rights management and charging large fees to those companies that want to create there own DRM systems that allow files to be open in the free Adobe Reader. To most organisations these fees are often unaffordable.
  • Keeping XFA forms proprietary.
  • Introducing their own proprietary signing service.
  • Reader extensions to enable commenting on PDF files. Although with later versions of the Reader this has steadily become more relaxed.

Licensing and moving the goal posts.

Does anyone remember Business Tools or Acrobat Exchange (a cut down version of Adobe Acrobat)? 

Anybody who wants to produce a plug-in for the Adobe Reader must first receive Adobe’s approval. Essentially, even if a third party seeks broad acceptance for their functionality, Adobe retains the right to reject it. Recently, Adobe restricted access to the Reader (RMSDK) that developers could use to modify the Reader on mobile devices. Starting from Acrobat XI, Adobe removed the ability to save comments on WebDav, forcing companies to use Acrobat.com services instead.

There many other examples.

Opportunities for third parties to introduce Unportability

Although many third party developers will want to try and keep their PDF files as portable as possible, especially if they are creating tools to create PDF files, there will be others attempting to turn the PDF files into a proprietary format again.
Security handlers

Security handlers such as that produced by FileOpen Systems change the encryption in the PDF file according to encryption Algorithms known only to FileOpen. How many other security handlers are utilised is an unknown because this is information that is not published. For example, Mapsoft has created customised security handlers for several of our clients. These files are rendered useless outside the client environments and this is a deliberate action to ensure that these PDF file are not portable.

Custom Data

Data saved in PDF files from third party plugins and applications. Plugins often save information that is specific to them in the PDF file. Users or systems can embed custom annotations, preference data, and proprietary data formats into the file, apart from the security handlers mentioned above.

Custom Annotations

On numerous occasions my company has created custom annotations that become embedded in the PDF file. It is largely because the existing annotations don’t provide the functionality required by the customer. This functionality is normally provided by a custom plug-in for Adobe Acrobat. Custom annotations normally leave their appearances in the PDF file, allowing the PDF to be viewed even without the Annotation handler. This approach maintains the PDF files’ portability, enabling viewing while removing the ability to modify them.

Unintended Unportability

On many occasions we have received files from customers that appear to be sound only to find some glaring errors in the structure of the PDF file that is rendering it useless to any further processing or even viewing. 

The issue of corrupt fonts or incomplete font information is often a cause of problems. Rendering the file may be possible. We have seen instances where a PDF file will render on screen, but it can’t be printed. 

A PDF file may be ok to render and then when further processing takes place such as the editing of images or text there is a failure inside Acrobat. On some files the extraction of text outputs gibberish because there wasn’t enough information in the PDF file to translate the font encoding. Some font information it purely graphical and there is no way of reliably creating text from this information.

Mobile Use

One of the most problematic areas of portability is on mobile devices. You might think producing a PDF to work seamlessly on iOS or Android is straightforward. However, even Adobe Acrobat Reader for Mobile on these platforms doesn’t guarantee it.

An example I have come across recently is in the use of PDF forms. Now I am not talking about XFA forms which I believe is still a proprietary format but the standard forms that have been available for PDF since the 1.2 version of the standard. Before the 1.7 version of PDF became an ISO standard (ISO 32000), you could align the version of PDF with a specific version of Acrobat by adding the major and minor versions of PDF together. Therefore, we can deduce that the introduction of PDF 1.2 was for Acrobat 3, released in 1996.

In any PDF form it should be possible to provide some level of interactivity for users and it is in this area where forms viewed on mobiles are severely lacking in their support. In recent work for a client we needed to introduce some functionality into the form for the following:

  • Showing hiding fields based on responses to previous questions on the form.
  • Showing form outlines in red when an entry is required and removing this when the field has been filled.

I would argue that neither of the above expectations are unreasonable. However, the Adobe Reader for iOS or Android does not support them due to the simple fact that it supports a tiny JavaScript object model. I have yet to find a third-party viewer that actually documents what they support in this area.

ISO Standards

PDF is no longer just a standard format; it has fragmented into multiple standards. One could argue that if Adobe had maintained PDF as a proprietary format, it would have preserved its portability with Adobe solely in charge of the format.

However, Adobe introduced a new version of PDF with every new Acrobat release, from PDF 1.0 to PDF 1.7, corresponding with Acrobat versions 1 through 8. Had it remained proprietary, Adobe likely would have continued this practice.

I vividly recall a period when an early version of InDesign could only generate PDF files that Acrobat’s very specific version reliably viewed. I remember feeling horrified the day I discovered that our Acrobat plug-in (Mapsoft MaskIt) was creating PDF files that appeared as grey areas in Acrobat 7.0. It wasn’t until the release of 7.0.6 that they started working properly.

This revision eliminates passive constructions and makes the narrative more engaging and straightforward.

We also have versions of PDF for specific industries or purposes such as PDF/A and PDF/X.

Summary and Conclusion

PDF is one of the most portable file formats, yet it’s not without flaws. Its flat file nature ensures broad portability, but the upper layer of Annotations often introduces issues. PDFs are ubiquitous across industries and websites, with nearly everyone using a computer familiar with them. We hope for significant improvements in the coming years to enhance its portability. Now that PDF is an ISO standard it is the responsibility of the developer community to ensure that this happens. However, there will always be the commercial pressures to personalise PDF, to create yet another sub-standard, but perhaps that is the beauty of the format.

Contact information: