PDFs in Academic Research and Publishing
PDF's role in academic publishing — from journal distribution and preprint servers to accessibility requirements, data extraction challenges, and the characteristics of LaTeX-generated papers.
PDF as the Dominant Format for Academic Papers
PDF has been the primary distribution format for academic research literature for over two decades. Journal publishers, preprint servers, and institutional repositories almost universally offer articles as PDF files. arXiv — the preprint server covering physics, mathematics, computer science, and related fields — hosts millions of PDF papers and receives thousands of new submissions each week. JSTOR's digitised journal archive, publisher portals from Elsevier, Springer, Wiley, and Taylor & Francis, and open-access repositories such as PubMed Central all deliver content primarily as PDF.
The dominance of PDF in academic publishing is not accidental. It reflects several genuine advantages for the specific needs of scholarly communication.
Advantages of PDF for Academic Use
Consistent layout and typography. Academic papers contain complex typographic elements — mathematical equations, tables, multi-column layouts, footnotes, and figures — that do not translate reliably to formats such as HTML or Word. PDF freezes the author's intended layout exactly, ensuring that every reader sees the same rendering regardless of their device, operating system, or installed fonts.
Stable page references. Academic citation practice depends on page numbers. A citation to "Smith (2019), p. 47" is only meaningful if "p. 47" means the same thing in every copy of the article. PDF's fixed pagination makes this possible; reflowable formats like ePub or HTML cannot support page-level citations reliably.
Integrated figures and references. PDF supports embedded high-resolution figures, clickable cross-references, and active hyperlinks to cited DOIs and URLs — all within a single portable file.
PDF/A for Institutional Repositories
Universities and research institutions are required to preserve published research for the long term, often under legal or funding mandates. The PDF/A standard (ISO 19005) is specifically designed for archival preservation. PDF/A prohibits features that create external dependencies — embedded multimedia, encryption, external content streams, and JavaScript are all disallowed. All fonts must be embedded. A conforming PDF/A document can be reliably rendered without any external resources, providing confidence that an institutional repository built today will still be able to display its holdings in fifty years.
PDF/A-2 and PDF/A-3 are the most commonly required variants for new submissions. PDF/A-3 permits the embedding of arbitrary file attachments (including XML metadata in formats like JATS), which is increasingly used by publishers to attach machine-readable article data alongside the human-readable PDF.
Tagged PDF and Accessibility Requirements
Research publishers face growing accessibility requirements from funders, institutions, and disability rights legislation. A tagged PDF includes a logical structure tree that maps the document's visual presentation to semantic elements: headings, paragraphs, tables, lists, figures, and alternative text for images. Screen readers use this tag structure to present the document to blind or visually impaired readers in a logical reading order rather than following the arbitrary visual layout of the page.
The Web Content Accessibility Guidelines (WCAG) 2.1 at Level AA, the EU Web Accessibility Directive, and Section 508 of the US Rehabilitation Act all have implications for PDF documents published by institutions and public-sector research bodies. Creating a properly tagged, accessible PDF requires deliberate effort during document production — it cannot be reliably achieved by automated post-processing of an untagged PDF. LaTeX-generated PDFs are particularly problematic in this respect, as standard LaTeX workflows produce untagged output (though projects such as LaTeX Tagged PDF are working to address this).
The Challenge of Extracting Data from Research PDFs
While PDF is excellent for human reading, it is a poor format for machine processing. The PDF format stores text as positioned character sequences rather than as structured prose. When a document is laid out in two columns, the underlying character stream may interleave text from both columns rather than preserving column order. Tables are represented as positioned characters with no inherent row or column structure. Mathematical notation is encoded as sequences of symbols without the semantic structure of MathML or LaTeX source.
These characteristics make extracting structured data from research PDFs — for tasks like systematic review, meta-analysis, or building training datasets — technically demanding. Tabula is a widely used open-source tool for extracting tables from PDFs; it works by analysing the spatial positions of characters and inferring grid structure, but requires manual calibration and produces unreliable results on complex table layouts or scanned pages. GROBID (GeneRation Of BIbliographic Data) is a machine learning system trained to parse the structure of academic papers — identifying title, abstract, authors, sections, and references — from the raw PDF character stream. It is used by large-scale academic literature processing pipelines such as Semantic Scholar. General-purpose PDF processing libraries such as PDFMiner (Python), iText, or Aspose.PDF provide lower-level access to the character stream for custom extraction workflows.
Scanned PDFs — common in digitised historical journal archives — present additional challenges because they contain only image data rather than embedded text. Extracting content requires OCR as a first step, introducing recognition errors that compound downstream extraction difficulties.
LaTeX-Generated PDFs
A significant proportion of academic papers in physics, mathematics, computer science, and economics are authored in LaTeX and submitted to publishers or preprint servers as LaTeX source, which is then compiled to PDF. LaTeX-generated PDFs have distinctive structural characteristics: they typically embed a small set of standard fonts (Computer Modern, Latin Modern) or specify PostScript fonts, and they carry source-level metadata in the PDF's XMP metadata stream when produced with pdfLaTeX or LuaLaTeX using the hyperref package.
The hyperref package automatically generates PDF bookmarks from section headings, creates clickable cross-references and citations, and embeds document metadata including title, author, and keywords. Papers compiled with hyperref are considerably more navigable than those that produce flat unstructured PDFs. However, as noted above, LaTeX PDFs are typically untagged, limiting their accessibility.
Open Access PDF Distribution and DOI Linking
Open access mandates from funders such as UKRI, the European Research Council, and the National Institutes of Health require that publicly funded research be made freely available. This has driven the growth of green open access (depositing accepted manuscripts in institutional repositories) and gold open access (publishing in fully open journals). In both cases, PDF remains the primary format for the deposited or published version of the article.
DOIs (Digital Object Identifiers) are the standard persistent identifier for academic publications. Modern academic PDFs embed DOI links in their metadata and typically include a clickable DOI link on the first page. Resolving a DOI through doi.org redirects to the publisher's canonical landing page for the article, which provides access to the PDF. This infrastructure of DOIs, persistent URLs, and PDF distribution underpins the entire scholarly communication system.
PDF Processing for Research and Publishing
Mapsoft's PDF solutions support structured data extraction, accessibility compliance, PDF/A conversion, and bulk processing workflows for publishers, institutions, and research organisations.