Why Information Managers Should Care How AI Handles PDF
PDF | Artificial Intelligence (AI)
The PDF Association recently published FAQ: AI and PDF, a freely available resource designed to clear up some widespread misconceptions about how AI systems should handle PDF documents. It's aimed at journalists, analysts, and policy-makers who may not have deep expertise in PDF technology, but the content has direct relevance for information professionals, who are often the ones deciding how documents get prepared, stored, and fed into AI pipelines.
A few themes from the FAQ are worth highlighting for this audience.
PDF content is highly valued by AI, for good reason
PDFs serve as the "document of record" in human communication. Unlike HTML web pages, which are transactional, dynamically generated, and subject to constant change, PDF documents are persistent. As a recent HuggingFace analysis noted, PDFs tend to be long, dense documents — reports, government papers, manuals — and content that requires significant effort to create typically correlates with higher information density. That's exactly why AI systems prize it.
Down-converting PDF before ingestion is generally a poor strategy
One of the more important points the FAQ addresses is the common practice of converting PDFs to plain text or Markdown before feeding them into an AI system. The FAQ is clear that this conversion is "inevitably lossy in terms of rich information and semantics."
Consider something as simple as a superscript: the number 22 converted to plain text becomes indistinguishable from the number 22. That superscript might represent a footnote, an exponent, a fraction numerator. The context is gone, and the AI has no way to recover it. The FAQ describes this as an unnecessary "dumbing down" process that risks increasing AI hallucinations. For information professionals who have long understood that structure and context are inseparable from content, this should resonate.
Tagged PDF provides semantic information AI can actually use
Tagged PDF documents include logical reading order, natural language indicators, table structure, and alt-text for images. Critically, Tagged PDF represents a document's unpaginated logical structure, which means AI systems can completely avoid the need to parse pagination artifacts. Modern office suites produce Tagged PDF, and most browsers can export it directly from HTML. The question worth asking, when evaluating any AI ingestion system, is whether it's actually leveraging those tags or discarding them.
Annotations and metadata are part of the document
AI systems should ingest all components of a PDF, not just visible page content. Annotations include text markup, digital signatures, multimedia, and embedded file attachments — content that is often critical to correctly understanding a document. Metadata, including both legacy Document Information entries and modern XMP metadata, can reveal title, author, provenance, and creation and modification dates that may not appear anywhere in the page content itself. An ingestion system that ignores these components is working with an incomplete picture of the document.
Incomplete redaction is a real risk
The FAQ draws a distinction that anyone working with sensitive records should understand. A correctly redacted PDF is a valid document from which the sensitive content has been purged. But a PDF containing redaction annotations is something different. It's an artifact of an incomplete redaction workflow, where content has been marked for removal but not yet eliminated. AI systems that process such documents may ingest personally identifiable information or other sensitive content that was intended to be gone. This is not a theoretical edge case.
Publishers can encode their TDM preferences in the document itself
For organizations managing content at scale, the FAQ covers how copyright owners can include XMP metadata in accordance with the W3C's TDMRep protocol, expressing their rights and preferences for text and data mining, including opting out of AI training. The PDF Association is working alongside industry, publishers, and regulators to ensure that PDF can encapsulate these various methods for expressing TDM rights consistently and reliably.
The full resource is available at pdfa.org/faq-ai-and-pdf, and includes a public feedback mechanism for additional questions or requests for clarification. A live webinar introducing the FAQ to a broader audience will be announced shortly.
About Duff Johnson
Duff started working with PDF in 1996 when he founded Document Solutions, Inc. in Oakland, California. Today, as CEO of the PDF Association, Duff coordinates industry activities in support of PDF technology. He serves in technical roles as Project Leader for ISO 14289 (PDF/UA) and as Project co-Leader for ISO 32000 (the PDF specification). Duff also chairs the US Technical Advisory Committee (TAG) for ISO TC 171 SC 2, and serves as its Head of Delegation.