Archival Challenges: Preserving Accessibility Standards in Long-Term Digital Repositories

1 day ago
4 min read

A journal article published today needs to be readable by a screen reader in 2045. That sentence sounds straightforward. In practice, it describes one of the most underexamined problems in academic publishing: the long-term survival of accessibility metadata through decades of software migration, format conversion, and repository infrastructure change.

Publishers have gotten reasonably good at building accessibility into current output. The harder question is whether that accessibility survives archiving. When a file enters Portico, CLOCKSS, or a university library repository, it does not enter a static vault. It enters a system that will be updated, migrated, and rebuilt multiple times before the content inside it is ever retrieved by a reader who needs it.

What PDF/A and PDF/UA Actually Require

PDF/A is the ISO standard for long-term archiving of electronic documents (ISO 19005). Its defining principle is self-containment: everything a viewer needs to render the file correctly must be embedded within it. Fonts, colour profiles, metadata. No external dependencies, no links to resources that may not exist in twenty years.

PDF/UA (ISO 14289) is the Universal Accessibility standard. It specifies what an accessible PDF must contain: a complete logical structure tree, tagged content with meaningful reading order, alternative text on every non-decorative image, document language declared in metadata, and navigation features that assistive technology can use.

The problem is that these two standards were designed largely in parallel, and combining them correctly requires more than ticking both boxes. A PDF/A-2a or PDF/A-3a file (the conformance levels that support tagged content) can satisfy both standards simultaneously. But most publishers archiving content produce PDF/A-1b or PDF/A-2b files, which confirm visual reproducibility without guaranteeing that the logical structure needed for assistive technology is present. The file will look right in thirty years. Whether it reads right is a different question entirely.

How Files Degrade Without Changing

Digital files do not rot the way paper does. But they degrade in a different sense: the ecosystem around them changes while the files themselves stay fixed. A PDF with correct accessibility tagging produced in 2015 may have relied on structural conventions that current screen readers interpret differently, or that future assistive technology will not support at all.

Font embedding is a concrete example. PDF/A requires fonts to be fully embedded. But if an archived file contains a font subset rather than the complete font, characters outside the subset cannot be rendered by any viewer, accessible or otherwise. For Latin-script content this tends to surface only with special characters. For non-Latin scripts, incomplete font embedding can make entire sections of a document unreadable to assistive technology, silently and without any error flag.

Reading order is a subtler risk. The logical reading sequence embedded in a PDF tag tree reflects the assumptions of the software that created it. When repositories migrate files through format conversion tools, tag trees are frequently stripped or rebuilt incorrectly. The content survives. The structure that makes it navigable for a screen reader user does not.

The Metadata Problem Nobody Talks About

Accessibility compliance is a point-in-time claim. A document that conforms to WCAG 2.1 and PDF/UA at the moment of archiving may not conform to WCAG 2.2 requirements introduced later, or to whatever standard supersedes both in the 2030s. Institutional repositories currently have no standardised mechanism for tracking the accessibility compliance history of a file: which standard it was validated against, which version of which checking tool confirmed conformance, and when that check was last performed.

This is not a theoretical gap. Libraries and repositories are already holding content that was validated against standards now superseded. When a user with a disability requests that content, there is no reliable way for the repository system to flag whether the file's accessibility claim is current or ten years out of date.

XMP metadata (the extensible metadata platform embedded in PDF files) has fields that can carry accessibility information. The Dublin Core and PRISM metadata schemas used in publishing workflows do not natively include accessibility compliance attributes. Repositories like DSpace and EPrints allow custom metadata fields, but without a community standard for what those fields should contain, each institution implements something different. The result is that accessibility provenance, the documented chain of what was checked and when, exists nowhere in most archives.

Building for Decades, Not Quarters

The solution sits upstream. Files destined for long-term archiving need to be produced to PDF/A-2a or PDF/A-3a conformance from the start, not converted after the fact. Conversion from standard PDF to PDF/A frequently drops or corrupts the tag structure. Producing a correctly tagged, fully embedded, structure-complete file at the point of export costs a fraction of what remediation or re-archiving costs later.

Metadata standards need to catch up. The publishing industry has the infrastructure to embed accessibility compliance records in XMP at the point of production: which standard, which conformance level, which tool, which date. ONIX already carries some accessibility fields for commercial distribution. Extending that logic into archival metadata schemas is achievable and overdue.

S4Carlisle's XML-first production workflows generate files with complete structural tagging from the source, reducing the risk that format conversion strips the metadata that accessibility depends on. For publishers depositing into Portico, CLOCKSS, or institutional repositories, this means content that meets current standards and carries the structural integrity to remain accessible through the format migrations that archiving systems will run over the next thirty years.

Accessibility built into a file at production is durable. Accessibility patched onto a file before archiving is fragile. The distinction matters most precisely when someone needs the content decades from now and no one is around to fix it.

Our NINJA AI Ecosystem and XML-first workflows produce PDF/A-compliant, structurally sound files built to survive long-term archiving without loss of accessibility. Contact sales@s4carlisle.com to find out how we can support your preservation and compliance strategy.

Archival Challenges: Preserving Accessibility Standards in Long-Term Digital Repositories

What PDF/A and PDF/UA Actually Require

How Files Degrade Without Changing

The Metadata Problem Nobody Talks About

Building for Decades, Not Quarters

Recent Posts

Comments