This document is one in a series of technical papers produced by the Digital Preservation Department of The National Archives (TNA), covering detailed technical issues related to the preservation and management of electronic records. This technical paper describes the methodology developed and implemented by TNA to automatically identify the formats in which digital objects are encoded.
For the purposes of this document, a format is defined as follows:
The internal structure and encoding of a digital object, which allows it to be processed, or to be rendered in human-accessible form. A digital object may be a file, or a bitstream embedded within a file. A distinction must be drawn between format identificationand format validation. Identification simply ascertains the format in which an object purports to be encoded, whereas validation ensures that the object is fully conformant to the format specification. As such, validation actually provides the most secure form of identification. However, validation is a technically complex process and is most efficient if it is informed by prior identification of the format.
Identifying and validating the format in which a digital object is encoded is a fundamental prerequisite to accessing and managing the object. This knowledge is required both by human users and IT systems – without it, a digital object cannot be used or preserved. For the majority of the time, this requirement is not obvious – most people use a limited set of software tools, within a consistent IT environment which is able to correctly assign the correct software to a given object. Even so, at some point most users encounter the problem of attempting to access a file in a format which is unknown to them, and unrecognised by either the operating system or any available software. It is also not uncommon to encounter a file which appears to be in a known format, but cannot be opened by the relevant software.
However, it is managed information environments, such as Electronic Records Management Systems (ERMS), or digital preservation facilities, which impose the most rigorous requirements for identifying and validating formats. In these scenarios, it is essential to understand the precise format and version in which every stored object is encoded. It is not enough to know that an object is a Microsoft Word document, for example; the exact version must be identified. It is equally essential to ensure that stored objects are valid, and do conform to the appropriate specification. Invalid objects can arise through the use of poor-quality software tools, or as a result of accidental or deliberate corruption.
The presumption is made that any format identification or validation method must be automated. Although manual identification is possible, given sufficiently detailed knowledge of how a given format specification and structure is expressed in hexadecimal values, this is clearly not recommended as a practical approach under normal circumstances. This technical report describes an identification method developed by TNA, which uses automated analysis of the binary structure of a digital object, and comparison with predefined internal and external ‘signatures’ for specific formats. This approach has been implemented by TNA as a software application called DROID (Digital Record Object Identification), which uses signature information stored in the PRONOM technical registry. PRONOM and DROID are both freely available on the web at http://www.nationalarchives.gov.uk/pronom. This method is currently limited to identification; full object characterisation, including validation and property extraction, is intended as a future enhancement.
The signature information required to perform automatic format identification is stored in the PRONOM technical registry. This section describes how signatures are modelled within PRONOM.
A format signature is any collection of characteristics which may be used to indicate the format of a digital object. This signature may be external or internal to the actual object bitstream. The PRONOM technical registry contains detailed information about individual formats, including their associated signatures. A simplified version of the data model used by PRONOM to describe the relationships between formats and their signatures is illustrated in the following UML class diagram:
Each format may have multiple associated internal and external signatures, with each internal signature comprising one or more byte sequences. Multiple relationships may also be defined between formats. Formats may also be assigned PRONOM Unique Identifiers (PUIDs)1, which can provide an unambiguous and persistent binding between the format identification of a given object, and the description of that format in PRONOM. Each type of signature is described in more detail in the following sections.