The XML schema definition for the file collection file is as follows:
The PRONOM File Collection Schema
Copyright The National Archives 2006. All rights reserved.

Preprocessing signatures
Background for pattern matching algorithm
The pattern matching algorithm is based on the BoyerMooreHorspool (BMH) algorithm for string matching (Horspool, 1980). The BMH algorithm does not allow for wildcard characters such as defined above, and so a certain amount of preprocessing needs to be performed on the internal signatures before the algorithm can be applied. The signatures are stored in unprocessed form in the PRONOM registry, and preprocessing is automatically applied as part of the generation of a new signature file.
Preprocessing of the signature file
The following preprocessing is required on each byte sequence to be used in the pattern matching. To simplify the exposition, we will use the following sequence, defined with an offset of 10 bytes from the beginning of the file as a running example:
A1A2A3[A4:A5]??B1B2B3(B4B5)*{5}01??C1C2C3{47}D1????F1(F2F3)F4F5

Step 1. Split the byte sequence pattern into fragments to remove all ‘*’

Split up each byte sequence pattern P into smaller fragments (P_{1}  P_{n}) wherever a “*” is found (assume P_{1} – P_{n} are in order left to right). Drop any wildcards of the form {n} or {nm} which appear at the ends of the P_{i}.
In the example:
P_{1} = A1A2A3[A4:A5]??B1B2B3(B4B5)
P_{2} = 01??C1C2C3{47}D1????F1(F2F3)F4F5 (note that we have dropped the “{5}”)
Step 2. Find the minimum and maximum subsequence offsests

For each sequence P_{i}, work out the minimum and (for P_{1}) maximum distance of the start of P_{i} from the end of P_{i1} (or the start of the file, for P_{1}).

Alternatively, if the pattern is defined relative to EOF:
In the example:
P_{1} has minimum and maximum subsequence offsets of 10 (recall that the pattern was defined with an offset of 10 bytes from BOF).
P_{2} has a minimum subsequence offset of 5 (from the “{5}” we dropped).
Step 3. Find the longest unambiguous byte sequence in every fragment

For each pattern fragment, pull out longest unambiguous byte sequence (i.e. not containing ??, {n}, {km}, (ab), [!a:b]): P_{1}X – P_{n}X (if there is more than one possible longest byte sequence for P_{i}X, choose one arbitrarily.)

If the pattern is not defined relative to EOF:

Work out the minimum offset of the start of P_{i}X from the start of P_{i}. This is the “minimum fragment length”.

Alternatively, if the pattern is defined relative to EOF:

Work out the minimum offset of the end of P_{i}X from the end of P_{i}. This is the “minimum fragment length”.
In the example:
P_{1}X = B1B2B3 (although it could equally well have been A1A2A3). The minimum fragment length is 5 (the length of “A1A2A3[A4:A5]??”).
P_{2}X = C1C2C3. The minimum fragment length is 2 (the length of “01??”).


For each P_{i}, split up the remainder of the sequence (i.e. the part not in P_{i}X) to the left and the right of P_{i}X according to any occurrences of ? or {n} or {km}. This creates arrays of objects P_{i}L_{j} (to the left of P_{i}X) and P_{i}R_{j} (to the right of P_{i}X) where each of these objects contains one or more unambiguous sequences (i.e. if “” occurs in sequence, then list all possibilities). Unlike in the previous step, these sequences may contain occurrences of the [:] wildcards, and are therefore not technically unambiguous.

For each subsequence P_{i}L_{j}, calculate the minimum and maximum offsets of the end of P_{i}L_{j} from the start of P_{i}L_{j+1} (or the start of P_{i}X, for the rightmost P_{i}L_{j}).

Similarly, for each subsequence P_{i}R_{j}, calculate the minimum and maximum offsets of the start of P_{i}R_{j} from the end of P_{i}R_{j1} (or the end of P_{i}X, for the leftmost P_{i}R_{j}).
In the example:
P_{1}L_{1} = A1A2A3[A4:A5], with a minimum and maximum offset of 1 (the “??”)
P_{1}R_{1} = B4 or B5, with a minimum and maximum offset of 0 (since P_{1}R_{1} follows directly on from P_{1}X).
P_{2}L_{1} = 01, with a minimum and maximum offset of 1 (the “??”)
P_{2}R_{1} = D1, with a minimum offset of 4, and a maximum offset of 7 (corresponding to “{47}”).
P_{2}R_{2} = F1F2F4F5 or F1F3F4F5, with a minimum and maximum offset of 2 (corresponding to “????”).
Step 5. Calculate the ‘shift distance’: the minimum distance between each byte and the end (or start) of the longest unambiguous byte sequence in its fragment.

For each distinct byte, b, the “shift distance” D_{i}(b) is equal to the minimum distance from the end of pattern P_{i}X to the occurrence of that byte in P_{i}X (unless the byte sequence is defined relative to EOF, in which case it is from the start of P_{i}X). Any bytes which do not occur in P_{i}X are given a shift distance equal to the length of P_{i}X + 1.
In the example, the shift distances for P_{1} are given by:
D_{1}(B3) = 1
D_{1}(B2) = 2
D_{1}(B1) = 3
D_{1}() = 4
The shift distances for P_{2} are defined similarly.
Preprocessing glossary
Term

Meaning

P

A byte sequence pattern as contained in the internal signature

P_{i}

A byte sequence pattern fragment created by splitting the pattern so that it does not contain a ‘*’

P_{i}X

The longest P_{i} in P.

Minimum fragment length for P_{i}X

If the pattern is not defined relative to EOF, this is the minimum offset of the start of P_{i}X from the start of P_{i}.
If the pattern is defined relative to EOF, this is the minimum offset of the end of P_{i}X from the end of P_{i}.

P_{i}L_{j}

A byte sequence pattern fragment to the left of P_{i}X

P_{i}R_{j}

A byte sequence pattern fragment to the right of P_{i}X

Minimum and maximum offsets

For each subsequence P_{i}L_{j}, these are the minimum and maximum number of bytes between the end of P_{i}L_{j} from the start of P_{i}L_{j+1} (or the start of P_{i}X, for the rightmost P_{i}L_{j}).
Similarly, for each subsequence P_{i}R_{j}, these are the minimum and maximum number of bytes between the start of P_{i}R_{j} from the end of P_{i}R_{j1} (or the end of P_{i}X, for the leftmost P_{i}R_{j}).

D_{i}(byte)

The ‘shift distance’ of that byte. This is defined as the minimum distance from the end of pattern P_{i}X to the occurrence of that byte in P_{i}X (unless the byte sequence is defined relative to EOF, in which case it is from the start of P_{i}X).

The pattern matching algorithm
The direction in which the pattern matching is carried out is determined by whether the byte sequence is relative to BOF, EOF or neither. The default in the following algorithm is that it is carried out from left to right from the beginning of the file, but if byte sequence is relative to EOF, then the pattern matching will be carried out from right to left, starting at the end of the file. The latter is described below by text in brackets: (/…).

We begin by trying to find P_{1}X (/P_{n}X). To this end, commence the search at the beginning (/end) of file F, at an offset of the minimum subsequence offset plus the minimum fragment length. This is the earliest (/latest) point in the file at which P_{1}X (/P_{n}X) may occur. Take a “window” on F of the same length as P_{1}X (/P_{n}X) and compare it with sequence P_{1}LX (/P_{n}X).

If the window and sequence don’t match, then get the “shift distance” associated with the first byte to the right (/left) of the window in F. Shift the window forwards (/backwards) by that many bytes.

Now repeat step 2 until either a match is found (move on to next step) or until either the end (/beginning) of the file or the maximum offset is reached (byte sequence fails).

Check for matches to the P_{1}L_{i} and P_{1}R_{i} (/P_{n}L_{i} and P_{n}R_{i}) to see whether a match for the whole of P_{1} (/P_{n}) can be found. For each sequence, start from smallest possible offset and stop as soon as a match is found so that we end up with the shortest possible sequence in F that matches the pattern. If matches are found for all these sequences, then record the location of the rightmost (/leftmost) byte for match of P_{1} (/P_{n}) as this will determine where the search for the next pattern fragment starts. If no match is found for P_{1} (/P_{n}), then search for next possible occurrence of P_{1}X (/P_{n}X) as in steps 2 and 3 until either the end of the file or the maximum offset is reached.

Now that we have found a match for the whole of P_{1}, repeat steps 1 to 4 for the remaining patterns P_{2} to P_{n}. In each case however, do not begin searching at the start (/end) of the file, but at the rightmost (/leftmost) byte of the last pattern found, as recorded in Step 4 (plus (/minus) the minimum subsequence offset for the pattern). Continue until all patterns have been found (i.e. positive match) or until the end (/start) of the file is reached (i.e. no match).
Note that because we search for the patterns in order (from left to right, in the default case), and always find the earliest occurrence of each pattern, there is never any need to backtrack using this algorithm. If we can’t find pattern P_{3}, say, we know that this has nothing to do with the placement of patterns P_{1} and P_{2}.
An activity diagram for the pattern matching is given below (this corresponds to the ‘comparison of the file with an internal signature’ step in the main activity diagram in section 3.1).
