# Digital Preservation Technical Paper

 Page 5/5 Date conversion 16.05.2016 Size 381.61 Kb.
1   2   3   4   5

## File collection file schema

The XML schema definition for the file collection file is as follows:

1. ## Pre-processing signatures

Background for pattern matching algorithm

The pattern matching algorithm is based on the Boyer-Moore-Horspool (BMH) algorithm for string matching (Horspool, 1980). The BMH algorithm does not allow for wildcard characters such as defined above, and so a certain amount of pre-processing needs to be performed on the internal signatures before the algorithm can be applied. The signatures are stored in unprocessed form in the PRONOM registry, and pre-processing is automatically applied as part of the generation of a new signature file.

Pre-processing of the signature file

The following pre-processing is required on each byte sequence to be used in the pattern matching. To simplify the exposition, we will use the following sequence, defined with an offset of 10 bytes from the beginning of the file as a running example:

 A1A2A3[A4:A5]??B1B2B3(B4|B5)*{5}01??C1C2C3{4-7}D1????F1(F2|F3)F4F5

1. ### Step 1. Split the byte sequence pattern into fragments to remove all ‘*’

• Split up each byte sequence pattern P into smaller fragments (P1 - Pn) wherever a “*” is found (assume P1 – Pn are in order left to right). Drop any wildcards of the form {n} or {n-m} which appear at the ends of the Pi.

In the example:

P1 = A1A2A3[A4:A5]??B1B2B3(B4|B5)

P2 = 01??C1C2C3{4-7}D1????F1(F2|F3)F4F5 (note that we have dropped the “{5}”)

1. ### Step 2. Find the minimum and maximum subsequence offsests

• For each sequence Pi, work out the minimum and (for P1) maximum distance of the start of Pi from the end of Pi-1 (or the start of the file, for P1).

• Alternatively, if the pattern is defined relative to EOF:

In the example:

P1 has minimum and maximum subsequence offsets of 10 (recall that the pattern was defined with an offset of 10 bytes from BOF).

P2 has a minimum subsequence offset of 5 (from the “{5}” we dropped).

1. ### Step 3. Find the longest unambiguous byte sequence in every fragment

• For each pattern fragment, pull out longest unambiguous byte sequence (i.e. not containing ??, {n}, {k-m}, (a|b), [!a:b]): P1X – PnX (if there is more than one possible longest byte sequence for PiX, choose one arbitrarily.)

• If the pattern is not defined relative to EOF:

• Work out the minimum offset of the start of PiX from the start of Pi. This is the “minimum fragment length”.

• Alternatively, if the pattern is defined relative to EOF:

• Work out the minimum offset of the end of PiX from the end of Pi. This is the “minimum fragment length”.

In the example:

P1X = B1B2B3 (although it could equally well have been A1A2A3). The minimum fragment length is 5 (the length of “A1A2A3[A4:A5]??”).

P2X = C1C2C3. The minimum fragment length is 2 (the length of “01??”).

1. ### Step 4. Split the fragments into remaining unambiguous byte sequences

• For each Pi, split up the remainder of the sequence (i.e. the part not in PiX) to the left and the right of PiX according to any occurrences of ? or {n} or {k-m}. This creates arrays of objects PiLj (to the left of PiX) and PiRj (to the right of PiX) where each of these objects contains one or more unambiguous sequences (i.e. if “|” occurs in sequence, then list all possibilities). Unlike in the previous step, these sequences may contain occurrences of the [:] wildcards, and are therefore not technically unambiguous.

• For each subsequence PiLj, calculate the minimum and maximum offsets of the end of PiLj from the start of PiLj+1 (or the start of PiX, for the rightmost PiLj).

• Similarly, for each subsequence PiRj, calculate the minimum and maximum offsets of the start of PiRj from the end of PiRj-1 (or the end of PiX, for the leftmost PiRj).

In the example:

P1L1 = A1A2A3[A4:A5], with a minimum and maximum offset of 1 (the “??”)

P1R1 = B4 or B5, with a minimum and maximum offset of 0 (since P1R1 follows directly on from P1X).

P2L1 = 01, with a minimum and maximum offset of 1 (the “??”)

P2R1 = D1, with a minimum offset of 4, and a maximum offset of 7 (corresponding to “{4-7}”).

P2R2 = F1F2F4F5 or F1F3F4F5, with a minimum and maximum offset of 2 (corresponding to “????”).

1. ### Step 5. Calculate the ‘shift distance’: the minimum distance between each byte and the end (or start) of the longest unambiguous byte sequence in its fragment.

• For each distinct byte, b, the “shift distance” Di(b) is equal to the minimum distance from the end of pattern PiX to the occurrence of that byte in PiX (unless the byte sequence is defined relative to EOF, in which case it is from the start of PiX). Any bytes which do not occur in PiX are given a shift distance equal to the length of PiX + 1.

In the example, the shift distances for P1 are given by:

D1(B3) = 1

D1(B2) = 2

D1(B1) = 3

D1() = 4

The shift distances for P2 are defined similarly.

1. ## Pre-processing glossary

 Term Meaning P A byte sequence pattern as contained in the internal signature Pi A byte sequence pattern fragment created by splitting the pattern so that it does not contain a ‘*’ PiX The longest Pi in P. Minimum fragment length for PiX If the pattern is not defined relative to EOF, this is the minimum offset of the start of PiX from the start of Pi. If the pattern is defined relative to EOF, this is the minimum offset of the end of PiX from the end of Pi. PiLj A byte sequence pattern fragment to the left of PiX PiRj A byte sequence pattern fragment to the right of PiX Minimum and maximum offsets For each subsequence PiLj, these are the minimum and maximum number of bytes between the end of PiLj from the start of PiLj+1 (or the start of PiX, for the rightmost PiLj). Similarly, for each subsequence PiRj, these are the minimum and maximum number of bytes between the start of PiRj from the end of PiRj-1 (or the end of PiX, for the leftmost PiRj). Di(byte) The ‘shift distance’ of that byte. This is defined as the minimum distance from the end of pattern PiX to the occurrence of that byte in PiX (unless the byte sequence is defined relative to EOF, in which case it is from the start of PiX).

1. ## The pattern matching algorithm

The direction in which the pattern matching is carried out is determined by whether the byte sequence is relative to BOF, EOF or neither. The default in the following algorithm is that it is carried out from left to right from the beginning of the file, but if byte sequence is relative to EOF, then the pattern matching will be carried out from right to left, starting at the end of the file. The latter is described below by text in brackets: (/…).

1. We begin by trying to find P1X (/PnX). To this end, commence the search at the beginning (/end) of file F, at an offset of the minimum subsequence offset plus the minimum fragment length. This is the earliest (/latest) point in the file at which P1X (/PnX) may occur. Take a “window” on F of the same length as P1X (/PnX) and compare it with sequence P1LX (/PnX).

1. If the window and sequence don’t match, then get the “shift distance” associated with the first byte to the right (/left) of the window in F. Shift the window forwards (/backwards) by that many bytes.

1. Now repeat step 2 until either a match is found (move on to next step) or until either the end (/beginning) of the file or the maximum offset is reached (byte sequence fails).

1. Check for matches to the P1Li and P1Ri (/PnLi and PnRi) to see whether a match for the whole of P1 (/Pn) can be found. For each sequence, start from smallest possible offset and stop as soon as a match is found so that we end up with the shortest possible sequence in F that matches the pattern. If matches are found for all these sequences, then record the location of the rightmost (/leftmost) byte for match of P1 (/Pn) as this will determine where the search for the next pattern fragment starts. If no match is found for P1 (/Pn), then search for next possible occurrence of P1X (/PnX) as in steps 2 and 3 until either the end of the file or the maximum offset is reached.

1. Now that we have found a match for the whole of P1, repeat steps 1 to 4 for the remaining patterns P2 to Pn. In each case however, do not begin searching at the start (/end) of the file, but at the rightmost (/leftmost) byte of the last pattern found, as recorded in Step 4 (plus (/minus) the minimum subsequence offset for the pattern). Continue until all patterns have been found (i.e. positive match) or until the end (/start) of the file is reached (i.e. no match).

Note that because we search for the patterns in order (from left to right, in the default case), and always find the earliest occurrence of each pattern, there is never any need to backtrack using this algorithm. If we can’t find pattern P3, say, we know that this has nothing to do with the placement of patterns P1 and P2.

An activity diagram for the pattern matching is given below (this corresponds to the ‘comparison of the file with an internal signature’ step in the main activity diagram in section 3.1).

1: See Brown (2005)

1   2   3   4   5