Digital Preservation Technical Paper



Download 381.61 Kb.
Page3/5
Date conversion16.05.2016
Size381.61 Kb.
1   2   3   4   5

The DROID format identification process


The DROID software tool uses signature information stored in PRONOM to perform automatic format identification. This section describes the identification algorithm employed by DROID, and the formats of the XML files used by DROID to describe signatures and record the results of the identification process.

The format identification algorithm

The format identification process employed by DROID is illustrated in the following activity diagram:

If a file is too big to hold in memory in its entirety, the first step is skipped. Instead, the file is read into memory a chunk at a time. To be specific, whenever the algorithm needs to access byte n of the file, a check is made to see whether byte n is already in memory. If not, the currently buffered portion of the file is discarded, and a new portion, containing byte n is loaded in its place. The size of the buffer used is currently 1000000 bytes.

The algorithm uses the word ‘file’ throughout but streams are identified using the same algorithm. They are read into memory at the start of the process. If this results in running out of memory then the memory contents are written to temporary file and the rest of the stream is appended on the end of that file. In that case, the algorithm then runs on that temporary file.

A file is first compared with the set of internal signatures, and any matches are classified as ‘positive’ identifications. These are further classified as ‘specific’ or ‘generic’ in accordance with the specificity rules defined in Section 2.2.1. If any matches have priority over other matches then the lower priority matches are discarded in accordance with the priority rules defined in Section 2.2.2. If the file matches one or more internal signatures then its file extension is checked against any allowed external signatures; if the file extension does not match any external signatures then a warning of a possible file extension mismatch is generated, but the identification result is not otherwise affected. If the file has no positive results against internal signatures then it is compared with the set of external signatures, and any matches are classified as ‘tentative’ identifications. If the file does not match any internal or external signatures then it is classified as a ‘negative’ identification.

The possible results of an identification process can be summarised as follows:



Status

Warning

Comments

Positive (Specific)




The file matches a specific internal signature

Positive (Specific)

Possible file extension mismatch

The file matches a specific internal signature but the file extension does not match any associated external signatures

Positive (Generic)




The file matches a generic internal signature

Positive (Generic)

Possible file extension mismatch

The file matches a generic internal signature but the file extension does not match any associated external signatures

Tentative




The file matches an external signature but no internal signatures

Negative




The file does not match any internal or external signatures

The pattern matching algorithm employed to undertake the internal signature identification is described in detail in Appendix C.
      1. Identification example


PRONOM contains information on five formats. The first two (“Format A1” and “Format A2”) each have their own specific internal signatures consisting of a single byte sequence. The third file format (“Format B”) has no internal signature. The last two (“Format C1” and “Format C2”) both share the same generic internal signature. All file formats can have an external signature in the form of the file extension “txt”. They all also have their own specific extension. These are respectively “fa1”, “fa2”, “fb”, “fc1”, “fc2”.

A set of files are processed by DROID using these signatures, with the following results:



  • aFile.fa1 passes the internal signature for “Format A1”, but fails the others. It is classified as a “Positive Specific” hit on “Format A1”.



  • bFile.fa1 passes the internal signature for “Format A2”, but fails the others. It is classified as a “Positive Specific” hit on “Format A2”. A warning is given, because the file extension is not associated with this format.



  • cFile.fa1 matches the internal signatures for both “Format A1” and “Format A2”. It is therefore classified as a possible “Positive Specific” hit on both formats. However, “Format A2” is defined as having priority over “Format A1”, so the hit on “Format A1” is discarded. The final output for this file is therefore a “Positive Specific” hit on “Format A2” with a warning about the file extension being wrong.



  • dFile.fa1 fails all internal signatures. Despite having a valid extension for “Format A1”, it is not classified as a tentative hit as it has failed the internal signature. It is therefore classified as a “Negative” identification.



  • eFile.txt fails all internal signatures. Despite having a valid extension for all file formats, it is only classified as a “Tentative” hit on “Format B” as it has failed the internal signatures for all other formats.



  • fFile.xxx passes the internal signature for “Format A2”. The file extension is unknown, so this is classified as a “Positive Specific” identification but with a warning about the file extension.



  • gFile.fb passes the internal signature for “Format A2”. Even though the file extension corresponds to “Format B”, the comparison is never made because a positive identification has already been made on the internal signature for “Format A2”. This file is classified as a “Positive Specific” identification on “Format A2” but with a warning about the file extension.



  • hFile.xxx fails all internal signatures. The file extension is unknown, so this is classified as a “Negative” identification.



  • iFile.txt passes the third internal signature and fails the other two. It is classified as a “Positive Generic” hit on “Format C1” and “Format C2” (no indication is made that the extension matches the other file formats).



  • jFile.fc1 passes the third internal signature and fails the other two. It is classified as a “Positive Generic” hit on “Format C1” and “Format C2”. The hit on “Format C2” also has a warning about the file extension.



  • kFile.txt passes the all three internal signatures. It is classified as a “Positive Specific” hit on “Format A1” and “Format A2”. By the same argument as for “cFile.fa1”, the hit on “Format A1” is discarded. This file is also a “Positive Generic” hit on “Format C1” and “Format C2”.

The DROID signature file

The signature information contained in PRONOM must be exported in a form which can be used by the DROID tool to perform automatic format identification. A DROID signature file contains all the information required by DROID to identify formats, and takes the form of an XML document which complies with the schema described in Appendix A.1. The internal signatures contained in the signature file have been pre-processed in accordance with the method described in Appendix B. The resultant signature file contains the following information:



  • A list of file formats and for each one, the internal system identifier, format name, format version, format PUID, and a collection of identifiers for all internal signatures associated with the file format, as well as the collection of external signatures. Any priority relationships over other formats are also recorded.



  • A list of internal signatures and for each one, a unique identifier and a collection of byte sequences.



  • Each byte sequence has a position type (absolute from BOF, absolute from EOF, or variable) and possibly a maximum offset. Each sequence is made up of a series of subsequences, each represented by its longest unambiguous byte sequence, its minimum offset, its shift distance function and a series of left and right sequence fragments (see Appendix B for further details).



  • Each sequence fragment has a position number, an unambiguous byte sequence and minimum and maximum offsets (see Appendix B for further details).


      1. Example signature file

The following example DROID Signature File describes the five example formats described in Section 3.1.1.







B1B2B3B4

4

3

2

1

0

A1A2A3

C1C2C3

3

2

1

0

B1B2B3

3

2

1

0

A1A2A3

B4

B5

C1C2C3

3

2

1

0

01

D1

F1F2F4F5

F1F3F4F5

........

........



15

txt

fa1

16

txt

fa2

1

txt

fb

17

txt

fc1

17

txt

fc2






The DROID file collection file

The DROID file collection file contains the list of files selected for identification, and the results of a DROID identification process. It is generated by the DROID software and takes the form of an XML document. If saved prior to the identification process being performed, it contains the following information:



  • The DROID version number



  • The signature file version number



If saved following identification, it contains the following information:

  • The DROID version number



  • The signature file version number



  • The list of the files submitted to the identification process



  • For each file, a list of file format identifications



  • For each file format identification, the name, version and PUID of the file format, the identification status (e.g. positive or tentative), and any relevant warnings or errors.


      1. Example file collection file


The following example DROID File Collection File describes the results of the identification process described in Section 3.1.1.





V1.1

3

2006-02-09T11:59:52



C:\testfiles\aFile.fa1



Positive (Specific Format)

Format A1

V1.1

V1.1 of format A










C:\testfiles\bFile.fa1



Positive (Specific Format)

Format A2

V1.2

V1.2 of format A




Possible file extension mismatch







C:\testfiles\cFile.fa1



Positive (Specific Format)

Format A2

V1.2

V1.2 of format A




Possible file extension mismatch







C:\testfiles\dFile.fa1





C:\testfiles\eFile.txt



Tentative

Format B

V0.0

V0.0 of format B










C:\testfiles\fFile.xxx



Positive (Specific Format)

Format A2

V1.2

V1.2 of format A




Possible file extension mismatch







C:\testfiles\gFile.fb



Positive (Specific Format)

Format A2

V1.2

V1.2 of format A




Possible file extension mismatch







C:\testfiles\hFile.xxx





C:\testfiles\iFile.txt



Positive (Generic Format)

Format C1

V1

V1 of format C








Positive (Generic Format)

Format C2

V2

V2 of format C










C:\testfiles\jFile.fc1



Positive (Generic Format)

Format C1

V1

V1 of format C








Positive (Generic Format)

Format C2

V2

V2 of format C




Possible file extension mismatch







C:\testfiles\kFile.txt



Positive (Generic Format)

Format A2

V1.2

V1.2 of format A








Positive (Generic Format)

Format C1

V1

V1 of format C








Positive (Generic Format)

Format C2

V2

V2 of format C










Bibliography

Brown, A, 2004, PRONOM 4 Information Model, The National Archives

Brown, A, 2005, The PRONOM PUID Scheme: a scheme of persistent unique identifiers for representation information, Digital Preservation Technical Paper, 2



http://www.nationalarchives.gov.uk/aboutapps/pronom/pdf/pronom_unique_identifier_scheme.pdf [viewed 7 March 2006]

Horspool, R N, 1980, Practical fast searching in strings, in Software - Practice and Experience, 10, 501-506



http://www-igm.univ-mlv.fr/~lecroq/string/node18.html [viewed 2 September 2005]
Tessella Support Services, 2006, DROID: Software Requirements Document

Tessella Support Services, 2006, DROID: Architectural Design Document


1   2   3   4   5


The database is protected by copyright ©essaydocs.org 2016
send message

    Main page