Carlyle, A., Ranger, S., & Summerlin, J. (2008). Making the Pieces Fit: Little Women, Works, and the Pursuit of Quality. CATALOGINGAND CLASSIFICATION QUARTERLY. 46 (1), 35-64.
MAKING THE PIECES FIT: LITTLE WOMEN, WORKS, AND THE PURSUIT OF QUALITY
Allyson Carlyle, PhD Phone: 206 / 543-1887
Associate Professor Fax: 206 / 616-3152
Information School Email: firstname.lastname@example.org
MGH Suite 370
University of Washington
Seattle, WA 98195-2840
Sara Ranger, MLIS Phone: 713 / 290-1614
Teacher Fax: 713 / 290-1273
The Monarch School
1231 Wirt Rd. Email: email@example.com
Houston, TX 77055
Joel Summerlin, MLIS Phone: 206 / 373-6061
Search Vocabulary Manager Email: firstname.lastname@example.org
710 2nd Ave., Suite 200
Seattle, WA 98104
SUMMARY. In current cataloging practice, the identification of an item as a member of a particular work set is accomplished by assigning a main entry heading, or main entry citation, in the bibliographic record representing that item. The main entry citation is normally comprised of a primary author name and the uniform title associated with the work. However, the quality of bibliographic records varies, and this means of identification is not universally used by catalogers. Thus, consistent identification and retrieval of records representing editions of works is not guaranteed. Research is reported that investigates the extent to which records that are members of a particular work set may be automatically identified as such.
Allyson Carlyle, PhD, is Associate Professor, Information School, University of Washington, Suite 370, Mary Gates Hall, Seattle, WA 98195-2840. (Email: email@example.com).
Sara Ranger, MLIS, MS, is a teacher at the Monarch School, Houston, TX 77055. (Email: firstname.lastname@example.org).
Joel Summerlin, MLIS, is Search Vocabulary Manager at Corbis Corporation, 710 2nd Ave. Suite 200, Seattle, WA 98104. (Email: email@example.com)
We thank OCLC and the Association for Library and Information Science Education for support of this project via the OCLC/ALISE Library and Information Science Research Grant. In addition, we gratefully acknowledge the contributions of the following along the way, in alphabetical order: Samantha R. Becker, Diana Brooking, Harry Bruce, Collette Davis, Karen Fisher, Lisa M. Fusco, Maurice Green, Joseph W. Janes, Elizabeth S. Knight, Marsha McGuire, and Edward T. O’Neill.
INTRODUCTION In card and book catalogs, collocated arrangement of bibliographic records under author name and uniform title accomplished the retrieval and display of all records representing a work. The human hands and minds of the filer forgave cataloger errors and omissions, neatly placing records into arrangements constructed to elucidate the nature of the documents held by a particular library associated with a particular work. Such arrangements showed the relationships among items; for example, translations were grouped together and arranged by language. For all that we have gained by moving our catalogs into an electronic environment, this, so far, we have lost: the ability to identify the records associated with a work and organize them in helpful and intelligent ways.
A first step toward recovering our ability to effectively retrieve and display works in the electronic environment is the identification of all of the records that represent manifestations or editions (these two terms will be used interchangeably in this paper) of a work. Assignment of a standardized uniform title and author name, a work citation, is the primary mechanism provided to identify a bibliographic record as belonging to a set of work records (a work set) in the Anglo-American Cataloguing Rules, 2nd edition (AACR2). Unfortunately, problems with standards and practice prevent effective identification of works in cataloging records, and thus, in our catalogs. It may be difficult or impossible to identify work set records for several reasons:
The lack of standard author or title information in access fields; that is, author and title information as they appear in authority records are not present in the record. One reason for this is that access fields contain information as it is presented in the item. This may be in accordance with cataloging rules, for example, when a uniform title is not present because they are not mandated in AACR2, or it may be contrary to cataloging rules, as when access field information is abbreviated.
The assignment of correct names or titles to incorrect MARC fields.
The presence of typos or other unintentional errors.
All of these reasons may be attributed to a lack of quality in record creation.
Doubtless the most effective procedure for improving the current situation would be to find the records that contain low quality information, examine the documents that they represent, determine their work status based on current cataloging standards, and upgrade those records. Sometimes it would not be necessary to examine the documents themselves, but sometimes the lack of information in records would necessitate document examination. With or without physical examination, this solution is less than optimal because it would require many hours of cataloger time and, thus, be very costly. It would also have to be done by many individual catalogers in many different catalogs, requiring even more time and expense because of the coordination of effort needed. One means of reducing the cost of work set record identification would be to apply automatic methods to the task (for example, Ayres, 1990, or the FRBR Work Set Algorithm).1 But what are the best methods of automatically identifying work set records? To what extent would automatic methods accurately identify work set records lacking adequate author and title information?
The research project described here was undertaken to begin to answer these questions. In this project, four large works of fiction were studied. These works were Bleak House, by Charles Dickens, Kidnapped, by Robert Louis Stevenson, The Three Musketeers, by Alexandre Dumas, and Little Women, by Louisa May Alcott. The project included the following steps:
assemble and analyze records representing editions of sample works;
determine the most effective means of collocating work set records automatically;
test the effectiveness of an automatic process in identifying and collocating records representing the sample works.
RATIONALE FOR THE RESEARCH Previous research has revealed that online catalogs do not effectively collocate records representing works with which many records are associated (voluminous works). Carlyle found that searches for voluminous works in online catalogs retrieved record sets with low precision and that arrangements of work set records were interrupted frequently by records representing other works.2 Poor collocation may be exacerbated in union catalogs and bibliographic utilities by the presence of duplicate records.3
The lack of collocation of work set records representing voluminous works may be a significant barrier to users seeking such works in library catalogs. Large retrieval sets are not always easy for users to negotiate, and may be more challenging when results sets consist of long lists of records that are not ordered to demonstrate the nature of the items retrieved or the relationships that exist among them. In the first major study of online library catalogs, Matthews, Lawrence, and Ferguson found that users reported difficulty with large retrieval sets.4 Other studies of online catalog use have reported similar results.5
A variety of research projects have addressed the problem of poor retrieval and collocation of work set records in online catalogs. The experimental BOPAC (Bradford Online Public Access Catalog) and BOPAC2 systems were developed at the University of Bradford specifically to identify and cluster work set records so that searches for works would produce simpler and clearer results displays.6 In the first BOPAC, researchers assembled “manifestation sets" automatically. They reported a variety of difficulties, not solved automatically in this project, in identifying all of the records that should have been retrieved in a work set. These difficulties included misspellings, incorrect MARC tagging, and inadequate information included in a record. They also reported that it was impossible to determine whether certain records were members of a particular manifestation set, because the content in the record contained insufficient identification information.7
In BOPAC2, a project building on and expanding the idea behind the original BOPAC, the research team developed a Z39.50 system to perform searches for authors and titles across a variety of catalogs. Goals for BOPAC2 included assembling records representing manifestations of a work in organized ways and allowing users to manipulate the organization according to their needs.8 Records representing manifestations of works were identified automatically by using the author and title of the work. BOPAC2 searched for titles entered by users in both uniform title fields and title fields. Only single titles were searched; other titles that works were published under were not sought (see Section 2.3).
Another project attempting to assemble and organize work set records was described by Weinstein.9 In this project, 500 records were sampled from the University of Michigan library, all by or about Beethoven. Although works, in this project called "conceptions", were identified manually, the process of identification mimicked an automatic procedure (p. 259).
Kilgour sought to increase the effectiveness of known-item searching by introducing a strategy of searching by combinations of author surnames and title keywords.10 Author-title combination searching was viewed as increasing search effectiveness by shortening the number of screens or records that were retrieved in response to a search. While it is probably true that the strategy of using author surnames and title keyword would prove more effective than author or title searching in retrieving records associated with a work, the problem of title variants remains, in that work set records containing titles that lack the title keywords sought would still not be retrieved. In addition, a strategy such as this still might retrieve records not associated with a single work. For example, if one were to search the author surname “Milton” and the title keyword “paradise”, records could be retrieved representing two of Milton’s works: Paradise Lost and Paradise Regained, and a work by David Scott Milton called Paradise Road In such a search, records for these works would be interfiled, obstructing the identification of manifestations of interest to a catalog user.
Several research projects whose aim was to explore bibliographic relationships identified records representing manifestations of particular works in the course of data collection. Smiraglia, investigating the derivative relationship in the Georgetown University Libraries, sampled for earliest representative records associated with a work.11 After collecting a set of sample records, he searched OCLC using author and title searches to find records representing progenitor work set records (the first record associated with a work) and to assemble a complete set of records associated with each work. In a similar project, Smiraglia and Leazer drew a sample of records from which to examine the derivative relationship in library catalogs.12 With an initial sample record in hand, they searched WorldCat for records sharing a derivative relationship with that record. Vellucci conducted an investigation of bibliographic relationships existing among musical works. She performed exhaustive searches in OCLC and a local catalog using variant titles discovered in the course of the research and in various music reference sources.13 Because the goals of these projects were not directly related to problems identifying records representing works, data was not collected on the variability of author and title information in work set records.
The research project described in this paper is closely related to, and follows up on, a project competed by O’Neill at OCLC in 1994.14 The object of O’Neill’s research was to develop a software program to identify records for all English language manifestations of a large fiction work. In this research, records representing 10,000 widely held fiction works were initially identified using an automatic identification program. Once the 10,000 works were identified, the automatic program was made less restrictive and additional manifestations were discovered. Manual identification of variant names and titles was occasionally used to increase the number of work set records retrieved. The software developed in the project consisted of algorithms that matched the following combination of attributes: author name (MARC 100 field, subfield a), title proper (245 field, subfield a) and uniform title (240 field, subfield a). However, although high measures of recall and precision were observed, the report did not include specific measures, nor did it provide details on how these measures were completed.
Recently, a good deal of attention has been given to work issues in cataloging. The publication of Functional Requirements for Bibliographic Records in 1998, in which works are defined as an important entity in a conceptual model of the bibliographic universe, stimulated a flurry of activity surrounding works.15 It has, in addition, stimulated articles (e.g., Carlyle, 2006),16 blogs (e.g., William Denton’s FRBR Blog),17 and multiple implementation projects (e.g., OCLC Office of Research, Fiction Finder).18 The new draft set of cataloging rules, Resource Description and Access (RDA), has included rules to incorporate the FRBR entity model of work, expression, manifestation, and item.19 In addition, OCLC has followed up by creating an algorithm that may be used to assemble records representing work set records.20 Although these projects and algorithms have been widely distributed, statistics demonstrating their level of success are not available.
Research and anecdotal evidence demonstrate the poor performance of online catalogs in retrieving and displaying records representing manifestations of voluminous works. Because the means by which effective retrieval and display of work set records may be accomplished in library catalogs is tightly linked to the process by which relevant records are identified, it is useful for a variety of explicit procedures to be documented and evaluated. A major purpose of this research project is the discovery and analysis of procedures that may be used to identify records representing manifestations of works automatically. Because a further gap in the research is the extent to which automatic procedures successfully identify records representing editions of works that have been published under varying titles, including translations, another major goal of this research is to investigate the potential success of automatic procedures in identifying, and thus collocating, records associated with a work.
METHODOLOGY This research project entailed a detailed investigation and test of the potential of automatic methods to successfully identify bibliographic records as editions of particular voluminous works. The steps followed in completing the project are summarized below:
1) Assemble records representing editions of sample works.
2) Based on an analysis of the records representing the sample works, determine the most effective means of collocating work set records automatically. This includes the following:
a) constructing work identifiers;
b) excluding non-work set records.
3) Test the effectiveness of these processes in collocating records for the sample works.
Assembly of Work set records A sample of four works was selected for a detailed analysis from the list produced by O'Neill's 1994 research of the 100 largest fiction works in WorldCat:21
Bleak House by Charles Dickens,
Kidnapped by Robert Louis Stevenson,
Little Women by Louisa May Alcott,
Three Musketeers by Alexandre Dumas.
These works were in the mid-range of works identified in O’Neill’s research and were associated with approximately 300 records each. Works associated with the largest numbers of bibliographic records were not selected because of the time necessary for a detailed analysis and smaller works were not selected to make sure the range of identification issues would be sufficiently broad.
In this research, determining an automatic identification process was done by analyzing the set of the records associated with each sample work. In order to obtain as complete a set of work set records as possible to conduct this analysis, several steps were performed. The OCLC Office of Research provided an initial set of work set records identified automatically in WorldCat using the identification program developed in the O’Neill project. Because the identification program identified English language editions only, records for editions not discovered using the O’Neill identification program had to be identified. Records not in the OCLC record set included records for English language editions with variant titles not identified in the O’Neill program, non-English language editions, and non-book editions. These records were identified manually for this study using a variety of strategies described below.
First, the authority records associated with each work were retrieved. Variant titles, including non-English titles, identified in the authority records were searched in WorldCat. Next, each work was searched in the National Union Catalog to discover additional variant titles to be used in searching WorldCat.22 The following two strategies were also used in searching WorldCat: last name author and title word keyword searches, and author searches alone. Records for translations that included uniform titles were reviewed to discover variant titles proper that had not been identified in the previous steps, and WorldCat was searched again using these titles.
It was necessary, on occasion, to verify that specific titles were, in fact, variant title forms of the works studied. For the unclear non-English titles, we verified that they were variations of the work titles in question by asking catalogers who had the necessary language expertise. For the remaining unclear titles, we searched in reference works and on the web for complete bibliographies of the authors that would include known title variants. After all of these steps, a few titles still remained unclear. These were not treated as variant titles and the few records containing only these titles and no uniform title were excluded from the set of work set records.
Only records that represented manifestations of the sample works were sought. A manifestation was operationally defined as any manifestation of a work that should receive the same main entry citation (author and title or uniform title) as the work itself, as prescribed by AACR2R and including application of Chapter 25, Uniform Title'>Titles. Documents representing the collected works of the authors in the study would also contain the sample works, however, the researchers did not search for or identify any of these records, primarily because of the extra time and complications involved in identifying and including them. That said, records representing items that contained multiple works (most often, two works published together in a single volume) that contained the work studied were included so long as an identifiable work title was present in the record.
Determining Work Identifiers After work set records were assembled, we analyzed them to determine the means by which bibliographic records could be identified automatically. Each bibliographic record in the sample was reviewed to determine the specific attributes in a record that would act reliably as work identifiers. These attributes included: 1) type of field content, or attribute type, for example, author name, and 2) MARC fields in which field content occurred.
Three types of field content were discovered that were effective in determining whether a record belonged to a work set:
Library of Congress Classification (LCC) number.
Author name and title together are the standard identifiers of works with personal authors. However, three of the works studied (Bleak House, Kidnapped, and Three Musketeers) were associated with unique LCC numbers that could also be used to identify work set records.
Author names in the records analyzed included authorized and non-authorized forms of name. Because these different forms of name were discovered, the author name attribute includes multiple forms of name. This attribute, composed of both authorized and non-authorized forms of name is referred to as “Name” in the rest of the paper. Uniform title is used by catalogers expressly to identify a work and is a type of title that frequently identifies a work. However, works are often associated with variant forms of title, which may be the only title appearing in a record. As a result, the attribute “Title” is also composed of both authorized (uniform) and non-authorized titles. LCC numbers occur in a single form; dates of publication and other numbers that frequently occur in a call number that are not part of the LCC number proper were disregarded. This attribute is referred to as “LCC” in the rest of the paper.
The next step in the research was to determine work identifiers – thecombination of field content and specific MARC field/subfield that could be used automatically to mark records as belonging to a particular work set. Work identifiers are listed in Table 1.
[TABLE 1 ABOUT HERE]
Name and Title are almost always present in bibliographic records for works of fiction. Together, they identify a work. In the records analyzed here, Name always occurred in a 100 field, subfield “a” or in a 700 field, subfield “a,” when occurring in a name-title added entry. Title, however, was discovered in several different fields. Obviously, Title occurs in both 240 and 245 fields, the uniform title and title fields. Title also occurs in a subfield “t” in a name-title added entry field.
Sometimes when a work is published with another work in a single physical volume, it is cataloged using a 246 or 740 field with a uniform or variant title instead of using a name-title added entry for a contained work. This would, perhaps, not be recommended cataloging practice, but it occurs frequently nonetheless. Thus, Title could occur in 240, 245, 246, or 740 fields, or in the subfield “t” in a 700 (name-title added entry) field. Finally, Title sometimes occurs in a 500 field. To make identification of work set records more reliable, its presence in a 500 field is only treated as a work identifier when it follows the phrases “Translation of” or “Trans. of.” It is possible that additional fields would have to be considered for other works – particularly other types of works, such as non-fiction or non-textual works.
It was necessary to restrict the name-title added entry fields included as work identifiers to those with a second indicator of “2” because only “2” unambiguously identifies the presence of a work that is contained within the item being cataloged. The name-title added entry field is also used to indicate that the work cataloged is related to another work. Unfortunately, cataloging practice has not been consistent in assigning a second indicator of “2” for contained works, so some manifestation records that were excluded (see discussion below) should have been included in the work set.
LCC may appear in either an 050 or 090 field. LCC has the potential to be particularly useful for identifying records such as translations, which can contain non-standard author or title information. As stated above, only the LCC number was used as an identifier; any information in the field following LCC, such as date of publication, was disregarded.
Constructing Work Identifiers Automatically
The work identifiers above were determined by manually examining records. However, one object of the research was to determine how effectively work identification could be done using automatic means. In other words, is it possible for work identifiers to be constructed using automatic means alone? Consequently, the next step in the research was to determine how effectively this could be done.
Two general means of automatically discovering work identifiers were found:
1) using author name and title information harvested from name authority records in the National Authority File (NAF), and
2) using work identifiers harvested directly from bibliographic records.
Because a goal of the research was to make automatic work set record identification more effective, it was natural to pursue the possibility of using information about authors and works present in authority records. Authority work is done specifically to make the retrieval and display of authors, works, subjects, and series consistent and effective for users. The NAF, because it contains information about persons and works, is likely to be the most effective place to discover name and (sometimes) title attributes. Moreover, although many works are not represented in work authority records in the NAF, voluminous works frequently are.
Authority records make it easy to include non-authorized as well as authorized forms of name as a part of Name. However, only non-authorized forms of name found in authority records for the author in 400 fields may be included, as these are the only forms that can be identified automatically. To ease matching constraints, only subfield “a” is included in Name. The records received for the project from OCLC included names with slight typos. Because the programming already exists to include these forms using an error correction algorithm, they were also considered to be non-authorized forms of name. The assumption of the research is that error correction algorithms would be effective in identifying errors in author names.
The next attribute used in the identification process is Title. Uniform titles are used specifically to name a work, and thus are included in Title. Work authority records, when present, may be used to discover the uniform title of a work and a variety of variant titles. Unfortunately, work authority records do not consistently contain all known variant titles, making it desirable to discover other variant titles using other methods. All of the works in the sample included a variety of variant titles.
In a real catalog, the steps necessary to construct effective work identifiers would begin after a user did a search for a particular work. First, an existing work-record clustering program such as the FRBR Work Set Algorithm would retrieve an initial set of bibliographic records representing the work. If the work is associated with an author name, the name appearing in majority of 100 field subfield “a” of those records would be searched automatically in the NAF. Names appearing in the a subfields of 100 and 400 fields of the authority record would be automatically collected, searched, and matched against 100 and 700 fields in the bibliographic records. The collection of all forms of name found would comprise Name. As stated above, this process would be more effective using an automatic error detection program to retrieve records with typos.
The most common 240 field title in the initial set of bibliographic records would then be searched in the NAF for authority records that contain the same titles in t subfields of the 100 field, and the same author name in subfield a. Then, all forms of title present in 400 t subfields would be collected. Next, the bibliographic file would be searched using these titles, matching them against 240, 245, 246, and 740 fields in bibliographic records containing the Name attribute. Here, the identification process requires Title to be present at the beginning of the “a” subfield in the title fields searched. It was quickly apparent from a review of the records in the sample why Title occurrences would have to be limited to the beginning of a field: sometimes (often?) preliminary wording indicates a related work, for example, "Sequel to Kidnapped." However, the process disregards any text following Title, whether it is in the same subfield (for example, subfield “a”) or another (for example, subfield “b”).
Additional variant titles would be identified and collected as follows. After retrieving bibliographic records containing 100 and 240 fields, titles proper (245 field, subfield a) in those bibliographic records would be collected, harvesting any variant titles that had not yet been identified. This procedure has the potential to be particularly effective for identifying translated titles that do not appear in work authority records.
The last step would identify an LCC number associated with a work, if one existed. The LCC number would be harvested from 050 and 090 fields in bibliographic records. Since not all works are represented by single LCC number, an algorithm would be developed that would: (1) identify commonly occurring LCC numbers in 050 and 090 fields in the initial work record set; (2) search that LCC number in the catalog; and (3) accept an LCC number as a work attribute only if the majority of records retrieved contain the name and title attributes, or variants of those attributes, of the work sought. Further research would be necessary to determine an effective threshold for acceptance of LCC number as a work identifier and any other steps necessary to exclude records for other works by the same author. Although unique class numbers do not exist for all highly-manifested works, it should be relatively easy to determine automatically if a work had one, as normally such a high percentage of records, especially LC records, representing such a work would have the same LCC number. This final step could also be followed by a search of 245 field forms of title that had not yet been identified, adding more variant titles to the algorithm.
Excluding Non-Work set records
While a work set record identification process would be extremely useful for a database cleanup and record upgrade project, it could also be used on the fly to improve user searching. If used in this manner, the success of a work identification program would be contingent upon not simply its ability to identify records that belong in the work set (recall) but also its ability to exclude records that do not belong in it (precision). Many records that represent related works, such as adaptations, were incidentally retrieved in the course of assembling records for the study. These records would decrease the precision of a search or display for a work. Some of these records were retrieved by the O’Neill program and were included in the initial set of records sent to the researchers; others were retrieved in the search for relevant records in WorldCat. These records posed a problem for automatic work identification because they frequently contained the same attribute combinations that work set records contained, and could be misidentified as belonging to the work set. We recognize that records representing related work are of potential interest to a person searching for a work, but for the purposes of this paper, we are considering them not relevant to the search.
Records assembled for the project but rejected because they were not work set records were analyzed to determine exclusion criteria. Like work identifiers, exclusion criteria were to be determinable automatically, using characteristics present in the records. The criteria for exclusion were developed to be as extensive as possible. It should be noted that the criteria presented here are so extensive that they might not perform effectively in a large database if used on the fly, whenever a user does a search. In other words, they might increase response time so much it would not be worthwhile to use them all. However, given the possibility of the partial use of exclusion criteria, we explored it as completely as possible here, to provide a variety of options.
In addition, as discussed below, some of the criteria may exclude too many relevant records. Again, we wished to provide an extensive list of exclusion criteria for those interested in doing further research in this area. It may be that some of the criteria are very effective in excluding irrelevant records, and others are not.
Records were excluded from work sets when these criteria were present:
a name-title added entry with a second indicator other than 2 and the absence of Name in a 100 field. Because of the inconsistent use of the second indicator, this step could exclude relevant records.
the word stems “dramat*”, “adapt*”, or the words “paraphrase” or "rewritten" in a 245, 500, or 520 field (to exclude dramatizations, adaptations, and paraphrases).
the stem “simpl*” or the words “retold”, “retelling”, or "revised" in a 245 field (to exclude simplifications, retellings, and revisions).
a “g”(projected medium, e.g., videorecording or motion picture) or “j” (musical sound recording) in the “Type” fixed field.
the stem “comic*” in a 650 field (to exclude comic strip versions).
the term “kit” in a 245 subfield “h” (to exclude kits, which frequently contain adapted versions of texts).
In addition, the following criterion was applied to exclude records:
Absence of a 100 field or presence of a 100 field with a name other than the name associated with the work in question, when a name-title added entry attribute combination is not present.
Several phrases that were considered for exclusion criteria were ultimately considered and tallied as unclear – it would not be possible without examination of the items themselves to determine work status. In a real automatic work identification program, a determination of whether to include these phrases in the exclusion criteria or not would have to be made, but we decided to see what percentage of records would be affected. Phrases treated as unclear occurred most frequently in the 245 field and included:
“edited for college use”
“edited for school use”
“abridged for modern reading”
Criteria that would only sometimes indicate adaptation were not used to exclude records from a work set. For example, the phrases "based on" or "edited by" in a 245 field sometimes indicate adaptation, and sometimes not.
Collocating Work Set Records Automatically To summarize, the work identification process is composed of the following steps to construct work identifiers and collocate work set records automatically:
1) Assemble an initial set of work set records using an existing algorithm. Preferably this algorithm would include processes of identifying slight variations, e.g., misspellings, in author names and titles.
2) Determine the Name attribute by searching authority records in the NAF using the predominant 100 form of name derived from the preliminary work set and collecting 400 forms of name.
3) Determine the Title attribute by:
a) searching authority records in the NAF using the predominant 100 form of name and the predominant 240 form of title derived from the preliminary work set to collect variant titles from $t forms in all matching work authority records, and
b) collecting remaining variant titles from 245 fields by searching all bibliographic records containing Name and titles collected in step 3a that match a 240 field, and collecting previously undiscovered titles from the 245 field.
4) Determine whether an LCC number unique to the work exists by setting a threshold for the percentage of work set records that have identical LCC numbers in an 050 or 090 field. This step could be followed by a search of 245 field forms of title that had not yet been identified, to add any remaining variant titles to Title.
5) Search bibliographic records using Name,Title, and LCC, retrieving records that contain the work identifiers.
6) If using the identification process on the fly for individual searches, exclude records using exclusion criteria (if the process were being used to correct and enhance records permanently, it would be more effective not to exclude records).
Testing Effectiveness After the work identification process was developed, the sample work set records were again analyzed manually, this time to apply the process, testing its effectiveness in actually identifying work set records in the sample. The work identifiers were matched with data in each bibliographic record assembled for the research project. At this point, additional decisions were made or refined regarding initial articles, diacritics, spacing, indicators, and position of the work identifiers in the fields and subfields. Initial articles, diacritics, and spacing were disregarded, regardless of MARC tagging. Indicators in MARC fields were ignored, as were subfield delimiters, other subfields, and other subfield content. Punctuation was disregarded unless it occurred in the middle of a word or call number.
The matching process was always truncated. As stated above, if Title was followed by text, either in the same subfield or another subfield, the remainder of the text was ignored. If LCC was followed by Cutter numbers or other characters appearing in the same subfield or additional subfields, they were ignored. For Name, only the "a" subfield was considered, and if additional text appeared in the subfield "a" after Name, it was ignored.
After an initial set of records was identified, variant titles were collected from 245 fields that had 240 fields, and from records that had the phrases "Translation of" or "Trans. of" in a 500 field and added to the Title attribute. Records not yet having been identified as work set records were reviewed based on an expanded Title attribute.
After records with work identifiers were discovered, exclusion criteria were applied and records that met these criteria were excluded. Unfortunately, it was not possible to compare each bibliographic record with the actual physical item it represented, so researchers decided whether the record represented an edition or manifestation of the work studied based solely upon examination of the record.
Results The identification process proposed here, considering data from authority records as well as multiple passes through bibliographic records to collect additional work data, was largely successful. Seventy-seven to 95 percent of all records for the sample works were correctly identified (tables 2-5). This means that the majority of records for all four works could be identified automatically using the work identification process. Excluding Little Women, which had a unique identification problem described below, only two percent of records were unidentifiable, and three to five percent either misidentified as the work, incorrectly excluded, or unclear.
Only a single work identifier was tallied for each record. Name+Title was considered the preferred work identifier as it is the standard means of identifying works, and occurs more predictably than an LCC for a given work. For the sample work set records, Name+Title correctly identified a record as belonging to a particular work set in most cases. For some works, it is clear that LCC would be a highly effective attribute for identifying work set records that do not contain a correct Name+Title. For instance, the ability of the LCC to identify a work set record is particularly important for Three Musketeers, which is comprised of records that contain many varying or incorrect names and titles, thus limiting the ability of Name+Title to identify the record. The distinctive characteristics of Three Musketeers that probably account for the unidentifiable variant forms of name and title are that it was not published originally in English and it is an older work. Name-title added entry was the unique identifier of only 5 records in the entire sample.
Tables 2-5 also report four categories of problems with the automated identification process. The first category is work set records that are not identifiable using the identification process. Problems in this category are broken down by data type. A few records were not identifiable because both author and title data were unidentifiable. Given that use of uniform title is not required, it is not surprising that the most common reason for records not being identifiable is that the title data cannot be discovered automatically. This was especially a problem for Little Women, which proved to be an unusual challenge because it consists of two parts – Little Women (part 1) and Good Wives (part 2) – which are sometimes published separately. Although this information is relatively easy for a person to ascertain and understand from the records for Good Wives and Little Women, we could not determine a way in which an automatic identification program could discover the relationship between Good Wives and Little Women (see Figure 1). We need to bear in mind that some percentage of individual works will have unusual characteristics, such as this one, that will impede record identification. However, with a minimal amount of human intervention, almost all work set records could be identified.
[Figure 1 (Good Wives) about here.]
The second category, labeled "misidentified," indicates records that were mistakenly identified as belonging to the work set (see Figure 2). These records contained the work identifiers, but manual examination of the record revealed other information indicating they were definitely not representations of the work, or they were most probably not representations of the work. All of the records in this category were, however, records representing related works. For example, one of the Little Women records was misidentified as the work because the word "adapted" was spelled as "adpted." In this case, then, the exclusion criteria didn't work. Another example is that records for items in non-English languages contained non-English terms for "adaptation," etc. If this type of identification process were to be implemented, a fuller complement of these terms could be included.
[Figure 2 (Misidentified) about here.]
Sometimes the work status of a record was questionable because of indications of extent in the record. For example, a 50 page version of Little Women is likely to be either an adaptation, abridgement, or part. The works represented in this study were all of significant length. Unabridged sound recordings generally ranged from 8-19 cassettes/sound discs. Because of this, records for sound recordings that gave no indication of adaptation or abridgement that had 1 or 2 cassettes/sound discs only were tallied as "misidentified as the work" unless the record specifically indicated abridgement or parts (automatic recognition of these records is discussed in Carlyle & Summerlin, 2002).23Likewise, the works Little Women, Three Musketeers, and Bleak House were generally published in editions that had between 300-500 pages or more. For this study, records for these works that had 200 pages or fewer were also tallied as "Misidentified as the work" if they were not explicitly identified as abridgments, condensations, or parts. The "misidentified as the work" category was included to keep estimates of the success of automatic identification process conservative. In other words, we decided that it was better to underestimate the success of the automatic collocation method than overestimate it.
The third category of problem records was records that should have been retrievable in the work set, but were excluded incorrectly by the identification procedure. Given this sample of works, the automatic exclusion process worked extremely well – only one record in the Little Women work set, was excluded incorrectly (see Table 4 below). However, it must be noted that although the exclusion criteria worked well for the four works studied here, it is not certain that they would be effective for all other works. For example, the identification procedure would perform badly for any work that contained one of the exclusion word stems in a Title attribute. It would also perform badly, for example, for a work assigned the subject heading “Comic literature” because it would exclude relevant records containing that subject heading in a 650 field.
Another weakness of the exclusion criteria is that the stems, words, and phrases used here work only for English and a few other languages; they would not work to exclude records in many other languages. Further research on exclusion criteria is needed both to identify non-English language criteria and to test the effectiveness of the exclusion criteria identified here.
The final category are records that appeared to be either work or related work set records, but it was impossible to determine which given the lack, or ambiguity, of information in the record. This category is another failure of the cataloging record to provide sufficient information, and physical examination of items would have been necessary to determine work status. An example of records of this type is given in Figure 3.
[Figure 3 (Unclear) about here.]