|THE DESIGN OF BROWSING AND BERRYPICKING TECHNIQUES
FOR THE ONLINE SEARCH INTERFACE
by Marcia J. Bates
Graduate School of Library and Information Science
University of California at Los Angeles
Los Angeles, CA 90024-1520
Copyright © 1989
by Marcia J. Bates
All Rights Reserved
First, a new model of searching in online and other information systems, called "berrypicking," is discussed. This model, it is argued, is much closer to the real behavior of information searchers than the traditional model of information retrieval is, and, consequently, will guide our thinking better in the design of effective interfaces. Second, the research literature of manual information seeking behavior is drawn on for suggestions of capabilities that users might like to have in online systems. Third, based on the new model and the research on information seeking, suggestions are made for how new search capabilities could be incorporated into the design of search interfaces. Particular attention is given to the nature and types of browsing that can be facilitated.
As more and more different types of databases are brought online, the universe of information available to search online is beginning to to resemble the vast array of sources available in manual print environments. From an original emphasis on bibliographic databases, which are the online equivalent of abstracting and indexing (A & I) services, databanks have expanded to the full text of journals and other documents, as well as directories, encyclopedias, and other reference sources traditionally available in libraries. Soon there really will be something approaching whole libraries accessible by computer.
As more types of resource are brought online, however, the searcher has a more complex search environment to consider, both in terms of types of sources to use and search techniques to employ with these sources [1,2]. We need to expand our understanding of these resources and the search techniques to use in them, and at the same time, expand our view of how the search interface should be designed to assist searchers with their new, complex tasks.
In this article the following is done:
•A new model of searching in online and other information systems, called "berrypicking," is discussed. This model, it is argued, is much closer to the real behavior of information searchers than the traditional model of information retrieval is, and, consequently, will guide our thinking better in the design of effective interfaces.
•The research literature of manual information seeking behavior is drawn on for suggestions of capabilities that users might like to have in online systems.
•Based on the new model and the research on information seeking, suggestions are made for how these capabilities could be incorporated into the design of search interfaces. Particular attention is given to the nature and types of browsing that can be facilitated.
II. A "BERRYPICKING" MODEL OF INFORMATION RETRIEVAL
The classic model of information retrieval (IR) used in information science research for over twenty-five years can be characterized as follows (Compare Robertson , especially p. 129):
This model has been very productive and has promoted our understanding of information retrieval in many ways. However, as Kuhn  noted, major models that are as central to a field as this one is, eventually begin to show inadequacies as testing leads to greater and greater understanding of the processes being studied. The limitations of the original model's representation of the phenomenon of interest become more and more evident.
It is only fitting, then, that in recent years the above classic model has come under attack in various ways [5-8]. Oddy  and Belkin et al.  have asked why it is necessary for the searcher to find a way to represent the information need in a query understandable by the system. Why cannot the system make it possible for the searcher to express the need directly as they would ordinarily, instead of in an artificial query representation for the system's consumption?
At the other end of the model, that of document representation, powerful developments in computing make possible free text and full text searching so that the traditional document representation (controlled vocabulary) takes on a different role and, for some purposes, is less important in much information retrieval practice.
Here I want to challenge the model as a whole--to the effect that it represents some searches, but not all, perhaps not even the majority, and that with respect to those it does represent, it frequently does so inadequately. As a formal model for testing, it has taught us much; as a realistic representation of actual searches, it has many limitations. As a consequence, as long as this model dominates information science thinking, it will limit our creativity in developing IR systems that really meet user needs and preferences.
The model I am about to propose differs from the traditional one in four areas:
1. Nature of the query.
2. Nature of the overall search process.
3. Range of search techniques used.
4. Information "domain" or territory where the search is conducted.
The first two areas will be dealt with in this section and the second two in the next section.
Let us return for a closer look at the classic model. Fundamental to it is the idea of a single query presented by the user, matched to the database contents, yielding a single output set. One of Gerard Salton's  contributions to research in this area was the idea of iterative feedback to improve output. He developed a system that would modify the query formulation based on user feedback to the first preliminary output set. The formulation would be successively improved through the use of feedback on user document preferences until recall and precision were optimized.
But Salton's iterative feedback is still well within the original classic model as presented in Figure 1--because the presumption is that the information need leading to the query is the same, unchanged, throughout, no matter what the user might learn from the documents in the preliminary retrieved set. In fact, if a user in a Salton experiment were to change the query after seeing some documents, it would be "unfair," a violation of the basic design of the experiment. The point of the feedback is to improve the representation of a static need, not to provide information that enables a change in the need itself.
So throughout the process of information retrieval evaluation under the the classic model, the query is treated as a single unitary, one-time conception of the problem. Though this assumption is useful for simplifying IR system research, real-life searches frequently do not work this way.
In real-life searches in manual sources, end users may begin with just one feature of a broader topic, or just one relevant reference, and move through a variety of sources. Each new piece of information they encounter gives them new ideas and directions to follow and, consequently, a new conception of the query. At each stage they are not just modifying the search terms used in order to get a better match for a single query. Rather the query itself (as well as the search terms used) is continually shifting, in part or whole. This type of search is here called an evolving search.
Furthermore, at each stage, with each different conception of the query, the user may identify useful information and references. In other words, the query is satisfied not by a single final retrieved set, but by a series of selections of individual references and bits of information at each stage of the ever-modifying search. A bit-at-a-time retrieval of this sort is here called berrypicking. This term is used by analogy to picking huckleberries or blueberries in the forest. The berries are scattered on the bushes; they do not come in bunches. One must pick them one at a time. One could do berrypicking of information without the search need itself changing (evolving), but in this article the attention is given to searches that combine both of these features.
Figure 2 represents a berrypicking, evolving search.
In Figure 3 we see the size of the picture shrunk in order to show the context within which the search takes place.
The focus of the classic model in Figure 1 is the match between the document and query representations. The focus of the model in Figures 2 and 3 is the sequence of searcher behaviors. The continuity represented by the line of the arrow is the continuity of a single human being moving through many actions toward a general goal of a satisfactory completion of research related to an information need. The changes in direction of the arrow illustrate the changes of an evolving search as the individual follows up various leads and shifts in thinking. The diagram also shows documents and information being produced from the search at many points along the way.
In the case of a straightforward single-match search of the classic sort, we can think of the arrow as being very short and straight, with a single query and a single information output set. Thus, we can see that this model differs from the classic one in the first two respects mentioned above: 1) The nature of the query is an evolving one, rather than single and unchanging, and 2) the nature of the search process is such that it follows a berrypicking pattern, instead of leading to a single best retrieved set.
There is ample evidence of the popularity of searches of the evolving/berrypicking sort. Reviews of research by Line , Hogeweg-de Haart , Stone , and Stoan  attest to the popularity of this approach in a variety of environments, particularly in the social sciences and humanities. A recent landmark study by Ellis  on social scientists supports and amplifies the results of earlier studies. Kuhlthau's work  with high school students suggests that there is a great deal of exploratory searching that goes on, both before and after a topic for a paper is selected. While the research reviewed here refers largely to the academic environment, I would suggest that many searches by people in many contexts other than academic can also be better characterized by the berrypicking/evolving model than by the classic IR model. The sources consulted may differ, but the process is similar.
III. HOW AND WHERE USERS SEARCH FOR INFORMATION NOW
It was argued in the previous section that information seekers in manual environments use a berrypicking/evolving search mode. In this section we will examine in more detail some of the search techniques used and information sources consulted by users in manual environments.
We might be tempted to say that the path taken in Figures 2 and 3 is simply a series of mini-matches of the classic sort. That is, that at each point where searchers identify documents of interest, they are making a match as represented in Figure 1, and that Figure 2 is simply a representation of searching at a higher level of generality. To make that assumption, however, would be to misrepresent what is being proposed here. Figure 2 is different in essential character, not just in level of generality. Specifically, in a real search there are many different ways people encounter information of interest to them. We will discuss several of them below. Only one of those ways is the kind represented by the classic model.
Users employ a number of strategies. With the help particularly of Stoan  and Ellis , I will describe just six of them, which are widely used:
• Footnote chasing (or "backward chaining" ). This technique involves following up footnotes found in books and articles of interest, and therefore moving backward in successive leaps through reference lists. Note that with this technique, as with other citation methods, the searcher avoids the problem of subject description altogether. This method is extremely popular with researchers in the social sciences and humanities. See, for example, Stenstrom and McBride .
• Citation searching (or "forward chaining" ). One begins with a citation, finds out who cites it by looking it up in a citation index, and thus leaps forward.
• Journal run. Once, by whatever means, one identifies a central journal in an area, one then locates the run of volumes of the journal and searches straight through relevant volume years. Such a technique, by definition, guarantees complete recall within that journal, and, if the journal is central enough to the searcher's interests, this technique also has tolerably good precision. In effect, this approach exploits Bradford's Law: the core journals in a subject area are going to have very high rates of relevant materials in that area.
• Area scanning. Browsing the materials that are physically collocated with materials located earlier in a search is a widely used and effective technique. Studies dating all the way to the 1940's confirm the popularity of the technique in catalog use. Frarey , in reviewing three of those early studies, found that use of the subject catalog is divided about equally between selecting books on a subject on the one hand, and finding the shelf location of a category in the classification in order to make book selections in the stacks on the other hand. The latter is, of course, the sort of area scanning described here. Recent work by Hancock  again confirms the importance of this approach.
• Subject searches in bibliographies and abstracting and indexing (A & I) services. Many bibliographies and most A & I services are arranged by subject. Both classified arrangements and subject indexes are popular. These forms of subject description (classifications and indexing languages) constitute the most common forms of "document representation" that are familiar from the classic model of information retrieval discussed earlier.
• Author searching. We customarily think of searching by author as an approach that contrasts with searching by subject. In the literature of catalog use research, "known-item" searches are frequently contrasted with "subject" searches, for example. But author searching can be an effective part of subject searching as well, when a searcher uses an author name to see if the author has done any other work on the same topic .
Until now most of the emphasis in online databanks and other automated IR systems--theoretical, experimental, and operational--has been on use of just one of the above techniques, namely, searching abstracting and indexing services. It is assumed that to do an automated information search, one is searching on a bibliographic database, a list of references with or without abstracts, that is just like an abstracting and indexing service, except that it is online. In experiments, the "document representations" in the classic IR model may involve very sophisticated methods, but most come down to some form of representation of the contents of documents that is usually much shorter, and different from, the documents themselves. In short, most IR research, until a recent flurry of interest in full text databases, has been research on databases of document surrogates.
Real searches, by contrast, use all the above techniques and more, in endless variation. It is part of the nature of berrypicking that people adapt the strategy to the particular need at the moment; as the need shifts in part or whole, the strategy often shifts as well--at least for effective searchers. So, to return to an earlier point, the berrypicking model does not represent a number of mini-matches of the classic sort, i.e., between search term and A & I service (database) term. Rather, the evolving/berrypicking search also involves the third and fourth features mentioned earlier: 3) the search techniques change throughout, and 4) the sources searched change in both form and content.
We have generally assumed in library/information science that the fifth technique in the list above, the A & I search, is clearly superior to the others. That is an important reason for the primacy given given to the bibliographic search in our research and practice. However, Stenstrom and McBride  found, when they asked social science faculty where they got the references for journal articles they used, that over 87 percent of them said they got the references from abstracting journals only occasionally, rarely, or never (p. 429). They relied far more heavily on footnote chasing: 69 percent (p. 429). Both Stoan  and Ellis  provide evidence and are very persuasive on the power and effectiveness of these other techniques for academic researchers and students at the very least.
Some of the other search techniques described above are possible on some systems--see, for example, Palay & Fox , Croft & Thompson , Cove & Walsh , Noerr & Noerr . See also Hildreth's masterly review of intelligent interfaces for bibliographic retrieval systems . Nowhere, to my knowledge, however, are all of these techniques easily applied by a searcher within a single system.
A model containing a unified perspective, incorporating the full range of searcher behaviors in the information seeking process, may make it easier to design many more such features for information retrieval systems. Ellis  has presented the results of his own research on social scientists and, on that basis, argues for the implementation of most of the above techniques, as well as others not discussed here. The particular mix of different capabilities that should ultimately be made available is a question deserving much more attention in the future.
Citation searching is also available, of course, in online systems in the Institute for Scientific Information databases. This searching method is now widely accepted in library/information science as another valuable database approach. Not all readers may be aware, however, of how hard Eugene Garfield had to work in the 1960's and 1970's to persuade librarians of the value of citation searching. I vividly recall observing an otherwise very capable reference instructor telling a class in the late 1960's that a citation index was a waste of money, that it was just a vanity publication for professors--its only value being for them to look up and see who was citing their own work. My point here is that we have yet to fully accept all six of these techniques as valid, effective approaches to information. Even citation indexing, now widely used, was not received easily into the thinking of library/information science.
From the standpoint of general effectiveness in searching, it is clear, on reflection, however, that, other things being equal, the searcher with the widest range of search strategies available is the searcher with the greatest retrieval power. We in information science feel that information searchers should take more advantage of A & I services in online or manual form. We, in our turn, should recognize that these other techniques used so commonly by researchers must have some real value for them, and that there may be times when they are preferable (see Stoan ). With each of the six retrieval techniques described above, it is possible to think of instances when that technique is clearly superior to the others as a route to the desired information.
I would argue on two grounds that these techniques should all be available in at least some future automated IR systems, and that our model of information retrieval should include berrypicking through use of these and other techniques:
1. The more different strategies searchers can use an information store, the more retrieval effectiveness and efficiency is possible.
2. There are many experienced searchers who use these techniques already--in a berrypicking mode--with great satisfaction. These approaches represent well established patterns that are handed down from scholars to their students and which work well for them in many cases. If we want to meet users' needs, we should enable them to search in familiar ways that are effective for them.
To summarize the argument to this point, this model of searching differs from the traditional one not only in that it reflects evolving, berrypicking searches, but also searches in a much wider variety of sources, and using a much wider variety of search techniques than has been typically represented in information retrieval models to date. With this broader picture of information retrieval in mind, many new design possibilities open up. In the next section, some of those possibilities will be examined, with particular attention to the role of browsing in the broader search process.
IV. SEARCH CAPABILITIES FOR A BERRYPICKING SEARCH INTERFACE
Browsing. The view of searching as frequently being an evolving/berrypicking process, and one which uses a variety of types of information sources and search techniques, changes our sense of what browsing capabilities should be like in online systems, and how the database and the search interface should be designed.
Concepts of browsing in IR systems are becoming more and more sophisticated. See Noerr & Noerr , Wade & Willett , Cove & Walsh , Hildreth , Bawden , Ingwersen & Wormell . But there is still a lingering tendency in information science to see browsing in contrast to directed searching, to see it as a casual, don't-know-what-I-want behavior that one engages in separately from "regular" searching.
However, as Ellis notes , browsing is an important part of standard information searching; he calls it "semi-directed or semi-structured searching" when used this way. He recommends that browsing of a variety of types of information, e.g., contents pages, lists of cited works, subject terms, should be made available in automated systems. He further argues that since the user is doing the browsing, and we therefore do not have to design a cognitive model of user browsing into the system, that providing browsing features should be relatively simple.
Relatively simpler perhaps, but making effective provision for browsing capabilities involves its own complexities. The techniques above combine browsing and conventional use of the information access apparatus in a variety of specific configurations. With all of the six techniques above, as well as with other features that might be designed for browsing, it will be desirable to set up combinations of features that incorporate browsing in different ways in each case.
The nature of browsing associated with each of the techniques listed above is examined in more detail below. Key Design features recommended for automated IR systems will be stated for each technique.
So that there is no confusion, however, I want to emphasize that browsing and berrypicking are not the same behavior. There will be a great deal of discussion of browsing in the remainder of this article, but only because browsing has gotten less attention in our field that other kinds of searching. Berrypicking involves the use of a wide variety of techniques, some of which are very standard, and others which involve a considerable amount of browsing. One of the points emphasized in this model is precisely that people use a wide variety of techniques.
Each of the six techniques is discussed below, followed by some general points about database and interface design for berrypicking and browsing.
Footnote chasing: In footnote chasing one might want both to be able to browse through the article or book that generates the references as well as through the list of references--in fact, to move back and forth easily between the two parts of the document. The body of information browsed in footnote chasing has a coherence and meaning that clusters around the indiosyncratic purposes of the author of the article or book. Browsing in the footnotes or endnotes will be minimal if the searcher only looks up individual references found in the text, and sticks to them. Browsing of the references can be more extensive if the searcher scans the list, independently of an originating textual reference.
Key design features: User can get the following easily, preferably by direct manipulation, e.g., with mouse and pull-down menus: 1) Overview of document contents--chapter or section headings, 2) Full text of documents and references, 3) Ability to jump back and forth between text and references.
Citation searching: In citation searching, one might want either to browse the set of references that cite a given starter reference, or read any of the citing articles. No single human created this grouping of citations; rather they come together because they all happened to cite the originating reference; they may otherwise be quite unrelated. Such a collection of references is likely to be stimulating to creativity, as the citing articles may not be on the "same" topic in the conventional sense, yet nonetheless create a grouping that has at least one key thread of similarity that may go along unconventional lines. (See also Bawden .) Because of this unconventional grouping, the user might well want to expand the search indefinitely in any direction, that is, upon finding a citing article, learn which articles cite it, and so on.
Key design features: Users should have the ability to 1) Scan lists of citing references, 2) Make simple single step jumps to a) full text of citing articles, b) full list of references in citing article, and 3) Make jumps in any direction ad infinitum, i.e., the user should not have to "return to go" and reenter a starting article for each jump in any direction.
Journal run: Looking through journals manually, the searcher flips through issues, scanning large chunks of the text of the articles, as well as the contents lists and abstracts. Here the grouping of articles is that subject area represented by the coverage of the journal. When the journal has a very broad subject coverage, such as that of Science or American Psychologist, it is unlikely to meet a searcher's need for information on a topic of the normal degree of specificity associated with a research project. To put it differently, browsing such general journals is probably useful more for general monitoring of the environment, rather than contributing to a well defined need.
In cases where the journal coverage is a more specific subject area, however, reviewing the contents lists or articles in that journal may be an excellent way to see quickly a large number of articles exactly in the heart of an area that interests one. The grouping of articles that results from their joint publication in a journal can be expected to be coherent and well thought out, since the focus of journals is generally well defined by editors for prospective authors.
Key design features: 1) Easy specification of journal title and starting date in a journal run search, 2) Easy jumps between contents lists and articles and back again, 3) Capability of requesting, if wanted, standard section headings in scholarly articles, such as "Methodology," or "Conclusions," so the searcher is shown these sections directly.
Area scanning: This technique is most commonly used with books arranged by a library classification scheme on the shelves of a library. With area scanning, one may either follow the exact arrangement of the classification scheme by reading linearly along the shelves, or alternatively, and, I suspect, more commonly, deliberately not follow that order. In practice, one of the most useful aspects of area scanning is that one can visually scan in a random manner over the shelves in a subject area of interest.
The effect of this latter method is to "jump the rails" of the classification scheme, to skip to other parts of the scheme that are near the starting point, without having to look at every single intervening book and category. This technique represents a deliberate breaking up of the conventional classified order, while enabling the searcher to remain in the same general initial subject area. Thus the search domain may consist of a variety of specific areas within one larger area.
Area scanning is the quintessential form of browsing in manual environments. As noted earlier, the research shows that it has remained very popular over many years among users. It is reasonable to presume that it meets some real needs. More research into why this approach is popular is desirable. However, here are a couple of guesses: 1) The searcher is exposed to a variety of related areas, some of which, because of the jumping around, may be related in unexpected ways--thus producing serendipitous discoveries. 2) The searcher can look directly at the full text of the materials. By flipping through the pages and reading a passage here and there, the searcher gets a quick gestalt sense of the "feel" or character of the author and his or her approach. Whatever that feel is, it is almost never accessible through any classification or subject description.
Key design features: 1) A library's listing of its books on the shelves arranged by the order of the classification scheme is called a shelf list. Thus, for area scanning linearly along the shelves, a capability of browsing the shelf list can be provided. 2) For "jumping the rails" of the classification scheme, browsing at several levels of generality within the classification scheme itself can be provided, i.e., giving the searcher the option of browsing a list of the most general categories in the scheme, or a list of the general categories plus their subdivisions, and so on, down to the full detail of the scheme. 3) At any point, with either of the first two capabilities listed in this section, the searcher should be able to ask for "snapshots" of full text of books (more discussion later).
Subject searching in bibliographies and A & I services: In discussions of "browsing" in online databases, the term usually refers to reading short lists of alphabetically arranged subject terms or reading citations and their associated abstracts. But, in fact, in such activities, there is little sense of the random visual movement usually associated with browsing. Indeed, the lists of terms printed out are short, and the printing of citations is costly, so searchers often keep it to a minimum. When the cost of printing out abstracts falls, and/or CD-ROM database use become more widespread, true browsing may be easier to do.
It may help the discussion here if we compare the manual form of A & I services, and consider how they are used for browsing. We may be able to do more, of course, with the online form, but let us first see if the text lends itself to browsing in principle. A very common pattern in manual forms of A & I services is to arrange the abstracts by a classified order, and attach a subject index using more specific subject terms. When an online searcher searches by controlled vocabulary, or by free text on the titles and abstracts, all the entries associated with the more specific subject terms are brought together in one location, so they become easy to examine. In the manual form, usually only the abstract numbers are brought together in the index. So grouping entries by these specific terms is a useful function of online services, though the browsing potential is limited for the reasons given above.
Since the A & I services generally arrange the abstracts by a classified order, it is possible in the manual form to browse through the abstracts in a classified section. This is generally impractical in online databases unless the search is also limited to certain dates or issues of the service, because the online database usually combines many years of the service in one, and each classification category therefore contains very large numbers of items (see Bates ). However, in a database in which cost per reference is not a factor, then some sort of browsing in the classified sections might be possible, particularly if brief forms of the reference were printed out, so many could be seen on the screen at once.
Key design features: The user should have the capability of 1) Rapid browsing of many references without cost, and/or ability to ask to see every nth reference in a large set (see further discussion in Bates , p. 21ff.), 2) Browsing the classification used in an A & I service, as well as abstracts within each classification, either all or every nth one.
Author searching: Author searching makes sense as a form of subject searching in that authors tend to write on similar things from one article or book to another. Thus, if one item pays off, maybe another by the same person will too. While bibliographies and catalogs have brought together in one place the references to an author's work since time immemorial, it would be a novel contribution of online systems if they made it possible to see grouped in one place the full text of an author's works. Library stacks do it for books, but there is currently no way to bring together other forms of publication, or to combine book texts with those other forms. When the day comes that full text online becomes very cheap, this grouping of an author's work in one place will be possible. The question in the meantime is, can we design the interface to make it easy to "flip through" the pages of the author's work?
Key design features: When author searching, the user should have the capability of calling up 1) Bibliographies of authors' works, 2) "Snapshots" of the text of works (see discussion later), and 3) Features that enable footnote chasing and citation searching.
Each of these approaches can be seen as a different way to identify and exploit particular regions in the total information store that are more likely than other regions to contain information of interest for the search at hand. To put it differently, these are different ways of identifying berry patches in the forest, and then going berrypicking within them.