Transformation and Management of xml data

Download 31.11 Kb.
Size31.11 Kb.
Transformation and Management of XML data

By Michael Christoffersen, Lars Nyberg and Qi Zang


On of the most valued resources today is information and the ability to manage and share this information across multiple technological platforms and applications. This can be difficult due to the many thousand of different document formats and the incompatibility between them. The use of Extensible Markup Language (XML), which use in the last few years has grown enormously due to its flexibility and the many applications supporting this format, promises to solve at least some of these problems.
Our project is to transform documents in the BibTeX format into a new XML format (xBib) thus making it possible to manage and transform the data stored in the documents. These actions should be made through a web service with and interface available in a standard web browser. This report is the documentation of this product as well as a description of the thoughts and consideration involved in the process of completing this project.

Project definition

The project is centred on formats for bibliographic data, specifically the BibTeX format, which is a format used in conjunction with LaTeX documents. The purpose is to transform BibTeX bibliographies into an XML format so that different task can be performed with the data contained in the files. More specifically the aims of the project can be defined as followed:

  • Definition of a XML format in DTD or XML Schema for the BibTeX data.

  • Transformation of BibTeX data to the defined XML format (by a Java program).

  • Implementation of simple queries on the XML files in an XML query language.

  • XSLT transformation of output into XHTML.

  • Implementation of a web service based on the above queries and transformations.

  • Implementation of a simple web interface using the web service.

Project Plan

The defined aims are very loose and can be accomplished in many different ways. The following section will try to specify more precisely how we intend to structure the process of analysing the many possibilities and building the actual products.
First of all we need to define the data that we are working on. In defining the new xBib format we will also examine the problems and constrains which the BibTex gives us. Secondly we need to build the parser in small step. Each step will bring us closer to the needed output. The first step should be to define, discuss and build methods in which we can read the files into our parser. Next the goal should be to separate the different parts of the BibTeX and to transform them into elements. From there the parser should continue to develop and to consider more and more problems and exceptions in the source files. When building the parser we should also develop the XSLT and the queries based on example files produced while defining the format. Finally all these tasks are put together in a web service and a simple interface is cunstructed.

Terms and definitions



Catalogue: A collection of bibliographic records for documents in a collection such as a library

Bibliographic record: a collection of bibliographic data describing a document. In this report the expression entry is often used instead.

Bibliographic data: Data such as title, author etc. contained within a bibliographic record.

About the authors

Qi Zang is
Michael Christoffersen is a librarian from the Royal School of Library and Information Science 2002 and is currently studying at The IT University of Copenhagen’s Master of Science in Information Technology (Internet Technology).
Lars Nyberg is a librarian from the Royal School of Library and Information Science 2002 and is currently studying at The IT University of Copenhagen’s Master of Science in Information Technology (Design, Communication and Media).

Defining the format

One might think that the simplest task in our project would be the definition of a new XML format for the bibliographic data. But defining a format should never be a simple task. Each decision we make in constructing our XML format, xBib, will influence everything that we do later on in the process. We need to consider not only the short-sided goals of our web service and the way BibTeX is constructed, but also the more fundamental requirements that bibliographic information demands.

Bibliographic formats

The purpose of a bibliographic record is to describe a given document. Not a description of the actual intellectual content but a description of the document as an existent object (Hagler 1997). A collection of bibliographic records, a catalogue or a bibliography, are often used with four different generic tasks in mind related to these objects: To find, to identify, to select and to obtain (IFLA 1998). These tasks are the basis for the requirements asked of a bibliographic format and the bibliographic data contained in a record. Different types of documents have requires specific types of information needed to fulfil for example the task of selecting a document. In a record describing a book one might need information about the language or edition of the text, while a record for software applications would benefit from information about hardware requirements. The task of identification would have great benefit from a unique identifier like the ISBN or ISSN number for books and periodicals while the URI would serve the same purpose for an online document and at the same time be important for the ability to obtain the document.
Building a bibliographic format one should consider which of the four tasks to prioritise and which document types the format should describe. The format should specify which data should be included but also consider how this information should be expressed and possibly reformatted from how it is expressed in the document itself – i.e. if the publisher is misspelled should it be corrected in the bibliographic record or if the document has more than tree authors should they all be included.

The BibTeX format

The BibTeX format is designed back in 1985 as a format for bibliographies and it is primarily thought to contain bibliographic data to references cited within LaTeX documents. LaTex is a so-called “document preparation system for high-quality typesetting” (LaTex3 Project 2002) and BibTeX share some of its syntax. LaTeX is widely used all over the world especially for scientific papers. This bias is very visible in the BibTeX format. The different document types (called entry types in BibTeX) and the choice of bibliographic information for these types (i.e. the different fields) are centred on academic and technical publications and especially publications from the Anglo-American world (Dempsey & Heery 1997). Furthermore the specifications are quite vague in the descriptions of how data in the different field should be written (except for author and title), and field like institution, organization and school are made for specific entry types though the purpose of these fields are very much the same (a summery of the different fields and entry types in BibTeX is included in Appendix A in this report).
The BibTeX bibliographies consist of different entries with a citation key and different fields containing bibliographic data. The syntax is loose in the sense that either braces or quotes can embrace the data in a field and for data only contain numerals nothing at all is necessary. It is also possible to include any non-standard field or entry type in the file. Such data would just be ignored by most applications. The BibTeX files can also include comments and a string command where the latter is used to define abbreviations used in the entries.

Constructing xBib

In the construction of our new XML format, which we have chosen to call xBib, we must take the knowledge that we now have of BibTeX and the purpose of bibliographic records. First of all we must decide how close to BibTeX, and thus also to its limits, we wish xBib to stay. We could define a format were the entry types were more generic and where specific fields for ISBN or URI’s could be present. We could also be stricter with the data contained in each file and thus specify that chapter should always be a numeric value or that month could only contain the abbreviated versions of each month. However since our format is primarily used with data transformed from BibTeX, we are quite constrained by data already existing or data missing even though we would have liked it to be present. Also it would be nice to be able to transform our xBib files back into BibTeX after we have performed our queries. Thus we are forced to keep the basic data fields of BibTeX in our XML format. What we can do is to separate names from the author and editor field and divide them into separate fields for each name, and separate fields for the first and last name. This would make it easier to search and sort the entries later. We are also able to make a sequence in which the data should appear so that data can be grouped more logically and thus be more easily understood by human readers.
Another decision is what to do with existing fields that are not standard fields in BibTeX. Due to its loose structure BibTeX can allow unknown fields to exist, a thing which is not possible in the more strict syntax of XML. We could exclude the data but that might erase useful information that is not accounted for in the specified fields in BibTeX. Thus a solution would be to make a container to store this information maybe for later use in a transformation back into BibTeX. Another option would be to include the data in the note field but that would complicate the data already stored there.
Apart from the basic bibliographic fields there are other fields, which fate we have to decide. It would be a plus if we were able to keep every functionality from the BibTeX format. First of all the comments from BibTeX could be valuable and should therefore also be included in our format (NOT INCLUDED IN PARSER). Secondly the string command, which is referred to by a data field in an entry, contains data that we at least should consider including in some way. We could include the data and referrers in specific element and attributes, we could exclude the data as a field but transform the referring abbreviations into the value specified in the string command or we could finally just ignore the information altogether. We have decided that the first option would be the best but due to constrains of this project we have decided to leave it out at the moment anyway.
Finally the file should contain a basic description of itself so that its relation and history are clearer. An example of an xBib file is included in Appendix F.

Defining a Schema

Once we decided what the actual format should contain we are able to define exactly what can or cannot be included in a file with a schema. There are different types of schemas to use and our choice is between DTD and XML Schema. The DTD is the easiest and often also the most sufficient solution. The limits of the DTD are in specifying restrictions of the data contained within elements and attributes and the fact that DTD’s are not themselves XML. The XML Schema overcomes these shortcomings. It is written in XML and does give many options for defining the data.
Since we are limited from specifying more specifically the content of the BibTeX fields we do not need the extra facilities provided with the XML Schema. Thus we have chosen the DTD as our Schema (See Appendix B). On the other hand the future for the DTD seems limited so we have also made a version in XML Schema (See Appendix C).
We define xBib in our DTD in the following way. First some parameter entities are defined. They contain some of the most used groups of elements and are used as a shortcut for writing these elements. Then the root element is defined and it is specified that the following child element should always be description followed be at least one entry, string or comment element. We have included the latter to elements in the DTD even though they are not yet implemented in our parser.
The entry element must have an attribute called citationID and only on of the 14 different entry type elements. Each of these elements can include specific elements containing the bibliographic data. Noteworthy are the editors, authors and nonStandardFields elements. The first to contains one or more author/editor elements. These elements can them self contain firstname and lastname in which the first is optional. The nonStandardFields element contains at least one nonStandardField element with an attribute defining the name of the transformed field.
The inbook element is specified by BibTeX to contain either a Chapter field and/or a Pages field. Defining this in both DTD an XML Schema coursed some problems. Our preferred way to do this in DTD was to write: chapter | pages | (chapter | pages). But this could not be validated in the validators that we used. This is due to the shortcomings in the validators, which cannot clearly determine the structure of the document. DTD’s with such a structure is called non-deterministic (Microsoft 2001). For the sake of validation we have made another solution, which does not fully give us what we want but are still exceptional.
A final note to the DTD is that we included the annote field as an optional element for all entry type elements, even though the specifications ignore this.

Constructing the parser

Design Overview

The overall purpose with the parser was to take a file in BibTeX format and convert this to an output in our XML format, xBib.
We created eight classes to obtain the goal in making the parser. The core classes are the ConvertEntry and ConvertFile, which is the classes that actually makes the conversion from the BibTeX format to the xBib format. The other classes are made as help classes, with several class methods to be used when needed.

Core Process

The user chooses a file in BibTeX format to upload and convert to xBib format. We make an array (NewVector), which should contain the data read from the file.
Figure 1

When the file is read in the ReadFile class we check to see if a line from the file starts % or @string, and if it does we do not read this particular line into our array holding the entries. Otherwise if the line starts with @ or not with one of the above mentioned, we waits with storing lines in the array until all lines in an entry is snapped. We store each entry as a string in the array. This process is then followed by dissecting (see figure 2) the obtained entries into entry type, citation id, authors, title, publisher etc. First we extract entry type by finding what is between the first @ and the first {, and the citation id is between the { and the following comma. Next we find each fields name and value by splitting the remainder of the entry in parts around the =. Then, apart from the first and last element in the array containing the splitted array, each string is a field value and field name separated by a comma. Hereafter we have an array that can be converted to xBib.

Figure 2

Now the process is to find fields that are specified by the BibTeX format and not, or specified by our DTD (or XML Schema). Furthermore we have to divide the author and editor fields into sub fields containing first and last name for each person mentioned in the entry.

When we convert the entry we check to see if the entry type is valid, and if it is, we use this information to find the fields the particular entry must have (required) and can have (optional). The other fields which is either required or optional is put in an element variable holding all non standard fields. Each time an author or editor (hereafter just person(s)) field is present there is some sub code that deals with the name dividing. If several persons are mentioned they are separated by and with whitespaces around. These we use to split the persons from each other, and then we check for separating characters as comma and whitespace. If comma is present the first part of a person is last name, and is there a whitespace the last part is last name. Some conditions as name wrapped in braces and von-like names, makes the task a little more complex. Text within braces should be kept together and here is and not considered a separating character between persons. Therefore we replace and (replaced with #and#) within braces until have we separated each person's name. Another thing with braces is that if there is not any text outside the braces all text should be considered as lastname. And if von-like names are present the last name are taken from where the von-like name begins.
When we have made the conversion to xBib the entry is return and actually replaced with its own BibTeX representation in the array shown in figure 2. The conversion is not always successful and therefore an entry is deleted if it does not validate. There are two reasons for not validating, either no field contains any value or an entry type is not specified in the DTD.

Testing the Parser and Validating the Results

In testing the parser we made a test file with a few of our own examples and with content from the three BibTex files delivered by the teachers. We have not tested against the new version of the tough.bib and Ed Demaine's .bib file. And with the parser we have created now there is no reason doing any testing on them, because we know that the parsing will fail. The reason for this is that we use the = as the separator in splitting the entry into field names and values. But a solution could be like the one where we replace and with #and# when we separate persons from each other. Beside this problem, we know that comments at the end of an entry is not discarded but actually added to the preceding field value. We have not considered a solution of this problem, because the error occurred to late in the project period.

Running queries

About the queries

Building the xQueries


Critical note

Transforming output with XSLT

There are to types of files, which we need to transform in order for us to present them as a web page. The xBib file that is outputted from the parser and the results from our queries from Quip. The stylesheets are included in Appendix D. Since we serve the files from a servlet (see the next section) we do not need the results from the XSLT processing to be valid XHTML. The elements that are not included in the XSLT output will be inserted in the servlet.

Transforming xBib

The XSLT transforms the XML into something that are viewable in a browser as styled text. But it also allows us to structure and style the bibliographic data more specifically. We have chosen to present the data in the American Psychological Association's “style of citation” as described in Skov (2000). Thus the data is not just spit out as paragraphs of text but restructured according to the entry type and different separators are inserted between the data dependent of which element follows which. This is done by using the xsl:if and xsl:choose elements. This transformation is made with the file xbib.xsl. An example of a result file given to the XSLT processor is shown in appendix G and the result of the process is included as Appendix H.

Transforming query results

The root element will always be quip:result but some of the results will contain actual valid entry elements from xBib and those results, once Quip’s root element is replaced by the remaining required elements from xBib (See the next section), can use the above described file. Otherwise another XSLT document (xbib2.xsl) will transform the results into something which can be inserted into an XHTML document. Since only one of these queries are implemented into our web service, the XSLT document is only optimised for the result of this query.

Bringing it all together



Dempsey, L. & Heery, R. (1997). A review of metadata: a survey of current resource description formats. Online tilgængelig den 19/12-2002:

Hagler, R. (1997). The Bibliographic Record and Information Technology (3.

Ed.). Chicago: American Library Association.

Harold, E. R. & Means, W. S. (2002). XML in a nutshell. O’Reilly.
Help On BibTeX bib files. (u.å.). Online tilgængelig den 19/12-2002:
IFLA Study Group on the Functional Requirements for Bibliographic Records.

(1998). Functional Requirements for Bibliographic Records: Final Report.

München: Saur. (UBCIM Publications. New Series; Vol. 19).
LaTeX3 Project. (2002). LaTeX Project Home page. Online tilgængelig den 19/12-2002:
Microsoft. (2001). Deterministic and Non-Deterministic Schemas. Online tilgængelig den 19/12-2002:
Skov, A. (2000). Referér korrekt!: Om udarbejdelsen af bibliografiske referencer. København: Danmarks Biblioteksskole. Online tilgængelig den 19/12-2002:

Share with your friends:

The database is protected by copyright © 2020
send message

    Main page