A Glossary of Corpus Linguistics

Book Review
A Glossary of Corpus Linguistics
Paul Baker, Andrew Hardie, & Tony McEnery. Edinburgh: Edinburgh University Press, 2006. Pp. 1-187.

Reviewed by Vander Viana
Queen s University Belfast
Northern Ireland, UK

Launched in 2006, A Glossary of corpus linguistics is part of a series of glossaries by Edinburgh University Press, similar to the ones in Sociolinguistics, Applied Linguistics or Cognitive Linguistics. Authored by specialists from the Department of Linguistics and English Language at Lancaster University, the publication is welcome to a fast evolving research area.
The book opens with a warning about website addresses. Due to the speed at which they change, the authors have decided to include only those which are likely to remain for a longer period of time.

Still, readers do not find the online address of the British National Corpus, for instance. Also in the introductory notes , a list of over 150 acronyms is offered even though, for the sake of consistency, the authors have decided to list the terms in full in the glossary.

The entries in the glossary are organized alphabetically, which helps finding specific terms such as header or raw corpus . This way, the book may work like a dictionary. Although no theme clustering is offered by the authors, the publication might be summarized into six main categories.
First, there are numerous references to distinct concepts in corpus linguistics such as alignment , hapax legomena and representativeness , to cite a few.

There are also related notions in the fields of computational linguistics and statistics. Other entries which do not seem to be straightforwardly related to corpus linguistics at first also find a place in the glossary, for instance, dictionary , ethics , form-focused teaching and Java .

A second category refers to existing corpora. There are entries on the Brown Corpus, the British National Corpus, the American National Corpus and the International Corpus of English. The publication additionally encompasses more recent compilations such as the Corpus del Espai±ol, a 100-million-word corpus of Spanish created by Mark Davies (p. 49) in 2001/2002. Some less popular corpora are also referenced in the glossary for instance, Cronfa Electroneg o Gymraeg (representing modern Welsh) and the Guangzhou Petroleum English Corpus (developed under the auspices of the Chinese Petroleum University and the Jiao Tong University Corpus for English in Science and Technology).
Thirdly, several computer programs are mentioned in the volume. Not only are concordancers defined, but such programs for example, ConcApp by C. Greaves, Concordance by R. J. C. Watt, and Concordancer / Le Concordanceur by D. W. Rand are also described. Parsers and taggers find a home in the volume together with related concepts ( parsing , part-of-speech tagging , skeleton parsing , and tag transition probabilities ), programs ( Constraint Grammar Parser of English , Link Grammar Parser , Minipar , TAGGIT and Trigrams n Tags ) and parsed and/or tagged corpora ( CHRISTINE Corpus and Gothenburg Corpus ).

Another category has to do with statistical tests, which can also be found in the glossary. Brief explanations are provided for the difference between parametric and non-parametric tests as well as their most commonly used types. Chi-square, log-likelihood and Fisher s Exact Test, among others, are also explained.

A fifth group of entries encompasses those related to well-known journals and associations in corpus linguistics. The former includes references to the International Journal of Corpus Linguistics and the Journal of Quantitative Linguistics
. The latter comprises, for instance, the Association for Computational Linguistics and its European chapter, the Association for Literary and Linguistic Computing, the European Association for Lexicography and the European Language Resources Association. Emphasis is placed on the European associations, perhaps due to the fact that the authors are based in the UK.

Finally, there are references to projects, archives, databases and universities. One example is Project Gutenberg, a massive internet archive of over 16,000 copyright-free books stored as machine-readable text (p. 135). The Alex Catalogue of Electronic Texts resembles Project Gutenberg for it is an archive of on-line, freely available texts that are copyright free (pp. 8-9). As far as historical databases are concerned, there is a reference to the Chadwyck-Healey Databases. The only location mentioned in the glossary is the University Centre for Computer Corpus Research on Language (UCREL), situated at the University of Lancaster, where the authors work.

In relation to further reading, some of the entries refer to publications in which readers may find out more information on a specific topic. As the bibliography is concerned, almost 200 works are listed. The volume refers to 4 publications from 2001, 9 from 2002, 3 from 2003, 4 from 2004 and only 1 from 2005, the year before the glossary was published.
The references in this volume are in most cases adequately explained and exemplified. In a user-friendly way, it provides a number of cross-references either in the middle or at the end of an entry. In the first case, they are in bold type. This system makes it easy for readers to find out what they are looking for if they need further information.
It is true that writing a book necessarily implies a selection. In the glossary, however, the absence of some terms should be reconsidered. For instance, despite the reference to Corpus del Espai±ol, the volume does not include Davies s interface to search the British National Corpus (http://corpus.byu.edu/bnc) nor does it dedicate an entry to Davies and Ferreira s O Corpus do Portuguiªs (http://www.corpusdoportugues.org), totaling over 45 million words of Brazilian and European Portuguese from 1300s to 1900s. By the same token, readers may feel a lack of balance in the choice of institutions listed mainly European associations and the university where the authors work. In terms of presentation, it would have been advisable if the entries had also been grouped thematically. As this is not the case, readers may find themselves flipping through the publication to learn about the different aspects of an issue they are interested in.

All these minor shortcomings, however, may be solved in a second edition. A Glossary of Corpus Linguistics does in fact provide very useful introductory information on corpus linguistics. The publication adds to the field by playing a twofold role: it stands as a source of concepts that corpus linguists should use worldwide and, at the same time, may work as an introduction to novices in the area.