A Crash Course in Corpus Linguistics

May 2002

Jóhanna Barðdal
Linguistics Division, UNT

Table of Content:

1. Introduction
2. Available Corpora
3. Instruction Sheet for a Guest Account at the LDC corpora

1. Introduction

The emergence of corpus linguistics has provided us with an amazing tool, namely the tool to investigate huge amounts of texts in order to find answers to various questions on language use, questions which could not be easily answered earlier on when computer technology was more limited, and linguists had to rely on their own memory capacity and have endless amounts of time to their disposal. Biber, Conrad and Reppen (1998: Ch.1) point out the following major types of linguistic fields which can now be researched due to statistical methods, availability of large text material and increased computer technology:
  • Words: word meaning, frequency, distribution, and collocation patterns.
  • Grammar: use, function, distribution across registers, and grammatical synonomy.
  • Lexico-Grammar: the relation between lexical items and grammtical items, word distribution across grammatical constructions, and constructional choice depending on lexical items.
  • Discourse: reference marking in different types of texts, encoding of new and given information, evolving of grammatical features in the course of texts.
  • Registers: variation of linguistic features within and across registers, similarities and differences between spoken and written registers, similarities and differences between various written registers.
  • Language acquisition and development: acquisition of specific linguistic features or patterns across learners, first and second language acquisition, development of writing skills.
  • Historical linguistics: register development, style, changes in language use across historical periods. 

I will now discuss each of these in turn.
        Research on lexicographic questions is very limited without recourse to corpus linguistics and corpus-based approaches. With the aid of computers and computer technology it is possible to calculate the relative frequency of words, to compare word frequency in different registers, to detect the collocates of words in a large amount of text material, and subsequently to isolate the various meanings, or senses, a word has. It is possible to compare seemingly synonymous words and discover whether they are true synonyms or whether their distribution and collocates varies, for instance, according to registers. These kinds of research questions are particularly important for learners of a language and for the compiling of dictionaries. 
        Corpus-based research on grammatical features can include everything from morphology to word classes to syntactic structures. Comparison between registers is likely to reveal systematic differences in the distribution of, for instance, derivational morphemes; whether some morphemes are typical for certain types of roots, constrained by either phonological or semantic factors; whether nominalizations are more widely used in some registers as opposed to verbal predicates in other registers; whether there are systematic differences in the use of certain types of seemingly synonymous syntactic structures across registers. Information on language use is of utmost importance to second language learning, and/or learning a language for specific purposes, and should thus be adequately represented in textbooks and work books used within that area.
        In recent years there has been a growing interest within the linguistic community for the relation between lexical items, i.e. words, and syntactic constructions, and the possible interaction between the two. Large corpora makes it easy to single out specific words and to locate and analyze the syntactic frame they occur in. This is especially interesting for words which are considered near synonyms, since an investigation may reveal differences in syntactic and/or stylistic distribution. Conversely, it is also possible to single out near synonymous syntactic structures and to locate and analyze the particular lexical items which instantiate these frames. Such research might show that near synonymous words or structures are used in different ways. This is a relatively new area of investigation, an area which may become more important in the future together with increased awareness of and interest in actual language use and frequency patterns.
        Within the area of discourse analysis it is possible to carry out corpus-based research, for instance, on nominal reference across word category, such as nouns vs. pronouns, across type of reference, such as anaphoric, text-external or inferred, across registers, such as conversation vs. academic prose, etc. It is also possible to investigate rhetorical features or structures as they evolve during the course of a text.
        Corpus linguistics methods are ideal for research on registers and register differences, because in order to establish similarities and/or differences between registers huge amount of texts are needed. For each register it is essential that as many authors as possible are represented so that individual idiosyncratic differences will be evened out. It is also important that as many linguistic features as possible be investigated since categorization of genres cannot be based on a few linguistic variables. Recent research on registers has shown that linguistic variables tend to come in clusters, i.e. certain linguistic features co-occur with other linguistic features in texts. Furthermore, these clusters can be shown to be in complementary distribution. That is, if one bundle of linguistic features is evident in texts, then another bundle is typically lacking in the same texts. Thus, it has been argued that registers and register differences are most accurately characterized by their location along a dimension, of which five have been identified. More research is needed on more registers and more languages. The results of research on register characteristics can be used to develop instructions for specific-purpose writing and for developing more accurate grammar checking programs for computer and word-processing users.
        Without corpus linguistics, research on language acquisition has been limited to the study of the language of very young children, the study of only one or two learners, the study of only a few linguistic features, and has been restricted to only one register. Corpus linguistics allows for the possibility of studying certain linguistic features across a large amount of speakers, and thus it provides a basis for generalizations across language learners. Also, because of computer capacity bundles of features can be studied and correlations among features are more easily detected. The development of, for instance, writing proficiency can easily be studied through the examination of a large amount of texts written by school children. Various kinds of texts can be studied and compared with equivalent registers for adults, thus contributing to research on development of stylistic proficiency. Finally, computer technology aids in the comparison between first and second language learners, since groups of speakers with various backgrounds, can more easily be compared. 
        Similarly, research on historical linguistics can also benefit from corpus-based approaches; with extensive text material from different historical periods both lexicographic and grammatical features can be identified and traced chronologically. Also, research on registers, and changes within registers across time, can be conducted with the aid of corpus-based technology. Research on both geographic and demographic variation over time has become more easily conductible as well as research on individual author's style. 
        These are only some of the possible research areas which can benefit from studies based on corpora. Generally, corpus-based linguistics is ideal as a research method when generating answers on most questions on language use, the only limitation being the imagination of the analyst. 
                  (The overview in this section is mostly based on Biber, Conrad & Reppen 1998)

2. Available Corpora

A corpus is a linguistic database, i.e. a database of language use, of either spoken or written language. However, a corpus is not only a large amount of text material, it has to be compiled in a systematic way and maybe even further processed in order to be useful.  Corpora are compiled in different ways depending on their purpose. The most useful corpus consists of many registers, or genres, of both written and spoken language. The more registers, the more authors and texts in each register, the larger the corpus, the "better stratified" is the corpus going to be, and, thus, the more reliable will the conclusions also be.
        Corpora can either be tagged or untagged. A tagged corpus is a corpus where all words have been marked in some way, for instance for word category, i.e. nouns are tagged as nouns, verbs as verbs, adjectives as adjectives, etc. An untagged corpus is not processed at all, words have not been tagged for word category, it is simply a raw text material. Tagged corpora are sometimes lemmatized. A lemma is the base of a word, before any endings are attached to it. Thus, in a lemmatized corpus it is possible to search for a base, such as, for instance, go, and the results will list all instances of the lexeme go, i.e. all word forms of go. Such a list will therefore include: go, goes, going and went. For further information, a very useful on-line tutorial on corpus linguistics has been compiled by Catherine Ball at Georgetown University.
        Computer software for corpus linguistics is basically of three kinds: concordance programs, taggers/coders, and specifically made programs used to answer certain research questions. Concordance programs are search engines which give the result of the search as text samples. Depending on how the corpus has been tagged, a concordance program can give us text samples of a specific word, a collocation, a lemma or a syntactic construction (see next section for an example of the display of the output). Taggers/coders are programs made particularly to tag raw text material for variables such as morphological properties, word category, lemmas, or even syntactic functions. In addition to these two kinds of programs the third possiblity is to design one's own computer program, or have it done by a computer linguist. Such programs are needed if the research question is of a very specific kind and concordances and taggers are not of any use. Concordance programs and taggers/coders are available as either freeware or commercial software. The Linguist list has a site on various software available for linguists and linguistic analysis, including both freeware and commercial software.
        A huge amount of linguistic corpora are being compiled in the world today. Many of those are available on-line, usually with a password. A password can either be obtained for free for a limited amount of time or a limited use of the corpus, or a site license can be bought on annual basis. Secondly, corpora come on CDs, either for institutional use or personal use. I will now describe only a handful of the corpora which exist today.
        The British National Corpus contains 100 million words of both written and spoken British English. It is segmented into orthographic units, i.e. sentences, and tagged for word category. A copy of the corpus for one person costs £50 and a network license costs £250. BNC will soon be available on-line at the British Library. A similar initiative is also being carried out for American English. The American National Corpus is scheduled to be launched in September 2002.
        COBUILD offers an English corpus of 450 million words, both written and spoken, called the Bank of English. The majority of the material is from 1990 or later and is constantly updated. An on-line demo is available at the COBUILD websites, which can be used to generate concordances on words and collocations. More elaborate linguistic research, such as on collocations with up to four intervening words, or on syntactic constructs, require a subscription at the cost of £500 for three user IDs per annum. The subscription can be extended to ten user IDs at the most, and class IDs can also be set up. The BoE is useful for everybody working on English, and everybody teaching either English or ESL/EFL.
        ICAME is a collection of corpora of both spoken and written language. It comes on CD with various fonts and software. It includes, for instance, the Brown corpus (see next section), various corpora with historical texts, and the Bergen Corpus of London Teenage Language. The price for one user ID is approximately $440, and $1000 for ten user IDs.
        The Bergen Corpus of London Teenage Language (COLT) contains spoken conversation of London teenagers from 13-17 years of age, collected in 1993. It is approximately 500,000 words and tagged for word category. At the moment a 151-text sample of the corpus is available on-line at no cost, where it is possible to search for words, collocations and combinations of letters.
        The Corpus of Middle English Prose or Verse is a part of the Middle English Compendium, also containing the Middle English Dictionary and a HyperBibliography of Middle English Prose and Verse. At the moment it is available at no cost. It contains 61 texts used in the Middle English Dictionary. It is possible to browse the texts, to search for words, collocations and phrases, and co-occurring words with up to 80 intervening characters. It is possible to confine searches to individual texts, groups of texts, or the whole collection.
        The Penn-Helsinki Parsed Corpus of Middle English contains 1,3 million words from 55 texts of Middle English Prose. It is tagged for word category and it is syntactically parsed. A site license for five users costs $200 for the corpus and $50 for the search program.
        The International Corpus of English (ICA) is a project on varieties of English worldwide. It started in 1990 and includes 15 research teams in different parts of the world, each compiling a corpus on their own regional or national variety of English, including Australia, Great Britain, Singapore, India, Canada, South Africa, New Zeeland, etc. After completion the corpus will contain 15 million words, 1 million for each variety of both spoken and written English. The corpora are being annotated at various levels: text level, word category, and phrasal and sentence level. Certain subcorpora will  be coded for phonetic/phonological variables. Some of the corpora are free, others are a part of ICAME (see above), yet others can be obtained through their home institutes.
        The American component of ICA contains a part on spoken language, compiled in the Linguistics department at the University of California, Santa Barbara. The corpus is often referred to as CSAM or SBCSAM. Its main web page is at UCSB, but the corpus is distributed by the Linguistic Data Consortium (LDC, see next section). It comes on three CDs and costs $75 for non-members of the consortium.
        CHILDES is a system of Child Language Data Exchange. Various researchers working on language acquisition, both first and second language acquisition of children and adults, have contributed to this exchange system. The system contains a databank of transcribed conversation, computer programs to analyze the transcripts, methods of how to transcribe and code the material, and the possibility of linking the material to audio/video systems. In order to obtain the material individuals can join the TalkBank and access the database for free.
        Corpora have been compiled not only for the English language but for other languages as well. Michael Barlow has links on his websites to corpora in 21 languages and/or language families, such as Chinese, Russian, Swedish, Danish, Dutch, Spanish, Turkish, and many more. There are also the so-called parallel corpora which either contain parallel texts in two or more languages or are translations of each other. There are links on Michael Barlow's sites to various parallel corpus projects being compiled in the world. Some of those include English - French, English - German, English - Tai, English - Norwegian, Swedish - English, Swedish - German, Swedish - French, English - French - Greek, English - French - Dutch, English - Swedish - Norwegian - Finnish, some Eastern European languages, only to mention a few.
        Finally, the largest electronic corpora available on-line is the World Wide Web itself, and it can be searched with various engines, such as Google, Lychos, Yahoo, etc, at no cost.

3. Instruction Sheet for a Guest Account at the LDC Corpora

The Linguistic Data Consortium (LDC) is a consortium compiling and distributing data banks of mostly the English language but also other languages. It was founded with research grants from ARPA and NSF in 1992, and is hosted by the University of Pennsylvania. LDC consists of 214 different corpora, of which three corpora are available at no cost for everybody. These three corpora are the Brown corpus, the TIMIT corpus and the Switchboard corpus. The Brown corpus is approximately 1,2 million words, containing texts from at least 15 written registers within the Humanities, such as belle lettres, reports, fiction, biography, popular culture, etc. It exists, and can be accessed, as a text file, and can thus be used for lexicographic research. The TIMIT corpus is a corpus of recorded speech, containing 6,300 sentences, recorded from male and female speakers of eight dialects of American English.  It exists as a speech file and is made for acoustic-phonetic research and/or research on speech recognition systems. The Switchboard corpus is approximately 2,4 million words of telephone conversation in American English. It exists both as a speech and text file. It can thus both be used for acoustic-phonetic research and lexicographic research. 
        For a free access to the Brown, TIMIT and the Switchboard corpora you need to sign up for a guest account. When entering the LDC website you click on LDC Online. This will take you to a site which allows you to attend an interactive tutorial on the corpora, to sign up for a guest account, to find answers to all frequently asked questions about the LDC and its corpora, and finally to access either the text corpus (Brown and Switchboard) or the speech corpus (TIMIT and Switchboard). 
        After signing up for the guest account you will receive a password through e-mail. You will be asked for his password together with your user name when you try to enter either the text corpora or the speech corpora on the LDC Online site mentioned above. When you have logged in and agreed to the conditions for using the corpora you will get a list of the options you have and a list of all the corpora available on-line. However, as a non-member, you can only access the two text or speech corpora that are available to guests. 
        Let us concentrate on the text corpora. The first five options before the list of all the on-line text corpora serve the purpose of doing searches, etc. on combined or all the corpora, and are thus not of interests to non-members. Go down on the site and select either Brown or Switchboard. At this point you have four choices: You can choose to view the corpus, to do a search for either only one word or two words and get a text concordance, to obtain frequencies and/or combined frequencies, or to get the corpus' histogram which includes frequency lists of all word forms in the corpus. The following display shows the search results for the combination have to in Brown:
Text concordances for have to

 Search result for corpus BROWN
| Back to Top |Next Page |
        3737    have/HV
        3737    have
        14047   to/TO
        10567   to/IN
        3       to/NIL
        1       to/QL
        1       to/NN
        24619   to
In region 0 to 1189209
Found:  255     have to
Page Number:0 Page Length:100
 article  e Republicans would have to face is a state law
 article    hether notice would have to first be given that
 article   The proposal would have to receive final legis
 article       <s> Each ally will have to carry out obligatio
 article   year and would only have to put up half that am
 article     ually , Davis would have to toss in the towel s
 article         tration will either have to cut down expenses o
 article         two days , it will have to be at the expense o
 article       speculating , but I have to think Jack feels he
 article     ants . <p> <s> `` I have to stay with Nieman fo
 article    ere told they would have to get to know certain
 article   sold and they would have to get to know people 
 article    nd potatoes -- they have to have that go-go-go 
 article      ctory . <s> If they have to take any car , they

From this display we can read that out of 3,737 occurrences of have and out of 10,567 occurrences of the infinitive marker to there are only 255 occurrences of have to in this corpus. Below the statistics a few text samples from the corpus are given with the sequence have to in the middle. You can specify how many text samples you want from the text, and how long you want the context to be. A considerable amount of text samples should suffice as material for a detailed examination and analysis of the possible collocations and grammatical behavior of have to, i.e. for instance, whether it selects for certain types of main verbs, whether it tends to occur in certain tenses, whether it is used with a person subject or impersonally, etc. 
        It is not only possible to search for words and combinations of words in Brown, i.e. for untagged raw word forms, but also for forms already tagged for word category. This means that it is possible to search for the word to as either an infinitive marker or as a preposition. It is also possible to search only for a certain word category without specifying any particular word. Such a search might give us all instances of nouns if we searched for the word category Noun. A search of forms irrespective of upper or lower case usage in the text is also possible, and finally, it is possible to search for lemmas.
        Returning to have to, it is interesting in this context, since Brown solely consists of written material, to find out whether have to shows the same behavior in spoken language. It is thus interesting to do a similar search in Switchboard! It turns out that have occurs 29,445 times, to occurs 70,339 times, and the combination have to occurs 4,848 times. Switchboard differs from Brown in being untagged, thus the 70,339 instances of to are all instances of to irrespective of word category or part of speech, i.e. to can here be an infinitive marker or a preposition. Going back to have to, it turns out that there are 4,848 instances of have to in a spoken corpus of 2,4 million words, whereas there are only 255 occurrences in a written corpus of 1,2 million words. These data therefore clearly show that the use of have to is much more common in spoken language than in written language.
        A non-member guest access to Brown and Switchboard is ideal for all comparisons of words and collocations between the spoken and the written language, since Brown is a well stratified corpus with many registers and many authors included in each register. Brown is, thus, a representative corpus for written language. 


Ball, Catherine N. Tutorial: Concordances and Corpora. Available at http://www.georgetown.edu/cball/corpora/tutorial.html

Barlow, Michael. Corpus Linguistics. Available at http://www.ruf.rice.edu/ ~barlow/corpus.html 

Barlow, Michael. Parallel Corpora. Available at http://www.ruf.rice.edu/ ~barlow/para.html

Biber, Douglas, Susan Conrad & Randi Reppen. 1998. Corpus Linguistics: Investigating Language Structure and Use. Cambridge, Cambridge University Press.

Child Language Data Exchange System. Carnegie Mellon University. Information available at http://childes.psy.cmu.edu/

Cobuild. University of Birmingham & HarperCollins Publisher. Available at http://titania.cobuild.collins.co.uk/

ICAME Corpus Collection. University of Bergen. Information available at http://www.hit.uib.no/corpora.html

International Corpus of English. University of Hong Kong. Information available at http://www.hku.hk/english/research/ice/index.htm 

The American National Corpus. Northern Arizona University & Vassar College. 
Information available at http://americannationalcorpus.org/

The Bergen Corpus of London Teenage Language. University of Bergen. Available at http://www.hit.uib.no/colt/

The British National Corpus. Oxford University. Available at http://www.hcu. ox.ac.uk/BNC/

The Corpus of Middle English Prose or Verse. University of Michigan, Oxford Text Archive & the Humanities Text Initiative. Available at http://www.hti.umich. edu/c/cme/

The Corpus of Spoken American English. University of California, Santa Barbara. Information available at http://www.linguistics.ucsb.edu/research/sbcorpus/default.htm

The Linguist List. Eastern Michigan University & Wayne State University. Available at http://www.linguistlist.org

The Linguistic Data Consortium. University of Pennsylvania. Available at http://www.ldc.upenn.edu

The Penn-Helsinki Parsed Corpus of Middle English. University of Pennsylvania. Available at http://www.ling.upenn.edu/mideng