A Crash Course in Corpus
Linguistics
May 2002
Jóhanna Barðdal
Linguistics Division, UNT
Table of Content:
1. Introduction
2. Available Corpora
3. Instruction Sheet for a Guest Account at
the LDC corpora
References
1. Introduction
The emergence of corpus linguistics has provided
us with an amazing tool, namely the tool to investigate huge amounts of
texts in order to find answers to various questions on language use, questions
which could not be easily answered earlier on when computer technology
was more limited, and linguists had to rely on their own memory capacity
and have endless amounts of time to their disposal. Biber, Conrad and Reppen
(1998: Ch.1) point out the following major types of linguistic fields which
can now be researched due to statistical methods, availability of large
text material and increased computer technology:
-
Words: word meaning, frequency, distribution,
and collocation patterns.
-
Grammar: use, function, distribution across
registers, and grammatical synonomy.
-
Lexico-Grammar: the relation between lexical
items and grammtical items, word distribution across grammatical constructions,
and constructional choice depending on lexical items.
-
Discourse: reference marking in different
types of texts, encoding of new and given information, evolving of grammatical
features in the course of texts.
-
Registers: variation of linguistic features
within and across registers, similarities and differences between spoken
and written registers, similarities and differences between various written
registers.
-
Language acquisition and development: acquisition
of specific linguistic features or patterns across learners, first and
second language acquisition, development of writing skills.
-
Historical linguistics: register development,
style, changes in language use across historical periods.
|
I will now discuss each of these in turn.
Research
on lexicographic questions is very limited without recourse to corpus linguistics
and corpus-based approaches. With the aid of computers and computer technology
it is possible to calculate the relative frequency of words, to compare
word frequency in different registers, to detect the collocates of words
in a large amount of text material, and subsequently to isolate the various
meanings, or senses, a word has. It is possible to compare seemingly synonymous
words and discover whether they are true synonyms or whether their distribution
and collocates varies, for instance, according to registers. These kinds
of research questions are particularly important for learners of a language
and for the compiling of dictionaries.
Corpus-based
research on grammatical features can include everything from morphology
to word classes to syntactic structures. Comparison between registers is
likely to reveal systematic differences in the distribution of, for instance,
derivational morphemes; whether some morphemes are typical for certain
types of roots, constrained by either phonological or semantic factors;
whether nominalizations are more widely used in some registers as opposed
to verbal predicates in other registers; whether there are systematic differences
in the use of certain types of seemingly synonymous syntactic structures
across registers. Information on language use is of utmost importance to
second language learning, and/or learning a language for specific purposes,
and should thus be adequately represented in textbooks and work books used
within that area.
In
recent years there has been a growing interest within the linguistic community
for the relation between lexical items, i.e. words, and syntactic constructions,
and the possible interaction between the two. Large corpora makes it easy
to single out specific words and to locate and analyze the syntactic frame
they occur in. This is especially interesting for words which are considered
near synonyms, since an investigation may reveal differences in syntactic
and/or stylistic distribution. Conversely, it is also possible to single
out near synonymous syntactic structures and to locate and analyze the
particular lexical items which instantiate these frames. Such research
might show that near synonymous words or structures are used in different
ways. This is a relatively new area of investigation, an area which may
become more important in the future together with increased awareness of
and interest in actual language use and frequency patterns.
Within
the area of discourse analysis it is possible to carry out corpus-based
research, for instance, on nominal reference across word category, such
as nouns vs. pronouns, across type of reference, such as anaphoric, text-external
or inferred, across registers, such as conversation vs. academic prose,
etc. It is also possible to investigate rhetorical features or structures
as they evolve during the course of a text.
Corpus
linguistics methods are ideal for research on registers and register differences,
because in order to establish similarities and/or differences between registers
huge amount of texts are needed. For each register it is essential that
as many authors as possible are represented so that individual idiosyncratic
differences will be evened out. It is also important that as many linguistic
features as possible be investigated since categorization of genres cannot
be based on a few linguistic variables. Recent research on registers has
shown that linguistic variables tend to come in clusters, i.e. certain
linguistic features co-occur with other linguistic features in texts. Furthermore,
these clusters can be shown to be in complementary distribution. That is,
if one bundle of linguistic features is evident in texts, then another
bundle is typically lacking in the same texts. Thus, it has been argued
that registers and register differences are most accurately characterized
by their location along a dimension, of which five have been identified.
More research is needed on more registers and more languages. The results
of research on register characteristics can be used to develop instructions
for specific-purpose writing and for developing more accurate grammar checking
programs for computer and word-processing users.
Without
corpus linguistics, research on language acquisition has been limited to
the study of the language of very young children, the study of only one
or two learners, the study of only a few linguistic features, and has been
restricted to only one register. Corpus linguistics allows for the possibility
of studying certain linguistic features across a large amount of speakers,
and thus it provides a basis for generalizations across language learners.
Also, because of computer capacity bundles of features can be studied and
correlations among features are more easily detected. The development of,
for instance, writing proficiency can easily be studied through the examination
of a large amount of texts written by school children. Various kinds of
texts can be studied and compared with equivalent registers for adults,
thus contributing to research on development of stylistic proficiency.
Finally, computer technology aids in the comparison between first and second
language learners, since groups of speakers with various backgrounds, can
more easily be compared.
Similarly,
research on historical linguistics can also benefit from corpus-based approaches;
with extensive text material from different historical periods both lexicographic
and grammatical features can be identified and traced chronologically.
Also, research on registers, and changes within registers across time,
can be conducted with the aid of corpus-based technology. Research on both
geographic and demographic variation over time has become more easily conductible
as well as research on individual author's style.
These
are only some of the possible research areas which can benefit from studies
based on corpora. Generally, corpus-based linguistics is ideal as a research
method when generating answers on most questions on language use, the only
limitation being the imagination of the analyst.
(The overview in this section is mostly based on Biber, Conrad
& Reppen 1998)
2. Available Corpora
A corpus is a linguistic database, i.e. a database
of language use, of either spoken or written language. However, a corpus
is not only a large amount of text material, it has to be compiled in a
systematic way and maybe even further processed in order to be useful.
Corpora are compiled in different ways depending on their purpose. The
most useful corpus consists of many registers, or genres, of both written
and spoken language. The more registers, the more authors and texts in
each register, the larger the corpus, the "better stratified" is the corpus
going to be, and, thus, the more reliable will the conclusions also be.
Corpora
can either be tagged or untagged. A tagged corpus is a corpus where all
words have been marked in some way, for instance for word category, i.e.
nouns are tagged as nouns, verbs as verbs, adjectives as adjectives, etc.
An untagged corpus is not processed at all, words have not been tagged
for word category, it is simply a raw text material. Tagged corpora are
sometimes lemmatized. A lemma is the base of a word, before any endings
are attached to it. Thus, in a lemmatized corpus it is possible to search
for a base, such as, for instance, go, and the results will list
all instances of the lexeme go, i.e. all word forms of go.
Such a list will therefore include: go, goes, going
and went. For further information, a very useful on-line
tutorial on corpus linguistics has been compiled by Catherine Ball
at Georgetown University.
Computer
software for corpus linguistics is basically of three kinds: concordance
programs, taggers/coders, and specifically made programs used to answer
certain research questions. Concordance programs are search engines which
give the result of the search as text samples. Depending on how the corpus
has been tagged, a concordance program can give us text samples of a specific
word, a collocation, a lemma or a syntactic construction (see next section
for an example of the display of the output). Taggers/coders are programs
made particularly to tag raw text material for variables such as morphological
properties, word category, lemmas, or even syntactic functions. In addition
to these two kinds of programs the third possiblity is to design one's
own computer program, or have it done by a computer linguist. Such programs
are needed if the research question is of a very specific kind and concordances
and taggers are not of any use. Concordance programs and taggers/coders
are available as either freeware or commercial software. The
Linguist list has a site on various software available for linguists
and linguistic analysis, including both freeware and commercial software.
A
huge amount of linguistic corpora are being compiled in the world today.
Many of those are available on-line, usually with a password. A password
can either be obtained for free for a limited amount of time or a limited
use of the corpus, or a site license can be bought on annual basis. Secondly,
corpora come on CDs, either for institutional use or personal use. I will
now describe only a handful of the corpora which exist today.
The
British National Corpus contains 100 million words of both written
and spoken British English. It is segmented into orthographic units, i.e.
sentences, and tagged for word category. A copy of the corpus for one person
costs £50 and a network license costs £250. BNC will soon be
available on-line at the British Library.
A similar initiative is also being carried out for American English.
The
American National Corpus is scheduled to be launched in September 2002.
COBUILD
offers an English corpus of 450 million words, both written and spoken,
called the Bank of English. The majority of the material is from
1990 or later and is constantly updated. An on-line demo is available at
the COBUILD websites, which can be used to generate concordances on words
and collocations. More elaborate linguistic research, such as on collocations
with up to four intervening words, or on syntactic constructs, require
a subscription at the cost of £500 for three user IDs per annum.
The subscription can be extended to ten user IDs at the most, and class
IDs can also be set up. The BoE is useful for everybody working on English,
and everybody teaching either English or ESL/EFL.
ICAME
is a collection of corpora of both spoken and written language. It comes
on CD with various fonts and software. It includes, for instance, the Brown
corpus (see next section), various corpora with historical texts, and the
Bergen Corpus of London Teenage Language. The price for one user ID is
approximately $440, and $1000 for ten user IDs.
The
Bergen Corpus of London Teenage Language (COLT) contains spoken conversation
of London teenagers from 13-17 years of age, collected in 1993. It is approximately
500,000 words and tagged for word category. At the moment a 151-text sample
of the corpus is available on-line at no cost, where it is possible to
search for words, collocations and combinations of letters.
The
Corpus of Middle English Prose or Verse is a part of the Middle English
Compendium, also containing the Middle English Dictionary and a HyperBibliography
of Middle English Prose and Verse. At the moment it is available at no
cost. It contains 61 texts used in the Middle English Dictionary. It is
possible to browse the texts, to search for words, collocations and phrases,
and co-occurring words with up to 80 intervening characters. It is possible
to confine searches to individual texts, groups of texts, or the whole
collection.
The
Penn-Helsinki Parsed Corpus of Middle English contains 1,3 million
words from 55 texts of Middle English Prose. It is tagged for word category
and it is syntactically parsed. A site license for five users costs $200
for the corpus and $50 for the search program.
The
International Corpus of English (ICA) is a project on varieties of
English worldwide. It started in 1990 and includes 15 research teams in
different parts of the world, each compiling a corpus on their own regional
or national variety of English, including Australia, Great Britain, Singapore,
India, Canada, South Africa, New Zeeland, etc. After completion the corpus
will contain 15 million words, 1 million for each variety of both spoken
and written English. The corpora are being annotated at various levels:
text level, word category, and phrasal and sentence level. Certain subcorpora
will be coded for phonetic/phonological variables. Some of the corpora
are free, others are a part of ICAME (see above), yet others can be obtained
through their home institutes.
The
American component of ICA contains a part on spoken language, compiled
in the Linguistics department at the University of California, Santa Barbara.
The corpus is often referred to as CSAM
or SBCSAM. Its main web page is at UCSB, but the corpus is distributed
by the Linguistic Data Consortium (LDC, see next section). It comes on
three CDs and costs $75 for non-members of the consortium.
CHILDES
is a system of Child Language Data Exchange. Various researchers working
on language acquisition, both first and second language acquisition of
children and adults, have contributed to this exchange system. The system
contains a databank of transcribed conversation, computer programs to analyze
the transcripts, methods of how to transcribe and code the material, and
the possibility of linking the material to audio/video systems. In order
to obtain the material individuals can join the TalkBank
and access the database for free.
Corpora
have been compiled not only for the English language but for other languages
as well. Michael
Barlow has links on his websites to corpora in 21 languages and/or
language families, such as Chinese, Russian, Swedish, Danish, Dutch, Spanish,
Turkish, and many more. There are also the so-called parallel corpora which
either contain parallel texts in two or more languages or are translations
of each other. There are links on Michael Barlow's sites to various parallel
corpus projects being compiled in the world. Some of those include
English - French, English - German, English - Tai, English - Norwegian,
Swedish - English, Swedish - German, Swedish - French, English - French
- Greek, English - French - Dutch, English - Swedish - Norwegian - Finnish,
some Eastern European languages, only to mention a few.
Finally,
the largest electronic corpora available on-line is the World Wide Web
itself, and it can be searched with various engines, such as Google, Lychos,
Yahoo, etc, at no cost.
3. Instruction Sheet for a Guest Account at
the LDC Corpora
The Linguistic
Data Consortium (LDC) is a consortium compiling and distributing data
banks of mostly the English language but also other languages. It was founded
with research grants from ARPA and NSF in 1992, and is hosted by the University
of Pennsylvania. LDC consists of 214 different corpora, of which three
corpora are available at no cost for everybody. These three corpora are
the Brown corpus, the TIMIT corpus and the Switchboard corpus. The Brown
corpus is approximately 1,2 million words, containing texts from at least
15 written registers within the Humanities, such as belle lettres, reports,
fiction, biography, popular culture, etc. It exists, and can be accessed,
as a text file, and can thus be used for lexicographic research. The TIMIT
corpus is a corpus of recorded speech, containing 6,300 sentences, recorded
from male and female speakers of eight dialects of American English.
It exists as a speech file and is made for acoustic-phonetic research and/or
research on speech recognition systems. The Switchboard corpus is approximately
2,4 million words of telephone conversation in American English. It exists
both as a speech and text file. It can thus both be used for acoustic-phonetic
research and lexicographic research.
For
a free access to the Brown, TIMIT and the Switchboard corpora you need
to sign up for a guest account. When entering the LDC website you click
on LDC Online. This will take you to a site which allows you to attend
an interactive tutorial on the corpora, to sign up for a guest account,
to find answers to all frequently asked questions about the LDC and its
corpora, and finally to access either the text corpus (Brown and Switchboard)
or the speech corpus (TIMIT and Switchboard).
After
signing up for the guest account you will receive a password through e-mail.
You will be asked for his password together with your user name when you
try to enter either the text corpora or the speech corpora on the LDC Online
site mentioned above. When you have logged in and agreed to the conditions
for using the corpora you will get a list of the options you have and a
list of all the corpora available on-line. However, as a non-member, you
can only access the two text or speech corpora that are available to guests.
Let
us concentrate on the text corpora. The first five options before the list
of all the on-line text corpora serve the purpose of doing searches, etc.
on combined or all the corpora, and are thus not of interests to non-members.
Go down on the site and select either Brown or Switchboard. At this point
you have four choices: You can choose to view the corpus, to do a search
for either only one word or two words and get a text concordance, to obtain
frequencies and/or combined frequencies, or to get the corpus' histogram
which includes frequency lists of all word forms in the corpus. The following
display shows the search results for the combination have to in
Brown:
====================================
Text concordances for have to
Search result for corpus BROWN
| Back to Top |Next Page |
3737 have/HV
3737 have
14047 to/TO
10567 to/IN
3
to/NIL
1
to/QL
1
to/NN
24619 to
In region 0 to 1189209
Found: 255 have to
Page Number:0 Page Length:100
----------------------------
article e Republicans would have to face is a state law
article hether notice would have to first be
given that
article The proposal would have to receive final
legis
article <s> Each ally
will have to carry out obligatio
article year and would only have to put up half that
am
article ually , Davis would have to toss
in the towel s
article tration
will either have to cut down expenses o
article two days
, it will have to be at the expense o
article speculating , but
I have to think Jack feels he
article ants . <p> <s> `` I have
to stay with Nieman fo
article ere told they would have to get to
know certain
article sold and they would have to get to know people
article nd potatoes -- they have to have that
go-go-go
article ctory . <s> If they
have to take any car , they
========================================================
From this display we can read that out of 3,737
occurrences of have and out of 10,567 occurrences of the infinitive
marker to there are only 255 occurrences of have to in this
corpus. Below the statistics a few text samples from the corpus are given
with the sequence have to in the middle. You can specify how many
text samples you want from the text, and how long you want the context
to be. A considerable amount of text samples should suffice as material
for a detailed examination and analysis of the possible collocations and
grammatical behavior of have to, i.e. for instance, whether it selects
for certain types of main verbs, whether it tends to occur in certain tenses,
whether it is used with a person subject or impersonally, etc.
It
is not only possible to search for words and combinations of words in Brown,
i.e. for untagged raw word forms, but also for forms already tagged for
word category. This means that it is possible to search for the word to
as either an infinitive marker or as a preposition. It is also possible
to search only for a certain word category without specifying any particular
word. Such a search might give us all instances of nouns if we searched
for the word category Noun. A search of forms irrespective of upper or
lower case usage in the text is also possible, and finally, it is possible
to search for lemmas.
Returning
to have to, it is interesting in this context, since Brown solely
consists of written material, to find out whether have to shows
the same behavior in spoken language. It is thus interesting to do a similar
search in Switchboard! It turns out that have occurs 29,445 times,
to occurs 70,339 times, and the combination have to occurs
4,848 times. Switchboard differs from Brown in being untagged, thus the
70,339 instances of to are all instances of to irrespective
of word category or part of speech, i.e. to can here be an infinitive
marker or a preposition. Going back to have to, it turns out that
there are 4,848 instances of have to in a spoken corpus of 2,4 million
words, whereas there are only 255 occurrences in a written corpus of 1,2
million words. These data therefore clearly show that the use of have
to is much more common in spoken language than in written language.
A
non-member guest access to Brown and Switchboard is ideal for all comparisons
of words and collocations between the spoken and the written language,
since Brown is a well stratified corpus with many registers and many authors
included in each register. Brown is, thus, a representative corpus for
written language.
GOOD LUCK!
References
Ball, Catherine N. Tutorial: Concordances and
Corpora. Available at http://www.georgetown.edu/cball/corpora/tutorial.html
Barlow, Michael. Corpus Linguistics. Available
at http://www.ruf.rice.edu/ ~barlow/corpus.html
Barlow, Michael. Parallel Corpora. Available
at http://www.ruf.rice.edu/ ~barlow/para.html
Biber, Douglas, Susan Conrad & Randi Reppen.
1998. Corpus Linguistics: Investigating Language Structure and Use.
Cambridge, Cambridge University Press.
Child Language Data Exchange System. Carnegie
Mellon University. Information available at http://childes.psy.cmu.edu/
Cobuild. University of Birmingham &
HarperCollins Publisher. Available at http://titania.cobuild.collins.co.uk/
ICAME Corpus Collection. University of
Bergen. Information available at http://www.hit.uib.no/corpora.html
International Corpus of English. University
of Hong Kong. Information available at http://www.hku.hk/english/research/ice/index.htm
The American National Corpus. Northern
Arizona University & Vassar College.
Information available at http://americannationalcorpus.org/
The Bergen Corpus of London Teenage Language.
University of Bergen. Available at http://www.hit.uib.no/colt/
The British National Corpus. Oxford University.
Available at http://www.hcu. ox.ac.uk/BNC/
The Corpus of Middle English Prose or Verse.
University of Michigan, Oxford Text Archive & the Humanities Text Initiative.
Available at http://www.hti.umich. edu/c/cme/
The Corpus of Spoken American English.
University of California, Santa Barbara. Information available at http://www.linguistics.ucsb.edu/research/sbcorpus/default.htm
The Linguist List. Eastern Michigan University
& Wayne State University. Available at http://www.linguistlist.org
The Linguistic Data Consortium. University
of Pennsylvania. Available at http://www.ldc.upenn.edu
The Penn-Helsinki Parsed Corpus of Middle English.
University of Pennsylvania. Available at http://www.ling.upenn.edu/mideng
|