Corpus Studies Definitions English Language Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

A corpus is a large collection of texts produced by native speakers of English, which are stored electronically and can be accessed using search software. Users type in a query a word or phrase to generate 'concordance lines' (randomly selected lines of text containing the target language) which are extracted from the corpus.

Concordancer: A concordancer is a computer program that is able to search rapidly through large quantities of text for a target item (morpheme, word, or phrase) and print out all the examples it finds in the contexts in which they appear.

The aim of this study is two-fold:

1. To determine whether training learners in the use of online corpora would have any noticeable effect on the 'naturalness' of their redrafted essays;

2. To explore learners' reactions and preferences regarding the B N C (British National Corpus) and C O B (COBUILD Corpus and Collocations Sampler)

For the aims of this study, Forty-five second-year intermediate-level Japanese university students, enrolled on a compulsory academic writing course. They were asked to write a factual report based on the theme of 'obsession'. Sentence-level, lexical, and grammatical problems in students' first drafts were highlighted by the teacher. Prior to any corrections, learners received a 30-minute introduction on how to use online corpora. Students were then asked to produce second drafts correcting problem areas identified in their first drafts by referring to one or both of the online corpora introduced. Sentences identified as problematic in the first drafts were isolated and rated for 'naturalness' by four native-speaker teachers. Students were also asked to comment on the usefulness of online corpora for improving their writing and their preferences for either the B N C or the C O B UIL D Concordance and Collocations Sampler.

From the changes made by students between the first and second drafts, 214 (61.14 per cent) were rated as more natural, 114 (32.57 per cent) as equivalent, and 22 (6.29 per cent) as less natural by the native-speaker raters.

Student feedback on the activities was generally very positive with 95 per cent of respondents believing that online corpora were a useful resource to aid them in redrafting their essays. In terms of preferences, 84.5 per cent of students preferred the CO BUILD Concordance and Collocations Sampler to the B N C, typically stating that it was more user-friendly and faster.

Since around 61 per cent of changes made to students' first drafts, with the support of online corpora, resulted in more natural language, we can safely say that this is an approach worthy of further investigation. There was no control group in this study so it is impossible to say to what degree the improvements seen can be attributed to the training given. Another point here is that the reported results are somewhat distorted by subjects who made no effort to improve their second drafts, increasing the number of equivalent ratings. Furthermore, the whole sentence or clause containing mistakes had been underlined, so it was not always obvious to learners exactly what to search for, and the wrong choice could easily produce misleading information. For busy teachers, online corpora can reduce their workloads by providing learners with the support they need to make corrections autonomously, without the necessity of lengthy explanations in the margins.

Article 2: Expressions of gratitude by Hong Kong speakers of English (pragmatics 2009)

In this study, the data consist of real-life spoken discourse by Hong Kong speakers of English. The data comprise 300 samples of approximately 2000 words each, and are part of the Hong Kong component of the International Corpus of English (ICE-HK), which was made publicly available in March 2006.

The one million words corpus is intended to be broadly representative of the English language used in Hong Kong in the 1990s. In each text category of the spoken section of the corpus, both males and females were sampled, and speakers from different age groups and from different regional backgrounds were also included. All the informants in the corpus have completed second-level schooling which serves as the minimum requirement.

The study here centers on the stem 'thank', which is contained in most frequently used formulaic expressions to convey gratitude such as thanks, thank you, thanks a lot, thank you so much, and additional sequences or strategies that express gratitude. These thanking strategies fall into three broad groups. The first group is a formulaic expression e.g. thanks / thank you which is accompanied by other utterances to reinforce the expression of gratitude. The accompanying utterance can take the form of a single word (e.g. Thank you, Professor) or it can also be a complete sentence (e.g. Thank you. That's very sweet of you).The second group involves a single occurrence of the formulaic expression only. Speakers' use the formulaic expression e.g. thanks/ thank you as a signal to close the conversation or as a responder to an expression of gratitude in previous discourse. The third group is thanking as an extended turn. It is also characterized by two properties: 1. Thanking is accomplished by means of several turns rather than just a single turn used in the first and second groups. 2. There are usually two or more thanking strategies being used in an extended turn.

In the study the data extraction, coding and sorting are carried out with the help of WordSmith Tools version 5.

A total of 233 thanking expressions were examined for Hong Kong English. The results showed that Hong Kong speakers of English do not employ a wide variety of the thanking strategies. Also, their acts of thanking are often used as a closing signal [Okay thank you bye bye ](28.8%) and as a single turn (26.6%). Looking at the corpus, it is found that the single expressions of gratitude thanks and thank you tend to be used as complete turns more frequently than any of the longer formulaic sequences such as thank you very much and thank a lot. In fact, the intensified expressions of gratitude are not common. Instead, the single lexical items thanks and thank you are often used in everyday conversation and in courts. One-fifth of the ICE-HK corpus data contains thanking expressions followed by other speakers' titles and names. They are used in parliamentary debates and in spontaneous commentaries (notably horse racing). Another one-fifth of corpus data accounts for the 'thanking + stating reason' strategy. Expressions of gratitude in this category typically begin with thanks/thank you and are then followed by for verb + ing (Uh once again thank you for coming here and spending time with us).

In expressing their gratitude, Hong Kong speakers of English seldom (3%) compliment the interlocutors nor repeat sequences and lexical items of gratitude in an extended turn[Thank you for helping me for in this assignment. and thank you really thank you].

One possible explanation would be that Chinese people are too reserved to express their gratitude openly and explicitly. By contrast, these "gratitude clusters" are commonplace in foreign countries. Hong Kong speakers of English rarely employ thank you as a response to their interlocutors' expressions of gratitude. 'Thanking + refusing' is also rare in the corpus. This might seem to suggest that Hong Kong people (and the Chinese in general) are not inclined to refuse an offer.

The frequency of occurrence of different strategies of expressing gratitude can thus be used as one of the guiding principles for the selection and prioritizing of language content. More focus can also be given to teaching the ways in which one can express gratitude and at the same time reject an offer.

Article 3: Issues in creating a corpus for EAP pedagogy and research (EAP 2007)

Corpora have been used in EAP since the 1980s, but were initially used mainly for research. Corpora have proved useful in determining the features of an academic register, in terms of both word frequencies and specific vocabulary. The advent of corpora has greatly stimulated register and genre analysis.

The value of corpus work in EAP teaching lies in the fact that it can both replace instruction with discovery and refocus attention on accuracy as an appropriate aspect of learning. It also promotes a learner-centred approach bringing flexibility of time and place.

Issues in creating a corpus for EAP pedagogy:

1. Domain categories: There is a lack of consensus among academic institutions and librarians, as well as within the corpus community in this regard. The major differences are in assigning specific subject areas to broader categories. For example, while the Brown corpus lists Political Science in a separate category alongside Law and Education, A completely different approach is taken by the Academic Corpus, which puts Politics under the Arts category. The differences in classification systems can result in incompatibilities between EAP corpora, potentially preventing users from accessing several corpora simultaneously. One way of addressing the issue of classification is to localize classification to a particular EAP environment. For example, universities in the UK could adopt a joint classification system.

2. Genre categories: There is only limited consensus about what genre fundamentally entails. In some cases, the corpus designers avoid such genre classification problems by themselves setting limitations on the types of data collected. For example, they may impose the restriction that data contributors can submit only argumentative essays or literature examination papers.

3. Text integrity: EAP corpora sometimes contain incomplete texts. Removing references and quotations, for instance, from academic texts raises a serious question about the 'authenticity' of such texts. Furthermore, EAP teachers will sometimes need examples to show the typical length and type of statement that tends to be quoted, and which verbs are used to introduce the quoted text.

4. Different levels: Some EAP corpora are restricted to data from specific levels of EAP, such as the Reading Academic Text (RAT) corpus. A comprehensive EAP corpus should contain texts produced by students at all levels.

5. All grades: Some EAP corpora contain only student texts that have been awarded high grades. However, without lower-grade student texts, there is no opportunity for monitoring progression, or for making comparisons with the higher-grade student writing.

6. Availability: EAP corpora, like EAP courses, are often intended only for intra-institutional use. This means that there is little sharing of best practice, little institutional cooperation, and considerable duplication of effort. Moreover, many corpora are collected for personal research, and are never used again. Several valuable resources are created but not made available to the public at all.

Article 4: Learner corpora: The missing link in EAP pedagogy (EAP 2007)

This article deals with the place of learner corpora, i.e. corpora containing authentic language data produced by learners of a foreign/second language, in English for academic purposes (EAP) pedagogy and sets out to demonstrate that they have a valuable contribution to make to the field.

The specificity of learner corpora is in the fact that they contain data from foreign or second language learners. Interest in learner corpora is growing fast and has already generated a range of stimulating studies, which highlight the potential of this new resource for the EAP field. For instance, Flowerdew (2001) shows how careful investigation of learner corpus data can help uncover three areas of difficulty in learner EAP writing: collocational patterning, pragmatic appropriacy, and discourse features.

One important finding from learner-corpus-based studies in general and EAP in particular is that some of the linguistic features that characterize learner language are shared by learners from a wide range of mother tongue backgrounds while others are exclusive to one particular learner population. The shared features can be assumed to be developmental while the latter are presumably due to transfer from the learners' mother tongue.

A few materials designed to help students improve their academic writing skills are corpus-informed. Most of these (few) materials tend to be based on native corpora only. But learner writing is characterized by errors and infelicities which are often quite different from those found in native writing, even novice native writing. By relying solely on native corpus data, EAP materials ignore these and thus fail to provide non-native learners with the type of information that is arguably most vital to them. What L2 learners really need is EAP resource books addressing the specific problems they encounter as non-native writers. Learner corpora make such an approach possible.

Article 5: Learners' writing skills in French: Corpus consultation and learner evaluation (second language writing 2006)

This study aims at investigating the effects of corpus consultation on students' writing and their reaction to the process.

The learners' task was to write a short essay (600 words) in French on a topical subject concerning language diversity in France. So it was decided to include a variety of texts in French relating to the history and development of the French language and to current issues relating to the language written by educated and informed native speakers of French. The corpus analysis tool used for the purpose of this study was WordSmith Tools.

The students involved in this study were 14 undergraduate learners of French following the B.A. in Applied Languages and the B.A. in Applied Languages with Computing at the University of Limerick. As part of the module requirements students completed a written assessment. The writing assessment consistsed of a short essay in French (600 words) relating to an aspect of the French language. Students completed this task in their own time and could consult dictionaries and grammars. As they submitted the essays, they were just beginning classes on corpus consultation skills over a three-week period.

The marking of the assessment takes the form of underlining errors/mistakes and placing an x in the margin to indicate a basic inaccuracy, for example, in gender, agreement, verb forms, lexical and stylistic issues. Students were given an introduction of 10 minutes outlining what they had to do; they were allowed up to 100 minutes to consult the corpus, revise their texts, make any changes they wished, and provide feedback on the corrections which they had made using the concordancing output.

The following categories of errors were identified: grammatical errors (prepositions, articles, singular/plural, adjectives, tenses); lexical errors (word choice, informal usage, idioms); syntactic errors (sentence structure, word order); and substance/mechanical errors. Prepositions account for the greatest number of changes within the grammatical category. The real value of the concordancer in these attempts to change lies in the fact that it can make correct forms and prepositions more salient to the learner, possibly making them more memorable than a dictionary or grammar. Twenty-two of the 28 attempts to correct word choice and inappropriate vocabulary are successful. In several of the examples of errors, it would appear that the learner had a specific word in mind but did not use this word in the correct context. This is where the real value of the concordancer lies; it shows how the words should be used in the correct context.

The overall feedback of the students to the use of corpus was positive.

The results of this study suggest that, with training and guidance, consultation of an appropriate corpus may provide a means for the learners to participate more actively in the development of their writing skills. This active participation could be enhanced by integrating corpora and concordancing into the word processing environment.

Article 6: Discourse Particles in Corpus Data and Textbooks: The Case of Well (Applied linguistics 2009)

Discourse particles such as okay, so and well are syntactically optional linguistic items which have no or little propositional value but serve important pragmatic functions. In the pedagogical setting, however, discourse particles are often dismissed as a sign of dys-fluency and their use discouraged. Without these items in their speech, learners may come across as unnatural, dogmatic and/or incoherent, hence leading to a greater possibility of communicative failure.

The present study looks at how the discourse particle well is used by expert speakers in an intercultural spoken corpus in Hong Kong and how it is described and presented in textbooks designed for learners within the same community. In this study, data are drawn from two sources, namely the Hong Kong Corpus of Spoken English containing approximately 1 million words of naturally-occurring speech. The corpus consists of 311 texts which are primarily intercultural encounters in English between Hong Kong Chinese whose first language is Cantonese and speakers of languages other than Cantonese. Participants in the corpus are all competent speakers of English who regularly and successfully communicate in English either professionally or socially. To compare particle usage in 'real' English and 'school' English, a database consisting of 15 English textbooks collected in Hong Kong was created. This draws a direct comparison of the use of well between invented texts in textbooks and naturally-occurring data in the HKCSE. The analysis focused on three aspects of D-use well namely its frequency of occurrence, position and discourse function.

Result and discussion

The functions of well as described and realized in textbooks did not seem to be a close match with their functions in corpus data. Given that discourse particles are crucial in achieving pragmatic competence and that the descriptions and examples of discourse particles in the textbooks examined are far from satisfactory, substantial revisions with the incorporation of naturally-occurring examples are required in order to present a more comprehensive picture to students concerning how discourse particles are used.

Article 7: Can a graded reader corpus provide 'authentic' input? (ELT 2009)

Graded readers are a useful way of motivating learners to read extensively, through the accessibility they provide by limiting the number of headwords. This accessibility also makes them a valuable resource when made into a corpus, a database of texts, for learners not yet able to manipulate and authentic corpus.

The question is that how far can it be assumed that a corpus of graded readers like this reflects authentic language? Do the language patterns learners need to know still emerge? A useful way of examining this is to compare the occurrence of lexical chunks in graded and authentic corpora. Lexical chunks are considered fundamental to achieving native-like fluency. So, even text that is simplified should contain such chunks to provide useful input. This makes it important to find out how far such items are filtered out in the grading process.

The corpus

In this study, B1 and B2 corpora were made up of simplified texts from graded readers. The Penguin series of graded readers were chosen because of the variety of genres and topics covered. The authentic corpus used was the written only portion of the B N C, comprising around 93 million words of a wide variety of text types. The Keywords function of Wordsmith Tools (Scott 1996) was used to find words whose frequency was unusually high in the graded corpora in comparison with the B N C.


The first thing considered was how many word clusters recurred within the two graded corpora, to see if this was comparable with an authentic corpus. The Bi and B2 corpora showed a similar number of occurrences of clusters at each level, and the BNC, when scaled down proportionally, showed a similar pattern overall. However, there are fewer two-word clusters occurring in the BNC compared to the two graded corpora. Two- and three-word clusters (chunks) occurring in the graded corpora showed relations of time and place, other prepositional relations, interpersonal functions, and linking functions. In other words, through the graded corpora learners will be exposed to plenty of chunks that are representative of the most commonly used authentic language. Of course, the limitations of the corpus restrict the learners' exposure to some very frequent chunks. To a large extent, this appears to be bound up with text type in the graded corpora, and the predominance of fiction. It is difficult, perhaps impossible, to find graded texts which reflect the range included within an authentic corpus.

Article 8: An exploratory study of collocational use by ESL students: a task based approach (system 2009)

The present study was an attempt to better understand the competence in L2 collocational use of the ESL secondary school leavers in the Hong Kong context by looking at their actual performance in a writing task, deploying two highly comparable native and non-native corpora compiled from a portion of the writing of 60 Hong Kong and 60 British students. The size of the corpora was relatively small. The 60 British essays were collected from a school in northern England. The corpora were worked on with ConcApp, a free concordancer.

Data analysis began by identifying all the lexical words. The focus of analysis was on language that was overused, under-used or not used by the L2 learners compared to that of their British counterparts. No Statistical tests were used to find out 'significant' differences between the two groups because the main concern of the study was pedagogical.

In general the two groups used more frequently nouns and adjectives than verbs and intensifiers. Nevertheless the frequency of use of nouns, adjectives, verbs and intensifiers was on the whole higher in the British corpus than that in the Hong Kong corpus. The HK learners were, on the whole, weaker in vocabulary compared to their British counterparts. It was interesting to note that the Hong Kong students seemed to have a preference for the amplifier 'very' while the British students the downtoner 'quite'. The HK learners used fewer collocations and an extremely restricted range of collocating words compared to the British students. Surprisingly some collocating words used by the British students with, for example, 'scar' such as 'large', 'big' and 'deep' were words also known to the Hong Kong learners but were not used by them in collocation e.g. 'large scar'.

There is some evidence of L2 collocational use affected by LI. For example, in the use of 'circle' in various examples in the Hong Kong corpus e.g. 'circle eye'. Such use of 'circle' might be explained by the fact that 'round' and 'circle' are expressed by the same lexical item in Chinese. Some other collocational uses were affected by confusion with L2. For example, as regards the collocation 'curly hair', the Hong Kong learners seemed to be confused by L2 pronunciation e.g. 'cury hair', 'curley hair' or, they might have confusion with the loose synonyms in English e.g. 'curve', 'coil' and 'curvy'. Furthermore, some informal collocations found in the British corpus were totally absent in the HK corpus. Such collocational use, like the intensifier + adjective collocations (e.g. a bit doggy) contributes to the sense of nativeness in English, of which the HK learners seemed to be completely unaware.

Finally, teachers should raise the awareness of collocation use in the students'L1. It is paramount to draw the learners' attention to L2 collocational use different from that of their LI whenever necessary. To minimize the adverse effect of learners' confusion with L2, vocabulary should be taught in collocational contexts.

Article 9: A corpus-based lexical study on frequency and distribution of Coxhead's AWL [Academic Word List] word families in medical research articles (RAs) (ESP 2007)

Coxhead screened out 570 "words" with high-frequency and wide text coverage from academic texts in her Academic Corpus, irrespective of subject areas and disciplines. A corpus-based research approach adopted to study the 570 academic word families from Coxhead's AWL in the field of medicine, with an emphasis on their frequency and coverage in medical RAs.

The relevant materials and data in this study were mainly obtained from three sources: the public Internet and two databases, namely ScienceDirect Online and Medline. The data processing mainly comprised the normalization and segmentation of the AWL terms, the normalization was a lexical step to transform the words into their basic forms. After normalization, the final lexical terms were mostly composed of segmentation units that were elementary to inflected or derived forms in the word family. The first step to process the word families was to normalize and segment the words so that the occurrence of each word family in the corpora could be counted with a self-designed computer program. This process led to the production of a Whole Paper Corpus (WPC) consisted of 50 medical RAs written in English. The 570 AWL word items were first input into computer memory as the basic word database. The four sections - Introduction, Methods, Results, and Discussion -and the attached Abstract of the RAs were then input separately into five separate sub-corpora, which together formed the integrated WPC.

The result of the study on the text coverage of the AWL word families in medical RAs (10.073%) suggested that the academic words in medicine play a similar role as they do in subject disciplines other than medicine. The AWL word families had about 10% coverage of the running words in either the WPC or each of the five sub-corpora, which demonstrates that they form a large proportion of the running words in medical RAs and distribute throughout the whole text with high dispersion. The higher coverage of academic words in the Abstract and the Discussion section and the lower coverage in the Materials and Methods and the Results sections in our study can be explained by the claim that academic words are often used when people tend to express abstract ideas rather than content. A more detailed analysis made on a small number of words in this study showed that language learners should be informed not only of the frequency of some word items, but also to awaken the learners of the influence of subject matter and academic discourse on lexical units, which might vary in accordance with the different subjects and genre categories. Choosing and using academic words in RAs should meet a two-sided requirement: the rhetorical functions the academic words themselves are supposed to serve on the one hand and on the other hand the actual need of these rhetorical functions each RA section requires under some specific circumstances.

Article10: Academic vocabulary in agriculture research articles: A corpus-based study (ESP 2009)

This study integrates corpus-based and genre-based approaches, studying the research article to uncover specific characteristics of academic vocabulary using the AWL as its point of departure. This study focuses on frequency, coverage and distribution of the words from the AWL in agriculture research articles, both in the whole article and across its sections.

For the study, we built an 826,416-word corpus of research articles in the agricultural sciences. The corpus is representative of a genre, the experimental research article and of a discipline, agriculture. It consists of articles taken from the on-line versions of journals indexed by the Science Citation Index (SCI) Report. The corpus contains 218 articles produced academics working in English-speaking universities, and selected from journals published between 2000 and 2003, which were specifically recommended by subject specialists at our university. The documents were prepared to be accessed as whole texts or as individual sections (sub-corpora). The units of analysis were tokens, types and families. Types are denned as single word forms; tokens as the number of occurrences of each type; and families as a collection of formally and semantically related word types.

The computer software used for the analysis was Wordsmith Tools (WST) (Scott, 2004). For the study, researchers first determined the frequency and distribution of word types and tokens in the corpus. Then, using the GSL and AWL as match lists, we identified the academic and general words present in the corpus and their coverage. The families of the most frequent words were identified in the AgroCorpus.

The items were sub-grouped into general words, academic words and "other tokens". These included mainly technical words, but also other words, such as formulas and proper names. These tokens were not further analyzed, as the focus of the study was on academic vocabulary. As for types, the total number used in the corpus was 23,682. Their distribution across sections of the agriculture research article was different, indicating lower variability in the Results section and higher in the Methods section. The sub-group of AWL word types revealed that, of the total of 3107 types in the AWL, only 1941 occurred in the AgroCorpus, which means that 1166 items of the AWL (37.50%) did not occur at all in the AgroCorpus. The individual sections of our corpus differed not only in the number of types that occurred above the mean, but also in the number of families, with more families occurring in Introduction. The Results section had the lowest number of AWL families; that is, it was the section with the lowest variation. There were academic words from the AgroCorpus that were used with technical rather than academic meaning. The word 'culture' provides an example of a word from the AWL used with technical meaning in the field studied. This example adds further evidence to the point that disciplines use words with preferred meanings and collocational behaviour. The results also lend support to the argument that vocabulary should be taught considering the students' specific target context. The argument in favor of the use of a general academic word list may be valid in contexts where English is a second language, as is the context of academic writing courses for international students in English-speaking countries.