Constructing On Inter Rater Reliability Of Efl English Language Essay

This research aims to investigate the constructing on inter-rater reliability of EFL essay writing tasks in college English of Mainland China.

In Mainland China, all contemporary high-state tests on English proficiency consider the essay writing task as an indispensable part of the whole and allocate quite large proportion of total score to it. Consequently, the corresponding washback has been elaborated in the college classroom-based teaching that essay writing tasks turn out to be the most feasible, necessary, and cost-effective ones for routine assignments in classroom-based teaching. Washback is a crucial concept in language teaching and testing. Alderson and Wall (1993, p: 115) interpreted this concept as a phenomenon that "… testing influences teaching is commonplace in the educational and applied linguistics literature." Hushes (2003) made a clear and simple definition that washback refers to the washback effect upon learning and teaching brought by tests. Alderson and Wall (1993) also implied that in general education circles, "backwash" is known as this phenomenon, but in British applied linguistics and more professional language testing circle, "wahsback" is the more popular and recognizable term. To be more specific, washback is also be defined as an indispensable part of the tests' impact upon society. Although some researchers like Alderson and Wall (1993) doubted in what ways the washback exists and functions, washback is widely accepted by language testing researchers today and considered as a significant item in language testing and assessment research. Contemporarily, since the Education Ministry of China has emphasized upon a policy of augmenting the enrollment of universities' freshmen in recent years, it is quite common for a college English classroom to afford around 70 students together. Since Chinese students are usually involved into EFL learning in the classroom, the great proportion of class time is allocated to teaching on fundamental English like grammar and the communicative activities like questions and answers. Then, from the perspective of cost-effectiveness, essay-writing tasks are regarded as the most feasible home assignments for students in EFL classrooms.

What's more, for EFL teaching in a university level, curricular design and interpretation, to a large extent, is varied considerably among universities. For instance, in the selected research context, there is no specialized course designed for English writing proficiency, which is usually incorporated in the process of text analysis and grammar teaching. Usually, the rating activities have always been conducted mainly based on different raters' educational background and individual benchmarks for rating.

Therefore, it is just this specific great significance arousing my strong interest to further explore this area from the perspective mentioned in the topic. Surely there are much academic research has been done in this area, but the similar research has never been conducted on the selected context. Meanwhile, since I was personally involved as a college English teacher in the selected context. Having been acquainted with the routine rating procedure there, I am strongly interested in what is going on there with respect to inter-rater reliability.

Research questions

A general hypothesis was made here that although the prevalent College English teaching on essay writing in mainland China does elicit wide-range corresponding rating activities, the reliability embedded is not strong since there is no unanimous rating principle or rubrics agreed by teachers in the selected research context. This research attempts to display an analysis on inter-rater reliability's construction in assessment of EFL Essay writing tasks in classrooms of college English in Mainland China, both from theoretical view and practical one. This study mainly is going to probe around the following research questions:

What influence is there on the inter-rater's reliability brought by raters' individual characteristics? Can these be categorized?

Corresponding hypothesis:

There is a strong interference from raters' characteristics on inter-rater reliability and the characteristics can be categorized.

Between the holistic and analytic rating methods, which one is contributing more to constructing the inter-rater reliability?

Corresponding hypothesis;

Holistic and analytic rating methods are contributing to constructing inter-rater reliability in different ways.

Whether does the halo-effect function as a psychological impediment or a motivation for the inter-rate reliability?

Corresponding hypothesis:

The positive or negative halo-effect in rating activities can either function as a motivation and an impediment.

In view of these three research questions, the first one is the primary one, in which the researcher is going to invest more. The other two research questions are regarded as the extension of the primary one. Namely, research question 2 and 3 will be incorporated as two main parts in the question 1.

Background introduction and literature review

Being an essential productive language skill, writing proficiency in English is constantly highlighted to serve language acquisition or examination-oriented purposes.

In mainland China, there is always a threshold requirement on writing English proficiency in most job-application cases and English is always functioning as a compulsory course incorporated in curricular design from primary school to postgraduate level in universities. Under the circumstance, the demand on effective and reliable assessment on English writing in the mentioned context is stronger than ever before, since the interaction between writing task achievement and task assessment can reflect and measure the both effects of language teaching and learning.

According to Hamp-Lyons (1991, p. 8), when mentioning the concept "the reliability of writing assessment", we mainly refer to "the extent to which it 'yields the same results on repeated trails'" mines (Carmines & Zeller, 1979, p. 11). Class- based essay writing tasks for EFL students definitely can be defined as a typical case of the exploration and the demonstration of subjective language knowledge and skills in classroom-based English teaching. However, the students' endeavor in writing would turn out to be invalid if there is no acceptable reliability in rating, which is closely related to raters. Admittedly, there are so many elements and factors involved in rating for classroom-based easy writing tasks such as writers' and raters' individual characteristics, grading rubrics design, macro and micro-requirement of writing assessment, rating methods, scores' interpretation, etc. Thus, the inter-rater reliability can mainly be defined as the extent, to which raters with different characteristics and backgrounds can make their agreement (Penny et al, 2000).

From the perspective of writers--- students, regarding essay writing, the requirement on students is not only related to the reproduction of ideas, evidence, and other arguments of other writers, but also involves reorganize those into a new formation of his or her individual design (Ballard & Clanchy, 1991, p. 29). Meanwhile, we also cannot ignore the intervening factor like "a disjunction between the attitudes to knowledge held by the students and the assumptions about the appropriateness of different (culturally shaped) attitudes to knowledge held by the staff who are now assessing their academic work (Ballard & Clanchy, 1991, p. 19). When writing involved, each writer' information output is composed by his or her ideology system, experience, emotions, knowledge, etc. (Hamo-Lyons, 1991), which contextualizes the corresponding language performance's individuality.

Based on the above, it should be reasonable to claim that to be acquainted with students' topical knowledge system and macro-educational background is a prerequisite for constructing the inter-rater reliability in essay writing, or teachers and raters should pre-mould students knowledge background in teaching tin order to prepare the qualified language performance which can provokes raters' unanimous rating agreement. For example, raters' unfamiliarity with varied writers' rhetorical styles developed by different culture backgrounds may give rise to more scathing assessment on actual proficient language performance. Then there is a strong demand on analyzing this main independent variable --- raters. Linacre (1989, p. 48-49, 51) adopted the term "severity" to indicate both the overall severity of raters and differentials among raters by their interpretations on rating scale thresholds for specific items. Then both the overall severity and the more specific effects are included by the term "rater characteristics" suggested by McNamara and Adams (1991/1994, 3), which is widely referenced by current writing assessment research. Then the hypothesis of this study is clearer that raters' individual characteristics should be regarded as the most important factor for constructing assessing reliability in essay writing, the evidence of which is mounting. Hake (1986) reported based on his study that essays composed mainly in personal experience's narrative were misgraded more frequently than those written in expository ones. Purves (1992) hypothesized that the differences between correlations about various functional essay types may be caused by rater's variables instead of students' ones. Hamp- Lyons and Mathias (1994) suggested that raters tended to give high scores to those students who chose the relatively difficulty essay prompt since they may unconsciously reward those students or may have lower expectations regarding that prompt. Weigle (1999, p. 146) summarized that "… in addition to factors within the prompts themselves, raters variables have an important influence on composition scores." Meanwhile, Weigle (1999, p. 147) concluded two main factors influencing rating reliability in writing --- "rater expectation" and "rater background". As mentioned above, those variables, which would pose a threat to the constructing of inter-rater reliability, are mostly related to raters' characteristics.

Apart from the characteristics mentioned above, holistic and analytic ways of rating are also taken into my consideration. Holistic and analytic methods for assessing compositions have been widely accepted in second language testing and teaching (Cananle, 1981; Carroll, 1980; Jacobs, Zinkgraf, Wormuth, Hartfiel, and Hughey, 1981). Hughes (2003) defined holistic rating as the process of assigning a single score to a piece of writing according to rater's overall impression of it. Compared with its ancestor called general impression marking, the holistic rating nowadays is different in an improvement that a scoring rubric would be given in the holistic rating (Weigle, 2002). Regarding the current holistic rating, on the one hand, it is indisputable that holistic rating can be very rapid and time- saving, which wins its great popularity for many years in classroom EFL teaching. On the other hand, many researchers doubt its reliability since the holistic rating is too general and too rapid to make a trustworthy decision. Harris (1968) found in his research that if the rater scored a 20-minute composition of student only once, the reliability coefficient was merely 0.25. However, in other researches, it seemed that the reliability of coefficient was much higher if the composition was scored twice or more. Analytic rating, methods of scoring which require a separate score for each of a number of aspects of a task, seems to have strong face validity and reliability, mainly due to its specific benchmarks and clear- cut scales for reference. Nevertheless, Hughes (2003) also listed two primary disadvantages of it. Firstly, the analytic rating is extremely time-consuming. Secondly, he argued that being too immersive in detail scoring, to some extent, would lead to ignoring the overall performance of students' writing.

Research context and participants:

The context for this study refers to the classroom-based English essay writing tasks' assessment in a university for college students with varied ethnic-group background, in the southwest of Mainland China. Researching participants are ten EFL teachers of college English (5 experienced and 5 inexperienced). The experienced rater refers to those who have at least 10- year teaching experience and have been involved into writing assessment work organized by the government in a national range. The inexperienced rater refers to those teachers whose teaching experience is less than 10 years.

Research Methodology and Methods

4.1 Methodology

This study is suggested to conduct typically in a qualitative ethnographical way. Ethnography is considered as a part of qualitative research mainly because "… its epistemology, that is, its origins and methods, is based in the epistemology of qualitative research" (Wiersma, 2000, p: 238). However I do not mean the quantitative characteristics are not allowed to appear here, especially when some data will be collected and displayed in a typological way, because the distinction between qualitative and quantitative cannot be defined in a simplistic and naïve way and many aspects of them are indistinguishable (Reichardt & Cook, 1979; Nunan, 1992), and In this study, the study context--- the university classroom I chose-- is going to be considered as a case. Then, the study is going to be conducted with many case-study characteristics at the very beginning. Like those suggested by Merriam (1988, p. 16) that "… the qualitative case study can be defined as an intensive, holistic description and analysis of a single entity, phenomenon, or social unit. Case studies are particularistic, descriptive, and heuristic and rely heavily on inductive reasoning in handling multiple data sources."

In principle, an integration of psychometric and ethnographic methods should be optimal, but Nunan (1992. P: 52) claimed that "In practice, however, integrated approaches seem to non-existent". Thus the ethnography is placed at the top of this study's methods list. This study firstly incorporates the interpretative description and analysis of conventional rating procedures and its essential components of student's English essay writing assignments conducted by the English as a Foreign Language (EFL) teachers in the selected research context. The direct observation on rating assisted by Think-aloud protocol in raters' office is adopted here according to Wilson's naturalistic-ecological perspective (1982) that the highly effective investigation on behavior can be achieved best in the natural contexts in which the behavior occurs, instead of in the experimental laboratory and Wiersma's viewpoiont (2000) that since the field research is one of the main characteristic of an ethnographic research, the research surely should be conducted in the natural situation. Meanwhile, a semi-structured interview with the focus group is also incorporated because the ethnographers should gain the "subjective perception and belief systems" of the research participants including both researchers and subjects (Wilson, 1982). However, considering from a practical view, I attempt to conduct the research study in a way of "inductivism" instead of "a deductive research" (Nunan, 1992, p. 13) in the selected research context since "… insights and generalizations emerge from close contact with the data rather than from a theory of language learning and use" (Nunan, 1992, p. 55). Definitely, it is not reasonable to expect high generalizability from a case study. Therefore, this case study's inductive characteristics lie in seeking for explanation after evidence collecting and the corresponding typology. Thirdly, there would be some proposals put forward for further constructing the inter-rater reliability. Data collecting and interpreting here are also involved.

4.2 Specific Methods

4.2.1 Direct observation assisted by Think-aloud protocol

Over the process, the researcher plans to gain relevant information mainly by direct observation and field notes taking, assisted by adopting Think-aloud protocol on raters.

Observation is the most direct way to figure out "the activities that took place in that setting, the people who participated in those activities, and the meanings of what was observed from the perspective of those observed" (Patton, 1990, p: 202). The observation here is planned to conduct in a naturalistic way since the researcher is going to observe directly in the field. By collecting detailed descriptive information, the researcher can get acquainted with what is going on in the selected context both overtly and covertly. The direct observation can also offer the researcher a chance to be aware of the new information about the research target that is usually taken for granted previously. Additionally, more valuable information would emerge from the firsthand experience of being involved into the research context personally (Patton, 1990).

Think-aloud protocol, which is also called the verbal protocol, is adopted here to require raters to verbalize their complex cognitive activities of decision-making as they are rating. It is quite prevalent in researching human mental activities involved into a certain activity. While this method was sometimes criticized for its subjective characteristics and providing inaccurate data without contribution to conclusion (e.g., Cohen, 1984), Smagorinsky (1989) claimed in a review of the literature that Think-protocol can offer very valuable data if the data are collected in a systematic way and the analysis is supported by other forms of evidence. That's why the interview with the focus group is also incorporated in the specific research methods in the following part.

To be specific in this research, the observation on raters' rating activities will be conducted in a one-to-one way. After being authorized by relevant department in the university the researcher is going to observe the rating activities of ten raters respectively (5 experienced and 5 inexperienced raters are included). The observation is going to be conducted in the office in which the rater carries out his or her routine rating activities in order to prevent the interference form the change of the context. From the beginning to the end of each observation, there is no any interference or indication from the observer. The exact time schedule will be made after negotiating with every subject respectively. There is no limit for time limit for rating essays and how long the researcher should observe depends upon the time invested in the rating work by a certain rater. Before conducting the observation, the research participants will be well informed that they are required to verbalize their important thoughts, especially about the decision-making, during the process of rating. During the exact procedure of rating, the rater is expected to say aloud what is happening at the very moment he or she tends to make a decision. Simultaneously, the researcher will conduct the direct observation on the rater as well. The process of observing assisted by Think-aloud protocol will be audio recorded and simultaneously the field notes will be taken, all of which should be conducted with the permission of the research participants and Chinese will be used for think-aloud protocol for avoid misunderstanding.

With respect to the essays for rating in the research, firstly, there are two essays with similar language proficiency selected from sophomores of the selected context for rating. This step is mainly to investigate what routine procedures conducted by raters and what pre-determined benchmarks for rating held by them. It is also going to focus on collecting the relevant information about the influence brought by holistic and analytic rating methods. Namely, for the first essay, all raters are required to rate only in a holistic way. For the second essay, analytic way of rating is required only. The researcher will provide the rubrics for the analytic rating in advance.

Secondly, two essays selected from IELTS 6-point level would be offered in order to check inter-rater reliability in a criterion-referenced way by requiring raters to use both holistic and analytic ways in the Think-aloud protocol. The rubrics are going to be cited from IELTS. This step is considered as a supplementary activity or a reassurance for the observation and the corresponding results mentioned above. No tape recording is required here but some important and new information will be written down

The observation and the corresponding field notes taking preliminarily are going to focus upon those following elements:

The routine procedure of rating activities.

Characteristics and essential requirements of each rating procedure recognized by the rater.

Rater's important comments on the target assignment.

Reasons for marks giving.

Based on the data collected above, a typological conclusion would be made to figure out the extent to which all raters make an agreement or a different choice mainly about both holistic and analytic methods of rating.

4.2.2 Semi-structured Interview with focus group

Interviews are pervasive and preferred mainly due to a fact that neither much technical paraphernalia nor professional skills of conducting are required (Denscombe, 1999). Before the interview, the interviewee should be generally informed of both the interview and research purpose by the interviewer (researcher), which reminds the interviewee of a fact that " the interviewee's words can be treated as 'on the record' and ' for the record' (Denscombe, 1999). Namely, such informing, to a large extent, can prevent the interviewee from using a discursive and too colloquial utterance since those words are taken very seriously in the research. Regarding this research, semi-structured interview is adopted primarily for two reasons: firstly, there are a series of issues on the research list to meet the requirement of relevant information acquisition. Secondly, the interview involved in this research is mainly for offer the supplementary evidence for the direct observation and think-aloud protocol mentioned above. Therefore, to elicit interviewee to explore further and more about the issues listed by the researcher is a main focus in method using as well, since a semi-structured interview should "…let the interviewee develop ideas and speak more widely on the issues raised by the researcher" (Denscombe, 1999, p: 113). Finally, this semi-structured interview will be conducted as a focus group interview. It is simply because the interview on focus group is characterized by information eliciting and exploring in a more natural way. As Denscombe (1999, p: 115) asserted, "… focus groups can lead to insights that might not otherwise have come to light through the one-to-one conventional interview." To be specific, the interview on focus group are conducted to provide the information about the halo-effect, holistic and analytic way involved in rating process, which means what elements or characteristics can stir up positive or negative halo effect in rating. Field notes will be conducted here. Both English and Chinese are used for more effective communication.

The semi-structured interview with focus group attempts to include the following questions:

What kind of rater's characteristic do you think can impose the greatest impact upon inter-rater reliability?

In what way do you think the holistic and analytic way of rating can contribute to inter-rater reliability?

In your rating, when and in what context do you prefer to make your decision mainly in view of English language proficiency elements (professional aspect)?

In your rating, when and in what context do you prefer to make your decision mainly in view of others factors (non- professional aspect) such as mood, impression, etc.?

Is there some occasion that you tend to highlight or depreciate a student's writing performance just because you accidentally come across some most satisfied or dissatisfied characteristics in the essay (the halo-effect)?

All the methodology and methods mentioned above are designed to contribute to "internal validity" constructing of the research and then to achieving relatively high "external validity" (Nunan, 1992, p. 15) of this research.

Data collection and analysis

The data collected by direct observation firstly will be categorized into different ranges and displayed in a typological way, for example the frequency of a certain variable's interference. Then analysis will focus on the main trend implied by the data to figure out the referencing answer to research questions. In view of ethnographic research is kind of contextualization, the data are supposed to be collected and interpreted in the natural context in which the behavior happens (Wiersma, 2000).

Initially, the data collected by both methods mentioned above are going to be divided into three categories: Professional Knowledge and Skills, Specific Operation of Rating Activities and Psychological Characteristics during the rating process.

The data for Profession mainly refer to those related to professional knowledge and skills that are essential to conduct essay-rating activities, for instance, what the previous benchmark the rater holds for rating the lexical resource in the essay.

The data for Operation mainly refer to those related to the specific rating procedures or steps adopted the rater, for example, how often the rater would reread the essay when rating grammatical range and accuracy.

The data for Psychological Characteristics refer to those related to the interaction between the rater's psychological change and corresponding reasons during the rating process.

The information collected by the interview on the focus group will be also conducted in a typological way. Tables and relevant charts are designed for data analysis illustration.

Time line

Stages of research

Corresponding timeline

â-Preparation for the research

â-Review the relevant literature

â-Initial design of research proposal

â-Revise then finalize the research proposal

November 1 to December 10, 2010

â-Conduction of specific research, data collection and analysis

December 7, 2010 to January 30, 2011

Direct observation with Think-aloud protocol

1.December 15 to 30, 2010

Semi-structured interview with the focus group

2.January 3 to 15, 2011

â-Draft for final research paper

â-Finalize the research dissertation

January 16 to May 20, 2011

Submission of the dissertation

May 31, 2011

Ethical consideration

All research activities will be conducted with the participants' permission and no specific private information, such as name, title, etc., is allowed to reveal since the revealing can cause trouble for participants.

Anticipated Problems and limitations

Admittedly, there are some limitations of this study. Firstly, different universities adopt varied versions of text books with various difficulty level, and multimodal methods of ESL essay writing teaching, which would elicit the essay writing tasks with considerably different characteristics. Secondly, the representative samples both of students' English essays to be rated and raters are relatively small (Ten raters are involved here). Thirdly, the research context selected is a university mainly for students from various ethnic groups, the majority of who is from impoverished rural areas. Due to inadequate professional English teaching staff and facilities, the average level of English proficiency is relatively lower, which should be considered as a factor to influence raters' decision making over the rating procedure. Because all research participants in the select research context are well informed with their students' previous educational background, raters there may tend to make a decision form the viewpoint of teaching on ethnic students. In view of those factors or limitations mentioned above, the high extensional generalizability of the research results is quite impossible. Finally, even though the research participants would be get informed about how the Think-aloud protocol is used before the formal research's conduction, some raters may not respond in a natural way, which may not reflect the genuine information about their mental activities.