Reliability of Speaking Proficiency Tests

Info: 23624 words (94 pages) Dissertation
Published: 21st Dec 2021

Share this: Facebook Twitter Reddit LinkedIn WhatsApp

Introduction

Testing, as a part of English teaching, is a very important procedure, not just because it can be a valuable source of information about the effectiveness of learning and teaching but also because it can improve teaching, and arouse the student’s motivation to learn. Testing oral proficiency has become one of the most important issues in language testing since the role of speaking ability has become more central in language teaching with the advent of communicative language teaching (Nakamura, 1993). However, assessing speaking is challenging (Luoma, 2004). Validity and reliability, as fundamental concerns and essential measurement qualities of the speaking test (Bachman, 1990; Bachman & Palmer, 1996; Alderson et al, 1995), have aroused widespread attention. The validation of the speaking test is an important area of research in language testing.

Test of oral proficiency just started in China 15 years ago, and there are a few very dominant tests. An increasing number of Chinese linguists are putting their attention and efforts on analysis of their validity and reliability. Institutions began to introduce speaking tests into English exams in recent years with the widespread promotion of communicative language teaching (CLT). Publications that deal with speaking tests within institutions provide some qualitative assessments (Cai, 2002). But there is relatively little research literature relating to the reliability and validity of such measures within a university context. (Wen, 2001).

The College English Department at Dalian Nationalities University (DLNU) has been selected as one of thirty-one institutions of the College English Reform Demonstration Project in the Peoples’ republic of China. In College English (CE) course of DLNU, the speaking test is one of the four subtests of the final examination of English assessment. The examination uses two different formats. One is a semi-direct speaking test, in which examinees talk to microphones connected to computers, and have their speeches recorded for the teachers to rate afterwards. The other is a face-to-face interview. This research in this paper aims to ascertain the degree of the reliability and validity of the speaking tests. By analyzing the results of the research, teachers will become more aware of the validity and reliability of oral assessments, including how to improve the reliability and validity of speaking tests. I, as a language teacher, will gain insight into the operation of language proficiency test, In order to better degree of reliability and validity of a particular test, I will also take other qualities of test usefulness into account when designing the language proficiency test., such as practicality and authenticity.

Research questions

This study mainly addresses the questions of validity and reliability of the speaking test administered at DLNU. They are comprehensive concepts that involve analysis of test tasks, administration, rating criteria, examinee and tester’s attitudes towards the test, the effect of the test on teaching and teacher or learner attitudes towards learning the tests (Luoma, 2004). Therefore, the purpose of this study is to answer the following research questions:

Is the speaking test administered at DLNU a valid and reliable test? This question can involve the following two sub-questions:
1. To what extent is the speaking test administered at DLNU reliable?
2. To what extent is the speaking test administered at DLNU valid?
In what aspects and to what extent may the validity and reliability of the speaking test administered at DLNU be improved?

Literature Review

This chapter presents a theoretical framework of speaking construct, ways of testing speaking, marking of speaking test and the reliability and validity of speaking test, also introduces the situation of speaking test in China.

Analyzing Speaking And Speaking Test

The Nature Of Speaking

Speaking, as a social and situation-based activity, is an integral part of people’s daily lives (Luoma, 2004). Testing second language speaking is often claimed to be a much more difficult undertaking than testing other second language abilities, capacities or competencies, skills¼ˆUnderhill, 1987). Assessment is difficult not only because speaking is fleeting, temporal and ephemeral, but also because of the comprehensibility of pronunciation, the special nature of spoken grammar and spoken vocabulary, as well as the interactive and social features of speaking (Luoma, 2004), because of the “unpredictability and dynamic nature” of language itself (Brown, 2003). To have a clear understanding of what it means to be able to speak a language, we must understand that the nature and characteristics of the spoken language differ from those of the written form (Luoma, 2004; McCarthy & O’Keefe, 2004; Bygate, 2001) in its grammar, syntax, lexis and discourse patterns due to the nature of spoken language.

Spoken English involves reduced grammatical elements arranged into formulaic chunk expressions or utterances with less complex sentences than written texts. Spoken English breaks the standard word order because the omitted information can be restored from the instantaneous context (McCarthy & O’Keefe, 2004; Luoma, 2004; Bygate, 2001; Fulcher, 2003). Spoken English contains frequent use of the vernacular, interrogatives, tails, adjacency pairs, fillers and question tags which have been interpreted as dialogue facilitators (Luoma, 2004; Carter & McCarthy, 1995). The speech also contains a fair number of slips and errors such as mispronounced words, mixed sounds, and wrong words due to inattention, which is often pardoned and allowed by native speakers (Luoma, 2004). Conversations are also negotiable, unpredictable, and susceptible to social and situational context in which the talks happen (Luoma, 2004).

The Importance Of Speaking Test

Testing oral proficiency has become one of the most important issues in language testing since the role of speaking ability has become more central in language teaching with the advent of CLA (Nakamura, 1993). Of the four language skills (listening, speaking, reading, &writing), listening and reading occur in the receptive mode, while speaking and writing exist in the productive mode. Understanding and absorption of received information are foundational while expression and use of acquired information demonstrate an improvement and a more advanced test of knowledge. A lot of interests now in oral testing is partly because second language teaching is more than ever directed towards the speaking and listening skills¼ˆUnderhill, 1987). Language teachers are engaged in “teaching a language through speaking” (Hughes, 2002:7). On one hand, spoken language is the focus of classroom activity. There are often other aims which the teacher might have: for instance, helping the student gain awareness of practice in some aspect of linguistic knowledge (ibid). On the other hand, speaking test, as a device for assessing the learners’ language proficiency also functions to motivate students and reinforce their learning of language. This represents what Bachman (1991) has called an “interface” between second language acquisition (SLA) and language testing research.

However, assessing speaking is challenging, “because there are many factors that influence our impression of how well someone can speak a language” (Luoma, 2004:1) as well as unpredictable or impromptu nature of the speaking interaction. The testing of speaking is difficult due to practical obstacles and theoretical challenges. Much attention has been given to how to perfect the assessment system of oral English and how to improve its validity and reliability. The communicative nature of the testing environment also remains to be considered (Hughes, 2002).

The Construct Of Speaking

Introduction To Communicative Language Ability (CLA)

A clear and explicit definition of language ability is essential to language test development and use (Bachman,1990). The theory on which a language test is based determines which kind of language ability the test can measure, This type of validity is called construct validity. According to Bachman (1990:84), CLA can be described as “consisting of both knowledge or competence and the capacity for implementing or executing that competence in appropriate, contextualized communicative language use”. CLA includes three components: language competence, strategic competence and pyschophysiological mechanisms. The following framework (figure 2.1) shows components of communicative language ability in communicative language use (Bachman,1990:85).

Knowledge Structures Language Competence

Knowledge of the world Knowledge Of Language

Strategic Competence

Psychophysiological Mechanisms

Context Of Situation

This framework has been widely accepted in the field of language testing. Bachman (1990:84) proposes that “language competence” essentially refers to a set of specific knowledge components that are utilized in communication via language. It comprises organizational and pragmatic competence. Two areas of organizational knowledge that Bachman (1990) distinguishes are grammatical knowledge and textual knowledge. Grammatical knowledge comprises vocabulary, syntax, phonology and graphology, and textual knowledge, comprises cohesion and rhetorical or conversational organization. Pragmatic competence shows how utterances or sentences and texts are related to the communicative goals of language users and to the features of the langue-use setting. It includes illocutionary acts¼Œor language functions, and sociolinguistic competence, or the knowledge of the sociolinguistic conventions that govern appropriate language use in a particular culture and in varying situations in that culture (Bachman, 1987).

Strategic competence refers to mastery of verbal and nonverbal strategies in facilitating communication and implementing the components of language competence. Strategic competence is demonstrated in contextualized communicative language use, such as socialcultural knowledge, real-world knowledge and mapping this onto the maximally efficient use of existing language abilities.

Psychophysiological competence refers to the visual and auditory skill used to gain access to the information in the administrator’s instructions. Among other things, psychophysiological competence includes things like sound and light.

Fulcher’s Construct Definition

To know what to assess in a speaking test is a prime concern. Fulcher (1997b) points out that the construct of speaking proficiency is incomplete. Nevertheless, there have been various attempts to reflect the underlying construct of speaking ability and to develop theoretical frameworks for defining the speaking construct. Fulcher’s framework (figure 2.2) (Fulcher, 2003: 48) describes the speaking construct.

As Fulcher (2003) points out that there are many factors that could be included in the definition of the construct:

Phonology: the speaker must be able to articulate the words, have an understanding of the phonetic structure of the language at the level of the individual word, have an understanding of intonation, and create the physical sounds that carry meaning.

Fluency and accuracy: these concepts are associated with automaticity of performance and the impact on the ability of the listener to understand. Accuracy refers to the correct use of grammatical rules, structure and vocabulary in speech. Fluency has to do with the ‘normal’ speed of delivery to mobilise one’s language knowledge in the service of communication at relatively normal speed. The quality of speech needs to be judged in terms of the gravity of the errors made or the distance from the target forms or sounds.

Strategic competence: this is generally thought to refer to an ability to achieve one’s communicative purpose through the deployment of a range of coping strategies. Strategic competence includes both achievement strategies and avoidance strategies. Achievement strategies contain overgeneralization/morphological creativity. Learners transfer knowledge of the language system onto lexical items that they do not know, for example, saying “buyed” instead of “bought”, Speakers also learn approximation: learners replace an unknown word with one that is more general or they use exemplification, paraphrasing (use a synonym for the word needed), word coinage (invent a new word for an unknown word), restructuring (use different words to communicate the same message), cooperative strategies (ask for help from the listener) , code switching (take a word or phrase from the common language with the listener in order to be understood) and non-linguistic strategies (use gestures or mime, or point to objects in the surroundings to help to communicate). Avoidance or reduction strategies consist of formal avoidance (avoiding using part of the language system) and functional avoidance (avoiding topical conversation). Strategic competence includes selecting communicative goals and planning and structuring oral production so as to fulfill them.

Textual knowledge: competent oral interaction involves some knowledge of how to manage and structure discourse, for example, through appropriate turn-taking, opening and closing strategies, maintaining coherence in one’s contributions and employing appropriate interactional routines such as adjacency pairs.

Pragmatic and sociolinguistic knowledge: effective communication requires appropriateness and the knowledge of the rules of speaking. A range of speech acts, politeness and indirectness can be used to avoid causing offence.

Ways Of Testing Speaking

Clark (1979) puts forward a theoretical basis to discriminate three types of speaking tests: direct, semi-direct and indirect tests. Indirect tests belong to “procommunicative” era in language testing, in which the test takers are not actually required to speak. It has been regarded as having the least validity and reliability, while the other two formats are more widely used (O’Loughlin, 2001). In this section, the characteristics, advantages and disadvantages of the direct and semi-direct test are presented,

The Oral Proficiency Interview Format

One of the earliest and most popular direct speaking test formats, and one that continues to exert a strong influence, is the oral proficiency interview (OPI) –developed originally by the FSI (Foreign Service Institute) in the United States in the 1950s and later adopted by other government agencies. It is conducted with individual test-taker by a trained interviewer, who assesses the candidate using a global band scale (O’Loughlin, 2001). It typically begins with a warm-up discussion of a few easy questions, such as getting to know each other or talking about the day’s events. Then the main interaction contains the pre-planned tasks, such as describing or comparing pictures, narrating from a picture series, talking about a pre-announced or examiner-selected topic, or possibly a role-play task or a reverse interview where the examinee asks question of the interviewer (Luoma. 2004). An important example of this type of test is the speaking component of the International English Language Testing System (IELTS), which is adopted in 105 different countries around the world each year.

The Advantage Of An Interview Format

The oral interview was recognized as the most commonly used speaking test format. Fulcher (2003) suggests that it is partly because the questions used can be standardized, making comparison between test takers easier than when other task types are used. Using this method, the instructor can get a sense of the oral communicative competence of students and can overcome weakness of written exams, because the interview, unlike written exams, “is flexible in that the questions can be adapted to each examinee’s performance, and thus the testers have more controls over what happens in the interaction” (Luoma, 2004:35). It is also relatively easy to train raters and obtain high inter-rater reliability (Fulcher, 2003).

The Disadvantage Of An Interview Format

However, concern and skepticism exist about whether it is possible to test other competencies or knowledge because of the nature of the discourse that the interview produces (van Lier, 1989).

a. Issue of time

For the instructor, time management can be quite an issue. For instance, using a two-hour period for exams for 20 students means each student is allowed only six minutes for testing. This includes the time needed to enter the room and adjust to the setting. With such a time limit the student and instructor can hardly have any kind of normal real-world conversation.

b. Issue of asymmetrical relationship

The asymmetrical relationship between examiners and candidates elicits a form of inauthentic and limited socio-cultural contexts (van Lier, 1989; Savignon, 1985; Yoffe, 1997). Yoffe (1997) commented on ACTFL (American Council on the Teaching of Foreign Languages) OPI that the tester and the test-taker are “clearly not in equal positions” (Yofee, 1997).

The asymmetry is not specific to the OPI but is inherent in the notion of an ‘interview’ as an exchange wherein one person solicits information in order to arrive at a decision while the interlocutor produces what he or she perceives as most valued. The interviewee is, in most cases, acutely aware of the ramifications of the OPI rating and is, consequently, under a great deal of stress.

Van Lier (1989) also challenges the validity of OPI in terms of the asymmetry between them because “the candidate speaks as to a superior and is unwilling to take the initiative” (van Lier, 1989). Under the unequal relationship, the speech discourse, such as turn –taking, topic nomination and development, and repair strategies are all substantially different from normal conversational exchanges (see van Lier 1989).

c. Issue of interviewer variation

Given the fact that the interviewer has considerable power over the examinee in an interview, concerns have been aroused about the effect of the interlocutor (examiner) on the candidate’s oral performance. Different interviewers vary in their approaches and attitudes toward the interview. Brown (2003) warns the danger of such variation to fairness. O’Sullivan (2000) conducts an empirical study that indicated learners perform better when interviewed by a woman, regardless of the sex of the learner. Underhill (1987:31) expresses his concern on the unscripted “flexibility… means that there will be a considerable divergence between what different learners say, which makes a test more difficult to assess with consistency and reliability.”

Testing Speaking In Pairs

There has been a shift toward a paired speakers format: two assessors examine two candidates at a time. One assessor interacts with the two candidates and rates them on a global scale, while the other does not take part in the interaction and just assesses–using an analytic scale. The paired oral test has been used as part of large-scale, international, standardized oral proficiency tests since the late 1980s (Ildikó, 2001). Key English Test (KET), Preliminary English Test (PET), First Certificate in English (FCE) and Certificate in Advanced English (CAE) make use of the paired format. In a typical test, the interaction begins with a warm-up, in which the examinees introduce themselves to the interlocutor, followed by two pair interaction task. The talk may involves comparing two photographs by each candidate at first, such as in Cambridge First Certificate (Luoma, 2004), then a two-way collaborative task between the two candidates based on more photographs, artwork or computer graphics, and ends up with a three-way discussion with the two examinees and the interlocutor about a general theme that is related to the earlier discussion.

The advantages of the paired interview format

Many researchers claim that the paired format is preferable to OPI. The reasons are:

a. The changed role of the interviewer frees up the instructors in order to pay closer attention to the production of each candidate than if they are participants themselves (Luoma, 2004).

b. The reduced asymmetry allows more varied interaction patterns, which elicits a broader sample of discourse and increased turn-takings than were possible in the highly asymmetrical traditional interview (Taylor, 2000).

c. The task type based on pair-work will generate a positive washback effect on classroom teaching and learning (Ildiko, 2001). In the case of the instructor following Communicative Language Teaching (CLT) methodology, where pair work may take up a significant portion of a class, it would be appropriate to incorporate similar activities in the exam. In that way the exam itself is much better integrated into the fabric of the course. Students can be tested for performance related to activities done in class. There may also be benefits in regards to student motivation. If students are aware that they will be tested on activities similar to the ones done in class, they may have more incentive to be attentive and use class time effectively.

The disadvantages of the paired interview format

There are, however, also concerns voiced regarding the paired format.

a. Mismatches between peer interactants

The most frequently raised criticisms against the paired speaking test relate to various forms of mismatches between peer interactants (Fulcher, 2003). Ildiko (2001) points out that when a candidate has to work with an incomprehensible or uncomprehending peer partner, it may negatively influence the candidate’s performance. As a consequence, in such cases it is quite impossible to make a valid assessment of candidates’ abilities.

b. Lack of familiarity between peer interactants

The extent to which this testing format actually reduces the level of anxiety of test-takers compared to other test formats remains doubtful (Fulcher, 2003). O’Sullivan (2002) suggests that the spontaneous support offered by a friend positively reduces anxiety and task performance under experimental conditions. However, the chances are quite high that the examinee will meet with strangers as his or her peer interactant. It is hard to imagine how these strangers can carry out some naturally flowing conversations. Estrangement, misinterpretation and even breakdown may occur during their talk.

c. Lack of control of the discussion

Problems are generated if the examiner loses control of the oral task (Luoma, 2004). When the instructions and task materials are not clear enough to facilitate the discussion, the examinees’ conversation may go astray. Luoma (2004) points out that testers often feel uncertain about what amount of responsibility that they should give to the examinees. Furthermore, examinees do not know what kind of performance will earn them good results without the elicitation of the examiner. When one of the examinees has said too little, the examiner ought to monitor and jump in to give help when necessary.

Semi-Direct Speaking Tests

The term “semi-direct” is employed by Clark (1979:36) to describe those tests that are characterized “by means of tape recordings, printed test booklets, or other ‘non-human’ elicitation procedures, rather than through face-to-face conversation with a live interlocutor.” Appearing during 1970s, and being an innovative adaptation of the traditional OPI, the semi-direct method normally follows the general structure of the OPI and makes an audio-recording of the test taker’s performance which is later rated by one or more trained assessors (Malone, 2000). Examples of the semi-direct type used in the U.S.A. are the simulated oral proficiency interviews (SOPI) and the Test of Spoken English 2000 (TSE) (Ferguson, 2009). Examples in U.K. include the Test in English for Education Purpose (TEEP) and the Oxford-ARELS Examinations (O’Loughlin, 2001). Another mode of delivery is testing by telephone — as in the PhonePass test (the test mainly consists of reading sentences aloud or repeating sentences), or even video-conferencing (Ferguson, 2009).

The Advantages Of The Semi-Direct Test Type

First, the semi-direct test is more cost efficient than direct tests, because many candidates can be tested simultaneously in large laboratories and administered by any teacher, language lab technician or aide in a language laboratory where the candidate hears taped questions and has their responses recorded (Malone, 2000).

Second, the mode of testing is quite flexible. It provides a practical solution in situations where it is not possible to deliver a direct test (O’Loughlin, 2001), and it can be adapted to the desired level of examinee proficiency and to specific examinee age groups, backgrounds, and professions (Malone, 2000).

Third, semi-direct testing represents an attempt to standardize the assessment of speaking while retaining the communicative basis of the OPI (Shohamy, 1994). It offers the same quality of interview to all examinees, and all examinees respond to the same questions so as to remove the effect that the human interlocutor will have on the candidate (Malone, 2000). The uniformity of the elicitation procedure greatly increases the reliability of the test.

Some empirical studies (Stansfield, 1991) show high correlations (0. 89- 0. 95) between the direct and semi-direct tests, indicating the two formats can measure the same language abilities and the SOPI can be the equivalent and surrogate of the OPI. However, there are also disadvantages.

The Disadvantages Of The Semi-Direct Test Type

First, the speaking task in semi-direct oral test is less realistic and more artificial than OPI (Clark, 1979; Underhill, 1987). Examinees use artificial language to “respond to tape-recorded questions — situations the examinee is not likely to encounter in a real-life setting” (Clark, 1979:38). They may feel stressful while speaking to a microphone rather than to another person, especially if they are not accustomed to the laboratory setting (O’Loughlin, 2001).

Second, the communicative strategy and speech discourse elicited in these semi-direct SOPIs is quite different from that found in typical face-face interaction – being more formal, less conversation-like (Shohamy, 1994). Candidates tend to use written language in tape-mediated test, more of a report or narration; while, they focus more on interaction and on delivery of meanings in OPI.

Third, there are often technical problems that can result in poor quality recordings or even no recording in the SOPI format (Underhill, 1987).

In conclusion, one cannot assume any equivalence between a face-to face test and a semi-direct test (Shohamy, 1994). It may be that they are measuring different things, different constructs, so the mode of test delivery should be adopted on the basis of test purpose, accuracy requirement, practicability, and impartiality (Shohamy, 1994). Stansfield (1991) proposes the OPI is more applicable to the placement test and evaluation test of the curriculum, while SOPI is more appropriate for large-scale test with requirement of high reliability.

Marking Of Speaking Test

Marking and scoring is a challenge in assessing second language oral proficiency.. Since only a few elements of the speaking skill can be scored objectively, human judgments play major roles in assessment. How to establish the valid, reliable, effective marking criteria scales and high quality scoring instruments have always been central to the performance testing of speaking (Luoma, 2004). It is important to have clear, explicit criteria to describe the performance, as it is important for raters to understand and apply these criteria, making it possible to score them consistently and reliably. For these reasons, rating and rating scales have been a central focus of research in the testing of speaking (Ferguson, 2009).

Definition Of Rating Scales

A rating scale, also referred to as a “scoring rubric” or “proficiency scale” is defined by Davies et al as following (see Fulcher, 2003):

consisting of a series of band or levels to which descriptions are attached
providing an operational definition of the constructs to be measured in the test
requiring training for its effective operation

Holistic And Analytic Rating Scales

There are different types of rating scales used for scoring speech samples. One of the traditional and commonly used distinctions is between holistic and analytic rating scales. Holistic rating scales also are referred to as global rating. With these scales, the rater attempts to match the speech sample with a particular band whose descriptors specify a range of defining characteristics of speech at that level. A single score is given to each speech sample either impressionistically or by being guided by a rating scale to encapsulate all the features of the sample (Bachman & Palmer, 1996).

Analytic rating scales: They consist of separate scales for different aspects of speaking ability (e.g. grammar / vocabulary; pronunciation, fluency, interactional management, etc). A score is given for each aspect (or dimension), and the resulting scores may be combined in a variety of ways to produce a composite single overall score. They include detailed guidance to raters, and rich information that they provide on specific strengths and weakness in examinee performance (Fulcher, 2003). Analytic scales are particularly useful for diagnostic purposes and for providing a profile of competence in the different aspects of speaking ability (Ferguson, 2009). The type of scale that is selected for a particular test of speaking will depend upon the purpose of the test

Validity And Reliability Of Speaking Test

Bachman And Palmers Theories On Test Usefulness

The primary purpose of a language test is to provide a measure that can be interpreted as an indicator of an individual’s language ability (Bachman, 1990; Bachman and Palmer, 1996). Bachman and Palmer (1996) propose that test usefulness including six test qualities—reliability, construct validity, authenticity, interactiveness, impact (washback) and practicality. Their notion of usefulness can be expressed as in Figure2.3:

Usefulness=Reliability + Construct validity + Authenticity +

Interactiveness + Impact +Practicality

These qualities are the main criteria used to evaluate a test. “Two of the qualities — reliability and validity — are critical for tests and are sometimes referred to as essential measurement qualities” (Bachman & Palmer, 1996:19), because they are the “major justification for using test scores as a basis for making inferences or decisions” (ibid). The definitions of types of validity and reliability will be presented in this section.

Defining Validity

The quotation from AERA (American Educational Research Association ) indicates:

“Validity is the most important consideration in test evaluation. The concept refers to the appropriateness, meaningfulness, and usefulness of the specific inferences made from test scores. Test validation is the process of accumulating evidence to support such inferences. A variety of inferences may be made from scores produced by a given test, and there are many ways of accumulating evidence to support any particular inference. Validity, however, is a unitary concept. Although evidence may be accumulated in many ways, validity always refers to the degree to which that evidence supports the inferences that are made from the score. The inferences regarding specific uses of a test are validated, not the test itself.” (AERA et al., 1985: 9)

Messick stresses that “it is important to note that validity is a matter of degree, not all or none' (Messick, 1989:53). Validity lies in the test scores and inferences drawn from the scores. Validity is multifaceted and different types of evidence are needed to support any claims for the validity of scores on a test (Bachmann, 1990:89). The fact that there are many ways to establish the validity leads to the topic of types of validity.

Types Of Validity

It should be pointed out that different types of validity are distinguished by many writers based on different test purposes. The major validations mentioned here are from Alderson et al's (1995) framework.

Construct validity: it is seen as the most basic, fundamental type of validity. Bachman and Palmer (1996:21) state that: “Construct validity pertains to the meaningfulness and appropriateness of the interpretations that we make on the basis of test scores.” That is, to justify the interpretation of a test score, we need to provide evidence that the test score reflects the area(s) of language ability that we want to measure and little else. As we all know, construct validity is the specific definition of an ability that provides the basis for a given test or test task. The degree of construct validity is determined by the relationship between the purpose of a test and the theory on which the test is based. Alderson et al (1995) suggest ways to validate the construct validity: correlate each subtest with other subtests, with total test or with total minus self; multitrait-multimethod studies and factor analysis.

Internal Validity

Internal validity relates to studies of “the perceived content of the test and its perceived effect” (Alderson et al, 1995:171). There are two kinds of internal validity:

Face Validity: it can be defined as the extent to which the test appeals to test takers and test users (Bachman & Palmer, 1996). Regarded as the most superficial form of validity, it pertains to the public acceptability of a test. It is often determined impressionistically. Questionnaires to or interviews with candidates or administrator include questions like: does it look fair and appropriate to the test-takers and to the public? Does the test appear to measure what it claims to measure? Do the test tasks look something like what you might do in a real world setting? (Henning, 1987; Ferguson, 2009).

Content Validity: it is “concerned with whether or not the content of the test is sufficiently representative and comprehensive for the test to be a valid measure of what it is supposed to measure.”(Henning, 1987:94). It involves only the test, and not the performance of test takers. Its validation should be based on the analysis of the language being tested and on the particular course objectives. Alderson et al (1995) suggest methods to validate content validity: compare test content with specifications/syllabus, questionnaires to and interviews with “experts” such as teachers, subject specialists, applied linguists, expert judges rate test items and texts according to precise list of criteria, etc..

External Validity

External validity relates to studies “comparing students' test scores to measures of their ability gleaned from outside the test” (Alderson et al, 1995:171). It comprises two types:

Concurrent validity: It in essence involves the comparison of the test scores with some other measures for the same candidates taken at roughly the same time as the test (Alderson et al, 1995:177). This produces a correlation coefficient which suggests the extent t which the tests are measuring the same thing. Ways to validate concurrent validity include: correlate students' test scores with students scores on other tests; students' test scores with other measures of ability such as students' or teachers' ratings (ibid).

Predictive Validity: It shows that the test can predict how successful the learners will be at using the language in the future (Underhill, 1987). Ways to validate predictive validity include: correlate students' test scores with their scores on tests taken some time later; correlate students' test scores with other measures of ability taken some time later, such as subject teachers' assessments, language teachers' assessment; and also students' scores with success in study or at work (Alderson et al, 1995).

Reliability

Defining Reliability

Reliability is defined as consistency and stability of measurement (Bachman & Palmer, 1996). “A reliable test score will be consistent across different characteristics of the testing situation.” (Bachman & Palmer, 1996:19). It shows that the accuracy and consistency of the measurement are reflected in obtaining the similar results over repeated tests involving the same subjects.

Types Of Reliability

Luoma (2004) presents three types of reliability particularly relevant for speaking test.

Intra-rater reliability or internal consistency: it means that raters agree with themselves, over a period of few days, about the ratings that they give. In other words, if a person rates a test one day then rates the test the same on another day, the test is said to have high intra-rater reliability.

Inter-rater reliability: it means that different raters rate performances similarly. They do not necessarily need to agree completely. However, well-defined criteria help raters agree, and frequent disagreements may indicate either that some raters are not able to apply the criteria consistently or that the criteria need to be defined better.

Parallel form reliability: examinees are asked to take two or more of the different forms of tests, and their scores are analyzed for consistency. If the scores are not consistent, the forms cannot be considered parallel--- assuming of course that the raters are internally consistent.

Relationship Between Validity And Reliability

Bachman (1990: 161) points out: “The concerns of reliability and validity can thus be seen as leading to two complementary objectives in designing and developing tests: (1) to minimize the effects of measurement error, and (2) to maximize the effects if the language abilities we want to measure.”

As the two prime concerns of the qualities in the test, validity and reliability are interrelated concepts. Reliability is a prerequisite to validity in performance assessment in the sense that no test can achieve its intended purpose if the test results are unreliable. Similarly, inferences cannot be drawn from an invalid test regardless of its reliability.

Speaking Test In China

In this section, the importance of English language and development of speaking test in China are discussed with a view to offering a better understanding of the background of the empirical research.

The Importance Of English Language In China

In China, English is one of the three core compulsory subjects along with mathematics and Chinese that are tested for students wishing to enter schools of higher level (Cheng, 2008). As an obligatory subject for all majors in Chinese universities and colleges, students are required to pass the CET (College English Test) Band IV to obtain their bachelor's degrees. Apart from the academic requirements for English, better English skills and abilities are also preferential accesses to employments, selection and promotion.

Development Of Speaking Tests In China

Despite all the efforts of the Chinese Educational Ministry to promote English proficiency nationwide, the situation of English education cannot satisfy the needs of the social development (Cai, 2002). Students' pragmatic competence is weak compared to other language competences, especially their oral abilities (ibid). Many empirical studies have been conducted showing that spoken ability is the weakest of the four basic skills (Wen, 2001). The weakness of spoken language learning has been a leading problem for English education for a long time.

The development of speaking test in China started in the 1990s, for the language testing had focused on testing of reading and writing abilities. The Cambridge Business English Certificate (BEC) was first introduced to China in 1993 with an oral component. The speaking sub-test in the Test for English Majors (TEM) began in 1994. The National Matriculation English Test -- Oral Subtest (NMETOS) was formally introduced in 1995 in China (see Cheng, 2008). The CET-- Spoken English Test (CET-SET) started in 1999. The Public English Testing System (PETS) with a speaking component has begun to be promoted since 1999.

The reasons the development of the speaking test started late in China are various. The following are major reasons. On one hand, objectively there are many difficulties involved in the construction and administration of the speaking assessment due to the discrepancies among colleges and universities, regions and areas in terms of teaching resources, students' level of English upon entering college, and the social needs they face. Unified effective assessment instruments are lacking and the measurement tools are hard to grasp and implement. Large students population, limited testers and time make it impracticable to administer the speaking test.

On the other hand, subjectively, communicative assessment has received little attention. Hughes (1989) states that there is a great discrepancy between the predominance of the communicative approach and the accurate measurement of communication ability. With the widespread promotion of communicative language teaching (CLT) in EFL countries replacing the traditional grammar-centered, text-centered, and teacher-centered methods, English teachers have been trying to carry out CLT in their classrooms. However, communicative speaking assessment has not been seriously practiced in a way that reflects authentic interaction in test task design. So the matters concerning validity and reliability have been given little attention in oral assessments in China.

Induction To CET-SET

CET is a large-scale standardized test administered nationwide by the National College English Testing Committee on behalf of the Higher Education Department of Chinese Ministry of Education (CME) (Cheng, 2008). The main aim is to measure English proficiency of college and university undergraduate students in accordance with the National College English Teaching Syllabus (CME, 1999). As a norm-referenced test, nowadays, CET “has nearly become the unifying criterion in judging the English level of non-English major students of university in educational field and even in the whole society” (Zhu, 2004). In most colleges and universities, the certificate of CET-4 is one of the requirements to obtain a Bachelor's degree. As we can see, the CET has exerted a huge amount of influence on English language teaching and learning at the tertiary level in China due to its high stake (Cheng, 2008).

CET came into being in 1987, but its component Spoken English Test (SET) started in 1999. It is available to students who have passed the CET-4 with a score of 80% or above or the CET-6 with a score of 75% or above. Every speaking test is administered by an interviewer, who talks to the candidates and controls the direction and topic of the conversation, also rates their performance, along with an assessor, who listens to a learner speaking and only makes an evaluative judgement on what he or she hears. The candidates are comprised by 3 or 4 learners. Topics cover various fields based on examinees' outlooks to the life and the world, such as, ideal jobs, campus life, holiday, festival activities, TV programmes, education in China, environment and human, modern life style, social communication, etc. The rating scale is listed in Appendix 1. The test procedure consists of three parts as it is shown in the following table 2.3.

Table 2.3 CET-SET procedure (National College English Syllabus for Non-English Majors, 1999).

Background Of Testing At DLNU

Introduction To Dlnu And College English (Ce) Teaching

Dalian Nationalities University (DLNU) offers engineering and applied sciences as major disciplines. As its name suggests, 65% students are from 55 different minority ethnic groups. The College English Department (CED), which offers courses to non-English major students, has been selected as one of 31 institutions of College English Reform Demonstration Project due to its unremitting efforts and practice on the reform of CE teaching. It makes full use of modern computer and network technologies and has constructed a new model of CE teaching, to help develop students' autonomic learning ability. In April 2004, its “National New Model of CE Teaching Research” entered the list of "Tenth Five-Year Plan" Research Projects of National Education Science Organization.

College English Course And Syllabus

CE teaching and learning is an integral part of higher education in China. “College English Curriculum Requirements (For Trial Implementation)” (CECR) (CME, 2004) provides colleges and universities with the guidelines for non-English major students. DLNU, along with many other colleges and universities, takes the CECR as their CE teaching syllabus. According to CECR (CME, 2004), College English has knowledge and practical skills of the English language as its main components along with learning strategies and intercultural communication; it takes theories of foreign language teaching as its guide and incorporates different teaching models and approaches.

Teaching Requirement

English proficiency requirements are divided into three levels, namely, basic requirements, intermediate requirements, and higher requirements. The basic requirement is the minimum level that all the undergraduates of non-English majors must attain before graduation. The intermediate and higher requirements are respectively set for those who, having laid a good foundation of English, can afford time to learn more of the language. Institutions of higher learning should set their own objectives in the light of their specific circumstances, strive to create favorable conditions, and encourage students to adjust their objectives in line with their own performance and try to meet the intermediate or higher requirements (check the CECR (trial) in Appendix 2 for detailed description of three levels of requirements) .

Speaking Tests At DLNU

In College English course of DLNU, the speaking test is one of the four subtests of the final examination of English assessment. It has been adopted and administered since 2003, when CED started their practice on CE teaching reform. Almost all the freshmen are required to take the speaking test as part of their English final examinations at the end of first and second semesters (a small proportion of students from special ethnic groups take Russian or Japanese as their foreign language learning). Their speaking test scores take up 10 percent of the final score of English subject, which will be one of the criteria appraised for selection of scholarship awards and recorded into students' archives. 90% of the score cover these sub-tests--- 60% of score coming from computer-based test of listening and reading, 10% from writing score and 20% from teacher evaluations. Teacher evaluations are based on students' attendance, daily performance in class and after-class assignments.

Relationship Between Teaching And Testing

Due to the promotion of CLT, communicative language abilities have been stressed and developed in class. CE teaching in DLNU adopted the model of “2+2+2”, which means students take 2 periods of CE classes at regular classroom with a teacher's presence, 2 periods at computer labs with a teacher's presence, and another 2 periods at computer labs with teacher's absence. Students use two different course books in CE classes --- Reading & Writing, and Viewing, Listening & Speaking textbooks. They are two series of “New Horizontal College English” battery textbooks published by Foreign Language Teaching and Research Press (FLTRPP).

In Reading & Writing class, teachers impart language knowledge partially by speaking activities, such as, discussions, presentations, or debates revolving around the topics of the text. Topical knowledge, grammatical and textual knowledge are exercised through these activities. In Viewing, Listening & Speaking class, colloquial language takes up a high proportion. By watching and simulating sample conversations in the video, learners' pragmatic knowledge is enhanced. Dialogues between learners and between the teacher and the learner are transferred through the computer.

All the topics in the speaking test are chosen from the ones that have been practiced in class.

Purpose

Because the CET-SET is not available to most of college students as mentioned above, they may take less notice of the important role that speaking plays in foreign language acquisition. These two types of speaking tests, especially the computer-aided test, make it practicable to have more than 3,000 students' oral abilities tested within one week. Setting up the speaking test on a school scale can bring many benefits to language learners. It can be facilitated to adopt different effective test items to cater to the teaching syllabus, and it can function as a stimulus to promote students' practice of speaking English and check the teaching and learning outcome. Its purpose has been stated clearly in CECR (Trial):

to develop students' ability to use English in an all-round way, especially in listening and speaking, so that in their future work and social interactions they will be able to exchange information effectively through both spoken and written channels, and at the same time they will be able to enhance their ability to study independently and improve their cultural quality so as to meet the needs of China's social development and international exchanges. (CME, 2004)

Format

Two different formats are used at DLNU. One is semi-direct speaking test, in which examinees talk into the microphones connected to computers, and have their speeches recorded for the teachers to rate afterwards. 94% of students (approximately 2,900) participate in this form. The other is face-to-face interview, in which one interviewer talks to one examinee each time. Students from Department of Chemical Engineering and Technology (Dep. of CEAT) are interviewed by this form. Because this faculty in DLNU has been awarded the “Sate Key Discipline”, students have higher English requirements in NMET (minimum l10 out of 150 points) and the faculty is well-staffed and well-equipped. Students' CE class are offered on small class scale (20-30 students), while those of other departments are usually on large class scale (50-70 students). Besides CE class, students are offered classes of English Video by foreign teachers, Translation, Extensive Reading, etc. Thus, we can see students' English classes are taught and tested intensively.

Construction Of The Test

The speaking test follows the Basic Requirement of CECR for speaking:

Students should be able to communicate in English in the course of learning, to conduct discussions on a given theme, and to talk about everyday topics with people from English-speaking countries. They should be able to give, after some preparation, short talks on familiar topics with clear articulation and basically correct pronunciation and intonation. They are expected to be able to use basic conversational strategies in dialogue. (CECR) (CME, 2004)

The structures of computer-aided and interview oral proficiency test are shown in Table 3.1 and 3.2, and the detailed test contents are attached in the Appendix 3 and 4:

Table3.1 The structure and description of the computer-aided speaking test

In the computer-aided speaking test, testers' recorded speeches forwarded to teachers who do not teach this class to mark. The interviewers conduct both interviewing and marking.

Research Methodology

Subjects

The subjects of this research included 24 testers and 225 test takers, who were involved in the speaking tests administered from June 16 to June 20, 2009 at CED of School of Foreign Language and Culture in DLNU.

Testers

Note: In the above table, T=Tester; F=Female; M=Male; A. P. =Associate Professor; L=Lecturer; Year= Years they have taught English

Note: In the above table, T=Tester; F=Female; M=Male; A. P. =Associate Professor; L=Lecturer;

Test Takers

The test-taker subjects are freshmen, who have studied two semesters at DLNU. 112 students from Class 1- 4 of Grade 2008, majoring in Chemical Engineering and Technology (CEAT), received face-to-face interviews as their speaking test, and 113 students from Class 1-4 of Grade 2008, majoring in International Economy and Trade (IET) received the computer-aided speaking test. These two departments have the same entrance requirement scores (110 out of 150 points at NMET), higher than those of other departments. Both departments offered extra English classes apart from CE. For example, the Department of IET offered Business English class. Besides, some of their professional courses were taught in English. Those students in face-to-face testing groups and those in the computer-aided groups are presumably comparable in English levels.

Instruments

In this study, three kinds of instruments are adopted in quadrangulation to collect various forms of data: testing materials, questionnaire, telephone interview and test records are adopted.

Testing Materials

The necessary testing materials were collected, including College English Curriculum Requirements (trial), test guidelines and specifications, test paper, scoring criteria, test takers' scores and auditory records of the test.

Questionnaire

Questionnaire survey is important as a method to understand test-taker preferences and opinions (Fulcher, 2003; Alderson et al, 1995). The questionnaire in this empirical study aims at gathering the opinions of subjects, as well as comments and suggestions. It is judged and corrected by the expert in the field of Language Testing before being sent to the subjects. Questionnaires are classified into four versions: questionnaires to test takers of two different formats and questionnaires to testers of two different formats. Each questionnaire, along with the background information about the subject, can be divided into four parts (see the Appendix 5-8): Part I, subjects' opinions about various aspects of validity and reliability; Part II, opinions about aspects of the current speaking test which should be improved and corresponding suggestions；Part III, opinions about the preferred test format; Part IV, the effect on English teaching and learning.

To ensure the questionnaire was designed scientifically, one of the four versions was selected to test its validity and reliability. Factor analysis and internal consistency analysis were assessed using SPSS. The results are shown in Table 4.2:

In this table, the Kaiser-Meyer-Olkin Measure of Sampling Adequacy. (KMO) value of .705, exceeded the recommended value of .6 and the Barlett's Test of Sphericity reached statistical significance, supporting the factorability of the correlation matrix. Furthermore, the questionnaire has good internal consistency, as determined by Cronbach alpha coefficient reported of .747. Cronbach's alpha indicates a high degree of reliability.

Telephone Interview

Interviews with testers through telephone are complementary and supportive of the other resources. Questions to the testers cover the evaluation of usefulness of the test in the light of Bachman and Palmer's (1996) checklist for logical evaluation of test usefulness and their own perspectives on the reliability and validity of the tests, test operation and improvement, and the effects the test creates on teaching and learning.

Data Collection And Analysis

The test materials, students' tests scores, and audio records were collected from CED at DLNU. Then 235 copies of students' questionnaires, along with 24 copies of testers' questionnaires were collected. Each data collected was analyzed qualitatively or quantitatively to answer the research questions thoroughly.

Data were then entered into Excel and Statistical Package for Social Science 16.0 (SPSS 16.0) for statistical analyses. Subjects' answers on questionnaires were processed by SPSS to make basic descriptive analysis and frequency analysis, and then correlation analysis was conducted to provide more information about the validity and reliability of the test investigation.

Result And Discussion

This chapter presents results of the data analyses including both qualitative data and quantitative data. Alderson et al (1995) emphasize that it is best to validate a test in as many ways as possible. Therefore, research data from various sources are provided as far as possible.

Theoretical Evaluation

Results And Discussion From Testing Materials

Evaluation Of Test Content

Alderson et al suggest (1995:173) a common way to validate the content validity of a test is to analyze its content and compare that content with “its specification, a formal teaching syllabus or curriculum”. Luoma (2004) also suggests validating the test by relate the test task to the test purpose and test construct.

According to the CECR (CME, 2004), the construct is defined as the ability to “communicate in English in the course of learning, to conduct discussions on a given theme, to talk about everyday topics with people from English-speaking countries”, to “give, after some preparation, short talks on familiar topics with clear articulation and basically correct pronunciation and intonation”, and to “use basic conversational strategies in dialogue.” Thus, this construct involves the communicative competences in the light of the theory of CLA. However, the “sub-skills to be measured” section, grammatical accuracy, pronunciation, intonation, use of fairly accurate vocabulary and sentence pattern, etc., have been given much weight. In the interview format of test, communicative competence is stressed and tested, but without much consideration. Therefore, the computer-aided test has not attached enough importance to communicative competence in its oral test.

The Curriculum Requirements have defined the topics as “everyday topics” and “familiar topics.” The topics tested at DLNU cover quite a wide range, such as pets, music, shopping, public speaking, safety, hobbies, crimes, values on money, drunk driving, movie, examination, hair colouring, family relations, love and marriage, and so on. These are very familiar topics for students and occur in their daily lives. These topics are compatible with the teaching syllabus. Thus, the content validity is considered to be fairly high. However, one aspect that can be improved is that academic study is not included as a topic in the test. Public speaking can be regarded as a skill for presentation, but academic topic and situation should be more introduced into test content as a primary aspect of examinees' daily life.

Evaluation On Scoring Criteria

The rating criteria also need to be evaluated to check if it is coherent with the test purpose and the construct (ibid). The rating scale is defined as “the speech is complete and coherent in answering question, rich in content, with correct pronunciation and fluency and almost no grammatical errors.” The task demands and performance qualities are seen in terms of pronunciation, fluency, grammar, and coherence, so they are assessed in terms of linguistic criteria instead of communicative criteria. They are not quite coherent with the communicative purpose as the construct defines. Furthermore, the criteria are not defined concretely and precisely enough to make them easy to use.

Luoma (2004) recommends that the test administration and scoring processes can be evaluated in terms of their consistency and their coherence with the construct definition. In additions to these, validation of the test includes examinee attitudes to and experiences with the test, the washback effect or the effect of the test on teaching and the teacher or learner attitudes towards learning and the test (ibid), which will be evaluated and discussed in the following sections through various instruments.

Results From Telephone Interview

Bachman and Palmer (1996:150) propose a checklist for logical evaluation of the usefulness of a given test. The questions listed in Table 5.1 are ones to indicate the degree to which the reliability and the validity have been satisfied in the speaking tests at DLNU. Answers are elicited through telephone interview.

These questions offer a very detailed elicitation for an in-depth investigation from a theoretical perspective. With the exceptions of questions 5, 8, 9, and 10, answers to these questions along with the corresponding explanations, demonstrate a high and positive result of the logical evaluation of the reliability and validity of the speaking test. From these explanations, we can safely draw a conclusion that at the theoretical level the speaking test at DLNU has a high degree of reliability and validity.

Empirical Evaluation

Every facet of validity and reliability, and the impact of the speaking test are included in the questionnaires.

Results From Questionnaires Of Students Of Interview Speaking Test

A Strongly disagree,

B Disagree,

C Agree,

D Completely agree.

In this table, the numbers of objective questions are listed in the first left column; where the data under the Column A, B, C and D are the percentage of frequency, which shows the number of repeated choices. The “mean” column shows the average choice number for each question, and the column of “Std. Deviation” is a measure of the variability or dispersion of the data set, or a probability distribution. A low standard deviation indicates that the data points tend to be very close to the mean, whereas high standard deviation indicates that the data are spread over a large range of values.

In the following part, the result of each question is analyzed one by one with the statistical data.

Question 1. The test scores accurately estimate candidates' oral proficiency.

This question was used to investigate the face validity of the test. The data in this table shows that 82.1 % of students think the interview test can fairly and accurately test their oral proficiency. Only 16.1% students responded negatively. The standard deviation for responses to this question is the lowest of all, only 0.456, which indicates that test-takers have little variation in their responses to this question. In students' eyes, the test is quite valid.

Question 2. The interviewer can keep the friendly attitude all the time.

As one facet of the reliability of the test, this question aims at investigating if the attitude of the interviewer has an effect on the examinees' performance. The cumulative percent of positive response from students amounts to 97.3%, and the mean score of 3.54 is the highest of all the questions. It indicates that the effect of that the variation of the interviewers has on test-takers is quite minor. Thus, in this aspect, the reliability of the test score is high.

Question 3. The topic answering in the first part of the speaking test is the most capable of testing the candidate's oral proficiency.

Question 4. The impromptu question and answer in the second part of the speaking test is the most capable of testing the candidate's oral proficiency.

Question 3 and 4 investigate the face validity of the test from the perspective of test content. A majority of the students, 70.4%, expressed positive attitudes toward the first part of the test, while 28.6% student expressed negative comments about it. A very high percentage, 90.1% students spoke of second part in positive terms. The percentage of students (31.2%) who choose D completely agree is 21.4% higher than those in Question 3 (9.8%). These results show that there are more students who have decisive satisfaction with the impromptu question and answer.

In the later open question 16, a high proportion of students show expectations for more flexible and impromptu questions and activities involved into speaking tests. A number of students found the format boring and monotonous. Many students believed that the topics tested should be more interesting, up-to-date (instead of cliché, as one student pointed out) and close to life.

Question 5. The time is sufficient to demonstrate one's oral language proficiency.

This question aims to check if time allocation is appropriate and scientific. The result shows 60% of students feel it is fairly reasonable, and 13% believe the time is quite reasonable, while 31.2 % think it is less reasonable. Some students requested longer time for the whole interview, because it is not long enough to demonstrate their real oral proficiency.

Question 6. Instructions of the test are clear.

This question shows another facet of the reliability of the test. Up to 89.3% students are satisfied with this aspect.

Question 7. The test is fair for all the candidates.

This question is another evaluation for face validity of the test. A substantial majority, 78.6% of all students, agreed that the test topics in one test are of the similar level of difficulty, depth, and familiarity with the students. And they trust that the tester's attitude is unbiased to every test-taker.

Question 8. I spent lots of time preparing for the oral test.

This question was used to investigate the backwash effect of the test on students. Very few, 4.5% of the students, reported not preparing for the test at all; 36.6% of the students did not prepared much; 41.1% students spent time preparing for it; and 17.9% prepared a lot for the exam. The answers to this question show the highest standard deviation, which means students' attitudes towards the preparation vary to a higher degree than their attitudes toward other facets of the test. The reasons are shown in the open question 17 --- “How do you prepare for the oral test?”:

a. A small number of students believe that the oral test should not include prepared speech because unprepared speech is more capable of demonstrating one's oral proficiency.

b. A high proportion of students admit that they search online for information concerning the topics, then organize them into written scripts and memorize them by heart.

c. A smaller percentage of students claim that they search for information and converse with classmates for practice.

The result of the investigation of impact that the test has on learners shows that the speaking test has some beneficial effects on English learning. But many students have not grasped and utilized the better methods to prepare for the speaking test. Searching for information expands their topical knowledge, writing articles exercises their organizational knowledge, namely grammatical and textual knowledge, and conversing with classmates enhances their pragmatic knowledge. More practices of conversation should be involved into their preparation.

Question 9. Generally speaking, I think the oral test helps to develop my English oral proficiency.

This question investigates another aspect of impacts of the test on test-takers. Most students, 72.7% were positive about the effects of the oral proficiency test while 27.7% students did not believe the test had a positive result on their oral proficiency. Most of students stated their awareness of the importance of spoken English, and expressed their appreciation for the speaking test.

Question 10. Generally speaking, I think the oral test has a positive effect on English teaching and learning.

The impact of the test on English teaching and learning is another aspect of the test. The proportion of students who commented on the positive effects on teaching and learning is higher than previous questions, 84.8%, while only 15.2% students regarded the effects negatively. Open questions 18 is “do you think the test is connected well to the teaching?” The answers show that most students realize the importance of oral practice in English acquisition, and also expect to have more chances to practice spoken English in classes. Also, a number of students believe the oral test is detached from English classes because the teaching materials contain a large obscure vocabulary, which can hardly be of help with their spoken English.

Question 11. In my opinion, pronunciation and intonation is the most important factor in the oral test.

Question 12. In my opinion, vocabulary and sentence structure is the most important factor in the oral test.

Question 13. In my opinion, communicative skill is the most important factor in the oral test.

Question 14. In my opinion, accuracy is a more important factor in the oral test.

Question 15. In my opinion, fluency is a more important factor in the oral test.

Questions 11 through 15, aim at investigating students' understanding of rating criteria. By comparing the five groups of data, it is easy to notice that communicative skill is believed to be the most important. The mean, 3.36 is the highest among the five questions, and the standard deviation is comparatively lower. Fluency is given the second importance with the mean being 3.19. Pronunciation is given the third weight, with 77.7 % students thinking of it in positive ways, and mean being 3.01. Accuracy is given the fourth consideration; 66.1% students think of it positively. Finally, vocabulary and sentence structure rank the lowest, with the mean 2.61. A number of students state that they read aloud following the tape to increase their fluency, accuracy of pronunciation and intonation.

Question 19. Would you prefer to be tested

A. through speaking to the computer. B. Through interview with teachers alone

C. Through paired or group interview

Most students' preference is to be tested through an interview with teacher alone. The proportion of preference for interview with teacher alone is 68.8%, and preference to group interview is 23.2%. A large percentage of students expressed their positive feelings from the interview test. Individual testing makes them feel more relaxed and focused on talk. Students believed they had more chances of speaking comparing with the paired test in the previous term. The stated reasons are:

a. They do not fear the embarrassment of talking with peers when they frequently stumble and pause;

b. They receive more interaction and feedback from the teacher so that they believe they will realize more benefits;

c. They believe the interview presents a good opportunity to talk to the teacher closely, leading to better mutual understanding with the teacher;

d. They think the interview provided a good chance to improve their personality, since it resembles the occasion of a job interview;

e. They believe the interview avoids the impact that the peers might create on the test result.

f. The teacher's smile, encouraging eyes, and eliciting language, helps to boost their confidence, and inspire better task performance;

The stated reasons for the paired or group interview are:

a. Conversations with peers can inspire richer and deeper content from diverse perspective;

b. The occasion is more authentic to real life;

c. Cooperation with peers can better relieve the anxiety and tension of the exam and create more commitment;

d. The paired or group interview encourages and motivates more practice and preparation with classmates for the test;

e. The method provided a good comparison with classmates through the test to improve their competence;

f. This method helped the learners can get across their meanings effectively because some of them believed they can hardly be understood by others because of poor pronunciation or wrong expressions.

Results From Questionnaires Of Students Of Computer-Aided Speaking Test

The following Table 5.3 shows the responses of students who take computer-aided speaking test to the questions in the questionnaire. Students were offered16 statements, and asked to chose one of the letter that represented the meaning in accordance with their thoughts. The were then asked to fill in the bracket in front of each sentence:

A Strong disagree, B Disagree; C Agree, D Completely agree (See questionnaire in Appendix 6).

Table 5.3 Questionnaire results from the students of interview speaking test

In the following part, we are analyzing the responses to each individual question with the statistical data.

Question 1. The test scores accurately estimate candidates' oral proficiency.

This question is designed to investigate the face validity of the test. The data in this table shows that 71.4 % students think the interview test can fairly accurately test their oral proficiency, and 28.3% students think it does so negatively. In students' eyes, this computer-aided test is also valid, but comparing mean score 2.75 with 2.84 in interview speaking test, computer-aided test shows lower face validity.

Question 2. The self-introduction in the first part of the oral test is the most capable of testing the candidate's oral proficiency.

Question 3. The second part --- Text Reading is the most capable of testing the candidate's oral proficiency.

Question 4. The third part --- topic answering is the most capable of testing the candidate's oral proficiency.

These three questions investigate the face validity of this test from the perspective of its content. Responses from students show their attitudes towards three parts of test tasks are quite different. Part III ranks the first, Part II ranks the second and Part I ranks the last, with their mean scores respectively 3.22, 2.64 and 2.41. The reason why 56.7% students do not think self-introduction can be a good match between the test task and their spoken ability is that the self-introduction is completely prepared and it was tested in previous semester. Students believed that the Text Reading part promoted their practice for pronunciation and intonation and was viewed as more acceptable among test-takers.

Question 5. The time is sufficient to demonstrate one's oral language proficiency.

This question aims to check if time allocation is appropriate. The result show 55.8% students feel it is fairly reasonable, and 20.4% feel quite reasonable, while 23.9 % think of it as less than reasonable. This question receives better feedback from test-takers in this format than those in interview format. The reason is that in computer-aided test, test-takers are the dominant subjects, so they feel more controlling, while in the interview, the test-takers are the dominated subjects and therefore they feel inflexible for the time allotment.

Question 6. Instructions of the test are clear.

This question shows another aspect of the reliability of the test. Up to 91.1% students are satisfied with this aspect, and the mean score is the highest among all the questions, 3.35, compared to the mean score 3.19 in the interview test. Because all the instructions are informed through the computer, test-takers of this format show greater satisfaction than those in interviews.

Question 7. The test is fair for all the candidates.

This question is another evaluation for face validity of the test Most students (75.7%), agree that the test topics and test environment are equally difficult for them. This opinion is echoed in open questions. Because the raters do not have face-to-face contact, students believe that subjective elements are eliminated and therefore, scoring reliability is higher. However, compared to the interview format, the mean score (2.91) is lower than that of the interview format (3.04). The reason is also explained in open question: test-takers believe that learners who are good at memorizing knowledge show an advantage over those who do not favor memorizing. They believe that the lack of interaction in the test led to unfair scoring.

Question 8. I spent lots of time preparing for the oral test.

This question is designed to investigate the backwash effect of the test on students. A few students (6.2%) have not prepared for the test a bit, 28.3% students have not prepared much, 43.4% students have spent time preparing for it and 22.1% have prepared much for it. The replies to this question also show the highest standard deviation. The mean score (2.81) is a bit higher than that of counterpart in interview (2.72). The reason for the higher mean score is that due to the lack of interaction, test-takers need to prepare more for the test task, while in interview test, the contents they can prepare are less.

Question 9. Generally speaking, I think the oral test help to develop my English oral proficiency.

This question investigates another aspect of impacts of the test on test-takers. A significant minority of students, 35.4%, do not think the test had a positive effect on their oral proficiency (27.7% for interview), while 64.6% students are positive about the effect of test on them (72.7% for interview). Thus, the computer-aided test is perhaps less motivating than the interview test. This opinion is echoed in open question 20, which will be elaborated in discussion of Question 19.

Question 10. Generally speaking, I think the oral test has a positive effect on English teaching and learning.

The effect of the test on English teaching and learning is another aspect of the impact of the test. The proportion of students who reported positive effects on teaching and learning is higher than in the previous questions, 85.8%, while only 14.2% students think of them negatively. The two groups of mean scores are quite close, 3.11 for the computer-aided test and 3.13 of interview test. Some learners in both groups of tests believe that spoken English is not well stressed and developed in English classes.

Question 11. In my opinion, pronunciation and intonation is the most important factor in the oral test.

Question 12. In my opinion, vocabulary and sentence structure is the most important factor in the oral test.

Question 13. In my opinion, communicative skill is the most important factor in the oral test.

Question 14. In my opinion, accuracy is a more important factor in the oral test.

Question 15. In my opinion, fluency is a more important factor in the oral test.

Questions 11 to 15 all aim at investigating students' understanding of rating criteria. By comparing the five groups of data, it is easy to notice that pronunciation and intonation rank first as most important factors, with 3.35 being the highest mean score. Communicative skill is given the second importance with a mean of 3.20. Fluency is given the third weight, with 54.9 % students thinking of it positive, and a mean score of 3.03. Vocabulary and sentence structure are given the fourth consideration. Finally, accuracy ranks the lowest, with the mean of 2.59. The reason for the difference from the interview test is presumed to be that the different test tasks promote different priorities.

Question 19. Would you prefer to be tested

A. through speaking to the computer. B. through interview with the teacher

Obviously students' preference is to be tested through interview with the teacher. The proportion of preference for interview with teacher is 61.9%, and preference for computer-aided test is 38.1%. The stated reasons for the preferences for interview with the teacher are:

a. They will receive more interaction and feedback from the tester so they believe they can benefit more from the test;

b. They feel more motivated to talk when the listener is a human;

c. They think speaking to a human listener provides a good chance to enhance their psychological diathesis and their personality, since it resembles the occasion of a job interview;

d. While talking with a computer, staring at the flickering time reduction on the screen makes them nervous;

e. Talking with a computer, when they forget the script or finish the speech, while time is not due, they have to use a lot of vocal clutters to fill the silence. (This is approved by Luoma (2004:35) that interview is “flexible in that the questions can be adapted to each examinee's performance” )

f. Because they are tested with a large number of classmates simultaneously in one room, they feel pressure and sometimes their performances are affected by the noise of others' speech;

g. Interactive conversations push them to improve their listening abilities and to practice more with classmates while preparing for the test;

The stated reasons for preferring the computer-aided test are:

a. Facing the teacher makes the test-taker more nervous, while talking to computer does not have much variation from practice therefore creates feeling sense of safety

b. The computer-aided test is convenient and time-efficient.

c. Enhancement of scoring reliability because teacher's subjective impression is eliminated.

d. Being able to listen to their own recordings gives them a better self-evaluation.

e. It helps to reduce the teacher's boredoms.

Results From Teachers Questionnaire

In this table, the data under the Column A, B, C and D are the percentage of frequency , which shows the number of repeated choice. The “mean” column shows the average number of choice, and the “mode” column is the choice that occurs the most frequently.

Question 1. The test scores accurately estimate candidates' oral proficiency.

A high proportion, up to 87.5%, of testers believe the test scores were a positive estimation of the candidates' oral proficiency. A small number of testers think there are occasions that some students can hardly do themselves justice because of anxiety and nervousness. Several testers point out that the test tasks are not difficult enough to differentiate the higher level students from the fairly high level ones.

Question 2. I have fully understood and grasped the scoring criteria to judge the candidate's performance justly.

Most testers, 87.7%, think they understand the marking scales, and 54.2% of them seem to be quite sure of it. The mean is 3.38, and the mode is 4, but the standard deviation is quite high, 0.842, which shows inconsistency among testers.

Question 3. I am able to assess each candidate in an unbiased and impartial way.

Most testers, 87.5%, think they can, yet 12.5% of testers think their judgements are influenced to a small degree by impressions of normal performance.

Question 4. I think the self-introduction in the first part of the oral test is the most capable of testing the candidate's oral proficiency.

Half of testers gave positive rankings of this item and half of them ranked it negatively, but none of them completely agree that self-introduction is a capable task of test candidates' oral proficiency. The standard deviation is 0.771,\indicating some inconsistencies among them. Most testers think this task is inflexible and ineffective, but some think it can help candidates of low levels to prepare for the exam.

Question 5. I think the question-answering of the oral test is the most capable of testing the candidate's oral proficiency.

All the testers think this test task is an appropriate one. The standard deviation is 0.495, indicates the consistency of their opinions. Some teachers suggested that the topics are closely related to teaching materials, but they are not novel and stimulating enough.

Question 6. I think the time is reasonable and sufficient to demonstrate one's oral language proficiency.

The vast majority of testers, up to 87.5%, think the amount of time is reasonable and sufficient, but a number of teachers think time is a bit short to demonstrate learners' real spoken ability since more time would be required to get used to the test environment and

Question 7. I think the test is fair to all the candidates.

The replies to this question show fairly consistent agreement with standard deviation being 0.464. And when asked if the test score is consistent with learner's oral proficiency, 90% testers think that test scores can reflect the testers' oral proficiency consistently

Question 8. Generally speaking, I think the oral test has a positive effect on English teaching.

There is a unanimous appreciation of the positive effect of the oral test on English teaching with the 62.5% testers stating that they completely agree and 37.5% stating they agree. This opinion is echoed in open question 11: Do you think the test is connected well to the teaching? How .do you connect your teaching to the oral test? Many teachers present their teaching experience in support of the application of CLT and the adoption of speaking tests.

a. They organize group activities in class and outside class, such as drama, discussion, debate, role-play, text retelling, etc, and students are only allowed to speak English;

b. Topics are assigned to encourage students to search for information and present them in speech or in written form in order to expand their topical and organizational knowledge;

c. Basic sentence patterns, useful phrasal expressions, and frameworks of speech are offered prior to learners' presentation in class;

d. Students' used correct pronunciation and intonation in class, and instructors encourage students to correct each other.

In comparison to students' response of the relation between speaking test and teaching, the students' satisfaction degree is lower. The potential reasons are also stated by some teachers:

a. The major teaching textbooks provide difficult texts and a large vocabulary, hardly applicable to spoken practice;

b. The Viewing, Listening & Speaking textbook is more practical for applying CLT, but it is assigned to only one fourth of all the class hours;

c. High level students always take positive parts in activities, but others, especially low level students often feel ignored in view of limited class time and a large population (except Dep. of CEAT, classes from other majors have a population of 50-70);

d. National CET has exerted its authority on every college and university and many of them have set the scores of CET as the criteria to assess the teaching quality. Under this pressure, grammatical knowledge is still the focus of teaching.

Question 9. Which aspects of speaking are more important for you in judging the candidates' performance? Please mark the following in order of importance or priority: 1= most important; 4= least important.

Pronunciation

Vocabulary richness

Sentence structure

Fluency

This is a ranking question to analyze the perspectives of the tester for investigation of the inter-rater reliability. To make the work more quantifiable, the frequency of each option is calculated. For each facet's rank, it is rated according to the correspondent scale which has 4 points: 1, 2, 3 and 4. “1” represents rank fourth, and “4” represents rank first. For instance, if the facet is ranked 1, it is entered as “4”; if it is ranked 2, it is entered as “3”, and the like.

The result shows that pronunciation and fluency are assigned the most importance, but the standard deviations for them are highest among all the questions. Vocabulary richness is given less importance, and sentence pattern is given the least consideration. A number of raters admit that they feel confused about their choices. How much weight to give to each part is quite problematic. This confusion reveals quite inconsistent judgement standards among raters.

Question 13. Which form of oral test do you prefer?

A. Candidates talking to the computer. B. Candidates talking with the teacher

The vast majority of testers, 83.3% prefer interview with candidates because of higher level of authenticity, better understanding of learners' levels, better flexibility of the topic, and so on. Group talk between students is suggested because the individual interview is quite boring for them. Also, a small number of teachers state that it is hardly possible to test several thousands of students at the end of the semester.

Results From Statistical Data

Descriptive Analysis Of Spoken Scores

In the empirical study, basic descriptive analyses are conducted to investigate the validity and reliability of the speaking tests. All scores from the speech test were into the database in order to obtain basic descriptive statistics. One hundred complete (called “valid” in column 1) are included in this analysis but 12 with missing scores are excluded. Those descriptive statistics are presented in Table 5.5 and 5.6 Table 5.5 provides minimum and maximum scores, mean scores, standard deviation along with Cronbach's Alpha.

As the diagram provided in Figure 5.1 illustrates, the scores are reasonably normally distributed, with most scores occurring in the center, tapering towards the extremes.

As was the case with the spoken scores, few students achieved the minimum or maximum scores with most students scoring in the midrange. The histogram presented in Figure 5.2 provides a visual representation of the frequency of scores.

In the above descriptive analysis of spoken scores of the test, the normal distributions of scores indicate that the test is reliable one with moderate difficulty. The Cronbach coefficient of the two speaking tests are .689 and .597 respectively. Based on the large sample (N=100+), the data are not very high but acceptable. The speaking test of interview format shows a bit higher reliability than computer-aided test.

Statistic Analysis Of Test Scores

Alderson et al (1995) suggest that one good method to analyze construct validity of any test is to correlate each subtest with other subtests. Due to the fact that raters only give holistic score, the scores for each section are not provided. The relationships between each of the two subsets were investigated using Pearson product-moment correlation coefficient. The internal correlation matrix is presented in Tables 5.9 and 5.10.

Table 5.9 The internal correlation matrix of interview speaking test

Pearson correlation coefficients take on value from -1 to +1. When based on a large sample (N=100+), very small correlations may be statistically significant (Jin, 1999). The above tables show that the correlation coefficient between speaking score and the other 3 tests range from .392 to .502 and the levels of significance range from .003 to .057. The highest coefficients in two tables are that of speaking test with writing test, .502 and .460 respectively, which means they are the most highly correlated. This result supports the argument (Bachman and Palmer, 1996) that speaking and writing skills are both productive modes, and that the exams tested the same modes of skills.

The conclusion that there is a moderately, significant positive correlation between the speaking test and writing, reading, and listening tests can be safely drawn. In other words, the speaking test has acceptable construct validity (divergent validity). The speaking tests at DLNU have measured the construct (language competence) it is supposed to and claims to measure.

Investigation Of Inter-Rater Reliability

To investigate the marking reliability, inter-rater reliability is analyzed. Thirty-five audio recordings were sampled then another rater was asked to mark the recordings after the first rating. The second marker was not aware of the score given by the first marker. Then the scores of each test-taker was entered into the database. Table 5.11 compares the ratings of the two independent raters.

After data are processed by SPSS, the Cronbach coefficient is calculated, .676. Based on the small sample, this result indicates not high, but acceptably reliable.

Summary Of The Validity And Reliability Evaluation

Recall that Research Question 1 was: ( a. To what extent is the speaking test in DLNU reliable? and b. To what extent is the speaking test in DLNU valid? Given the findings from all the test materials, questionnaires, and test records at DLNU, the speaking test has an acceptable level of reliability, in view of the test setting, tester attitudes, imprecise scoring criteria, degree of inter-rater marking consistency, etc. The test scores reflect the testers' oral proficiency consistently, which echoes Bachman and Palmer's (1996) statement. The test also demonstrates a moderate degree of validity, in light of positive feedback from students and teachers, significant correlation between different sub-test scores, and comparison with the teaching syllabus.

The second research question was “in what aspects and to what extent may the validity and reliability of the speaking test in DLNU be improved?” This research question will be answered in next chapter in more details.

Recommendations And Implications

The evaluation of reliability and validity, and the analysis of the effect that the speaking test has on students' learning reveal that the scoring reliability, test content, the test format, and the relation between the test and teaching and learning is acceptable, but need to be improved.

The first major finding from the data of testing scores, the teacher questionnaire and the tester interview is that the test scores can reflect the testers' oral proficiency consistently, but “subjectivity” still occurs to a certain degree in the scoring process. Firstly, in the theoretical evaluation of the previous chapter, the rating scale was not found to be completely coherent with the test construct. Then the analysis on testers' questionnaires reveals the disparity of considerations when judging test-takers' performance (standard deviation of pronunciation and fluency are 1.060 and 1.100 respectively). In the investigation of the inter-rater reliability, the Cronbach coefficient is not very high (.676 based on 35 samples). There is also a discrepancy between students' understanding of scoring criteria (average standard deviation is .668).

The second finding is from the students' attitudes towards the content of the speaking test, as well as the aspects of which and the extent to which the test may be improved. I am convinced by the data from the student questionnaires that test content needs to be better chosen and organized although a high proportion of test-takers show appreciation of the test. 28.6% of students expressed negative attitudes towards the prepared speech in interview test, 56.7% of students spoke of the self-introduction part in positive terms, and 41.6% students thought the text reading was not very effective.

Third, the link between speaking test and CE teaching and learning needs to be strengthened. Most students regard the test as a motivation to learn oral English instead of a compulsory and boring task to complete. 22.1% learners in computer-aided test practice conversations with peers as preparation for the test, rest of which prepare for the test by reciting the speech. There maybe a mismatch between the teachers' perspective of relation between test and CE teaching and students'. Unanimously all the teachers think the speaking test has an positive effect on English teaching and learning, while 27.7% students in interview format and 35.4% in computer-aided speaking test do not agree. Teachers' satisfaction degree is obviously higher than that of students.

The data also help the researcher realize that both test formats have some problems. 68.8% of test-takers in interview test like its format, although 32.2% of students express expectation to take other formats. In the computer-aided test, 38.1% of test-takers like its format, and 61.9% of them expect to be tested in face-to-face format. Both the teachers and most students approve of pair or group conversation or role play.

Therefore, the author proposes several measures to improve the reliability and validity of the test.

Enhancing Scoring Reliability

There are some special procedures needed to enhance scoring reliability.

High-Quality Scoring Instruments

High-quality scoring instruments are indispensable to consistency of scoring. First, test developers need to design a more detailed rating scale, and use more of a communicative approach in light of the test construct. The well-defined criteria help raters agree, as is mention in Chapter 2. The rating form for the test can be applied to ensure the consistency of rating procedures (Luoma, 2004). Second, both holistic and analytic criteria should be used during scoring process and should complement each other to ensure better evaluation performance. Third, recording of the assessed performance, especially the interview test, should be encouraged as an evidence to evaluate the reliability of the rating afterwards.

Rater Training

Raters should not only acquire a full understanding of the criteria, but also should practice rating by viewing taped performances or itemized live speaking tests (Luoma, 2004). Then they should report their scores and discuss the reasons for the consensus score. Through this method, they will fully grasp the benchmark and the levels of the scale to set the standard.

Examiner Training

Examiners should be trained and observed to ensure they are less subjective in the performance process. With as little as possible variation in it, both inter- and intra-examiner reliability can be enhanced.

Improving The Speaking Test Format And Task

The expectations of the face-to-face test format from students show that communicative format is much preferable. However, it is not practicable to adopt the interview test on a university scale with an eye to the limited teacher resource as well as time. As Stansfield (1991) proposes that SOPI is more appropriate for large-scale test with requirement of high reliability. Given the advantages of semi-direct test format, the computer-aided speaking test cannot be replaced by interview. For students' interests in interactive dialogues, tasks testing pragmatic and communicative competence should be involved into the semi-direct format. “Non-human” elicitation procedures and paired works can be added into the computer-aided speaking test. Impromptu questions can add the unpredictable factor to the speaking test, thus test-takers feel more motivated to practice spoken English..

Test tasks need to be more authentic to improve the validity of the test. More diversified and motivating topics need to be given added weight when designing the test tasks. Self-introduction and prepared tasks can be eliminated. Various tasks, such as team discussion, debate, English drama, situational conversation, simulation speech, etc., can be introduced into face-to-face test tasks. Not formulaic but renewed, practical and more intercultural communication knowledge should be involved. Inclusion of some non-verbal (visual) stimuli, such as pictures and cards can be taken into consideration. These prompts can be more vivid and comprehensible to the test takers.

Testers in the oral test are expected to create a more relaxing testing atmosphere during the test to reduce the anxiety of the test takers. In a serious circumstance, test takers are too conscious of the test itself and are less concentrated on the communication with their partners. This lack of concentration decreases the degree of the validity and reliability of the test.

Improving The Relation Between Speaking Test And CE Teaching And Learning

According to the theories of teaching and learning, tests can have a strong impact on teaching and learning (Hughes, 2002). Hughes states, which have been introduced in Chapter 2, that language teachers are engaged in “teaching a language through speaking” (Hughes, 2002:7). On one hand, spoken language is the focus of classroom activity. There are often other aims which the teacher might have: for instance, helping the student gain awareness of practice in some aspect of linguistic knowledge（ibid). On the other hand, speaking test, as a device for assessing the learners' language proficiency also functions to motivate students and reinforce their learning of language. Therefore, the speaking test at DLNU can be adapted to the combination of formative and summative assessment. Various tasks, such as team discussion, debate, English drama, situational conversation, simulation speech, can be assigned as assessment tasks. These tasks require that learners make constant and frequent effort to practice spoken English. Their performance and progress will be kept track of by the teacher to give formative assessment. At the end of semester, a formal speaking test will be given as a summative assessment. Test-takers should be informed of the task type and scope, but not necessarily as specific as what they have been informed of currently. After the test performance, reporting scores and giving detailed feedback will be helpful to put positive washback effect on learners.

The Need For Pretesting

Even with the trials, some unsuitable items still survive. Without pretesting and post hoc analysis, no institution could be absolutely sure that the test is a reliable and valid one (Fulcher, 1997a), especially a speaking test, which is full of “unpredictability and dynamic nature” (Brown, 2003). It is important for the test developers to tailor themselves to the test-taker, because the test-takers are the subjects in the test.

The above proposals and suggestions basically cover all aspects of reliability and validity of language testing which ought to have to be considered seriously. When considering the improvement of reliable and valid degree of a particular test, test developers must also take other qualities of test usefulness into account, such as practicality and authenticity.

Conclusion

Findings

In this empirical research, the author has used different dimensions to evaluate the reliability and validity of the speaking test at DLNU, including theoretical, empirical, descriptive and quantitative. Based on the results demonstrated in Chapter 5, the following conclusions can be drawn.

First, from the data of testing scores, the tester and test-taker questionnaire, and test materials, it is found that the speaking test has an acceptable degree of reliability. The major problem lies in the lack of intelligent rating scale based on the test construct. The test setting, test facilities, and tester attitude are satisfactory enough to ensure the reliability.

Second, the test also demonstrates a moderate degree of validity, in light of positive feedback from students and teachers, significant correlation between different sub-test scores, and comparison with the teaching syllabus. However, prepared speeches show lower face validity from students' perspectives towards the content of the speaking test. Students view topics as formulaic and believe they need to be reorganized.

At the same time, there may be a mismatch between the teachers' perspective of relation between test and CE teaching and students' perceptions of the same. The link between speaking test and CE teaching and learning needs to be strengthened. A fairly high proportion of students prepare for the speaking test aiming at passing the test. Teachers' satisfaction is obviously higher than students.

The results also show that different test formats affect the test taker's perceptions and help determine whether test contents are considered valid or not. There may be some kind of relationship between the test taker's perception and test task. Additionally, the results show that test takers' perceptions of validity vary not only across different formats of subjects but also even within the same group of subjects.

Implications

Implications of this research can be generalized into four points. Above all, for researchers in the field of Language Testing, some empirical information would be their useful reference, since institutions have rarely adopted speaking test on a university scale. The construction of computer-aided and interview speaking tests at DLNU can be both positive and negative examples for other institutions to refer to.

Second, the above practical suggestions have important implications for test developers. While developing or improving an oral test to measure test-takers' language proficiency, test developers can take these factors into consideration. If it is necessary and possible, they would develop more reliable and valid tests by designing the test tasks from those perspectives discussed above.

Third, for language teachers, the findings will help teachers gain better understanding to learners' mentality, and raise the teachers' awareness of the rationale for teaching and assessing communicatively in spite of the practical constraints in the EFL classroom context

Limitations And Further Research

The limitation of the study lies in mainly relying on questionnaires to elicit subjects' perceptions. Perhaps with follow-up interviews with the test takers, or observation of their performances, deeper insights into their mental activities and the process whereby they frame and carry out the tasks can complement the insights gathered from the questionnaires.

Another limitation is the depth of the reliability analysis and coverage of validation. Scores of every section of the speaking test are not provided for an individual test taker due to the holistic scoring method. Students' other test scores or teachers' rankings are not available, making it difficult to validate the concurrent and predictive validities. Therefore, more detailed statistical analyses of reliability and validity were not undertaken. Under current conditions, only basic and general reliability analysis and validation can be conducted.

Having realized the significance and limitation of this research, the author recognizes the necessity of further theoretical discussion and empirical studies on reliability and validity in language testing. With the purpose of better assessing the degree of reliability and validity of a particular test, the author will take other qualities of test usefulness into account, such as practicality and authenticity, to construct a more holistic and thorough theoretical framework of this notion; the author need to refine the research instruments to continue the empirical research, especially those quantified methods.

References

American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (1985). Standards for educational and psychological testing. Washington, DC: American Psychological Association.

Alderson, J. C., Clapham, C. & Wall, D. (1995). Language Test Construction and Validation. Cambridge: CUP.

Bachman, L. & Clark, J. (1987). The Measurement of Foreign/Second Language Proficiency. The ANNALS of the American Academy of Political and Social Science, 490. 20-33.

Bachman, L. & Palmer, A. (1996). Language Testing in Practice. Oxford: OUP

Bachman, L. (1990). Fundamental Considerations in Language Testing. Oxford: Oxford University Press.

Bachman, L. (1991). What Does Language Testing Have to Offer? TESOL Quarterly, 25(4).

Brown, A. (2003). “Interviewer variation and the co-construction of speaking proficiency”. Language Testing, 20 (1), 1-25.

Bygate, M. (2001). Speaking. In Carter, R. & Nunan, D. (2001). The Cambridge Guide to Teaching English to Speakers of Other Languages. Cambridge: CUP.

Cai, J. (2002). Pressures College English Teaching confronted with. Foreign Language Teaching and Research (bimonthly), 34(3), 228-230.

Carter, R. & McCarthy, M. (1995). Grammar & the Spoken Language. Applied Linguistics, 16 (2), 141-155.

Cheng, L. (2008). The key to success: English language testing in China. Language Testing, 25(1), 15-37.

Clark, J.L.D. (1979). Direct vs. semi-direct tests of speaking ability. In E.J. Briere & F.B. Hinofotis (Eds.), "Concepts in language testing: Some recent studies"(pp.35-49). Washington, DC: TESOL.

Chinese Ministry of Education. (1999). National College English Syllabus for Non-English Majors. Shanghai: Shanghai Foreign Language Education Press.

Chinese Ministry of Education. (2004). College English Curriculum Requirements (For Trial Implementation). Beijing: Foreign Language Teaching and Research Press.

Ferguson, G. (2009). Language Testing class handouts.

Fulcher, G. (1997a). An English Language Placement Test: Issues in Reliability and Validity. Language Testing 1997,14(2), 113-138.

Fulcher, G. (1997b). ‘The Testing of Speaking in a Second Language.' in Clapham, C. and Corson, D. (eds) Encyclopedia of Language and Education Vol 7: Language Testing and Assessment. Amsterdam: Kluwer Academic Publishers.

Fulcher, G. (2003). Testing Second Language Speaking. London: Longman

Henning, G. (1987).A Guide to Language Testing. Cambridge, Massachusetts: Newbury House.

Hughes, A. (1989). Testing for Language Teachers. Cambridge: Cambridge University Press.

Hughes, R. (2002). Teaching and Researching Speaking. London: Longman.

Ildikó, C. ( 2001). “Is testing speaking in pairs disadvantageous for students? A quantitative study of partner effects on oral test scores”. English Language Teaching, 9(1), 1-17.

Jin, X. (1999). Quantitative Data Analysis in Foreign Language Teaching Research. Wuhan: Huazhong Univerisy of Science and Technology Press.

Kim, H. S. (2003). L2 Language Assessment in the Korean Classroom. Asian EFL Journal, 11, 1-30.

Liu, R. & Han, B. (1991). Language Testing and Testing Methods. Beijing: FLTRP.

Luoma, S. (2004). Assessing Speaking. Cambridge: CUP.

Malone, M. (2000). Simulated Oral Proficiency Interviews Recent Developments. ERIC Clearinghouse on Languages and Linguistics, 12, 10-11.

Messick, S. (1989). Validity. In Linn, R. L. (ed.) Educational Measurement. New York: Macmillan.

McCarthy, M. & O'Keeffe. A. (2004). Research in Teaching of Speaking. Annual Review of Applied Linguistics, 24, 26-43.

O'Loughlin, K. (2001). The Equivalence of Direct and Semi-direct Speaking Tests. Cambridge: CUP.

O'Sullivan, B. (2000). Exploring gender and oral proficiency interview performance. Elsevier Science,28, 373-386.

O'Sullivan, B. (2002). “Learner acquaintanceship and oral proficiency test pair-task performance”. Language Testing, 2002 (19), 277-295.

Savignon, S. (1985). “Evaluation of communicative competence: the ACTFL provisional proficiency guidelines”. The Modern Language Journal, 69:129-134.

Shohamy, E. (1994). The Validity of Direct Versus Semi-direct Oral Tests. Language Testing, 11(2), 99-123.

Stansfield, C. W. (1991). A comparative analysis of simulated and direct oral proficiency interviews. In S. Avian (ed.). Current Developments in Language Testing. Singapore, RELC.

Taylor, L. (2000). Investigating the paired speaking format. UCLES Research Notes, 2, 14-15.

Underhill, N. (1987) . Testing Spoken Language. Cambridge: CUP.

Van Lier, L. (1989). Reeling, writhing, drawling, stretching, and fainting in coils: oral proficiency interview as conversation. TESOL Quarterly,23(3), 489-508.

Wen, Q. (2001 ). Evaluating Oral English Teaching from TEM-4. Foreign Language, 4, 24-28.

Yoffe, L. (1997). “An overview of the ACTFL proficiency interview: A test of speaking ability”. Testing & Evaluation SIG Newsletter, 1997(9):2-13.

Zhu, H. (2004). The Backwash of CET on College English Teaching from the Perspective of the Backwash of Language Tests. Journal of South-Central University for Nationalities (Humanities and Social Sciences), 24 (2), 5-12.

Appendix

Appendix 1 :CET-SET rating scale (CME, 1999)

College English Curriculum Requirements (Excerpt)

(For Trial Implementation)

I. Character and Objective of College English

College English, an integral part of higher learning, is a required basic course for undergraduate students. As a systematic whole, College English has as its main components knowledge and practical skills of the English language, learning strategies and intercultural communication; it takes theories of foreign language teaching as its guide and incorporates different teaching models and approaches.

The objective of College English is to develop students' ability to use English in an all-round way, especially in listening and speaking, so that in their future work and social interactions they will be able to exchange information effectively through both spoken and written channels, and at the same time they will be able to enhance their ability to study independently and improve their cultural quality so as to meet the needs of China's social development and international exchanges.

II. Teaching Requirements

As China is a large country with conditions varying from region to region and from college to college, the teaching of College English should follow the principle of providing different guidance for different groups of students and instructing them in accordance with their aptitude so as to meet the specific needs of the individualized teaching.

The requirements for undergraduate College English teaching are set at three levels, i.e., basic requirements, intermediate requirements, and higher requirements. All non-English majors are required to attain to one of the three levels of requirements after studying and practicing English at school. The basic requirements, a goal that all college graduates must achieve, are meant for students who have or have not completed Band 7 of the Senior High School English Standards prior to entering college. Intermediate and higher requirements are respectively set for those who, having laid a good foundation of English, can afford time to learn more of the language, and have completed Bands 8 or 9 of the Senior High School English Standards upon entering college. The three levels of requirements, which incorporate knowledge and practical skills of the English language, learning strategies and intercultural communication, embody qualitatively and quantitatively the objective of College English teaching. The basic requirements are the minimum level that all non-English majors have to reach before graduation. Institutions of higher learning should set their own objectives in the light of their specific circumstances, strive to create favorable conditions, and encourage students to adjust their objectives in line with their own performance and try to meet the intermediate or higher requirements.

The Three Levels Of Requirements Are Set As Follows

Basic Requirements

1. Listening: Students should be able to follow classroom instructions, everyday conversations, and lectures on general topics conducted in English. They should, by and large, be able to understand Special English programs spoken at a speed of about 130 words per minute (wpm), grasping the main ideas and key points. They are expected to be able to employ basic listening strategies to facilitate comprehension.

2. Speaking: Students should be able to communicate in English in the course of learning, to conduct discussions on a given theme, and to talk about everyday topics with people from English-speaking countries. They should be able to give, after some preparation, short talks on familiar topics with clear articulation and basically correct pronunciation and intonation. They are expected to be able to use basic conversational strategies in dialogue.

3. Reading: Students should be able to read, in the main, English texts on general topics at a speed of 70 wpm. With longer yet less difficult texts, the reading speed should be at 100 wpm. They should be able to read, in the main, English newspapers and magazines published in China, grasping the main ideas, and understanding major facts and relevant details. They should be able to understand texts of practical styles commonly used at work and in life. They are expected to be able to employ effective reading strategies while reading.

4. Writing: Students should be able to complete writing tasks for general purposes, e.g., describing personal experiences, impressions, feelings, or some events, and to undertake practical writing. They should be able to write within 30 minutes a short composition of 120 words on a general topic or an outline. The composition should be basically complete in content, appropriate in diction and coherent in discourse. Students are expected to be able to have a command of basic writing strategies．

5. Translation: With the help of dictionaries, students should be able to translate essays on familiar topics from English into Chinese and vice versa. The speed of translation from English into Chinese should be 300 English words per hour whereas the speed of translation from Chinese into English should be 250 Chinese characters per hour. The translation should read smoothly. Students are expected to be able to use appropriate translation techniques.

6. Recommended Vocabulary: Students should acquire a total of 4,500 words and 700 phrases (including those that have been covered in high school English courses), among which 2,000 are active words (see Appendix III: Active Word List). Students should not only be able to comprehend the active words but be proficient in using them when expressing themselves in speaking or writing.

Intermediate Requirements：

1．Listening: Students should be able to follow, in the main, talks and lectures by people from English-speaking countries, to understand longer English radio and TV programs produced in China on familiar topics spoken at a speed of around 150 wpm, grasping the main ideas, key points and relevant details. They should be able to understand, by and 1arge, course in their areas of specialty taught by foreign teachers in English.

2. Speaking: Students should be able to hold conversations in fairly fluent English with people from English-speaking countries, and to employ fairly well conversational strategies. They should, by and large, be able to express their personal opinions, feelings and views, and to state facts, events and reasons with clear articulation and basically correct pronunciation and intonation.

3. Reading: Students should, in the main, be able to read essays on general topics in newspapers and magazines published in English-speaking countries at a speed of 80 wpm. With longer texts for fast reading, the reading speed should be 120 wpm．Students should be able to skim or scan reading materials. When reading summary literature in their areas of specialty, students should be able to get a correct understanding of the main ideas, major facts and relevant details.

4. Writing：Students should be able to express personal views on general topics, compose English abstracts of theses in their own specialization, and write short English papers on topics of their specialty. They should be able to describe charts and graphs, and to complete within 30 minutes a short composition of 160 words. The composition should be complete in content, clear in organization and coherent in discourse.

5. Translation: With the help of dictionaries, students should be able to translate texts on familiar topics in newspapers and magazines published in English speaking countries, to translate on a selective basis articles of popular science relevant to their own specialty. The speed of translation from English into Chinese should be 350 English words per hour whereas the speed of translation from Chinese into English should be 300 Chinese characters per hour. The translation should read smoothly, convey the original meaning and be free from serious mistakes in understanding or expression.

6. Recommended Vocabulary: Students should acquire a total of 5,500 words and 1,200 phrases (including those that have been covered in high school English courses and the Basic Requirements), among which 2, 500 are active words (including the active words that have been covered in the Basic Requirements). (see Appendix III: Active Word List)

Higher Requirements：

1. Listening: Students should be able to understand longer dialogues and passages, and grasp the key points even when sentence structures are complicated and views are only implied. They should, by and large, be able to understand radio and TV programs produced in English-speaking countries. They should be able to understand lectures related to their areas of specialty and grasp the gist and main points.

2. Speaking: Students should be able to conduct dialogues or discussions with certain degree of fluency and accuracy on general or specialized topics, and to make concise summaries of extended texts or speeches in difficult language. They should be able to deliver papers at academic conferences and participate in discussions.

3. Reading: Students should be able to read rather difficult texts, and understand their meanings. With the help of dictionaries, they should be able to read original versions of English textbooks and articles in newspapers and magazines published in English-speaking countries, and to read literature related to their areas of specialty without much difficulty.

4. Writing: Students should be able to express their opinions freely on general topics with clear structure, rich content and good logic. They should be able to write brief reports and papers of their areas of specialty, and to write within 30 minutes expository or argumentative essays of 200 words on a given topic. The text has complete content, logical thinking, and clear expression of ideas.

5. Translating: With the help of dictionaries, students should be able to translate fairly difficult English texts on popular science, culture, and reviews in newspapers and magazines published in English-speaking countries into Chinese, and translate Chinese introductory texts on the conditions of China or Chinese culture into English. The speed of translation from English into Chinese should be 400 English words per hour whereas the speed of translation from Chinese into English should be 350 Chinese characters per hour. The translation should convey the idea with accuracy and smoothness and be basically free from mistakes and misinterpretation.

6. Recommended Vocabulary: Students should acquire a vocabulary of 6,500 words and 1,700 phrases, among which 2,500 are active words (including the active words that have been covered in the Basic Requirements and Intermediate Requirements)．

In developing competence in listening, speaking, reading, writing and translation at the three levels mentioned above, college and universities should lay more stress on the cultivation and training of listening and speaking abilities. A good command of vocabulary, especially of active words, constitutes the basis for the improvement of students' ability to use English in an all-round way. Therefore, teaching plan for this component should be specified in the College English syllabus of each school．

Moreover, colleges and universities should cover components of learning strategies and intercultural communication in their teaching so as to enhance students' abilities of independent learning and of communication.

Speaking Test For Computer-Aided Test (Excerpt)

Round 3

Part I Self-introduction

Part II Text Reading

Gail and I had no illusions about what the future held for us as a married, mixed couple in America. The continual source of our strength was our mutual trust and respect.

We wanted to avoid the mistake made by many couples of marrying for the wrong reasons, and only finding out ten, twenty, or thirty years later that they were incompatible, that they hardly took the time to know each other, that they overlooked serious personality conflicts in the expectation that marriage was an automatic way to make everything work out right.

Part III Topic for Oral Test

Should people buy things according to what ads say?

Round 4

Part I Self-introduction

Part II Text Reading

But when he asked her for a photo, she declined his request. She explained her objection: "If your feelings for me have any reality, any honest basis, what I look like won't matter. Suppose I'm beautiful. I'd always be bothered by the feeling that you loved me for my beauty, and that kind of love would disgust me. Suppose I'm plain. Then I'd always fear you were writing to me only because you were lonely and had no one else. Either way, I would forbid myself from loving you.

When you come to New York and you see me, then you can make your decision. Remember, both of us are free to stop or to go on after that—if that's what we choose ..."

Part III Topic for Oral Test

If a student is afraid of speaking in front of other people, what suggestions can you give?

Round 6

Part I Self-introduction

Part II Text Reading

For many people, the root of their stress is anger, and the trick is to find out where the anger is coming from. “Does the anger come from a feeling that everything must be perfect?” Eliot asks.

“That's very common in professional women. They feel they have to be all things to all people and do it all perfectly. They think, ‘I should, I must, I have to.' Good enough is never good enough. Perfectionists cannot delegate. They get angry that they have to carry it all, and they blow their tops. Then they feel guilty and they start the whole cycle over again.”

“Others are angry because they have no compass in life. And they give the same emphasis to a traffic jam that they give a family argument,” he says. “If you are angry for more than five minutes—if you stir the anger within you and let it build with no safety outlet—you have to find out where it's coming from.”

Part III Topic for Oral Test

Do you like singing karaoke? Why?

Topics for face-to-face interviews (excerpt)

1. Do you turn to your father/mother for help when you have problems? why or why not?

2. Have you ever experienced an amusing case of coincidence? What is it?

3.What do you think is crucial for a happy marriage?

4.When are you under stress? What makes you feel stressed?

5. Why are Olympics so fascinating to people of all ages?

Appendix 5: Questionnaire for students who take interview speaking test

Dear Students:

Our College English Department in Dalian Nationalities University has implemented oral test for several years. I designed the questionnaire to collect data to improve the formats of oral tests. Your answers of the questionnaire will be of great value to our work. There is no right or wrong answers. Please fill out the form to ensure validity. All the data collected will be kept confidential. Thank you for your cooperation.

Part 1：Personal information

1．Age：_________ 2．Name: ________ 3. Sex：________ 4.Department and class：_____________

Part 2：Choose one of the letter that represent the meaning in accordance with your thoughts and fill in the bracket in front of each sentence.

A. Strong disagree B. Disagree C. Agree D. Completely agree

（）1. The test scores accurately estimate candidates' oral proficiency.

（）2. The interviewer can keep the friendly attitude all the time.

（）3. The topic answering in the first part of the speaking test is the most capable of testing the candidate's oral proficiency.

（）4. The impromptu question and answer in the second part of the speaking test is the most capable of testing the candidate's oral proficiency.

（）5. The time is sufficient to demonstrate one's oral language proficiency.

（）6. Instructions of the test are clear.

（）7. The test is fair for all the candidates.

（）8. I spent lots of time preparing for the oral test.

（）9. Generally speaking, I think the oral test help to develop my English oral proficiency.

（）10. Generally speaking, I think the oral test has a positive effect on English teaching and learning.

（）11. In my opinion, pronunciation and intonation is the most important factor in the oral test.

（）12. In my opinion, vocabulary and sentence structure is the most important factor in the oral test.

（）13. In my opinion, communicative skill is the most important factor in the oral test.

（）14. In my opinion, accuracy is a more important factor in the oral test.

（）15. In my opinion, fluency is a more important factor in the oral test.

Part 3: Open questions:

16. Do you think the oral test can be improved？Please give details: (such as, test time allocation, facilities, format, contents, etc..)

17. How do you prepare for the oral test?

18. Do you think the test is connected well to the teaching?

19. Would you prefer to be tested

B. through speaking to the computer. B. Through interview with teachers alone

C. Through paired or group interview

20. Please give reason for your preference：

Questionnaire For Students Who Take Computer-Aided Speaking Test

Dear Students:

Our College English Department in Dalian Nationalities University has implemented oral test for several years. We designed the questionnaire to collect data to improve the formats of oral tests. Your answers of the questionnaire will be of great value to our work. There is no right or wrong answers. Please fill out the form to ensure validity. All the data collected will be kept confidential. Thank you for your cooperation.

Part 1：Personal information

1．Age：_________ 2．Name: ________ 3. Sex：________ 4.Department and class：_____________

Part 2：Choose one of the letter that represent the meaning in accordance with your thoughts and fill in the bracket in front of each sentence.

B. Strong disagree B. Disagree C. Agree D. Completely agree

（）1. The test scores accurately estimate candidates' oral proficiency.

（）2. The self-introduction in the first part of the oral test is the most capable of testing the candidate's oral proficiency.

（）3. The second part --- Text Reading is the most capable of testing the candidate's oral proficiency.

（）4. The third part --- topic answering is the most capable of testing the candidate's oral proficiency.

（）5. The time is sufficient to demonstrate one's oral language proficiency.

（）6. Instructions of the test are clear.

（）7. The test is fair for all the candidates.

（）8. I spent lots of time preparing for the oral test.

（）9. Generally speaking, I think the oral test help to develop my English oral proficiency.

（）10. Generally speaking, I think the oral test has a positive effect on English teaching and learning.

（）11. In my opinion, pronunciation and intonation is the most important factor in the oral test.

（）12. In my opinion, vocabulary and sentence structure is the most important factor in the oral test.

（）13. In my opinion, communicative skill is the most important factor in the oral test.

（）14. In my opinion, accuracy is a more important factor in the oral test.

（）15. In my opinion, fluency is a more important factor in the oral test.

Part 3: Open questions:

16. Do you think the oral test can be improved？Please give details: (such as, test time allocation, facilities, format, contents, etc..)

17. How do you prepare for the oral test?

18. Do you think the test is connected well to the teaching?

19. Would you prefer to be tested _____________

A. through talking to the computer. B. Through interview with teachers.

20. Please give reason for your preference：

Questionnaire For Testers In Face-To-Face Speaking Test

Dear teachers:

Our College English Department in Dalian Nationalities University has implemented oral test for several years. I designed the questionnaire to collect data to improve the formats of oral tests. Your answer of the questionnaire will be of great value to my work. There is no right or wrong answers. Please fill out the form to ensure validity. All the data collected will be kept confidential. Thank you for your cooperation.

Part 1：Personal information

1．Age：_________ 2. Sex：________ 3.Department and class you are teaching：_________________

4. Academic title: ________________ 5. Years that you have taught English: ______________

Part 2：Choose one of the letter that represent the meaning in accordance with your thoughts and fill in the bracket in front of each sentence.

C. Strongly disagree B. Disagree C. Agree D. Strongly Agree

（）1. The test scores accurately estimate candidates' oral proficiency.

（）2. I have fully understood and grasped the scoring criteria to judge the candidate's performance justly.

（）3. I am able to assess each candidate in an unbiased and impartial way.

（）4. I think the topic answering in the first part of the oral test is the most capable of testing the candidate's oral proficiency.

（）5. I think the impromptu question and answer of the oral test is the most capable of testing the candidate's oral proficiency.

（）6. I think the time is reasonable and sufficient to demonstrate one's oral language proficiency.

（）7. I think the test is fair to all the candidates.

（）8. Generally speaking, I think the oral test has a positive effect on English teaching.

9. Which aspects of speaking are more important for you in judging the candidates' performance? Please mark the following in order of importance or priority: 1= most important; 4= least important.

Dissertation

Part 3: Open questions:

10. Do you think the oral test can be improved？Please give details: (such as, test time allocation, facilities, format, contents, etc..)

11. Do you think the test is connected well to the teaching? How do you connect your teaching to the oral test? （How did you suggest them to do? How did they practice?）

12. Do you think your students' oral test results are compatible with their real oral proficiency? If not, is it because of the test's problem or their own problems?

13. Which form of oral test do you prefer?

A. Candidates talking to the computer. B. Candidates talking with the tester

14. Please give reason for your preference：

Thank you for your cooperation!

Questionnaire For Testers In Computer-Aided Speaking Test

Dear teachers:

Our College English Department in Dalian Nationalities University has implemented oral test for several years. I designed the questionnaire to collect data to improve the formats of oral tests. Your answer of the questionnaire will be of great value to my work. There is no right or wrong answers. Please fill out the form to ensure validity. All the data collected will be kept confidential. Thank you for your cooperation.

Part 1：Personal information

1．Age：_________ 2. Sex：________ 3.Department and class you are teaching：_________________

4. Academic title: ________________ 5. Years that you have taught English: ______________

Part 2：Choose one of the letter that represent the meaning in accordance with your thoughts and fill in the bracket in front of each sentence.

D. Strongly disagree B. Disagree C. Agree D. Strongly Agree

（）1. The test scores accurately estimate candidates' oral proficiency.

（）2. I have fully understood and grasped the scoring criteria to judge the candidate's performance justly.

（）3. I am able to assess each candidate in an unbiased and impartial way.

（） 4. I think the self-introduction in the first part of the oral test is the most capable of testing the candidate's oral proficiency.

（）5. I think the question-answering of the oral test is the most capable of testing the candidate's oral proficiency.

（）6. I think the time is reasonable and sufficient to demonstrate one's oral language proficiency.

（）7. I think the test is fair to all the candidates.

（）8. Generally speaking, I think the oral test has a positive effect on English teaching.

9. Which aspects of speaking are more important for you in judging the candidates' performance? Please mark the following in order of importance or priority: 1= most important; 4= least important.

Dissertation

Part 3: Open questions:

10. Do you think the oral test can be improved？Please give details: (such as, test time allocation, facilities, format, contents, etc..)

11. Do you think the test is connected well to the teaching? How do you connect your teaching to the oral test? （How did you suggest them to do? How did they practice?）

12. Do you think your students' oral test results are compatible with their real oral proficiency? If not, is it because of the test's problem or their own problems?

13. Which form of oral test do you prefer?

B. Candidates talking to the computer. B. Candidates talking with the teacher

14. Please give reason for your preference：

Thank you for your cooperation

According to the results reported in Table 5.6, the frequency of scores at each level indicates that the smallest numbers of students earned the lowest score (60) and the highest score (100). The largest number of students achieved intermediate scores.

Share this: Facebook Twitter Reddit LinkedIn WhatsApp

Cite This Work

To export a reference to this article please select a referencing stye below:

Related Services

View all

Dissertation Writing Service

From £136

Dissertation Proposal Writing Service

From £124

Female student reading and using laptop to study

Topics and Titles Writing Service

From £24

DMCA / Removal Request

If you are the original writer of this dissertation and no longer wish to have your work published on the UKDiss.com website then please:

Dissertation Services

PhD Services

Other Services

Contact

Introduction

Research questions

Literature Review

Analyzing Speaking And Speaking Test

The Nature Of Speaking

The Importance Of Speaking Test

The Construct Of Speaking

Introduction To Communicative Language Ability (CLA)

Knowledge Structures Language Competence

Strategic Competence

Psychophysiological Mechanisms

Context Of Situation

Fulcher’s Construct Definition

Ways Of Testing Speaking

The Oral Proficiency Interview Format

The Advantage Of An Interview Format

The Disadvantage Of An Interview Format

Testing Speaking In Pairs

The advantages of the paired interview format

The disadvantages of the paired interview format

Semi-Direct Speaking Tests

The Advantages Of The Semi-Direct Test Type

The Disadvantages Of The Semi-Direct Test Type

Marking Of Speaking Test

Definition Of Rating Scales

Holistic And Analytic Rating Scales

Validity And Reliability Of Speaking Test

Bachman And Palmers Theories On Test Usefulness

Defining Validity

Types Of Validity

Internal Validity

External Validity

Reliability

Defining Reliability

Types Of Reliability

Relationship Between Validity And Reliability

Speaking Test In China

The Importance Of English Language In China

Development Of Speaking Tests In China

Induction To CET-SET

Background Of Testing At DLNU

Introduction To Dlnu And College English (Ce) Teaching

College English Course And Syllabus

Teaching Requirement

Speaking Tests At DLNU

Relationship Between Teaching And Testing

Purpose

Format

Construction Of The Test

Research Methodology

Subjects

Testers

Test Takers

Instruments

Testing Materials

Questionnaire

Telephone Interview

Data Collection And Analysis

Result And Discussion

Theoretical Evaluation

Results And Discussion From Testing Materials

Evaluation Of Test Content

Evaluation On Scoring Criteria

Results From Telephone Interview

Empirical Evaluation

Results From Questionnaires Of Students Of Interview Speaking Test

Results From Questionnaires Of Students Of Computer-Aided Speaking Test

Results From Teachers Questionnaire

Results From Statistical Data

Descriptive Analysis Of Spoken Scores

Statistic Analysis Of Test Scores

Investigation Of Inter-Rater Reliability

Summary Of The Validity And Reliability Evaluation

Recommendations And Implications

Enhancing Scoring Reliability

High-Quality Scoring Instruments