The aim of this paper is to answer the above question and to explain how tests should be administered in order to maintain its validity and reliability. For this purpose I am going to look at the specialized literature on testing as well as to sum up the main points in this essay, taking as reference several authors I mention in the following lines. I also comment on the implications of testing in secondary schools. The reason why testing requirements should be taken into account is that they are a measure part of the evaluation of students performance, and consequently a way of identifying students knowledge. What is more, testing is an acceptable way of raising short-term motivation to learn specific material, but if it is used badly, then, there may be long-term negative effects.
Because it is a measure to be used as part of the evaluation of students, it is also a measure part of the evaluation of teaching labour itself. The necessary adjustments have to be done in order to increase its effectiveness and to enable students to benefit more from them. So, tests help teachers to evaluate the effectiveness of the syllabus as well as the methods and materials teachers use. This way they provide teachers with enough and useful information which have a positive effect on both teaching and learning.
New approaches do agree that teachers should focus on purposeful communication activities and motivate students, so, both teaching techniques and tests contents have to do with the objectives of the course. It means that a new methodology also includes new testing requirements. This is true, but, one can go even further, McNamara (2000: 8) differentiates between test and criterion he defines as the relevant communicative behaviour in the target situation. This differentiation opens a new direction towards testing procedures within communicative language tests, which McNamara (2000: 17) defines they paid attention to the social roles candidates were likely to assume in real world settings.
To measure communicative language ability, Shohamy (1995) explains that many of the arguments basically revolve around the question of the authenticity of the test, of the language samples used in the test.
The following lines deal with Hughes specific measures which can be adopted to ensure tests content validity, construct validity, face validity, reliability, and practicality, and include what subsequent authors have added to these elements. These specific measures are tools to respect the following three main aspects in testing: First, the purpose of tests, it means, what teachers want to know by means of tests. Of course, this implies that the teacher or the scorer has a deep knowledge what language consists of, and how it is used nowadays. These tools make teachers realize about the final aim of tests and consequently help them to do the necessary changes in order to match what they want to know with the questions or elements within tests. Second, the level of the student which can be defined or graded in different ways, for example, elementary, intermediate, or advanced. Third, types of tests need to be considered, according to the approaches to test construction. Hughes (1989: 69-72) discusses four types of test I am going to comment on:
- Proficiency tests. It does not correspond to the contents of a language course and it is designed to clarify the ability of someone in a language. This test establishes a standard someone has to have to be defined as proficient, and look to possible future situations in which language use is required. Examples of this are First Certificate Examination, Trinity, etc.
- Achievement tests. This type is closely related to a language course, and the purpose is to clarify whether pupils, groups, or the courses themselves have reached the objectives established at the beginning. The author distinguishes two different kinds within it, those administered at the end of the course, final achievement tests, and those administered while the course, progress achievement tests. This is the case of secondary schools which focus on a curriculum with objectives and contents pupils have to cover. With regard to secondary schools tests McNamara (2000: 7) adds that learners may be encouraged to share in the responsibility for assessment, and be trained to evaluate their own capacities in a process known as self-assessment.
- Diagnostic tests. They are a tool to establish students strengths and weaknesses. It situates the ability of the students within different categories, for example within speaking, writing, reading, and listening. A good diagnostic test should be big enough and serves for self-instruction; an example of this test is the Dialang.
- Placement tests. They serve as a mean to place students at the suitable stage of learning and to assign students to the different groups. In my opinion, kinds of tests are in itself a measure to ensure what it is explained next, validity, and reliability; that is why they have to be carefully chosen. Content validity
Hatch and Lazaraton (1991) says that it represents our judgement regarding how representative and comprehensive a test is. Keeping this into mind and what Hughes writes (1989: 22) on this issue, tests have content validity when they have enough
samples of what is going to be tested. For example, a grammar test will not have content validity unless it has enough samples of the relevant grammatical devices. When a test is under construction, one has to clarify the skills or structures that he or she is going to take into account. This provides teachers with suitable criteria for selecting elements for the test. So, content validity needs enough samples that measure what they are intended to measure, and needs tests specifications to coincide with tests contents.
Tests have construct validity if one can verify that it measures just the ability he or she is interested in. Hughes (1989: 26) explains that the word construct refers to any underlying ability which is hypothesised in a theory of language ability. It also means that if one skill involves a number of sub-skills, they also have to be verified. The author adds (Hughes, 1989: 27) that construct validation is a research activity, the means by which theories are confirmed, modified, or abandoned.
Tests not only have to have validity, they also have to look like they have validity, that is why the author uses the term face validity, and says (Hughes, 1989: 27) if it looks as if it measures what it is supposed to measure. An example of face validity would be a test which measures pronunciation ability requiring students to speak, otherwise it would not have face validity, and construct validity would be also under discussion. Maybe pupils in secondary schools are not going to protest against what they consider unfair, but students or candidates in other contexts will not accept a test which does not have face validity. Consequently, teachers should check tests first before administer them to the students.
Teachers cannot expect students to get the same scores on the same test when repeated the next day. Even when the circumstances seem identical human beings do not perform the same way on every occasion. One does know and accept that scores are not going to be identical, but similar scores have to be obtained when administered again to the same students. So, the more similar the scores are, the more reliable the test is. Intrinsic reliability: how to make tests more reliable These points are focused on Hughes (1989: 36-40) ways of achieving consistent performances from candidates:
- Get enough number of responses. To have enough items will make a test more reliable, and items should be independent one each other. Each additional item represents a fresh start for students. It is important to make a test long enough to achieve reliability, but tests should not be so long that students become bored and fail in demonstrating their abilities. When designing a test, Brown and Hudson (2002: 54) expose that tests should be made up of a sufficient number of observations, or bits of information.
- Clarify what you want. For example, students can be asked to write a composition about politics, which allows them too much freedom, instead of this, they can be required to discuss the following measures intended to improve Spanish politics:
- Better instruction for politicians.
- Geopolitical situation.
- Predictions for the future
- Avoid unambiguous items. Tests should not have items whose meaning is not clear enough, or to which there are other possible answers the teacher has not anticipated. If one item can be interpreted in different ways it means that the item is not contributing to the reliability of the test. Our colleagues can help us in clarifying whether an item is ambiguous or not.
- Instructions must be clear and explicit. Not only the weakest students misinterpret what they are asked to do, better students also can provide an alternative interpretation. Teachers should give spoken instructions read from a written test in order to avoid misunderstanding.
- Tests must be perfectly legible. Sometimes, tests are badly typed, and have too much text in too small space. To check the test before administer it solves this point.
- Train candidates in testing techniques. When students are familiar with elements within the test they are likely to perform better than they would do otherwise.
- Conditions of administration must be uniform and non-distracting. The administration of tests should be the same for everyone in order to ensure uniformity. Timing must be specified, and the acoustic conditions must be good and the same for everyone. Extrinsic reliability: scorer reliability These points are focused on the scorer as part of reliability:
- Use the suitable items for an objective scoring. One possible option is to use an open-ended item one has provided part of the structure in order to get a unique, possible answer which the candidates produce themselves.
- Make comparisons between candidates as direct as possible. This point relates to clarify what you want, it means, scoring would be more reliable when all the compositions are on the same topic.
- Tell candidates how they are going to be marked. Scorers should specify what they consider an acceptable answer and assign points for partially correct answers. Alderson, Clapham and Wall (1995) make a distinction between holistic scoring vs. analytic scoring. According to this dichotomy one wonder what to consider, and what to mark in secondary schools, am I looking for general understanding? Am I looking for a more deep understanding? These are previous questions scorers should wonder.
- Train scorers. When scoring is subjective, such as scoring a composition, some training needs to be done. It would be said that there are three types of scorers: lenient, severe, and inconsistent. Do scorers focus only on form or on candidates ability to recognize social conventions in interaction? Do candidates match form to the context? These are questions I do believe that have to be in the scorers mind while administering and correcting a test.
- Choose samples of correct answers. To choose the better responses serves a mean to measure other responses, and should be chosen immediately after the administration of a test. In the case of compositions, Hughes (1989: 41) says that one can select archetypical representatives of different levels of ability.
- Avoid candidates names and identify them by number. This is possible and very common in proficiency tests which use numbers instead of names or photographs making scoring more reliable, but, what happens in secondary schools? Teachers in secondary schools know their pupils well, and as scorers could have expectations of pupils that they know, so, how can one avoid expectations from pupils one knows very well? Literature, in this sense, focuses more on cases where no direct and close relation with the candidate is. To identify candidates by number and not by name also avoids any gender and nationality discrimination.
- Two scores better than one. When scoring is subjective the ideal would be to have two independent scorers and then compare results. McNamara (2000: 35) refers to ratings and raters as the judgements and those who make them, and having two independent scorers is called inter-rater reliability. This provides both scorers with different views about the testing activity to learn from. Practicality The literature revised for this essay do agree that tests have to be practical in the sense that a test should be appropriate for the resources that are at hand, and should be easy and cheap to construct, administer, score, and interpret.
Once one has considered these points in test construction and test scoring, there is another point to have in mind and to check, the so-called washback effect, McNamara (2000: 72) explains as the effect of tests on teaching and learning, which can be positive or negative. The washback effect could have a strong influence on pupils life in secondary schools, as well as on teachers reputation, so, it has to be considered by both pupils and teachers. A wider effect of testing is defined by McNamara (2000: 74) as test impact, which goes out from the classroom and affects candidates from many countries; an example of this is the TOEFL.
Computers have nowadays a great impact on testing and as explains McNamara (2000: 79) important national and international language tests, including TOEFL, are moving to computer based testing (CBT). In this sense, Chapelle (2001) enumerates several new terms to refer to testing by means of computers and applies what previous literature, which have been explained before, to new technologies. She (Chapelle, 2001: 38) starts defining computer-assisted assessment as testing practices requiring a computer to assist in construction, delivery, response analysis and score reporting. The author deals with several questions on how computers can be used, can develop, promote, etc.
Teachers have to be careful when making tests, and administering them, and do have to consider what the literature on testing has been discussing the last three decades; otherwise they could be reducing the accuracy, it means, the reliability, and the validity of tests. So, the balance between validity and reliability has to be considered, and ensured. The ideal test must be practical and accurate, and has to follow the principles of validity and reliability I have summarized above. Finally, I bring together some words from Hughes who explains that tests are valid when they measure the abilities we are interested in, and when one gets similar scores one particular day or on the next, which involves choosing the appropriate content and techniques. So, a text is reliable and has validity when once one has applied these techniques, and he or she finally measures these abilities consistently as part of a wider objective, the evaluation of candidates ability to interact in a social world.