Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of UK Essays.
[Type the company name]
Test Design and Evaluation of Class Observation
Test design and evaluation
This is an integrated test based on the curriculum of the institution and the guidance of the HI instructor. The current assessment tool is summative in a sense that it “aims to measure what students grasped and typically occurs at the end of a course or a unit of instruction” (Brown, 2010:7). In this case, Cycle A and B from Kaplan Higher intermediate book were tested. It is also formal as it “taps into a storehouse of skills and knowledge” (Brown, 2010:7). Thus, this assessment was a planned and systematic tool to assess student achievement at the end of a specific timeframe. The grading is criterion referenced and is “designed to give test takers feedback in form of grades” (Brown, 2010:8) that is not normed but rather reflect the actual knowledge of the student for that specific class/unit. The structure fully coincides with the curriculum and the skills/language objectives covered in the class, thus this test is direct (in a sense that it directly tests what it is supposed to test).
The observation took place at Kaplan International, Los Angeles. The class observed was a Higher Intermediate Integrated skills class. The students have been in this class for more than two weeks; this is daily class with new students arriving every week and some departing back to their countries or moving to another level after a level test (or an additional assessment test). The class is held every day (Monday to Friday) from 8:30 am to 11:45 am. I observed 2 days only totaling 61/2 hours. The teacher bases the class and the content on the KI (Kaplan Internrtional) global curriculum and the language and content objectives provided by the company. According to the curriculum, the class should include all major skills: reading, writing, listening, speaking, grammar, vocabulary, and pronunciation. There were instances when the teacher provided cultural and sociolinguistic information to the class. The application of the skills taught in class was as following: for listening the students were mostly working on main idea, reading focused on comprehension monitoring, vocabulary was taught using context clues and never with bilingual dictionaries or cellphones. Integration of listening and reading for writing purposes were also taught alongside with how to write supporting paragraphs and provide details in those paragraphs.
To be admitted to this class from another institution or language school, the students have to have 5.5 or higher in IELTS test, or 170 (approximately) from Cambridge English, or TOEFL score of 46 or higher.
The classes at Kaplan are almost always diverse in terms of languages spoken, culture, religion, and age. Upon acceptance students get distributed throughout existing classes mostly based on their native language: every class must have even distribution, so the there are no overwhelming majority of, say, Arabic speakers in one class and no Arabic speakers in another. This method ensures the fairness towards students with other language, as there is always uniformity in classroom language: it must be English.
For this class specifically, there were 2 speakers of Chinese, 2 Japanese, 1 Korean, 1 French, 1 Brazilian Portuguese, 1 KSA Arabic, 1 Turkish, 1 Azerbaijani Turkish (and 2 Russian, 1 German, 1 Kazakh, 1 Polish, 1 Armenian: these students were not assessed as they came later and were not present when the observation was conducted). Totally, 16 students, 6 of which took teacher created test. As the contingent shows, all of them were non-native speakers of English, ranging from 18-36 years old.
Kaplan utilizes its own books, workbooks, assessment tools (KITE) and IWBs (Smart Board Smart Note slides). The book includes practice for all skills presented. There are other accompanying books and materials that the instructor may use. The class was also supplemented with printed handouts and internet-based interactive activities. I observed unit 2 from Cycle A (10.25.17) and unit 4 from Cycle B (11.03.17). To understand how the class works and what the strengths and weakness are for the students, I spent nearly an hour with the teacher almost every day after work (in the teachers’ room). She provided me with valuable information on how to construct the test, what materials to use and what types of tasks to choose. She suggested that I stick to the class curriculum and create an integrated test to precisely assess what she covered in the class.
The assessment tool – specifications
As Carr states, any test must assess what it is supposed to assess: “we want to use a test or assessment for a particular reason, to do a certain job, not just because” (Carr, 2015:5). Keeping this in mind, I followed the class curriculum and designed a test that would test what they (the students) were taught during those 2 days of observation. All the choices were coordinated with the teacher of HI class and LING 568 professor at CSUN. To achieve maximum validity, fairness and reliability, I decided to follow the guidelines of the class and create an integrated test that assessed some of the skills practiced in the class, namely, listening comprehension via notetaking and true/false answers following the task, reading comprehension and vocabulary using a short text with cloze task, and finally, paragraph writing by means of incorporating listening and reading information in a short answer task. Originally, grammar was also considered, as passive causative was covered in the class, but upon revising the test and calculating the timeframe, this point was taken out. The idea of integrated testing is not testing one sill at a time, but rather replicating the real world situations, where people do not only listen (they listen and speak) or practice vocabulary (they use vocabulary in context while reading, speaking, and/or listening). As Carr mentions, this test “requires examinees to use multiple aspects of language ability, typically to perform more life-like tasks… Such tests more closely resemble real-life language use tasks, and thus, require more communicative language use” (Carr, 2015:320). This and other similar findings suggest that integrated tests are more useful and valid when testing second language.
It took 1.5 hours to design the assessment tool. Most of the time was spent on looking for the right text with appropriate level of difficulty and vocabulary. The listening part (with its true/false questions) did not take any time as I was able to find a relevant one in matter of minutes. The test itself is comprised of 4 integrated tasks/parts:
Task 1 (Listen to an interview about making our homes greener once. Take notes) assesses listening. As this is a non-observable comprehension skill, a not-taking task was designed to make sure that the students are listening. As Brown explains in his book, “we must rely as much as possible on observable performance in our assessments of students” (Brown, 2010:159). This idea was the basis of coming up with obligatory note-taking during the test. Note-taking proves that the student is in fact listening, as this invisible process becomes visible when there is at least something scribbled on the paper.
Task 2 (Are the sentences true (T) or false (F)?) asks the students to critically evaluate whether the statements provided in the task are true or false. This task is assessing students’ listening ability and primes them for further writing task 4. Research shows that writing T/F questions are hard and these types of questions are being avoided by test item writers. McCuobrie states that items like this are “not only difficult to write well but, in order to avoid ambiguity, the writer is pushed to assessing the recall of an isolated fact. Such a format is therefore unfair as an otherwise competent student may fail if he/she has not memorized isolated facts” (MCCOUBRIE, 2004:711). To test this hypothesis, T/F questions were chosen in this matter, as they pose challenge and difficulty in writing. Further, the results will show that the validity of such isolated items if written well.
Task 3 (Read the flowing 3 paragraphs about international environmental organizations. Fill in the blanks with the vocabulary provided) was originally planned to be contradicting to the listening, but the HI teacher recommended choosing a text that would add extra information to the listening. This way, her Monday quiz would not be similar to this one in all the tasks. The vocabulary-in-context determined whether students were able to “infer the meaning of new, unfamiliar words from context” (Carr, 2015:69).
Task 4 (How can International aid organizations raise awareness about bring environmentally friendly? Write a short paragraph using the information from listening and reading to help you answer this question. Give a specific example to support your opinion) assesses three skills at once: listening comprehension, reading comprehension and paragraph writing. For this task the students have to use the information from both the reading and listening and write a short paragraph, however, the written outcome should not reflect opinion. The students were instructed to follow the guidelines and were provided with a short rubric for this task:
- Use short paragraph using the information from listening and reading.
- Answer this question in full.
- Give a specific example (at least one) to support your opinion.
- Use correct spelling, grammar and vocabulary.
- Keep to standard paragraph structure. (topic-details-sum)
Thus, all choices were made for a specific reason and the task types were created according Bachman and Palmer’s Language competence hypothesis. The idea that strategic competence is of great importance is not a new trend. After abandoning UTH (unitary trait hypothesis), the supporters of strategic competence value integration of all skills provided they are taught and tested in communicative settings for communicative purposes. Bachman and Palmer (1997) discuss components of language competence and put forward an idea that is somewhat similar to UTH in terms of not testing each skill and other discrete language points separately. Rather, the benefit of this testing method/type is the usage of more real-life tasks, which, as mentioned above, result in better understanding and processing of test items (Bachman and Palmer, 1997:70-75). All (if possible and applicable) 4 competences presented by Bachman and Palmer must be taken into account to better test the subjects and predict their future performance in real-life situations.
Grading system and the amount of points were agreed with the HI teacher, as instructors at Kaplan are advised to give a holistic grade based on 100 pint scale for weekly quizzes. As there were 16 students in the class, 6 of which were not present when the unit 2 from Cycle A (10.25.17) and unit 4 from Cycle B (11.03.17) were covered, the instructor advised that 10 of the students be assessed with this assessment tool and her Monday quiz, each worth 50 points (totally, 100 points), and the other 6 took Kaplan devised weekly assessment worth 100 points. This decision was made to ensure the uniformity of the grade and fairness to those students who did not learn the material from unit 2 from Cycle A and unit 4 from Cycle B.
It took 5 minutes to grade tasks 1 to 3, and 15 minutes to grade task 4. Here is the detailed description of total possible points for this assessment tool:
- Task 1 – 1 question – 3 points
- Task 2 – 5 questions – 2 points each (highest possible for the task – 10 points)
- Task 3 – 6 questions – 2 points each (highest possible for the task – 12 points)
- Task 4 – 5 criteria – 5 points each (highest possible for the task – 25 points)
Total – 50 points
Multilingual and multicultural classrooms are a challenge when it comes to grading. To ensure that all the students are on the same page with the test-teacher-administrator, the grading system must be simple and comprehensible, yet objective and self-explanatory. Putting students’ accomplishments as priority, Katz (2014) prescribes paramount importance to grading, grading policy, and communication between students and the teacher (in this case, the test designer): “such communication is important given that English learners may come from school systems with vastly different grading systems. These cultural expectations can impact how students interpret which aspects of their work or language performance will be counted among the many activities in a language classroom” (Katz, 2014:333). This suggests that the grading system must be clearly explained to the students before and after the test administration, there must not be bias of any kind, and policies have to be explained about unrelated issues of test proper (like, cheating and plagiarism). These criteria were set as a protocol and followed from design to grading for this assessment tool.
Administration of the test went very well and the conditions were perfect. In fact, while observing, the HI teacher mentioned to the students that they were going to take a test devised by the observer. The fact that most of them were familiar with the observer, made their tension go away. This helped ease the tension of the atmosphere as some of the students were not sure about why an observer was there. On Monday (11.20.17), the students were advised to leave their cell phones in a designated area, put away their books and belongings, and sit one seat away from their partners. There were no windows in the class and no other distractors. There was no outside noise or people talking in the corridors, the temperature and lighting were as usual. The students were relaxed as it was Monday; they had enough sleep. The only difference between this test and their Monday quiz was the sitting arrangements.
The time allotted for this assessment was only 30 minutes; however it took 35 minutes for all the students to complete all the parts of the assessment. There were several students who completed all the parts earlier than the rest, but overall, the time limits were not violated drastically.
There was a small deviation from the main plan of how the test would be administered as the HI teacher suggested that the students receive the note taking sheet (NTS) prior to receiving the test questions. Originally, the plan was to give the three page test (the NTS included) at once, but most likely, this would interfere with note-taking and the students would be distracted by the other page with questions. Hence, after instructions on how long the listening was and how many people were talking, the NTS was given first. The students then received the other double sided sheet containing all 4 tasks. Instructions were given before each task to make sure that students remember what they had to do. There were no conditions that might have affected the reliability of this test.
Overall, students performed very well. Tasks 1 and 2 were the highest scored ones: task 1 was worth 3 points for note-taking and all the students performed as required. There were 2 students (from 10) who made 2 mistakes each on task 3 dropping down the overall class percentage to 93.3%. Task 4 was where many students had issues with 78.8% completion of the task. To better illustrate the results, the graphs and test scores are brought here:
Graph 1 Graph 2
This is what the class overall scores look like:
Compared to tasks 1,2, and 3, task 4 does not seem to yield perfect results. This is where the description turns in to analysis.
The decision was to create an integrated test based on the curriculum of the institution and the guidance of the HI instructor. The current assessment tool is summative in a sense that it “aims to measure what students grasped and typically occurs at the end of a course or a unit of instruction” (Brown, 2010:7). In this case, Cycle A and B from Kaplan Higher intermediate book were tested. It is also formal as it “taps into a storehouse of skills and knowledge” (Brown, 2010:7). Thus, this assessment was a planned and systematic tool to assess student achievement at the end of a specific timeframe. The grading is criterion referenced and is “designed to give test takers feedback in form of grades” (Brown, 2010:8) that is not normed but rather reflect the actual knowledge of the student for that specific class/unit. The structure fully coincides with the curriculum and the skills/language objectives covered in the class, thus this test is direct (in a sense that it directly tests what it is supposed to test). To some extent, this type of assessment can give the teacher a glimpse into what the students know and what they don’t. To clarify, this test has predictive abilities and can give the teacher a chance to base her/his future lesson plans on the outcomes of this test. These criteria were set a protocol and followed from design to grading.
All the tasks were objective. However, the written task presented a minor problem: almost half of the students wrote their own opinion for the task, whereas they had to write an integrated response combining the listening and the reading. These suggested two things: either the test instructions were unclear or the students didn’t care about the test. The last possible explanation was that maybe the students were tired. However, to increase face validity, familiar formatting and task arrangements were used. The questions were organized in ways that were familiar to the students. This gave students confidence and peace of mind during the test.
The test itself was very reliable: the questions were straightforward, the formatting was familiar, the students couldn’t cheat, no one else but the designer and the HI instructor saw the test, students were informed about this test many days before administration, etc. There was great importance in this test as it was half of Monday quiz grade, so the students had to perform well and do their best to achieve better results. Moreover, Monday quizzes are tickets for some students to early level tests.
The reliability was also maintained when choosing the listening and reading portions. This must be ensured as the level of the test and the level of the student have to match, otherwise the results are not valid and do not reflect the precise grades for the class. While creating this test, fairness was taken into account. The topics of choice were very general and had no cultural, religious or ethnic bias. This is very important when working with international contingent, as not all come from the same background. The topic of the book was perfectly suited for this matter. In fact, all the topics in Kaplan designed books are of similar dint, as Kaplan deals mainly with international students.
However, the test was taken only by 10 students out of 16. This is not the complete number of the students and the results reflect the knowledge of 70% only. From here one can jump to a clear cut conclusion that this 70% is not all the class, hence the results do not reflect the achievements of all the students in that class. Here are the stats for the class overall performance:
Besides the fact that it cost money and took some time to make and conduct, there was nothing challenging about the design of this assessment tool. The administration of the test went very smoothly and there were not hurdles. The location and the room were provided by the institution for the class purposes and the instructor was paid by the institution as well. The time for designing and grading was not time consuming. There were many instances of technology use and, thankfully, the technology did not give up on students and their test. Instructions were given clearly and in timely manner. This assessment tool was designed for this specific class for one time use only, thus the design, items and the content went through meticulous scrutiny by both the test designer and the instructor.
Authenticity and washback
Authenticity and washback are connected in a way that authenticity is determined by how real the task is and the last in its turn shapes the outcome of the washback. In short, authenticity is very important, because it shows what is useful for students and their everyday lives. Grant Wiggins in his 1993 article, called “Assessment: Authenticity, context, and validity” discusses how important authenticity is when it comes to context. His ideas of authenticity are different: what we think we know in one context does not mean that we know the same thing in different contexts (Wiggins, 1993). Also, this article is worth reading as it explains how authentic context helps students perform better during tests. What he suggests is that students have to be able to complete tasks based not rote memorizations but by logic and inference. For example, if there is a lexical unit and the student can use this unit only in one context, the student is not considered as a competent user of this lexical unit. So, the presentation of the test, the content, and purpose have to be authentic to impact students positively.
Basing on above mentioned principles, the test was designed choosing maximally authentic material for reading and listening. The writing task was generated relying on the reading and the listening, so there was nothing authentic about it. From here, if looking at graph 4, the results prove the above statement: the authenticity of the task ensures higher success rate. The artificially created task 4 was not “natural “enough to provide students with satisfactory basis for performance.
Despite the fact that the last task was boring and inauthentic, students completed it within timeframes. Compared to other tasks, which were more interesting and authentic, task 4 was the weakest point of this assessment tool. Besides, the students never had a chance to talk about the results with the test designer, but they talked to their teacher and got the results back. This looks more like a negative washback considering the students were not encouraged and cheered by the test designer after the assessment was taken and graded. Feedback is important to ensure formative rather than summative delivery of the scores to the students. This may raise their motivation and self-esteem when getting ready for similar tests.
Considering the fact that this was a one-time use test, the results and the process were very informative and instructional. The designing process was of great pleasure, the administration was a different experience, and then, the scoring and the write-up gave much deeper understanding of what was being done. This was not your average test as it took more time and effort to make and analyze.
If there was a chance to change the tasks and the test layout, another writing task would be included: short opinion paragraph, step-by-step instructions, or questions for peer interview followed by summarizing friend’s opinion. And if there were a chance, I would choose to assess the entire classroom instead of the 70%. These “if”s make space for more research and questions for those who design test and are food for thought.
- Bachman, L., and Palmer, A. S. (1996). Language testing in practice. New York: Oxford University Press.
- Brown, H. O., & Abeywickrama, P. (2010). Language Assessment: Principles and Classroom Practices. White Plains, NY: Pearson Longman.
- Carr, N. T. (2015). Designing and analyzing language tests. Oxford: Oxford University Press.
- Katz, A. (2014). Assessment in second language classrooms (M. Celce-Murcia, Ed.). Teaching English as a Second or Foreign Language, National Geographic Learning , 320-337.
- Mccoubrie, P. (2004). Improving the fairness of multiple-choice questions: a literature review. Medical Teacher, 26(8), 709-712.
- Wiggins, G. (1993). Assessment: Authenticity, context, and validity. Phi Delta Kappan, 75(3), 200-208.
Student name____________________________ Class_____________________ Date __________
50 possible points
Task 1. Listen to an interview about making our homes greener once. Take notes. 3 possible points for taking notes. Use the paper provided.
Task 2. Are the sentences true (T) or false (F)? 10 possible points (2 points per item)
_____1. Using a better thermostat will reduce your energy bill by about 4%.
_____2. Modern thermostats cost about $400.
_____3. LED lights last longer than traditional light bulbs.
_____4. You will use less water when you replace old bulbs with LED lights.
_____5. Planting trees where Enrico suggests will help you keep your home cool in summer.
Task 3. Read the flowing 3 paragraphs about international environmental organizations. Fill in the blanks with the words provided. 12 possible points (2 points per item)
- The Sierra Club, founded in 1892, is one of the oldest conservation organizations in existence. With over 1.3 million members, this organization is one of the most effective and powerful at effecting changes in government and _______________America. Fighting for the _______________of land and forest, clean air and water, and a host of other issues, the Sierra Club is well-known and respected.
- The iconic panda logo has made the WWF instantly recognizable to many people around the world. With 5 million members internationally and over 1.2 million in the States, this 45-year-old wildlife defense organization is going h2. h2ly promoting an emphasis on science, the WWF works to preserve nature and its creatures. From the organization’s website: “We are committed to reversing the _________________ of our planet’s natural environment and to building a future in which human needs are met in harmony with nature. We recognize the critical relevance of human numbers, poverty and ______________ patterns to meeting these goals.”
- Greenpeace began in 1971 when a group of activists put themselves directly in _____________ in order to protest nuclear testing off the coast of Alaska. Believing that concerted action from ordinary people is the best way – according to their signature quote from Margaret Mead, the only way – the organization has helped to stop______________, nuclear testing, as well as leading efforts to protect Antarctica. Over 2.5 million members worldwide.
Task 4. How can International aid organizations raise awareness about bring environmentally friendly? Write a short paragraph using the information from listening and reading to help you answer this question. Give a specific example to support your opinion. 25 possible points (5 points per criteria)
- Use short paragraph using the information from listening and reading.
- Answer this question in full.
- Give a specific example (at least one) to support your opinion.
- Use correct spelling, grammar and vocabulary.
- Keep to standard paragraph structure. (topic-details-sum)
Appendix (Audio script; not included in the test – strictly for the teacher and test administrator)
Radio host: And now we have a story for you about
how to make your home greener. You may not
know this, but making your home greener will not
only help protect the environment, but it can also
help you save money and make you happier and
healthier. Enrico Gioccalda, green homes builder,
joins us now to give us a few tips. Thanks for
joining us, Enrico.
Enrico Gioccalda: Thank you for having me. 65
Radio host: It’s our pleasure. So tell us what we
all want to know. How can we save money and
become happier by making our homes greener?
Enrico Gioccalda: Well, of course, there are many
different things that you can do, but today, I’ll just
focus on some of the most important ones. So,
my first tip is to get a smart thermostat in your
Radio host: So a thermostat controls the heating and
air conditioning in a house, but what exactly is a
Enrico Gioccalda: A smart thermostat can be
programmed to change the temperature at
different times of the day. The advantage of this
is that you won’t have to think about turning your
thermostat down at night or off when you go out,
it will just do all of that itself. We usually find that
people pay around 4% less on their energy bills
when they do this.
Radio host: But are these smart thermostats
Enrico Gioccalda: Not really. You can pick one up
for about $150.
Radio host: Great, OK
Enrico Gioccalda: My second tip would be to
switch your old light bulbs for LED or compact
fluorescent bulbs. This will help you save energy
and money. You do have to spend money at first
to buy all of the bulbs, but you’ll find that they
use less energy, produce less heat, and last much
longer than traditional light bulbs.
Radio host: OK, so whenever an old light burns out,
replace it with one of these energy-efficient bulbs,
Enrico Gioccalda: Yes, exactly. You can also start
using low-flow sinks, toilets, and showers in your
bathroom. We call fixtures “low-flow” when they
are specially designed to use less water than older
models. You’ll see your water bills go down, but
you won’t notice any difference in use.
Radio host: Low-flow fixtures for the bathroom,
OK, sounds interesting. We’re almost out of time,
but could you maybe just give us one more tip,
Enrico Gioccalda: Yes, of course. My last tip is
something that might surprise you. Did you know
that the things you do outside your home in your
yard and garden can actually increase energy
efficiency inside your home? One thing you can
do is plant trees on the western and southern
sides of your home because then the trees will
provide shade and block solar energy in summer,
and in winter, when the trees lose their leaves,
more sunlight will be able to reach the windows
and warm your home.
Radio host: Wow, that’s really smart. I never would
have thought of that. Thank you so much for
joining us, Enrico.
Cite This Work
To export a reference to this article please select a referencing stye below:
Related ServicesView all
DMCA / Removal Request
If you are the original writer of this essay and no longer wish to have the essay published on the UK Essays website then please: