Assessment Examinations And Standard Setting English Language Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Assessment plays an important role in the process of medical education as it is an effective tool for detecting quality in students training(1). "Assessment drives learning" this statement focuses on the essential role of assessment as well planned and implemented assessment has an important steering effect on learning because it transfers what is important to learn and motivate students for learning(2). Many people argued that as curriculum should be the key which motivate learning while assessment should designed to be sure that learning outcomes have occurred, So assessment tool must have clarity of the learning purpose and must be designed to drive educational intent and maximize learning(3).

Constructive alignment is an important influential idea in which students construct meaning from related learning activities and teachers apply learning environment which support planned learning activities(4). So constructive alignment makes the teaching system consistent when curriculum, learning activities and assessment methods are aligned with intended learning outcomes(5). Moreover, assessment may reveal learning outcome which isn't expected but it is recognized as important outcome, so it must be integrated into the intended learning outcomes as emergent outcome(6). Our faculty begins applying constructive alignment to our teaching process, and we design course specification for every curriculum as a step for applying constructive alignment, also we meet regularly to solve problems which face us during implementing other steps of constructive alignment application.

Formative assessment promotes deeper learning as it provides students with feedback to encourage them to know their strength and weakness which reinforce students' internal motivation to learn and improve their knowledge and skills(7). Summative assessment is final assessments which determine the rank-order students and decide grades(1). Wass et al(7) argued superficial learning which aims mainly on passing the examination and they emphasized on the importance of feedback on students assessment which encourage student reflection and deep learning. However, Epstein(8) showed that summative assessment has effect on learning even in absence of feedback as students study what they expect to be tested. Although formative and summative assessment are stark in contrast, they are both necessary and distinction between them should be made to detect which assessment is suitable only for formative use or have sufficient rigorous for summative use(7). Van der Vleuten and Schuwirth(9) emphasized that formative and summative assessment can be used with little difference with focusing on the development of comprehensive assessment in which both encourage learning and right decision about learners.

I will focus my writing on written assessment as I am involved as examiner in assessing written examination of 2nd part of MSc of Radiology. According to Miller pyramid we use written assessment to assess the domain of cognition, either factual recall of knowledge "knows" or knowledge application and problem solving "knows how". Our final written examination of summative assessment is formed of two essay papers, each one formed of four essay questions with three hours duration for each( 1st paper assesses cardiovascular, respiratory, musculoskeletal and gastrointestinal systems, while 2nd paper assesses genitourinary, Obstetrics & Gynecology, pediatrics and central nervous systems), and third paper of twenty multiple choice questions (MCQs) with one hour duration. Also for formative assessment, we use short essays or MCQs through short quiz in which we firstly identify students training level to apply questions which assess knowledge appropriate to their experience, and then we give them feedback about their performance to overcome the gap between their actual level and referenced level.


Essay question is effective method for assessing cognitive skills as it can assess ability of students to form answer and measure their attitudes and opinions, also it can give students effective feedback on their learning(10,11). But it has the disadvantage of being time-consuming test to grade and its test doesn't cover a wide domain(8).

Newble and Cannon(11) stated that essay is either extended response question which is useful in assessing higher cognitive skills (like analysis, synthesis, problem solving) and restricted response questions used for testing knowledge of lower level. Epstein(8) stated that well-structured essay with clear framework can eliminate cueing and maintain more cognitive process with context rich answers. We usually use extended response questions to assess different cognitive skills, but for improving essay assessment we must use clear words on constructing questions like using describe, criticize and compare instead of discuss, as I find some poor-structured essay questions in our examination, for example: "discuss radiological imaging of breast mass?" which must be changed to be "compare between ultrasound and mammography for differentiating breast mass?".

Essay reliability and validity

Van der vleuten(12) stated five criteria for assessment tool utility which are "reliability, validity, educational impact, acceptability and cost effectiveness".

Reliability measures consistency of the assessment test and it is often described as reliability per hour of testing time as time is limiting factor during examination, so essay is low reliable as it requires longer time to answer(13). Inter-case reliability measures the performance consistency of the candidates through different cases, so it detects extent of context specificity of the assessment tool to be sure that performance of the candidates is accurately ranked(7). Schuwrith and Van der Vleuten(14) stated that inter-case correlation of different essays in certain test is low as the essay numbers which can be asked in a certain test is limited.

Chase(15) stated that essay scoring is complex process as it has many variables which are essay content, writer, rater and other colleague variability with their significant writing effect. The most important type of reliability for rater-type assessment is inter-rater reliability, single inter-rater reliability (which means correlation between two raters performance) ranges from 0.3 to 0.8 as this depends on essay topic, essay length, rater experience and level of rater training(16). But Munro et al(17) stated that single inter-rater reliability can be regularly obtained as 0.7 if there is continuous extensive rater training. On agreement with those authors about increasing inter-rater reliability we already use double markers for assessing essays question and the mean of their scores is calculated to be the end score. In the tutorial of 2/12/2010, I understand more about value of inter-rater reliability and the importance of examiner training for assessment, so I think we must give our radiology staff training in this field to improve inter-rater reliability.

Essays are poor objective test for assessing learning outcomes as there is variability in the assessment scores through different examiners(18,19). Schuwrith and Van der Vleuten(14) emphasized that using one marker for each essay for all students is more reliable than one marker for all essays of the same student. Davis(18) stated that using double marking for the same question is mandatory to reduce variation incidence between the markers. Norman et al(20) stated that providing structured marking of the essay may improve its reliability but it may cause process trivialization. Beattie and James(21) suggested using checklist in marking essay to reduce subjectivity and improve objectivity of essay as it provides the examiner with key point of each item and its allocated marks. As mentioned before, double markers are applied in our radiology department for assessing each question but we don't use checklist in marking essay question, and I think this make our examination less reliable with poor objectivity. So we have to use checklist with specific marks on each part of the question.

Validity is the ability of assessing method to measure what is purported to(19). Van der Vleuten(12) stated that validity is approached as hypothesis, as assessment data is hypothesized then collected at certain time and expressed to support validity hypothesis or not. Validity is markedly related to reliability as valid method must be reliable(8). The valid method will reflect what students achieved from intended learning objectives of the course, so increasing test items is essential for more valid test, therefore essay paper has limited content validity (7). Brown(22) advises using large number of short essays to improve its reliability and validity and to reduce sampling errors. Davis(18) argued that as this cause more time-consuming to mark. Van der Vleuten et al(23) stated that assessment methods should have content validity which must designed and mapped on a blueprint. Our faculty assessment centre begins to apply test blueprint to determine the main content of the test which must have high content validity to cover our intended learning objectives, so we have to use larger number of short essays according to test blueprint to be eight to ten short essays instead of four long essays in essay paper.

Modified Essay

Modified essay questions were initially produced by Royal College of General Practitioners in London and are widely used now(11). Davis(18) stated the importance of using context rich scenario which will direct the students to answer with precise data and increase examination reality. Schuwirth and Van der Vleuten(14) showed that written case-simulation essay appeared to be more valid as its questions focus on history taking, diagnosis, investigation and examination findings which are closely related to real practice. Swanson et al(24) argued that as these essays aren't suitable for assessing problem solving questions. Newble and Cannon(11) showed that certain skills are needed for constructing modified essay questions to avoid giving idea about answer of previous question or punishing students on question constructing error. Schuwirth and Van der Vleuten(13) emphasized that considerable structure of essay question is necessary but over-structuring may lead to limited increase in its reliability. As we use essays in both formative and summative assessment we have to use modified essay with context rich scenario and case-based question instead of traditional essay especially in formative assessment to returned it to students with its model answer for discussing and applying feedback during the tutorial, as this will motivate students and encourage their critical thinking and reflection, but also we must take training about constructing modified essay question to avoid poor form which may cause assessment error.

Schuwirth and Van der Vleuten(13) advise using essay in limited occasions when objective tests are not suitable. Objective written tests like MCQs, short answer question and matching exercise have the advantage of being economic, rapidly scoring, high reliable and evaluate the student in large content(25).


There are two major formats of MCQs which are true/false format and single best answer, true/false format can cover broad amount of topics and is easily marked but it mainly measures definitions and simple facts(26). Case and Swanson(27) explained why using true/false format is markedly reduced as it is not only difficult to construct but it mainly used to assess recalling of isolated fact, also it can't detect if student who identify correctly the false statement knows the right answer or not. Another disadvantage of true/false format is its high probability of guessing(28). To overcome guessing, negative marking was achieved in which there is deducing marks for wrong answer, but these may produce negative psychometric results(25). We sometimes use true/false format instead of single best answer, but we don't apply negative marking for MCQ correction as we think that is stressful to the students, also I have bad memory about using negative marking as when I was medical student at 2nd year I got 19/50 in physiology MCQ test and this caused to me poor willingness to MCQ risk. When I read carefully previous radiology examination of true/false format, unfortunately I find some ambiguous items which may cause critical failure for this examination. So I think we must limit using this type for assessing definitions and facts identifications and apply other types of objective tests. This is in agreement with Schuwrith and Van der Vleuten (13) who stated that true/false questions are only suitable when the question purpose is to evaluate if student is able to determine the correctness of hypothesis.

MCQs are able to evaluate broad range of learning outcomes within short time and limited human intervention, also they have low guessing probability with free question of ambiguity(29). In the tutorial of 16/12/2010, there is a debate about effect of MCQ guessing on changing test score, but I learn from the discussion an interesting concept which emphasized that guessing doesn't change test reliability as competent student is a well guesser.

For constructing good MCQ items it is essential to have good idea about the content, study the objective of the assessment and apply high quality form for items writing(27). MCQs consist of stem and several options, stem is formed of sentence or question and may be accompanied by diagrams or tables, while the correct option is defined as "keyed response" and the wrong options are called "distracters"(29).

MCQ reliability and validity

Collins(30) showed that MCQs have the disadvantage of being test knowledge recognition rather than constructing answer. McAleer(31) argued that as MCQs are objective test which doesn't allow students the chance for giving additional information and doesn't apply examiner to put judgment on student answer quality. I agree with McAleer(31) as we use MCQs to assess knowledge understanding for broad range of learning objectives within short time.

Reliability is referred to reproducibility of the assessment score and it is expressed as a coefficient which range from 1 for perfect reliability and 0 for no reliability. MCQs are widely used due their high reliability which is attributed to their ability to assess broad amount of knowledge by providing large number of items which address areas of context specificity within short time(7,30). Van der vleuten and Schuwirth(9) showed that the predominant factor which affects reliability is domain as competence depends on context specificity. Downing(32) stated that internal consistency reliability is important for written examination and it is determined by indices like Cronbach's alpha or Kuder-Richardson formula 20 which is obtained from test-retest format, also he emphasized that MCQs have high internal consistency reliability as the test score would be near the same if examination is repeated at later time. While McCoubrie(25) argued that and he stated that the assumption of MCQs as reliable test is weak as they are only reliable because they maintain time efficient test with wild sampling of topics. Van der vleuten and Schuwirth(9) stated that the reliability of MCQ test in one hour is 0.62 which is increased to 0.93 for four hours test due to using more items number. Wass et al(33) stated that for important examination in which stakes are high a higher reliability of 0.8 or more is essential to determine pass-fail decision but for formative assessment lower reliability can be accepted. Our final MCQ examination contains twenty questions with examination time of one hour, so it has low reliability due to small number of items within short time which miss many objectives, so I think we have to increase the question numbers to cover more knowledge of context specificity and consequentially increase the test time to improve test reliability.

Face validity means appearance of the test and if it matches with educational purpose(7). MCQ test with good face validity must be acceptable, readable, have clear content with well-structured items and avoid spelling and grammatical error(29,30). Case and Swanson(27) stated that MCQs must be well-structured to be simple, easily understood with using plausible distracters, also grammatical errors especially using negative and inaccurate words like " never, sometimes, frequently and usually" should be avoided as they may lead to examinees confusion(31). Lowe(34) stated that useful distracters should demonstrate a misconception between students about the right option, so writing many plausible distracters is a difficult part for MCQ construction with more time-consuming. Also, MCQ reliability increases with removing non plausible distracters(35,36). So the flaws of writing distracters like using more than correct answer, using "all of the above" or "none of the above", or making the right option is the longest one should be avoided(37). Although we choose MCQs from question banks or MCQ books, unfortunately I find many defects in our last MCQ examination, firstly one question contains double negatives, while other questions contain inaccurate words which are sometimes and always, also some distracters can easily eliminated in another questions. So I think we must take care during choosing MCQ distracters which should appear to the students as a valid answer while they are incorrect, also we must avoid apparent incorrect or plain distracters. Therefore, we must take training courses for MCQ preparation and writing MCQ stems and distracters to avoid MCQ flaws.

A criticism of MCQ validity as they measure factual knowledge and don't integrate skills, attitude and communication skills(25). Downing and Yudkowsky(38) emphasized that knowledge is the single best domain which determine expertise, so MCQs are valid competence method which assess cognitive knowledge. Collins(30) stated that MCQs have high content validity if they represent wide sample of content that serve objective learning outcomes. However, Shumway and Harden(1) critic that as MCQs assess discrete superficial knowledge not deep understanding because they designed to detect what students know or don't know.

Blooms taxonomy of educational objectives is a hierarchy of knowledge for different cognitive levels which are "knowledge, comprehension, application, analysis, synthesis and evaluation"(39). Educators simplified Bloom's taxonomy into three levels which are knowledge recalling, comprehension and analysis, and problem solving(11). Case and Swanson(27) and McAleer(31) showed that well-structured MCQs can assess taxonomic higher cognitive process rather than assessing recalling of facts. Peitzman et al(40) argued that as higher-order MCQs don't improve MCQ validity but they appear more real and acceptable to students and examiners. Frederiksen(41) stated it is difficult to construct MCQs with rich context as item writers tend to escape from topics which can't be easily asked. According to Case and Swanson(27) and McAleer(31), we always choose MCQs with different cognitive level, and when I revise our MCQ tests I find some questions which assess recalling of knowledge(Q*) and other assess problem solving(Q**) for the same topic, for example:

Q*: what is the effective measure which reduces radiation of CT chest?

A-120 mA

B-150 mA

C-200 mA

D-250 mA

Q**: what of the following will reduce dose of radiation for CT chest?

A-reducing mA from 250 to 150

B-reducing KVp from 160 to 120

C-reducing the pitch to be 1 instead of 2

D-reducing scanning time to be 1 instead of 2

Newble and cannon(11) advice using computerized optical mark reader to score and analyze MCQ tests as computer programme has the advantage of applying statistical data of the test which include reliability coefficient, standard deviation and test item analysis. In our examination we use hand marking sheet of answers to correct MCQs. But recently our faculty brings new computer machine for correcting MCQ test, so we need to learn how to use the specific statistical data for test interpretation as these may help us to improve our next examination.

Blueprint is an important powerful tool for integrated curriculum as it maintains assessing all intended learning objectives(42). Our faculty assessment centre members work in progress and they make many orientation about blueprint construction and its importance, also they asked all departments to finish their blueprint, but until now we evaluate our examination retrogradly according to the intended learning objectives, and unfortunately in some written examination we find that test items doesn't cover all topics of the curriculum and missed many intended learning objectives, also another MCQ examinations focus on certain system rather than other systems which may produce bias of examination results. So, I think we are urgently in need to use test blueprint which cover learning objectives and assessing methods to identify the key topics which must be tested according to our learning objectives and to determine question numbers according to their corresponding weight in the context. This is in agreement with Downing and Haladyna(43) who stated that blueprint reduces two validity threats which are under-sampling bias of the curriculum and constructing irrelevant items.

Extended matching questions and short-answer questions

Schuwirth et al(44) explained that students can answer correct MCQ by detecting the right answer but they aren't able to answer it in the absence of MCQ options. Graber et al(45) explained the problematic effect of MCQ cueing which may cause diagnostic errors especially if diagnostic reasoning is assessed. Schuwirth and Van der Vleuten(14) advise using extended matching items and short-answer questions as they can reduce the cueing effect.

Extended matching questions (EMQs) are good authentic test as they use real clinical scenario which need sufficient clinical knowledge and can test wide range of topics for knowledge application and problem-solving ability like diagnosis, investigation and management(46). Beullens et al(47) emphasized that EMQs are able to assess extended learning and minimize recognition effect rather than memorizing facts which are needed for MCQ solving. McAleer(31) critics that as EMQs with its many different items and long list of suitable answers are difficult to construct. However, Schuwrith and Van der Vleuten(13) advice using EMQs as they are good reliable test with short time scoring. We don't have experience in EMQs, but after knowing its importance and its significant role for improving written assessment reliability, I think before applying this form we need training of how construct and practice these questions to avoid bad representation of some items.

Short answer questions are important assessing tool because they are objectively scored test as they need clear sets of answer with little guessing incidence(3). McAleer(31) critics that as he stated, although short answer questions are easy constructed item, they are used only to measure recalling of information as they can't measure complex learning outcomes like synthesis and information analysis. While, Epstein(8) stated that short answer questions can be used for summative and formative assessment but its reliability depends mainly on training the students how they answer these items. We don't apply short answer questions in our examination, but I think we can use them in certain situation when we want to cover broad area of content and be sure that students are able to supply an answer rather than choosing it from many options.

Educational Impact

Consequential validity is referred to the real impact of assessment method on learning which appropriately drive students' learning(25). Wass et al(7) stated that consequential validity refers to the educational consequence of the test as it produces the desired educational outcomes, which means that students should study subject rather than studying the test. Although consequential validity is an important process, it is ignored by many examiners(48). I think our written examination has significant educational impact on how our students study, as from my experience students study what they need to pass rather than studying the whole integrated information. To improve this, we have to use different forms of written assessment which must cover the important content of the curriculum, and should be mixed with continuous formative assessment and feedback to steer our students to determine what they study and how they learn. This is in agreement with Van der Vleuten(12) who stated that assessment can drive learning through four ways: assessment content, assessment structure, question which asked and frequency of repeated examination.


Shumway and Harden(1) emphasized that practicability of assessment method depends on resources, expertise availability and their costs. Resource intensiveness is determined by cost of constructing and correcting the test items(44). Cost includes beginning and continuing resources which are needed for test implantation(1). Essay questions appear to be easily constructed items but specific answer key is needed which may cause more time-consuming for preparation(18). MCQs seem to be easy to grade especially with using computer machine but for good structured items more time is needed for construction(30). Shumway and Harden(1) stated that it is important to consider the relation between assessment method cost and its benefit. Van der Vleuten(12) critics that as he considered investment in assessment methods is an investment in teaching and learning process. I think we must take care about the criteria of each method and balanced them against each other as the outcome may change according to the assessment context specificity. Also, In agreement with Van der Vleuten(12), I think we must use different assessment tools especially for summative assessment of high stakes examination to obtain more reliable and valid assessment.


Score determines the number of correct answers of an assessment but it doesn't represent the quality of students' performance (49). Norcini(50) stated that standard-setting is the process by which pass mark of examination is determined to distinguish competent from non-competent students as it allows for variation according to the level of test difficulty.

There are two types of standard-setting: relative (Norm-referenced) and absolute (criterion-referenced) standard, in relative standard-setting fixed number of students will pass the examination irrespective to their level of competence as it is related to peer performance and fixed percentage of success(50). In our faculty we use relative standard-setting to select students with highest score for admission to postgraduate course when fixed number is determined. In the tutorial of 9/12/2010, I gain new information from one peer who advice using relative standard-setting for choosing lower achiever in formative assessment who need extra-training. Also I learn significant concept from other peer who critics relative standard-setting, as students are lazy and demotivated because they have concept that they may pass accidently irrespective to their performance.

Absolute standard-setting is more suitable for competence test as accurate standard should be determined below which the candidate wouldn't be fit for particular purpose(7). Absolute standard-setting may be test-centered methods like Angoff method or examinee-centered methods like contrast group method, in Angoff method the examiner evaluates every item to hypothetically determine what the candidate will get in each item(51,52). Smee and Blackmore(53) stated that modified Angoff method reduces the difficulties of traditional Angoff method, for examlple the difficulty of detecting hypothetical borderline candidates in Angoff method is reduced by supplying the examiners with real test scores of previous assessment of the candidates. While in contrast group method, panelists decide the pass score by detecting it on the score scale which should be most fit to the examination purpose(52,54). In our faculty we don't use any forms of standard-setting as we use 60% as an ideal setting for pass/fail decision for all test types, But I think the assessing centre in our faculty must use standard-setting for more improvement. And in my opinion I prefer applying modified Angoff method as it is widely used in medical assessment and it is designed for multi-component assessment, so we can use it for many assessment types.

Norcini et al(50) stated that absolute standard-setting is applied either as conjunctive or compensatory standard, in conjunctive standard the candidate must exceed each item separately to pass the total test, while in compensatory standard the test scoring permits the candidate to compensated poor performance in one item by high performance in another item. In our written assessment we accept for passing the examination score less than 60% in some questions if it is compensated by score of other questions and the total score reach 60% or more, but now I think we can use conjunctive method in assessing essay paper by which the candidates must pass each essay question separately as this will improve their studying to pass in each item.

Item analysis

Construct validity means ability of test to differentiate between candidates groups at their stage of learning in certain domain(7,55). Construct validity of MCQ test is evaluated by using item test analysis(30). Case and Swanson(27) stated that item analysis provides useful information about the quality of each item separately and the whole test quality. Items analysis will be valuable when it maintains effective feedback to test writers as this will improve their skills in further test construction, also it would be helpful in discarding poor items and detecting certain areas of the content which may need more clarity(30). Item difficulty is detected from the proportion of students who answered each item correctly, Items are considered difficult if 50% of students or less answered them correctly and low difficulty if 85% or above of students answer the item, while moderate difficulty which have 60-80% discriminating index are the most discriminating items(30). In the tutorial of 16/12/2010, many peers emphasized the value of applying difficult items in the examination as these will encourage students towards excellent and to study more to get marks, so I think we must apply certain percentage of difficult items in each examination to drive learning of our students and we must avoid using too easy or too difficult items as they can't discriminate between students.

Item discrimination is determined by the difference of the percentage of correct response between two students group (top third and lower third) with discrimination ratio lies between +1 and -1 and acceptable index is in the range of -0.5 to +0.5(27). Good item has discrimination index closer to +1 as it can distinguish good student from poor one but if poor student can answer more item correctly than good students, this indicate negative discriminating item which should be excluded (30). Downing(32) emphasized that items of MCQ test represent sample of all questions which could be tested, so for test with good internal consistency the test score should be an indicator for the student score on any other set with relevant items. Although our faculty has assessment centre but we don't apply item analysis to any examination, So I think before applying it, we are in need to orient our faculty members about the importance of item analysis and how we use its statistical data to detect causes of low discrimination , discard poor questions, and identify gaps in curriculum.

Finally, we use written assessment to assess the major domain of cognition in its low level of knowledge recalling to its high level of knowledge application and problem solving, but as mentioned before, I think our written assessment has limited reliability and validity as we use limited number of essay questions and short timed MCQ test. So we must apply using more objective tests of well-structured MCQs, extended matching questions and short answer questions with more essay questions especially modified essay with context rich scenario and case-based question which must accompanied with checklist in marking essay and item analysis, also we must determine the questions numbers according to their corresponding weight in the context and according to test blueprint, as these will facilitate sampling a broad range of relevant contents and constructs of our learning objectives.