The experience of examining in an OSCE
I am a Specialty Trainee (ST6) in general adult psychiatry and I currently work as a clinical teaching fellow in mental health at St. George’s, University of London (SGUL). At SGUL, the main method used for assessing clinical competence of the students is the objective structured clinical examination (OSCE) and I am asked to examine OSCEs at SGUL on a regular basis. SGUL are currently in the process of revising the OSCEs for medical students in their clinical years; this revision is marked by a change to the use of global rating scales to assess OSCE candidates and I have selected this assessment experience as the topic of this assignment in order to consider the evidence for global rating scales.
The OSCE has long been recognised as “one of the most reliable and valid measures” of clinical competence available  . Since they were first conceptualised in the 1970s  , OSCEs have become a very common form of clinical examination in both undergraduate and postgraduate medical education. They were developed in order to help address the unreliability and lack of authenticity of the traditional assessments of clinical competence, namely the 'long case' and the 'short case'.
During the OSCE, the candidates pass through a number of independently scored stations. The candidate will be set a task in each station, which will often involve an interaction with a standardised patient who portrays a clinical scenario. Tasks can include physical examination, history-taking and explaining diagnoses and treatment options.
The experience of examining in an OSCE
I was recently asked to examine in a summative end of term OSCE for transition year (T-year) students at SGUL. Prior to the day of the examination, I received copies of the candidate instructions, simulated patient script, examiner instructions and examiner mark sheet, all of which I reviewed in advance, familiarising myself with the checklist mark sheet. On the day of the examination, I attended a thirty-minute examiner briefing, which covered how to use the mark sheets. As four identical concurrent OSCE circuits were running that afternoon, immediately prior to the examination start time, I met with the other examiners and simulated patients that would be examining and acting in the same station as me, in order to discuss any uncertainties in the mark sheet and to ensure as much consistency as possible.
Throughout the course of an afternoon, I examined approximately thirty students. Each student had ten minutes in which to discuss smoking with the simulated patient and to apply motivational interviewing techniques. The simulated patient was an actor with previous training and experience in medical student OSCEs. The candidates had one minute to read their instructions before entering the station. A one-minute warning bell rang at nine minutes.
The mark sheet consisted of a checklist of twenty-three items, a simulated patient mark and a global rating. Each checklist item was marked either 0 for poorly done or not done, 1 for skill completed adequately or 2 for does well. I completed the checklist marks as the candidate undertook the station. The simulated patient mark was based on how easy the candidate made it for them to talk and they could award 0, 1 or 2, which was added to the overall checklist score. The simulated patient and I devised a silent hand signal system for informing me of their mark after the student had left the station. Finally, I awarded an overall global rating on a 1 to 5 Likert scale, with 1 being a clear fail and 5 being outstanding. During the examiner briefing, it was explicit that the global rating should reflect how the examiner felt the student performed, irrespective of the checklist score. I used the one-minute reading time between candidates to check that I had completed the mark sheet and to award the global rating.
I awarded the majority of students a global rating of 3 (clear pass) with one candidate receiving a clear fail and one receiving an outstanding.
Critical reflection on the experience
I often feel anxious about examining clinical examinations, especially those that are summative, as I am responsible for ensuring a fair and consistent examination. However, I felt that I was able to prepare well for the examination and familiarise myself with the scenario and mark sheet, as I received this information two days prior to the examination. I also received guidance from the responsible examiner on the level expected from the students, which helped to allay my anxiety about forgetting what stage the students are at and expecting the wrong standard. The station expected the candidates to use motivational interviewing, which is something that I am familiar with as a psychiatrist. Therefore, I had a good level of understanding of the station and what was expected of the students. However, I was surprised when I reviewed the checklist mark sheet, as I felt that, as an “expert” I may not have asked all the questions that the students were required to ask in order to obtain the marks. I suspected that this was because, as an experienced clinician, I would need less information to reach the correct diagnostic conclusion. However, this left me wondering whether checklists are the best method for assessing experienced clinicians or, in the case of this OSCE, the better students.
As the same station was occurring simultaneously, it was important to make sure that there was inter-rater reliability. Therefore, I was keen to discuss the station and mark sheet with the other examiners that were examining the same station in order to ensure consistency with the more subjective points on the mark sheet. However, there was very little time for this after the examiner briefing, as examiners were rushing to find the correct station and speak to their actor. This left me worried about the risk of subjectivity being introduced by individual examiner’s interpretations of the mark sheet. On a positive note, I did have some time to run through the station with the simulated patient before the examination, to ensure that they were clear how much information to give the candidates and how to indicate their marks. The simulated patient was familiar with the station, as she had already participated in a morning examination session; thus she was able to provide me with information about problems that had arisen with the station, in order to ensure consistency.
In terms of other aspects of the examination that went well, I remained quiet and non-intrusive during the examination, allowing the candidates to interact with the simulated patient uninterrupted.
Even though OSCEs are called objective, I have always wondered how objective they truly are. Even with checklist scoring, there is room for some degree of subjectivity, especially when deciding whether a student did something adequately or well. During this examination, I had difficulty allocating the global rating for each students and I was concerned that I may have been inconsistent with this, introducing further subjectivity into the examination. I was particularly concerned that the first few students were judged differently to later students, as I was still familiarising myself with the general standard of the students. When the Royal College of Psychiatrists (RCPsych) moved to using global ratings instead of checklist scores in their membership examinations, they removed the word “objective” from the title of the examination. I will address this in the key points.
On the other hand, as a psychiatrist, I am often asked to examine OSCE stations with a strong emphasis on communication skills, as described in this experience, and I do not feel that checklists necessarily reflect these skills. Students tend to fire off a list of rehearsed questions, in order to meet the checklists requirements within the limited time they have. This negatively impacts on rapport with the simulated patient. Like the RCPsych, SGUL is changing the format of the OSCEs for the more senior years of the undergraduate medicine courses to use global rating scales instead of checklist scores and I was interested to investigate the evidence for the advantages of global ratings over checklists.
Are global rating scales as reliable as checklist scores?
Do global rating scales have advantages over checklists for more experienced candidates?
Are global ratings a better method of assessing communication skills than checklists?
In 1990, psychologist George Miller proposed a framework for assessing clinical competence  (see Figure 1). At the lowest level of the pyramid is knowledge (knows), followed by competence (knows how), performance (shows how), and action (does). OSCEs were introduced to assess the 'shows how' layer of Miller's triangle.
Figure 1: Miller’s pyramid for assessing clinical competence (taken from Norcini, 2003) 
OSCE marking strategies
Historically, marking of the candidate’s performance in the OSCE has been undertaken by an examiner who ticks off items on a checklist as the student achieves them. In some cases, the total checklist score forms the mark awarded to the candidate.
The use of checklists is proposed to lessen subjectivity, as they make the examiners “recorders of behaviour rather than interpreters of behaviour”  . However, in recent years, global ratings have increasingly been used in conjunction with or even instead of checklists. There are a number of reasons for this. Firstly, global ratings have been shown to have psychometric properties including inter-station reliability, concurrent validity and construct validity, that are equal to or higher than those of checklists  ,  . Further, checklists do not reflect how clinicians solve problems in the clinical setting  . Finally, binary checklists do not take into account components of clinical competence, such as empathy  , rapport and ethics  ,  .
Global ratings versus checklists: psychometric properties
Van der Vleuten and colleagues conducted two literature reviews of the psychometric properties of different examination scoring systems, including those used in OSCEs  . They made a distinction between objectivity and objectification, describing objectivity as the “goal of measurement, marked from subjective influences”  . The authors acknowledged that “subjective influence cannot completely be eliminated”  . Consequently, they defined objectification as the use of strategies to achieve objectivity and suggested that such strategies might include detailed checklists or yes/no criteria. The studies they reviewed consistently indicated that objectification does not result in “dramatic improvement” in reliability.
They concluded that methods considered to be more objective, including checklists, “do not inherently provide more reliable scores” and “may even provide unwanted outcomes, such as negative effects on study behaviour and triviality of the content being measured”  . This conclusion was supported by the results of another study, which found higher reliabilities for subjective ratings than for objective checklists  .
Regehr et al directly compared the reliability and validity of task-specific checklists and global rating scales in an OSCE  . They discovered that, compared with checklists, global ratings “showed equal or higher inter-station reliability, more accurate prediction of the training level of the … (candidate)”, indicating better construct validity, and “more accurate prediction of the quality of the final product”, indicating better concurrent validity. The results of the study also revealed that the combination of checklists with a global rating scale did not significantly improve the reliability or validity of the global rating alone.
Cohen et al  undertook a study to determine the validity and generalizability of global ratings of the clinical competence made by expert examiners. They administered a thirty-station OSCE to seventy-two foreign-trained doctors who were applying to work in Ontario. For each candidate, the examiners completed a detailed checklist and two five-point global ratings. Their results revealed that “generalizability coefficients for both ratings were satisfactory and stable across cohorts”. There were significant and positive correlations between the global ratings and total test scores, demonstrating construct validity. This further supports the conclusion that global ratings are as reliable, or even more reliable, than checklists.
Whilst studying the psychometric properties of global rating scales, Hodges et al  found that the student’s perception of how they are being evaluated can affect their behaviour during the examination. Students who believed that they were being assessed by checklists tended to use more closed questions in a focused interview style. However, those students that perceived that they were being marked on a global rating scale tended to use more open-ended questions and gave more attention to their interaction with the patient. This finding was supported by another study  , which also found that reliability of global ratings is further improved when the students anticipate evaluation by a global rating scale. The authors concluded, “not only student scores but also the psychometrics of the test may be affected by the students’ tendency to adapt their behaviours to the measures being used”.
Global ratings versus checklists: the effect of the level of expertise of the candidate
Dreyfus and Dreyfus  suggested that there are five stages of developing expertise: novice, advanced beginner, competence, proficiency and expertise. Each stage is characterised by a different type of problem-solving, for example the novice will collect large amounts of data in no particular order to use for problem-solving. At the other end of the spectrum, experts tend to gather specific data in a hierarchical order. However, experts have great difficulty in breaking down their thinking into the individual components and, therefore, struggle to return to the novice type of problem-solving.
This theory has been shown to apply to clinical practice through research investigations. For example, Leaper  studied the behaviour of clinicians when interviewing patients, in particular what questions they asked and in what order. The study included doctors specialising in surgery, ranging from pre-registration house officer to consultant. Leaper found that the more junior doctors would apply the same set of questions to each patient, irrespective of whether they were relevant to that patient or not. Whereas, the senior doctors were more flexible in their use of questions and were able to yield more information with fewer questions.
This shows how, as clinicians develop expertise, they tend to move away from applying checklist style questions to each patient and towards complex, hierarchical problem-solving skills. Therefore, whilst the checklist marking used in OSCEs may be appropriate for novices, it penalises the more experienced clinician who “integrate information as they gather it, in a way that they may not be able to articulate”  . In order to test this theory, Hodges et al evaluated the effectiveness of OSCE checklists in measuring increasing levels of clinical competence. They asked forty-two doctors of three different grades to undertake an OSCE comprised of two fifteen-minute stations. In each station, an examiner rated the candidate’s performance using a checklist and a global rating scale. Each station was interrupted after two minutes to ask the candidate for a diagnosis. Each candidate was again asked for a diagnosis at the end of the station. The results revealed significantly higher global ratings for experts than junior doctors but a decline in checklist scores with increasing levels of expertise. The consultant grade doctors scored significantly worse than both grades of junior doctors on the checklists. The accuracy of diagnoses increased between two and fifteen minutes for all three groups, with no significant differences between the groups. These results were consistent with a previous study, which found that senior doctors scored significantly better on OSCE global ratings than their junior counterparts, but not on checklists  . This study was primarily designed to examine the validity of a psychiatry OSCE for medical students. Thirty-three medical students and seventeen junior doctors completed an eight-station OSCE, during which examiners used both checklists and global ratings to assess the candidates. Although it was not the primary aim of the study, the results suggested that checklists were not effective for evaluating the junior doctors, as they did not capture their higher level of expertise.
Global ratings versus checklists: assessment of communication skills
The OSCE has been shown to be an effective method for assessing communication and interpersonal skills  ,  . More recently, research has focused on whether global rating scales are a preferable method of marking communication skills in an OSCE.
Scheffer et al  explored whether students’ communication skills could be reliably and validly assessed using a global rating scale within the framework of an OSCE. In this study, a Canadian instrument was translated to German and adapted to assess students’ communication skills during an end-of-term OSCE. Subjects were second and third year medical students at the reformed track of the Charite´-Universitaetsmedizin Berlin. Different groups of raters were trained to assess students’communication skills using the global rating scale and the judgements of different groups of raters were compared to expert ratings as a defined gold standard. The examiners found it easier to distinguish between better students by using a combination of a checklist and a global rating scale. With only the checklist, examiners reported that students often earned the same score despite considerable differences in their communication skills.
Mazor et al  assessed the correspondence between OSCE communication checklist scores and patients’ perceptions of communication effectiveness. Trained raters used a checklist to record the presence or absence of specific communication behaviors in one hundred encounters in a communication OSCE. Lay volunteers served as simulated patients and rated communication during each encounter. The results revealed very low correlations between the trained raters’ checklist scores and ratings by simulated patient, averaging about 0.25. The authors suggested that checklists are unable to capture the complex determinants of patient satisfaction with a clinician’s communication.
In a discussion paper, Newble concludes that “a balanced approach is probably best”  with checklists being more appropriate for assessing practical skills and global ratings more appropriate for process aspects, such as communication skills.
Analysis of literature and discussion
Are global rating scales as reliable as checklist scores?
Reliability refers to the consistency of a measure and is a proxy for objectivity. In my reflection I expressed concerns about whether global rating scales are more subjective in comparison to checklist scores and how this affected the reliability of the OSCE. In two thorough literature reviews, Van der Vleuten, Norman and De Graaff discussed and criticised this presumption  . They argued that checklists may focus on easily measured and trivial aspects of the clinical encounter, and that more subtle but critical factors in clinical performance may be neglected. They referred to such measurement as "objectified" rather than objective. My presumption was that objective or objectified measurement is superior to subjective measurement, such as global ratings, with respect to psychometric properties such as reliability. However, van der Vleuten et al reviewed the literature and concluded that "objectified methods do not inherently provide more reliable scores" and "may even provide unwanted outcomes, such as negative effects on study behaviour and triviality of content being measured"  .
All the literature that I reviewed supported the finding that global rating scales are at least as reliable as checklist scores  ,  ,  . In addition, studies show that reliability of global ratings is further improved when candidates are aware that the examination will be marked using global ratings  . Further, Regehr et al found that combining a checklist and global rating scale did not significantly improve the reliability of the global rating scale alone  . However, the results of this study are not necessarily generalisable for several reasons: the examination was only testing practical surgical skills; the research population was heterogeneous with the researchers recruiting candidates with a wide range of ability levels, whereas OSCEs are most commonly used to examine students at the same level of training; and the study only used “expert” examiners.
Research addressing this key question has other weaknesses. A lot of the studies refer to global ratings that are allocated by the simulated patient, rather than the examiner, which is not usually the case in the exams at SGUL. Different schools use slightly different OSCE formats, so study results from one school or course may not be generalisable to all medical schools. At SGUL, examiners come from variety of backgrounds and are not necessarily clinicians. In some schools, the standardised patient also marks the candidate, instead of an examiner.
There is very little research from the UK and much of the relevant literature is from the 1980s and 1990s with a paucity of recent research. This may reflect the stability of the background theory to the OSCE but it may be useful to repeat some of the previous research in light of changes to undergraduate medical curricula in the last twenty years.
The overwhelming evidence from the literature is that global rating scales are at least as reliable as checklist scores. Indeed, reliability of the examination can be improved through the use of global ratings, especially if the students are aware that this is how they will be assessed. Nonetheless, up-to-date literature regarding OSCEs is very sparse and there is a lack of good quality, large scale randomised controlled trials in the OSCE field in general. There is opportunity for more UK-based studies following the changes to undergraduate medical curricula over the past twenty years. The use of global rating scales should be a key focus of future research, in order to provide more support for the recent move of medical education institutions, including SGUL, to use global rating scales rather than checklists in OSCEs.
Do global rating scales have advantages over checklists for more experienced candidates?
Educational theory suggests that, as clinicians develop expertise, they tend to move away from applying checklist style questions to each patient and towards complex, hierarchical problem-solving skills  ,  . Therefore, whilst the checklist marking used in OSCEs may be appropriate for novices, the literature consistently shows a decline in checklist scores with increasing levels of expertise  ,  .
However, the studies do not necessarily imply that global ratings are a substantially better choice than checklists for capturing increasing levels of expertise in OSCEs; as, in the studies I reviewed, global ratings were only useful for discriminating between the most junior and most senior clinicians, not between different grades of junior doctors or between candidates at the same stage  .
Although the results replicated those of previous studies, a study by Hodges et al in 1999 had a number of limitations, including a small number of candidates; only fourteen from each of the three grades of doctors. The study lacked reliability, as the researchers only used two stations. However, an earlier study by the same authors using eight stations yielded similar results  . Another limitation was that both stations were psychiatry-specific. Although this is relevant to my experience described in this assignment, the results are not generalisable to other specialties. Further, the investigators interrupted the OSCE at two minutes in order to elicit the candidate’s working diagnosis. The candidates were aware that this was going to happen, which may have influenced their approach to the interview. Overall, the quality of the studies I reviewed was limited by the use of small sample sizes.
Given the limited amount of literature that addresses this question, it is difficult to arrive at a firm conclusion. However, the literature available does confirm my suspicion raised in the reflection that checklists may not the best assessment tool for more experienced clinicians, which supports the move at SGUL to using global rating scales instead of checklists for the more senior years of undergraduate medical training. Further research is required to evaluate whether checklists fail to pick up differences between outstanding and average candidates who are at the same stage of training. Hodges et al also suggest additional research into the nature of questioning used by clinicians at different levels of training, with specific focus on “the types of questions asked, the sequence of questions, and the degree to which the questions reflect the formation of a diagnostic hypothesis”  , in order to ensure that the most appropriate assessment tools are being employed at each stage of training.
Are global ratings a better method of assessing communication skills than checklists?
As discussed in the reflection, I often feel that checklists are not satisfactory for assessing a candidate’s communication skills. I am clearly not the only individual with these concerns, as some recent OSCE research has focused on the best marking strategy when assessing communication skills in the OSCE. Scheffer et al  found that checklists alone are not sufficient to distinguish between students’ communication skills. However, this study is not necessarily generalisable to the OSCE at SGUL, as it was conducted at a German medical school with a six-year problem-based learning curriculum, which is very distinct from the four and five year courses that SGUL offers.
The results of a study Mazor et al  suggested that checklists are unable to capture the complex determinants of patient satisfaction with a clinician’s communication. This study was limited by the relatively small number of encounters rated per case, which may be a possible explanation for the low to zero correlations between checklist score and patient perception for some cases. The authors acknowledge that “this small number of encounters per case reduced the power of the statistical tests of the correlations between the OSCE score and the patients’ perceptions of communication”.
As with other aspects of OSCE research, there are few studies examining this question. The studies available are not UK based and have other limitations. However, based on the limited evidence and my own experience of OSCEs, I agree with Newble’s conclusion that “a balanced approach is probably best”  with checklists being more appropriate for practical skills stations and global ratings more appropriate for communication skills stations. Future research could include videotaping OSCE stations in order to analyse intra-rater reliability and validity of the different marking strategies.
Proposals for future practice
I chose OSCEs as the focus of this assignment because I wanted to gain a better understanding of the background evidence for the change from using checklist scores to global rating scales at SGUL. By encouraging reflection and review of the literature, this assignment has allowed me to critically appraise the use of global rating scales in OSCEs and my approach to them. In my reflection I expressed concerns about global rating scales introducing subjectivity into the examination. However, I also suggested possible advantages of global rating scales over checklists, including better assessment of communication skills and of more experienced clinicians. Through review of the literature, I have been able to allay my concerns about the objectivity, reliability and validity of global rating scales. The literature also confirms my thoughts about the advantages of this form of assessment. Whilst I appreciate that global rating scales are by no means perfect, I am now a lot clearer about why they are used. I feel satisfied that I have a better knowledge of the advantages of global ratings, making me less anxious about using them.
This has been a particularly timely exercise, as it coincides with the introduction of global rating scales in OSCEs at SGUL. The knowledge that I have gained will be invaluable when helping students to prepare for their examinations. The students are used to being assessed by checklists and they will need to learn to adapt their behaviour to perform optimally when assessed by global rating scales. Until now, much OSCE preparation has focused on questions that may be included in the checklist. With the introduction of global rating scales, I will be advising students to give a lot more consideration to communication skills and their overall approach to the task in each station, rather than firing off a list of questions. A positive point is that the literature has shown that students are good at adapting their behaviour during the examination according to the measures being used method of evaluation  48.
In terms of other proposals for future practice, I need to ensure that I prepare thoroughly prior to each OSCE that I examine. It will be paramount that I read the station before the day of the OSCE and be clear in my mind what is expected of the candidates, as I will not be able to rely on the checklist on the day. Through review of the theory and evidence behind OSCE marking schemes, I realise that, as an examiner, I need to be clear about what standard is expected of the student in advance of the examination. In the experience I described in this assignment, I received such information a couple of days prior to the examination, which was useful for preparing as an examiner. This preparation may help to reduce my concerns about inconsistency that I described in my reflection, especially with the first few candidates that pass through my station.
As well as examining OSCEs, I am also occasionally asked to write OSCE stations for SGUL. Therefore, an additional benefit of reviewing and analysing the literature on global rating scales is that it will assist me when developing the new style global rating scale OSCEs.
The key message that I take away from this experience is that there is good evidence to support the use of global rating scales in OSCEs and in some instances the use of such rating scales instead of or as well as checklists can improve the psychometric properties of the examination. The literature suggests that global rating scales are better at identifying a more mature and experienced way of solving problems, which supports the change to this method of assessment in OSCEs in the more senior years of undergraduate medicine at SGUL. However, no assessment method is perfect and some research maintains that checklists are a preferable assessment method for practical tasks.