Traditionally, the evaluation of a curriculum of study has been an integral part of the teaching process for all ages of student and all contexts of education since the 1960's (Cronbach, 1963). Whether the curriculum is a short in-service, a month-long training course, or a year-long subject, instructors usually fully evaluate their units of study in order to confirm their learning objectives, provide quality assurance and improve courses. At the higher education level, it is taken for granted that best practice dictates distribution of a standard evaluation survey (SES) to all participants at the end of a course of study. Most instructors hand out the SES, which queries participants about their levels of satisfaction, increases in skills and knowledge, and possible improvements for future courses. Overwhelmingly most instructors do not customize SES's to suit each course they teach, but rely on mandatory institutional forms. The enterprise agreements at many Australian universities and colleges requires teachers to use the standard SES, and this policy does not allow for any change in wording or addition of items to augment evaluation. This policy does not prevent individual teachers from using their own evaluation procedures, but it discourages such initiatives because the institution only records "official" SES scores.
Typically the SES's comprise a one-page survey asking Likert-style questions which are supposedly related to all students' perceptions and experiences across the institution. The questions cover selected aspects of courses and teaching and are necessarily general in order to cover a range of disciplines including medicine, science, law, engineering, business, arts, computer science, education, nursing, forensic science and environmental studies. Students range from first-year undergraduates through to postgraduate levels. At some universities students are asked to complete two separate SES's, one examining features of the course design, and another studying aspects of teaching.
From this scenario, two questions arise: 1. Is the SES a useful method to evaluate student opinions on a range of teaching styles and curricula? 2. Can an institution make accurate comparisons between blended/traditional approaches, teachers, courses, and schools?
The Rise of the Course Evaluation
There is very little information about the earliest beginnings of course evaluations. The start of the 20th century saw the rise of scientific management in industry, and a flow-on effect into education. Individual personnel evaluations were encouraged in business and government, while use of testing, marks and grade averages gradually gained popularity in most schools. At the higher education level, in the USA, course evaluations began in the 1920's at the universities of Wisconsin and Harvard. A few other universities followed suit but it was not until the 1960's when such surveys became widespread and were used by management for tenure and promotional decisions. The situation was rather different in Australia however - anecdotally, the author's experience as a student at a large Sydney university was that course evaluation surveys were never used in the Arts Faculty in the 1970's or even early 1980's. Compulsory course evaluation at many Australian universities and colleges has only been instituted in the last ten years. Before then, using a SES was a matter of choice for an instructor.
The use of online surveys for course evaluation has increasingly become more commonplace since the beginning of the Web in the 1990's. Dommemeyer, et al (2004) compared paper-based and online teaching evaluations and found a lower online return rate, which changed to equal when incentives were added to the task, but made no difference to mean item scores for the two mediums. Such findings may not be relevant now given the popularity of the Web and social networks such as Facebook and MySpace, which has led to students being completely comfortable with using a computer for most tasks. As courses have moved online so have their accompanying evaluation surveys, with some research indicating little difference between the two mediums for evaluation purposes (Norris & Conn, 2005). In fact, some evidence points to relative advantages of the online evaluation with assured anonymity, ease of use, and greater accessibility valued over limitations of traditional paper surveys (Nevo, McClean & Nevo, 2010).
An under-researched issue is course evaluation exhaustion, where students are over-surveyed at the end of semester (see Rogelberg & Stanton, 2007; Baruch & Holton, 2008). At some institutions, each student who voluntarily participates answers the same survey four times at the end of each and every semester. Over a three year degree a student may answer the same survey 24 times. Given these frequencies, practice effects (from ticking the same boxes and writing the same comments) may yield unreliable results. Another option are dedicated websites which allow students to rate their instructors and courses then present findings for public perusal. Such sites can include ratings going back several years. RateMyProfessors.com is one such site, boasting ratings of over one million teachers and has been gaining in popularity with mainly American students for over a decade.
Brown, Baillie & Fraser (2009) compared the official SES's of one group of students to their website ratings on RateMyProfessors.com. They found that there was high correlation in terms of most of the survey item content, but that students tended to rate their instructors lower on the anonymous website. When questioned, students said that they found the website ratings to be fair, honest and representative of the instructor's abilities. An important variable, easiness of the course was much more highly related to course quality on the website than official evaluations of teaching.
Problems with Student Evaluation Surveys
Although their use is widespread, SES's have been the subject of research and controversy (see Kulik, et al, 2001; Olivares, 2003). The main criticism of SES's is that they are susceptible to bias from "grading leniency" on the part of instructors (Eiszler, 2002). In an in-depth, longitudinal study, Langbein (2008) looked at student data from all courses over a 4 year period at an American university and found that student grades had a significant positive effect on their SES's scores. Langbein concludes that if teacher salaries are linked to positive student evaluations then teaching staff, management and students are involved in a game of deception, where SES scores and student grades are specious indications of teaching quality and authentic course evaluation is totally missing from the equation.
It should be noted that just as salary rises may be a stimulus for awarding high grades, student choices of elective classes, with small numbers and student-directed curricula can be equally responsible for higher SES's. Such small classes may justifiably result in higher grades (and ratings) for very motivated students studying in model circumstances with specialist instructors. Thus, comparisons of large compulsory first-year courses with small elective third-year courses may be like comparing apples with oranges. Olivares (2003) even argues that the widespread use of SES's has placed academic control and management in the hands of students, thus resulting in consumer-based programs of instruction ultimately leading to impoverished academic freedom and decreases in student learning.
Darby (2007) studied SES's from 23 classes of 25 teachers enrolled in a course on child abuse. She found that course evaluations do not necessarily measure the course effectiveness but rather globally held student preferences about the nature of the course and its activities. It would follow that relatively interesting courses may be regarded more highly than relatively boring or challenging courses. Comparing two such courses on typical sets of measures will usually lead to the interesting (easy) course being awarded higher ratings.
Case Study: Communication Research
The author teaches a first-year course in Communication Research at a large metropolitan Australian university. The course is compulsory attracting an enrolment of 300 plus students from Communication, Design, Education and Law majors. The course has been taught for the past 6 years using a blended approach using teacher-led instruction, group discussion, and from online resources including Web-based modules, assignments and innovative Flash-based statistics tutorials, which teach the rudiments of the statistics program, SPSS. The unit is challenging - students must devise a research project, submit ethics clearance, write a survey, analyse the results with SPSS and write a research report.
Typical SES feedback has always lower than average scores for other units taught within the school and the university. As such, the author has been asked to explain the low feedback in order to justify the future of the unit. Comments are mixed, ranging from "Great course" to "Learned nothing". Thus, the author has developed an alternate instrument to measure teaching and learning effectiveness for the unit, Communication Research, which extends the Student Feedback processes at University of Western Sydney.
The extra evaluation was a classic pretest/posttest design with one exception - the pretest was distributed half way through the course making it a retrospective pretest. Retrospectives pretests have been shown to be an appropriate method for assessing changes in skills and knowledge (Sullivan & Haley, 2009) but also yield significantly lower (and appropriate) scores compared with traditional pretests and posttests (Taylor, Russ-Eft & Taylor, 2009). Pratt, Mcguigan & Katzev (2000) argue that traditional pretests suffer from several limiting factors, which can be overcome using the retrospective method. Students must be physically present at the beginning and end of course; at the beginning of a course, time may be better spent on content presentation, rather than evaluation; and student opinion may be masked by students overestimating their abilities, before they understand their deficiencies - the "response shift" bias
Response-shift bias can be detrimental to evaluation insofar as students who have undertaken a course may change their frame of reference as a result, becoming changed human beings. Comparing changed individuals with their former selves is not a valid comparison since attitudes, skills and knowledge have evolved at different rates. The retrospective pretest avoids the bias by dealing with only the current individuals asking them to comment on their previous abilities.
The 30 items for the online surveys were constructed to test the course aims in a highly detailed manner. The surveys consisted of three demographic questions (Gender, Age and Degree), followed by 19 Likert scales related to perceived research skills and knowledge, then 8 scale questions related to features of the unit itself such as teaching, assessments, and course relevance. The skills/knowledge items were not vague general items such as "The course met the objectives?" but comprised a detailed breakdown of course content and objectives. Items such as Library database skill level, Excel knowledge, Understanding of ANOVA, and Ability to write method section were employed. The questions were then phrased as 7 point bipolar scales in order to accurately measure student opinions.
The initial survey was placed online in Week 8 for the 303 enrolled students, and the final survey was placed online in the last week, Week 13 for the 272 students who were regular attendees of the course. The two surveys contained identical questions except the initial survey referred to students' perceived skills, abilities and opinions in Week 1, and the final survey referred to the same skills, abilities and opinions in Week 13.
Two hundred and six students of the 303 enrolled students completed the initial survey, and 171 students completed the final survey. Females outnumbered males two to one. Most students were aged 18 to 24. The total number of eligible students who completed the unit was 272 given the attrition rate. Thus, the return rate was 67% for the initial survey and 63% for the final survey.
The main aim of the pre/post tests was to measure the amount of shift between student estimations of their knowledge and skills from the beginning to the end of the course. This is typically never performed with SES evaluations. Given typical problems with normality and homogeneity of the data, a Wilcoxon signed rank test was performed with all of the scale items of the surveys comparing items from the initial survey with the same items from the final survey. Table 1 shows the z scores, and significance levels for 19 skills/knowledge items. The majority of the items are very strongly significant at the 95% level of confidence. This is a compelling result showing that the learning aims of the course had been fulfilled. The one test item in Table 1 that is least changed, but still significant (Understanding the Mean, p = 0.05), has probably been treated previously in high school Mathematics classes.
Results of Wilcoxon T- test for Skills and Knowledge (alpha = 0.05)
Writing Literature Review
Understanding Scientific Report
Understanding Research Ethics
Understanding Newspaper reports
It is interesting to note that students' liking of statistics increased over the duration of the unit. This is probably due to their understanding of SPSS and other aspects of the unit also increasing. Table 2 also shows that students liking of the course increased over the duration of the semester. However, they found the course consistently difficult and many could not see the relevance of research to their chosen careers, even though most were First Year undergraduates and had not selected their degree majors. The course relevance item was included because previous SES comments had questioned why Communication students needed to know the intricacies of research, and report writing.
Results of Wilcoxon T- test for opinions of course (alpha = 0.05)
Ratings of course Item
Relevance of the course to career
Difficulty of course
Like or dislike course
In order to ascertain possible gender bias, a Mann-Whitney U test (alpha = 0.05) was performed on the two surveys using Gender as the grouping variable. On the initial survey, males rated themselves significantly higher on Database skills, Selecting articles, and Authoring surveys. These gender differences disappeared in the final survey to be replaced by Statistics Knowledge, Excel skills, SPSS skills, Writing result , Understanding ANOVA , Understanding T-test , and Writing the Discussion. It seems that males are more confident in terms of the statistics and report writing components of the course. A gender comparison of class grades does not confirm this apparent masculine confidence with more than twice as many females as males scoring high grades.
Analysing the final survey alone, a series of one-way ANOVA tests (lowered alpha of 0.01 due to normality violations) were performed with the four key predictors of Degree, Relevance of course to career, Difficulty of course, and Like/dislike the course. The degree, which the students selected as being their major was found to be only partially useful in predicting evaluation responses, however the remaining three items were all found to be strongly predictive of nearly all of the other evaluation items.
The students' degree or major was found to be a significant predictor of Excel Skills, Liked Flash tutorials, Relevance of course, Difficulty of course, Assessments and Teaching, with Education (N=2) and Journalism (N=40) students scoring the highest means. In fact, the Education students were the most appreciative overall of the range of students evaluated. Education students saw that the course was highly relevant to their careers. The most unappreciative students were the Design students (N=5) who did not see any relevance of research to their future roles in the design industry. In terms of difficulty, Law (N=11) and Journalism students rated the course more difficult than other students.
The item, Relevance of the course to career, was the strongest predictor of students' opinions, significantly affecting all of the other items in the survey. The higher the considered course relevance the higher the rating on each of the other evaluation items. Significance levels were predominantly 0.00 indicating a strong predictive effect of course relevance.
The Difficulty of the course was also a strong predictor of the other evaluation items (except Understood Newspaper Reports). The higher the perceived degree of difficulty, the lower the rating of the other evaluation items. Most of the significance levels were 0.00. An overlooked aspect of students finding a course difficult is the lack of student appreciation of what is learned. The maxim that we learn through our mistakes is probably true, but we never appreciate this learning when we make mistakes, we suffer low self-esteem and hardship. Students had many problems to overcome during the course. It may be low self-esteem that is being reflected in the course evaluations.
The item, Like/Dislike the Course was also a significant predictor for every other survey item (except Understood Newspaper Reports). The more a student disliked the course the lower the rating on the other evaluation items, the higher the liking, the higher the rating of the other items. Most of the significance levels were 0.00, certainly depicting a strong relationship between disliking the course and low ratings on almost every other aspect. The present study agrees with other research such as Darby (2007) who found similar results with adult learners.
It is remarkable that the item, Understood Newspaper Reports should be immune from whether students liked the course or found it difficult. The material presented in class was an analysis of how journalists sometimes accept scientific studies on face value alone. The critical nature of the classwork seemed to be valued by most students who probably appreciated the applied nature of social science research. This part of the course may have especially appealed to journalism students, and students with a journalistic flair or interest, i.e. most students.
Pearson product-moment correlations co-efficients were calculated for the three major predictors above, yielding Table 3 below.
Pearson product-moment correlations for 3 predictors
Difficulty of course
Relevance of course to career
Difficulty of course
r (Sig level)
Like or dislike course
r (Sig level)
Relevance of course to career
r (Sig level)
* Correlation is significant at the 0.01 level (2-tailed).
The three main predictors are highly correlated with each other with the Relevance of course to career strongly related to Like or dislike the course (r= .75). On logical grounds both Relevance of the course to career and Difficulty of the course would tend to be the main reasons why the course is disliked. This in turn is probably also responsible for poor course ratings on the SES surveys.
The two main question asked at the beginning of this paper were: 1. Is the SES a useful method to evaluate student opinions on a range of teaching styles and curricula? and 2. Can an institution make accurate comparisons between blended/traditional approaches, teachers, courses, and schools? The rest of the paper will discuss these two questions in light of the current results and other research findings.
Is the SES a useful method of evaluation?
The SES asked 13 questions related to content, relevance, learning design, assessment, feedback, guidelines, resources, flexibility, location, workload, fairness, skills, and overall experience. The only item where the course fared close to the School and university mean was fairness. For the past 4 years, every other mean was well below the School and university average. If promotion or salary were linked to these scores, the coordinator should be looking for a new job - the SES evaluation scores are not improving.
Typically, the SES reports usually show just one criterion - the mean of each of the evaluation items. This is usually the only reported statistic comparing courses and schools. A range of known statistical problems arise here. The mean should only be used where questions are shown to be normally distributed but characteristically, answers to survey items are rarely normal. The mean is also subject to extreme scores, and to distributions where half the students are positive and half are negative. Standard deviations of the individual question are always given in scholarly research, but seldom in SES results. Thus, the mean is not a reliable indicator of survey items, with the median and the mode often used as a substitute in the literature. The SES process generates a great deal of data, which could be appropriately analysed if staff were given access to the raw data. However, management generally withholds the data for comparison purposes, circumventing any attempts at genuine educational research.
What lies behind students' low ratings? The wording of the SES questions are often generic (e.g. "What were the resources like?") and sometimes vague (e.g. "What was your opinion of the learning design?"). When confronted with such items students may tend to rely on their feelings towards the course - do they like the course, was it difficult, was it useful? Once students respond to vague/generic questions with emotionally driven answers, a pattern is set for the rest of the items. It is only by asking specific, unexpected questions that one can obtain a considered response. But all the SES items are general and predictable since they have been worded to apply to all students in all courses across the institution, and the SES's are given dozens of times over the students' usual three year degree. Instructors could be permitted to add questions to SES forms, thus customising the survey for course levels, styles of teaching and objectives. This used to be the case at some universities ten years ago when SES's were not mandatory official documents.
What then is the SES measuring? The answer is probably course difficulty, relevance and whether the course is liked, or not. These related opinions of students, while certainly important, do not allow a true picture of the course effectiveness to emerge. The SES evaluation gives a spurious account of real student learning and instead presents a series of questions leading to a management sponsored "Top Ten" course popularity poll.
Can an institution make accurate comparisons?
The simple answer is no. Taking into account the varying course levels, disciplines, teaching styles, blendedness, motivation, course difficulty and size of student cohort, any across the board comparisons are destined to also measure these extraneous factors to a greater or lesser degree. A much fairer comparison would be to compare all first year, or all second year, or all third year courses within the same school or department. Across the institution comparisons are a scientific management activity, which first began with Taylor at the beginning of the 20th century. A hundred years after Taylor and we still have a situation where management is setting SES targets for programs, courses and schools, without any questioning of the validity or reliability of the methods of evaluation.
There are several possible alternatives to effectively measuring course effectiveness. All of these alternatives take more time and energy than simply distributing a generic survey.
Pretest/posttest evaluations necessitate careful wording of the surveys to ensure that curricula and objectives are fully evaluated. The course coordinator/s and management need to work together especially if online surveys are to be implemented. Survey results need to be analysed fairly, equitably and beyond simple descriptive statistics. Such a process is certainly possible within most institutions, but it demands a much higher commitment to evaluation than is currently being shown by staff and management alike. Instead, we have generic testing of student likes and dislikes, plus a measure of course hardship and debateable relevance.
The Communication Research pretest/post test evaluation definitely showed that significant learning had occurred in all of the intended skills and knowledge areas. Teaching staff have concurred that students make enormous gains in their research writing and analytical abilities. However, these learning milestones are not recognized by students unless one asks the right questions. There appears to be a huge cost to making extensive (rather than modest) gains in learning. Students must progress from being ignorant, then bewildered and challenged, to finally mastering the course material. This process throws students in the deep end where they feel unfamiliar and uncomfortable. While they eventually survive, their recollections of the process encountered are not conducive to high SES ratings.
Wright (2006) suggests that students not be allowed to submit anonymous SES's. While anonymity protects students against staff reprisals at students who offer low SES scores, it also may lead to little or no responsibility being taken for student responses. With no possibility of follow-up, students may not consider their responses carefully. They may not take appropriate amounts of time and rush through answering the questions without valid reasons for their ratings. Evaluations may be based on a recent low grade, or a single negative class experience with a staff member, or activity. Wright makes the point that the SES anonymity shows that staff are deemed less trustworthy than students. Students could easily be asked to identify themselves (especially in online evaluations) and their identities hidden from teaching staff. This may ensure a more valid take-up of SES's.
A more recent intervention is use of classroom response systems such as KeePad technology in order to survey student opinion and retention using handheld keypads and wireless technology usually employed in mass lectures (Sawdon, 2009; Liao, Chen & Tai, 2009). While classroom systems have been applauded for their interactivity, ability to maintain student interest, and enhance retention, the technology could also be used to evaluate lectures and workshops immediately after they occur, rather than at the end of semester when student memories may be lacking. The KeePad technology also has the inbuilt ability to identify students and their ratings.
Curriculum and assessment based evaluation have been a part of education as long as SES's have existed. The use of informal student ratings, class visits, collegial opinions, student performance, syllabi assessment, long-term follow-up of students, and graduate surveys have all been used by instructors and management in the past (Greenwood & Ramagli, 1980). The use of any of these methods could provide extra support to SES evaluations, but most institutions have opted for the easy solution.
As we are all aware, doing primary research is a daunting task, but for a First Year undergraduate, undertaking their first primary research project, it is like speaking another language. Primary research is challenging at First, Second, Third or postgraduate levels if it has never been attempted before. The students of Communication Research produce a research report as their main assessment item. Students must perform at an adequate academic level, summarise the research literature, design a survey, analyse their results and write a report. The fact that students can submit these reports is a prodigious performance of the skills and understanding gained throughout the entire course. From a total novice base, the course produces researchers who have struggled with locating literature, statistics, report writing conventions, citation style, tables and graphs. Is this not what real learning is all about?