# Using Assessment To Raise Achievement Education Essay

**Published:** **Last Edited:**

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

## Introduction

Pupil assessment plays a central role in modern schooling systems, informing teaching and learning, and facilitating school leadership and governance. Few would question the value of ongoing formative assessment, but the efficacy of summative, 'high-stakes', snapshot testing is often called into question, because of the pressure it places on teachers and pupils, and because of questions about its reliability when there are incentives for schools to 'teach to the test'. The system of national curriculum testing in England is frequently criticised along these lines, often alongside arguments for a move towards the greater use of continuous teacher-based assessment (Brooks and Tough 2006).

However, teacher assessment and test-based assessment each has its own merits. Blind marked tests may be 'objective' in that they are marked by examiners who do not know the candidates or their demographic characteristics. But, these tests can only evaluate performance on a very limited range of questions on a specific day, can favour technique over underlying ability, and may favour one cultural backgrounds or gender. Conversely, teacher assessments are usually based on observation of ability on a wider range of tasks over a longer time horizon, but may be sensitive to personal and subjective preferences on the part of the teacher, or the specific relationship and interaction between pupil and teacher.

In particular, any non-blind assessment could be subject to some form of statistical discrimination or stereotyping whereby individual judgements are made on the basis of what is expected of pupils of similar type. A substantial literature in the fields of social psychology, education and economics has considered questions related to stereotyping ad its implications for assessment (for example, see Wright and Taylor 2007, Steele and Aronson 1995, Gipps and Murphy 1994, Reeves, et al 2001, Lavy 2004, Dee 2005a, 2005b, Ammermueller and Dolton 2006 and Ouazad 2007). This previous research indicates that teachers' assessments of pupils' academic abilities can differ from pupils' achievements in tests and exams in ways that are systematically related to ability, demographic and socioeconomic background. This finding is a worrying since it implies that some groups can be educationally disadvantaged simply by the type of assessment to which they are exposed. Indeed, the main motivation for being concerned about this assessment divergence is, presumably, that it may have real consequences for children's subsequent outcomes. However, as far as we know, no previous research has attempted to answer this rather important question.

The issue is important, even if teacher-based assessments are not used for high-stakes summative assessment, because teachers guide pupils in curriculum choices and exam entries and because pupils' (and their parents') motivation may respond in more subtle ways to views of their own ability. Consequently, any divergence between teacher perceptions and test-based measures of achievement along lines of gender, ethnicity and social class, could offer at least a partial explanation for attainment gaps and differences in higher education participation patterns between these pupil groups (e.g. DfES 2003, Conner et al 2003, 2004 for England).

## Methods

The first part of the analysis is based on simple linear regression models, estimated by ordinary least squares, in which we regress the difference between teacher and test based assessments on a set of observable pupil and school characteristics. This specification is equivalent to that set out in Lavy (2004) for estimation of gender biases in assessment systems in Israel. It eliminates fixed pupil characteristics that have identical effects on both tests and teacher-based scores, and highlights characteristics that have differential effects on these scores. In our case, we consider the assessments of pupils' English, science and maths ability at age 14 in England - referred to as the Key Stage 3 tests.

An important consideration is to what extent the teacher-test gap in scores varies across the distribution of prior achievement levels, or 'ability'. This question is important in its own right, because a systematic trend in the gap between high ability and low ability children could suggest some structural problems with the assessment system. It is also important because groups differ in terms of their average achievement, so it is easy to confuse a systematic divergence between test and teacher assessment for particular group (low income, for example) with a systematic divergence along lines of average ability. Our approach to measuring prior achievement - in a way that is not biased in favour of teacher or test based assessment - is to use the average of teacher and test-based assessment scores from pupils' age 11 assessments (referred to as Key Stage 2 assessments), which children undertake at the end of primary school. By including the mean of teacher and test-based prior assessments as an explanatory variable in our models of the test-teacher gap in assessment, we can examine how the gap varies with both observable pupil characteristics and with levels of prior achievement measured. Note that in this setting, the age 11 and age-14 assessments are made by different teachers and at different phases of education.

The second objective of this paper is to consider whether the gaps between scores produced by different assessment methods influence pupils' subsequent educational attainment and the decision to stay on after compulsory schooling age. Our approach to this task is to use least-squares regression models to estimate the relationship between the teacher-test gaps on a pupil's age-14 assessments and various academic outcomes in the next phase of pupils' academic careers. These outcomes relate to qualifications (GCSE/NVQs) taken at minimum school leaving age (age 16) and to the decision to stay on at school or participate in other forms of education in the age 16-18 period. We also consider the mix of subjects in which pupils sit exams at age 16, in order to explore whether disparities in assessment in particular subjects could discourage further study in maths, science or English. All these outcomes are important factors in the subsequent decision to participate in higher education, and the type of higher education undertaken.

## Data and Context

Compulsory education in state schools in England is organised into five "Key Stages". The Primary phase, from ages 4-11 spans the Foundation Stage to Key Stage 2. At the end of Key Stage 2, when pupils are 10-11, children leave the Primary phase and go on to Secondary school where they progress through to Key Stage 3 at age 14, and to Key Stage 4 at age 16. At the end of each Key Stage, prior to age-16, pupils are assessed on the basis of standard national tests and at age 16 pupils sit GCSEs (academic) and/or NVQ (vocational) tests in a range of subjects. After compulsory schooling ends at age 16, pupils can continue their education in school or at college, or sometimes in the workplace, studying for academic and/or vocational qualifications. Those who gain suitable qualifications can, at age 18-19, enrol in Higher Education, usually at a university.

The UK's Department for Children, Schools and Families (DCSF2) collects a variety of data on state-school pupils centrally, because the pupil assessment system is used to publish school performance tables and because information on pupil numbers and characteristics is necessary for administrative purposes - in particular to determine funding. A National Pupil Database exists since1996 holding information on each pupil's assessment record in the Key Stage Assessments throughout their school career. Assessments at Key Stages 2 and 3 (ages 11 and 14) include a test-based component and teacher assessment component for three core curriculum areas: maths, science and English. As set out in the statutory information and guidance on Key Stage 3 assessments: "The tests give a standard snapshot of attainment in English, mathematics and science at the end of the key stage. Teacher assessment covers the full range and scope of the programmes of study. It takes into account evidence of achievement in a variety of contexts, including discussion and observation" (QCA 2004).

Importantly, the tests and teacher assessments are intended to measure ability, knowledge and skills along the same dimensions in the same subject areas. Since the teacher assessment is based on several measurements we may expect the variance in teacher assessment to be lower than at key stage examination.

For each subject, the teacher assessments and tests award the pupil an achievement Level on a discrete scale ranging from Below Level 1 up to Level 5 at Key Stage 2, and up to Level 7 (8 in maths) at Key Stage 3. These levels are converted into Points-based system which assigns 6 points to each Level and we work with these Points in our empirical analysis.

In particular, our definition of the teacher-test assessment gap is the difference in points awarded by the teacher in their assessment and the points awarded by examiners. Since 2002, a Pupil Level Annual Census (PLASC) records information on pupil's school, gender, age, ethnicity, language skills, any special educational needs or disabilities, entitlement to free school meals and various other pieces of information including postcode of residence (a postcode is typically 10-12 neighbouring addresses). PLASC is integrated with the pupil's assessment record (described above) in the National Pupil Database (NPD), giving a large and detailed dataset on pupils along with their test histories. Tracking of pupils continues after age 16 in an integrated database of age-16-18 education that is derived from PLASC, a database called the Independent Learner Record, and from other sources.

From these sources we derive two extracts for use in our estimation. The first follows four cohorts of children from their Key Stage 2 assessment at age 11, to their Key Stage 3 assessment at age 14 in 2002-2005. The second follows the academic careers of three older cohorts of children from age-11 through to age 16 in 2002-2004, and then on to the point where they have made their post-age-16 educational choices. The first of these two extracts draws on pupil characteristics at age 14 as a basis for analysis of any systematic divergence between test and teacher based assessment. The second extract, recording pupil characteristics at age 16, allows us to explore if past teacher-test assessment gaps (at age-14, Key Stage 3) influence subsequent education decisions and outcomes. Various other data sources can be merged in at school level, including institutional characteristics (from the DCSF). In both data extracts we exclude the 12% of pupils with recognised disabilities and learning difficulties who are registered as having Special Educational Needs, whether in Special schools or mainstream schools. We also focus solely on state Comprehensive schools, that are schools that do not choose pupils on the basis of academic ability, and we do not have data on pupils attending private schools. This large and complex combined data set provides us with information on around 1.4 million children aged 14 in 2002-2005, plus just over 1 million children aged 16 in 2002-2004, with those aged 14 in 2002 represented in both datasets.

## Result and Discussions

## Descriptive Statistics

Table 1 presents some simple descriptive statistics for the data set used in our analysis. As explained in before, we have two core datasets, one based on cohorts of children age 14 in 2002-2005 and another on cohorts aged 16 in 2002-2004. The first dataset, summarised in the top panel Table 1, is used in our analysis of the associations between pupil characteristics and the gap between teacher and test assessment scores. The second dataset, used to analyse whether these assessment gaps affect subsequent outcomes, is summarised in the lower panel. The table presents means and standard deviations for the full sample, and for various subsamples.

The first three rows of the top panel give mean teacher and test scores in each subject, and the group differences in mean achievement can be seen by reading across the columns of the table. As is well known, Asian and black pupils, and pupils eligible for free meals score below the mean in the population in all core subject areas; boys score below girls in English but slightly higher in maths and science. The bottom three rows of the top panel show the gap between teacher assessment points and the test-based points. A look down column (1) in the top panel shows that, on average over the 2002-2005, the point scores based on teacher assessments were slightly lower than those based on tests, by up to one third of a point in mathematics and English. Looking across the columns provides insights into how these gaps vary according to our socioeconomic, ethnic and demographic groups of interest. Notable features are that ethnic minorities and those with English as an additional language score even lower on teacher assessments in English than the population as a whole, with a gap as high as 0.6 points for Asian pupils. On the other hand, the gap between teacher and test assessments in science and maths is generally larger and more negative for the population (i.e. teacher's assessments lower than test assessments) than it is for the ethnic and socioeconomic subgroups. Boys seem to fare relatively badly in teacher assessments in maths and science and relatively well in English.

The bottom panel shows a range of age 16 and post-16 outcomes, again split by pupil subgroups. Pupils enter 9.8 exams on average at age 16, and whilst there is some variation across groups the differences are not dramatic. There is a lot more variation across groups in terms of their relative position in the distribution of scores from these age-16 exams, and free meal entitled pupils, black pupils and boys have relatively low attainments: the average free meal entitled pupil is at the 37th percentile in the distribution of age-16 qualifications. On the other hand Asian and, interestingly, English additional language pupils gain better qualifications than average. Post-16 participation rates follow a similar pattern, with high post-16 participation and staying on rates for Asians and those with English as an additional language. A high proportion (85.4%) of black pupils participates in post-16 academic education, but only 30.7% do so in school. Boys score below girls in their GCSEs, and are less likely to continue in academic education, either in school or elsewhere. The subject shares do not differ widely between demographic groups, but there is considerable within group variance.

These descriptive statistics reveal some interesting features in the data. The top panel in particular suggests that there are systematic differences between teacher and test-based assessments, and that these differences vary along ethnic, socioeconomic and gender lines. In Section 4.2 we extend this analysis using a regression models to explore the separate contribution of each of these pupil characteristics, and to control for pupils' achievement levels.

The simple correlation patterns between the teacher-test points gap in various subjects and at different ages could also be informative, since we could expect quite strong correlations between gaps at age 11 and gaps at age 14 if the gaps were systematically related to observable and fixed pupil characteristics. As it turns out, the correlations between gaps at these different ages are tiny - a maximum of 0.019 - immediately suggesting that a child who is under-assessed by a teacher at age 11 is unlikely to be under-assessed at age 14 too.

However, there are indications that teacher perceptions are important: the correlation between teacher-test point's gaps in different subjects is much stronger at age 11 (r£0.17) than it is at age 14 (r£0.08). An obvious explanation for this finding is that a pupil is assessed by the same teacher in all subjects at age 11, but usually by different teachers at age 14, suggesting that teacher behaviour has a role to play here, and that the gaps between teacher and test based scores are not purely random.

## Regression estimates of group divergence in teacher and test based assessments

Simple correlation patterns and descriptive statistics are, however, uninformative about the types of pupil characteristic that lead to divergence between test and teacher scores in these core subjects. For this analysis, we turn to the regression approach outlined in Section 2. Our main findings are succinctly summarised in Figure 1. Figure 1 displays the magnitude and statistical significance of the coefficients from regression models of the gap between teacher and test based assessment scores in core subjects assessed at age 13-14: English, mathematics and science. The estimation sample is based on 1.4 million Year 9 pupils (aged 13-14) in 2002-2005. The key explanatory variables for which we report the coefficients are pupil ethnic, demographic and socioeconomic characteristics, plus indicators of the mean teacher and test scores received by the pupil in the corresponding subject assessment at age 11. Our regression models also include year dummy variables (0-1 indicators), an 'unknown ethnic group' indicator, school type indicators, and the gap between teacher and test scores in the pupil's Key Stage 2 assessments at age 11.

The bar charts in Figure 1 indicate the relative bias in the different modes of assessment for each pupil group. To provide a sense of scale in relation to achievement levels, the graph shows the coefficients re-scaled terms of standard deviations of the average teacher and pupil assessment score at age 14. Note that the regression specification implies that all the effects are measured relative to a baseline group of white girls with English as a first language, aged 13 and 0 months in September at the beginning of the year, not entitled to free meals, and with a mean age-11 score of 27 (corresponding to expected achievement of Level 4 in both teacher and test assessments). For this baseline group, the gap between teacher and test based assessments is about 6% of one standard deviation of the variation in achievement points in English and maths, and effectively zero in science.

Solid shading represents coefficients that are statistically significant at the 1% level, whilst hatched shading represents insignificant coefficients and these statistical tests are based on standard errors that allow for general correlation of the regression unobservable within school units.

Note that we represent two regression specifications in each subject. The results of ordinary least squares estimation are shown by the left hand darker shaded bar in each pair of bars. The right hand bar represents the results when we take out differences between schools and allow for secondary school specific fixed effects (that is, we estimate the regressions using the deviations of the variables from the school specific means, which is equivalent to including secondary school dummy variables).

Figure 1 can be interpreted by reading a minus sign on the coefficient as showing that the corresponding group tends to do relatively poorly in the teacher assessments and relatively well in the tests, referenced to the gap for the baseline group. The first obvious feature is that there is little difference between low income pupils (on free meals) and others in terms of the teacher-test gap in assessment points. The coefficients are small and statistically insignificant for all subjects. However, it is evident from the negative coefficients on the ethnic minority variables that ethnic minorities tend to have lower teacher assessments and higher test scores in English, when compared to white pupils (top panel of Figure 1). On the other hand, the ethnic differences are not particularly strong in mathematics and science. Pupils with English as an additional language do relatively poorly on teacher assessments in all subjects. One striking feature is that boys, compared to girls, do relatively well on teacher assessments in English, but relatively poorly in mathematics and science. The last two findings echo those in Reeves et al (2001) for age-11 assessments in 1998. Lastly, older children seem to be rated relatively well in teacher assessments than tests, particularly in science.

All these gaps are fairly modest, at most around 5% of one standard deviation in terms of achievement levels at age 14. In contrast, there are some very substantial gaps with respect to levels of prior achievement: the last four pairs of bars in Figure 1 report the coefficients on dummy variables indicating points scored in age 11 assessments. The results on the relationship between the teacher-test gap and achievement levels are the most striking feature of this analysis: pupils scoring towards bottom of the distribution do much better on the teacher assessments than the tests relative to their peers at the top of the achievement distribution. As an example, the difference between the baseline group and pupils scoring 21 points (Level 3) or less on the test and teacher assessments at age 11 corresponds to around 16% of one standard deviation of the achievement levels at age 14. A comparable gap of the opposite sign can be observed for pupils scoring 33 points (Level 5) at age 11 in science.

Controlling for school-specific effects makes little difference to the coefficients on gender, age, prior achievement and free meal entitlement, but ethnic differences are far less marked. These changes imply that much, though not all, of the variation in teacher-test gaps across ethnic groups is linked to differences between schools (and hence also between teachers) rather than differences for pupils of different ethnicity in the same school (or taught by the same teacher). This result is important: the implication is that some schools generate wider gaps between teacher and test-based assessments than others, and ethnic minorities are more likely to be in schools that generate wide gaps. In particular, ethnic minorities do not score much lower on teacher assessments in English relative to tests, than do white pupils in the same school. The significant ethnic group effects we found initially in English are driven by comparison of minority pupils with white pupils in different schools taught by different teachers.

The general impression given by these results is that there is a general tendency for teachers to be fairly conservative in their ratings of pupil relative to the tests, rating lower ability pupils above their test scores, and rating higher ability pupils below their test scores. This is borne out if we plot these estimated effects of demographic characteristics on assessment gaps, against estimated effects of demographic characteristics on achievement as shown in Figure 2. The figure plots the coefficients on the demographic variables reported in Figure 1 against coefficients from a regression of a pupil's mean teacher and test assessment points at age-14 on the same set of demographic characteristics. The data labels on the scatter plot show to which group each point corresponds. We also show two trend lines, fitted using a quadratic polynomial, both showing that there is a general tendency for groups with higher than average achievement to have more negative gaps between teacher and test assessments (the teacher scores are lower than the tests) whilst groups with lower than average achievements score relatively well on the teacher assessments. The relationship is particularly clear if we exclude the data points corresponding to pupils on free school meals: the resulting trend line is shown as a solid black line. Note that a pupil's entitlement or otherwise to free meals may unobserved or imperfectly observed by teachers in secondary school, which may explain why there is no significant difference in teacher-test assessment gap according to free meal entitlement, and why these data points are outliers in the general trend. This finding lends some support to the idea that it is observable group characteristics - ethnicity, language, gender, and prior achievement (on which teachers presumably have information) - that drive these divergent patterns in terms of teacher-test assessment gaps.

It is important to note, therefore, that these results on ethnic and gender differences are rarely consistent with a standard story of statistical discrimination, or gender or ethnic stereotyping arising from face-to-face assessments. In our case, face-to-face assessments favour groups with lower levels of achievement. Clearly, this is not a pattern we would expect to see if expected group achievement is being used to rank individual pupils.

We have also checked for interactions between pupil achievement groups and ethnic, socioeconomic and gender groups, but this analysis revealed few interesting patterns. We extended the analysis to consider whether the divergence between teacher and test assessments for an individual is related to the characteristics of the school group of which they are a member. In a few cases (free meal entitlement) the divergence seems more strongly related to group characteristics than a pupil's own. There are also a strong tendency for black pupils to do relatively poorly in teacher assessments in English, and a strong tendency for white pupils to come out less well in teacher assessments if the school has a high proportion of black pupils. However, teacher assessment of black pupils is much more favourable relative to their test scores when the proportion of black pupils in the school is high, implying quite strong interactions between black status and the composition of the school group in determining the divergence between teacher and test based assessment in English. One possible explanation is that the teachers are drawn to make more favourable assessments of black pupils when a high proportion of the pupils in the school are black, although peer influence on the pupil could also be at work. The strongest influence on the gap between

teacher and test based assessment is the level of ability or prior achievement, and the individual and group interactions in terms of achievement reveal some interesting patterns.

Whatever the level of prior achievement, teacher assessments become increasingly favourable relative to the tests when there are a higher proportion of similar ability pupils in the school. In other words, teachers appear to grade relative to the school population.

Although it is difficult to gauge what precise mechanisms drive these findings, the results do highlight that teacher and test assessments in many cases divergence systematically according to individual characteristics and group composition. This finding is quite worrying, since it is not what would be expected from an unbiased assessment system. In particular, the discrepancy between teacher and test assessments at the top and bottom of the achievement distribution gives cause for concern. In the next section we go on to consider whether we should be especially concerned about divergence between teacher and test based assessment in so far as these impact in future educational decisions and opportunities.

## Impacts on qualifications and post-compulsory education

Our central results concerning the influence of divergence in assessment on qualifications and subsequent outcomes are represented in Figure 3. Teacher assessments and test-based assessments are positively correlated (conditionally) with pupil age 16-plus outcomes, indicating that both assessments contain unique information about the pupil.

However, the best way to consider the specific impact of divergence between teacher and test assessments, and hence any influence on pupils arising from teacher perceptions, is to observe how pupil outcomes change as the gap between teacher and test scores widens, holding constant the average of the teacher and test-based assessments. Hence Figure 3 reports the coefficients from regressions of pupil age 16-plus outcomes on measures of the gap between teacher and test assessments in maths, science and English, at age 14 and 11.

We report on three different educational outcomes at age 16 and beyond: 1) a pupil's total number of GCSE/NVQ entries and 2) their percentile in the national distribution of GCSE/NVQ points (awarded on the basis of the number and grade of test result); or 3) whether the pupils is recorded studying for any non-vocational post-16 qualification in the Independent Learner Record data set. All the results are presented for regression specifications that include controls for basic pupil characteristics (free meal entitlement, ethnicity, language, age and gender) plus dummy variables for prior achievement levels based on the sum of the teacher and test assessment point scores at age 14 and age 11. The specifications also allow for school-specific fixed effects, but the results are insensitive to the inclusion or otherwise of these fixed effects. As before, solid shading indicates coefficients that are statistically significant at the 1% level. The dataset contains around 1.1 million pupils, aged 15-16 in 2002-2004.

Again, to aid interpretation, we have standardised our coefficients in Figure 3 so that the height of the bar represents the association between a one standard deviation change in the teacher-test point gap, and the outcome measured in terms of standard deviations of the pupil distribution. Given this scaling, it is immediately clear that divergence between teacher and test assessment has very little impact on pupil age 16-plus outcomes, regardless of the fact that most of our coefficients are statistically significant.

Consider then the results for GCSE entries represented by the first of the three bars in each group. The coefficients on the gap variables imply that for all subjects except English, the number of GCSE entries is increasing in the favourability of the teacher assessments relative to the tests. This is what we might expect at age 14, since teacher expectations in secondary school could be directly influential in terms of the number of papers for which a pupil is entered. This possible direct linkage cannot, however, explain the association between the divergence in assessment in primary school at age 11 and the number of GCSE/NVQ entries. An alternative explanation is that positive teacher evaluations relative to test scores encourage pupils' academic ambitions through more subtle psychological channels. However, it needs to be re-emphasised that the effects are minute in terms of their magnitude. The scale of the coefficients implies that a one Level (6 point) positive gap between teacher and test based assessment scores in every core subject at age 11 and age 14 is linked to a seven percentage point increase in the expected number of GCSE/NVQ entries, that is an increase equivalent to seven additional GCSE/NVQ entries for every 100 pupils being "over" evaluated by a full one Level by teachers in every core subject at ages 11 and 14.

Although the findings on GCSE/NVQ entries might suggest that a more favourable teacher assessment engenders a positive academic attitude in pupils, this view is partially at odds with the findings in on GCSE/NVQ attainment. These results are shown by the second bar in each group of three bars in Figure 3. Here we show that, whilst a positive teacher-test assessment gap at age 11 is linked to marginally higher performance overall in GCSE/NVQs, the opposite is true for divergence in assessment at age 14: at this age, it is a positive test teacher gap that is associated with better GCSE/NVQ performance. One reading of these somewhat contradictory results is that whilst the favourable teacher assessments at the end of primary school may encourage a positive pupil response, it is the pupil qualities that generate good test results at age 14 that are most closely linked to success in formal GCSE/NVQ exams at age 16. Whatever the explanation, the magnitudes are again very small: a one-Level excess in test based assessment over teacher assessment in all core subjects at age 14 is associated with an increase in GCSE/NVQ performance that is equivalent to a mere 1.2 percentiles of the pupil distribution of GCSE/NVQ point scores. This is mirrored by an almost identical effect of a one level excess of teacher assessments over test scores in all core subjects at age 11.

The findings on the association of assessment divergence with GCSE/NVQ scores is, broadly speaking, played out further in the results on the decision to stay on at school, or to pursue post-compulsory education more generally (the third bar in each group in Figure 3). A relatively good teacher assessment at age 11 is linked to higher probabilities of participation in post-compulsory education, but so too is a relatively good test performance at age 14. As before, the implied effects on the probability of post-school participation (and hence Higher Education participation in subsequent years) are very small indeed. According to these models, a pupil who received a full one-Level excess teacher assessment age 11 in all core subjects has a 1.13 percentage point higher probability of staying on at school relative to another pupil in the same school, receiving equal teacher and test assessments (and increase of 3.24% relative to the mean staying on rate of 34.92%). Although this effect is not negligible, a divergence of assessment on this scale is way outside anything observed in the actual data.

We have also considered effects on the share of English-related subjects, maths related subjects and science related subjects taken at age 16, and whether or not a pupil is recorded staying on at school in Year 12 but there seem to be no strong influences on these outcomes either. There is no suggestion here of any very meaningful linkage between the divergence in assessment and the choice of subjects. In general it appears that doing relatively well in maths science and English tests at age 11 and 14 (relative to teacher assessments) is linked to a higher share of maths, science and English subjects in age 16 qualifications, but all the coefficients are so small that they are effectively zero, even when statistically significant. We

have also looked further to see if pupils that are "over-assessed" by teachers relative to tests experience more positive outcomes as the degree of over-assessment increases. We find occasional evidence of such non-linearities, but for the most part there are few significant differences of this type. We have also considered whether teacher-test assessment gaps have bigger influences on outcomes for low achieving pupils or for high achieving pupils, but the patterns for both high and low achievers are similar.

In summary, although we have found some statistically significant effects, the results in Figure 3 do not appear to tell a convincing story about divergence in teacher and test-based assessments having any real impact on qualifications or post-school participation decisions.

## Conclusion

Our observed analysis finds evidence of systematic differences between test and teacher based assessments in national curriculum assessment at secondary school in England, using data on the population of age-14 state school pupils from 2002-2005. The biggest differences are between pupil achievement groups, with higher achieving pupils more likely to be under-assessed by teachers relative to tests, and low achieving pupils more likely to be under-assessed by the tests relative to the teachers. There are also differences by gender: boys do relatively well in teacher assessments in English, but girls do relatively well in the teacher assessments in science and maths. The gender gap between test and teacher assessments is comparable in magnitude to the gender gap in the mean test and teacher scores, though of opposite sign. There are some smaller differences by ethnic group in English assessment. The reasons for these divergences between teacher and test based assessment scores are not revealed by our analysis, but statistical discrimination or stereotyping seems an unlikely explanation since any upward 'bias' in teacher assessments relative to the tests works in favour of low-achieving groups.

It is of course unlikely that any two different assessment methods will give directly comparable measures of pupil achievement and skills for every pupil, especially when there are differences in breadth of skills which are being assessed. However, mean differences across pupil groups do raise serious concerns about placing too much trust on any one form assessment. Clearly, the current policy and pedagogical emphasis on the use of tests alone is problematic, as is any suggestion that the system is shifted to very heavy reliance on the teacher assessments.

Even so, we find little evidence that divergence between teacher assessment and actual test scores really matters much for pupil outcomes. Favourable teacher assessments are linked to marginally more GCSE/NVQ entries at age 16, suggesting a possible direct route by which teacher perceptions could influence subsequent pupil outcomes. However, the effects are very small in magnitude and we find no strong evidence here that discrepancies in assessment have any influence on qualifications or post-compulsory schooling decisions. Hence, it seems unlikely from this evidence that pupils are heavily influenced by teacher perceptions of their abilities or that teacher perceptions could be a major influence on post-16 or higher education participation rates.