When judging a student’s performance in speaking or writing we can make use of benchmarks: detailed, validated descriptions of specific levels of student performance that is expected at a particular CEFR level. Such benchmarks are often represented by samples of student work (writing) or videos (speaking).
Benchmarks can be produced in the following way. A coordinator selects a number of oral or written student sample performances to introduce a specific CEFR level. For each sample a group of experts then judges and discusses whether the sample does indeed illustrate the level, and why it is not the level above or below this. After reconsideration the group then votes on the level of each performance. During a last phase the individual group members rate the performances once again and compare their scores.
It must be emphasized here that benchmarking is a group process, rather than one expert showing and telling the other experts which performances best illustrate a performance at the desired CEFR level.
Suggestions for action
References
In the case of speaking and writing tasks this question may seem to be rhetorical. In real life, how we speak and write much depends on the circumstances we are placed in. So if speaking and writing tasks are to be authentic, we cannot usually do without providing a context. Generally speaking, most speaking and writing tasks are therefore placed in a context.
In theory it may be possible to ask a student to speak or write about a subject without providing a context. This is often done when students are to give their (personal) opinion about a subject, a phenomenon or an incident. Yet from a CEFR point of view this may be questionable: in our communication with others we need to think about who we are addressing and why. To simply give our/an opinion without thinking of the person we are addressing may be rather counterproductive, it may hurt other people’s feelings or it may simply not be understood or indeed be misunderstood. We must also realize that the aim of assessing speaking or writing in a foreign language is not to test a student’s ability to express a view or an opinion, but rather to test if the student can express that opinion in the foreign language. In other words: we need to test if the student can express a view or an opinion, but we do not assess the content of the message (e.g. facts, data, etc.).
In the case of reading and listening tests, we often find that students are instructed to read a text or listen to a passage without any context given (“Read the following text and answer the questions.”). It cannot be denied that in such cases we may be testing reading or listening. However, from a CEFR point of view we would also expect a context when testing reading and listening: we need to give a reason for reading the text.
In testing spoken and written interaction it is advisable to provide the students with contexts that are realistic and link up with their age and their experience in life. We also need to remember that there are cultural differences between speakers and writers from different backgrounds. Students may or may not be comfortable in saying or writing certain things.
In testing spoken production it is also advisable to provide a context based on a printed text or on a visualisation (pictures, photos, etc.). This may help students in constructing a view or an opinion. To avoid us testing the students’ opinions or views or imagination or cultural knowledge, rather than their ability to express them, it is advisable to provide key words or arguments for and against.
In tests of reading or listening, by providing students with authentic texts placed in a realistic context, and with the student being given a purpose to read or listen, we may improve the validity of reading and listening tests.
When providing contexts bear in mind that there are unsuitable contexts such as war, politics, racism (including cultural clichés and stereotyping), sex and sexism (including stereotyping), potentially distressing topics (e.g. death, terminal illness, severe family and social problems, natural disasters, and the object of common phobias such as spiders and snakes – where the treatment might be distasteful), examinations, passing and failing, drugs.
The most obvious answer to this question would be to apply all the steps of linking the new exam to the CEFR: familiarisation, specification, standardization, standard setting and validation. Some steps may be made easier if in a matrix test specifications have been produced: what is tested, how it is tested, the number of items, the item types, the types of texts etc. This is a way to make sure that a test measures the same construct with the same CEFR-related skills as earlier versions.
What is ideally needed is to run a pre-test of the new exam with a representative selection of items of the earlier exam embedded (so-called anchor items). With the help of advanced statistics it would then be possible to set standards comparable to those of the earlier exam. If the earlier standards have been related to the CEFR, one may argue that the new exam is linked to the CEFR. In fact this is part of the linking process of validation.
In general the validity and the reliability of a test will be enhanced when on the basis of data of student performance on the test, the test itself and/or the format of the test is adapted (difficulty of tasks, type of tasks, length of the test etc.). The first and most important step in the linking process is to make sure that the test in question is valid (that it indeed tests what it claims to be testing) and reliable (that the testing is consistent). Ideally this is done through pretesting the collection of data on the test. However, there are situations where such a procedure is not possible or too costly, such as in classroom-based testing.
If it is not possible to collect evidence that a test is valid and reliable through statistics, we can nevertheless make an attempt to link the test to the CEFR through specification. In fact specification is a phase in the linking process that always needs to be carried out.
The specification phase in the linking process helps to raise awareness among test providers of:
There are four steps to be taken in the specification phase:
References:
In theory this is possible, in practice this may be challenging. It also depends on the type of test and the skill that is being tested. A test that is (to be) linked to the CEFR is supposed to tap a representative number of descriptors at the desired CEFR level. For each descriptor at each level we would need a sufficient number of items to be able to give a valid judgement on whether the student can do what is described in the descriptor. In practice this may mean that tests of reading and listening would have to be longer than is feasible.
In the case of speaking tests, there are formats such as in the Oral Proficiency Test (OPI), where a trained interlocutor moves from one level to another depending on the proficiency of the candidate. In such tests it would be possible to find out whether the student is able to function at more than one CEFR level. It must be stressed here that interlocutors will need to be thoroughly trained in administering such a test. Generally speaking this would not be within the reach of untrained teachers.
There exist computer-based adaptive tests, in which the students are presented with items at various levels, depending on the responses that they give. In principle a more difficult task (possibly at a higher CEFR level) is presented every time the student gives a correct response. In this way testing time may be reduced considerably. This means that numerous items are needed to create an item bank. However, creating a calibrated item bank is costly, and needs time.
The CEFR is built on the idea that a person who can perform at a given CEFR level can also perform at the level(s) below the given level. A person at B1-level is supposed to be able to perform at levels A2 and A1 as well. However, this does not mean that we can simply assign levels A2 or A1 to the student when he or she has a low score on a B1-level test, for the reasons outlined above.
In many countries pass/fail scores in exams are laid down in the law or described in the syllabus, without reference to the CEFR. Thus it is possible for students to pass an exam at a given CEFR level without attaining a score that would indicate that students have a proficiency at the desired CEFR level.
For a pass/fail score that is related to the CEFR we need to carry out a standard-setting procedure (for the receptive skills) or a benchmarking procedure (for the productive skills). In these procedures a group of experts determines what minimum score is needed for the students to claim that they have reached the desired level. In the case of speaking or writing tests, performances can be selected by experts that illustrate how students should perform for them to be graded at a specific CEFR level.
It is thus possible for a student to have a score on the exam that indicates two things: (1) the student has or has not passed the exam from a legal perspective and (2) the student has or has not reached the desired CEFR level.
It is often said in syllabuses that the exam is at a given CEFR level. However, if there has been no CEFR-related standard setting or benchmarking, the scores on that exam cannot be said to be related to the CEFR.
When grading a student’s performance in reading or listening we may need to set a performance standard. This is the boundary or cut score between two scores on a performance scale. A cut score of 30, for example, says that a score of 30 or more indicates a performance at a particular level (for example B1) while a lower score indicates that the student has not reached the desired level.
There are various ways to set standards. It has been found that applying two or more of such methods may yield the best results. For all these methods a coordinator needs to gather student scores on a reading or listening tests. A number of such methods are described in the linking Manual (see References below). In cases such as reading or listening tests when numerical scores are given, experts estimate at what CEFR level a test taker can be expected to respond correctly to a set of items.
It must be emphasized here that standard setting is a group process, rather than one expert showing and telling the other experts which score is required to determine if a performance is at the desired CEFR level.
Even if the CEFR acknowledges that linguistic competence is an important aspect of language competence, it may be difficult to link sections in an exam that test subskills such as grammar or vocabulary to the CEFR. It must be noted that the formulations of the CEFR descriptors for linguistic competence are rather general and can be interpreted in many ways. For some languages (such as French and German) more detailed descriptors have been developed. However, these have not been scaled in the same way as those in the CEFR itself have been.
The problem is that often exams tend to focus on issues in vocabulary and grammar that learners with a specific first language background find difficult when learning a particular foreign language. Such sections in an exam may focus on the structure of a language rather than on the communicative aspects of it. Such sections do not necessarily focus on grammatical constructions that are typical of written texts produced in various contexts at various levels, and the vocabulary going with those. Such sections are not usually linked to specific linguistic CEFR descriptors at various levels.
From a formative point of view such a linguistic focus is understandable. However, in summative situations, if the curriculum and the syllabus claim that students should be able to function at specific CEFR levels at the end of secondary school, then it is to be wondered if an exam that is to be linked to the CEFR should contain (large) sections on linguistic competence.
It can be argued that when testing reading and listening we also test a student’s understanding of the structure and the vocabulary of a language. We may argue the same for tests of speaking and writing: if assessment criteria such as the use of vocabulary and grammatical structures are applied, then there would seem no need to discretely measure vocabulary and structures.
Many testing organisations and publishing companies claim that the tests they administer or publish are at a given CEFR level. The validity of such claims may be very important for test takers. On the basis of their results, they may be admitted to further education or hired for a job. There is also a need for institutes and employers to be able to depend on the validity of claims of links to the CEFR and to specific CEFR levels in particular.
It is obvious that without sufficient proof of the validity of claims of links to CEFR levels, such claims cannot be trusted. Ideally such proof should be included in the test materials. However, such test materials may potentially refer to documents that are confidential and thus inaccessible to the general public. Yet some published information on linking should be available. Such information may contain evidence of various types:
It will not always be easy to find enough evidence of the validity of a test’s claims to links to the CEFR levels. Very often the only validation of such claims is through a specification of the content of the test in terms of the CEFR, such as low-stakes classroom-based tests. For some tests, the resulting evidence may be sufficient. However, for high-stakes tests all the types of evidence as mentioned above are needed for the links to the CEFR level to be called valid.
Texts that are taken from real life and that have a communicative function link up to the CEFR model of language use and would therefore be welcome in CEFR-based language tests. This is not to say that such texts cannot be edited for technical reasons (texts may be too long for inclusion, incidental words may cause undue problems of understanding at the intended CEFR-level). Such editing is permissible, both from a validity point of view and from a legal point of view, as long as certain rules of good practice are observed.
The selection of listening samples may be problematic for various reasons. Authentic materials may be difficult or expensive to obtain, the sound quality may not be acceptable, the cost of producing varied listening samples may be too high. Yet it must be avoided that listening is tested by making use of reading texts that are read out by one or two actors.
Some items are on occasion given more weight than other items because they are thought to be more difficult than others. If the item consists of a number of operations this is acceptable, if the students know that the item is worth more points. In other cases it is hardly necessary to weight items. We will be able to distinguish good students from less good students because in principle the less good students might give an incorrect response and gain no points on that item.
From a CEFR-point of view there is also an issue in weighting items. If one item is considered to be more difficult than another item, then it must be wondered if that item may be tapping a descriptor at a higher level. As is argued in another FAQ (Can we test at more than one CEFR level in one test?), it is advisable to create homogenous tests aimed at one CEFR-level only.
There is one more issue in weighting items. Item writers or indeed the syllabus itself may claim that certain items are more difficult than other items. Without data on the performance of these items such claims are not valid.
Some language tests claim that they measure a student’s language skills at one or more CEFR levels. One must check what the validity of such a claim would be. We cannot simply “average out” performances in different skills. In real life most learners are better at one skill than another, certainly at the lower CEFR levels. Thus, in a test we may be able to average score points, but we cannot average CEFR levels. We may be able to say that the student is at B2 for reading and at A2 for writing. We cannot then say that the student is at B1 for reading and writing combined.
Table 1 in the CEFR (Common Reference Levels: global scale) is often misunderstood as meaning that what is described for a particular level is what is to be expected from a language user at that level, for all language skills. This table however should be interpreted as a description of what a person can do at that level and with particular skills: he or she can function at the given level for reading and listening, but at another level for speaking and writing.
The Council of Europe has advocated the development of profiles, in which the student’s proficiencies in the various language skills are described. The European Language Portfolio has also adopted this approach.
Download the document