Comparing Traditional and Performance-Based Assessment
Performance standards require performance-based assessment. The following excerpt is from a paper presented by Dr. Judith Liskin-Gasparro at the Symposium on Spanish Second Language Acquisition held at the University of Texas at Austin in October, 1997. It presents a clear description of performance-based assessment and contrasts it with traditional language testing. See the Assessment Page under Instructional Resources for other resources on this topic.
The language teaching profession in the United States is now having a love affair with a new kind of assessment, one that is variously called “authentic assessment,” “alternative assessment,” or “performance assessment.” These are being hailed as the true path to educational reform. With assessment that is performance-oriented, the thinking goes, with assessment that aims to measure not only the correctness of a response, but also the thought processes involved in arriving at the response, and that encourages students to reflect on their own learning in both depth and breadth, the belief is that instruction will be pushed into a more thoughtful, more reflexive, richer mode as well. Teachers who teach to these kinds of alternative assessments will naturally teach in ways that emphasize reflection, critical thinking, and personal investment in one’s own learning. Surely this is a good thing.
Grant Wiggins (1989a, 1989b, 1990, 1992, 1993, 1994) has written extensively on authentic assessment and on the differences between traditional tests and the new assessment models. His discussion (Wiggins 1994) on the etymologies of the words "test" and "assessment" provides some interesting insights. The original testum was an earthenware pot that was used as a colander, to separate gold from the surrounding ore. The term was later extended to the notion of determining the worth of a product or of a person’s effort. The key notion here is that a test measures knowledge or ability after the fact, with the assumption that the product of learning will contain in itself all of the information that the evaluator needs to know about the learners and the quality of their thinking processes.
The root of the term “assessment” is assidere, which is also the root of the French asseoir, to seat or set. It was first used in the sense of setting the value of property to apportion a tax. Assessors traditionally make a site visit -- they inspect the property or the situation and its documents, they categorize its functions, they hear from the owner of the property, they evaluate it by setting it against already-existing standards, and so forth. The assessment requires time, as well as interaction between the assessor and the person or property being assessed, so that the congruence of perception with reality or, in our case, the congruence between underlying mental processes and surface observation, can be verified. The idea here is that the product is not sufficient evidence of the quality of the thinking processes that produced it.
- First, authentic assessments are viewed as "direct" measures of student performance, since tasks are designed to incorporate the contexts, problems, and solution strategies that students would use in real life. Traditional standardized tests, in contrast, are seen as "indirect" measures, since test items are designed to "represent competence" by extracting knowledge and skills from their real-life contexts.
- Second, items on standardized instruments tend to test only one domain of knowledge or skill so as to avoid ambiguity for the test taker. Authentic assessment tasks are by design "ill-structured challenges" (Frederiksen 1984), since their goal is to help students prepare for the complex ambiguities of the “real” world.
- Third, authentic assessments focus on processes and rationales. There is no single correct answer; instead, students are led to craft polished, thorough, and justifiable responses, performances, and products. Traditional tests, on the other hand, are one-time measures that rely on a single correct response to each item; they offer no opportunity for demonstration of thought processes, revision, or interaction with the teacher. Because they usually require brief responses, which are often machine-scored, students construct their responses in only the most minimal way, and often by only plugging in a piece of knowledge. There is limited potential for traditional tests to measure higher-order thinking skills since, by definition, those skills involve analysis, interpretation, and multiple perspectives.
- Fourth, the new assessment models involve long-range projects, exhibits, and performances that are linked to the curriculum. Students are aware of how and on what knowledge and skills they are to be assessed. Assessment is conceived of as both an evaluative device and a learning activity. Traditional tests, in contrast, must be kept under lock and key so students do not have knowledge about or access to them ahead of time. Thus, traditional tests may seek to improve student performance in a general way via the washback effect -- they will study in a particular way in the hope that this will improve their test performance -- but there is virtually no way that students can “learn by doing” while taking a traditional test in the way that they learn while engaging in a performance-based assessment.
- Fifth, in the new assessment models, the teacher is an important collaborator in creating of tasks, as well as in developing guidelines for scoring and interpretation. Teachers may write traditional tests for their own students and then be responsible for fitting the content and format of the test to the curriculum, but many large-scale tests are developed externally and do not involve at all the teachers whose students are being evaluated. In addition, little or no teacher judgment is required to decide whether a response on a traditional test is correct or incorrect. All of this promotes greater distance between teachers and traditional assessment activities in general and has historically made the study of assessment a pretty dry and unappealing topic in teacher education programs.
- Finally,
there is the sticky area of validity and reliability, both of
which are essential features of good assessment instruments.
Validity has to do with the faithfulness of a test to its purpose;
in other words, how well it measures what it actually purports
to measure. Reliability refers to the consistency and precision
of test scores; in other words, how closely the score an individual
gets on a particular assessment measure reflects what could be
considered his or her “true score.” Traditional tests
can’t be beaten when it comes to reliability, not to mention
efficiency. When responses are obviously right or wrong, there
is little chance that the scores on a test will vary between
one rater and another or if the student takes two parallel versions
of the same test. This means that traditional tests lend themselves
to a wide range of statistical analyses and comparisons because
we can be fairly confident that the true score on a test is very
close to the reported score.
The new assessments, on the other hand, are by design ill-structured, messy, open-ended, and complex. And the designers of authentic assessments like that this is the case. Because authentic assessments involve students constructing complex, open-ended responses, those who use them will have to struggle with issues of reliability. Where authentic, performance-based assessments shine is when it comes to validity. They reflect real-life tasks, as well as the multi-faceted character of curriculum and pedagogy in ways that a one-shot evaluation cannot. To use an analogy, an authentic assessment is like a videotape of student learning, while a traditional test is more like a single snapshot.
Authentic assessments have been criticized for their subjectivity (largely the reliability issue), and it is certainly true that it is far more difficult to develop standards for evaluation and to apply them consistently across a group of portfolios or oral performances or research projects than it is to do the same for an objective paper-and-pencil test. But the apparent objectivity of traditional tests hides a host of unanswered -- and often unasked -- questions: Who selected the domains of knowledge to be tested? On what basis? Why were the omitted domains left out? The biases that underlie the development and evaluation of alternative assessments are right there on the surface to be seen, critiqued and, we hope, addressed and corrected, whereas the biases built into traditional tests usually go undetected because they are hidden beneath the surface-level meanings of the test items which in isolation might seem just fine.
If we think about the kinds of foreign language assessments that could be classified as “authentic” or “performance-based” assessments, what would they be? If in the courses you teach or have taken, students have worked on a research project that had stages, where they turned in drafts and had conferences with you, and where the learning over time was documented as part of the project in addition to the final product, then that was an example of an authentic assessment. If a group of students wrote a skit, got feedback on drafts of the script, staged it and performed it, that would be an authentic assessment. What I am talking about is a multi-staged project that involves reiterative rounds of planning, researching, and producing language and culminating in a product or a performance.
