Comparing Traditional and Performance-Based Assessment
Performance standards require performance-based
assessment. The following excerpt is from a paper presented by
Dr. Judith Liskin-Gasparro
at the Symposium on Spanish Second Language Acquisition held
at the University of Texas at Austin in October, 1997. It presents
a clear description of performance-based assessment and contrasts
it with traditional language testing. See the Assessment
Page under
Instructional Resources for other resources on this
topic.
The language teaching profession
in the United States is now having a love affair with a new kind
of assessment, one that is variously
called “authentic assessment,” “alternative assessment,” or “performance
assessment.” These are being hailed as the true path to educational
reform. With assessment that is performance-oriented, the thinking
goes, with assessment that aims to measure not only the correctness
of a response, but also the thought processes involved in arriving
at the response, and that encourages students to reflect on their
own learning in both depth and breadth, the belief is that instruction
will be pushed into a more thoughtful, more reflexive, richer mode
as well. Teachers who teach to these kinds of alternative assessments
will naturally teach in ways that emphasize reflection, critical
thinking, and personal investment in one’s own learning.
Surely this is a good thing.
Grant Wiggins (1989a, 1989b, 1990,
1992, 1993, 1994) has written extensively on authentic assessment
and on the differences between
traditional tests and the new assessment models. His discussion
(Wiggins 1994) on the etymologies of the words "test" and "assessment" provides
some interesting insights. The original testum was an earthenware
pot that was used as a colander, to separate gold from the surrounding
ore. The term was later extended to the notion of determining the
worth of a product or of a person’s effort. The key notion
here is that a test measures knowledge or ability after the fact,
with the assumption that the product of learning will contain in
itself all of the information that the evaluator needs to know
about the learners and the quality of their thinking processes.
The root of the term “assessment” is assidere, which
is also the root of the French asseoir, to seat or set. It was
first used in the sense of setting the value of property to apportion
a tax. Assessors traditionally make a site visit -- they inspect
the property or the situation and its documents, they categorize
its functions, they hear from the owner of the property, they evaluate
it by setting it against already-existing standards, and so forth.
The assessment requires time, as well as interaction between the
assessor and the person or property being assessed, so that the
congruence of perception with reality or, in our case, the congruence
between underlying mental processes and surface observation, can
be verified. The idea here is that the product is not sufficient
evidence of the quality of the thinking processes that produced
it.
- First, authentic assessments are viewed as "direct" measures
of student performance, since tasks are designed to incorporate
the contexts, problems, and solution strategies that students
would use in real life. Traditional standardized tests, in contrast,
are seen as "indirect" measures, since test items are
designed to "represent competence" by extracting knowledge
and skills from their real-life contexts.
- Second, items on standardized
instruments tend to test only one domain of knowledge or skill
so as to avoid ambiguity for the
test taker. Authentic assessment tasks are by design "ill-structured
challenges" (Frederiksen 1984), since their goal is to help
students prepare for the complex ambiguities of the “real” world.
- Third, authentic assessments focus on processes and rationales.
There is no single correct answer; instead, students are led
to craft polished, thorough, and justifiable responses, performances,
and products. Traditional tests, on the other hand, are one-time
measures that rely on a single correct response to each item;
they offer no opportunity for demonstration of thought processes, revision,
or interaction with the teacher. Because they usually require
brief responses, which are often machine-scored, students construct their
responses in only the most minimal way, and often by only plugging
in a piece of knowledge. There is limited potential for traditional
tests to measure higher-order thinking skills since, by definition,
those skills involve analysis, interpretation, and multiple perspectives.
- Fourth, the new assessment models involve long-range projects,
exhibits, and performances that are linked to the curriculum.
Students are aware of how and on what knowledge and skills they are to be
assessed. Assessment is conceived of as both an evaluative device
and a learning activity. Traditional tests, in contrast, must
be kept under lock and key so students do not have knowledge about
or access to them ahead of time. Thus, traditional tests may
seek to improve student performance in a general way via the washback
effect -- they will study in a particular way in the hope that
this will improve their test performance -- but there is virtually
no way that students can “learn by doing” while taking
a traditional test in the way that they learn while engaging
in a performance-based assessment.
- Fifth, in the new assessment
models, the teacher is an important collaborator in creating
of tasks, as well as in developing guidelines
for scoring and interpretation. Teachers may write traditional
tests for their own students and then be responsible for fitting
the content and format of the test to the curriculum, but many
large-scale tests are developed externally and do not involve
at all the teachers whose students are being evaluated. In addition,
little or no teacher judgment is required to decide whether a
response on a traditional test is correct or incorrect. All of this promotes
greater distance between teachers and traditional assessment
activities in general and has historically made the study of assessment a
pretty dry and unappealing topic in teacher education programs.
- Finally,
there is the sticky area of validity and reliability, both of
which are essential features of good assessment instruments.
Validity has to do with the faithfulness of a test to its purpose;
in other words, how well it measures what it actually purports
to measure. Reliability refers to the consistency and precision
of test scores; in other words, how closely the score an individual
gets on a particular assessment measure reflects what could be
considered his or her “true score.” Traditional tests
can’t be beaten when it comes to reliability, not to mention
efficiency. When responses are obviously right or wrong, there
is little chance that the scores on a test will vary between
one rater and another or if the student takes two parallel versions
of the same test. This means that traditional tests lend themselves
to a wide range of statistical analyses and comparisons because
we can be fairly confident that the true score on a test is very
close to the reported score.
The new assessments, on the other
hand, are by design ill-structured, messy, open-ended, and
complex. And the designers of authentic
assessments like that this is the case. Because authentic
assessments involve students constructing complex, open-ended
responses,
those who use them will have to struggle with issues of reliability.
Where authentic, performance-based assessments shine is when
it
comes to validity. They reflect real-life tasks, as well
as the multi-faceted character of curriculum and pedagogy in
ways
that
a one-shot evaluation cannot. To use an analogy, an authentic
assessment is like a videotape of student learning, while
a traditional test
is more like a single snapshot.
Authentic assessments have been
criticized for their subjectivity (largely the reliability issue),
and it is certainly true that
it is far more difficult to develop standards for evaluation
and to apply them consistently across a group of portfolios or
oral
performances or research projects than it is to do the same for
an objective paper-and-pencil test. But the apparent objectivity
of traditional tests hides a host of unanswered -- and often
unasked -- questions: Who selected the domains of knowledge to
be tested?
On what basis? Why were the omitted domains left out? The biases
that underlie the development and evaluation of alternative assessments
are right there on the surface to be seen, critiqued and, we
hope, addressed and corrected, whereas the biases built into
traditional
tests usually go undetected because they are hidden beneath the
surface-level meanings of the test items which in isolation might
seem just fine.
If we think about the kinds of foreign language
assessments that could be classified as “authentic” or “performance-based” assessments,
what would they be? If in the courses you teach or have taken,
students have worked on a research project that had stages, where
they turned in drafts and had conferences with you, and where
the learning over time was documented as part of the project
in addition
to the final product, then that was an example of an authentic
assessment. If a group of students wrote a skit, got feedback
on drafts of the script, staged it and performed it, that would
be
an authentic assessment. What I am talking about is a multi-staged
project that involves reiterative rounds of planning, researching,
and producing language and culminating in a product or a performance.
|