Pitfalls of Performance Assessment
Previous trials, experience in other industries, and theoretical
analysis suggest that simulation offers substantial opportunities to facilitate performance
assessment; however, a variety of issues must be solved to make performance assessment
a robust capacity.
- Technical versus nontechnical skills.
As indicated in previous sections, it is feasible (if difficult) to assess a technical
response to specific events and generic nontechnical behaviors. For which kinds
of assessments is it appropriate to measure only technical performance, only nontechnical
performance, or some combination of the two?
- Number of scenarios. How many different
scenarios are needed to achieve robust performance assessment of individuals in all
relevant aspects (technical and nontechnical) of patient care? Some emerging results
suggest that increasing the number of scenarios is more effective in improving the
reliability of ratings than increasing the number of raters is.
- Rating individuals versus rating crews or teams.
Anesthetists work both as individuals and in crews and teams with other anesthetists
and with surgeons, nurses, technicians, and others. Should the performance of individuals
working alone be assessed? Should anesthetists be able to call for and use help
in solving problems? If so, can one still rate the individual when working with
a team?
- How can performance that fluctuates substantially
over time be aggregated into a single rating? This issue was recognized
by Gaba and colleagues as a major apparent source of inter-rater disagreement. It
is not addressed by the ANTS system. Which techniques will best address this issue,
especially as applied to scenarios that represent the actual complexity of clinical
practice?
- Criterion thresholds. What level of performance
should be set as criterion thresholds for different purposes? Can benchmarks of
performance be established by truly expert clinicians (recognizing that years of
experience or hierarchic rank is not a surrogate for expertise or skill)? Similarly,
how does the rating system deal with single actions or behaviors that were lethal
or harmful in the presence of otherwise good performance? If used for formative
assessment, a rating system should indicate the successes of the examinee as well
as the failures. However, if used for summative or high-stakes assessment, it may
be critical to ensure that the examinee who risks harming a simulated patient cannot
outscore another examinee whose overall performance is less strong, but who at least
did not endanger the patient. For example, not performing chest compressions in
a cardiac arrest situation would be such a "knockout" criterion.
- Appropriate statistical analysis of validity, inter-rater
reliability, and reproducibility of these assessments. A variety of statistical
tests and approaches have been used to evaluate these characteristics. The data
on performance show various levels of inter-rater variance and high interindividual
(and interteam) variability.[38]
[64]
[67]
[68]
[69]
[97]
[98]
[99]
[100]
As detailed in the paper of Gaba and associates,
[64]
some inter-rater reliability statistics are
more stringent than others, especially in terms of the nature of the "by chance"
benchmark. No firm consensus has been reached regarding which tests are most appropriate
to answer key questions about simulation-based performance assessment. Some of the
rating systems (including ANTS) have used less stringent tests of inter-rater reliability.
Generalizability theory[101]
[102]
offers a set of statistical techniques to sort out the impact of scenario, subject,
rater, number of scenarios, and other facets on such assessments. This technique
also specifies how comparisons can be made against reference performance levels or
as relative comparisons between subjects without a fixed benchmark.
In summary, the results of the various simulation groups suggest
that although it should be possible to use simulation to enable performance assessment,
it will not be easy to develop a robust set of performance measures of anesthetists'
skill that are widely accepted,[103]
even if the
simulator is used as a tool to present standardized patient scenarios.[104]
Klemola and Norros recently published a new way of looking at
performance that involves anesthetists' "habit of actions."[105]
These authors distinguish between "reactive habits" (conservative, self-contained,
reluctant to construct subjective evaluations) and "interpretative habits" (creative,
interactive, continuous integrative reasoning).
This paper shows that a lot of issues need to be considered when discussing the best
method of education and evaluation. Additional issues include defining and assessing
professional competence. A more consultant-based method was introduced by Greaves
and Grant, who compiled an inventory of characteristics of the anesthetist's practice.
[106]
A current review is given by Epstein.[107]
Thus, although there are presumed advantages to using simulation
as a tool for performance assessment (known scenarios, errors can be allowed to occur
and play out, intensive recording/archiving of performance is possible), the anesthesia
community should be careful to introduce simulation-based performance evaluation
on a slow and measured basis.[108]
Performance
assessment will be a hot topic for discussion in the simulation and clinical anesthesiology
communities in the early 21st century. This controversy should not divert attention
from the major application of simulation, which is to improve clinical performance
through individual and team training in preventing and managing adverse clinical
events.