Previous Next

Pitfalls of Performance Assessment

Previous trials, experience in other industries, and theoretical analysis suggest that simulation offers substantial opportunities to facilitate performance assessment; however, a variety of issues must be solved to make performance assessment a robust capacity.

  1. Technical versus nontechnical skills. As indicated in previous sections, it is feasible (if difficult) to assess a technical response to specific events and generic nontechnical behaviors. For which kinds of assessments is it appropriate to measure only technical performance, only nontechnical performance, or some combination of the two?
  2. Number of scenarios. How many different scenarios are needed to achieve robust performance assessment of individuals in all relevant aspects (technical and nontechnical) of patient care? Some emerging results suggest that increasing the number of scenarios is more effective in improving the reliability of ratings than increasing the number of raters is.
  3. Rating individuals versus rating crews or teams. Anesthetists work both as individuals and in crews and teams with other anesthetists and with surgeons, nurses, technicians, and others. Should the performance of individuals working alone be assessed? Should anesthetists be able to call for and use help in solving problems? If so, can one still rate the individual when working with a team?
  4. How can performance that fluctuates substantially over time be aggregated into a single rating? This issue was recognized by Gaba and colleagues as a major apparent source of inter-rater disagreement. It is not addressed by the ANTS system. Which techniques will best address this issue, especially as applied to scenarios that represent the actual complexity of clinical practice?
  5. Criterion thresholds. What level of performance should be set as criterion thresholds for different purposes? Can benchmarks of performance be established by truly expert clinicians (recognizing that years of experience or hierarchic rank is not a surrogate for expertise or skill)? Similarly, how does the rating system deal with single actions or behaviors that were lethal or harmful in the presence of otherwise good performance? If used for formative assessment, a rating system should indicate the successes of the examinee as well as the failures. However, if used for summative or high-stakes assessment, it may be critical to ensure that the examinee who risks harming a simulated patient cannot outscore another examinee whose overall performance is less strong, but who at least did not endanger the patient. For example, not performing chest compressions in a cardiac arrest situation would be such a "knockout" criterion.
  6. Appropriate statistical analysis of validity, inter-rater reliability, and reproducibility of these assessments. A variety of statistical tests and approaches have been used to evaluate these characteristics. The data on performance show various levels of inter-rater variance and high interindividual (and interteam) variability.[38] [64] [67] [68] [69] [97] [98] [99] [100] As detailed in the paper of Gaba and associates, [64] some inter-rater reliability statistics are more stringent than others, especially in terms of the nature of the "by chance" benchmark. No firm consensus has been reached regarding which tests are most appropriate to answer key questions about simulation-based performance assessment. Some of the rating systems (including ANTS) have used less stringent tests of inter-rater reliability. Generalizability theory[101] [102] offers a set of statistical techniques to sort out the impact of scenario, subject, rater, number of scenarios, and other facets on such assessments. This technique also specifies how comparisons can be made against reference performance levels or as relative comparisons between subjects without a fixed benchmark.

In summary, the results of the various simulation groups suggest that although it should be possible to use simulation to enable performance assessment, it will not be easy to develop a robust set of performance measures of anesthetists' skill that are widely accepted,[103] even if the simulator is used as a tool to present standardized patient scenarios.[104]

Klemola and Norros recently published a new way of looking at performance that involves anesthetists' "habit of actions."[105] These authors distinguish between "reactive habits" (conservative, self-contained, reluctant to construct subjective evaluations) and "interpretative habits" (creative, interactive, continuous integrative reasoning).


3090
This paper shows that a lot of issues need to be considered when discussing the best method of education and evaluation. Additional issues include defining and assessing professional competence. A more consultant-based method was introduced by Greaves and Grant, who compiled an inventory of characteristics of the anesthetist's practice. [106] A current review is given by Epstein.[107]

Thus, although there are presumed advantages to using simulation as a tool for performance assessment (known scenarios, errors can be allowed to occur and play out, intensive recording/archiving of performance is possible), the anesthesia community should be careful to introduce simulation-based performance evaluation on a slow and measured basis.[108] Performance assessment will be a hot topic for discussion in the simulation and clinical anesthesiology communities in the early 21st century. This controversy should not divert attention from the major application of simulation, which is to improve clinical performance through individual and team training in preventing and managing adverse clinical events.

Previous Next