The issue of consistent grader severity is an on-going concern for all who score performance examinations. This study explored the consistency of common grader severity across three performance examination administrations. Each performance examination administration was analyzed using the multi-facet Rasch model which produced calibrations of grader severity.
The data are from three annual administrations of a medical oral examination labeled administrations A, B, and C. Between administrations, there were some common graders and some non-common graders. To be included in the study, a common grader had to rate candidates in at least two of the three administrations, although some graders were common to all three administrations. In this study, there were 115 common graders who met this criterion. This examination also had standardized items and tasks which graders used to rate the candidates. The candidates for each of the three administrations were completely different; however, the examination process was the same.
Graders rate a random sample of the candidates who take the examination in a given administration. During the course of each examination administration each grader gives many ratings which are used to calibrate his/her severity. Because so many ratings are given by each examiner, the calibrations of grader leniency or severity are very precise.
The items in this oral examination were carefully developed for consistency and content coverage. The skills being rated were well defined and the same across all administrations. The rating scale is well defined for each rating level. Graders were trained prior to the examination with regard to the content of the items and examination procedures. Many of the graders have a great deal of experience in the examination process. The multi-facet formula used for this analysis was:
loge (Pnijkx / Pnijk(x-1)) = Bn - Di - Cj - Hk - Fx
where Bn = ability of candidate n;
Di = difficulty of item i;
Cj = severity of grader j;
Hk = difficulty of task k; and
Fx = Rasch-Andrich threshold or step calibration.
Because the examination materials are so well standardized, differences in grader severity within examination administrations are most likely due to inherent differences in grader expectations and standards, which will probably not change substantially due to training. Grader severity was calibrated using the multifacets model for each of the three examination administrations. The center of each scale was anchored at 0.00 logits for all three exam administrations. Next the grader severity calibrations were compared across examination administrations using z-scores and correlations for the common graders.
Using the grader severity estimates and their measurement errors, the standardized difference between grader severities across administrations was calculated using zscores (Forsyth., Sarsangjan, and Gilmer, 1981). The formula used to obtain standardized differences for grader severity calibrations is:
Zj = (Cj1-Cj2)/(Sj12+Sj22)½
where Cj1 and Cj2 are grader severity estimates for each administration, and Sj1 and Sj2 are the estimated measurement errors associated with these severity estimates.
Correlations were also used to confirm the patterns of grader severity.
The calibrated severity estimates for the common graders ranged from -1.78 to1.55 logits during administration A, from -2.07 to 1.50 logits during administration B and from -1.96 to 1.52 logits during administration C. Within each examination administration, the severity estimates among graders were significantly different from each other as indicated by a Chi-Square test and a Separation reliability. This difference in grader severity was significant even after training and working within a carefully structured examination process.
An absolute z-score of 1.96 or greater, indicates 95% confidence that there is a statistically significant difference in grader severity across administrations. Comparison of the grader severity estimates across administrations using the z-score analysis found that of the 115 common graders, only one was statistically significantly different in severity across administrations at the 95% confidence level. The common grader who was significantly different was very lenient during administration A, but significantly more severe during administrations B and C.
The graders within an administration were significantly different from each other in severity; however, they were consistent within themselves within and across examination administrations. This suggests that severity is a grader characteristic that should be included in the analysis of performance examinations to improve validity and reliability. The multi-facet model provides the opportunity to incorporate this facet into analysis of performance examinations and to better understand grader grading patterns.
Mary E. Lunz
Measurement Research Associates, Inc.
www.measurementresearch.com
Forsyth., Sarsangjan, and Gilmer, 1981, Forsyth, R., Sarsangjan, V. and Gilmer, J. (1981). Some empirical results related to the robustness of the Rasch model. Applied Psychological Measurement, 5, 175-186.
An Example of Grader Consistency using the Multi-Facet Model. Mary E. Lunz Rasch Measurement Transactions, 2007, 21:2 p. 1101-1102
Forum | Rasch Measurement Forum to discuss any Rasch-related topic |
Go to Top of Page
Go to index of all Rasch Measurement Transactions
AERA members: Join the Rasch Measurement SIG and receive the printed version of RMT
Some back issues of RMT are available as bound volumes
Subscribe to Journal of Applied Measurement
Go to Institute for Objective Measurement Home Page. The Rasch Measurement SIG (AERA) thanks the Institute for Objective Measurement for inviting the publication of Rasch Measurement Transactions on the Institute's website, www.rasch.org.
Coming Rasch-related Events | |
---|---|
Apr. 21 - 22, 2025, Mon.-Tue. | International Objective Measurement Workshop (IOMW) - Boulder, CO, www.iomw.net |
Jan. 17 - Feb. 21, 2025, Fri.-Fri. | On-line workshop: Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com |
Feb. - June, 2025 | On-line course: Introduction to Classical Test and Rasch Measurement Theories (D. Andrich, I. Marais, RUMM2030), University of Western Australia |
Feb. - June, 2025 | On-line course: Advanced Course in Rasch Measurement Theory (D. Andrich, I. Marais, RUMM2030), University of Western Australia |
May 16 - June 20, 2025, Fri.-Fri. | On-line workshop: Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com |
June 20 - July 18, 2025, Fri.-Fri. | On-line workshop: Rasch Measurement - Further Topics (E. Smith, Facets), www.statistics.com |
Oct. 3 - Nov. 7, 2025, Fri.-Fri. | On-line workshop: Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com |
The URL of this page is www.rasch.org/rmt/rmt212c.htm
Website: www.rasch.org/rmt/contents.htm