A Cautionary Tale about Item Equating with Fluctuating Samples

In many high-stakes testing scenarios samples tend to be reasonably comparable with regard to demographic characteristics across administrations. As part of the initial quality control checks, most psychometricians will investigate various demographic characteristics to get a pulse on sample stability. Unfortunately, many psychometricians may be tempted to only investigate "visible" demographic variables, such as gender, ethnicity, and so on. Failing to investigate "invisible" demographic variables such as whether the examinee is a first-time or repeat test-taker, or has previously rendered a fail result could lead to an enormous mistake with regard to equating examinations. Consider the following example.

Suppose a data set is provided to a psychometrician for scoring. As part of the initial quality control checks, s/he learns both the sample size and the visible demographics variables all seem fairly comparable to previous administrations. On the surface it appears the sample is comparable to previous samples, thus the psychometrician proceeds to investigate item quality and functioning. Preliminary item analyses reveal the items appear to be sound and functioning properly. Upon obtaining this assurance, the psychometrician then begins developing item anchors for equating purposes. After several iterations of investigating displacement values and unanchoring item calibrations that displace from those obtained from previous administrations, the psychometrician is satisfied with the remaining item calibrations and locks them down as anchors for the final scoring run.

Once data are scored, the results are reviewed and compared to historical trends. Diagnostic results (e.g., fit statistics, separation and reliability estimates, etc.) appear sound, but some notable differences in pass/fail statistics and mean scaled scores are evident. Concerned, the psychometrician revisits the scoring processes by reviewing syntax and reproducing all relevant data files. Examination data are rescored and the same results are produced. Still suspicious, the psychometrician begins combing both the new data set and last year's data set to identify anyone that had previously taken the exam. A list of repeat examinees is pulled and their scores are compared across both administrations of the examination. It turns out virtually all of the repeat examinees appear to have performed worse on the new examination. How could this be? Examinees have had additional training, education and time to prepare for the examination.

Upon closer inspection the psychometrician is surprised to learn a less obvious demographic characteristic had fluctuated among the examinees and caused this unusual scenario. It turns out a larger proportion of examinees were taking the examination due to a prior failure. This small, yet very important, artifact had a significant ripple effect on the quality of the final scores. The problem began when a less able sample interacted with items and the psychometrician was deceived into thinking many of the existing calibrations were unstable. As a result, the psychometrician unanchored many item calibrations that should otherwise have been left alone. Thus, when the new scale was established, it jumped and resulted in scores that lost their meaning across administrations.

Although item equating under the Rasch framework is quite simple and straight-forward, it still requires a great deal of careful attention. The scenario presented above illustrates how a significant problem may occur simply as a result of failing to investigate one key demographic characteristic of the sample. When equating, it is critical that one considers all types of sample characteristics, especially those that pertain to previous performance. An inconsistency in these demographics can result in item instability, which in turn, can go unnoticed when examining displacement values and creating item anchors. It is for this reason that many psychometricians only use first-time examinee data when equating exams. In any instance, all psychometricians that equate examinations under the Rasch framework would be wise to include to their list of quality control checks a comprehensive investigation of demographic characteristics both before and after a scoring run is complete.

Kenneth D. Royal, University of North Carolina at Chapel Hill
Mikaela M. Raddatz, American Board of Physical Medicine and Rehabilitation


Royal K. & Raddatz M. (2013) A Cautionary Tale about Item Equating with Fluctuating Samples. Rasch Measurement Transactions, 27:2 p. 1417


Response on the Rasch Listserv, Sept. 6, 2013

I read with interest the recent note in RMT 27:2 by Royal and Raddatz that contained a cautionary tale about equating test forms for certification and licensure exams. By the end of the note, I was troubled, by what I feel is a common misunderstanding about the properties of Rasch measurement.

Their tale begins with a test administration and the investigation of item quality and functioning before attempting to equate the current form to a previously established standard.

It is widely known that the properties of item invariance that allow equating in the Rasch model hold, if and only if, the data fit the Rasch model. The investigation of the fit of the data to the model should be investigated in the initial stage of equating. The authors state, "Preliminary item analyses reveal the items appear to be sound and functioning." One assumes that the fit of the data to the model was confirmed in this process, though it is not explicitly stated. As the story continues, we find, in fact, that the equating solution does not hold across the various subgroups represented in the analysis and the calibration sample is subsequently altered to produce a different and more logical equating solution.

This suggests that the estimates of item difficulty were not freed from the distributional properties of the sample. Hence, the data can not fit a Rasch model. One would hope that it is not necessary to get to the very end of the equating process before discovering that the estimates of item difficulty are not invariant and the link constant developed for equating is not acceptable.

What then is the cause of the problem? Without independent confirmation, I would suggest that the fit statistics used in the preliminary analysis lacked the power to detect violations of this type of first-time vs. repeater invariance. This is easily corrected with the use of the between group item fit statistic available in Winsteps. It will not solve the problem lack of fit to the Rasch model, but it will let you know there is a problem before you get too far into the equating process. Developing an item bank that measures both types of examinees fairly is an entirely different issue, and one that should be addressed. The lack of item invariance across subgroups is a classic definition of item bias.

Richard M. Smith, Editor
Journal of Applied Measurement




Rasch Books and Publications
Invariant Measurement: Using Rasch Models in the Social, Behavioral, and Health Sciences, 2nd Edn. George Engelhard, Jr. & Jue Wang Applying the Rasch Model (Winsteps, Facets) 4th Ed., Bond, Yan, Heene Advances in Rasch Analyses in the Human Sciences (Winsteps, Facets) 1st Ed., Boone, Staver Advances in Applications of Rasch Measurement in Science Education, X. Liu & W. J. Boone Rasch Analysis in the Human Sciences (Winsteps) Boone, Staver, Yale
Introduction to Many-Facet Rasch Measurement (Facets), Thomas Eckes Statistical Analyses for Language Testers (Facets), Rita Green Invariant Measurement with Raters and Rating Scales: Rasch Models for Rater-Mediated Assessments (Facets), George Engelhard, Jr. & Stefanie Wind Aplicação do Modelo de Rasch (Português), de Bond, Trevor G., Fox, Christine M Appliquer le modèle de Rasch: Défis et pistes de solution (Winsteps) E. Dionne, S. Béland
Exploring Rating Scale Functioning for Survey Research (R, Facets), Stefanie Wind Rasch Measurement: Applications, Khine Winsteps Tutorials - free
Facets Tutorials - free
Many-Facet Rasch Measurement (Facets) - free, J.M. Linacre Fairness, Justice and Language Assessment (Winsteps, Facets), McNamara, Knoch, Fan
Other Rasch-Related Resources: Rasch Measurement YouTube Channel
Rasch Measurement Transactions & Rasch Measurement research papers - free An Introduction to the Rasch Model with Examples in R (eRm, etc.), Debelak, Strobl, Zeigenfuse Rasch Measurement Theory Analysis in R, Wind, Hua Applying the Rasch Model in Social Sciences Using R, Lamprianou El modelo métrico de Rasch: Fundamentación, implementación e interpretación de la medida en ciencias sociales (Spanish Edition), Manuel González-Montesinos M.
Rasch Models: Foundations, Recent Developments, and Applications, Fischer & Molenaar Probabilistic Models for Some Intelligence and Attainment Tests, Georg Rasch Rasch Models for Measurement, David Andrich Constructing Measures, Mark Wilson Best Test Design - free, Wright & Stone
Rating Scale Analysis - free, Wright & Masters
Virtual Standard Setting: Setting Cut Scores, Charalambos Kollias Diseño de Mejores Pruebas - free, Spanish Best Test Design A Course in Rasch Measurement Theory, Andrich, Marais Rasch Models in Health, Christensen, Kreiner, Mesba Multivariate and Mixture Distribution Rasch Models, von Davier, Carstensen

To be emailed about new material on www.rasch.org
please enter your email address here:

I want to Subscribe: & click below
I want to Unsubscribe: & click below

Please set your SPAM filter to accept emails from Rasch.org

Rasch Measurement Transactions welcomes your comments:

Your email address (if you want us to reply):

If Rasch.org does not reply, please post your message on the Rasch Forum
 

ForumRasch Measurement Forum to discuss any Rasch-related topic

Go to Top of Page
Go to index of all Rasch Measurement Transactions
AERA members: Join the Rasch Measurement SIG and receive the printed version of RMT
Some back issues of RMT are available as bound volumes
Subscribe to Journal of Applied Measurement

Go to Institute for Objective Measurement Home Page. The Rasch Measurement SIG (AERA) thanks the Institute for Objective Measurement for inviting the publication of Rasch Measurement Transactions on the Institute's website, www.rasch.org.

Coming Rasch-related Events
Apr. 21 - 22, 2025, Mon.-Tue. International Objective Measurement Workshop (IOMW) - Boulder, CO, www.iomw.net
Jan. 17 - Feb. 21, 2025, Fri.-Fri. On-line workshop: Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com
Feb. - June, 2025 On-line course: Introduction to Classical Test and Rasch Measurement Theories (D. Andrich, I. Marais, RUMM2030), University of Western Australia
Feb. - June, 2025 On-line course: Advanced Course in Rasch Measurement Theory (D. Andrich, I. Marais, RUMM2030), University of Western Australia
May 16 - June 20, 2025, Fri.-Fri. On-line workshop: Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com
June 20 - July 18, 2025, Fri.-Fri. On-line workshop: Rasch Measurement - Further Topics (E. Smith, Facets), www.statistics.com
Oct. 3 - Nov. 7, 2025, Fri.-Fri. On-line workshop: Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com

 

The URL of this page is www.rasch.org/rmt/rmt272c.htm

Website: www.rasch.org/rmt/contents.htm