A Cautionary Tale about Item Equating with Fluctuating Samples

In many high-stakes testing scenarios samples tend to be reasonably comparable with regard to demographic characteristics across administrations. As part of the initial quality control checks, most psychometricians will investigate various demographic characteristics to get a pulse on sample stability. Unfortunately, many psychometricians may be tempted to only investigate "visible" demographic variables, such as gender, ethnicity, and so on. Failing to investigate "invisible" demographic variables such as whether the examinee is a first-time or repeat test-taker, or has previously rendered a fail result could lead to an enormous mistake with regard to equating examinations. Consider the following example.

Suppose a data set is provided to a psychometrician for scoring. As part of the initial quality control checks, s/he learns both the sample size and the visible demographics variables all seem fairly comparable to previous administrations. On the surface it appears the sample is comparable to previous samples, thus the psychometrician proceeds to investigate item quality and functioning. Preliminary item analyses reveal the items appear to be sound and functioning properly. Upon obtaining this assurance, the psychometrician then begins developing item anchors for equating purposes. After several iterations of investigating displacement values and unanchoring item calibrations that displace from those obtained from previous administrations, the psychometrician is satisfied with the remaining item calibrations and locks them down as anchors for the final scoring run.

Once data are scored, the results are reviewed and compared to historical trends. Diagnostic results (e.g., fit statistics, separation and reliability estimates, etc.) appear sound, but some notable differences in pass/fail statistics and mean scaled scores are evident. Concerned, the psychometrician revisits the scoring processes by reviewing syntax and reproducing all relevant data files. Examination data are rescored and the same results are produced. Still suspicious, the psychometrician begins combing both the new data set and last year's data set to identify anyone that had previously taken the exam. A list of repeat examinees is pulled and their scores are compared across both administrations of the examination. It turns out virtually all of the repeat examinees appear to have performed worse on the new examination. How could this be? Examinees have had additional training, education and time to prepare for the examination.

Upon closer inspection the psychometrician is surprised to learn a less obvious demographic characteristic had fluctuated among the examinees and caused this unusual scenario. It turns out a larger proportion of examinees were taking the examination due to a prior failure. This small, yet very important, artifact had a significant ripple effect on the quality of the final scores. The problem began when a less able sample interacted with items and the psychometrician was deceived into thinking many of the existing calibrations were unstable. As a result, the psychometrician unanchored many item calibrations that should otherwise have been left alone. Thus, when the new scale was established, it jumped and resulted in scores that lost their meaning across administrations.

Although item equating under the Rasch framework is quite simple and straight-forward, it still requires a great deal of careful attention. The scenario presented above illustrates how a significant problem may occur simply as a result of failing to investigate one key demographic characteristic of the sample. When equating, it is critical that one considers all types of sample characteristics, especially those that pertain to previous performance. An inconsistency in these demographics can result in item instability, which in turn, can go unnoticed when examining displacement values and creating item anchors. It is for this reason that many psychometricians only use first-time examinee data when equating exams. In any instance, all psychometricians that equate examinations under the Rasch framework would be wise to include to their list of quality control checks a comprehensive investigation of demographic characteristics both before and after a scoring run is complete.

Kenneth D. Royal, University of North Carolina at Chapel Hill
Mikaela M. Raddatz, American Board of Physical Medicine and Rehabilitation

Royal K. & Raddatz M. (2013) A Cautionary Tale about Item Equating with Fluctuating Samples. Rasch Measurement Transactions, 27:2 p. 1417

Response on the Rasch Listserv, Sept. 6, 2013

I read with interest the recent note in RMT 27:2 by Royal and Raddatz that contained a cautionary tale about equating test forms for certification and licensure exams. By the end of the note, I was troubled, by what I feel is a common misunderstanding about the properties of Rasch measurement.

Their tale begins with a test administration and the investigation of item quality and functioning before attempting to equate the current form to a previously established standard.

It is widely known that the properties of item invariance that allow equating in the Rasch model hold, if and only if, the data fit the Rasch model. The investigation of the fit of the data to the model should be investigated in the initial stage of equating. The authors state, "Preliminary item analyses reveal the items appear to be sound and functioning." One assumes that the fit of the data to the model was confirmed in this process, though it is not explicitly stated. As the story continues, we find, in fact, that the equating solution does not hold across the various subgroups represented in the analysis and the calibration sample is subsequently altered to produce a different and more logical equating solution.

This suggests that the estimates of item difficulty were not freed from the distributional properties of the sample. Hence, the data can not fit a Rasch model. One would hope that it is not necessary to get to the very end of the equating process before discovering that the estimates of item difficulty are not invariant and the link constant developed for equating is not acceptable.

What then is the cause of the problem? Without independent confirmation, I would suggest that the fit statistics used in the preliminary analysis lacked the power to detect violations of this type of first-time vs. repeater invariance. This is easily corrected with the use of the between group item fit statistic available in Winsteps. It will not solve the problem lack of fit to the Rasch model, but it will let you know there is a problem before you get too far into the equating process. Developing an item bank that measures both types of examinees fairly is an entirely different issue, and one that should be addressed. The lack of item invariance across subgroups is a classic definition of item bias.

Richard M. Smith, Editor
Journal of Applied Measurement

Please help with Standard Dataset 4: Andrich Rating Scale Model

Rasch Publications
Rasch Measurement Transactions (free, online) Rasch Measurement research papers (free, online) Probabilistic Models for Some Intelligence and Attainment Tests, Georg Rasch Applying the Rasch Model 3rd. Ed., Bond & Fox Best Test Design, Wright & Stone
Rating Scale Analysis, Wright & Masters Introduction to Rasch Measurement, E. Smith & R. Smith Introduction to Many-Facet Rasch Measurement, Thomas Eckes Invariant Measurement: Using Rasch Models in the Social, Behavioral, and Health Sciences, George Engelhard, Jr. Statistical Analyses for Language Testers, Rita Green
Rasch Models: Foundations, Recent Developments, and Applications, Fischer & Molenaar Journal of Applied Measurement Rasch models for measurement, David Andrich Constructing Measures, Mark Wilson Rasch Analysis in the Human Sciences, Boone, Stave, Yale
in Spanish: Análisis de Rasch para todos, Agustín Tristán Mediciones, Posicionamientos y Diagnósticos Competitivos, Juan Ramón Oreja Rodríguez

To be emailed about new material on www.rasch.org
please enter your email address here:

I want to Subscribe: & click below
I want to Unsubscribe: & click below

Please set your SPAM filter to accept emails from Rasch.org

www.rasch.org welcomes your comments:

Your email address (if you want us to reply):


ForumRasch Measurement Forum to discuss any Rasch-related topic

Go to Top of Page
Go to index of all Rasch Measurement Transactions
AERA members: Join the Rasch Measurement SIG and receive the printed version of RMT
Some back issues of RMT are available as bound volumes
Subscribe to Journal of Applied Measurement

Go to Institute for Objective Measurement Home Page. The Rasch Measurement SIG (AERA) thanks the Institute for Objective Measurement for inviting the publication of Rasch Measurement Transactions on the Institute's website, www.rasch.org.

Coming Rasch-related Events
Jan. 25-26, 2017, Wed.-Thurs. In-person workshop: Measurement with the Rasch Model (M. Pampaka, J. Williams, Winsteps), Manchester, UK, website
Feb. 27 - June 24, 2017, Mon.-Sat. On-line: Advanced course in Rasch Measurement Theory (EDUC5606), Website
March 31, 2017, Fri. Conference: 11th UK Rasch Day, Warwick, UK, www.rasch.org.uk
April 2-3, 2017, Sun.-Mon. Conference: Validity Evidence for Measurement in Mathematics Education (V-M2Ed), San Antonio, TX, Information
April 26-30, 2017, Wed.-Sun. NCME, San Antonio, TX, www.ncme.org
April 27 - May 1, 2017, Thur.-Mon. AERA, San Antonio, TX, www.aera.net
May 26 - June 23, 2017, Fri.-Fri. On-line workshop: Practical Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com
June 30 - July 29, 2017, Fri.-Fri. On-line workshop: Practical Rasch Measurement - Further Topics (E. Smith, Winsteps), www.statistics.com
July 31 - Aug. 3, 2017, Mon.-Thurs. Joint IMEKO TC1-TC7-TC13 Symposium 2017: Measurement Science challenges in Natural and Social Sciences, Rio de Janeiro, Brazil, imeko-tc7-rio.org.br
Aug. 7-9, 2017, Mon-Wed. PROMS 2017: Pacific Rim Objective Measurement Symposium, Sabah, Borneo, Malaysia, proms.promsociety.org/2017/
Aug. 11 - Sept. 8, 2017, Fri.-Fri. On-line workshop: Many-Facet Rasch Measurement (E. Smith, Facets), www.statistics.com
Aug. 18-21, 2017, Fri.-Mon. IACAT 2017: International Association for Computerized Adaptive Testing, Niigata, Japan, iacat.org
Sept. 15-16, 2017, Fri.-Sat. IOMC 2017: International Outcome Measurement Conference, Chicago, jampress.org/iomc2017.htm
Oct. 13 - Nov. 10, 2017, Fri.-Fri. On-line workshop: Practical Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com
Jan. 5 - Feb. 2, 2018, Fri.-Fri. On-line workshop: Practical Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com
Jan. 10-16, 2018, Wed.-Tues. In-person workshop: Advanced Course in Rasch Measurement Theory and the application of RUMM2030, Perth, Australia (D. Andrich), Announcement
Jan. 17-19, 2018, Wed.-Fri. Rasch Conference: Seventh International Conference on Probabilistic Models for Measurement, Matilda Bay Club, Perth, Australia, Website
May 25 - June 22, 2018, Fri.-Fri. On-line workshop: Practical Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com
June 29 - July 27, 2018, Fri.-Fri. On-line workshop: Practical Rasch Measurement - Further Topics (E. Smith, Winsteps), www.statistics.com
Aug. 10 - Sept. 7, 2018, Fri.-Fri. On-line workshop: Many-Facet Rasch Measurement (E. Smith, Facets), www.statistics.com
Oct. 12 - Nov. 9, 2018, Fri.-Fri. On-line workshop: Practical Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com
The HTML to add "Coming Rasch-related Events" to your webpage is:
<script type="text/javascript" src="http://www.rasch.org/events.txt"></script>

The URL of this page is www.rasch.org/rmt/rmt272c.htm

Website: www.rasch.org/rmt/contents.htm