Early Detection of Item Miskey on a CAT: The Use of Multiple Indices

An item on an operational test that has been keyed incorrectly represents a threat to score validity. A miskeyed item or items can cause more able examinees to have lower overall scores and less able examinees to have higher overall scores, thus reducing the ability to clearly discriminate between examinees. This is particularly true when test scores are used for classification such as determining whether or not an examinee should be awarded a professional license or a certificate. A procedure that can detect miskeyed items early in an examination cycle improves the integrity of a testing program by reducing the likelihood of misclassifying examinees.

No one statistical index is completely reliable in the detection of a miskeyed item. In classical test statistics, the p-value and the point biserial correlation have frequently been used to identify miskeyed items. A low p- value and a negative point biserial are often interpreted as indicating an item miskey. While these outcomes can indicate an item miskey they are also associated with other item characteristics, such as item ambiguity or multiple correct answers. For a fixed form test, these statistics may be sufficient, however in a computerized adaptive test (CAT) environment the usefulness of the p- value and the point biserial is greatly diminished. CAT examinations are designed to present items to an examinee based on an estimate of the examinee's ability which causes the sample used to calculate p-value and point biserial estimates of items in an operational examination to be different than the reference group used to establish the original item parameters. Further, calculating p-value and point biserial estimates based on a sample obtained from responses generated by a CAT examination, results in statistics which are less stable due to the range restriction of the sample. In item response theory (IRT) models, fit statistics are often used to identify problem items. Commonly the weighted (infit) and unweighted (outfit) standardized mean squares statistics are used to identify items that do not meet the expectations of the measurement model. However, the calculation of both infit and outfit is dependent on deviations from the model expectations and a restricted sample range will also impact these calculations, making it difficult to use them for identifying miskeyed items.

There are three things to consider when identifying miskeys in an operational examination.

  1. Is there a statistic or combination of statistics that can identify miskeys?
  2. How large of a sample size is needed to create useful decisions?
  3. What is the rate of false positive and false negative identifications?

Ideally a single statistic would provide all the information needed to determine a miskey however it may be that a combination of statistics would be more useful. Sample size is important because a method that works well with a smaller sample would enable earlier analysis during a testing cycle reducing the amount of time a miskeyed item was used. Finally, it is important to understand the false positive and false negative rate since too many false positives require manual inspection and too many false negatives would defeat the purpose of the process. Cut-off values can be established for each statistic, such that any item falling above or below an established set of values would be considered to be a likely candidate as a miskeyed item.

To explore this idea a simulation was created to review the performance of readily available statistics to determine if singly or in combination they could provide a consistent identification of miskeyed items. The statistics investigated are p-value, point-measure correlation, infit, outfit, displacement, and upper asymptote. The upper asymptote statistic is available in the Winsteps item analysis and represents a four-parameter IRT model (4- PL) estimate of carelessness or inadvertent selection of a wrong answer. The expectation is that this value should be close to 1 for normal items and much smaller for miskeyed items.

A simulator program (Becker 2012) was used to administer ten replications of a variable length CAT examination, each replication having 1200 examinees. Eight items out of a large pool of items of over 1400 items were selected to be miskeyed items. The simulator program generated test results using the candidate ability measure and the item difficulty to generate a probability of a correct response for each candidate/item interaction. A random number was then generated and, if that number was less than or equal to the probability, the candidate was scored as having answered the question correctly. However, when a candidate encountered a miskeyed item, if the random number was less than or equal to the probability, the candidate was scored as having answered the question incorrectly. The resulting matrix of answer strings was then analyzed using Winsteps and the statistical indices described above were examined to assess their utility in identifying the miskeyed items.

The analysis identified three statistics that, used in combination, gave the cleanest separation between miskeyed and normal items. These statistics were p-value, displacement, and upper asymptote. The cut-off values that were found to be most useful were as follows; p- value <= 0.20, displacement >= 1.5 and upper asymptote <=0.4. Items could receive different N counts based on the selection algorithm used in the variable length CAT. A further cut off was established setting an exposure minimum of 20. The ten replications with eight miskeyed items in each replication presented 80 cases in which a miskeyed item would hopefully be flagged. Using these criteria miskeyed items were flagged in 68 out of the 80 instances (85%). Conversely none of the normal items were flagged out of the 14,640 cases. In the 12 instances in which a miskeyed item was not flagged, 7 involved the same item, which was the hardest item in the miskey set. Logically, hard items are going to be the most difficult to detect as miskeys.

Reference

Becker, K. (2012). Pearson CAT Simulator. Chicago, IL: Pearson VUE.

John A. Stahl and Gregory M. Applegate
Pearson VUE


Early Detection of Item Miskey on a CAT: The Use of Multiple Indices. John A. Stahl & Gregory M. Applegate … Rasch Measurement Transactions, 2013, 27:1 p. 1405-6




Rasch Publications
Rasch Measurement Transactions (free, online) Rasch Measurement research papers (free, online) Probabilistic Models for Some Intelligence and Attainment Tests, Georg Rasch Applying the Rasch Model 3rd. Ed., Bond & Fox Best Test Design, Wright & Stone
Rating Scale Analysis, Wright & Masters Introduction to Rasch Measurement, E. Smith & R. Smith Introduction to Many-Facet Rasch Measurement, Thomas Eckes Invariant Measurement: Using Rasch Models in the Social, Behavioral, and Health Sciences, George Engelhard, Jr. Statistical Analyses for Language Testers, Rita Green
Rasch Models: Foundations, Recent Developments, and Applications, Fischer & Molenaar Journal of Applied Measurement Rasch models for measurement, David Andrich Constructing Measures, Mark Wilson Rasch Analysis in the Human Sciences, Boone, Stave, Yale
in Spanish: Análisis de Rasch para todos, Agustín Tristán Mediciones, Posicionamientos y Diagnósticos Competitivos, Juan Ramón Oreja Rodríguez

To be emailed about new material on www.rasch.org
please enter your email address here:

I want to Subscribe: & click below
I want to Unsubscribe: & click below

Please set your SPAM filter to accept emails from Rasch.org

www.rasch.org welcomes your comments:

Your email address (if you want us to reply):

 

ForumRasch Measurement Forum to discuss any Rasch-related topic

Go to Top of Page
Go to index of all Rasch Measurement Transactions
AERA members: Join the Rasch Measurement SIG and receive the printed version of RMT
Some back issues of RMT are available as bound volumes
Subscribe to Journal of Applied Measurement

Go to Institute for Objective Measurement Home Page. The Rasch Measurement SIG (AERA) thanks the Institute for Objective Measurement for inviting the publication of Rasch Measurement Transactions on the Institute's website, www.rasch.org.

Coming Rasch-related Events
May 17 - June 21, 2024, Fri.-Fri. On-line workshop: Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com
June 12 - 14, 2024, Wed.-Fri. 1st Scandinavian Applied Measurement Conference, Kristianstad University, Kristianstad, Sweden http://www.hkr.se/samc2024
June 21 - July 19, 2024, Fri.-Fri. On-line workshop: Rasch Measurement - Further Topics (E. Smith, Winsteps), www.statistics.com
Aug. 5 - Aug. 6, 2024, Fri.-Fri. 2024 Inaugural Conference of the Society for the Study of Measurement (Berkeley, CA), Call for Proposals
Aug. 9 - Sept. 6, 2024, Fri.-Fri. On-line workshop: Many-Facet Rasch Measurement (E. Smith, Facets), www.statistics.com
Oct. 4 - Nov. 8, 2024, Fri.-Fri. On-line workshop: Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com
Jan. 17 - Feb. 21, 2025, Fri.-Fri. On-line workshop: Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com
May 16 - June 20, 2025, Fri.-Fri. On-line workshop: Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com
June 20 - July 18, 2025, Fri.-Fri. On-line workshop: Rasch Measurement - Further Topics (E. Smith, Facets), www.statistics.com
Oct. 3 - Nov. 7, 2025, Fri.-Fri. On-line workshop: Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com

 

The URL of this page is www.rasch.org/rmt/rmt271c.htm

Website: www.rasch.org/rmt/contents.htm