AERA Rasch Measurement Abstracts, 1996

Many-Facet Rasch analysis for Student Ratings of University Instructors
(10.30) Eunlim Kim Chi
Student ratings of instructors, as generally collected at universities, reflect a special judgment situation in which the number of judges (anonymous students) is very large (usually more than 20), but each judge rates only one instructor. This requires the development of an effective method for analyzing such data. The present study investigates methods for constructing objective measures for instructors from student ratings. In addition, sub-dimensions of instructor evaluation are identified.

Identifying Rater Errors in the Analysis of Text using the Many- Faceted Rasch Model
(10.30) Judith A. Monsaas, Mona W. Matthews, George Engelhard, Jr.
The purpose of this paper is to examine the use of fit statistics for identifying transcripts than may need to be re-rated because of rater unreliability. The data used are "story retellings" of young K-2 children that have been audiotaped and transcribed. Two raters divided this stories into T-units and counted the number of words per T-unit. A T-unit (minimal terminable unit) includes one main clause with all its subordinate clauses. It is the smallest unit that could be considered a complete sentence. The average length of the T-unit is a measure of language complexity. Approximately 115 K-2 children were given this assessment at the beginning and the end of the school year. Two raters rated half of the transcripts each. They also rated about 40 common transcripts in order to examine interrater reliability. Fixed effect chi-square statistics showed a significant rater effect (chi^2=5.13, p=.02) with a reliability of rater separation of .61. According to the Facets manual, traditional interrater reliability statistics are "1 - separation reliability"; thus the traditional index is about .39. The scores for the 40 transcripts rated by both raters were correlated (r=.59). INFIT and OUTFIT statistics for each of the raters were greater than 1.5, suggesting that the two raters were using different criteria for rating the transcripts. The research question examined here is as follows: Can person fit statistics identify individual transcripts that need to be re-rated? Transcripts with large INFIT and OUTFIT statistics are identified. These transcripts are re-rated by a third independent rater. After re-rating, Facets is rerun to determine if reliability has improved.

Exploring the Use of Facets Analysis in Formative Evaluation of a Portfolio Assessment Scoring System
(10.30) Susan T. Paulukonis, Joan I. Heller, Carol M. Myford
Design and development of an effective portfolio assessment system can be a time-consuming and costly process. This paper details how Facets analysis was used in the pilot stages of development of one such system to assist in and reduce the development effort. After initial design of a rating system, a pilot rating session was held. A Facets analysis was performed on the data, and the results suggested that specific aspects of the system needed redesign or refinement. Qualitative data were then analyzed to more closely examine those aspects of the system and suggest specific changes. Results of both Facets and qualitative analyses are presented.

Evaluating Group Difference in Unconnected Data with Facets
(10.30) Richard E. Smith, Roger E. Wilk
In many recent studies data which lack common elements across one of the facets (unconnected) have been analyzed with the many- facet Rasch model. The Facets program identifies such data as "loosely connected data". This study utilized simulated data to test the extent to which generating group difference in loosely connected data can be recovered. Seven different approaches to the analysis are examined. Of these, only the two-step approaches recover a majority of the generating group parameters. However, scaling differences in the difficulty and group parameters remain a problem.

Many-Faceted Rasch Analysis of Children's Change in Self-Efficacy
(10.30) Weimo Zhu
The "ceiling effect" or "instrument decay" is often a serious threat in studying change of self-efficacy because an infinite range of ability/latent trait is forced into a finite range of possible scores on a measurement scale. This study, using a longitudinal design, illustrated how many-faceted Rasch analysis provides a useful and convenient means of transforming total observed scores from an ordinal scale into linear measures, of examining stability of the scale, and of determining children's change in psychomotor self-efficacy. These measures are compared with their change in physical fitness and psychomotor performance.

Perfect Fit
(23.16) William L. Bashaw
"Perfect fit" is taught in two ways - by assessment of model-data similarity through fit statistics, and by match of target person populations to item difficulty distributions. These are incompatible if one accepts Guttman patterns as desirable. The two concepts are reconciled by considering situations with known parameters. Any value of fit can be interpreted in terms of person-item match. The problem of interpreting very low fit values is clarified. Expected values of INFIT and OUTFIT are calculated in three situations and the ranges discussed. Guttman patterns are discussed as impractical. Arguments apply to person and item fit.

Judging Plans and Connectivity in a Facets Dataset
(23.16) John M. Linacre
Even in data that accords with Rasch model specifications there can be problems with measure estimability. Georg Rasch noticed that extreme (zero or perfect) scores must be excluded from the usual estimation procedure (1960, p.79). Andersen (1977) noticed inestimability in data that contain a Guttman transition, implying an infinitely wide gap in the latent continuum. Sparseness may cause data to decompose into disconnected subsets. Besides Guttman-like disjunctions, disconnection can be complete (no examinees, items or judges in common between subsets) or incomplete (all examinees respond to the same prompts but the New York judges rate the New York examinees, the California judges rate the California examinees, with no cross-over). A practical algorithm for establishing connectivity has been developed from Weeks & Williams (1964), but with many refinements to allow for sparse and disordered observations, and also anchored, centered and grouped facet elements.

Virtual Equating
(23.16) Stuart Luppescu
The goal is to estimate test item difficulties from the intrinsic properties of the items across tests, thus producing equated tests. These properties include the lexile difficulties of reading passages, the lexiles of the test items, and the classification of the items (e.g., draw conclusions, infer motives of character).

Evaluating the Accuracy of Judgments Obtained from Item Review Committees
(39.13) George Engelhard, Jr., Melodee Davis, Linda Hansche
The purpose of this study is to examine whether or not the judges on item review committees can accurately identify test items that exhibit a variety of flaws. An instrument with 75 items (47 with one or more flaws, and 28 with no known flaws) was constructed and administered to 39 judges who were operational members of an item review committee. After undergoing training, the 39 judges were asked to examine the 75 items and indicate whether or not each item exhibited cultural or technical flaws. There were 8 cultural flaw categories (e.g., Does the item unfairly favor males or females?) and 8 technical flaw categories (e.g., Is the item content inaccurate or factually incorrect?). The accuracy of the judges was defined in terms of the match between the judged classifications and the a priori classification of the items. The Facets model was used to analyze the data. The data suggest that there are statistically significant difference in judgmental accuracy, although the judges exhibited fairly high accuracy rates overall that ranged from 83% to 94%. The data also suggest that it is easier to identify some types of flaws than others; specifically, the judges were more accurate in identifying items with no flaws and less accurate in identifying items with both cultural and technical flaws. Suggestion for future research on judgmental accuracy and the implications of this study for identifying biased items are discussed.

Descriptive or Explanatory Measurement?
(39.13) John M. Linacre
Demographic decomposition of person ability parameters can be easily accomplished. Nevertheless problems in selection, interpretation and communication arise. In this paper, measures constructed from responses of 24,944 adults to 169 items on the 1992 National Adults Literacy Survey are analyzed several ways.

1. Descriptive measurement and explanatory analysis:
A person-item Rasch analysis is performed in which each adult is measured. These measures are summarized by demographic factor (unadjusted for other factors). (This is the typical report displayed in newspapers.) Each adult measure is then "decomposed" demographically according to gender, ethnicity, native language etc. The demographic effects are estimated using regression. (This is the typical report displayed in research findings.)

2. Explanatory measurement:
Each adult is "decomposed" demographically according to gender, ethnicity, native language, etc. and treated as a random representative of each demography. A many-facet Rasch analysis is performed in which each demographic mean effect is measured. No estimate of within-demographic-effect sample variance is obtained. This variance inflates measurement error, reduces test discrimination, and reduces the logit demographic effect size.

Are explanatory measurement reports an ideal to be aimed at or a misleading distraction? The audience is encouraged to consider this question by comparing plots depicting different answers to the question "what size is the demographic effect?"

The Construction of Meaning: Replicating Objectively Derived Criterion-Referenced Standards
(54.54) Gregory E. Stone
In this replication study, stabilities of meaning, measure and examinee pass rates resulting from objective (Rasch) criterion- referenced standards are examined. Data were collected from the same standard setting judges who participated in the initial project (Stone, 1995). Stability was assessed through both quantitative measurements and qualitative explorations. The results confirmed the stability of objective (Rasch) standards and further supported the notion that objective content-based standards, unlike Angoff performance-based standards, originate from well-developed, stable and legitimately meaningful constructs. The report further confirmed the effectiveness of the objective model's direct, content-based, judge decision- making process.

Models of Judgment and Rasch Measurement Theory
(54.58) George Engelhard, Jr.
The purpose of this paper is to describe four models of rater judgment and to examine the implications of these models for the elicitation and analysis of rater judgments within the context of educational assessment. The specific models that will be described were developed by Brunswick (1952), Hogarth (1987), Landy and Farr (1983), and Wherry (1952). These models were selected because of their strong influence on practice and their representativeness relative to other extant models of judgment. The use of modern Item Response Theory as represented by Rasch Measurement Theory to model rater judgments within the context of these four models is stressed throughout the paper. The models described in this paper can provide a strong theoretical basis for examining rater judgments and hold the potential for providing a coherent and systematic framework for improving educational assessment systems based on judgments.

Gender Differences in Performance on Multiple-Choice and Constructed Response Mathematics Items
(54.58) Mary Garner, George Engelhard, Jr.
The purpose of this study is to examine gender differences in performance on multiple-choice and constructed response items in mathematics, administered within the context of a high stakes, curriculum specific, high school graduation test. A secondary purpose is to demonstrate a method based on the many-faceted Rasch measurement model (Facets) for comparing performance on different item types and exploring differential item functioning (DIF). A random sample of approximately 4,000 eleventh graders were used for the analysis. The study will seek answers to the following questions: (1) Are there gender differences in mathematics performance? (2) Are gender differences linked to content areas within mathematics? (3) Are gender differences linked to cognitive levels of the tasks? (4) Do gender differences vary according to item format?

Rasch Books and Publications
Invariant Measurement: Using Rasch Models in the Social, Behavioral, and Health Sciences, 2nd Edn. George Engelhard, Jr. & Jue Wang	Applying the Rasch Model (Winsteps, Facets) 4th Ed., Bond, Yan, Heene	Advances in Rasch Analyses in the Human Sciences (Winsteps, Facets) 1st Ed., Boone, Staver	Advances in Applications of Rasch Measurement in Science Education, X. Liu & W. J. Boone	Rasch Analysis in the Human Sciences (Winsteps) Boone, Staver, Yale
Introduction to Many-Facet Rasch Measurement (Facets), Thomas Eckes	Statistical Analyses for Language Testers (Facets), Rita Green	Invariant Measurement with Raters and Rating Scales: Rasch Models for Rater-Mediated Assessments (Facets), George Engelhard, Jr. & Stefanie Wind	Aplicação do Modelo de Rasch (Português), de Bond, Trevor G., Fox, Christine M	Appliquer le modèle de Rasch: Défis et pistes de solution (Winsteps) E. Dionne, S. Béland
Exploring Rating Scale Functioning for Survey Research (R, Facets), Stefanie Wind	Rasch Measurement: Applications, Khine	Winsteps Tutorials - free Facets Tutorials - free	Many-Facet Rasch Measurement (Facets) - free, J.M. Linacre	Fairness, Justice and Language Assessment (Winsteps, Facets), McNamara, Knoch, Fan
Other Rasch-Related Resources: Rasch Measurement YouTube Channel
Rasch Measurement Transactions & Rasch Measurement research papers - free	An Introduction to the Rasch Model with Examples in R (eRm, etc.), Debelak, Strobl, Zeigenfuse	Rasch Measurement Theory Analysis in R, Wind, Hua	Applying the Rasch Model in Social Sciences Using R, Lamprianou	El modelo métrico de Rasch: Fundamentación, implementación e interpretación de la medida en ciencias sociales (Spanish Edition), Manuel González-Montesinos M.
Rasch Models: Foundations, Recent Developments, and Applications, Fischer & Molenaar	Probabilistic Models for Some Intelligence and Attainment Tests, Georg Rasch	Rasch Models for Measurement, David Andrich	Constructing Measures, Mark Wilson	Best Test Design - free, Wright & Stone Rating Scale Analysis - free, Wright & Masters
Virtual Standard Setting: Setting Cut Scores, Charalambos Kollias	Diseño de Mejores Pruebas - free, Spanish Best Test Design	A Course in Rasch Measurement Theory, Andrich, Marais	Rasch Models in Health, Christensen, Kreiner, Mesba	Multivariate and Mixture Distribution Rasch Models, von Davier, Carstensen

Go to Institute for Objective Measurement Home Page. The Rasch Measurement SIG (AERA) thanks the Institute for Objective Measurement for inviting the publication of Rasch Measurement Transactions on the Institute's website, www.rasch.org.

Coming Rasch-related Events
May. 15 - June 12, 2026, Fri.-Fri.	On-line workshop: Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com
June 19 - July 25, 2026, Fri.-Sat.	On-line workshop: Rasch Measurement - Further Topics (E. Smith, Winsteps), www.statistics.com
Aug. 31 - Sept 2 2026, Mon.-Wed.	In person: IMEKO TC1 Metrology Education and Training symposium, Klagenfurt, Austria www.photomet-edumet2026.com. Submissions by April 20
Aug. 30 - Sept. 3, 2027, Mon.-Fri.	In Person: 2027 IMEKO World Congress (TC1, Tc7, TC13, TC18, TC26), Rimini, Italy imeko2027.org