Many-Facet Rasch analysis for Student Ratings of University
Instructors
(10.30) Eunlim Kim Chi
Student ratings of instructors, as generally collected at
universities, reflect a special judgment situation in which the
number of judges (anonymous students) is very large (usually more
than 20), but each judge rates only one instructor. This
requires the development of an effective method for analyzing
such data. The present study investigates methods for
constructing objective measures for instructors from student
ratings. In addition, sub-dimensions of instructor evaluation
are identified.
Identifying Rater Errors in the Analysis of Text using the Many-
Faceted Rasch Model
(10.30) Judith A. Monsaas, Mona W. Matthews, George Engelhard,
Jr.
The purpose of this paper is to examine the use of fit statistics
for identifying transcripts than may need to be re-rated because
of rater unreliability. The data used are "story retellings" of
young K-2 children that have been audiotaped and transcribed.
Two raters divided this stories into T-units and counted the
number of words per T-unit. A T-unit (minimal terminable unit)
includes one main clause with all its subordinate clauses. It is
the smallest unit that could be considered a complete sentence.
The average length of the T-unit is a measure of language
complexity. Approximately 115 K-2 children were given this
assessment at the beginning and the end of the school year. Two
raters rated half of the transcripts each. They also rated about
40 common transcripts in order to examine interrater reliability.
Fixed effect chi-square statistics showed a significant rater
effect (chi^2=5.13, p=.02) with a reliability of rater separation
of .61. According to the Facets manual, traditional interrater
reliability statistics are "1 - separation reliability"; thus the
traditional index is about .39. The scores for the 40
transcripts rated by both raters were correlated (r=.59). INFIT
and OUTFIT statistics for each of the raters were greater than
1.5, suggesting that the two raters were using different criteria
for rating the transcripts. The research question examined here
is as follows: Can person fit statistics identify individual
transcripts that need to be re-rated? Transcripts with large
INFIT and OUTFIT statistics are identified. These transcripts are
re-rated by a third independent rater. After re-rating, Facets
is rerun to determine if reliability has improved.
Exploring the Use of Facets Analysis in Formative Evaluation of a
Portfolio Assessment Scoring System
(10.30) Susan T. Paulukonis, Joan I. Heller, Carol M. Myford
Design and development of an effective portfolio assessment
system can be a time-consuming and costly process. This paper
details how Facets analysis was used in the pilot stages of
development of one such system to assist in and reduce the
development effort. After initial design of a rating system, a
pilot rating session was held. A Facets analysis was performed
on the data, and the results suggested that specific aspects of
the system needed redesign or refinement. Qualitative data were
then analyzed to more closely examine those aspects of the system
and suggest specific changes. Results of both Facets and
qualitative analyses are presented.
Evaluating Group Difference in Unconnected Data with Facets
(10.30) Richard E. Smith, Roger E. Wilk
In many recent studies data which lack common elements across one
of the facets (unconnected) have been analyzed with the many-
facet Rasch model. The Facets program identifies such data as
"loosely connected data". This study utilized simulated data to
test the extent to which generating group difference in loosely
connected data can be recovered. Seven different approaches to
the analysis are examined. Of these, only the two-step
approaches recover a majority of the generating group parameters.
However, scaling differences in the difficulty and group
parameters remain a problem.
Many-Faceted Rasch Analysis of Children's Change in Self-Efficacy
(10.30) Weimo Zhu
The "ceiling effect" or "instrument decay" is often a serious
threat in studying change of self-efficacy because an infinite
range of ability/latent trait is forced into a finite range of
possible scores on a measurement scale. This study, using a
longitudinal design, illustrated how many-faceted Rasch analysis
provides a useful and convenient means of transforming total
observed scores from an ordinal scale into linear measures, of
examining stability of the scale, and of determining children's
change in psychomotor self-efficacy. These measures are compared
with their change in physical fitness and psychomotor
performance.
Perfect Fit
(23.16) William L. Bashaw
"Perfect fit" is taught in two ways - by assessment of model-data
similarity through fit statistics, and by match of target person
populations to item difficulty distributions. These are
incompatible if one accepts Guttman patterns as desirable. The
two concepts are reconciled by considering situations with known
parameters. Any value of fit can be interpreted in terms of
person-item match. The problem of interpreting very low fit
values is clarified. Expected values of INFIT and OUTFIT are
calculated in three situations and the ranges discussed. Guttman
patterns are discussed as impractical. Arguments apply to person
and item fit.
Judging Plans and Connectivity in a Facets Dataset
(23.16) John M. Linacre
Even in data that accords with Rasch model specifications there
can be problems with measure estimability. Georg Rasch noticed
that extreme (zero or perfect) scores must be excluded from the
usual estimation procedure (1960, p.79). Andersen (1977) noticed
inestimability in data that contain a Guttman transition,
implying an infinitely wide gap in the latent continuum.
Sparseness may cause data to decompose into disconnected subsets.
Besides Guttman-like disjunctions, disconnection can be complete
(no examinees, items or judges in common between subsets) or
incomplete (all examinees respond to the same prompts but the New
York judges rate the New York examinees, the California judges
rate the California examinees, with no cross-over). A practical
algorithm for establishing connectivity has been developed from
Weeks & Williams (1964), but with many refinements to allow for
sparse and disordered observations, and also anchored, centered
and grouped facet elements.
Virtual Equating
(23.16) Stuart Luppescu
The goal is to estimate test item difficulties from the intrinsic
properties of the items across tests, thus producing equated
tests. These properties include the lexile difficulties of
reading passages, the lexiles of the test items, and the
classification of the items (e.g., draw conclusions, infer
motives of character).
Evaluating the Accuracy of Judgments Obtained from Item Review
Committees
(39.13) George Engelhard, Jr., Melodee Davis, Linda Hansche
The purpose of this study is to examine whether or not the judges
on item review committees can accurately identify test items that
exhibit a variety of flaws. An instrument with 75 items (47 with
one or more flaws, and 28 with no known flaws) was constructed
and administered to 39 judges who were operational members of an
item review committee. After undergoing training, the 39 judges
were asked to examine the 75 items and indicate whether or not
each item exhibited cultural or technical flaws. There were 8
cultural flaw categories (e.g., Does the item unfairly favor
males or females?) and 8 technical flaw categories (e.g., Is the
item content inaccurate or factually incorrect?). The accuracy
of the judges was defined in terms of the match between the
judged classifications and the a priori classification of the
items. The Facets model was used to analyze the data. The data
suggest that there are statistically significant difference in
judgmental accuracy, although the judges exhibited fairly high
accuracy rates overall that ranged from 83% to 94%. The data
also suggest that it is easier to identify some types of flaws
than others; specifically, the judges were more accurate in
identifying items with no flaws and less accurate in identifying
items with both cultural and technical flaws. Suggestion for
future research on judgmental accuracy and the implications of
this study for identifying biased items are discussed.
Descriptive or Explanatory Measurement?
(39.13) John M. Linacre
Demographic decomposition of person ability parameters can be
easily accomplished. Nevertheless problems in selection,
interpretation and communication arise. In this paper, measures
constructed from responses of 24,944 adults to 169 items on the
1992 National Adults Literacy Survey are analyzed several ways.
1. Descriptive measurement and explanatory analysis:
A person-item Rasch analysis is performed in which each adult is
measured. These measures are summarized by demographic factor
(unadjusted for other factors). (This is the typical report
displayed in newspapers.) Each adult measure is then
"decomposed" demographically according to gender, ethnicity,
native language etc. The demographic effects are estimated using
regression. (This is the typical report displayed in research
findings.)
2. Explanatory measurement:
Each adult is "decomposed" demographically according to gender,
ethnicity, native language, etc. and treated as a random
representative of each demography. A many-facet Rasch analysis
is performed in which each demographic mean effect is measured.
No estimate of within-demographic-effect sample variance is
obtained. This variance inflates measurement error, reduces test
discrimination, and reduces the logit demographic effect size.
Are explanatory measurement reports an ideal to be aimed at or a misleading distraction? The audience is encouraged to consider this question by comparing plots depicting different answers to the question "what size is the demographic effect?"
The Construction of Meaning: Replicating Objectively Derived
Criterion-Referenced Standards
(54.54) Gregory E. Stone
In this replication study, stabilities of meaning, measure and
examinee pass rates resulting from objective (Rasch) criterion-
referenced standards are examined. Data were collected from the
same standard setting judges who participated in the initial
project (Stone, 1995). Stability was assessed through both
quantitative measurements and qualitative explorations. The
results confirmed the stability of objective (Rasch) standards
and further supported the notion that objective content-based
standards, unlike Angoff performance-based standards, originate
from well-developed, stable and legitimately meaningful
constructs. The report further confirmed the effectiveness of
the objective model's direct, content-based, judge decision-
making process.
Models of Judgment and Rasch Measurement Theory
(54.58) George Engelhard, Jr.
The purpose of this paper is to describe four models of rater
judgment and to examine the implications of these models for the
elicitation and analysis of rater judgments within the context of
educational assessment. The specific models that will be
described were developed by Brunswick (1952), Hogarth (1987),
Landy and Farr (1983), and Wherry (1952). These models were
selected because of their strong influence on practice and their
representativeness relative to other extant models of judgment.
The use of modern Item Response Theory as represented by Rasch
Measurement Theory to model rater judgments within the context of
these four models is stressed throughout the paper. The models
described in this paper can provide a strong theoretical basis
for examining rater judgments and hold the potential for
providing a coherent and systematic framework for improving
educational assessment systems based on judgments.
Gender Differences in Performance on Multiple-Choice and
Constructed Response Mathematics Items
(54.58) Mary Garner, George Engelhard, Jr.
The purpose of this study is to examine gender differences in
performance on multiple-choice and constructed response items in
mathematics, administered within the context of a high stakes,
curriculum specific, high school graduation test. A secondary
purpose is to demonstrate a method based on the many-faceted
Rasch measurement model (Facets) for comparing performance on
different item types and exploring differential item functioning
(DIF). A random sample of approximately 4,000 eleventh graders
were used for the analysis. The study will seek answers to the
following questions: (1) Are there gender differences in
mathematics performance? (2) Are gender differences linked to
content areas within mathematics? (3) Are gender differences
linked to cognitive levels of the tasks? (4) Do gender
differences vary according to item format?
AERA abstracts. Rasch Measurement Transactions, 1996, 9:4 p.460
Forum | Rasch Measurement Forum to discuss any Rasch-related topic |
Go to Top of Page
Go to index of all Rasch Measurement Transactions
AERA members: Join the Rasch Measurement SIG and receive the printed version of RMT
Some back issues of RMT are available as bound volumes
Subscribe to Journal of Applied Measurement
Go to Institute for Objective Measurement Home Page. The Rasch Measurement SIG (AERA) thanks the Institute for Objective Measurement for inviting the publication of Rasch Measurement Transactions on the Institute's website, www.rasch.org.
Coming Rasch-related Events | |
---|---|
Apr. 21 - 22, 2025, Mon.-Tue. | International Objective Measurement Workshop (IOMW) - Boulder, CO, www.iomw.net |
Jan. 17 - Feb. 21, 2025, Fri.-Fri. | On-line workshop: Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com |
Feb. - June, 2025 | On-line course: Introduction to Classical Test and Rasch Measurement Theories (D. Andrich, I. Marais, RUMM2030), University of Western Australia |
Feb. - June, 2025 | On-line course: Advanced Course in Rasch Measurement Theory (D. Andrich, I. Marais, RUMM2030), University of Western Australia |
May 16 - June 20, 2025, Fri.-Fri. | On-line workshop: Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com |
June 20 - July 18, 2025, Fri.-Fri. | On-line workshop: Rasch Measurement - Further Topics (E. Smith, Facets), www.statistics.com |
Oct. 3 - Nov. 7, 2025, Fri.-Fri. | On-line workshop: Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com |
The URL of this page is www.rasch.org/rmt/rmt94d.htm
Website: www.rasch.org/rmt/contents.htm