The Voice of Georg Rasch
David Andrich, Murdoch U.
From closely following the development of Rasch models for
measurement in the 1970s, I formed the impression that Rasch could
precipitate a Kuhnian-type revolution in social measurement. Such
revolutions are generally inspired by people with unusual
intellectual insights and personal qualities. I had the excellent
fortune of living and working closely with Georg Rasch in the mid
1970s over two periods which totalled twelve months. It seemed
important that the events that he saw as shaping his thinking
should be recorded. Accordingly, in 1979 I taped conversations
with Rasch over a series of evenings. These tapes provide
participants The Voice of Georg Rasch.
The Nationwide On-Line Networking of Item Banks in
Sun-Geun Baek, Korean Educational Development Institute
An initial item bank of 13,000 items has been constructed.
Challenges addressed include maintaining accurate item
calibrations, adding new items to the bank, and managing
differential item functioning.
Making Sense Out of Survey Data
Betty Bergstrom, American Dietetic Association, Commission on
Dietetic Registration
Presents a survey designed to quantify the scope of practice from
entry-level to advanced level practice in the field of dietetics.
Data were analyzed with the Masters Partial Credit Model because of
a clear difference in the difficulty of steps across items. Using
demographic data and the computed expected mean scores and most
probable responses, a map of the practice of dietetics was
constructed. This allows dietitians to the see differences in
roles and tasks performed along the continuum of practice and also
to estimate the impact of acquiring an advanced level degree or
choosing a job within a particular setting.
Using the Rasch Measurement Model in Grading Essay-Type
Items: A Way to Enhance both Economy and Objectivity of
Sunhee Chae, Korean Educational Development Institute
Using data from the national examination for public school
teachers, the Facets approach is used to measures judge
leniency and investigate the judging design.
Psychophysics of Stage: Task Complexity and Statistical
Michael L. Commons, Harvard Medical School; Francis A. Richards,
Edward J. Trudeau, Eric A. Goodheart, Harvard U.; Theo Dawson, UC
The assessment of developmental stage has been limited to measures
of performance rather than an analysis of task demands. To remedy
that lack, a behavior-analytic-compatible core-developmental notion
of hierarchical complexity of tasks has been introduced. The
hierarchical complexity of tasks and the stage of the response
chain that complete them have three unusual properties that are
discussed in the paper. The General Stage Model predicts that
performance of items within a hierarchically ordered task sequence
will have a set of corresponding Rasch-model-generated stage
The Stability of Rater Characteristics in Large Scale
Assessment Programs (also AERA 10.48)
Peter J. Congdon, Joy McQueen, Australian Council for
Educational Research
This study reports on the stability of rater severity over an
extended period of time and the stability of rater severity across
the different performance dimensions within the context of a large
scale assessment program. The study focuses on changes in rater
severity and rater fit over an eight day period.
Assessment of Stage Transitions in Performance on Commons'
Balance Beam Instrument
Theo Dawson, UC Berkeley; Eric Goodheart, Harvard U.; Karen
Draney, Mark Wilson, UC Berkeley; Michael Commons, Harvard Medical
We perform both Rasch and saltus analyses of performance on
Commons' Balance Beam instrument. Analysis with the Rasch model
clearly shows two hierarchical stages of development in the item
estimates, though four are predicted. To further examine stage
transitions, we use saltus analysis, which models latent group
effects as well as latent trait effects.
Raters and Single Prompt-to-Prompt Equating Using the
Facets Model in Writing Performance Assessment
Yi Du and William L. Brown, Minneapolis Public Schools
The study is based on an equating practice of the 1996 writing
assessment in Minneapolis Public Schools (MPS). In the writing
assessment, a common topic with three different prompts
representing narrative, persuasive, and informative writing, was
used to assess how well students can write. The random-groups
equating design and a Facets model were used to equate the
three non-overlapped prompts and rater severity in this study.
Scale scores were obtained by adjusting the student raw scores for
difference in prompts and raters using the Facets model.
The scale scores reflect the fact that some prompts are more
difficult than others and some raters are severer than others. The
scale scores are, therefore, comparable and best describe the
students' overall achievement on this assessment. The study shows
that the Facets model is the best option for writing
performance assessment equating. Both prompt-to-prompt and rater
equating is necessary to gain comparability of students in writing
performance assessment.
Applications of a Multicomponent Rasch Model to Understanding
Lifespan Differences in Processing
Susan E. Embretson and Karen M. McCollam, U. of Kansas
The multicomponent latent trait model (MLTM; Embretson, 1980), a
multidimensional Rasch model which estimates underlying components,
was applied to aging spatial visualization data to estimate two
latent components hypothesized to decline with age: general control
processes and working memory capacity. Specific hypotheses about
age decline are tested with the MLTM estimates and discussed in
context of prior studies.
Measuring the Accuracy of Self-Efficacy Judgments in
George Engelhard Jr., Frank Pajares, Emory U.
Accuracy is defined as the correspondence between the self-reported
confidence of students to solve a set of mathematics problems
(judgement of self-efficacy) and the actual achievement on these
problems (mathematics performance). The judgments examined in this
study were obtained from 297 eighth grade students. They were
asked to rate their confidence of success on 19 algebra problems
using a 6-point Likert scale ranging from "no confidence at all" to
"complete confidence". This research will add to our understanding
of the relationship between self-efficacy and performance.
Dimensionality and DIF on Multiple-Choice and Constructed
Response Mathematics Items
Mary Garner, Emory U.
The Multidimensional Random Coefficients Multinomial Logit model is
used to explore the influence of assumptions of dimensionality on
the measurement of differential item functioning. DIF indices
using a unidimensional model are compared to DIF indices that are
produced using a multidimensional model. Both between-item
multidimensionality and within-item multidimensionality models are
Rasch Measurement Theory, The Method of Paired Comparisons,
and Graph Theory
Mary Garner, George Engelhard, Jr., Emory U.
The purpose of this paper is to answer the following questions: (1)
What is the relationship between the method of paired comparisons
and Rasch measurement theory? (2) What is the relationship between
the method of paired comparisons and graph theory? (3) What can
graph theory contribute to our understanding of Rasch measurement
theory? Specifically, it is shown how the method of paired
comparisons can lead to the Rasch model, just as consideration of
the Rasch model can lead to a pairwise algorithm for estimating the
parameters of the Rasch model. Furthermore, both graph theory and
previously unexplored aspects of the method of paired comparisons
are used to increase understanding and utility of a pairwise
algorithm for estimating parameters of the Rasch model as presented
by Choppin (1985).
Rasch Solutions to Curriculum Based Measurement
Peter MacMillan, U. of Northern British Columbia
A probe for reading consisted of a selected sample of
grade-appropriate text. One score was the number of words read
correctly in one minute, another was number of words written in one
minute, and a third the number of those words spelled correctly.
This problem raised issues not solvable with conventional analysis.
Multifaceted Rasch analysis solved these problems and also enabled
calibrating the difficulty of the prompts and comparison of
performance across grades and schools.
Training and Rater Reliability in Performance Assessment
(also AERA 10.48)
Juliette Mendelovits, Australian Council for Educational
About 200 classroom teachers of Grade 3, 7 and 10 students were
trained to assess first language English speaking performances
using a training video tape. The teachers used common marking
schedules to assess taped sample performances of students from each
of the Grades. Their assessments were collected and analyzed
together with two sets of consensual `expert' ratings. These data
are the focus for a discussion of the effect of different types of
experience on the reliability and accuracy of rater judgment.
Gender Differences in Attitudes toward Computers
Judith A. Monsaas, North Georgia College; George Engelhard Jr.,
Emory U.
Responses to the Survey of Attitudes Toward Learning About and
Working with Computers instrument were analyzed with a
Facets model to investigate gender bias on individual
attitude items. Methodological and substantive issues are
Halo Effect in Personal Care Products
Thomas K. Rehfeldt, Helene Curtis Inc.
A Facets analysis of two sets of marketing research data on
shampoos will be used to show the "halo" effect or enhancement of
unrelated properties because of a an especially favorable response
to the "halo" property. Specifically the effect of fragrance on
physical performance will be reported.
Using Maps to Produce Meaningful Evaluation Measures:
Evaluating Middle School Science Teacher Change in Assessment,
Collegial and Instructional Practices
Lily Roberts, University of California, Berkeley
Addresses a common challenge in program evaluation: to make the
evaluation meaningful to the stakeholders. An example using
criterion-referenced maps shows middle school science teacher
change on measures of assessment, collegial and instructional
practices. Maps generated using a Partial Credit model offer a
solution to the methodological challenge. Use of these maps
provides an interesting perspective on the issue of rhetoric
versus reality in assessment reform.
A Many-Faceted Rasch Analysis of Faculty Research Grant Peer
Randall E. Schumacker, U. of North Texas
The study examined the peer review process of faculty members'
research grant proposals in the presence of an incomplete rating
design and the potential for the leniency or severity of the raters
to impact the proposal's rating. The many-faceted Rasch model was
used to investigate and calibrate the reviewers' effect.
Rasch Analysis of Two "Concern about Falling" Assessments and
Applications to Planning Treatment Programs
Everett V. Smith Jr., U. of Oklahoma; Michelle M. Lusardi, U. of
This study demonstrates the limitations of the Falls Efficacy
Scale. These limitations were addressed using a simultaneous
calibration of the Falls Efficacy Scale and Mobility Efficacy Scale
items. A possible treatment program based on the simultaneous
calibration is presented. Implications for future research are
Fit in a CAT Environment
John A. Stahl, Computer Adaptive Technologies Inc.
The ineffectiveness of the familiar Rasch fit misfit statistics in
the CAT environment has created a void. This paper is an initial
exploration into alternative ways of detecting unexpected patterns
of responses by people or to items. Investigated are binomial
distribution theory, regression analysis of ability estimates
across item administration, and Wald-Wolfowitz run tests.
Using the Rasch models to study large scale Physics
Examinations in Australia
Andrew Stephanou, U. of Melbourne
This paper describes the use of the Rasch model to extract and
display information about the behavior of individual examination
questions and to study the performances of candidates in
end-of-high-school physics examinations in Australia. The
transformation of examination scores into interval measures is
accompanied by a qualitative analysis of each examination question
to identify the skills identified understand the factors
influencing levels of question difficulty.
Identifying information from distractors in multiple choice
items: a routine application of an IRT hypothesis
Irene Styles and David Andrich, Murdoch U.
The present methods for identifying information from distractors in
multiple choice items are relatively complicated. A simple
criterion for hypothesizing information in a distractor which is
easily applied with modern software, is described and illustrated.
The criterion is that the response characteristic curve of a
distractor with information should show a single peaked form. If
the distractor's characteristic curve shows such a form, it is
rescored according to a format for ordered response categories. If
the simple logistic model of Rasch is applied to the dichotomously
scored responses, its extensions for ordered categories can be
applied readily within the same framework. Complementary evidence
is used to confirm or reject the hypothesis of information in a
Ordered Partition Model and Item Independency in Item
Response Theory
Nikolai Volodin, Australian Council for Educational
Discusses the use of Wilson's Ordered Partition Model to analyze
items with two correct responses and also dependent items.
From PPVT-R to PPVT-III: An Application of Rasch
Common-Person Equating
Jing-Jen Wang, American Guidance Service
Illustrates how a small sample of relevant subjects can be used to
successfully equate two forms of a test developed many years apart.
Once successfully equated, scale scores on the two test forms can
be meaningfully compared.
Dichotomization of a graded response scale: a step too
Diane Whalley, Alan Tennant, Audrey Bowen, U. of Leeds
The Wimbledon Scale assesses mood disturbance in patients with
neurological deficits. On data collected from 99 patients with
Traumatic Brain Injury, three different response category formats
were compared. Examination of item fit, local independence, item
bias, category adequacy and level of measurement showed the
original scoring proposed by the scale authors to be invalid.
Changes in teachers' perceptions of barriers to portfolio
Edward W. Wolfe, Educational Testing Service; Chris W. T. Chiu,
Michigan State U.
When measurements are made using Likert-type questionnaires over
time, one must disentangling changes in persons from changes in
items and scales. We describe procedures for doing so using Rasch
anchoring methods. The procedure is illustrated with teachers
responses to a questionnaire concerning their perceptions of
barriers to portfolio assessment implementation.
Mathematically Demonstrated Hierarchical Complexity of Tasks
and Behavior Development Theory
Michael L. Commons, Harvard Medical School; Edward J. Trudeau,
Harvard U.
Mathematically proving the "hierarchical complexity of tasks" from
a well-defined system shows that stages exist independent of
development theory. To define this system, the General Stage Model
describes the axiomatic definition of simple tasks as logically
primitive elements of the system. Hierarchical task sequences are
formed out of linear orderings of the simple tasks based on their
mathematical relationships.
Changes in Rater Characteristics over a 12 Month
Peter Congdon, Australian Council for Educational
Compares the rating patterns of the same group of seven raters
across a 12 month interval and across ten different tasks from
within a marking period. the data come from an upper tertiary
writing assessment program. These results will contribute to the
knowledge of rater invariance and equating using raters and
calibrated rater banks.
Using Rasch Analysis to Examine the Development of
Socio-moral Concepts
Theo L. Dawson, U. of California at Berkeley
To trace the development of conceptions good education, I performed
an analysis of the educational concepts present in 121 clinical
interviews. Over 600 concepts were identified. This poster traces
the process that was used to refine concept categories without
obscuring developmental differences. The role of Rasch modeling is
A Saltus Analysis of Developmental Data from the Laundry
Problem Task Series
Eric Goodheart & Michael L. Commons, Harvard U.; Theo L. Dawson
& Karen Draney, UC Berkeley
We perform a Saltus analysis of cross-sectional data from a sample
of participants who completed the Laundry Problem Task Series
(Commons, Miller, and Kuhn, 1982). The results of the analysis
will be discussed in terms of their general implications for
developmental research as well as the development of the Laundry
Problem instrument.
Tracking the 1996 NCAA Football Season: Divisions IA &
Patrick B. Fisher, Golden Rule Measurement
Tracking the 1996 NCAA Football Season: Divisions IA &
IAAThe "experts" rate teams based on their won/loss percentage.
Rasch measurement takes into account each team's and its opponents'
entire record. Examples will be shown explaining how a team with
a losing record can be considered a top team and how a team with a
winning record can be among the lesser teams. I will show my track
record at predicting the outcome of games versus the point spread.
I will also show the track record of predicting the range of the
point spread when including the standard error of measurement
versus what actually happened. I will show my track record of
predictions for the season and the top 30 teams at various times
through the season.
Vertical Equating of Reading Comprehension Skills across
Grades 7 and 9
Tock-Keng Lim, Nanyang Technological U.; Ah-Keng Yap, Ngee Ann
Polytechnic, Singapore
Equating of reading comprehension skills across Grades 7 and 9,
with two tests assessing the following skills: sequence of events,
story development, inferences, main ideas and reasoning. The
Grades 7 and 9 data sets, containing common linked items, were
first analyzed separately and then vertically equated using a
common item anchoring method and a common item concurrent
calibration method.
Measurement implications when categories in graded responses
are not empirically distinguishable (also AERA 26.42)
Annette Mercer and David Andrich, Murdoch U.
Graded response formats in assessment imply a continuum which is
partitioned into contiguous intervals. Often, response categories
reflecting this partition are defined operationally. An important
issue is whether these operational definitions work as intended.
For example, it is possible that graders who are required to
classify performances into five categories can reliably classify
the performances into only three categories. Because they are
sensitive to the number and operation of response categories, the
Rasch models for are particularly suited for investigating whether
or not categories do work as intended. In particular, and in
theory, categories should be combined, but only combined, if there
is no discrimination at the threshold between the two categories.
This paper describes a simulation study which examines implications
for parameter estimates and tests of fit of analyses when all
thresholds are assumed to discriminate equally but some in fact
have zero discriminations and therefore categories adjacent to them
should be combined.
Morality and Rule Comprehension: Catholic Children's
Cognitive Moral Development during Grade School
Michael A. Morabito
Children were presented with ten item sets, each containing three
social rule violations. Two of the three rules were from the same
general rule domain, the third from another. Children were asked
to identify the two like rule violations. Rasch analysis
identified the developmental sequence inherent in these rules and
the previously indicated age trends.
Rasch Analysis of Achievement in Mathematics
Ah-Keng Yap, Ngee Ann Polytechnic; Tock-Keng Lim, Nanyang
Technological U.
A group of O-level students are required to attend Mathematic
Enrichment course prepared for their further study in engineering
diploma. Using On-line test software package, students are given
a unique set of objective questions for their pre-test and
post-test. For each student, the responses of the tests are
treated as that given by two-person. The responses are then
analyzed using BIGSTEPS program. The results demonstrated that
students showed improvement in their post-test indicating that
Enrichment course has improved student's proficiency in
Comparative Approaches to Understanding Rating Scales
Organizer: Alan Tennant, U. of Leeds
Contrasting Analytical Approaches to Rating Scales
Alan Tennant, Diane Whalley, Audrey Bowen, U. of Leeds
How do dichotomous, rating scale and latent class analyses of the
same data differ in the information they provide? Results of
collaborative work with German researchers are discussed.
Developmental Assessment
Organizer: Mark Wilson, U. of California at Berkeley
Assessment of Development through Stage Transitions in Moral
Theo Dawson and Mark Wilson, U. of California at
Investigates whether alternative analytical approaches can identify
the intermediate stages of moral and evaluative reasoning specified
in Kohlberg's theory. Previous analyses using the Partial Credit
model have not shown the expected clustering of subjects at the
hypothesized stages. This may be due to the smoothing effect of
that analytical technique.
Investigating "Strands" of the Australian English
Geoff Masters, Australian Council for Educational
Do different variables, "strands", within a Profile (content area)
cooperate to form one dimension? Results of analyses of different
strands are presented to illustrate the problem of working with the
dimensionality of a content area.
Mapping student development in middle school science with
embedded assessments
Mark Wilson and Karen Draney, U. of California at
An innovative middle school science curriculum has been developed
by the Science Education for Public Understanding Program (SEPUP).
At its center is an assessment system based on tasks which are
embedded in the curriculum. These tasks are structured around four
variables or core principles. As part of the SEPUP assessment
system, maps have been developed to represent student progress on
each of these developmental variables. The maps are graphical
representations of the variable, showing how it unfolds or evolves
over the year in terms of student performance on assessment tasks.
The maps are derived from empirical analyses of student data and
are based on ordering assessment tasks from relatively easy tasks
to more difficult and complex ones. Once constructed, maps can be
used to record and track student progress and to illustrate the
skills a student has mastered and those that the student is working
on. By placing students' performance on the continuum defined by
the map, teachers can demonstrate students' progress with respect
to the goals and expectations of the course. The maps are one tool
to provide feedback to the teacher on how students are progressing
and are also a source of information in providing feedback to
students on their own performances in the course. Creating
developmental maps of the SEPUP variables has presented conceptual,
technical and practical problems. We describe the steps needed to
create the maps and give examples of their uses in the classroom
and for communication with parents and teachers.
Developmental Assessment: Aligning Tasks with Top-down
Described Variables
Margaret Forster, Australian Council for Educational Research
The Australian National English (Language) Profile is an example of
a variable or "Strand" based primarily on expert opinion: a top-
down approach to the construction of variables. A range of tasks
was developed to address outcomes from the Profile. Rasch analysis
of responses to these tasks provides item calibrations which
contribute to a better understanding of the variable and provides
opportunities to enrich and revise the original definition.
Facets Applications
Organizer: John Michael Linacre, U. of Chicago
Winning with Facets
John M. Linacre, U. of Chicago
The benefits of the many-facet Rasch model for judge-mediated
examinations are outlined. Specifically discussed are linearity,
quality control and fairness. Flexibility of judging plan,
construct validation and communication of results are touched on.
Some comparisons are drawn with other analytical methods.
Individualizing Judge Performance Reports
Mary E. Lunz, American Society of Clinical Pathologists
The multi-facet Rasch model produces estimates of judge severity
and an interaction report which shows the expected and observed
performance of each judge on each project, student and/or task
graded. the interaction report can be extremely useful for
providing individualized feedback to each judge after a grading
session. The presentation describes the information on the
interaction report and one approach to transcribing that
information to a format that can be easily interpreted and
potentially useful to individual judges.
A Retrospective Method to Study Rater Stability across
Thomas R. O'Neill, American Society of Clinical
A retrospective method to study the stability of rater severity
across many administrations is described. When scores are intended
to be interchangeable across administrations, a linking strategy
must be used to express new raters' severity on the same scale as
the original raters. Using "common raters" to link together two
test administrations, requires that the "common raters" maintain
their same degree of severity across administrations. How to
prospectively use the retrospective information is also discussed.
Foundations of Measurement
Organizer: David Andrich, Murdoch U.
In many instances in the advancement of knowledge, solutions are found to problems that have not been posed - it is the solutions that subsequently generate the problems, and it can turn out that these solutions have major significance to their field. These solutions come from unusual insights into the understanding of some area of study. There are examples of solutions arriving before the problems are articulated in Rasch measurement, including Rasch's own realization that in the elimination of parameters in his analysis of reading data, he had solved a problem that seemed never to have been posed. It is the solution to this problem that has created the distinctive impact of Rasch models for measurement. Some other examples of such problems within Rasch measurement are considered.
Theory, Observation, and Mathematical Thinking
William P. Fisher, Jr. Louisiana State U. Medical
To date, the fact that all observation is theory-laden has not
overtly affected much quantitative psychosocial research practice.
This session's other speakers' examples of ways in which available
solutions impose themselves on problem definition point toward some
ontological and social fundamentals of method. The extent to which
these fundamentals are incorporated in Rasch measurement theory and
practice is described, as is how they could be more fully
implemented there.
Health Care Outcome Measurement
Organizer: Allen W. Heinemann, Rehabilitation Institute of
Three presentations will focus on distinct aspects of health care
measurement. Two presentations will explore calibrations of
various instruments; the third will explore the application of
measurement processes with a single case. Dr. William Fisher will
describe his calibration of the widely used MOS SF-36 with a
locally-derived instrument of physical functioning. Dr. Rita Bode
will explore differences in raters' descriptions of patients'
functional status across impairment groups. Dr. Carl Granger will
describe a diagnostic challenge and application of rating scale
analysis that improves clinicians' understanding of a patient's
physical condition.
Common-Sample Equating of the MOS SF-36 and the LSU HSI
Physical Functioning Scales
William P. Fisher, Jr., Robert L. Eubanks, and Robert L.
Marier, Louisiana State U. Medical Center
This report from an ongoing study equates the physical functioning
subscales of the Medical Outcomes Study Short Form 36-item health
status measure (SF-36) and the Louisiana State University Health
Status Instruments (LSU HSI), and compares their respective
measurement properties, using data from a convenience sample of 103
patients visiting a public hospital general medicine clinic.
Is a change in a patient's pattern of response, based on
Rasch analysis, consistent with the clinical picture?
Carl V. Granger, State U. of New York, Buffalo
A 53 year old female employed as a hospital nurses' aide has been
experiencing chronic low back pain for 10 years. Intermittently
she has lost time from work. Her pattern of response to physical
functioning had been consistent with that of chronic low back pain.
On the most recent visit, her pain worsened and her pattern of
response was not compatible with that of chronic low back pain. Is
this shift in response consistent with the clinical picture?
Using Multi-Faceted Rating Scale Analysis to Evaluate
Functional Status during Medical Rehabilitation
Allen Heinemann and Rita Bode, Rehabilitation Institute of
Multi-faceted rating scale analysis provides a valuable approach to
describing severity of disability using functional status items
rated by team members from different disciplines. An example is
provided using the Rehabilitation Institute of Chicago's Functional
Assessment Scale. Results will focus on a calibration of patients,
items, the rating scale, raters, and impairment groups in a sample
of 7,000 patient records available from 1992 to 1996. We found
that 1) rating scale problems need to be resolved before further
analyses are conducted, 2) misfitting items may be reduced by
specifying multiple rating scales, and 3) subscales may be
important to distinguish.
Measuring Psychological Development
Organizer: Trevor Bond, James Cook U.
This symposium is an attempt to enhance communication between measurement theorists and investigators of the psychology of human development. Piagetian theories provide strong theoretical frameworks within which to address issues of interpretation and meaning. Each presented paper clarifies central issues in developmental theory by marrying empirical investigations with sensitive data analytical techniques. This symposium demonstrates how developmental psychologists are applying Rasch techniques with a view to obtaining insight into how to obtain yet greater benefit from Rasch methodology.
Comparing Décalage and Development with Cognitive
Developmental Tests (also IOMW9 Poster)
Trevor Bond, James Cook U.
Research into formal operational thought using the Rasch model
substantiates important aspects of the original theorizing of
Piaget. Common-person equating has been used across tasks to
estimate the relative difficulty of tasks and to estimate cognitive
development over one- and two-year intervals. Estimates of
cognitive development do not exceed 0.5 logits per year, but
difficulty differences (décalage) between tests of formal thought
are as large as 2 logits, confounding attempts to separate
development from décalage. This paper investigates the use of
Rasch methodology to solve this conundrum.
Can Qualitative Stage Characteristics be Revealed
Gino Coudé, Gérald Noelting, and Jean-Pierre Rousseau, U.
The correspondence between Piagetian stages of cognitive
development in three tasks and the hierarchy of item difficulties
were examined. The cognitive level of each task (Mixing juices,
Caskets task - text-form, Coded orthogonal views) had previously
been established qualitatively. Rasch analysis provided a helpful
representation of item difficulty levels as well as an assessment
of fit of items to the underlying cognitive trait. Results reveal
a high correspondence between qualitative and quantitative
analyses, but quantifying stage leaps remains to be fully
Can Qualitative Stage Characteristics be Revealed
Quantitatively?Developing conceptions of good education:
Untangling content and structure (also AERA 10.57)
Theo Dawson, U. of California at Berkeley
The conceptual content and stage of 121 interviews have been
analyzed. First, a detailed concept analysis yielded over 600
concepts. Second, the General Stage Scoring System, a method of
assessing stage that is not based on the particular content of
responses, was used to assess stage. The relationship of concept
usage and stage is examined with partial credit analysis.
A Theoretical and Empirical Investigation of Misfitting
Persons and Items Based on Cognitive Development (also AERA
Christine M. Fox and William M. Gray, U. of Toledo
Using a test which represents different forms of thought and
different logical operations, we explored 1) the rating scale
structure of 18 plausible scoring schemes; 2) person and item
misfits with both the original analysis and the new scoring
schemes. Interpretations emphasized the congruence between
empirical investigations and theory.
New Horizons in Standard Setting
Organizer: Gregory E. Stone, Dental Assisting National Board;
Discussant: William P. Fisher, Jr., Louisiana State U. Medical
Setting and Evaluating Performance Standards in High Stakes
Writing Assessments
George Engelhard Jr., Emory U.; Belita Gordon, U. of
Describes a set of procedures for setting a performance standard
(passing score) and evaluating the quality of the ratings obtained
from standard-setting judges. The Georgia High School Writing Test
illustrates the wide differences in judges's views of how well a
student should write to receive a high school diploma.
Setting Standards on Performance Examinations
Mary E. Lunz, American Society of Clinical Pathologists
Describes several criterion standard-setting approaches for
performance examinations: 1) Fair average using rating scale
definitions; 2) Construction of a synthetic minimally competent
candidate; 3) Project/case difficulty assessment for the minimally
competent candidate; 4) Global compared to analytic scoring.
Informing Mastery Level through Binomial Trials: A Refinement
of Objective Standard Setting
Gregory E. Stone, DANB; George Engelhard Jr., Emory U.
The logistic estimation model asks judges to directly define the
criterion through the selection of "essential" content, but the
mastery level itself is determined through philosophical discourse
and evaluative means. In this paper, a binomial trials technique
directs the process of mastery level selection within the logistic
estimation model, to produce a mastery level in a decidedly more
systematic and reasonable manner.
Informing Mastery Level through Binomial Trials: A Refinement of Objective Standard Setting
Graphical Guides for Communicating Rating Scale
Richard M. Smith, Rehabilitation Foundation Inc.
Developing and using graphical guides for communicating rating
scale analyses to medical rehabilitation professionals.
How to Work BIGSTEPS
John M. Linacre, U. of Chicago
Discusses recent enhancements to BIGSTEPS including residual
principal component analysis. Other features will be discussed in
accord with participants interest.
How to Work Facets
John M. Linacre, U. of Chicago
Discusses recent enhancements to Facets including rating
scale information. Other features will be discussed in
accord with participants interest.
Integrating QUEST with Item Creation, Storage and
Brian Doig, Australian Council for Educational Research
Demonstrates an "all-in-one" item management system that Margaret
Wu, Ray Adams and I have completed which integrates the
Quest Rasch analysis program with item creation, storage and
Measuring Psychological Development: Unresolved
William Gray, U. of Toledo; Mark Wilson, U. of California at
Reactions to the "Measuring Psychological Development" symposium
will lead off a discussion of unresolved issues in this area.
Rating Scale Measurement and ASTM Standards
William P. Fisher, Jr. Louisiana State U. Medical
The administrative simplification section of the Health Insurance
Portability and Accountability Act of 1996 (the Kennedy-Kasselbaum
bill) mandates the adoption of uniform national standards for
health information. This presentation addresses the significance
and use of scale-free measurement for creating and maintaining
universal metrics in the context of the electronic health
Measuring Feedback-Seeking Modes: An Alternative to Composite
Scores (AERA 5.56)
Rita Bode, Rehabilitation Institute of Chicago
The objective of this study is to illustrate an alternative to the
use of traditional composite scores in creating scales from survey
items. The variable dealt with is mode of feedback-seeking.
Survey items concerning frequency of feedback-seeking from various
sources were Rasch-calibrated. The resulting item map was expanded
to include calibrations for each response in the rating scale for
each item. The expanded item map was then used to identify four
types of feedback-seeking by referencing the item descriptors on
the map.Using the Rasch procedure, it was possible to determine not
only the extent to which individual faculty sought feedback but
also which feedback-seeking modes they used and how frequently.
Assessing and Improving the Extraversion-Introversion Scale
of the Myers-Briggs Type Indicator by the Rasch measurement
(AERA 5.56)
Eunlim Kim Chi, Kyunghee U.
The Myers-Briggs Type Indicator(MBTI) is one of the most popularly
used personality inventories, but most psychometric studies on the
MBTI have been restricted at factor analyses. This paper assesses
the validity of the EI(extraversion-introversion) scale of MBTI by
the Rasch measurement and improves the scoring system. This
includes the fit analysis of the El scale and the investigation of
unidimensionality of the scale. In addition, the scoring system
using a single pole, instead of the bi-poles, is suggested. The
data for the present study were the Korean version of the MBTI
which administered to 235 Korean university students. The El scale
contained 21 items. Also, the self-determined types were examined
from the students to use as a criterion for assessing the validity
of the scale. The results identified the invalid items for the El
scale and demonstrated that the El scale can be more validly scored
by a single pole.
Primary, Concrete, Abstract, Formal, Systematic, and
Metasystematic Operations as Observed in a "Piagetian" Balance-Beam
Task Series (AERA 5.56)
Eric Andrew Goodheart, Harvard U.; Theo Linda Dawson, U. of
California at Berkeley; Michael Lamport Commons, Harvard Medical
We performed a Partial Credit analysis of cross-sectional
developmental data gathered from children and adults who were
presented a task series derived from Inhelder's and Piaget's
balance beam. This analysis create a probabilistic model situating
both participants and items along a single hierarchically ordered
dimension. As the General Stage Model predicted, the items formed
a series of clusters along this dimension according to their order
of hierarchical complexity. The order of hierarchical complexity
predicted the item difficulty, which reflects the order in which
items are learned. Gappiness between items of differing stage will
be further analyzed using the Saltus model.
Advances in Partial Credit Models with Applications to
Performance Assessment (AERA 12.32)
Huynh Huynh, U. of South Carolina
Research results to be presented focus on: 1) Partial credit models
and testlets; 2) Equivalence between partial credit item and a set
of independent binary items; 3) Decomposition of a partial credit
item into independent binary and indecomposable trinary items; 4)
number of categories for a partial credit item; 5) location of
score categories of a partial credit item and applications to
criterion-referenced interpretation.
New Measures, New Methodology, New Psychometrics: A
Synthesis (AERA 18.44)
Everett Smith Jr., U. of Oklahoma; J. Rogers, Connecticut State
Education Dept.; M. Kulikowich, U. of Connecticut; T. Jetton, U. of
Demonstrates a G-study and a Facets analysis of some
performance based data from 5 items, 3 judges, 2 testing times, and
44 students.
Quantifying Item Dependency by Fisher's Z (AERA
Linjun Shen, National Board of Osteopathic Medical
This study proposes Fisher's Z as the index to assess local item
dependency among clustered items. Fisher's Z is a transformation
of Pearson correlation coefficient of standardized residuals
between Rasch model predicted scores and the observed scores for a
pair of items across all persons in an exam. This paper also
proposes using the distribution of Fisher's Z for standard alone
items in the same exam as the criterion to determine the
significance level of dependency among clustered items. Compared
with Yen's 0 . Fisher's Z has three improvements. It normalizes
Yen's Q3, takes into 3 ' account of measurement error, and
establishes a practical significance level of dependency.
Therefore, Fisher's Z is a more sensitive and more practical index
than Yen's Q, to identify local item dependency.
Optimal Categorization of a Rating Scale: A Longitudinal
Study (AERA 26.42)
Weimo Zhu, Wayne State U.; Wynn Updyke, and Cheryl Lewandowski,
Indiana U.
To determine its stability of optimal categorization, a 50-item
psychomotor self-efficacy scale was administered four times to a
total of 2,022 children from 15 Midwestern schools during a
four-year period. By combining adjacent categories in a
"collapsing" process, in which new categorizations are constructed,
the optimal categorization was determined by comparing indexes of
Person and Item Separation and fit statistics provided by the Rasch
analysis. It was found that the optimal categorization identified
by the Rasch analysis can be stable and generalized to sequent
administrations, and the determination of an optimal categorization
should be empirically based.
Physical Functioning Construct Congruence across Instruments:
Towards a Universal Metric (AERA 40.15)
William P. Fisher Jr., Louisiana State U. Medical Center
Begins examining the stability of a physical functional
independence construct across instruments and samples. This is not
a formal equating of instruments or samples, but indicates whether
such an effort would be likely to succeed. The equating method
employed is a pseudo-common item equating, in which the
calibrations from separate samples of similar, but not identical
items, from different instruments, are compared. More than 30
articles presenting Rasch analyses of physical functioning scales
were reviewed. The final average correlation is .90, with an
average of 7 pseudo-common items. Measures based on these
calibrations should be linearly transformable versions of the same
metric. The quantitative stability of different areas of physical
functional independence across instruments and samples suggests
that the development of a universal metric is a realizable
A Many-facet Rasch Model for Assessing Disease Class-specific
Diagnostic Abilities in Medical Students (AERA 40.15)
Frank J. Papa, R. E. Schumacker, Robert Stone, David Aldrich, U.
of North Texas
Medical diagnostic accuracy appears to be both disease
class-specific, and a function of a case presentation's
`typicality'. A many-faceted Rasch model and computer adaptive
testing were used to assess the level of case typicality
(diagnostic difficulty) at which medical students' performed in
nine specific disease classes. Analysis indicated: 1) students
(n=100) differed significantly in diagnostic ability; 2) the
disease classes were significantly different; 3) the cases were
significantly different. These results suggest that it may be
possible to draw inferences regarding the robustness of a
clinician's diagnostic disease class concepts.
The Measurement of Concern for Falling and Social Cognitive
Applications to Planning Treatment Programs (AERA 40.15)
Everett V. Smith Jr., U. of Oklahoma; Michelle M. Lusardi, U. of
First, we demonstrate limitations of the Falls Efficacy Scale.
Second, we address these limitations using a simultaneous
calibration of the Falls Efficacy Scale and Mobility Efficacy Scale
items. Third, we discuss a possible treatment program based on the
simultaneous calibration and Social Cognitive Theory. Results
indicate that Falls Efficacy Scale fails to assess the higher ends
of the self-efficacy continuum. Simultaneous calibration of items
improved this lack of scale definition. This provides a
theoretical framework for planning treatment programs.
Using Person Fit Statistics to Improve Prediction of
Functional Outcomes (AERA 40.15)
Richard M. Smith, Kevin J. Fuss, Rehabilitation Foundation Inc.,
Raymond E. Wright, SPSS
The prediction of the functional outcomes of the physical
rehabilitation process based on the initial status of the patient
has clear financial implications for the health care industry,
particularly in the era of managed care with the focus on reducing
the resources required to achieve a given level of functional
status. In recent years there has been a trend towards the
utilization of functional assessment instruments, such as the FIM(sm),
PECS and LORS, that employ the Rasch model to convert the results
of observations based on rating scales items to interval measures.
However, little attention has been placed on the fit of the
observed responses for each person to the requirements of the
psychometric model or to the effect of misfit in the item level
evaluations that are used in the prediction of functional status at
the conclusion of treatment. To examine this effect initial and
discharge functional assessment scores on five different aspects of
functionality were analyzed for several treatment programs. The
results indicate modest increases in prediction of functional
outcomes when only fitting initial evaluations were used in the
analysis. However, the results were not consistent across
Development of an instrument to identify Attention Deficit
Disorder/Hyperactivity Disorder (AERA 50.16)
Everett Smith Jr., Brian Johnson, U. of Oklahoma
A validation study using multi-group common factor analysis and the
Rasch rating scale model to address the validity of the DSM-IV
criteria for ADD/HD in a college sample. We use the Rasch model to
supplement findings in the common factor analysis and to identify
subjects that may have ADD/HD. We also argue that the Rasch
methodology is more useful given what many counseling psychologies
want to do with scores from Likert type assessments.
