IOMW9 - AERA Rasch Measurement Abstracts, 1997

IOMW9 Papers

The Voice of Georg Rasch
David Andrich, Murdoch U.
From closely following the development of Rasch models for measurement in the 1970s, I formed the impression that Rasch could precipitate a Kuhnian-type revolution in social measurement. Such revolutions are generally inspired by people with unusual intellectual insights and personal qualities. I had the excellent fortune of living and working closely with Georg Rasch in the mid 1970s over two periods which totalled twelve months. It seemed important that the events that he saw as shaping his thinking should be recorded. Accordingly, in 1979 I taped conversations with Rasch over a series of evenings. These tapes provide participants The Voice of Georg Rasch.

The Nationwide On-Line Networking of Item Banks in Korea
Sun-Geun Baek, Korean Educational Development Institute
An initial item bank of 13,000 items has been constructed. Challenges addressed include maintaining accurate item calibrations, adding new items to the bank, and managing differential item functioning.

Making Sense Out of Survey Data
Betty Bergstrom, American Dietetic Association, Commission on Dietetic Registration
Presents a survey designed to quantify the scope of practice from entry-level to advanced level practice in the field of dietetics. Data were analyzed with the Masters Partial Credit Model because of a clear difference in the difficulty of steps across items. Using demographic data and the computed expected mean scores and most probable responses, a map of the practice of dietetics was constructed. This allows dietitians to the see differences in roles and tasks performed along the continuum of practice and also to estimate the impact of acquiring an advanced level degree or choosing a job within a particular setting.

Using the Rasch Measurement Model in Grading Essay-Type Items: A Way to Enhance both Economy and Objectivity of Evaluation
Sunhee Chae, Korean Educational Development Institute
Using data from the national examination for public school teachers, the Facets approach is used to measures judge leniency and investigate the judging design.

Psychophysics of Stage: Task Complexity and Statistical Models
Michael L. Commons, Harvard Medical School; Francis A. Richards, Edward J. Trudeau, Eric A. Goodheart, Harvard U.; Theo Dawson, UC Berkeley
The assessment of developmental stage has been limited to measures of performance rather than an analysis of task demands. To remedy that lack, a behavior-analytic-compatible core-developmental notion of hierarchical complexity of tasks has been introduced. The hierarchical complexity of tasks and the stage of the response chain that complete them have three unusual properties that are discussed in the paper. The General Stage Model predicts that performance of items within a hierarchically ordered task sequence will have a set of corresponding Rasch-model-generated stage measures.

The Stability of Rater Characteristics in Large Scale Assessment Programs (also AERA 10.48)
Peter J. Congdon, Joy McQueen, Australian Council for Educational Research
This study reports on the stability of rater severity over an extended period of time and the stability of rater severity across the different performance dimensions within the context of a large scale assessment program. The study focuses on changes in rater severity and rater fit over an eight day period.

Assessment of Stage Transitions in Performance on Commons' Balance Beam Instrument
Theo Dawson, UC Berkeley; Eric Goodheart, Harvard U.; Karen Draney, Mark Wilson, UC Berkeley; Michael Commons, Harvard Medical School
We perform both Rasch and saltus analyses of performance on Commons' Balance Beam instrument. Analysis with the Rasch model clearly shows two hierarchical stages of development in the item estimates, though four are predicted. To further examine stage transitions, we use saltus analysis, which models latent group effects as well as latent trait effects.

Raters and Single Prompt-to-Prompt Equating Using the Facets Model in Writing Performance Assessment
Yi Du and William L. Brown, Minneapolis Public Schools
The study is based on an equating practice of the 1996 writing assessment in Minneapolis Public Schools (MPS). In the writing assessment, a common topic with three different prompts representing narrative, persuasive, and informative writing, was used to assess how well students can write. The random-groups equating design and a Facets model were used to equate the three non-overlapped prompts and rater severity in this study. Scale scores were obtained by adjusting the student raw scores for difference in prompts and raters using the Facets model. The scale scores reflect the fact that some prompts are more difficult than others and some raters are severer than others. The scale scores are, therefore, comparable and best describe the students' overall achievement on this assessment. The study shows that the Facets model is the best option for writing performance assessment equating. Both prompt-to-prompt and rater equating is necessary to gain comparability of students in writing performance assessment.

Applications of a Multicomponent Rasch Model to Understanding Lifespan Differences in Processing
Susan E. Embretson and Karen M. McCollam, U. of Kansas
The multicomponent latent trait model (MLTM; Embretson, 1980), a multidimensional Rasch model which estimates underlying components, was applied to aging spatial visualization data to estimate two latent components hypothesized to decline with age: general control processes and working memory capacity. Specific hypotheses about age decline are tested with the MLTM estimates and discussed in context of prior studies.

Measuring the Accuracy of Self-Efficacy Judgments in Mathematics
George Engelhard Jr., Frank Pajares, Emory U.
Accuracy is defined as the correspondence between the self-reported confidence of students to solve a set of mathematics problems (judgement of self-efficacy) and the actual achievement on these problems (mathematics performance). The judgments examined in this study were obtained from 297 eighth grade students. They were asked to rate their confidence of success on 19 algebra problems using a 6-point Likert scale ranging from "no confidence at all" to "complete confidence". This research will add to our understanding of the relationship between self-efficacy and performance.

Dimensionality and DIF on Multiple-Choice and Constructed Response Mathematics Items
Mary Garner, Emory U.
The Multidimensional Random Coefficients Multinomial Logit model is used to explore the influence of assumptions of dimensionality on the measurement of differential item functioning. DIF indices using a unidimensional model are compared to DIF indices that are produced using a multidimensional model. Both between-item multidimensionality and within-item multidimensionality models are used.

Rasch Measurement Theory, The Method of Paired Comparisons, and Graph Theory
Mary Garner, George Engelhard, Jr., Emory U.
The purpose of this paper is to answer the following questions: (1) What is the relationship between the method of paired comparisons and Rasch measurement theory? (2) What is the relationship between the method of paired comparisons and graph theory? (3) What can graph theory contribute to our understanding of Rasch measurement theory? Specifically, it is shown how the method of paired comparisons can lead to the Rasch model, just as consideration of the Rasch model can lead to a pairwise algorithm for estimating the parameters of the Rasch model. Furthermore, both graph theory and previously unexplored aspects of the method of paired comparisons are used to increase understanding and utility of a pairwise algorithm for estimating parameters of the Rasch model as presented by Choppin (1985).

Rasch Solutions to Curriculum Based Measurement Norming
Peter MacMillan, U. of Northern British Columbia
A probe for reading consisted of a selected sample of grade-appropriate text. One score was the number of words read correctly in one minute, another was number of words written in one minute, and a third the number of those words spelled correctly. This problem raised issues not solvable with conventional analysis. Multifaceted Rasch analysis solved these problems and also enabled calibrating the difficulty of the prompts and comparison of performance across grades and schools.

Training and Rater Reliability in Performance Assessment (also AERA 10.48)
Juliette Mendelovits, Australian Council for Educational Research
About 200 classroom teachers of Grade 3, 7 and 10 students were trained to assess first language English speaking performances using a training video tape. The teachers used common marking schedules to assess taped sample performances of students from each of the Grades. Their assessments were collected and analyzed together with two sets of consensual `expert' ratings. These data are the focus for a discussion of the effect of different types of experience on the reliability and accuracy of rater judgment.

Gender Differences in Attitudes toward Computers
Judith A. Monsaas, North Georgia College; George Engelhard Jr., Emory U.
Responses to the Survey of Attitudes Toward Learning About and Working with Computers instrument were analyzed with a Facets model to investigate gender bias on individual attitude items. Methodological and substantive issues are discussed.

Halo Effect in Personal Care Products
Thomas K. Rehfeldt, Helene Curtis Inc.
A Facets analysis of two sets of marketing research data on shampoos will be used to show the "halo" effect or enhancement of unrelated properties because of a an especially favorable response to the "halo" property. Specifically the effect of fragrance on physical performance will be reported.

Using Maps to Produce Meaningful Evaluation Measures: Evaluating Middle School Science Teacher Change in Assessment, Collegial and Instructional Practices
Lily Roberts, University of California, Berkeley
Addresses a common challenge in program evaluation: to make the evaluation meaningful to the stakeholders. An example using criterion-referenced maps shows middle school science teacher change on measures of assessment, collegial and instructional practices. Maps generated using a Partial Credit model offer a solution to the methodological challenge. Use of these maps provides an interesting perspective on the issue of rhetoric versus reality in assessment reform.

A Many-Faceted Rasch Analysis of Faculty Research Grant Peer Ratings
Randall E. Schumacker, U. of North Texas
The study examined the peer review process of faculty members' research grant proposals in the presence of an incomplete rating design and the potential for the leniency or severity of the raters to impact the proposal's rating. The many-faceted Rasch model was used to investigate and calibrate the reviewers' effect.

Rasch Analysis of Two "Concern about Falling" Assessments and Applications to Planning Treatment Programs
Everett V. Smith Jr., U. of Oklahoma; Michelle M. Lusardi, U. of Connecticut
This study demonstrates the limitations of the Falls Efficacy Scale. These limitations were addressed using a simultaneous calibration of the Falls Efficacy Scale and Mobility Efficacy Scale items. A possible treatment program based on the simultaneous calibration is presented. Implications for future research are discussed.

Fit in a CAT Environment
John A. Stahl, Computer Adaptive Technologies Inc.
The ineffectiveness of the familiar Rasch fit misfit statistics in the CAT environment has created a void. This paper is an initial exploration into alternative ways of detecting unexpected patterns of responses by people or to items. Investigated are binomial distribution theory, regression analysis of ability estimates across item administration, and Wald-Wolfowitz run tests.

Using the Rasch models to study large scale Physics Examinations in Australia
Andrew Stephanou, U. of Melbourne
This paper describes the use of the Rasch model to extract and display information about the behavior of individual examination questions and to study the performances of candidates in end-of-high-school physics examinations in Australia. The transformation of examination scores into interval measures is accompanied by a qualitative analysis of each examination question to identify the skills identified understand the factors influencing levels of question difficulty.

Identifying information from distractors in multiple choice items: a routine application of an IRT hypothesis
Irene Styles and David Andrich, Murdoch U.
The present methods for identifying information from distractors in multiple choice items are relatively complicated. A simple criterion for hypothesizing information in a distractor which is easily applied with modern software, is described and illustrated. The criterion is that the response characteristic curve of a distractor with information should show a single peaked form. If the distractor's characteristic curve shows such a form, it is rescored according to a format for ordered response categories. If the simple logistic model of Rasch is applied to the dichotomously scored responses, its extensions for ordered categories can be applied readily within the same framework. Complementary evidence is used to confirm or reject the hypothesis of information in a distractor.

Ordered Partition Model and Item Independency in Item Response Theory
Nikolai Volodin, Australian Council for Educational Research
Discusses the use of Wilson's Ordered Partition Model to analyze items with two correct responses and also dependent items.

From PPVT-R to PPVT-III: An Application of Rasch Common-Person Equating
Jing-Jen Wang, American Guidance Service
Illustrates how a small sample of relevant subjects can be used to successfully equate two forms of a test developed many years apart. Once successfully equated, scale scores on the two test forms can be meaningfully compared.

Dichotomization of a graded response scale: a step too far?
Diane Whalley, Alan Tennant, Audrey Bowen, U. of Leeds
The Wimbledon Scale assesses mood disturbance in patients with neurological deficits. On data collected from 99 patients with Traumatic Brain Injury, three different response category formats were compared. Examination of item fit, local independence, item bias, category adequacy and level of measurement showed the original scoring proposed by the scale authors to be invalid.

Changes in teachers' perceptions of barriers to portfolio implementation
Edward W. Wolfe, Educational Testing Service; Chris W. T. Chiu, Michigan State U.
When measurements are made using Likert-type questionnaires over time, one must disentangling changes in persons from changes in items and scales. We describe procedures for doing so using Rasch anchoring methods. The procedure is illustrated with teachers responses to a questionnaire concerning their perceptions of barriers to portfolio assessment implementation.

IOMW9 Posters

Mathematically Demonstrated Hierarchical Complexity of Tasks and Behavior Development Theory
Michael L. Commons, Harvard Medical School; Edward J. Trudeau, Harvard U.
Mathematically proving the "hierarchical complexity of tasks" from a well-defined system shows that stages exist independent of development theory. To define this system, the General Stage Model describes the axiomatic definition of simple tasks as logically primitive elements of the system. Hierarchical task sequences are formed out of linear orderings of the simple tasks based on their mathematical relationships.

Changes in Rater Characteristics over a 12 Month Interval
Peter Congdon, Australian Council for Educational Research
Compares the rating patterns of the same group of seven raters across a 12 month interval and across ten different tasks from within a marking period. the data come from an upper tertiary writing assessment program. These results will contribute to the knowledge of rater invariance and equating using raters and calibrated rater banks.

Using Rasch Analysis to Examine the Development of Socio-moral Concepts
Theo L. Dawson, U. of California at Berkeley
To trace the development of conceptions good education, I performed an analysis of the educational concepts present in 121 clinical interviews. Over 600 concepts were identified. This poster traces the process that was used to refine concept categories without obscuring developmental differences. The role of Rasch modeling is described.

A Saltus Analysis of Developmental Data from the Laundry Problem Task Series
Eric Goodheart & Michael L. Commons, Harvard U.; Theo L. Dawson & Karen Draney, UC Berkeley
We perform a Saltus analysis of cross-sectional data from a sample of participants who completed the Laundry Problem Task Series (Commons, Miller, and Kuhn, 1982). The results of the analysis will be discussed in terms of their general implications for developmental research as well as the development of the Laundry Problem instrument.

Tracking the 1996 NCAA Football Season: Divisions IA & IAA
Patrick B. Fisher, Golden Rule Measurement
Tracking the 1996 NCAA Football Season: Divisions IA & IAAThe "experts" rate teams based on their won/loss percentage. Rasch measurement takes into account each team's and its opponents' entire record. Examples will be shown explaining how a team with a losing record can be considered a top team and how a team with a winning record can be among the lesser teams. I will show my track record at predicting the outcome of games versus the point spread. I will also show the track record of predicting the range of the point spread when including the standard error of measurement versus what actually happened. I will show my track record of predictions for the season and the top 30 teams at various times through the season.

Vertical Equating of Reading Comprehension Skills across Grades 7 and 9
Tock-Keng Lim, Nanyang Technological U.; Ah-Keng Yap, Ngee Ann Polytechnic, Singapore
Equating of reading comprehension skills across Grades 7 and 9, with two tests assessing the following skills: sequence of events, story development, inferences, main ideas and reasoning. The Grades 7 and 9 data sets, containing common linked items, were first analyzed separately and then vertically equated using a common item anchoring method and a common item concurrent calibration method.

Measurement implications when categories in graded responses are not empirically distinguishable (also AERA 26.42)
Annette Mercer and David Andrich, Murdoch U.
Graded response formats in assessment imply a continuum which is partitioned into contiguous intervals. Often, response categories reflecting this partition are defined operationally. An important issue is whether these operational definitions work as intended. For example, it is possible that graders who are required to classify performances into five categories can reliably classify the performances into only three categories. Because they are sensitive to the number and operation of response categories, the Rasch models for are particularly suited for investigating whether or not categories do work as intended. In particular, and in theory, categories should be combined, but only combined, if there is no discrimination at the threshold between the two categories. This paper describes a simulation study which examines implications for parameter estimates and tests of fit of analyses when all thresholds are assumed to discriminate equally but some in fact have zero discriminations and therefore categories adjacent to them should be combined.

Morality and Rule Comprehension: Catholic Children's Cognitive Moral Development during Grade School
Michael A. Morabito
Children were presented with ten item sets, each containing three social rule violations. Two of the three rules were from the same general rule domain, the third from another. Children were asked to identify the two like rule violations. Rasch analysis identified the developmental sequence inherent in these rules and the previously indicated age trends.

Rasch Analysis of Achievement in Mathematics
Ah-Keng Yap, Ngee Ann Polytechnic; Tock-Keng Lim, Nanyang Technological U.
A group of O-level students are required to attend Mathematic Enrichment course prepared for their further study in engineering diploma. Using On-line test software package, students are given a unique set of objective questions for their pre-test and post-test. For each student, the responses of the tests are treated as that given by two-person. The responses are then analyzed using BIGSTEPS program. The results demonstrated that students showed improvement in their post-test indicating that Enrichment course has improved student's proficiency in Mathematics.

IOMW9 Symposia

Comparative Approaches to Understanding Rating Scales
Organizer: Alan Tennant, U. of Leeds

Contrasting Analytical Approaches to Rating Scales
Alan Tennant, Diane Whalley, Audrey Bowen, U. of Leeds
How do dichotomous, rating scale and latent class analyses of the same data differ in the information they provide? Results of collaborative work with German researchers are discussed.

Developmental Assessment
Organizer: Mark Wilson, U. of California at Berkeley

Assessment of Development through Stage Transitions in Moral Reasoning
Theo Dawson and Mark Wilson, U. of California at Berkeley
Investigates whether alternative analytical approaches can identify the intermediate stages of moral and evaluative reasoning specified in Kohlberg's theory. Previous analyses using the Partial Credit model have not shown the expected clustering of subjects at the hypothesized stages. This may be due to the smoothing effect of that analytical technique.

Investigating "Strands" of the Australian English Profile
Geoff Masters, Australian Council for Educational Research
Do different variables, "strands", within a Profile (content area) cooperate to form one dimension? Results of analyses of different strands are presented to illustrate the problem of working with the dimensionality of a content area.

Mapping student development in middle school science with embedded assessments
Mark Wilson and Karen Draney, U. of California at Berkeley
An innovative middle school science curriculum has been developed by the Science Education for Public Understanding Program (SEPUP). At its center is an assessment system based on tasks which are embedded in the curriculum. These tasks are structured around four variables or core principles. As part of the SEPUP assessment system, maps have been developed to represent student progress on each of these developmental variables. The maps are graphical representations of the variable, showing how it unfolds or evolves over the year in terms of student performance on assessment tasks. The maps are derived from empirical analyses of student data and are based on ordering assessment tasks from relatively easy tasks to more difficult and complex ones. Once constructed, maps can be used to record and track student progress and to illustrate the skills a student has mastered and those that the student is working on. By placing students' performance on the continuum defined by the map, teachers can demonstrate students' progress with respect to the goals and expectations of the course. The maps are one tool to provide feedback to the teacher on how students are progressing and are also a source of information in providing feedback to students on their own performances in the course. Creating developmental maps of the SEPUP variables has presented conceptual, technical and practical problems. We describe the steps needed to create the maps and give examples of their uses in the classroom and for communication with parents and teachers.

Developmental Assessment: Aligning Tasks with Top-down Described Variables
Margaret Forster, Australian Council for Educational Research
The Australian National English (Language) Profile is an example of a variable or "Strand" based primarily on expert opinion: a top- down approach to the construction of variables. A range of tasks was developed to address outcomes from the Profile. Rasch analysis of responses to these tasks provides item calibrations which contribute to a better understanding of the variable and provides opportunities to enrich and revise the original definition.

Facets Applications
Organizer: John Michael Linacre, U. of Chicago

Winning with Facets
John M. Linacre, U. of Chicago
The benefits of the many-facet Rasch model for judge-mediated examinations are outlined. Specifically discussed are linearity, quality control and fairness. Flexibility of judging plan, construct validation and communication of results are touched on. Some comparisons are drawn with other analytical methods.

Individualizing Judge Performance Reports
Mary E. Lunz, American Society of Clinical Pathologists
The multi-facet Rasch model produces estimates of judge severity and an interaction report which shows the expected and observed performance of each judge on each project, student and/or task graded. the interaction report can be extremely useful for providing individualized feedback to each judge after a grading session. The presentation describes the information on the interaction report and one approach to transcribing that information to a format that can be easily interpreted and potentially useful to individual judges.

A Retrospective Method to Study Rater Stability across Administrations
Thomas R. O'Neill, American Society of Clinical Pathologists
A retrospective method to study the stability of rater severity across many administrations is described. When scores are intended to be interchangeable across administrations, a linking strategy must be used to express new raters' severity on the same scale as the original raters. Using "common raters" to link together two test administrations, requires that the "common raters" maintain their same degree of severity across administrations. How to prospectively use the retrospective information is also discussed.

Foundations of Measurement
Organizer: David Andrich, Murdoch U.

In many instances in the advancement of knowledge, solutions are found to problems that have not been posed - it is the solutions that subsequently generate the problems, and it can turn out that these solutions have major significance to their field. These solutions come from unusual insights into the understanding of some area of study. There are examples of solutions arriving before the problems are articulated in Rasch measurement, including Rasch's own realization that in the elimination of parameters in his analysis of reading data, he had solved a problem that seemed never to have been posed. It is the solution to this problem that has created the distinctive impact of Rasch models for measurement. Some other examples of such problems within Rasch measurement are considered.

Theory, Observation, and Mathematical Thinking
William P. Fisher, Jr. Louisiana State U. Medical Center
To date, the fact that all observation is theory-laden has not overtly affected much quantitative psychosocial research practice. This session's other speakers' examples of ways in which available solutions impose themselves on problem definition point toward some ontological and social fundamentals of method. The extent to which these fundamentals are incorporated in Rasch measurement theory and practice is described, as is how they could be more fully implemented there.

Health Care Outcome Measurement
Organizer: Allen W. Heinemann, Rehabilitation Institute of Chicago
Three presentations will focus on distinct aspects of health care measurement. Two presentations will explore calibrations of various instruments; the third will explore the application of measurement processes with a single case. Dr. William Fisher will describe his calibration of the widely used MOS SF-36 with a locally-derived instrument of physical functioning. Dr. Rita Bode will explore differences in raters' descriptions of patients' functional status across impairment groups. Dr. Carl Granger will describe a diagnostic challenge and application of rating scale analysis that improves clinicians' understanding of a patient's physical condition.

Common-Sample Equating of the MOS SF-36 and the LSU HSI Physical Functioning Scales
William P. Fisher, Jr., Robert L. Eubanks, and Robert L. Marier, Louisiana State U. Medical Center
This report from an ongoing study equates the physical functioning subscales of the Medical Outcomes Study Short Form 36-item health status measure (SF-36) and the Louisiana State University Health Status Instruments (LSU HSI), and compares their respective measurement properties, using data from a convenience sample of 103 patients visiting a public hospital general medicine clinic.

Is a change in a patient's pattern of response, based on Rasch analysis, consistent with the clinical picture?
Carl V. Granger, State U. of New York, Buffalo
A 53 year old female employed as a hospital nurses' aide has been experiencing chronic low back pain for 10 years. Intermittently she has lost time from work. Her pattern of response to physical functioning had been consistent with that of chronic low back pain. On the most recent visit, her pain worsened and her pattern of response was not compatible with that of chronic low back pain. Is this shift in response consistent with the clinical picture?

Using Multi-Faceted Rating Scale Analysis to Evaluate Functional Status during Medical Rehabilitation
Allen Heinemann and Rita Bode, Rehabilitation Institute of Chicago
Multi-faceted rating scale analysis provides a valuable approach to describing severity of disability using functional status items rated by team members from different disciplines. An example is provided using the Rehabilitation Institute of Chicago's Functional Assessment Scale. Results will focus on a calibration of patients, items, the rating scale, raters, and impairment groups in a sample of 7,000 patient records available from 1992 to 1996. We found that 1) rating scale problems need to be resolved before further analyses are conducted, 2) misfitting items may be reduced by specifying multiple rating scales, and 3) subscales may be important to distinguish.

Measuring Psychological Development
Organizer: Trevor Bond, James Cook U.

This symposium is an attempt to enhance communication between measurement theorists and investigators of the psychology of human development. Piagetian theories provide strong theoretical frameworks within which to address issues of interpretation and meaning. Each presented paper clarifies central issues in developmental theory by marrying empirical investigations with sensitive data analytical techniques. This symposium demonstrates how developmental psychologists are applying Rasch techniques with a view to obtaining insight into how to obtain yet greater benefit from Rasch methodology.

Comparing Décalage and Development with Cognitive Developmental Tests (also IOMW9 Poster)
Trevor Bond, James Cook U.
Research into formal operational thought using the Rasch model substantiates important aspects of the original theorizing of Piaget. Common-person equating has been used across tasks to estimate the relative difficulty of tasks and to estimate cognitive development over one- and two-year intervals. Estimates of cognitive development do not exceed 0.5 logits per year, but difficulty differences (décalage) between tests of formal thought are as large as 2 logits, confounding attempts to separate development from décalage. This paper investigates the use of Rasch methodology to solve this conundrum.

Can Qualitative Stage Characteristics be Revealed Quantitatively?
Gino Coudé, Gérald Noelting, and Jean-Pierre Rousseau, U. Laval
The correspondence between Piagetian stages of cognitive development in three tasks and the hierarchy of item difficulties were examined. The cognitive level of each task (Mixing juices, Caskets task - text-form, Coded orthogonal views) had previously been established qualitatively. Rasch analysis provided a helpful representation of item difficulty levels as well as an assessment of fit of items to the underlying cognitive trait. Results reveal a high correspondence between qualitative and quantitative analyses, but quantifying stage leaps remains to be fully investigated.

Can Qualitative Stage Characteristics be Revealed Quantitatively?Developing conceptions of good education: Untangling content and structure (also AERA 10.57)
Theo Dawson, U. of California at Berkeley
The conceptual content and stage of 121 interviews have been analyzed. First, a detailed concept analysis yielded over 600 concepts. Second, the General Stage Scoring System, a method of assessing stage that is not based on the particular content of responses, was used to assess stage. The relationship of concept usage and stage is examined with partial credit analysis.

A Theoretical and Empirical Investigation of Misfitting Persons and Items Based on Cognitive Development (also AERA 10.57)
Christine M. Fox and William M. Gray, U. of Toledo
Using a test which represents different forms of thought and different logical operations, we explored 1) the rating scale structure of 18 plausible scoring schemes; 2) person and item misfits with both the original analysis and the new scoring schemes. Interpretations emphasized the congruence between empirical investigations and theory.

New Horizons in Standard Setting
Organizer: Gregory E. Stone, Dental Assisting National Board; Discussant: William P. Fisher, Jr., Louisiana State U. Medical Ctr.

Setting and Evaluating Performance Standards in High Stakes Writing Assessments
George Engelhard Jr., Emory U.; Belita Gordon, U. of Georgia
Describes a set of procedures for setting a performance standard (passing score) and evaluating the quality of the ratings obtained from standard-setting judges. The Georgia High School Writing Test illustrates the wide differences in judges's views of how well a student should write to receive a high school diploma.

Setting Standards on Performance Examinations
Mary E. Lunz, American Society of Clinical Pathologists
Describes several criterion standard-setting approaches for performance examinations: 1) Fair average using rating scale definitions; 2) Construction of a synthetic minimally competent candidate; 3) Project/case difficulty assessment for the minimally competent candidate; 4) Global compared to analytic scoring.

Informing Mastery Level through Binomial Trials: A Refinement of Objective Standard Setting
Gregory E. Stone, DANB; George Engelhard Jr., Emory U.
The logistic estimation model asks judges to directly define the criterion through the selection of "essential" content, but the mastery level itself is determined through philosophical discourse and evaluative means. In this paper, a binomial trials technique directs the process of mastery level selection within the logistic estimation model, to produce a mastery level in a decidedly more systematic and reasonable manner.

Informing Mastery Level through Binomial Trials: A Refinement of Objective Standard Setting

IOMW9 Workshops

Graphical Guides for Communicating Rating Scale Assessments
Richard M. Smith, Rehabilitation Foundation Inc.
Developing and using graphical guides for communicating rating scale analyses to medical rehabilitation professionals.

How to Work BIGSTEPS
John M. Linacre, U. of Chicago
Discusses recent enhancements to BIGSTEPS including residual principal component analysis. Other features will be discussed in accord with participants interest.

How to Work Facets
John M. Linacre, U. of Chicago
Discusses recent enhancements to Facets including rating scale information. Other features will be discussed in accord with participants interest.

Integrating QUEST with Item Creation, Storage and Retrieval
Brian Doig, Australian Council for Educational Research
Demonstrates an "all-in-one" item management system that Margaret Wu, Ray Adams and I have completed which integrates the Quest Rasch analysis program with item creation, storage and retrieval.

Measuring Psychological Development: Unresolved Issues
William Gray, U. of Toledo; Mark Wilson, U. of California at Berkeley
Reactions to the "Measuring Psychological Development" symposium will lead off a discussion of unresolved issues in this area.

Rating Scale Measurement and ASTM Standards
William P. Fisher, Jr. Louisiana State U. Medical Center
The administrative simplification section of the Health Insurance Portability and Accountability Act of 1996 (the Kennedy-Kasselbaum bill) mandates the adoption of uniform national standards for health information. This presentation addresses the significance and use of scale-free measurement for creating and maintaining universal metrics in the context of the electronic health record.

AERA Papers

Measuring Feedback-Seeking Modes: An Alternative to Composite Scores (AERA 5.56)
Rita Bode, Rehabilitation Institute of Chicago
The objective of this study is to illustrate an alternative to the use of traditional composite scores in creating scales from survey items. The variable dealt with is mode of feedback-seeking. Survey items concerning frequency of feedback-seeking from various sources were Rasch-calibrated. The resulting item map was expanded to include calibrations for each response in the rating scale for each item. The expanded item map was then used to identify four types of feedback-seeking by referencing the item descriptors on the map.Using the Rasch procedure, it was possible to determine not only the extent to which individual faculty sought feedback but also which feedback-seeking modes they used and how frequently.

Assessing and Improving the Extraversion-Introversion Scale of the Myers-Briggs Type Indicator by the Rasch measurement (AERA 5.56)
Eunlim Kim Chi, Kyunghee U.
The Myers-Briggs Type Indicator(MBTI) is one of the most popularly used personality inventories, but most psychometric studies on the MBTI have been restricted at factor analyses. This paper assesses the validity of the EI(extraversion-introversion) scale of MBTI by the Rasch measurement and improves the scoring system. This includes the fit analysis of the El scale and the investigation of unidimensionality of the scale. In addition, the scoring system using a single pole, instead of the bi-poles, is suggested. The data for the present study were the Korean version of the MBTI which administered to 235 Korean university students. The El scale contained 21 items. Also, the self-determined types were examined from the students to use as a criterion for assessing the validity of the scale. The results identified the invalid items for the El scale and demonstrated that the El scale can be more validly scored by a single pole.

Primary, Concrete, Abstract, Formal, Systematic, and Metasystematic Operations as Observed in a "Piagetian" Balance-Beam Task Series (AERA 5.56)
Eric Andrew Goodheart, Harvard U.; Theo Linda Dawson, U. of California at Berkeley; Michael Lamport Commons, Harvard Medical School
We performed a Partial Credit analysis of cross-sectional developmental data gathered from children and adults who were presented a task series derived from Inhelder's and Piaget's balance beam. This analysis create a probabilistic model situating both participants and items along a single hierarchically ordered dimension. As the General Stage Model predicted, the items formed a series of clusters along this dimension according to their order of hierarchical complexity. The order of hierarchical complexity predicted the item difficulty, which reflects the order in which items are learned. Gappiness between items of differing stage will be further analyzed using the Saltus model.

Advances in Partial Credit Models with Applications to Performance Assessment (AERA 12.32)
Huynh Huynh, U. of South Carolina
Research results to be presented focus on: 1) Partial credit models and testlets; 2) Equivalence between partial credit item and a set of independent binary items; 3) Decomposition of a partial credit item into independent binary and indecomposable trinary items; 4) number of categories for a partial credit item; 5) location of score categories of a partial credit item and applications to criterion-referenced interpretation.

New Measures, New Methodology, New Psychometrics: A Synthesis (AERA 18.44)
Everett Smith Jr., U. of Oklahoma; J. Rogers, Connecticut State Education Dept.; M. Kulikowich, U. of Connecticut; T. Jetton, U. of Utah
Demonstrates a G-study and a Facets analysis of some performance based data from 5 items, 3 judges, 2 testing times, and 44 students.

Quantifying Item Dependency by Fisher's Z (AERA 26.42)
Linjun Shen, National Board of Osteopathic Medical Examiners
This study proposes Fisher's Z as the index to assess local item dependency among clustered items. Fisher's Z is a transformation of Pearson correlation coefficient of standardized residuals between Rasch model predicted scores and the observed scores for a pair of items across all persons in an exam. This paper also proposes using the distribution of Fisher's Z for standard alone items in the same exam as the criterion to determine the significance level of dependency among clustered items. Compared with Yen's 0 . Fisher's Z has three improvements. It normalizes Yen's Q3, takes into 3 ' account of measurement error, and establishes a practical significance level of dependency. Therefore, Fisher's Z is a more sensitive and more practical index than Yen's Q, to identify local item dependency.

Optimal Categorization of a Rating Scale: A Longitudinal Study (AERA 26.42)
Weimo Zhu, Wayne State U.; Wynn Updyke, and Cheryl Lewandowski, Indiana U.
To determine its stability of optimal categorization, a 50-item psychomotor self-efficacy scale was administered four times to a total of 2,022 children from 15 Midwestern schools during a four-year period. By combining adjacent categories in a "collapsing" process, in which new categorizations are constructed, the optimal categorization was determined by comparing indexes of Person and Item Separation and fit statistics provided by the Rasch analysis. It was found that the optimal categorization identified by the Rasch analysis can be stable and generalized to sequent administrations, and the determination of an optimal categorization should be empirically based.

Physical Functioning Construct Congruence across Instruments: Towards a Universal Metric (AERA 40.15)
William P. Fisher Jr., Louisiana State U. Medical Center
Begins examining the stability of a physical functional independence construct across instruments and samples. This is not a formal equating of instruments or samples, but indicates whether such an effort would be likely to succeed. The equating method employed is a pseudo-common item equating, in which the calibrations from separate samples of similar, but not identical items, from different instruments, are compared. More than 30 articles presenting Rasch analyses of physical functioning scales were reviewed. The final average correlation is .90, with an average of 7 pseudo-common items. Measures based on these calibrations should be linearly transformable versions of the same metric. The quantitative stability of different areas of physical functional independence across instruments and samples suggests that the development of a universal metric is a realizable goal.

A Many-facet Rasch Model for Assessing Disease Class-specific Diagnostic Abilities in Medical Students (AERA 40.15)
Frank J. Papa, R. E. Schumacker, Robert Stone, David Aldrich, U. of North Texas
Medical diagnostic accuracy appears to be both disease class-specific, and a function of a case presentation's `typicality'. A many-faceted Rasch model and computer adaptive testing were used to assess the level of case typicality (diagnostic difficulty) at which medical students' performed in nine specific disease classes. Analysis indicated: 1) students (n=100) differed significantly in diagnostic ability; 2) the disease classes were significantly different; 3) the cases were significantly different. These results suggest that it may be possible to draw inferences regarding the robustness of a clinician's diagnostic disease class concepts.

The Measurement of Concern for Falling and Social Cognitive Applications to Planning Treatment Programs (AERA 40.15)
Everett V. Smith Jr., U. of Oklahoma; Michelle M. Lusardi, U. of Connecticut
First, we demonstrate limitations of the Falls Efficacy Scale. Second, we address these limitations using a simultaneous calibration of the Falls Efficacy Scale and Mobility Efficacy Scale items. Third, we discuss a possible treatment program based on the simultaneous calibration and Social Cognitive Theory. Results indicate that Falls Efficacy Scale fails to assess the higher ends of the self-efficacy continuum. Simultaneous calibration of items improved this lack of scale definition. This provides a theoretical framework for planning treatment programs.

Using Person Fit Statistics to Improve Prediction of Functional Outcomes (AERA 40.15)
Richard M. Smith, Kevin J. Fuss, Rehabilitation Foundation Inc., Raymond E. Wright, SPSS
The prediction of the functional outcomes of the physical rehabilitation process based on the initial status of the patient has clear financial implications for the health care industry, particularly in the era of managed care with the focus on reducing the resources required to achieve a given level of functional status. In recent years there has been a trend towards the utilization of functional assessment instruments, such as the FIM(sm), PECS and LORS, that employ the Rasch model to convert the results of observations based on rating scales items to interval measures. However, little attention has been placed on the fit of the observed responses for each person to the requirements of the psychometric model or to the effect of misfit in the item level evaluations that are used in the prediction of functional status at the conclusion of treatment. To examine this effect initial and discharge functional assessment scores on five different aspects of functionality were analyzed for several treatment programs. The results indicate modest increases in prediction of functional outcomes when only fitting initial evaluations were used in the analysis. However, the results were not consistent across programs.

Development of an instrument to identify Attention Deficit Disorder/Hyperactivity Disorder (AERA 50.16)
Everett Smith Jr., Brian Johnson, U. of Oklahoma
A validation study using multi-group common factor analysis and the Rasch rating scale model to address the validity of the DSM-IV criteria for ADD/HD in a college sample. We use the Rasch model to supplement findings in the common factor analysis and to identify subjects that may have ADD/HD. We also argue that the Rasch methodology is more useful given what many counseling psychologies want to do with scores from Likert type assessments.

