Journal of Applied Measurement
GUIDELINES FOR MANUSCRIPTS
Reprinted from Smith, R.M., Linacre, J.M., and Smith, Jr., E.V. (2003). Guidelines for Manuscripts. Journal of Applied Measurement, 4, 198-204.
Included in this editorial are guidelines for manuscripts submitted to the Journal of Applied Measurement that involve applications of Rasch measurement. These guidelines may also be of use to those attempting to publish Rasch measurement applications in other journals that may not be familiar with these methods.
Following the guidelines, we provide a list of references that may assist individuals in gaining an overview of some of the material discussed in the guidelines. The guidelines and the list of references are by no means exhaustive. If you feel an important reference has been left out or have a recommendation for the guidelines, please e-mail us your suggestions (Richard Smith via www.jampress.org.
Finally, we consider this a work in progress and thank William Fisher and George Karabatsos for comments on an earlier version. We will attempt to incorporate ideas and references as we receive them. Please periodically visit the journal website at www.jampress.org for the most recent updates.
A. Describing the problem
1. Adequate references, at least reference to Georg Rasch (1960) when appropriate.
2. Adequate theory, at least exact algebraic representation of the Rasch model(s) used and citation for primary developer(s).
3. Adequate description of the measurement problem, including hypothesized definition of latent variable, identification of facets under investigation, description of rating scales or response formats.
4. Rationale for using Rasch measurement techniques. For example, this may include the preference for the unique properties that Rasch models embody, the goal of establishing generalized reference standard metrics, or empirical justification by performing, for example, a comparison of the generalizability of the estimated parameters obtained from competing models. Addressing the rationale for using Rasch measurement is particular important when reviewers are more familiar with the philosophy behind Item Response Theory or Classical True-Score Theory (CTT).
B. Describing the analysis
1. Name and citation or adequate description of software or estimation methodology employed.
2. Provide a rationale for the choice of fit statistics and the criteria employed to indicate adequate fit to the model requirements. This should include some acknowledgment of the Type I error rate that the critical values imply. Note. The mean square is not a symmetric statistic. A value of 0.7 is further from 1.0 than is 1.3. Using a 1.3/0.7 cutoff for mean squares uses a different Type I error rate for the upper and lower tail of the mean square distribution.
C. Reporting the analysis
1. Map of linear variable as defined by items.
2. Map of distribution of sample on linear variable.
3. Report on functioning of rating scale(s), and of any procedures taken to improve measurement (e.g., category collapsing).
Note: It is extremely difficult to make decisions about the use of response categories in the rating scale or partial credit model if there are less than 30 persons in the sample or 10 observations in each category. You might want to reserve that task until your samples are a little larger. If the sample person distribution is skewed you might actually need even larger sample sizes since one tail of the distribution will not be well populated. The same is true if the sample mean is offset from the mean of the item difficulties. This will result in there being few observations for the extreme categories for the items opposite the concentration of the persons.
4. Investigation of secondary dimensions in items, persons, etc. using, for example, fit statistics and other analysis of the residuals.
Note: All of the point-biserial correlations being greater that 0.30 in the rating scale and partial credit models does not lend a lot of support to the concept of unidimensionality. It is often the case that the median point-biserial in rating scale or partial credit data can be well above 0.70. A number of items in the 0.30 to 0.40 range in that situation would be a good sign of multidimensionality.
5. Investigation for local idiosyncrasies in items, persons, etc.
Note: Fit statistics for small sample sizes are very unstable. One or two unusual responses can produce a large fit statistic. Count up the number of item/person standardized residuals that are larger than 2.0. You might be surprised how few there are. Do you want to drop an item just because of a few unexpected responses?
6. Report Rasch separation and reliabilities, not KR-20 or Alpha.
Note: Reliability was originally conceptualized as the ratio of the true variance to the observed variance. Since there was no method in the true score model of estimating the SEM a variety of methods (e.g., KR-20, Alpha) were developed to estimate reliability without knowing the SEM. In the Rasch model it is possible to approach reliability the way it was originally intended rather than using a less than ideal solution.
7. Report on applicable validity issues
Note: This is of particular importance when attempting to convey the results of Rasch analysis to non-Rasch oriented readers. Attempts should be made to address the validity issues raised by Messick (1989, 1995), Cherryholmes (1988), and the Medical Outcomes Trust (1995). See Smith (2001) for one interpretation and Fisher (1994) for connecting qualitative mathematical criteria for meaningfulness with quantitative mathematical criteria.
8. Any special measurement concerns?
For example: Missing data: not administered or what? Folded data: how resolved? Nested data: how accommodated? Loosely connected facets: how were differences in local origins removed? Measurement vs. description facets: how disentangled?
9. For tests of statistical significance, in addition to the test statistics, degrees of freedom, and p-values, we encourage authors to report and interpret effect sizes and/or confidence intervals.
D. Style and Terminology
1. Use Score for Raw Score and Measure or Calibration for Rasch-constructed linear measures.
2. We do not encourage the use of Item Response Theory as a term for Rasch measurement.
3. Rescale from logits to user-oriented scaling.
4. If appropriate, attempt to convey the results in graphical format.
5. Do not use inappropriate language when discussing reliability and validity (e.g., the test is reliable and valid). It is the measures that are reliable and the inferences made from the item and person measures and fit information that are valid for specific purposes.
E. Common Oversights
1. Do not take the mean and standard deviation of point biserial correlations. The statistics are more non-linear than the raw scores. It is best to report the median and inter-quartile range or to use a Fisher z-transformation before you calculate a mean.
2. When comparing the results of several calibrations of the same data, do not use the item and person reliability as criteria for improvement. These indices suffer from the same floor and ceiling effects as their true score counterparts and hence may not accurately reflect increases in reliability. If an increase in reliability is one of your criteria for improvement, use the item and person separation indices to compare the results of multiple calibrations as these indices do not suffer from the same deficiencies.
References
Cherryholmes, C. (1988). Construct validity and the
discourses of research. American
Journal of Education, 96, 421-457.
Medical Outcomes Trust Scientific Advisory Committee.
(1995). Instrument Review
Criteria. Medical Outcomes Trust Bulletin,
1-4.
Messick, S. (1989). Validity. In R.L. Linn (Ed.),
Educational Measurement (3rd ed.,
pp.13-103). New York: Macmillan.
Messick, S. (1995). Validity of psychological
assessment: Validation of inferences
from persons' responses and performances as scientific
inquiry into score
meaning. American Psychologist, 50,
741-749.
Rasch Measurement Models
Adams, R. J., Wilson, M. R., and Wang, W. C. (1997). The
multidimensional random
coefficients multinomial logit model. Applied
Psychological
Measurement, 21, 1-24.
Andrich, D. (1978). A rating formulation for ordered
response categories.
Psychometrika, 43, 561-574.
Andrich, D. (1988). Rasch models for measurement.
Sage university paper series on
quantitative measurement in the social sciences.
Newberry Park, CA:
Sage Publications.
Bond, T. G., and Fox, C. M. (2001). Applying the
Rasch model: Fundamental
measurement in the human sciences. London:
Erlbaum.
Fischer, G. H., and Molenaar, I. W. (1995). Rasch
models: Foundations, recent
developments, and applications. New York:
Springer-Verlag.
Linacre, J. M. (1989). Many-facet Rasch
measurement. Chicago: MESA Press.
Masters, G. N. (1982). A Rasch model for partial credit
scoring. Psychometrika, 47,
149-174.
Rasch, G. (1960). Probabilistic models for some
intelligence and attainment tests.
Copenhagen: Danish Institute for Educational Research
(Expanded edition,
1980. Chicago: University of Chicago Press).
Wright, B. D., and Masters, G. N. (1982). Rating
scale analysis. Chicago:
MESA Press.
Wright, B. D., and Mok, M. (2000). Rasch models
overview. Journal of Applied
Measurement, 1, 83-106.
Wright, B. D., and Stone, M. H. (1979). Best test
design. Chicago: MESA Press.
Rationale for Using Rasch Models
Anderson, E. B. (1977). Sufficient statistics in latent
trait models. Psychometrika, 42,
69-81.
Andrich, D. (1989). Distinctions between assumptions and
requirements in
measurement in the social sciences. In J.A. Keats, R.
Taft, R.A. Heath, S.H.
Lovibond (Eds.), Mathematical and Theoretical
Systems, (pp. 7-16).
North Holland: Elsevier Science Publishers.
Andrich, D. (1995). Distinctive and incompatible
properties of two common classes of
IRT models for grade responses. Applied Psychological
Measurement, 19,
101-119.
Andrich, D. (2001, October). Controversy and the
Rasch model: A characteristics of a
scientific revolution. Paper presented at the
meeting of the International
Conference on Objective Measurement: Focus on Health
Care, Chicago, IL.
Andrich, D., (2002). Understanding resistance to the
data-model relationship in
Rasch's paradigm: A reflection for the next generation.
Journal of
Applied Measurement, 3, 325-359.
Bond, T. G., and Fox, C. M. (2001). Applying the
Rasch model: Fundamental
measurement in the human sciences. London:
Erlbaum.
Choppin, B. (1985). Lessons for Psychometrics from
Thermometry. International
International Journal of Educational Research (formerly
Evaluation In Education), 9, 9-12.
Fisher, W. P., Jr. (1993). Scale-free measurement
revisited. Rasch Measurement
7 Transactions, 272-273.
www.rasch.org/rmt/rmt71.htm.
Fisher, W. P., Jr. (1995). Opportunism, a first step to
inevitability? Rasch
Measurement Transactions, 9, 426.
www.rasch.org/rmt/rmt92.htm.
Fisher, W. P., Jr. (1996). The Rasch alternative.
Rasch Measurement Transactions, 9,
466-467.
www.rasch.org/rmt/rmt94.htm.
Linacre, J. M. (1996). The Rasch model cannot be
"disproved"! Rasch Measurement
Transactions, 10, 512-514
www.rasch.org/rmt/rmt103.htm.
Perline, R., Wright, B. D., and Wainer, H. (1979). The
Rasch model as additive
conjoint measurement. Applied Psychological
Measurement, 3, 237-256.
Romanoski, J., and Douglas, G. (2002). Test scores,
measurement, and the use
of analysis of variance: An historical overview.
Journal of Applied
Measurement, 3, 232-242.
Smith, R. M. (1992). Applications of Rasch
measurement. Chicago: MESA Press.
Wright, B. D. (1967). Sample-Free Test Calibration and
Person Measurement. In
B. S. Bloom (Chair), Invitational Conference on
Testing Problems (pp.
84-101). Princeton, NJ: Educational Testing Service.
Available at
Wright, B. D. (1977). Solving measurement problems with
the Rasch model. Journal
of Educational Measurement, 14 (2), 97-116.
Available at
Wright, B. D., and Linacre, J. M. (1989). Observations
are always ordinal;
measurements, however, must be interval. Archives of
Physical Medicine and
Rehabilitation, 70, 857-860. Available at
www.rasch.org/memo44.htm.
Wright, B. D., and Masters, G. N. (1982). Rating
scale analysis. Chicago:
MESA Press.
Wright, B. D., and Stone, M. H. (1979). Best test
design. Chicago: MESA Press.
Estimation Methodology
Fischer, G. H., and Molenaar, I. W. (1995). Rasch
models: Foundations, recent
developments, and applications. New York:
Springer-Verlag.
Linacre, J. M. (1989). Many-facet Rasch
measurement. Chicago: MESA Press.
Linacre, J. M. (1999). Estimation methods for Rasch
measures. Journal of Outcome
Measurement, 3, 382-405.
Wright, B. D., and Masters, G. N. (1982). Rating
scale analysis. Chicago:
MESA Press.
Wright, B. D., and Stone, M. H. (1979). Best test
design. Chicago: MESA Press.
Assessing Dimensionality and Fit
Anderson, E. B. (1973). A goodness-of-fit test for the
Rasch model. Psychometrika,
38, 123-140.
Bond, T. G., and Fox, C. M. (2001). Applying the
Rasch model: Fundamental
measurement in the human sciences. London:
Erlbaum.
Engelhard, Jr., G. (1994). Examining rater errors in the
assessment of written
composition with a Many-Facet Rasch model. Journal of
Educational
Measurement, 31, 93-112.
Engelhard, Jr., G. (1996). Clarification to "Examining
rater errors in the assessment of
written composition with a Many-Facet Rasch model".
Journal of Educational
Measurement, 33, 115-116.
Fischer, G. H., and Molenaar, I. W. (1995). Rasch
models: Foundations, recent
developments, and applications. New York:
Springer-Verlag.
Glas, C. A. W. (1988). The derivation of some tests for
the Rasch model from the
multinomial distribution. Psychometrika, 53,
525-546.
Kelderman, H. (1984). Loglinear Rasch model tests.
Psychometrika, 49, 223-245.
Linacre, J. M. (1998a). Structure in Rasch residuals:
Why principal component
analysis? Rasch Measurement Transactions, 12,
636.
Linacre, J. M. (1998b). Detecting multidimensionality:
Which residual data-types
works best? Journal of Outcome Measurement, 2,
266-283.
Linacre, J. M. (1992). Prioritizing misfit indicators.
Rasch Measurement
Transactions, 9, 422-423.
Linacre, J. M., and Wright, B. D. (1994). Chi-square fit
statistics. Rasch
Measurement Transactions, 8, 360-361.
Smith, Jr., E. V. (2002). Detecting and evaluating the
impact of multidimensionality
using item fit statistics and principal component
analysis of residuals. Journal
of Applied Measurement, 3, 205-231.
Smith, R. M. (1991a). IPARM: Item and Person analysis
with the Rasch model.
Chicago: MESA Press.
Smith, R. M. (1991b). The distributional properties of
Rasch item fit statistics.
Educational and Psychological Measurement, 51,
541-565.
Smith, R. M. (1996). A comparison of methods for
determining dimensionality in
Rasch measurement. Structural Equation Modeling,
3, 25-40.
Smith, R. M. (1996). Polytomous mean square fit
statistics. Rasch Measurement
Transactions, 10, 516-517.
Smith, R. M. (2000). Fit analysis in latent trait
measurement models. Journal of
Applied Measurement, 1, 199-218.
Smith, R. M., Schumacker, R. E., and Bush, M. J. (1998).
Using item mean squares to
evaluate fit to the Rasch model. Journal of Outcome
Measurement, 2, 66-78.
Wright, B. D. (1991a). Diagnosing misfit. Rasch
Measurement Transactions, 5, 156.
Wright, B. D. (1991b). Factor item analysis versus Rasch
item analysis. Rasch
Measurement Transactions, 5, 134-135.
Wright, B. D. (1996a). Comparing Rasch measurement and
factor analysis. Structural
Equation Modeling, 3, 3-24.
Wright, B. D. (1996b). Local dependence, correlation,
and principal components.
Rasch Measurement Transactions, 10,
509-511.
Wright, B. D., and Linacre, J. M. (1994). Reasonable
mean-square fit values. Rasch
Measurement Transactions, 8, 370.
Wright, B. D., and Masters, G. N. (1982). Rating
scale analysis. Chicago:
MESA Press.
Wright, B. D., and Stone, M. H. (1979). Best test
design. Chicago: MESA Press.
Rating Scale Category Effectiveness
Andrich, D. (1996). Category ordering and their utility.
Rasch Measurement
Transactions, 9, 465-466.
Andrich, D. (1998). Thresholds, steps, and rating scale
conceptualization. Rasch
Measurement Transactions, 12,
648-649.
Linacre, J. M. (1991). Step disordering and Thurstone
thresholds. Rasch Measurement
Transactions, 5, 171.
Linacre, J. M. (1999). Investigating rating scale
category utility. Journal of Outcome
Measurement, 3, 102-122.
Linacre, J. M. (2002). Optimizing rating scale category
effectiveness. Journal of
Applied Measurement, 3, 86-106.
Stone, M., and Wright, B. D. (1994). Maximizing rating
scale information. Rasch
Measurement Transactions, 8, 386.
Wright, B. D., and Linacre, J. M. (1992). Disordered
steps? Rasch Measurement
Transactions, 6, 225.
Wright, B. D., and Masters, G. N. (1982). Rating
scale analysis. Chicago:
MESA Press.
Zhu, W., Updyke, W. F., and Lewandowski, C. (1997).
Post-hoc Rasch analysis of
optimal categorization of an ordered-response scale.
Journal of Outcome
Measurement, 1, 286-304.
Reliability and Validity
Fisher, Jr., W. P. (1994). The Rasch debate: Validity
and revolution in educational
measurement. In M. Wilson (Ed.), Objective
measurement: Theory into
practice, Vol. 2 (pp.36-72). Norwood: Ablex
Publishing Corporation.
Fisher, Jr., W. P. (1997). Is content validity valid?
Rasch Measurement Transactions,
11, 548.
Linacre, J. M. (1993). Rasch-based Generalizability
theory. Rasch Measurement
Transactions, 7, 283-284.
Linacre, J. M. (1995). Reliability and separation
monograms. Rasch Measurement
Transactions, 9, 421.
Linacre, J. M. (1996). True-score reliability or Rasch
statistical validity? Rasch
Measurement Transactions, 9, 455-456.
Linacre, J. M. (1999). Relating Cronbach and Rasch
reliabilities. Rasch Measurement
Transactions, 13, 696.
Smith, Jr., E. V. (2001). Reliability of measures and
validity of measure interpretation:
A Rasch measurement perspective. Journal of Applied
Measurement,
2, 281-311.
Wright, B. D. (1995). Which standard error? Rasch
Measurement Transactions, 9,
436-437.
Wright, B. D. (1996). Reliability and separation.
Rasch Measurement Transactions, 9,
472.
Wright, B. D. (1998). Interpreting reliabilities.
Rasch Measurement Transactions, 11,
602.
Wright, B. D., and Masters, G. N. (1982). Rating
scale analysis. Chicago:
MESA Press.
Wright, B. D., and Stone, M. H. (1979). Best test
design. Chicago: MESA Press.
Metric Development and Score Reporting
Linacre, J. M. (1997). Instantaneous measurement and
diagnosis. In R.M. Smith (Ed.),
Physical Medicine and Rehabilitation State of the Art
Reviews, Vol. 11:
Outcome Measurement (pp.315-324). Philadelphia: Hanley
& Belfus, Inc.
Ludlow, L. H., and Haley, S. M. (1995). Rasch model
logits: Interpretation, use, and
transformations. Educational and Psychological
Measurement, 55, 967-975.
Smith, Jr., E. V. (2000). Metric development and score
reporting in Rasch
measurement. Journal of Applied Measurement, 1,
303-326.
Smith, R. M. (1992). Applications of Rasch
Measurement. Chicago: MESA Press.
Smith, R. M. (1991). IPARM: Item and Person analysis
with the Rasch model.
Chicago: MESA Press.
Smith, R. M. (1994). Person response maps for rating
scales. Rasch Measurement
Transactions, 8, 372-373.
Stanek, J., and Lopez, W. (1996). Explaining variables.
Rasch Measurement
Transactions, 10, 518-519.
Woodcock, R. W. (1999). What can Rasch-based score
convey about a person's test
performance? In S. E. Embretson, and S. L. Hershberger,
(Eds.), The new rules
of measurement: What every psychologist and educator
should know.
Mahwah, NJ: Erlbaum.
Wright, B. D., Mead, R. J., and Ludlow, L. H. (1980).
Kidmap: Research
memorandum number 29. Chicago: MESA
Press.
Wright, B. D., and Stone, M. H. (1979). Best test
design. Chicago: MESA Press.
Zhu, W. (1995). Communicating measurement. Rasch
Measurement Transactions, 9,
437-438.