guidelines

Journal of Applied Measurement

GUIDELINES FOR MANUSCRIPTS

Reprinted from Smith, R.M., Linacre, J.M., and Smith, Jr., E.V. (2003). Guidelines for Manuscripts. Journal of Applied Measurement, 4, 198-204.

Included in this editorial are guidelines for manuscripts submitted to the Journal of Applied Measurement that involve applications of Rasch measurement. These guidelines may also be of use to those attempting to publish Rasch measurement applications in other journals that may not be familiar with these methods.

Following the guidelines, we provide a list of references that may assist individuals in gaining an overview of some of the material discussed in the guidelines. The guidelines and the list of references are by no means exhaustive. If you feel an important reference has been left out or have a recommendation for the guidelines, please e-mail us your suggestions (Richard Smith via www.jampress.org.

Finally, we consider this a work in progress and thank William Fisher and George Karabatsos for comments on an earlier version. We will attempt to incorporate ideas and references as we receive them. Please periodically visit the journal website at www.jampress.org for the most recent updates.

A. Describing the problem

1. Adequate references, at least reference to Georg Rasch (1960) when appropriate.

2. Adequate theory, at least exact algebraic representation of the Rasch model(s) used and citation for primary developer(s).

3. Adequate description of the measurement problem, including hypothesized definition of latent variable, identification of facets under investigation, description of rating scales or response formats.

4. Rationale for using Rasch measurement techniques. For example, this may include the preference for the unique properties that Rasch models embody, the goal of establishing generalized reference standard metrics, or empirical justification by performing, for example, a comparison of the generalizability of the estimated parameters obtained from competing models. Addressing the rationale for using Rasch measurement is particular important when reviewers are more familiar with the philosophy behind Item Response Theory or Classical True-Score Theory (CTT).

B. Describing the analysis

1. Name and citation or adequate description of software or estimation methodology employed.

2. Provide a rationale for the choice of fit statistics and the criteria employed to indicate adequate fit to the model requirements. This should include some acknowledgment of the Type I error rate that the critical values imply. Note. The mean square is not a symmetric statistic. A value of 0.7 is further from 1.0 than is 1.3. Using a 1.3/0.7 cutoff for mean squares uses a different Type I error rate for the upper and lower tail of the mean square distribution.

C. Reporting the analysis

1. Map of linear variable as defined by items.

2. Map of distribution of sample on linear variable.

3. Report on functioning of rating scale(s), and of any procedures taken to improve measurement (e.g., category collapsing).

Note: It is extremely difficult to make decisions about the use of response categories in the rating scale or partial credit model if there are less than 30 persons in the sample or 10 observations in each category. You might want to reserve that task until your samples are a little larger. If the sample person distribution is skewed you might actually need even larger sample sizes since one tail of the distribution will not be well populated. The same is true if the sample mean is offset from the mean of the item difficulties. This will result in there being few observations for the extreme categories for the items opposite the concentration of the persons.

4. Investigation of secondary dimensions in items, persons, etc. using, for example, fit statistics and other analysis of the residuals.

Note: All of the point-biserial correlations being greater that 0.30 in the rating scale and partial credit models does not lend a lot of support to the concept of unidimensionality. It is often the case that the median point-biserial in rating scale or partial credit data can be well above 0.70. A number of items in the 0.30 to 0.40 range in that situation would be a good sign of multidimensionality.

5. Investigation for local idiosyncrasies in items, persons, etc.

Note: Fit statistics for small sample sizes are very unstable. One or two unusual responses can produce a large fit statistic. Count up the number of item/person standardized residuals that are larger than 2.0. You might be surprised how few there are. Do you want to drop an item just because of a few unexpected responses?

6. Report Rasch separation and reliabilities, not KR-20 or Alpha.

Note: Reliability was originally conceptualized as the ratio of the true variance to the observed variance. Since there was no method in the true score model of estimating the SEM a variety of methods (e.g., KR-20, Alpha) were developed to estimate reliability without knowing the SEM. In the Rasch model it is possible to approach reliability the way it was originally intended rather than using a less than ideal solution.

7. Report on applicable validity issues

Note: This is of particular importance when attempting to convey the results of Rasch analysis to non-Rasch oriented readers. Attempts should be made to address the validity issues raised by Messick (1989, 1995), Cherryholmes (1988), and the Medical Outcomes Trust (1995). See Smith (2001) for one interpretation and Fisher (1994) for connecting qualitative mathematical criteria for meaningfulness with quantitative mathematical criteria.

8. Any special measurement concerns?

For example: Missing data: not administered or what? Folded data: how resolved? Nested data: how accommodated? Loosely connected facets: how were differences in local origins removed? Measurement vs. description facets: how disentangled?

9. For tests of statistical significance, in addition to the test statistics, degrees of freedom, and p-values, we encourage authors to report and interpret effect sizes and/or confidence intervals.

D. Style and Terminology

1. Use Score for Raw Score and Measure or Calibration for Rasch-constructed linear measures.

2. We do not encourage the use of Item Response Theory as a term for Rasch measurement.

3. Rescale from logits to user-oriented scaling.

4. If appropriate, attempt to convey the results in graphical format.

5. Do not use inappropriate language when discussing reliability and validity (e.g., the test is reliable and valid). It is the measures that are reliable and the inferences made from the item and person measures and fit information that are valid for specific purposes.

E. Common Oversights

1. Do not take the mean and standard deviation of point biserial correlations. The statistics are more non-linear than the raw scores. It is best to report the median and inter-quartile range or to use a Fisher z-transformation before you calculate a mean.

2. When comparing the results of several calibrations of the same data, do not use the item and person reliability as criteria for improvement. These indices suffer from the same floor and ceiling effects as their true score counterparts and hence may not accurately reflect increases in reliability. If an increase in reliability is one of your criteria for improvement, use the item and person separation indices to compare the results of multiple calibrations as these indices do not suffer from the same deficiencies.

References

Cherryholmes, C. (1988). Construct validity and the discourses of research. American

Journal of Education, 96, 421-457.

Medical Outcomes Trust Scientific Advisory Committee. (1995). Instrument Review

Criteria. Medical Outcomes Trust Bulletin, 1-4.

Messick, S. (1989). Validity. In R.L. Linn (Ed.), Educational Measurement (3rd ed.,

pp.13-103). New York: Macmillan.

Messick, S. (1995). Validity of psychological assessment: Validation of inferences

from persons' responses and performances as scientific inquiry into score

meaning. American Psychologist, 50, 741-749.

Rasch Measurement Models

Adams, R. J., Wilson, M. R., and Wang, W. C. (1997). The multidimensional random

coefficients multinomial logit model. Applied Psychological

Measurement, 21, 1-24.

Andrich, D. (1978). A rating formulation for ordered response categories.

Psychometrika, 43, 561-574.

Andrich, D. (1988). Rasch models for measurement. Sage university paper series on

quantitative measurement in the social sciences. Newberry Park, CA:

Sage Publications.

Bond, T. G., and Fox, C. M. (2001). Applying the Rasch model: Fundamental

measurement in the human sciences. London: Erlbaum.

Fischer, G. H., and Molenaar, I. W. (1995). Rasch models: Foundations, recent

developments, and applications. New York: Springer-Verlag.

Linacre, J. M. (1989). Many-facet Rasch measurement. Chicago: MESA Press.

Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47,

149-174.

Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests.

Copenhagen: Danish Institute for Educational Research (Expanded edition,

1980. Chicago: University of Chicago Press).

Wright, B. D., and Masters, G. N. (1982). Rating scale analysis. Chicago:

MESA Press.

Wright, B. D., and Mok, M. (2000). Rasch models overview. Journal of Applied

Measurement, 1, 83-106.

Wright, B. D., and Stone, M. H. (1979). Best test design. Chicago: MESA Press.

Rationale for Using Rasch Models

Anderson, E. B. (1977). Sufficient statistics in latent trait models. Psychometrika, 42,

69-81.

Andrich, D. (1989). Distinctions between assumptions and requirements in

measurement in the social sciences. In J.A. Keats, R. Taft, R.A. Heath, S.H.

Lovibond (Eds.), Mathematical and Theoretical Systems, (pp. 7-16).

North Holland: Elsevier Science Publishers.

Andrich, D. (1995). Distinctive and incompatible properties of two common classes of

IRT models for grade responses. Applied Psychological Measurement, 19,

101-119.

Andrich, D. (2001, October). Controversy and the Rasch model: A characteristics of a

scientific revolution. Paper presented at the meeting of the International

Conference on Objective Measurement: Focus on Health Care, Chicago, IL.

Andrich, D., (2002). Understanding resistance to the data-model relationship in

Rasch's paradigm: A reflection for the next generation. Journal of

Applied Measurement, 3, 325-359.

Bond, T. G., and Fox, C. M. (2001). Applying the Rasch model: Fundamental

measurement in the human sciences. London: Erlbaum.

Choppin, B. (1985). Lessons for Psychometrics from Thermometry. International

International Journal of Educational Research (formerly Evaluation In Education), 9, 9-12.

Fisher, W. P., Jr. (1993). Scale-free measurement revisited. Rasch Measurement

7 Transactions, 272-273. www.rasch.org/rmt/rmt71.htm.

Fisher, W. P., Jr. (1995). Opportunism, a first step to inevitability? Rasch

Measurement Transactions, 9, 426. www.rasch.org/rmt/rmt92.htm.

Fisher, W. P., Jr. (1996). The Rasch alternative. Rasch Measurement Transactions, 9,

466-467. www.rasch.org/rmt/rmt94.htm.

Linacre, J. M. (1996). The Rasch model cannot be "disproved"! Rasch Measurement

Transactions, 10, 512-514 www.rasch.org/rmt/rmt103.htm.

Perline, R., Wright, B. D., and Wainer, H. (1979). The Rasch model as additive

conjoint measurement. Applied Psychological Measurement, 3, 237-256.

Romanoski, J., and Douglas, G. (2002). Test scores, measurement, and the use

of analysis of variance: An historical overview. Journal of Applied

Measurement, 3, 232-242.

Smith, R. M. (1992). Applications of Rasch measurement. Chicago: MESA Press.

Wright, B. D. (1967). Sample-Free Test Calibration and Person Measurement. In

B. S. Bloom (Chair), Invitational Conference on Testing Problems (pp.

84-101). Princeton, NJ: Educational Testing Service. Available at

www.rasch.org/memo1.htm.

Wright, B. D. (1977). Solving measurement problems with the Rasch model. Journal

of Educational Measurement, 14 (2), 97-116. Available at

www.rasch.org/memo42.

Wright, B. D., and Linacre, J. M. (1989). Observations are always ordinal;

measurements, however, must be interval. Archives of Physical Medicine and

Rehabilitation, 70, 857-860. Available at www.rasch.org/memo44.htm.

Wright, B. D., and Masters, G. N. (1982). Rating scale analysis. Chicago:

MESA Press.

Wright, B. D., and Stone, M. H. (1979). Best test design. Chicago: MESA Press.

Estimation Methodology

Fischer, G. H., and Molenaar, I. W. (1995). Rasch models: Foundations, recent

developments, and applications. New York: Springer-Verlag.

Linacre, J. M. (1989). Many-facet Rasch measurement. Chicago: MESA Press.

Linacre, J. M. (1999). Estimation methods for Rasch measures. Journal of Outcome

Measurement, 3, 382-405.

Wright, B. D., and Masters, G. N. (1982). Rating scale analysis. Chicago:

MESA Press.

Wright, B. D., and Stone, M. H. (1979). Best test design. Chicago: MESA Press.

Assessing Dimensionality and Fit

Anderson, E. B. (1973). A goodness-of-fit test for the Rasch model. Psychometrika,

38, 123-140.

Bond, T. G., and Fox, C. M. (2001). Applying the Rasch model: Fundamental

measurement in the human sciences. London: Erlbaum.

Engelhard, Jr., G. (1994). Examining rater errors in the assessment of written

composition with a Many-Facet Rasch model. Journal of Educational

Measurement, 31, 93-112.

Engelhard, Jr., G. (1996). Clarification to "Examining rater errors in the assessment of

written composition with a Many-Facet Rasch model". Journal of Educational

Measurement, 33, 115-116.

Fischer, G. H., and Molenaar, I. W. (1995). Rasch models: Foundations, recent

developments, and applications. New York: Springer-Verlag.

Glas, C. A. W. (1988). The derivation of some tests for the Rasch model from the

multinomial distribution. Psychometrika, 53, 525-546.

Kelderman, H. (1984). Loglinear Rasch model tests. Psychometrika, 49, 223-245.

Linacre, J. M. (1998a). Structure in Rasch residuals: Why principal component

analysis? Rasch Measurement Transactions, 12, 636.

Linacre, J. M. (1998b). Detecting multidimensionality: Which residual data-types

works best? Journal of Outcome Measurement, 2, 266-283.

Linacre, J. M. (1992). Prioritizing misfit indicators. Rasch Measurement

Transactions, 9, 422-423.

Linacre, J. M., and Wright, B. D. (1994). Chi-square fit statistics. Rasch

Measurement Transactions, 8, 360-361.

Smith, Jr., E. V. (2002). Detecting and evaluating the impact of multidimensionality

using item fit statistics and principal component analysis of residuals. Journal

of Applied Measurement, 3, 205-231.

Smith, R. M. (1991a). IPARM: Item and Person analysis with the Rasch model.

Chicago: MESA Press.

Smith, R. M. (1991b). The distributional properties of Rasch item fit statistics.

Educational and Psychological Measurement, 51, 541-565.

Smith, R. M. (1996). A comparison of methods for determining dimensionality in

Rasch measurement. Structural Equation Modeling, 3, 25-40.

Smith, R. M. (1996). Polytomous mean square fit statistics. Rasch Measurement

Transactions, 10, 516-517.

Smith, R. M. (2000). Fit analysis in latent trait measurement models. Journal of

Applied Measurement, 1, 199-218.

Smith, R. M., Schumacker, R. E., and Bush, M. J. (1998). Using item mean squares to

evaluate fit to the Rasch model. Journal of Outcome Measurement, 2, 66-78.

Wright, B. D. (1991a). Diagnosing misfit. Rasch Measurement Transactions, 5, 156.

Wright, B. D. (1991b). Factor item analysis versus Rasch item analysis. Rasch

Measurement Transactions, 5, 134-135.

Wright, B. D. (1996a). Comparing Rasch measurement and factor analysis. Structural

Equation Modeling, 3, 3-24.

Wright, B. D. (1996b). Local dependence, correlation, and principal components.

Rasch Measurement Transactions, 10, 509-511.

Wright, B. D., and Linacre, J. M. (1994). Reasonable mean-square fit values. Rasch

Measurement Transactions, 8, 370.

Wright, B. D., and Masters, G. N. (1982). Rating scale analysis. Chicago:

MESA Press.

Wright, B. D., and Stone, M. H. (1979). Best test design. Chicago: MESA Press.

Rating Scale Category Effectiveness

Andrich, D. (1996). Category ordering and their utility. Rasch Measurement

Transactions, 9, 465-466.

Andrich, D. (1998). Thresholds, steps, and rating scale conceptualization. Rasch

Measurement Transactions, 12, 648-649.

Linacre, J. M. (1991). Step disordering and Thurstone thresholds. Rasch Measurement

Transactions, 5, 171.

Linacre, J. M. (1999). Investigating rating scale category utility. Journal of Outcome

Measurement, 3, 102-122.

Linacre, J. M. (2002). Optimizing rating scale category effectiveness. Journal of

Applied Measurement, 3, 86-106.

Stone, M., and Wright, B. D. (1994). Maximizing rating scale information. Rasch

Measurement Transactions, 8, 386.

Wright, B. D., and Linacre, J. M. (1992). Disordered steps? Rasch Measurement

Transactions, 6, 225.

Wright, B. D., and Masters, G. N. (1982). Rating scale analysis. Chicago:

MESA Press.

Zhu, W., Updyke, W. F., and Lewandowski, C. (1997). Post-hoc Rasch analysis of

optimal categorization of an ordered-response scale. Journal of Outcome

Measurement, 1, 286-304.

Reliability and Validity

Fisher, Jr., W. P. (1994). The Rasch debate: Validity and revolution in educational

measurement. In M. Wilson (Ed.), Objective measurement: Theory into

practice, Vol. 2 (pp.36-72). Norwood: Ablex Publishing Corporation.

Fisher, Jr., W. P. (1997). Is content validity valid? Rasch Measurement Transactions,

11, 548.

Linacre, J. M. (1993). Rasch-based Generalizability theory. Rasch Measurement

Transactions, 7, 283-284.

Linacre, J. M. (1995). Reliability and separation monograms. Rasch Measurement

Transactions, 9, 421.

Linacre, J. M. (1996). True-score reliability or Rasch statistical validity? Rasch

Measurement Transactions, 9, 455-456.

Linacre, J. M. (1999). Relating Cronbach and Rasch reliabilities. Rasch Measurement

Transactions, 13, 696.

Smith, Jr., E. V. (2001). Reliability of measures and validity of measure interpretation:

A Rasch measurement perspective. Journal of Applied Measurement,

2, 281-311.

Wright, B. D. (1995). Which standard error? Rasch Measurement Transactions, 9,

436-437.

Wright, B. D. (1996). Reliability and separation. Rasch Measurement Transactions, 9,

472.

Wright, B. D. (1998). Interpreting reliabilities. Rasch Measurement Transactions, 11,

602.

Wright, B. D., and Masters, G. N. (1982). Rating scale analysis. Chicago:

MESA Press.

Wright, B. D., and Stone, M. H. (1979). Best test design. Chicago: MESA Press.

Metric Development and Score Reporting

Linacre, J. M. (1997). Instantaneous measurement and diagnosis. In R.M. Smith (Ed.),

Physical Medicine and Rehabilitation State of the Art Reviews, Vol. 11:

Outcome Measurement (pp.315-324). Philadelphia: Hanley & Belfus, Inc.

Ludlow, L. H., and Haley, S. M. (1995). Rasch model logits: Interpretation, use, and

transformations. Educational and Psychological Measurement, 55, 967-975.

Smith, Jr., E. V. (2000). Metric development and score reporting in Rasch

measurement. Journal of Applied Measurement, 1, 303-326.

Smith, R. M. (1992). Applications of Rasch Measurement. Chicago: MESA Press.

Smith, R. M. (1991). IPARM: Item and Person analysis with the Rasch model.

Chicago: MESA Press.

Smith, R. M. (1994). Person response maps for rating scales. Rasch Measurement

Transactions, 8, 372-373.

Stanek, J., and Lopez, W. (1996). Explaining variables. Rasch Measurement

Transactions, 10, 518-519.

Woodcock, R. W. (1999). What can Rasch-based score convey about a person's test

performance? In S. E. Embretson, and S. L. Hershberger, (Eds.), The new rules

of measurement: What every psychologist and educator should know.

Mahwah, NJ: Erlbaum.

Wright, B. D., Mead, R. J., and Ludlow, L. H. (1980). Kidmap: Research

memorandum number 29. Chicago: MESA Press.

Wright, B. D., and Stone, M. H. (1979). Best test design. Chicago: MESA Press.

Zhu, W. (1995). Communicating measurement. Rasch Measurement Transactions, 9,

437-438.

JAM Press