A Facets Model for Judgmental Scoring

Abstract: An extension to the Rasch model for fundamental measurement is proposed in which there is parameterization not only for examinee ability and item difficulty but also for judge severity. Several variants of this model are discussed. Its use and characteristics are explained by an application of the model to an empirical testing situation.

Key-words: Rater, Rasch, Judging, Latent Trait, IRT

Authors:
John M. Linacre, University of Chicago
Benjamin D. Wright, University of Chicago
Mary E. Lunz, American Society of Clinical Pathologists.
MESA Memo 61, written in July 1990 and accepted for a special issue of Applied Measurement in Education, but not published due to lack of space.

I. Introduction:

This century has seen efforts to remove subjectivity from the measurement of examinee ability, aptitude or knowledge. There are still many areas, however, in which performance ratings depend on assessments made by judges. Artistic skills, essay writing and science projects are but a few of the many such areas in education. In measuring professional performance, "By far the most widely used performance measurement techniques are judgmental ones" (Landy and Farr, 1983 p.57).

Four factors dominate the rating given to an examinee's performance: the ability of the examinee, the difficulty of the task performed, the severity of the judge and the way in which the judge applies the rating scale. In a competitive diving competition, each diver performs a series of dives, the test "items", and is rated on each dive by several judges, who probably differ in levels of severity, as well as their application of the rating scale. The aim is to determine the diver of the highest skill, regardless of which dives are actually performed, and which judges happen to rate them.

When there are several judges, it would be ideal if all judges gave exactly the same rating to a particular performance. Then each performance need only be evaluated by one such ideal judge and all ratings would be directly comparable. Practically speaking, minor random differences in the ratings of the same performance by different judges might be acceptable. However, even this level of agreement is hard to obtain. "In a study designed to see how 'free from error' ratings could be under relatively ideal conditions, Borman selected expert raters, used a carefully designed instrument, and tested it by using videotapes of behavior enacted to present clearly different levels of performance; he succeeded in getting an agreement among the raters of above .80, and yet concluded that 'ratings are far from perfect'" (Gruenfeld, 1981 p.12).

Since differences between judge ratings are usually non-trivial, it becomes necessary to determine how judges differ, and how these differences can be accounted for, and hence controlled, in a measurement model.

II. Expanding the partial credit model to include judging.

In "Probabilistic Models for Some Intelligence and Attainment Tests" (1960/1980), Georg Rasch writes that "we shall try to define simultaneously the meaning of two concepts: degree of ability and degree of difficulty", and presents a model to enable this. Rasch's initial model has been expanded to allow for partial credit items by including a set of parameters which describe the partial credit steps associated with each item (Wright & Masters 1982, Masters 1982).

With the inclusion of judges in the measurement process, it is useful to define simultaneously not only the ability of the examinee, and the difficulty of the test item, but also the severity of the judge. This is accomplished by expanding the partial credit model to include parameters describing each judge's method of applying the rating scale. Because this involves the additional facet of judges, beyond the facets of examinees and items, this expansion is called the "facets model", and the computer program which performs the analysis FACETS.

III. The facets model.

Here is a facets model for a rating scale that is the same for all judges on all items.

log_e (P_nijk/P_nij(k-1)) = B_n - D_i - C_j - F_k (1)

Where
P_nijk is the probability of examinee n being awarded, on item i by judge j, a rating of category k
P_nij(k-1) is the probability of examinee n being awarded, on item i by judge j, a rating of category k-1
B_n is the ABILITY of examinee n, where n = 1,N
D_i is the DIFFICULTY of item i, where i=1,L
C_j is the SEVERITY of judge j, where j=1,J
F_k is the HEIGHT of step k on a partial credit scale of K+1 categories, labelled 0,1,,K in ascending order of perceived quality, where k = 1,K.

In this model, each test item is characterized by a difficulty, D_i, each examinee by an ability, B_n, and each judge by a level of severity, C_j. The log_e odds formulation of (1) places these parameters on a common scale of log_e odds units ("logits").

Judges apply a rating scale to the performance of each examinee on each item. Each successive category represents a further step of discernibly better performance on the underlying variable being judged. The simple term F_k has one subscript. This defines the rating scale to have the same structure for every item and judge. This "common step" model is always the case when every observation is a dichotomy, because then K=1 and F₀=F₁=0.

This model allows for the estimation of differences in severity between judges, and thus eliminates this kind of judge "bias" from the calibration of items and the measurement of examinees.

Its FACETS implementation does not require that every examinee be rated by every judge on every item. It is only necessary that observations designed to create a network through which every parameter can be linked to every other parameter, directly or indirectly, by some connecting observations (Wright & Stone 1979 p. 98-106). This network enables all measures and calibrations estimated from the observations to be placed on one common scale.

A tempting way to organize a panel of judges might be for one judge to rate all the performances on one item, while another judge rates all the performances on another item. But this judging plan provides no way to discern whether a mean difference in ratings between the two items was because one item was harder, or because one judge was more severe. This confounding can be overcome by rotating judges among items so that, although the performance of an examinee on any particular item is rated once by only one judge, the total performance of each examinee is rated by two or more judges. Further, as part of the rotation design, each item is rated by several judges during the course of the examination scoring process.

III. Two other FACETS formulations.

a) Here ia a facets model for a rating scale that holds across items but differs among judges.

log_e (P_nijk/P_nij(k-1)) = B_n - D_i - C_j - F_jk (2)

Where
P_nijk is the probability of examinee n being awarded, on item i by judge j, a rating of category k
P_nij(k-1) is the probability of examinee n being awarded, on item i by judge j, a rating of category k-1
B_n is the ABILITY of examinee n, where n = 1,N
D_i is the DIFFICULTY of item i, where i=1,L
C_j is the SEVERITY of judge j, where j=1,J
F_jk is the HEIGHT of step k on a partial credit scale of K+1 categories, labelled 0,1,,K in ascending order of perceived quality as applied by judge j, where k=1,K.

In this model, the heights of the steps between adjacent rating categories vary among judges. The two-subscript term Fjk is the height of the step from the lower category k-1 up to the next higher category k as used by judge j.

b) Here is a facets model for a rating scale that differs for each item/judge combination.

log_e (P_nijk/P_nij(k-1)) = B_n - D_i - C_j - F_ijk (3)

Where:
F_ijk is the HEIGHT of step k on item i as rated by judge j.

The complex term F_ijk allows each judge to have a different way of using the rating categories on each item. This model is useful for examinations in which items differ in step structure so that judges who differ in their judging styles can also differ in the way they use different rating categories.

IV. The measurement properties of the facets model.

Test results obtained through the ratings of judges are descriptions of a local interaction between examinees, items and judges. Such results remain ambiguous with respect to inference, unless they can be transformed into measures with general meaning. It is essential for inference that the measures estimated from the test results be independent of the particular sample of examinees and items comprising the test situation. This requirement is especially apparent when examinees do not face identical testing situations. In circumstances where examinees are rated by different judges, respond to different sets of test items, or perform different demonstrations of competence, measures must be independent of the particular local judging situation to be meaningful at all. The construction of this independence is always necessary when the intention is to compare examinees on a common scale.

This generalizability of measurement estimates is called objectivity (Rasch 1968). Objectivity of examinee measures is modelled to exist when the same measures for examinees are obtained regardless of which sample of items, from the universe of relevant items, and which panel of judges, from the universe of relevant judges, were used in the test.

The facets model can be derived directly from the requirement of objectivity in the same manner as other Rasch models (Linacre 1987), and consequently also satisfies the mathematical requirements for fundamental measurement. Counts of steps taken are sufficient statistics for each parameter (Fisher, 1922), and the parameters for each facet can be estimated independently of estimates of the other facets. The measures of the examinees are thus "test-freed" and "judge- freed" (Wright 1968, 1977). Complete objectivity may not be obtainable for steps of the rating scale, however, when the definition of the scale depends on the observed structure of a particular testing situation (Wright & Douglas 1986).

V. Estimating the parameters of the facets model.

The FACETS estimation equations are derived by Linacre (1987) in a manner similar to those obtained for the partial credit model by Wright & Masters (1982 p.86), using unconditional maximum likelihood (Fisher 1922, JMLE). These equations yield sufficient parameter estimates and asymptotic standard errors for the ability of each examinee, the difficulty of each item, the severity of each judge, and the additional level of performance represented by each step on the partial credit scale. Mean square fit statistics (Wright & Masters, 1982 p.100) are also obtained.

VI. An application of FACETS.

The facets model was applied to performance ratings obtained by an examination board which certifies histotechnologists. 287 examinees were rated on 15 items by a panel of 15 judges. The ratings were on a 4 category partial credit scale labelled from 0 to 3, in which 0 means "poor/unacceptable", 1 means "below average" 2 means "average", and 3 means "above average" performance, as defined during the thorough training which the judges received.

Each examinee's performance on each item was rated only once. However, the 15 items were divided into 3 groups of 5 (1-5, 6-10, 11-15) so that each group of 5 items for each examinee could be rated by a different judge. Thus each examinee was rated by three judges, over the 15 items. Judges rotated through the groups of 5 items, so that each judge rated all 15 items over the course of the scoring session. The rotation was also designed so that the combinations of three judges per examinee varied over examinees. This provided a network of connections which linked all judges, items and examinees into one common measurement system, while enabling the separate estimation of the parameters of each facet.

Two aspects of judge behavior were examined. First, the extent to which judges differed in severity. Second, the extent to which each judge had his own way of using the rating scale, and how this affected his awarding of credit.

First, judges were calibrated under the assumption that they all applied the rating scale in the same way, but that each judge represented a different degree of severity. This is the facets model given in equation (1), in which rating scale steps are represented by F_k. Judge severity was calibrated at the logit value where the probability of awarding category "2" equalled that of awarding category "3" (rather than at equal probability of awarding category "0" or category "3") because these judges awarded far more "2" or "3" ratings than "0" or "1". This prevented perturbations in the infrequent awarding of 0 ratings from disturbing the estimation of judge severity. The resulting estimates are in Table 1. (The counts of ratings are in the last line of Table 3.)

Count of Count of Sum of Severity Mn Sq Judge Label Examinees Ratings Ratings in Logits S.E. Fit ------------------------------------------------------------------------ A (most severe) 68 340 803 .51 .08 .85 B 58 290 668 .35 .08 1.03 C 69 345 830 .23 .08 .89 D 43 215 533 .11 .10 .80 E 45 225 560 .08 .10 .87 F 79 395 982 .06 .08 .98 G 48 240 591 .00 .10 .78 H 47 235 594 -.01 .10 .91 I 52 260 652 -.02 .10 .96 J 32 160 409 -.05 .13 1.01 K 50 250 646 -.07 .10 1.05 L 58 290 728 -.18 .09 1.23 M 106 530 1382 -.33 .07 1.33 N 48 240 635 -.33 .12 1.22 O (most lenient) 58 290 767 -.35 .10 1.27 ------------------------------------------------------------------------ Mean: 57.4 287.0 718.7 .00 .09 1.01 Standard Deviation: 17.2 86.1 221.8 .24 .02 .17 Table 1. COMMON Scale Judge Calibrations at step from category 2 to category 3.

The "Count" column in Table 1 shows that these judges rated different number of examinees over the course of the examination, e.g. Judge M rated 106 examinees, while Judge J rated only 32. (Since each judge rated an examinee on 5 of the examinee's 15 items, the count of ratings is five times the count of examinees.) "Sum of Ratings" is the grand total of the ratings given by each judge. "Severity in logits" is the calibration of each judge according to the facets model (1),(and the choice of reference point at the transition from "2" to "3"). "Severity in logits" is accompanied by its modelled asymptotic standard error. Finally, a mean-square fit statistic is reported. Values greater than one indicate more variance in the ratings than was modelled. Values less than one indicate more dependence in the ratings than was modelled.

In the literature, judges are often used as though they were interchangeable. Each judge is thought to be "equivalent" to an "ideal" judge but for some small error variance. Were this the case, judge severities would be homogeneous.

But a chi-square for homogeneity (116 with 14 d.f.) among these 15 judges is significant at the .01 level. The hypothesis that these judges are interchangeable approximations of some ideal is unsupportable.

++-----------+-----------+-----------+-----------+-----------++ 45 + + | | Examinee | 1114231 | Score | | | 12123521 | | | | 1 244342 | | | | 1596231 | | | 40 + 1259544 + | | | 358552 | | | | 1 584555 | | | | 3353131 | | | | 232672 | | | 35 + 532121 + | | | 2322122 | | | | X 2223 | | | | 23221 | | | | 1 11 1 | | | 30 + 111 1 + | | | 21 | | Note: curvilinearity and | | 1 1 obtuseness of relation | | | | 12 | | | | 1 1 | | | 25 + Y 2 1 W + | | | 1 | | | | 1 | | | | | | | | | | | 20 + 1 + ++-----------+-----------+-----------+-----------+-----------++ -1 0 1 2 3 logits 4 Examinee measure Figure 1. Examinee score vs measure, on the COMMON judge scale. (W, X, Y discussed in text) The effect of this variation in judge severity can be demonstrated by comparing each objective examinee measure with its judge-dependent raw score. These are plotted in Figure 1. The ordinate is the raw score for each examinee, the sum of the 15 ratings each received, which has a possible range of zero to 45 points. The abscissa is the logit measure estimated for each examinee from the ratings each received, but adjusted for variation in judge severity by the facets measurement model. The horizontal spread of measures corresponding to each raw score shows the degree to which different levels of judge severity disturb the meaning of a raw score. Similarly, the vertical spread in raw scores corresponding to each logit measure shows the range of raw scores that an examinee of any given ability might receive depending on the combination of judges who rated him. As can be seen, one examinee (W), who scored 25, is estimated to have greater ability (0.43 logits) than another examinee (X), who scored 33, with a measure of 0.41 logits. Raw scores are biased against the examinee who scored 25, and would be unfair were the pass-fail criterion 30 points and only raw scores considered. The bias in the raw scores is entirely due to the particular combinations of severe and lenient judges that rated these examinees. The bias in raw scoring is also brought out by a comparison of examinee W with examinee Y, who also scored 25, but measured only -0.19. The raw scores of W and Y are identical, but the measured difference in their ability is 0.43 + 0.19 = 0.62 logits. Since the measures of W and Y have standard errors of 0.27 logits, a statistical test for their difference (t = 1.62, p = 0.1) may be significant enough to alarm an examining board concerned with making fair and defensible pass-fail decisions. The introduction into the measurement model of parameters calibrating and hence adjusting for the severity of judges enables the obvious inequities due to variance in judge severity to be removed from examinee measures. So far, we have allowed each judge to have his own level of severity, but have acted as though each uses the rating scale in the same way. Experience suggests that each judge, though thoroughly trained and experienced, applies the rating scale in a slightly different, though self-consistent, manner. Consequently, we will now model each judge to have his own personal way of using the rating scale. This corresponds to an analysis based on model equation (2), in which the step structure is represented by F_jk. Table 2 shows the judge severity estimates when each judge is calibrated with his personal rating scale. Again severity was calibrated at the logit value where the probability of awarding category "2" equalled that of awarding category "3". Count of Count of Sum of Severity Mn Sq Judge Examinees Ratings Ratings in Logits S.E. Fit ------------------------------------------------------------------------ A 68 340 803 .90 .08 1.01 B 58 290 668 .41 .08 1.06 C 69 345 830 .16 .08 .91 D 43 215 533 .22 .10 .81 E 45 225 560 .37 .11 1.02 F 79 395 982 .09 .08 1.00 G 48 240 591 .41 .11 1.01 H 47 235 594 .19 .11 1.06 I 52 260 652 -.16 .09 .93 J 32 160 409 .06 .13 1.13 K 50 250 646 .02 .11 1.09 L 58 290 728 -.48 .09 1.08 M 06 530 1382 -.67 .07 1.04 N 48 240 635 -.75 .10 .90 O 58 290 767 -.82 .10 1.05 ------------------------------------------------------------------------ Mean: 57.4 287.0 718.7 .00 .10 1.01 Standard Deviation: 17.2 86.1 221.8 .47 .02 .08 Table 2. PERSONAL Scale Judge Calibrations As would be expected, giving each judge his own rating scale has lessened the degree of unexpected behavior. The fit statistics are closer to their expected value of one when judges are modelled for personal scales. Comparing Tables 1 and 2, this is most clearly noticeable, for more lenient judges (L, M, N, O), and more severe judges (A, E). Categories 0 1 2 3 Judge Used Rel. Used Rel. Used Rel. Used Rel. Count % % | Count % % | Count % % | Count % % | --------------------------------------------------------------------------- A 14 4 135 | 18 3 88 | 139 11 145 | 169 7 81 | B 15 5 170 | 27 5 156 | 103 9 128 | 145 7 82 | C 12 3 114 | 36 5 173 | 97 7 100 | 200 8 94 | D 9 4 138 | 8 2 62 | 69 8 113 | 129 8 96 | E 5 2 73 | 8 2 59 | 84 10 130 | 128 8 91 | F 11 3 92 | 27 3 113 | 116 8 104 | 241 8 98 | G 3 1 41 | 12 3 82 | 96 10 139 | 129 7 86 | H 3 1 42 | 13 3 91 | 76 8 113 | 143 8 97 | I 7 3 88 | 20 4 127 | 67 7 91 | 166 9 102 | J 1 1 21 | 10 3 102 | 48 8 104 | 101 8 100 | K 4 2 53 | 10 2 66 | 72 7 100 | 164 9 104 | L 9 3 102 | 24 4 137 | 67 6 82 | 190 9 105 | M 23 4 143 | 18 2 57 | 103 5 68 | 386 10 115 | N 9 4 123 | 9 2 62 | 40 4 59 | 182 10 119 | O 6 2 68 | 20 3 114 | 45 4 55 | 219 10 118 | --------------------------------------------------------------------------- All 131 3 100 | 260 3 100 | 1222 7 100 | 2692 8 100 | Table 3. Use frequency of rating scale categories. In this examination, each judge rated a more or less random sample of examinees, and so an inspection of how many ratings each judge awarded in each category provides an explanation for the change in fit statistics when judges are calibrated on their personal scales. Table 3 gives the percentage of ratings, "Used %", given in each category by each category. These percents show how judges differ in the way they used the rating scale. The "Rel. %" columns show how much each judge used each category relative to the use of each category by all the judges. The more severe judges (A through H) used relatively more ratings of "2" than were expected from the common scale. When calibrated on the common scale, these judges had less dispersion, more central tendency, in their ratings than was expected, and so there ratings were less stochastic than expected, resulting in mean square fits of less than one. On the other hand, the more lenient judges (I through O) awarded relatively more extreme ratings of "3". When modelled on the common scale, their fit statistics were greater than one, showing more dispersion in their ratings than was modelled. Nevertheless, the patterns of responses in Table 3 show considerable similarity in the way that these judges viewed the rating scale, once the variation in their severity levels is accounted for. In fact, this panel of judges is so well trained that the none of their fit statistics is unacceptable. Of the residual variance obtained when modelling all judges to be identical, xx% is explained by allowing each judge his own severity but using a common scale, and only a further xx% by modelling each judge to have his own scale. ++-----------+-----------+-----------+-----------+-----------++ 4 + + | 11 | | 11 | Examinee | 11 | Measure | 1 | on | 13 | PERSONAL | | Judge | 111 1 | Scales | 11 | 3 + 1 1 + | 12 | | 2 12 | | Note: collinearity and 1 1 | | acuity of relation 3 1 | | 1212 | | 4 | | 26 1 | | 3311 | 2 + 4931 + | 2442 | | 941 | | 696 | | 693 | | 78 | | 493 | | 291 | | 196 | 1 + 58 + | 562 | | 54 | | X142 | | 1211 | | 1W1 | | 12 | | 1 | | Y 1 1 | 0 + 111 + | 1 | ++-----------+-----------+-----------+-----------+-----------++ -1 0 1 2 3 logits 4 Examinee measure on COMMON Judge scale Figure 2. Examinee measures on COMMON Judge scale vs PERSONAL Judge scales. In Figure 2, the measures obtained for each examinee when the judges are regarded as using a common scale are plotted against those obtained when each judge is allowed his personal scale. The examinee points are located close to the identity line. This is a visual representation of the fact that giving the judges their own scales has had very little effect on the ordering of examinees by ability. The Spearman rank order correlation of the two examinee measures is 0.998, indicating that almost no examinee's pass-fail decision would be affected by choice of models. In contrast, the Spearman correlation between raw scores and common scale measures is 0.976. This suggests that introducing the extra judge parameters into the model need not result in a meaningful difference in so far as examinee measures are concerned. Allowing each judge his own rating scale weakens inference because it lessens the generality of the measures obtained. Were a new judge included, it would be necessary to estimate not only his level of severity but also his own personal manner of using the rating scale. In Tables 1 and 2, the judge severity calibrations for common and personal scales are in statistically equivalent order. The personal scale calibrations, however, have twice the range of the common scale calibrations. Modelling a common scale has forced judges to seem more alike in severity. Fortunately Figure 2 shows that the effect on examinee measures of this compression of differences in judge severity is immaterial. But, for the study of judging and judge training, specifying each judge to have his own scale brings out noteworthy features of judge behavior. VII. Conclusion. The facets model is an extension of the partial credit model, designed for examinations which include subjective judgments. Its development enables the benefits of "sample-free", "test-free", and "judge-free" measurement to be realized in this hitherto intractable area. The use of the facets model yields greater freedom from judge bias and greater generalizability of the resulting examinee measures than has previously been available. The practicality of the facets model in allowing for simple, convenient, and efficient judging designs has proved of benefit to those for whom rapid, efficient judging is a priority. Further, the diagnostic information is of use in judge training. In the examination that was analyzed, it is clear that judges differ significantly in their severity, and that it is necessary to model this difference in determining examinee measures. Some evidence was also found to suggest that, even after allowing for this difference in severities, judges use the categories of the rating scale differently. However, for well-trained judges, the consequent differences in examinee measures do not appear to be large enough to merit modelling a separate rating scale for each judge. Modelling one common scale was satisfactory, for practical purposes. VIII. Bibliography. Borman, W.C. Exploring Upper Limits of Reliability and Validity in Job Performance Ratings. J. Applied Psychology 1978 63:2 134-144 Fisher R.A. On the mathematical foundations of theoretical statistics. Proc. Roy. Soc. 1922 Vol. CCXXII p. 309-368 Gruenfeld E.F. Performance Appraisal: Promise and Peril. Ithaca, New York: Cornell University Press 1981 Landy F.J., Farr J.L. The Measurement of Work Performance. New York: Academic Press 1983 Linacre J.M. An Extension of the Rasch Model to multi-faceted situation. Chicago: University of Chicago, Department of Education 1987 Masters, G.N. A Rasch Model for partial credit scoring. Psychometrika 1982 47:149-174. Rasch G. Probabilistic Models for Some Intelligence and Attainment Tests. (Copenhagen 1960) Chicago: University of Chicago Press 1980. Rasch G. A mathematical theory of objectivity and its consequences for model construction. In "Report from the European Meeting on Statistics, Econometrics and Management Sciences", Amsterdam 1968. Wright B.D. Sample-free test calibration and person measurement. In "Proceedings of the 1967 Invitational Conference on Testing Problems". Princeton, N.J.: Educational Testing Services 1968. Wright B.D. Solving measurement problems with the Rasch model. Journal of Educational Measurement, 1977, 14, 97-116. Wright B.D., Douglass G.A. The Rating Scale Model for Objective Measurement. MESA Research memorandum No. 35. Chicago: University of Chicago 1986. Wright B.D., Masters G.N. Rating Scale Analysis: Rasch Measurement. Chicago: MESA Press 1982. Wright B.D., Stone M.H. Best Test Design: Rasch Measurement. Chicago: Mesa Press 1979. Go to Top of Page Go to Institute for Objective Measurement Page Rasch-Related Resources: Rasch Measurement YouTube Channel Rasch Measurement Transactions & Rasch Measurement research papers - free An Introduction to the Rasch Model with Examples in R (eRm, etc.), Debelak, Strobl, Zeigenfuse Rasch Measurement Theory Analysis in R, Wind, Hua Applying the Rasch Model in Social Sciences Using R, Lamprianou El modelo métrico de Rasch: Fundamentación, implementación e interpretación de la medida en ciencias sociales (Spanish Edition), Manuel González-Montesinos M. Rasch Models: Foundations, Recent Developments, and Applications, Fischer & Molenaar Probabilistic Models for Some Intelligence and Attainment Tests, Georg Rasch Rasch Models for Measurement, David Andrich Constructing Measures, Mark Wilson Best Test Design - free, Wright & Stone Rating Scale Analysis - free, Wright & Masters Virtual Standard Setting: Setting Cut Scores, Charalambos Kollias Diseño de Mejores Pruebas - free, Spanish Best Test Design A Course in Rasch Measurement Theory, Andrich, Marais Rasch Models in Health, Christensen, Kreiner, Mesba Multivariate and Mixture Distribution Rasch Models, von Davier, Carstensen Rasch Books and Publications: Winsteps and Facets Applying the Rasch Model (Winsteps, Facets) 4th Ed., Bond, Yan, Heene Advances in Rasch Analyses in the Human Sciences (Winsteps, Facets) 1st Ed., Boone, Staver Advances in Applications of Rasch Measurement in Science Education, X. Liu & W. J. Boone Rasch Analysis in the Human Sciences (Winsteps) Boone, Staver, Yale Appliquer le modèle de Rasch: Défis et pistes de solution (Winsteps) E. Dionne, S. Béland Introduction to Many-Facet Rasch Measurement (Facets), Thomas Eckes Rasch Models for Solving Measurement Problems (Facets), George Engelhard, Jr. & Jue Wang Statistical Analyses for Language Testers (Facets), Rita Green Invariant Measurement with Raters and Rating Scales: Rasch Models for Rater-Mediated Assessments (Facets), George Engelhard, Jr. & Stefanie Wind Aplicação do Modelo de Rasch (Português), de Bond, Trevor G., Fox, Christine M Exploring Rating Scale Functioning for Survey Research (R, Facets), Stefanie Wind Rasch Measurement: Applications, Khine Winsteps Tutorials - free Facets Tutorials - free Many-Facet Rasch Measurement (Facets) - free, J.M. Linacre Fairness, Justice and Language Assessment (Winsteps, Facets), McNamara, Knoch, Fan Rasch Meta-Metres of Growth for Some Intelligence and Attainment Tests: A Meta-metre for some Intelligence and Attainment Tests, David Andrich, Ida Marais, Sonia Sappl To be emailed about new material on www.rasch.org please enter your email address here: I want to Subscribe: & click below I want to Unsubscribe: & click below Please set your SPAM filter to accept emails from Rasch.org www.rasch.org welcomes your comments: Please email inquiries about Rasch books to books \at/ rasch.org Your email address (if you want us to reply): FORUM Rasch Measurement Forum to discuss any Rasch-related topic Coming Rasch-related Events Jan. 16 - Feb. 13, 2025, Fri.-Fri. On-line workshop: Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com Apr. 8 - Apr. 11, 2026, Wed.-Sat. National Council for Measurement in Education - Los Angeles, CA, ncme.org/events/2026-annual-meeting Apr. 8 - Apr. 12, 2026, Wed.-Sun. American Educational Research Association - Los Angeles, CA, www.aera.net/AERA2026 May. 15 - June 12, 2026, Fri.-Fri. On-line workshop: Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com June 19 - July 25, 2026, Fri.-Sat. On-line workshop: Rasch Measurement - Further Topics (E. Smith, Winsteps), www.statistics.com Our current URL is www.rasch.org The URL of this page is www.rasch.org/memo61.htm

Rasch-Related Resources: Rasch Measurement YouTube Channel
Rasch Measurement Transactions & Rasch Measurement research papers - free	An Introduction to the Rasch Model with Examples in R (eRm, etc.), Debelak, Strobl, Zeigenfuse	Rasch Measurement Theory Analysis in R, Wind, Hua	Applying the Rasch Model in Social Sciences Using R, Lamprianou	El modelo métrico de Rasch: Fundamentación, implementación e interpretación de la medida en ciencias sociales (Spanish Edition), Manuel González-Montesinos M.
Rasch Models: Foundations, Recent Developments, and Applications, Fischer & Molenaar	Probabilistic Models for Some Intelligence and Attainment Tests, Georg Rasch	Rasch Models for Measurement, David Andrich	Constructing Measures, Mark Wilson	Best Test Design - free, Wright & Stone Rating Scale Analysis - free, Wright & Masters
Virtual Standard Setting: Setting Cut Scores, Charalambos Kollias	Diseño de Mejores Pruebas - free, Spanish Best Test Design	A Course in Rasch Measurement Theory, Andrich, Marais	Rasch Models in Health, Christensen, Kreiner, Mesba	Multivariate and Mixture Distribution Rasch Models, von Davier, Carstensen
Rasch Books and Publications: Winsteps and Facets
Applying the Rasch Model (Winsteps, Facets) 4th Ed., Bond, Yan, Heene	Advances in Rasch Analyses in the Human Sciences (Winsteps, Facets) 1st Ed., Boone, Staver	Advances in Applications of Rasch Measurement in Science Education, X. Liu & W. J. Boone	Rasch Analysis in the Human Sciences (Winsteps) Boone, Staver, Yale	Appliquer le modèle de Rasch: Défis et pistes de solution (Winsteps) E. Dionne, S. Béland
Introduction to Many-Facet Rasch Measurement (Facets), Thomas Eckes	Rasch Models for Solving Measurement Problems (Facets), George Engelhard, Jr. & Jue Wang	Statistical Analyses for Language Testers (Facets), Rita Green	Invariant Measurement with Raters and Rating Scales: Rasch Models for Rater-Mediated Assessments (Facets), George Engelhard, Jr. & Stefanie Wind	Aplicação do Modelo de Rasch (Português), de Bond, Trevor G., Fox, Christine M
Exploring Rating Scale Functioning for Survey Research (R, Facets), Stefanie Wind	Rasch Measurement: Applications, Khine	Winsteps Tutorials - free Facets Tutorials - free	Many-Facet Rasch Measurement (Facets) - free, J.M. Linacre	Fairness, Justice and Language Assessment (Winsteps, Facets), McNamara, Knoch, Fan
Rasch Meta-Metres of Growth for Some Intelligence and Attainment Tests: A Meta-metre for some Intelligence and Attainment Tests, David Andrich, Ida Marais, Sonia Sappl

Coming Rasch-related Events
Jan. 16 - Feb. 13, 2025, Fri.-Fri.	On-line workshop: Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com
Apr. 8 - Apr. 11, 2026, Wed.-Sat.	National Council for Measurement in Education - Los Angeles, CA, ncme.org/events/2026-annual-meeting
Apr. 8 - Apr. 12, 2026, Wed.-Sun.	American Educational Research Association - Los Angeles, CA, www.aera.net/AERA2026
May. 15 - June 12, 2026, Fri.-Fri.	On-line workshop: Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com
June 19 - July 25, 2026, Fri.-Sat.	On-line workshop: Rasch Measurement - Further Topics (E. Smith, Winsteps), www.statistics.com