Traditional tests of educational skills, including the skills of reading, have severe limitations. Unlike other familiar measuring instruments (the ruler and the thermometer are good examples), educational tests are not calibrated in a common unit. A score of five on one test does not mean the same as a score of five on another. Seventy-five per cent does not always represent a 'satisfactory' level of performance. The fact that Jack scored nine out of ten on this test and Jill scored seven out of ten on that one tells us nothing about the relative attainments of Jack and Jill. In general we lack the information needed to draw meaningful comparisons from scores on two different tests even where the tests are supposed to be measuring the same thing. A test is like a thermometer with the calibration in degrees erased and replaced by some arbitrary and uneven series of marks. Such a thermometer may be useful for determining which of two cups of coffee is the hotter, but readings from it cannot readily be integrated with measurements made with properly calibrated thermometers.
Educational tests differ with regard to
(a) their content (what the test is supposed to measure)
(b) their difficulty level (how hard it is to score on the test)
(c) their discriminating power (how finely they can distinguish between different levels of performance)
The first of these is fundamental. We want tests that will give us information about different types of educational attainment. We want geography tests to be quite different and separate from arithmetic tests and chemistry tests. Within the reading field we may want separate tests of inferential comprehension and rate of comprehension. Indeed much effort goes into the construction and refinement of tests of different traits. The second and third types of differences however are generally unhelpful. We introduce these differences because of the need to measure people of widely differing levels of attainment, and because the uses to which measurements are put also vary. The differences however do not aid interpretation. They make it difficult to compare scores achieved on different tests even when these tests are measuring precisely the same thing.
The development of 'norms' for a standardized test is an attempt to get over this hurdle. They lead to the translation of a raw score on the test on to some other scale (e.g. reading age, reading quotient or a percentile) which is common to a set of tests, and thus enables comparisons to be made within the set. The problem with this is that tests are usually normed only with regard to a small number of separate 'populations' (often only one). Norms developed for a random sample of nine-year-old school children in southern England will not be appropriate for interpreting the performance of a recent immigrant from Asia, a mentally handicapped fifteen-year-old, or even a nine-year-old who lives in Glasgow. Further, even when we are satisfied that the norms available are suitable, we are still restricted to the use of a comparatively small set of tests, each of which must be given in its entirety, under standard conditions, if scores for which the norms are valid are to result. There are circumstances in which a fully calibrated measurement system would be useful. When monitoring national standards of attainment there are several good reasons why it would be preferable to use a large number of test questions scattered among different forms of the test. To make sense of the results we need to be able to relate the tests one to another. Similarly, to measure the rate of growth of reading ability in an individual child we need to accumulate test results over an extended period. As it is clearly undesirable to confront the child with the same questions over and over again, different tests should be used, and so again we need to be able to compare and interpret the results. Lastly, there are occasions when it is impracticable, or at least inconvenient, to give a standardized test in its entirety. A teacher may want to select and administer only those parts of a test that bear on a particular problem with which he is concerned. Norms, however, are available only for scores on complete tests and so the advantages of standardization are lost.
An item bank is a large collection of test questions organized and catalogued like the books in a library. The idea is that the test user can select test items as required to make up a particular test. Since normally one would think in terms of item banks with several thousand items, the number of possible tests which could be composed from such a bank is astronomical. The great advantage of this system is its flexibility. Tests can be long or short, hard or difficult at will. Such an approach is useful only if we can solve the basic measurement problem outlined above; that is, if we can find a way of interpreting the scores, or score profiles, that result from each one of the whole range of possible tests in a meaningful and consistent way.
As already noted, tests vary according to their content, their level of difficulty and their discriminating power. To interpret a score we must know what a test is measuring, and this remains true whether or not the test is composed of questions drawn from an item bank. Many have pointed out the importance in testing of the underlying models and theories of the reading process. Materials stored in an item bank will be described and catalogued in relation to explicit models or theories (and to specific teaching objectives where appropriate). This will give the composer of a test a large degree of control over what is being measured. Variations in test difficulty and discriminating power (principally a function of test length) are handled by calibrating the individual items in terms of their characteristics and storing this information as well in the bank.
Two familiar characteristics of test questions point the way to the development of a suitable calibration system. The first is that test questions vary in difficulty, and that this variation is quite consistent. Within one topic it is reasonable to say that a particular test question is easier or more difficult than another one - without having to specify the pupil concerned. Furthermore, if we agree that item A is harder than item B and item B is harder than item C then it follows that item A is itself harder than item C. The internal consistency of such relationships is the basis for a logical model. Secondly, and not withstanding the implications of this model, we are familiar with the uncertainty that attends any test-taking exercise - we cannot predict with absolute accuracy which pupils will get exactly which questions right. Somebody always seems to get an easy question wrong and a hard one right. This does not contradict our notion of the relative difficulty of the items which is based on the generality of responses (actual or expected) from a large number of people, but it does suggest that any model of measurement based on test responses must involve probabilities rather than any exact deterministic formula. The Rasch model is an algebraic formulation but it is the most simple. It can be written as follows: the probability of an individual answering a particular question right (or of scoring at least a specifiable proportion of the marks on a question) is given by
p = x / (1+x)
where x is a function of the difference between a parameter representing the person's ability and the parameter representing the difficulty of the question. The mathematical procedures of deriving the difficulties of the items, the relative abilities of the persons, and the weight to be accorded to each possible score on a test, are known as Rasch-calibration or Rasch-scaling. Technical and non-technical discussions of these procedures can be found in the references cited at the end of this paper.
The essential outcome is that it is possible to calculate for any set of items, measuring a common trait and drawn from an item bank, and any set of responses to these items, a scaled ability score that is interpretable with respect to the entire bank and not just those questions included in the test. Since all other sets of items measuring the same trait will lead back to the same scale, however short or long, hard or easy the particular test, we may think of this scale as being a common standard attainment scale for that trait.
Item banking in practice
There is still considerable discussion as to just how many different dimensions of attainment an item bank can economically assess, and how much measurement information needs to be catalogued along with each test item in the bank. Clearly one needs to know what a particular test question is supposed to measure, and how this fits in with alternative theoretical models. Statistical information on the scaled difficulty of the item is also essential to ensure the general interpretability of results. Some items will be found to have biases (e. g. a particular item might be persistently easier for boys than for girls, another easier for those who have studied ITA than those who have not), and it might be thought useful to have such information available to bank customers. Although it can also be argued that as far as possible the bank should be composed of items without obvious bias, to some extent this will depend on the use to which the bank is to be put.
What will a Rasch-scaled item bank look like? There are at least two answers to this question since an item bank for use by teachers within schools will differ quite considerably in its structure from the sort of item bank being developed for the monitoring of standards on a national scale. This should not obscure the fact that both types of bank have the same logical base, and it will be possible to relate the results obtained from one to those obtained from the other.
For monitoring purposes a bank will need to have a very large number of test questions covering broad content areas and wide-ranging levels of difficulty. The collection of items is likely to be kept in some administrative center, and not published in full, although a certain amount of documentation explaining the structure of the bank, and the interpretation of the measurements produced by it, will need to be made accessible to a large number of people. The advantages of a bank in monitoring procedures are several. First of all it is possible to use, on any one occasion, a very large number of different test questions by building many different forms of the test. While an individual pupil may be required to attempt no more than twenty or thirty questions, data can be gathered on several hundred different questions, and this will give a much more detailed picture of performance. Further, the large number of questions involved greatly reduces the problem of test security. No teacher is likely to teach and no pupil to learn all of the several thousand test questions that comprise the bank (it has been remarked that any pupil who does master all the material in the bank deserves to get high marks). Where there are so many alternatives no single question from the bank acquires a disproportionate share of importance. This means that some of the questions can be used on several occasions without unduly disrupting the system and this makes it relatively easy to relate results from one year to the next with some precision and certainty. These techniques are the basis for monitoring changes in performance over time, a topic that is taken up in more detail elsewhere.
The item bank designed for teacher-use is likely to be different. There seem to be two main options: one is that individual teachers might be offered a complete assessment service by some central agency. In this, teachers could specify the types of tests they needed, either one or more on any particular occasion, and the agency would construct suitable tests with material drawn from the item bank. The agency sends the tests to the teacher who uses them and returns the scripts to the central agency for analysis and interpretation. This approach is currently being pioneered in South Australia and Tasmania. In Britain, the National Foundation for Educational Research (NFER) has recently decided to put a considerable proportion of its resources into the development of this type of assessment service, although it will be several years before it will be generally available. The alternative is that the item bank may be published in the form of a book or pamphlet which contains all of the test questions together with details about what each measures, a guide as to how to construct tests and some keys and conversion tables which permit the individual teacher (who does not have easy access to computers or calculators) to scale and interpret the results. We have the model for such an item bank in the book compiled by Purushothaman (1975) on secondary school mathematics. A smaller example of this approach, but one which measures knowledge of vocabulary has been published in the United States (Woodcock 1974). Whichever of these patterns is adopted the bank will have to be structured to serve several alternative purposes which the teachers may demand. The bank should be capable of providing detailed diagnostic information about the attainment or otherwise of a particular skill, or the mastery of a particular topic. It must therefore contain a large and varied selection of items catalogued accurately and in sufficient detail to permit the teacher to construct narrow but valid diagnostic tests. On other occasions the teacher may require a short but general test in order to gain a quick idea of the level and diversity of achievement within a class. Alternatively, the requirement may be for a longer comprehensive test to be used as an 'end of year' examination. The materials stored in the bank and their pattern of organization must be adequate to cope with all of these.
How do item banking systems solve the measurement problems posed at the beginning of this paper? We make use of the fact that a Rasch scaling analysis yields the same scale for people and for items. All the test questions measuring a particular trait can be lined up along a scale with their relative positions and spacings being determined by the difficulty of the question. The attainment of a person who attempts all or some of these items has a value which corresponds to a point somewhere on this same scale. Items which come to the left of a person's position are easier items on which he would probably be successful. Items which lie to the right are relatively harder. Items occupying scale positions very close to the attainment of the person are those on which that person has about an even chance of being successful. The reason for developing and presenting scales in this fashion is that they provide essentially a 'criterion referenced' interpretation of test performance. We can interpret the score on a test, any test, in terms of which skills have probably been mastered and which have not, which problems can probably be solved and which are still too difficult, and these statements do not relate only to those questions or items included in the test, but to the full range of material in the item bank. Norms are unhelpful when the person being tested does not belong to the norm group. What we need for the mentally handicapped fifteen-year-old or the nine-year-old Glaswegian is a statement of what he or she can do, not a comparison with the performance of typical nine-year-olds in Sussex. The great advantage of an item bank is that test results are not only related to the performance of some other group of individuals who have been exposed to the items, but also can be described directly in terms of mastery of the test material itself - and since this test material is available to the teacher, he or she can judge its relevance and hence the relevance of the assessment that is being made. Here we refer back to the other example given at the beginning, that of an Asian immigrant with language problems. If you want to test his competence in English language you will also most certainly prefer to know what he can and cannot do rather than to compare him with a normative group of English school children from which he would almost certainly have been excluded.
Item banking and the future
Theory is much further advanced than practice when it comes to item banking, but it seems certain that these new techniques will make a major impact on the educational scene within the next few years. They offer important advantages over traditional types of standardized tests. The development of a common set scale and unit of, for example, reading attainment, means that it will be much easier to chart the progress of individual pupils over time, and also to compare the results obtained by a teacher from an individual pupil with the figures that emerge from national or regional monitoring exercises. The technology behind this revolution is slightly more complex than that which underlies the standardized test, but there is good reason to hope that once teachers become familiar with the new approach they will find it easier to use.
It is important to realize however that item banking is not the final solution to all the problems posed by educational assessment. No item bank can be better than the material that is put into it, and users of assessment materials will continue to carry responsibility for ensuring that their tests are fair, appropriate, reliable and valid. A well-planned and well-documented item bank will go a long way towards helping meet these criteria, but the user will still need to exercise care. A very real danger is that since the establishment of an item bank is an expensive and time-consuming business, there will be a tendency to keep it in service for too long a time, allowing assessment procedures to stagnate. Of course an item bank should be a living thing with test materials being added and the classification system updated as new developments occur either in our understanding of the subject matter or in teaching practices. Such ongoing work requires money and this will only be justified if the customers of the bank are using the materials in an intelligent and appropriate way. It is too early to say how successfully banks will be able to change and adapt to new educational fashions.
Choppin, B. H. (1978) Item Banking and the Monitoring of Achievement Research in Progress Series, I. National Foundation for Educational Research (NFER), The Mere, Upton Park, Slough, Berks.
Purushothaman, M. (1975) Secondary Mathematics Item Bank. Slough: NFER.
Rentz, R. R. and Bashaw, W. L. (1977) The national reference scale for reading: an application of the Rasch model Journal of Educational Measurement 14 161-79.
Wood, R. and Skurnik, L. (1969) Item Banking. Slough: NFER.
Woodcock, R. W. (1974) Woodcock Reading Mastery Test Circle Pines, Minnesota: American Guidance Service.
Wood and Skurnik (1969) provide a general introduction to the subject of item banking together with a description of the early experimental work. This material is already becoming somewhat dated and for a more modern account of the application of item banking to national monitoring policies the reader should look at Choppin (1978).
Two examples of item banks for teacher-use are recommended. Purushothaman (1975) presents a full-fledged mathematics item bank for secondary schools in the form of a small book which includes a semi-technical account of the Rasch scaling procedures. For item banking materials in reading it is necessary at the moment to look outside the United Kingdom. Woodcock (1974) offers a range of vocabulary test materials scaled in the Rasch manner, while Rentz and Bashaw (1977) describe their work of scaling and equating a large number of existing American tests of reading.
Bruce Choppin, 1979
MESA Research Memorandum Number 49
MESA PSYCHOMETRIC LABORATORY
This appeared in
Choppin B. (1979) Testing the questions - the Rasch model and item banking. Chapter 5 in M. St.J. Raggett, C. Tutt, P. Raggett (Eds.) Assessment and Testing of Reading: Problems and Practices. London: Ward Lock Educational.
For details of the pioneering work of Bruce H. Choppin (1940-1983), read
Choppin, B. (1985) Bruce Choppin on Measurement and Education. D.L. McArthur and B.D. Wright (Eds.) Evaluation in Education (now Internal Journal of Educational Research) 9:1.
Linacre J.M. (1995) Bruce Choppin, visionary. Rasch Measurement Transactions 8:4 p. 394. Also in J.M. Linacre (Ed.) Rasch Measurement Transactions, Part 2. Chicago: MESA Press. 1996.
Go to Top of Page
Go to Institute for Objective Measurement Page
Please help with Standard Dataset 4: Andrich Rating Scale Model
|Rasch Measurement Transactions (free, online)||Rasch Measurement research papers (free, online)||Probabilistic Models for Some Intelligence and Attainment Tests, Georg Rasch||Applying the Rasch Model 3rd. Ed., Bond & Fox||Best Test Design, Wright & Stone|
|Rating Scale Analysis, Wright & Masters||Introduction to Rasch Measurement, E. Smith & R. Smith||Introduction to Many-Facet Rasch Measurement, Thomas Eckes||Invariant Measurement: Using Rasch Models in the Social, Behavioral, and Health Sciences, George Engelhard, Jr.||Statistical Analyses for Language Testers, Rita Green|
|Rasch Models: Foundations, Recent Developments, and Applications, Fischer & Molenaar||Journal of Applied Measurement||Rasch models for measurement, David Andrich||Constructing Measures, Mark Wilson||Rasch Analysis in the Human Sciences, Boone, Stave, Yale|
|in Spanish:||Análisis de Rasch para todos, Agustín Tristán||Mediciones, Posicionamientos y Diagnósticos Competitivos, Juan Ramón Oreja Rodríguez|
|FORUM||Rasch Measurement Forum to discuss any Rasch-related topic|
|Coming Rasch-related Events|
|July 31 - Aug. 3, 2017, Mon.-Thurs.||Joint IMEKO TC1-TC7-TC13 Symposium 2017: Measurement Science challenges in Natural and Social Sciences, Rio de Janeiro, Brazil, imeko-tc7-rio.org.br|
|Aug. 7-9, 2017, Mon-Wed.||In-person workshop and research coloquium: Effect size of family and school indexes in writing competence using TERCE data (C. Pardo, A. Atorressi, Winsteps), Bariloche Argentina. Carlos Pardo, Universidad Catòlica de Colombia|
|Aug. 7-9, 2017, Mon-Wed.||PROMS 2017: Pacific Rim Objective Measurement Symposium, Sabah, Borneo, Malaysia, proms.promsociety.org/2017/|
|Aug. 10, 2017, Thurs.||In-person Winsteps Training Workshop (M. Linacre, Winsteps), Sydney, Australia. www.winsteps.com/sydneyws.htm|
|Aug. 11 - Sept. 8, 2017, Fri.-Fri.||On-line workshop: Many-Facet Rasch Measurement (E. Smith, Facets), www.statistics.com|
|Aug. 18-21, 2017, Fri.-Mon.||IACAT 2017: International Association for Computerized Adaptive Testing, Niigata, Japan, iacat.org|
|Sept. 15-16, 2017, Fri.-Sat.||IOMC 2017: International Outcome Measurement Conference, Chicago, jampress.org/iomc2017.htm|
|Oct. 13 - Nov. 10, 2017, Fri.-Fri.||On-line workshop: Practical Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com|
|Oct. 25-27, 2017, Wed.-Fri.||In-person workshop: Applying the Rasch Model hands-on introductory workshop, Melbourne, Australia (T. Bond, B&FSteps), Announcement|
|Jan. 5 - Feb. 2, 2018, Fri.-Fri.||On-line workshop: Practical Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com|
|Jan. 10-16, 2018, Wed.-Tues.||In-person workshop: Advanced Course in Rasch Measurement Theory and the application of RUMM2030, Perth, Australia (D. Andrich), Announcement|
|Jan. 17-19, 2018, Wed.-Fri.||Rasch Conference: Seventh International Conference on Probabilistic Models for Measurement, Matilda Bay Club, Perth, Australia, Website|
|April 13-17, 2018, Fri.-Tues.||AERA, New York, NY, www.aera.net|
|May 25 - June 22, 2018, Fri.-Fri.||On-line workshop: Practical Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com|
|June 29 - July 27, 2018, Fri.-Fri.||On-line workshop: Practical Rasch Measurement - Further Topics (E. Smith, Winsteps), www.statistics.com|
|Aug. 10 - Sept. 7, 2018, Fri.-Fri.||On-line workshop: Many-Facet Rasch Measurement (E. Smith, Facets), www.statistics.com|
|Oct. 12 - Nov. 9, 2018, Fri.-Fri.||On-line workshop: Practical Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com|
|The HTML to add "Coming Rasch-related Events" to your webpage is:|
Our current URL is www.rasch.org
The URL of this page is www.rasch.org/memo49.htm