Be Fair With Achievement Testing

This message concerns unfair testing consequences and suggests how to correct them. It also tries to place in perspective the mission offered educational measurement professionals today. The testing situation is presented as a professional liability for American education. Too many meaningless results from unsuitable tests are reported using numbers that look like valid scores, even though other testing procedures correcting the measurement problems presently being encountered are available. An answer is recommended here for consideration as a guiding theme for district testing programs.

Every youngster is expected, by law, to start learning "Readin, Ritin, and Rithmetic" when old enough - and a few of us professionals have undertaken the responsibility of administering testing programs for these students and their teachers. For what purposes?

When students have progressed to the point where giving them a group test is reasonable, they are again captives of the situation " and we define the situation. What specifications for test assignment and test administration do we recommend? When we place a number beside a student's name to indicate achievement, it is a matter of personal and professional responsibility. What technology will we, as professionals, accept as fair enough for this purpose?

As servants in the educational enterprise, we must follow policies and acknowledge constraints not of our choosing. We have jobs to keep and families to support and that is important. Professionally, we are disappointed by some of the school board directives and administrative priorities with which we have to live. Especially galling is the political blindness directing some of the constraints imposed. But we can get a sense of direction from an answer Edward Esser gave when a new faculty member in his school asked him for help with a personal dilemma.

The teacher was torn between wanting to use the teaching methods considered best for helping individuals learn, and managing a class of thirty students with diverse needs and abilities in the way an insensitive administration required. Mr. Esser recognized the personal nature of this question and answered, "the only thing I know to say is if someone up there looks down to see what you are doing, try to be pointing in the right direction". As evident in the answer, professional educators are fortunate that the "personal" and the "professional" coalesce to guide the behavior of those engaged in influencing young lives.

Whatever different paths have led us into professional responsibilities, there is one direction in which we are all pointing. Whatever the student's readiness is for learning, whatever the social status or parental support, etc., a youngster deserves a fair chance to succeed.

Student's reported scores are commonly based on commercially published tests that are used for all the students in one or more elementary school grade. Let's put to one side the problem of item mis-match with local instructional programs when nationally marketed tests are used. Surely giving students who are a full grade apart in achievement the same test cannot be fair to all of them. In order to avoid a ceiling too low for the high students, because most of the questions are too easy to let them show how well they have learned, it is necessary to include items that are difficult for the top group. Many more items must be hard enough to challenge students in the middle range. This presents an unsolvable measurement problem with respect to the low students who have only a small part of the test to show what they can do. Under these conditions, students at the low end of the distribution are proffered a large number of unacceptable items, irrelevant to measuring their achievement. Too many of the questions in such tests are far beyond the low students. Responding to such questions is an unpleasant task for the low students that tells us nothing useful about what they know or can do.

Even though a student has worked hard and successfully in school, such a test is often incapable of showing this student's excellence of performance. Is this an acceptable or unacceptable testing practice?

Those whose marching orders are to concentrate on estimates of group performance by grade level grouping are concerned about the distribution of scores for all students in those groups, especially the means, sometimes the standard deviations and occasionally the shape of the distribution curve.

If a school board demands an average score for a grade in a given year, the statistician needs to know that all students at both ends of the continuum were included. This is necessary even if it means that all students are administered the same test. When teachers decide to protect some of their very low students from what is seen as a counterproductive activity, e.g., by sending them home on test day, they invalidate group means reported to the school board and media. This is a clear case of humanism versus necessary statistics, but it is a contest with no winners.

Statisticians and State Departments of Education are aware that students at the extremes of the ability distribution are not meaningfully tested and so receive scores of doubtful reliability and validity. Nevertheless, these must be treated as valid for reporting purposes. This unsatisfactory situation is caused by the selection of unsuitable testing procedures.

A psychometrician knows that a low raw score based on insufficient numbers of reasoned responses, especially when it is clouded by chance marking of more difficult items, has little measurement value. When the student has answered only a third or less of the items correctly, it takes a lot of ingenious rationalization to defend the resulting "number correct" as indicating a student's achievement level.

Psychometricians know that students act in an individualistic manner, making personality tests out of mistargeted achievement tests, but there is no way of knowing if or how much the score has been influenced in this way. Responses of the low students, particularly in the last part of the test, can be expected to include despair, resentment, defeat, unfairness, desire to cheat, disorganized thinking or memory to mention a few inhibitors. Perhaps even, on rare occasions, challenge to greater effort.

A high score, from marking almost all items correct, is an indication the student wasn't given a chance to demonstrate higher levels of performance. Neither extreme score has enough information about a student's ability or deficiencies to be accepted by a psychometrician interested in helping that student.

The psychometrician's point of view shows personal concern for the individual and how fairly the student has been treated in the testing situation. The psychometrician also recognizes that group results are especially faulty when a number of extreme scores are included in computing averages, variances, etc.

Taking a test is a personal experience for a student. Teachers, especially, are concerned with the effect testing has on their students. When all are taking the same long range test, students are required to compete with all other members of the class. Although competition is often an invigorating challenge for those who think they can win, it can be devastating for those of meager talent. Even something to be feared.

Items that are beyond the students capability to understand and that request information the student never had, or cannot remember, are discouraging. They are defeating when encountered early in the test and become even more distasteful as the student stumbles through to the end.

In the day to day classroom experiences, a student who is in a quandary concerning a learning activity is helped by the teacher. But in a test situation, the low student is abandoned to struggle with uncertainty all alone. Test results provided for the teacher under these circumstances should be considered of dubious value. At best, they indicate that the student might be "low" without giving any factual indication of how well the student is performing. These instances of unsuitable testing impact the professional teacher in several ways:

A. Individualizing instruction for students (who learn less rapidly or have more difficulties to overcome than the mid-range class members) requires that the teacher focus on individual improvement rather than comparative achievement level. The teacher works with such students throughout the year, encouraging their learning efforts and providing positive reenforcement. But at testing time, a year's work to establish some self-confidence and feeling of accomplishment in the low student can be seriously jeopardized by a broad-range test that should never have been assigned to that student.

B. Teachers, in line of duty, are adept at getting students to take on difficult tasks at times. They can impress a class with the importance of achievement tests and encourage students to do their best. But how, then, does the teacher de-emphasize the relevance of an unsuitable test to the low students? Especially when it is so far beyond a low student's level that it really is meaningless as measurement. Don't the low ones "count" with the rest of the class? No wonder that some teachers have little faith in standardized tests results.

C. Teachers and other educators know that learning is developmental. The maxim that you have only one chance to make a first impression is a serious reminder that the first test a student recognizes as a "regular" test, hopefully no earlier than grade three, had better not be an academic and emotional defeat.

Early testing must not establish a negative association for tests and so start a self-fulfilling chain reaction which can easily last throughout the elementary grades. The reasoning is obvious. Learning theory suggests that students will quickly learn what to expect about test taking from their experiences with tests. When testing time comes around, negative memories of previous testing periods loom large in the anticipatory set a student brings to the testing situation. Because test taking is personal and seen as important, negative memories of past testing trigger a mental set influencing the student sometimes to cooperate with the test and give it a try, but at other times to fight the test and self-destruct.

D. Testing has an objective aura because the results come out in numbers, so test results can count heavily in the reinforcement system of school life. They go beyond the concept of reward or punishment because taking a test is a personal experience for the student. Success and failure are personal. They are generated within oneself. Rewards and punishments given by other people can be recognized by students as superficial, but their private feelings of failure are personal evaluations that cannot be rejected. Failures associated with testing can be generalized to lower a student's notion of self-worth in other aspects of schooling. They can be expected to trigger a disenchantment with school work as well as with testing. Results of unsuitable testing only make the teachers efforts to promote learning more difficult.

E. What questions do thoughtful professionals ask when confronted with a case of student mistreatment through unsuitable testing? Do they wonder if it is ethical to mentally abuse a few students to compute averages and percentiles for the whole class? Must they quietly accept orders from the school board or State Department to promote an unfair system which they recognize as educationally counterproductive? Can professionals do anything more than "point in the right direction"?

Acceptable testing requires that one makes sure each student takes a test with a range of item difficulty that will make it challenging, but not defeating. Is this simple, but straightforward solution fair enough? (Remember that the problem of curriculum mis-match with standardized tests is held in abeyance at this time.)

The path that avoids abusing low students with tests leads the concerned professional into a different world of measuring. A world where personal consideration for students opens up new possibilities for knowing an individual student better, for evaluating a defined group's performance when taught by a new educational programs, and for letting students and parents know just how the student is progressing.

What is obviously an improvement for the students, however, requires professional educators to change to a new kind of achievement test with new reports, with additional information, and different explanations to parents.

Is it reasonable to ask that of educators? I would submit that this is a professional opportunity well within the capability of a college graduate. Learning the nature of this new kind of test information and how to best use it can become a primary teaching responsibility.

It is common in our society for an important change away from the traditional approach to be greeted with a certain hysteria in those areas affected. This is natural, but many solid demonstrations of scaling achievement tests with Rasch methodology will cause it to prevail.

Since a shift from old to new testing practices changes the situation for students, parents, and professionals it must be done as a district project. Such a shift is the beginning of a natural development into a district information system. Let's begin by considering the things that must be done. Convincing teachers and administrators that a complete testing program will produce the desired results has to be a matter of them finding out for themselves. There are two ways to go. (1) Have 3 or 4 schools in a district give it a try in all grades, possibly in only one basic skill, or (2) double test for two or three years and retain the most useful testing program.

Let us address the problem of transforming the raw scores from whatever test was prescribed for the student onto a continuous and common curriculum scale. This means that a test score for a third grade will be located on the same scale as that for eighth grade students. This is similar to measuring a third grade student with the same feet and inches scale as that used for taller upper grade students.

Compatibility between different tests at different levels of the curriculum requires that all items from which the tests are made are connected to each other on the curriculum scale. In one successful application, the relative calibrations (measures) assigned to each item have been established through Rasch technology. They have been recorded as RIT (Rasch units = 0.1 logits) levels of the Northwest Evaluation Association (NWEA) basic skills item banks. Thus construction of different tests for different abilities, "levels tests", using the NWEA banks avoids unsuitable testing of both low and high achieving students. It also shifts the measuring characteristics of a test from ranking each student against other students to placing them in progress through a curriculum. The NWEA scale establishes excellence of performance along a substantive curriculum scale.

It would be giving the wrong impression to suggest that an on-going testing system is expected to predict the appropriate short test for every student at each testing period. Experience has shown that incorrect short tests can be expected to be predicted at a rate of 3% or less when past test performance over several testing periods is used. However, because of the nature of the scale, all students having too low or too high numbers of items correct can be given a higher or lower level test at the same time as make-up tests are administered to students who were absent from the regular testing session. Students at all grades can be tested in any of the three subjects currently supported by the NWEA bank at one administration session if that is convenient.

It would also be giving the wrong impression to suggest that levels testing systems are simple to manage or automatically produce valid information as if by magic. It is not claimed that a system can be devised for other than basic skills tests. It has been established, however, that levels testing can almost entirely avoid unsuitable test assignments.

The flexibility characterizing test development when working from the NWEA item banks also allows the matching of test with curriculum. It also supports district control over when and who to test. It's overriding importance for professionals in education, however, is the way it alleviates mental child abuse in the captive audiences of a school district.

Who says that the levels tests are fair enough for students? Concerned professionals do. As deeply involved in the testing situation as the students, the responsible professionals want to derive a personal satisfaction from "pointing in the right direction". These professionals, who have watched and sympathized with students hurting under tests they never should have been given, now perceive the levels tests as fair enough to warrant a thorough-going change in their testing methodology. Statisticians, psychometricians, and educators can all agree with assigning students the particular levels test that matches their achievement levels, because whatever the test chosen, the measures all fall on the same linear curriculum variable.

It so happens that the time is exactly right for change, partly because of the criticisms that State Departments are experiencing due to the recent remarkable report showing the "Lake Wobegon" effect, that none of the 50 states are "below average" (whatever that means).

Statisticians, psychometricians, educators " not only are traditional test practices unfair and inhumane, they are also silly. Now is the time to be fair with achievement testing.

Be fair with achievement testing. Ingebo G. … Rasch Measurement Transactions, 1989, 3:3 p.66

Rasch Books and Publications
Invariant Measurement: Using Rasch Models in the Social, Behavioral, and Health Sciences, 2nd Edn. George Engelhard, Jr. & Jue Wang	Applying the Rasch Model (Winsteps, Facets) 4th Ed., Bond, Yan, Heene	Advances in Rasch Analyses in the Human Sciences (Winsteps, Facets) 1st Ed., Boone, Staver	Advances in Applications of Rasch Measurement in Science Education, X. Liu & W. J. Boone	Rasch Analysis in the Human Sciences (Winsteps) Boone, Staver, Yale
Introduction to Many-Facet Rasch Measurement (Facets), Thomas Eckes	Statistical Analyses for Language Testers (Facets), Rita Green	Invariant Measurement with Raters and Rating Scales: Rasch Models for Rater-Mediated Assessments (Facets), George Engelhard, Jr. & Stefanie Wind	Aplicação do Modelo de Rasch (Português), de Bond, Trevor G., Fox, Christine M	Appliquer le modèle de Rasch: Défis et pistes de solution (Winsteps) E. Dionne, S. Béland
Exploring Rating Scale Functioning for Survey Research (R, Facets), Stefanie Wind	Rasch Measurement: Applications, Khine	Winsteps Tutorials - free Facets Tutorials - free	Many-Facet Rasch Measurement (Facets) - free, J.M. Linacre	Fairness, Justice and Language Assessment (Winsteps, Facets), McNamara, Knoch, Fan
Other Rasch-Related Resources: Rasch Measurement YouTube Channel
Rasch Measurement Transactions & Rasch Measurement research papers - free	An Introduction to the Rasch Model with Examples in R (eRm, etc.), Debelak, Strobl, Zeigenfuse	Rasch Measurement Theory Analysis in R, Wind, Hua	Applying the Rasch Model in Social Sciences Using R, Lamprianou	El modelo métrico de Rasch: Fundamentación, implementación e interpretación de la medida en ciencias sociales (Spanish Edition), Manuel González-Montesinos M.
Rasch Models: Foundations, Recent Developments, and Applications, Fischer & Molenaar	Probabilistic Models for Some Intelligence and Attainment Tests, Georg Rasch	Rasch Models for Measurement, David Andrich	Constructing Measures, Mark Wilson	Best Test Design - free, Wright & Stone Rating Scale Analysis - free, Wright & Masters
Virtual Standard Setting: Setting Cut Scores, Charalambos Kollias	Diseño de Mejores Pruebas - free, Spanish Best Test Design	A Course in Rasch Measurement Theory, Andrich, Marais	Rasch Models in Health, Christensen, Kreiner, Mesba	Multivariate and Mixture Distribution Rasch Models, von Davier, Carstensen

Go to Institute for Objective Measurement Home Page. The Rasch Measurement SIG (AERA) thanks the Institute for Objective Measurement for inviting the publication of Rasch Measurement Transactions on the Institute's website, www.rasch.org.

Coming Rasch-related Events
Jan. 16 - Feb. 13, 2025, Fri.-Fri.	On-line workshop: Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com
Apr. 8 - Apr. 11, 2026, Wed.-Sat.	National Council for Measurement in Education - Los Angeles, CA, ncme.org/events/2026-annual-meeting
Apr. 8 - Apr. 12, 2026, Wed.-Sun.	American Educational Research Association - Los Angeles, CA, www.aera.net/AERA2026
May. 15 - June 12, 2026, Fri.-Fri.	On-line workshop: Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com
June 19 - July 25, 2026, Fri.-Sat.	On-line workshop: Rasch Measurement - Further Topics (E. Smith, Winsteps), www.statistics.com