Educational Research and Rasch Measurement

The unification of behavioral acts that indicate the existence of knowledge with stimuli taken from a variable whose nature is knowledge has rewritten the book on curriculum research and educational program evaluation. The Rasch model effectively brings level of student achievement together with curriculum attainment items to provide criterion variables not available for earlier research in basic skills instruction. It is obligatory that we professionals recognize this breakthrough and adjust our approaches to evaluation and research accordingly.

How does that help evaluators? This new way of implementing the time honored theory, that a student with greater knowledge has a greater probability of answering a valid question correctly than a student with lesser knowledge, is a quantum leap in educational achievement information. It is definitely not just an extension or refinement of traditional population referenced testing.

A leap because freeing the scale that achievement level and curriculum level have in common from various student populations effectively shifts the focus of determining the relationships among the test items to their actual arrangement in the curriculum. Because of the sequential nature of basic skills instruction the curriculum tasks line up in a practical approximation of a continuous variable (in Rasch parlance, a defensible latent trait). As dichotomous items that are actually curriculum tasks are lined up and given values with respect to each other, these calibrations (values) are on an equal interval scale generated by the confluence of knowledge and position in the curriculum.

Rasch measurement gives us a chance to dig our spikes in on the same equal interval track used for centuries by researchers and engineers in the hard sciences. The opportunity is here. Let's run as far as we can with it. One thing we can do is to avoid accepting research and evaluation models, that are obviously not defensible, simply because they seem to be the only game in town. There is a new approach now, we can now include in program evaluation designs an almost automatic monitoring component for basic skills.

It will be a challenge to get a critical mass of educational researchers and evaluators to make equal interval measurement a way of life in our profession. Many are too set in beliefs about the way learning is organized and measured to benefit fully from this measurement breakthrough. Others are too politically, financially, or personally committed to give the equal interval data based research a good college try. But this is another indication of difference from the past.

We have learned to expect that any substantial shift in life style first elicits panic accompanied by ineffectual and often regressive behavior. In the case of the entrenched traditional measurement, note the paralysis exhibited by the multi-organizational blue ribbon committee responsible for revising the latest Standards for Educational and Psychological Testing bible of testing. This one will go down in the history of measurement as a low point in the conflict of professional versus political. There were enough requests to include well researched information about the Rasch technology and its use that ignoring it had to be deliberate.

It was a good question that a remarkably insightful psychometrist, Marjorie Johnson, once asked her boss, Dr. Victor Doherty, "Who really evaluates the evaluators?" If we can stand the daylight we should look through evaluators eyes at our branch of the profession over the years. With a new and powerful measurement tool we can then plan how to foster respectability as evaluators and researchers. If you, as an evaluator, are concerned about the seeming willingness to bend the rules of research for the sake of filing an evaluation report or if you wonder about one researcher's failure to criticize another's publish or perish production, charge it to the lack of quality measurement required to design respectable evaluations or research. Like trying to scratch a match on a used bar of soap, you can't do much educational evaluation with only normative data scaled on the population being studied.

With this unsettling opening let's go ahead and get it all out on the table. Another established weakness in our professional community is the lack of honest discussion. What do we mean by that? In meetings of groups such as an international conference of combustion engineers collectively considering a presentation, the research design, the accuracy of the data, the inferences by the researcher, etc. are all open for questioning with no holds barred. They learn through challenges and answers.

In a staged show-and-tell setting where it wasn't counterproductive we might consider the source and condone a publish or perish ploy with its glaring design error, knowing that this artifact of the instructor evaluation system is important to some professional careers. But subverting the worthy purpose of research presentations through an organizational habit of not asking the hard questions is an untenable position for a group whose reason for existing is to traffic in information. True the "real world" of impossible requests, often for political reasons, will continue and the new evaluation tools can't substitute for reasonable questioning. Maintaining tenure is important, but among ourselves as researchers we can and must select the topics for enlightened and enlightening debate.

Another serious concern is the tendency to espouse any new statistical operation uncritically. One of the most obvious examples found in so called "research" reports is a mis-sampled "meta-analysis" of something or other. This impressive example of elegance in use of computers is one of the most glaring examples of ignoring a fundamental statistical procedure, representative sampling, taught to early beginners as basic to intelligent inference. These report writers also ignore another tenet, namely that the burden of proof is on the experimenter. Doesn't research always require following certain conventions of data management to be worthy of that designation? Why then do otherwise knowledgeable audiences sit fat, dumb, and happy while this kind of presentation progresses to it's run-out-of-time conclusion? Who evaluates the evaluators? Not evaluators.

Here is another problem that we have adapted to. Lack of measures focussed on the kids themselves has let us slide a layer of statistical assumptions between students and the derived scores we used to peg them in an unknown population norm. This distancing from students as individuals has led toward mythology and away from measurement. A glaring example of opting for convenience over knowledge of kids and concern for them is the statistical practice of matrix sampling. The question was first put to the national assessment administration the second year it operated and at various times since then. Just what differences are there in scores of a student population when they are taking a test that gives them and their teacher results as opposed to a dry run with nothing in it for them. Differences in reactions are expected in a group where students are not personally involved in the task. The sad fact is that the matrix sampling design has been used for years without those using it taking responsibility for experimentally determining how it effects information. Statistical mythology isn't good enough where students and teachers are concerned.

Resorting to presentations like those on errant meta-analysis and condoning designs like matrix sampling is more likely to develop when organization continuance, maintenance, and increased membership supersede attention to the organizational purpose. However, continuance and purpose need not be contradictory when there is reason to expect attainment of the purpose in question. Note the engineers hard questioning and demand for data based results. They don't need to get angry or mean and are not accused of being impolite when they ask a question that blows a whole presentation. It's what should be done in searching out the best data we can when there is reason to have confidence in the information.

How did the wimp factor grow in the last half a century? Years ago it was stimulating to observe the strenuous give and take in a conference of Colorado psychologists. In those days the aim of Psychology was to predict and control human behavior (scientifically). Although lacking hard information, the discussions in that era were direct and impersonal. A little earlier Educators had collaborated in designing what is commonly referred to as "the eight year study" with recognition of the time frame required to get data worth serious consideration. Early on there was recognition of and respect for the purposes of the research steps taken by physical scientists.

1. Thinking carefully and critically about a relationship between two ideas or phenomena with all the accumulated information available added to ones personal experience.

2. Devising ways to gather more relevant and/or specific data about this relationship.

3. Formulating a theory about the relationship that is consistent with the data at hand and stating it in specific terms.

4. Stating a go or no go experiment, (A brittle hypothesis that will break or survive rather than simply stating that more investigation is needed to support or deny the hypothesis).

5. Replicating any successful test of a hypothesis and varying experimental conditions to stress the theory with more rigorous hypotheses.

But wait a minute! How were we going to do that? We couldn't without adequate measures. In those days of population referenced achievement data and inadequate derived scores the lack of measuring instruments of suitable quality defeated attempts to make defensible evaluations of educational programs or to test theories with adequate research projects. What happened over the years of frustrated attempts to construct measurement instruments was a movement toward accepting whatever information was retrievable and massaging it statistically to produce a report.

The practice of settling for what was "next best" increased because of two pressures. One was the mounting need for evaluating because of the growing requests from political entities like school boards for objectivity in decision making information. This pressure increased when the Feds demanded thousands of evaluations for Title I funding, and there were no where near enough evaluators to do them professionally.

Another pressure was the popularity, an unwarranted substitution, of statistical efforts to bolster weaknesses in design integrity. Examples of statistical futility like analysis of covariance, factor analysis, and the infamous NCE scores captured the fancy of many influential people and became educationally "holy" almost as soon as they were concocted. These substitutes for design integrity were accepted with mixed beliefs, running from a nagging doubt in spite of authoritative blessings to "whole-hearted" disciple status.

The federal government got into the act supporting a nation wide talent project involving an attempt at stratified random sampling. Another massive data processing exercise several years later produced a widely read and quoted report based on a self-selected two thirds of the design sample. Probably the most extensive and misleading misrepresentation palmed off as representative are the publishers population norms advertised as national norms.

Title I financed programs in local districts were required to submit double barreled evaluations of achievement and self-concept gains. This political maneuver failed because adequate data was not available and because so many people were suddenly given the title of evaluator to meet the Fed's demand for reports. Consequently the reputation of evaluators suffered as meaningless Title I reports from districts were accepted by state department personnel. But that is largely a symptom of the restrictive conditions that plagued our profession before Rasch measurement.

Testing in the basic skills area took on a whole new meaning when it was proven that Rasch calibration could be used to develop curriculum based item banks with equal interval values for each item. Thanks to his tenacity and understanding of measurement, Dr Benjamin Wright guided American educators through a period when many were being made aware of the difference between population free measurement and group norms.

Evaluators now have available measures that can provide longitudinal data on an equal interval scale for all school students, a capability that does not exist with traditional standardized tests. Visualize a bar graph showing the yearly profits of a company over six years. These values are pegged on an equal interval scale and, because they are, they can be directly compared with each other. More than that, evaluators can compare growth between different time periods. And more than that, amounts of growth between time periods can be compared directly. The analogy could be continued to include other companies, but let's return to basic skills.

When equal interval measures of achievement are used growth measures can give schools operating at different levels a fair chance to exhibit growth in relationship to other schools, and to compare instructional programs effectiveness through comparison of differences between gains when a designated group is taught with a new program. This design gladdens the heart of an evaluator because it solves the two previously unsolvable problems of regression to the mean and the selection of control groups.

What about other programs initiated into the classrooms? How does basic skill data help evaluate a new science adoption or a school board demand to add law-related education to the curriculum? It gives an almost automatic monitoring of what evaluators are responsible for anyway, that is, how the new curriculum arrangement impacts basic skills learning. An evaluation of a new instructional program should answer the question, "Does it facilitate or inhibit arithmetic, reading, or language usage achievement at different levels of student ability?"

When we consider the many benefits from enlightened use of Rasch measurement, it seems that one of the professional obligations that comes with it is to lean in the direction of the more carefully designed research that equal interval measures make possible, and question more closely each other's evaluation and research efforts with the intent to learn the strengths and limits of Rasch measurement as it is applied to the complex of interactions in the educational enterprise.

The gift of measurement of this quality simply must serve to upgrade our educational research and evaluation, and lead to better education for us all.

Educational research and Rasch measurement. Ingebo G. … Rasch Measurement Transactions, 1989, 3:1 p.43-46

Rasch Publications
Rasch Measurement Transactions (free, online)	Rasch Measurement research papers (free, online)	Probabilistic Models for Some Intelligence and Attainment Tests, Georg Rasch	Applying the Rasch Model 3rd. Ed., Bond & Fox	Best Test Design, Wright & Stone
Rating Scale Analysis, Wright & Masters	Introduction to Rasch Measurement, E. Smith & R. Smith	Introduction to Many-Facet Rasch Measurement, Thomas Eckes	Invariant Measurement: Using Rasch Models in the Social, Behavioral, and Health Sciences, George Engelhard, Jr.	Statistical Analyses for Language Testers, Rita Green
Rasch Models: Foundations, Recent Developments, and Applications, Fischer & Molenaar	Journal of Applied Measurement	Rasch models for measurement, David Andrich	Constructing Measures, Mark Wilson	Rasch Analysis in the Human Sciences, Boone, Stave, Yale
in Spanish:	Análisis de Rasch para todos, Agustín Tristán	Mediciones, Posicionamientos y Diagnósticos Competitivos, Juan Ramón Oreja Rodríguez

Go to Institute for Objective Measurement Home Page. The Rasch Measurement SIG (AERA) thanks the Institute for Objective Measurement for inviting the publication of Rasch Measurement Transactions on the Institute's website, www.rasch.org.

Coming Rasch-related Events
May 17 - June 21, 2024, Fri.-Fri.	On-line workshop: Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com
June 12 - 14, 2024, Wed.-Fri.	1st Scandinavian Applied Measurement Conference, Kristianstad University, Kristianstad, Sweden http://www.hkr.se/samc2024
June 21 - July 19, 2024, Fri.-Fri.	On-line workshop: Rasch Measurement - Further Topics (E. Smith, Winsteps), www.statistics.com
Aug. 5 - Aug. 6, 2024, Fri.-Fri.	2024 Inaugural Conference of the Society for the Study of Measurement (Berkeley, CA), Call for Proposals
Aug. 9 - Sept. 6, 2024, Fri.-Fri.	On-line workshop: Many-Facet Rasch Measurement (E. Smith, Facets), www.statistics.com
Oct. 4 - Nov. 8, 2024, Fri.-Fri.	On-line workshop: Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com
Jan. 17 - Feb. 21, 2025, Fri.-Fri.	On-line workshop: Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com
May 16 - June 20, 2025, Fri.-Fri.	On-line workshop: Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com
June 20 - July 18, 2025, Fri.-Fri.	On-line workshop: Rasch Measurement - Further Topics (E. Smith, Facets), www.statistics.com
Oct. 3 - Nov. 7, 2025, Fri.-Fri.	On-line workshop: Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com