Early Forms of Formal Evaluation
The History and Influence of Evaluation in Society
Early Forms of Formal Evaluation
Some evaluator-humorists have mused that formal evaluation was probably at work in determining which evasion skills taught in Sabertooth Avoidance 101 had the
Chapter 2 • Origins and Current Trends in Modern Program Evaluation 39
greatest survival value. Scriven (1991c) apparently was not speaking tongue-in-cheek when suggesting that formal evaluation of crafts may reach back to the evaluation of early stone-chippers’ products, and he was obviously serious in asserting that it can be traced back to samurai sword evaluation.
In the public sector, formal evaluation was evident as early as 2000 B.C., when Chinese officials conducted civil service examinations to measure the profi- ciency of applicants for government positions. And in education, Socrates used verbally mediated evaluations as part of the learning process. But centuries passed before formal evaluations began to compete with religious and political beliefs as the driving force behind social and educational decisions.
Some commentators see the ascendancy of natural science in the seven- teenth century as a necessary precursor to the premium that later came to be placed on direct observation. Occasional tabulations of mortality, health, and pop- ulations grew into a fledgling tradition of empirical social research that grew un- til “In 1797, Encyclopedia Britannica could speak of statistics—‘state-istics,’ as it were—as a ‘word lately introduced to express a view or survey of any kingdom, county, or parish’” (Cronbach et al., 1980, p. 24).
But quantitative surveys were not the only precursor to modern social research in the 1700s. Rossi and Freeman (1985) give an example of an early British sea captain who divided his crew into a “treatment group” that was forced to consume limes, and a “control group” that consumed the sailors’ normal diet. Not only did the experiment show that “consuming limes could avert scurvy,” but “British seamen eventually were forced to consume citrus fruits—this is the derivation of the label ‘limeys,’ which is still sometimes applied to the English” (pp. 20–21).
Program Evaluation: 1800–1940
During the 1800s, dissatisfaction with educational and social programs in Great Britain generated reform movements in which government-appointed royal com- missions heard testimony and used other less formal methods to evaluate the respective institutions. This led to still-existing systems of external inspectorates for schools in England and much of Europe. Today, however, those systems use many of the modern concepts of evaluation; for example, recognition of the role of val- ues and criteria in making judgments and the importance of context. Inspectorates visit schools to make judgments concerning quality and to provide feedback for improvement. Judgments may be made about the quality of the school as a whole or the quality of teachers, subjects, or themes. (See Standaert. [2000])
In the United States, educational evaluation during the 1800s took a slightly different bent, being influenced by Horace Mann’s comprehensive annual, empirical reports on Massachusetts’s education in the 1840s and the Boston School Committee’s 1845 and 1846 use of printed tests in several subjects—the first instance of wide-scale assessment of student achievement serving as the basis for school comparisons. These two developments in Massachusetts were the first attempts at objectively measuring student achievement to assess the quality of a large school system. They set a precedent
40 Part I • Introduction to Evaluation
seen today in the standards-based education movement’s use of test scores from stu- dents as the primary means for judging the effectiveness of schools.
Later, during the late 1800s, liberal reformer Joseph Rice conducted one of the first comparative studies in education designed to provide information on the qual- ity of instructional methods. His goal was to document his claims that school time was used inefficiently. To do so, he compared a large number of schools that varied in the amount of time spent on spelling drills and then examined the students’ spelling ability. He found negligible differences in students’ spelling performance among schools where students spent as much as 100 minutes a week on spelling in- struction in one school and as little as 10 minutes per week in another. He used these data to flog educators into seeing the need to scrutinize their practices empirically.
The late 1800s also saw the beginning of efforts to accredit U.S. universities and secondary schools, although that movement did not really become a potent force for evaluating educational institutions until several strong regional accrediting associa- tions were established in the 1930s. The early 1900s saw another example of accred- itation (broadly defined) in Flexner’s (1910) evaluation—backed by the American Medical Association and the Carnegie Foundation—of the 155 medical schools then operating in the United States and Canada. Although based only on one-day site visits to each school by himself and one colleague, Flexner argued that inferior training was immediately obvious: “A stroll through the laboratories disclosed the presence or absence of apparatus, museum specimens, library and students; and a whiff told the inside story regarding the manner in which anatomy was cultivated” (Flexner, 1960, p. 79). Flexner was not deterred by lawsuits or death threats from what the medical schools viewed as his “pitiless exposure” of their medical training practices. He deliv- ered his evaluation findings in scathing terms. For example, he called Chicago’s fifteen medical schools “the plague spot of the country in respect to medical education” (p. 84). Soon “schools collapsed to the right and left, usually without a murmur” (p. 87). No one was ever left to wonder whether Flexner’s reports were evaluative.
Other areas of public interest were also subjected to evaluation in the early 1900s; Cronbach and his colleagues (1980) cite surveys of slum conditions, management and efficiency studies in the schools, and investigations of local government corruption as examples. Rossi, Freeman, and Lipsey (1998) note that evaluation first emerged in the field of public health, which was concerned with infectious diseases in urban areas, and in education, where the focus was on literacy and occupational training.
Also in the early 1900s, the educational testing movement began to gain momentum as measurement technology made rapid advances under E. L. Thorndike and his students. By 1918, objective testing was flourishing, pervading the military and private industry as well as all levels of education. The 1920s saw the rapid emergence of norm-referenced tests developed for use in measuring individual performance levels. By the mid-1930s, more than half of the United States had some form of statewide testing, and standardized, norm-referenced testing, including achievement tests and personality and interest profiles, became a huge commercial enterprise.
During this period, educators regarded measurement and evaluation as nearly synonymous, with the latter usually thought of as summarizing student test
Chapter 2 • Origins and Current Trends in Modern Program Evaluation 41
performance and assigning grades. Although the broader concept of evaluation, as we know it today, was still embryonic, useful measurement tools for the evaluator were proliferating rapidly, even though very few meaningful, formally published evaluations of school programs or curricula would appear for another 20 years. One notable exception was the ambitious, landmark Eight Year Study (Smith & Tyler, 1942) that set a new standard for educational evaluation with its sophisticated methodology and its linkage of outcome measures to desired learning outcomes. Tyler’s work, in this and subsequent studies (e.g., Tyler, 1950), also planted the seeds of standards-based testing as a viable alternative to norm-referenced testing. (We will return in Chapter 6 to the profound impact that Tyler and those who followed in his tradition have had on program evaluation, especially in education.)
Meanwhile, foundations for evaluation were being laid in fields beyond education, including human services and the private sector. In the early decades of the 1900s, Fredrick Taylor’s scientific management movement influenced many. His focus was on systemization and efficiency—discovering the most efficient way to perform a task and then training all staff to perform it that way. The emergence of “efficiency experts” in industry soon permeated the business community and, as Cronbach et al. (1980) noted, “business executives sitting on the governing boards of social services pressed for greater efficiency in those services” (p. 27). Some cities and social agencies began to develop internal research units, and social scientists began to trickle into government service, where they started conducting applied social research in specific areas of public health, housing needs, and work productivity. However, these ancestral, social research “precursors to evaluation” were small, isolated activities that exerted little overall impact on the daily lives of the citizenry or the decisions of the government agencies that served them.
Then came the Great Depression and the sudden proliferation of government services and agencies as President Roosevelt’s New Deal programs were implemented to salvage the U.S. economy. This was the first major growth in the federal govern- ment in the 1900s, and its impact was profound. Federal agencies were established to oversee new national programs in welfare, public works, labor management, urban development, health, education, and numerous other human service areas, and increasing numbers of social scientists went to work in these agencies. Applied social research opportunities abounded, and soon social science academics began to join with their agency-based colleagues to study a wide variety of variables relating to these programs. While some scientists called for explicit evaluation of these new social programs (e.g., Stephan, 1935), most pursued applied research at the intersec- tion of their agency’s needs and their personal interests. Thus, sociologists pursued questions that were of interest to the discipline of sociology and the agency, but the questions of interest often emerged from sociology. The same trend occurred with economists, political scientists, and other academics who came to conduct research on federal programs. Their projects were considered to be “field research” and pro- vided opportunities to address important questions within their discipline in the field. (See the interview with Michael Patton in the “Suggested Readings” at the end of this chapter for an example. In this interview, he discusses how his dissertation was ini- tially planned as field research in sociology but led Patton into the field of evaluation.)
42 Part I • Introduction to Evaluation
Program Evaluation: 1940–1964
Applied social research expanded during World War II as researchers investigated government programs intended to help military personnel in areas such as reducing their vulnerability to propaganda, increasing morale, and improving the training and job placement of soldiers. In the following decade, studies were directed at new pro- grams in job training, housing, family planning, and community development. As in the past, such studies often focused on particular facets of the program in which the researchers happened to be most interested. As these programs increased in scope and scale, however, social scientists began to focus their studies more directly on entire programs rather than on the parts of them they found personally intriguing.
With this broader focus came more frequent references to their work as “evaluation research” (social research methods applied to improve a particular program).1 If we are liberal in stretching the definition of evaluation to cover most types of data collection in health and human service programs, we can safely say evaluation flourished in those areas in the 1950s and early 1960s. Rossi et al. (1998) state that it was commonplace during that period to see social scientists “engaged in evaluations of delinquency-prevention programs, felon-rehabilitation projects, psy- chotherapeutic and psychopharmacological treatments, public housing programs, and community organization activities” (p. 23). Such work also spread to other coun- tries and continents. Many countries in Central America and Africa were the sites of evaluations examining health and nutrition, family planning, and rural community development. Most such studies drew on existing social research methods and did not extend the conceptual or methodological boundaries of evaluation beyond those already established for behavioral and social research. Such efforts would come later.
Developments in educational program evaluation between 1940 and 1965 were unfolding in a somewhat different pattern. The 1940s generally saw a period of consolidation of earlier evaluation developments. School personnel devoted their energies to improving standardized testing, quasi-experimental design, accreditation, and school surveys. The 1950s and early 1960s also saw consider- able efforts to enhance the Tylerian approach by teaching educators how to state objectives in explicit, measurable terms and by providing taxonomies of possible educational objectives in the cognitive domain (Bloom, Engelhart, Furst, Hill, & Krathwohl, 1956) and the affective domain (Krathwohl, Bloom, & Masia, 1964).
In 1957, the Soviets’ successful launch of Sputnik I sent tremors through the U.S. establishment that were quickly amplified into calls for more effective teaching of math and science to American students. The reaction was immediate. Passage of the National Defense Education Act (NDEA) of 1958 poured millions of dollars into massive, new curriculum development projects, especially in mathematics and science. Only a few projects were funded, but their size and perceived importance led policymakers to fund evaluations of most of them.
The resulting studies revealed the conceptual and methodological impover- ishment of evaluation in that era. Inadequate designs and irrelevant reports were
1We do not use this term in the remainder of the book because we think it blurs the useful distinction between research and evaluation that we outlined in the previous chapter.
Chapter 2 • Origins and Current Trends in Modern Program Evaluation 43
only some of the problems. Most of the studies depended on imported behavioral and social science research concepts and techniques that were fine for research but not very suitable for evaluation of school programs.
Theoretical work related directly to evaluation (as opposed to research) did not exist, and it quickly became apparent that the best theoretical and method- ological thinking from social and behavioral research failed to provide guidance on how to carry out many aspects of evaluation. Therefore, educational scientists and practitioners were left to glean what they could from applied social, behavioral, and educational research. Their gleanings were so meager that Cronbach (1963) penned a seminal article criticizing past evaluations and calling for new directions. Although his recommendations had little immediate impact, they did catch the attention of other education scholars, helping to spark a greatly expanded concep- tion of evaluation that would emerge in the next decade.