EXPERIMENTAL ESTIMATES OF EDUCATION PRODUCTION FUNCTIONS* Alan B. Krueger This paper analyzes data on 11,600 students and their teachers who were randomly assigned to different size classes from kindergarten through third grade. Statistical methods are used to adjust for nonrandom attrition and transitions between classes. The main conclusions are (1) on average, performance on standardized tests increases by four percentile points the first year students attend small classes; (2) the test score advantage of students in small classes expands by about one percentile point per year in subsequent years; (3) teacher aides and measured teacher characteristics have little effect; (4) class size has a larger effect for minority students and those on free lunch; (5) Hawthorne effects were unlikely. I. Introduction The large literature on the effect of school resources on student achievement generally finds ambiguous, conflicting, and weak results. Even quantitative summaries of the literature tend to reach conflicting conclusions. For example, based on the fact that most estimates of the effect of school inputs on student achievement are statistically insignificant, Hanushek [1986] concludes, "There appears to be no strong or systematic relationship between school expenditures and student performance." By contrast, Hedges et al. [1994] conduct a meta-analysis of (a subset of) the studies enumerated by Hanushek and conclude, "the data are more consistent with a pattern that includes at least some positive relation between dollars spent on education and output, than with a pattern of no effects or negative effects." Much of the uncertainty in the literature derives from the fact * i thank Helen Bain, a founder and principal director of Project STAR, for providing me with the data used in this study, Jayne Zaharias, DeWayne Fulton, and Van Cain for answering several questions regarding the data, and Jessica Baraka, Aaron Saiger, and Diane Whitmore for providing outstanding research assistance. The STAR data have been collected and maintained by the Center of Excellence for Research in Basic Skills at Tennessee State University. The STAR data are available from www.telalink.net/~heros. Helpful comments on my research were provided by Charles Achilles, Jessica Baraka, Ronald Ehrenberg, William Evans, Jeremy Finn, John Folger, Victor Fuchs, Joseph Hotz, Lawrence Katz, Cecilia Rouse, James P. Smith, two referees, and seminar participants at the Milken Institute, Massachusetts Institute of Technology, National Bureau of Economic Research, Princeton University, Vanderbilt University, University of California at Los Angeles, the Kennedy School (Harvard University), the London School of Economics, Stockholm University, the Econometric Society, World Bank, and Society of Labor Economists. Financial support was provided by the National Institute of Childhood Health and Development. © 1999 by the President and Fellows of Harvard College and the Massachusetts Institute of Technology. The Quarterly Journal of Economics, May 1999 497 498 QUARTERLY JOURNAL OF ECONOMICS that the appropriate specification—including the functional form, level of aggregation, relevant control variables, and identification—of the "education production function" is uncertain.1 Some specifications do consistently yield significant effects, however. Notably, estimates that use cross-state variation in school resources typically find positive effects of school resources, whereas studies that use within-state data are more likely to find insignificant or wrong-signed estimates (see Hanushek [1996]).2 Many of these specification issues arise because of the possibility of omitted variables, either at the student, class, school, or state level. Moreover, functional form issues are driven in part by concern for omitted variables, as researchers often specify education production functions in terms of test-score changes to difference out omitted characteristics that might be correlated with school resources (although such differencing could introduce greater problems if the omitted characteristics affect the trajectory of student performance). A classical experiment, in which students are randomly assigned to classes with different resources, would help overcome many of these specification issues and provide guidance for observational studies. This paper provides an econometric analysis of the only large-scale randomized experiment on class size ever conducted in the United States, the Tennessee Student/Teacher Achievement Ratio experiment, known as Project STAR. Project STAR was a longitudinal study in which kindergarten students and their teachers were randomly assigned to one of three groups beginning in the 1985-1986 school year: small classes (13-17 students per teacher), regular-size classes (22-25 students), and regular/aide classes (22-25 students) which also included a full-time teacher's aide. After their initial assignment, the design called for students to remain in the same class type for four years. Some 6000-7000 students were involved in the project each year. Over all four years, the sample included 11,600 students from 80 schools. Each school was required to have at least one of each class-size type, and random assignment took place within schools. The students 1. There is also debate over what should be the appropriate measure of school outputs (see Card and Krueger [1996]). Whereas education researchers tend to analyze standardized test scores, economists tend to focus on students'educational attainment and subsequent earnings. 2. Hanushek attributes this difference to omitted state-level variables that bias the state-level studies, although it is possible that endogenous resource decisions within states (e.g., assignment of weaker students to smaller classes as required by compensatory education) bias the within-state micro-data estimates, and that the interstate estimates are unbiased. EXPERIMENTAL ESTIMATES 499 were given a battery of standardized tests at the end of each school year. In a review article Mosteller [1995] described Project STAR as "a controlled experiment which is one of the most important educational investigations ever carried out and illustrates the kind and magnitude of research needed in the field of education to strengthen schools." The STAR data have been examined extensively by an internal team of researchers. This analysis has found that students in small classes tended to perform better than students in larger classes, while students in classes with a teacher aide typically did not perform differently than students in regular-size classes without an aide (see Word et al. [1990], Finn and Achilles [1990], and Folger and Breda [1989]). Past research primarily consists of comparisons of means between the assignment groups, and analysis of variance at the class level. In this research, little attention has been paid to potential threats to the validity of the experiment or to the longitudinal structure of the data. As in any experiment, there were deviations from the ideal experimental design in the actual implementation of Project STAR. First, students in regular-size classes were randomly assigned again between classes with and without full-time aides at the beginning of first grade, while students in small classes continued on in small classes, often with the same set of classmates.3 Re-randomization was done to placate parents of children in regular classes who complained about their children's initial assignment. Because analysis of data for kindergartners did not indicate a significant effect of a teacher aide on achievement in regular-size classes, it was felt that this procedure would create few problems. But if the constancy of one's classmates influences achievement, then the experimental comparison after kindergarten is compromised by the re-randomization. A second limitation of the experiment is that approximately 10 percent of students switched between small and regular classes between grades, primarily because of behavioral problems or parental complaints. These nonrandom transitions could also compromise the experimental results. Furthermore, because some students and their families naturally relocate during the school year, actual class size varied more than intended in small classes (11 to 20) and in regular classes (15 to 30). Finally, as in most 3. If a school had more than one small class, students could be moved between small classes. 500 QUARTERLY JOURNAL OF ECONOMICS longitudinal studies of schooling, sample attrition was common— half of students who were present in kindergarten were missing in at least one subsequent year. And some students may have nonrandomly switched to another public school or enrolled in private school upon learning their class-type assignments. These limitations of the experiment have not been adequately addressed in previous work. This paper has three related goals. First, to probe the sensitivity of the experimental estimates to flaws in the experimental design. Second, to use the experiment to identify an appropriate specification of the education production function to estimate with nonexperimental data. Third, to use the experimental results to interpret estimates from the literature based on observational data. The conclusion makes a rough attempt to compare the benefits and costs of reducing class size from 22 to 15 students. II. Background on Project STAR and Data A. Design and Implementation Project STAR was funded by the Tennessee legislature, at a total cost of approximately $12 million over four years.4 The Tennessee legislature required that the study include students in inner-city, suburban, urban, and rural schools. The research was designed and carried out by a team of researchers at Tennessee State University, Memphis State University, the University of Tennessee, and Vanderbilt University. To be eligible to participate in the experiment, a public school was required to sign up for four years and be large enough to accommodate at least three classes per grade, so within each school students could be assigned to a small class (13-17), regular class (22-25 students), or regular plus a full-time aide class.5 The statewide pupil-teacher ratio in kindergarten in 1985-1986 was 22.3, so students assigned to regular classes fared about as well as the average student in the state [Word et al. 1990]. Schools with more than 67 students per grade had more than three classes. One limitation of the comparison between regular and regular/aide classes is that in grades 1-3 each regular class had the services of a part-time aide 25-33 4. This section draws heavily from Word et al. [1990] and Folger [1989]. 5. Participating schools had an average per-pupil expenditure in 1986-1987 of $2724, compared with the statewide average of $2561. EXPERIMENTAL ESTIMATES 501 percent of the time on average, so the variability in aide services was restricted.6 The cohort of students who entered kindergarten in the 1985-1986 school year participated in the experiment through third grade. Any student who entered a participating school in a relevant grade was added to the experiment, and participating students who repeated a grade, skipped a grade, or left the school exited the sample. Entering students were randomly assigned to one of the three types of classes (small, regular, or regular/aide) in the summer before they began kindergarten.7 Students were typically notified of their initial class assignment very close to the beginning of the school year. Students in regular classes and in regular/aide classes were randomly reassigned between these two types of classes at the end of kindergarten, while students initially in small classes continued on in small classes. Notice, however, that results from the kindergarten year are uncontaminated by this feature of the experiment. Because kindergarten attendance was not mandatory in Tennessee at the time of the study, many new students entered the program in first grade. Additionally, students were added to the sample over time because they repeated a grade or because their families moved to a school zone that included a participating school. In all, some 2200 new students entered the project in first grade and were randomly assigned to the three types of classes. Another 1600 and 1200 new students entered the experiment in the second and third grades, respectively. Newly entering students were randomly assigned to class types, although the uneven availability of slots in small and regular classes often led to an unbalanced allocation of new students across class types. A total of 11,600 children were involved in the experiment over all four years. After third grade, the experiment ended, and all students were assigned to regular-size classes. Although data have been collected on students through ninth grade, the present study only has access to data covering grades K-3. Data were 6. The reason that regular classes often had a teacher aide is that the ethic underlying the study was that students in the control group (i.e., regular classes) would not be prevented from receiving resources that they ordinarily would receive. 7. The procedure for randomly assigning students was as follows. Each school prepared an alphabetized enrollment list. Algorithms were centrally prepared which assigned every kth student to a class type; the algorithm was tailored to the number of enrolled students. A random starting point was used by each school to apply the algorithm. The schools were audited to ensure that they followed procedures for random assignment. 502 QUARTERLY JOURNAL OF ECONOMICS collected on students each fall and spring during the experiment. Class type is based on the class attended in the fall. All students who attended a STAR class in either the fall or spring are included in the database. Unfortunately, the STAR data set does not contain students' original class type assignments resulting from the randomization procedure; only the class types that students actually were enrolled in each year are available. It is possible that some students were switched from their randomly assigned class to another class before school started or early in the fall. To determine the frequency of such switches, we obtained and (double) entered data on the initial random assignments from the actual enrollment sheets that were compiled in the summer prior to the start of kindergarten for 1581 students from 18 participating STAR schools.8 It turns out that only 0.3 percent of students in the experiment were not enrolled in the class type to which they were randomly assigned in kindergarten. Moreover, only one student in this sample who was assigned a regular or regular/aide class enrolled in a small class. Consequently, in the analysis below, we will refer to the class type in which students are enrolled during the first year they enter the experiment as their initial random assignment. A limitation of the experiment is that baseline test score information on the students is not available, so one cannot examine whether the treatment and control groups "looked similar" on this measure before the experiment began. Nonetheless, if the students were successfully randomly assigned between class types, one would expect those assigned to small- and regular-size classes to look similar along other measurable dimensions at base line. Tables I and II provide some evidence on the differences among students assigned to the three class types. Table I disaggregates the data into waves, based upon the grade the students entered the program, because this was the first time the students were randomly assigned to a class type. The sample consists of all students who were enrolled in a STAR class when the fall or spring data were collected. Sample means by class type for several variables are presented. As one would expect, students assigned to small classes had fewer students in their class than those in regular classes, on average. There are small 8. I thank Jayne Zaharias for providing the enrollment sheets. The sample I analyze excludes twins; schools were allowed to assign twins to the same class if that was the school's ordinary practice. EXPERIMENTAL ESTIMATES 503 TABLE I Comparison of Mean Characteristics of Treatments and Controls: Unadjusted Data A. Students who entered STAR in kindergarten13 Joint Variable Small Regular Regular/Aide P-Valuea 1. Freelunchc .47 .48 .50 .09 2. White/Asian .68 .67 .66 .26 3. Age in 1985 5.44 5.43 5.42 .32 4. Attrition rated .49 .52 .53 .02 5. Class size in kindergarten 15.1 22.4 22.8 .00 6. Percentile score in kindergarten 54.7 49.9 50.0 .00 B. Students who entered STAR in first grade 1. Free lunch .59 .62 .61 .52 2. White/Asian .62 .56 .64 .00 3. Age in 1985 5.78 5.86 5.88 .03 4. Attrition rate .53 .51 .47 .07 5. Class size in first grade 15.9 22.7 23.5 .00 6. Percentile score in first grade 49.2 42.6 47.7 .00 C. Students who entered STAR in second grade 1. Free lunch .66 .63 .66 .60 2. White/Asian .53 .54 .44 .00 3. Age in 1985 5.94 6.00 6.03 .66 4. Attrition rate .37 .34 .35 .58 5. Class size in third grade 15.5 23.7 23.6 .01 6. Percentile score in second grade 46.4 45.3 41.7 .01 D. Students who entered STAR in third grade 1. Free lunch .60 .64 .69 .04 2. White/Asian .66 .57 .55 .00 3. Age in 1985 5.95 5.92 5.99 .39 4. Attrition rate NA NA NA NA 5. Class size in third grade 16.0 24.1 24.4 .01 6. Percentile score in third grade 47.6 44.2 41.3 .01 a. p-value is for F-test of equality of all three groups. b. Sample size in panel A ranges from 6299 to 6324, in panel B ranges from 2240 to 2314, in panel c ranges from 1585 to 1679, and in panel d ranges from 1202 to 1283. c. Free lunch pertains to the fraction receiving a free lunch in the first year they are observed in the sample (i.e., in kindergarten for panel A; in first grade in panel B; etc.) Percentile score pertains to the average percentile score on the three Stanford Achievement Tests the students took in the first year they are observed in the sample. d. Attrition rate is the fraction that ever exits the sample prior to completing third grade, even if they return to the sample in a subsequent year. Attrition rate is unavailable in third grade. 504 QUARTERLY JOURNAL OF ECONOMICS TABLE II p-values for tests of wlthin-school differences among small, regular, and Regular/Aide Classes Variable Grade entered STAR program K 1 2 3 1. Free lunch .46 .29 .58 .18 2. White/Asian .66 .28 .15 .21 3. Age .38 .12 .48 .40 4. Attrition rate .01 .07 .58 NA 5. Actual class size .00 .00 .00 .00 6. Percentile score .00 .00 .46 .00 Each p-value is for an i^-test of the null hypothesis that assignment to a small, regular, or regular/aide class has no effect on the outcome variable in that grade, conditional on school of attendance. All rows except 4 pertain to the first grade in which the student entered the STAR program. The attrition rate in row 4 measures whether the student ever left the sample after initially being observed. differences in the fraction of students on free lunch, the racial mix, and the average age of students in classes of different size, although some of these differences are statistically significant (see rows 1-4).9 Because random assignment was only valid within schools, these differences suggest the importance of controlling for school effects as is done in Table II. Table II presents p-values for joint F-tests of the differences among small, regular, and regular/aide classes for the variables presented in Table I. Unlike results reported in Table I, these p-values are conditional on school effects. None of the three background variables displays a statistically significant association with class-type assignment at the 10 percent level, which suggests that random assignment produced relatively similar groups in each class size, on average. As an overall test of random assignment, I regressed a dummy variable indicating assignment to a small class on the three background measures in rows 1-3 and school dummies. For each wave, the student characteristics had no more than a chance association with class-type assignment. Furthermore, if the same regression model is estimated for a sample that pools all four entering waves of students together, the three student characteristics are still insignificantly related to assignment to a small class (p-value — .58). Within schools, there 9. To be precise, the fraction on free lunch actually measures the fraction who receive free or reduced-price lunch. EXPERIMENTAL ESTIMATES 505 is no apparent evidence that initial assignment to class types was correlated with student characteristics. To check whether teacher assignment was independent of observed teacher characteristics, I regressed each of three teacher characteristics (experience, race, or education) on dummies indicating the class type the teachers were assigned to and school dummies, and then performed an F-test of the hypothesis that the class-type dummies jointly had no effect. These regressions were calculated for each of the four grade levels, so there was a total of twelve regressions. In each case, the p-value for the class-type dummies exceeded .05.10 These results are as one would expect with random assignment of teachers to the different class types. There was a high rate of attrition from the project. Only half the students who entered the project in kindergarten were present for all grades K-3. For the kindergarten cohort, students in small classes were three-four percentage points more likely to stay in the sample than those in regular-size classes. This pattern was reversed among those who entered in first grade, however. Attrition could occur for several reasons, including students moving to another school, students repeating a grade, and students being advanced a grade. Although I lack data on retention rates for the early grades, Word et al. [1990] report that over the four years of the project, 19.8 percent of students in small classes were retained, while 27.4 percent of students in regular classes were retained. This is consistent with the lower attrition rate of students in small classes. Some of the analysis that follows makes a crude attempt to adjust for possible nonrandom attrition. It is virtually impossible to prescribe the exact number of students in a class: families move in and out of a school district during the course of a year; students become sick; and varying numbers of students are enrolled in schools. As a result, in some cases actual class size deviated from the intended ranges. Table III reports the frequency distribution of class size for first graders, by assignment to small, regular, or regular/aide classes. Although students assigned to small classes clearly were more likely to attend classes with fewer students, there was considerable variability in class size within each class-type assignment, and some overlap between the distributions. 10. In two cases the p-value was less than .10. Third grade teachers assigned to small classes were less likely to have a master's degree or higher than were teachers assigned regular-size classes, and first grade teachers in small classes had two more years of experience than those in regular-size classes (although less experience than those in regular/aide classes). 506 QUARTERLY JOURNAL OF ECONOMICS TABLE III Distribution of Children across Actual Class Sizes by Random Assignment Group in First Grade Actual Assignment group in first grade class size - in first grade Small Regular Aide Averagi 12 24 0 0 13 182 0 0 14 252 0 0 15 465 0 0 16 256 16 0 17 561 17 0 18 108 36 0 19 57 76 57 20 20 200 120 21 0 378 378 22 0 594 329 23 0 437 460 24 0 384 264 25 0 175 225 26 0 130 234 27 0 54 108 28 0 28 56 29 0 29 58 30 0 30 30 e class size 15.7 22.7 23.4 Actual class was determined by counting the number of students in the data set with the same class identification. It is also virtually impossible to prevent some students from switching between class types over time. Table IV shows a transition matrix between class types for students who continued from K-l, 1-2, and 2-3 grades. If students remained in their same class type over time, all the off-diagonal elements would be zero. The re-randomization of students in regular classes in first grade is apparent in panel A. But in second and third grades, when students were supposed to remain in their same type of class, 9-11 percent of students switched class-size types. Students were moved between class types because of behavioral problems or, in some cases, parental complaints. Obviously, if the movement between class types was associated with student characteristics (e.g., students with stronger academic backgrounds more likely to move into small classes), these transitions would bias a simple comparison of outcomes across class types. EXPERIMENTAL ESTIMATES 507 TABLE IV Transitions between Class-Size in Adjacent Grades Number of Students in Each Type of Class A. Kindergarten to first grade First grade Kindergarten Small Regular Reg/aide All Small 1292 60 48 1400 Regular 126 737 663 1526 Aide 122 761 706 1589 All 1540 1558 1417 4515 B. First grade to second grade Second grade First grade Small Regular Reg/aide All Small 1435 23 24 1482 Regular 152 1498 202 1852 Aide 40 115 1560 1715 All 1627 1636 1786 5049 C. Second grade to third grade Third grade Second grade Small Regular Reg/aide All Small 1564 37 35 1636 Regular 167 1485 152 1804 Aide 40 76 1857 1973 All 1771 1598 2044 5413 To address this potential problem, and the variability of class size for a given type of assignment, in some of the analysis that follows initial random assignment is used as an instrumental variable for actual class size. B. Data and Standardized Tests Students were tested at the end of March or beginning of April of each year. The tests consisted of the Stanford Achievement Test (SAT), which measured achievement in reading, word recognition, and math in grades K-3, and the Tennessee Basic Skills First (BSF) test, which measured achievement in reading and math in grades 1-3. The tests were tailored to each grade level. Because there are no natural units for the test results, I scaled the test scores into percentile ranks. Specifically, in each grade level the regular and regular/aide students were pooled 508 QUARTERLY JOURNAL OF ECONOMICS together, and students were assigned percentile scores based on their raw test scores, ranging from 0 (lowest score) to 100 (highest score). A separate percentile distribution was generated for each subject test (e.g., Math-SAT, Reading-SAT, Word-SAT, etc.). For each test I then determined where in the distribution of the regular-class students every student in the small classes would fall, and the students in the small classes were assigned these percentile scores. Finally, to summarize overall achievement, the average of the three SAT percentile rankings was calculated.11 If the performance of students in the small classes was distributed in the same way as performance of students in the regular classes, the average percentile score for students in the small classes would be 50. An examination of the correlations among the tests indicates that the strongest correlations typically are between tests of the same subject matter; for example, in second grade the SAT and BSF reading tests have a correlation of .80. Tests of the same subjects tend to have a higher correlation from one grade to the next than tests of different subjects. The SAT and BSF tests are also highly correlated with each other: the correlation between the average SAT percentile and average BSF percentile is .79 in first grade and .85 in second grade. For most of the subsequent analysis, the SAT exam is the primary focus of study because this test has been used on a national level for a long period of time. The main findings are similar for the BSF test, however. The average of the three SAT exams by class type is presented in the last row of Table I. Figure I displays the kernel density of the average test score distributions for students in small and regular classes at each grade level.12 In all grades, the average student in small classes performed better on this summary test measure than did those in regular or regular/aide classes. There does not seem to be a very strong or consistent effect of the teacher aide, however. The rest of the paper probes the robustness of these findings. 11. Formally, denote the cumulative distribution of scores on test j (denoted TJ) of students in the regular and regular/aide classes as FR(TJ) = prob [TjRi < Tj] = yj. For each student i in a small class, we then calculated FR(TjSi) = y'si. Naturally, the distribution of yJ' for students in regular classes follows a uniform distribution. We then calculated the average of the three (or two for BSF) percentile rankings for each student. If one subtest score was missing, we took the average of the two percentiles that were available; and if two were missing, we used the percentile score corresponding to the only available test. 12. Note that because we have averaged over three percentile scores, the distributions are not uniform for students assigned to regular classes. EXPERIMENTAL ESTIMATES 509 0 50 100 0 50 100 Stanford Achievement Test Percentile Stanford Achievement Test Percentile Figure i Distribution of Test Percentile Scores by Class Size and Grade Observe also that the average test score of students in all class types tends to be lower for those who entered the experiment in higher grades. This pattern is likely to reflect the fact that kindergarten was optional and higher-achieving students were more likely to attend kindergarten, as well as the tendency of lower-achieving students to be retained and disproportionately added to the sample at higher grade levels. Because of this feature of the data, I control for the grade in which the student entered Project STAR in some of the analysis below. The Appendix presents means for several variables that are available in the data set. III. Statistical Models To see the advantage of a randomized experiment in estimating the effect of school resources on student achievement, consider the following general model: (1) Yü = aStJ + bFtJ + s„, 510 QUARTERLY JOURNAL OF ECONOMICS where Yy is the achievement level of student i in school j,