Which of the Following Approaches to State Testing Works for U.S. Schools?

Date:


When and Where to Use Sampling

Sampling approaches make sense when policymakers are trying to get a broad understanding of trends and patterns. In the business world, the Bureau of Labor Statistics surveys a sample of individuals and employers each month to get a reasonably accurate picture of labor market conditions. Similarly, the National Assessment of Educational Progress (NAEP) tests a sample of students at regular intervals to understand achievement levels in each state.

The results of these surveys inform policymakers and provide clues about where to begin looking for problems and solutions. However, the labor-market surveys aren’t precise enough to be useful to individual employees or employers, let alone to researchers trying to do causal research. If an employer wanted to understand trends within their own company, they would need to look at the size of their own workforce and turnover rates among their own employees. In education, we have a derogatory term (“misNAEPery”) for policymakers who merely eyeball the NAEP trends and try to argue for or against certain policy changes.

More-detailed use cases require more-detailed data. As parents of school-age children, we want to know how our kids are doing. And, while we generally trust teachers and principals (one of us is a former principal), we still appreciate seeing how our own kids are doing on objective, standardized tests. We want that common benchmark. If states switched to a sampling approach, in which only some kids were tested each year, the parents of untested students would miss out on receiving objective, comparable, and individualized results.

Policymakers also need detailed data on student-level performance. Research on student performance in Florida and North Carolina found that both schools and districts have a meaningful influence on student learning. That was especially true during the pandemic, when researchers found that the specific school a student attended accounted for about three-quarters of the widening gap between low- and high-achieving students in math and about one-third of the gap in reading.

Sampling would make it much harder to evaluate the performance of schools and districts, especially for discrete student groups. Olson and Toch downplay this problem, but, because of sample-size issues, it simply wouldn’t be possible to look at school-level results for different student subgroups.

For a concrete example, imagine an elementary school with eight Black students in each of grades 3, 4, 5, and 6. To determine if this school should be held accountable for a given student group, a state would combine performance results across the grades and then see if the group met a minimum sample size. According to a recent analysis from Education Commission of the States, most states apply a minimum subgroup size of 10 to 20 students, with some as high as 30 students. With a total of 32 Black students, this school would just barely meet the minimum sample size, and it would be responsible for the performance of those students.

But if the state tested only a sample of students, the number of Black students tested in this hypothetical school would likely fall below the threshold. The sample sizes start to get very small very quickly. When one of us (Chad Aldeman) ran a sampling model for Washington, D.C., he found that about half of the city’s elementary schools would not be held accountable for low-income or Black students, less than 10 percent of schools would be responsible for Hispanic students or English language learners, and not a single elementary school would be accountable for the progress of students with disabilities.

The same math applies to school districts as well. Across the country, there are almost 9,000 school districts that serve between 100 and 1,000 students each. Collectively, those smaller districts educate more than four million students, but shifting to a sampling approach wouldn’t tell us much about the performance of those students.

Note that it would be technically possible to “over-sample” student groups or students in small schools or districts, but that would defeat the purpose of sampling in the first place. It would also mean that the testing burden would fall disproportionately on the traditionally underserved student groups that policymakers are the most concerned about.

But perhaps the biggest drawback with the sampling approach is that it might accomplish neither its political nor its technical goals. Opponents of “high-stakes testing” often worry more about the perceived stakes than the tests themselves. Standardized tests are frequently scapegoated for school closures or teacher layoffs, but real sanctions resulting from them are few and far between. The truth is that the threat of accountability has always been greater than any actual consequences, and that’s even truer today.

Moreover, the purported goal behind sampling is to reduce the amount of time kids spend taking tests, potentially freeing up more time for classroom instruction. This is a worthy aim, but the federally required state tests are not the main problem here. In fact, these exams account for only a tiny fraction of the time typically devoted to assessments each year. The real culprits are the layers upon layers of other tests adopted by states and local districts. There are potential solutions such as testing audits to reduce redundancy, but we’re not holding our breath for Congress to develop some sort of maximum testing rule, so it would behoove individual states and districts to determine which tests deliver the greatest value.

Simply put, in our view, a sampling approach would have significant downsides without tangible benefits. Rather than backing away from the principle of testing all kids, we think there’s room for innovation on what those tests look like and how states use them.

Share post:

Subscribe

Popular

More like this
Related