Stratified Learning for Reducing Training Set Size

Peter Hastings, Simon Hughes, Dylan Blaum, Patricia Wallace, M. Anne Britt

Educational standards put a renewed focus on strengthening students’ abilities to construct scientific explanations and engage in scientific arguments. Evaluating student explanatory writing is extremely time-intensive, so we are developing techniques to automatically analyze the causal structure in student essays so that effective feedback may be provided. These techniques rely on a significant training corpus of annotated essays. Because one of our long-term goals is to make it easier to establish this approach in new subject domains, we are keenly interested in the question of how much training data is enough to support this. This paper describes our analysis of that question, and looks at one mechanism for reducing that data requirement which uses student scores on a related multiple choice test.

P. Hastings—The assessment project described in this article is funded, in part, by the Institute for Education Sciences, U.S. Department of Education (Grant R305F100007). The opinions expressed are those of the authors and do not represent views of the Institute or the U.S. Department of Education.

The final publication is available at Springer via https://doi.org/10.1007/978-3-319-39583-8_39.