A Benchmark for Automatic Evaluation of K–12 Science Instructional Materials
SciEval is the first benchmark dataset for Automatic Instructional Materials Evaluation (AIME) — a generative AI task where large language models evaluate K–12 science instructional materials by producing quality scores and evidence-based justifications aligned with the EQuIP rubric.
The dataset consists of NGSS-aligned science lessons sourced from OpenSciEd, annotated by trained science education researchers through a multi-round process with structured adjudication. Each lesson is evaluated across 13 criteria spanning three-dimensional learning design, instructional supports, and student progress monitoring.
| Column | Description |
|---|---|
ID | Unique identifier (e.g., Course_lesson_N_Criterion) |
File | Source PDF filename |
Criterion | EQuIP rubric criterion code |
Score | Rating: 0 (N/A), 1 (Inadequate), 2 (Adequate), 3 (Extensive) |
Evidence | Detailed justification with page references |
Pos_Evidence | Positive evidence examples |
Neg_Evidence | Negative evidence / gaps |
Advice | Reviewer recommendations |
| Property | Value |
|---|---|
| Instructional units | 32 |
| Total pages | 4,499 |
| Avg. pages per lesson | 16.5 |
| Avg. words per lesson | ~5,908 |
| Score 0 (N/A) | 41.8% |
| Score 3 (Extensive) | 54.2% of active |
| ID string | File string | Criterion string | Score int | Evidence string |
|---|---|---|---|---|
| Bodies_Work_lesson_1_I_A | Bodies_Work_lesson_1.pdf | I_A | 2 | Examples of evidence that learning is driven by making sense of phenomena i... |
| Bodies_Work_lesson_1_I_B_1 | Bodies_Work_lesson_1.pdf | I_B_1 | 3 | Asking Questions and Defining Problems: Ask questions that arise from caref... |
| Cancer_lesson_1_I_B_3 | Cancer_lesson_1.pdf | I_B_3 | 2 | Systems and System Models: Models (e.g., physical, mathematical, computer mo... |
| Cancer_lesson_1_II_A | Cancer_lesson_1.pdf | II_A | 3 | Students experience the phenomenon, problems, and investigative phenomenon as... |
| Ocean_Plastic_lesson_1_II_C | Ocean_Plastic_lesson_1.pdf | II_C | 3 | All science information is accurate and grade level appropriate based on the ... |
| Hail_rain_lesson_1_II_B | Hail_rain_lesson_1.pdf | II_B | 2 | Throughout the unit, students are provided with a large number of opportuniti... |
| Hail_rain_lesson_1_III_D | Hail_rain_lesson_1.pdf | III_D | 3 | Students are provided with the appropriate background needed for completing... |
| Ocean_Plastic_lesson_1_III_E | Ocean_Plastic_lesson_1.pdf | III_E | 3 | In Lesson 1 students are involved in learning about each of the CCC and ask q... |
Each lesson is evaluated across 13 criteria organized into three categories:
Download the SciEval dataset to begin your research:
Each annotation file contains 13 rows (one per EQuIP criterion) with scores, evidence, and reviewer justifications. The PDFs and annotations are split into 218 training and 55 test documents via stratified sampling.
If you use SciEval in your research, please cite our paper:
@inproceedings{li2026scieval,
title = {SciEval: A Benchmark for Automatic Evaluation of
K-12 Science Instructional Materials},
author = {Li, Zhaohui and He, Peng and Chen, Zhiyuan and
Liu, Honglu and Wang, Zeyuan and Li, Tingting
and Xiong, Jinjun},
booktitle = {Proceedings of the 27th International Conference on
Artificial Intelligence in Education (AIED)},
year = {2026}
}