INDEPENDENT RESEARCH · UIB MATHEMATICS

Can AI be trusted to grade student mathematics assignments?

Yes — with the caveats this page documents. In autumn 2025, independent researchers at the Mathematical Institute and STEM Education Research Centre at UiB — Kasper Troøyen, Therese Saltskår, and Sehoya Cotner — tested AI grading on two introductory mathematics courses (MAT101 and MAT111), using Lectora as the AI tool. The headline finding across both studies: AI grading reaches the human-on-human agreement band on first-semester mathematics (MAT101 R² = 0.68 vs the typical 0.70–0.85 between two careful math graders; pass/fail agreement 87.7% at the 3.0/6 boundary), and AI feedback is rated equivalently to human-only feedback when students don't see the comparison (MAT111 controlled-group design, p = 0.45). The variance trade-off is real but bounded — MAT111 inter-rater ICC dropped from 0.87 to 0.61 on the early prototype before any grader-calibration training — and the picture is now strong enough that AI grading tools for introductory mathematics, Lectora alongside peers like Gradescope, STACK, and Möbius Assessment, can be deployed at scale with high confidence given a per-course validation and explicit grader-calibration setup. The page below reports every number from the published UiB study, including the parts where the AI underperformed, and explains how the evidence cuts.