VALIDATION
How has Lectora been validated for grading and feedback?
Lectora hasn't been validated in a single one-shot study. The tool has been built in close collaboration with three academic partners over multiple years, with both grading and feedback validated against real course teachers' judgment in each programme. At the Faculty of Medicine, UiB, the workflow has now been run across twelve final clinical exam sittings — 889 candidate-exam pairs and roughly 43,700 item-pair comparisons in total — including the published MED12 sitting where Lectora's draft agreed with the course teacher at R² = 0.81, measurably closer than the R² = 0.64 agreement between two careful human graders on the same exam. At the Mathematical Institute at UiB, independent researchers at UiB MatNat and the STEM Education Research Centre ran two studies in autumn 2025 testing AI grading on MAT101 (n = 1,051 paired AI/professor scores, R² = 0.68, pass/fail agreement 87.7%) and MAT111 (n = 80 paired gradings, where AI assistance dropped inter-rater ICC from 0.87 to 0.61 on the early prototype). The full findings — including the parts where AI underperformed — are reported in the UiB MatNat case study as a deliberate trust signal for institutional buyers. At NHH Finance — Norway's leading business school — the workflow has been piloted to test whether it can reliably support pass/fail screening on large multi-page hand-ins in financial asset management. The shared test across all three partnerships is the same: does the AI draft sit close enough to the course teacher's judgment that they can review it instead of grading the cohort from scratch?
What does R² mean in the context of grading?
R² (the coefficient of determination) measures how closely two sets of scores track each other. R² = 1.0 is perfect agreement; R² = 0.0 means the two scorers might as well be flipping coins. For grading, R² in the 0.6–0.9 range is the practical territory — even careful, calibrated human graders rarely exceed R² = 0.85 against each other on long-form exams, and R² in the 0.5–0.7 range is common when two independent humans grade the same paper without coordination.
The reason human R² isn't higher: rubrics leave room for interpretation. Two careful graders looking at the same student answer can reasonably disagree about whether the differential diagnosis was complete or merely adequate; about whether a regression interpretation was correct or partially correct; about whether a math proof's gap is minor or load-bearing. Those disagreements compound across thousands of sub-questions and the overall R² lands wherever the rubric's ambiguity allows.
What R² tells you about an AI grader is whether its decisions sit inside the normal range of human disagreement or outside it. R² = 0.81 against a single grader sits comfortably inside that range. Lectora's drafts are in the same band of variability you'd see between two careful humans, and on the MED12 sitting they sat closer to the course teacher than the second human grader did.
What evidence is there that Lectora is accurate?
Evidence comes from three ongoing academic partnerships, each running its own validation against the course teacher's prior grading. The published anchor is the MED12 sitting at the Faculty of Medicine, UiB. From there, the same methodology has been carried into the rest of the faculty's final clinical exams, into handwritten math at the Mathematical Institute at UiB, and into long-form analytical answers at NHH Finance. Each partnership has run for multiple years, on real exams the course teachers had already graded.
The MED12 sitting itself covered 895 candidates' answers from a single six-hour exam — the twelfth-semester clinical exam at UiB — producing roughly 36,000 individual scored sub-questions. Two independent human graders had already scored every paper in the original exam, which gave the baseline R² = 0.64 between two humans grading the same exam in parallel. Lectora was then run blind on the same papers, scoring against the same rubric with no access to either human grader's scores. The agreement between Lectora's draft scores and the course teacher's reference scores is what produced the R² = 0.81 figure. The dataset is published; the comparison is apples-to-apples; the headline finding holds on the full 36,000-score dataset.
Math and finance pilots run the same validation loop against each partner's own rubrics. Per-partnership R² numbers aren't published yet — those validations are run privately against the course teacher's prior grading and shared with the course coordinator. The point of citing MED12 publicly is to anchor the methodology: a real cohort the teacher has already graded, scored blind by Lectora, compared item-by-item. Every partnership uses that same loop.
What's the inter-rater agreement among human graders?
R² = 0.64 on the MED12 dataset, and that's typical for long-form medical exams. For comparison: peer-reviewed studies on essay grading report R² values between 0.50 and 0.75 depending on the rubric and the grader pool. Mathematical proofs land in the 0.55–0.70 range when graded by two independent humans; legal-style analyses tend to be lower; multiple-choice runs nearly perfect (which is why it's so widely used in summative assessment, regardless of pedagogical merit).
This is the part of grading that nobody talks about: even careful, professional human graders disagree with each other, and the disagreement is not a bug — it reflects the genuine ambiguity in rubric application. The student who got an 82 from one grader might have got a 77 from another. Both are defensible. The student doesn't see the second grader.
A grading system that agrees with the course teacher at R² = 0.81 is, in effect, agreeing more closely than a second human reviewer typically would. That's the substance of every partnership validation we've run: not "AI is right and humans are wrong," but "AI's draft is well inside the envelope of what a careful human grader would have written, and on the published sitting was closer to the course teacher than a second human typically is."
Is Lectora more consistent than human graders?
On the MED12 dataset, yes, measurably: R² = 0.81 versus 0.64. But "outperforming consistency" is the wrong frame. The right frame is: Lectora's draft is close enough to the course teacher's judgment that the educator can review it in a fraction of the time a full first-pass grading round would take.
The student gets a grade that was reviewed by their teacher. The teacher gets a draft that's already calibrated to their rubric. The system's accuracy floor is the educator's judgment, not Lectora's. That's why we ship the published R² as evidence of how good the draft is, not as a claim that the draft replaces review.
For institutions evaluating Lectora, the published R² answers a specific question: is the draft worth my time to review, or am I starting from scratch? At R² = 0.81, you're starting from a draft that already lands close to where you'd have landed — which is the whole point. Each new partnership re-runs the same comparison against its own course teacher's prior grading before going into production.
What does this mean for workload? Targeted manual scoring at the pass/fail boundary
R² = 0.81 against the course teacher buys something specific: a draft you can review at the pass/fail boundary instead of grading the whole cohort from scratch. The workflow is called targeted manual scoring, and it has now been run across twelve final clinical exam sittings at the UiB Faculty of Medicine — 889 candidate-exam pairs and roughly 43,700 item-pair comparisons in total. The pattern is the same on every exam.
Lectora drafts every candidate. Course staff then manually score a small stratified calibration set — typically fourteen candidates: eight from the low end of the AI-drafted distribution, four around the pass/fail boundary, two from the top. Lectora fits a live regression between its draft and the examiner scores on those fourteen, computes a 99% prediction interval per remaining candidate, and flags the candidates whose lower prediction-interval bound crosses the pass/fail threshold. That flagged set — typically five to fifteen papers on an 80-candidate exam — gets manual scoring next. Everyone else relies on the calibrated draft. An optional full-scoring benchmark can be run on the remaining cohort to sign off the line.
Across the twelve sittings the workflow has been run on so far, the aggregate saving is 70% — 12,694 examiner-task pairs manually scored against 41,733 in a full-cohort pass. Per-exam reduction varies from ~40% (sittings with dense risk clusters around the pass/fail threshold, where many candidates need manual review) to ~84% (larger sittings with sparse risk). The principal levers are cohort size and how densely candidates cluster around the threshold; smaller cohorts save proportionally less because the fourteen-candidate calibration is a bigger fraction of them, and tightening the prediction interval shrinks the risk block at the cost of more candidates landing inside review.
The integrity argument is the prediction interval, not the point estimate. The interval widens for candidates whose AI score sits in sparse regions of the calibration data and narrows for candidates in well-sampled regions. A candidate whose AI draft says "pass" with a tight interval far above the threshold gets no manual review. A candidate whose draft says "pass" with a wide interval that reaches down across the threshold does. The decision boundary is human; the rest of the curve is calibrated draft. That's the workload claim, and it's auditable per exam.
TARGETED MANUAL SCORING
Exam 01 · 78 candidates · 48 short-answer tasks
4. Manually score the risk block
Workload saved
73.1%
2 736 of 3 744 examiner-task pairs avoided
Manual scoring
26.9%
Risk block
7
Calibration R²
0.882
Full-cohort R²
0.833
Manual time @ 3 min/task
50.4 h
Full-pass time
187.2 h
Prediction interval
QUESTIONS & CASES GRADED ACROSS DEPLOYMENTS
…and growing every week.
20+ courses
IN PRODUCTION AND VERIFIED
Across medicine, mathematics, finance and more.
12 sittings · 889 candidates
UIB MEDICINE TARGETED-SCORING PILOT
~43,700 AI-vs-examiner item-pair comparisons via the targeted manual scoring workflow.