Question 1

How accurate is Lectora's AI grading?

Accepted Answer

On the published MED12 sitting, Lectora agreed with the course teacher at R² = 0.81 across 895 candidates and ~36,000 sub-question scores. For comparison, two independent human graders on the same exam agreed at R² = 0.64. So Lectora's draft tended to sit closer to the course teacher than a second human grader typically would. The same loop has since been run across twelve final clinical exam sittings at UiB Medicine and is in pilot at the Mathematical Institute and NHH Finance.

Question 2

What is R² and why does it matter for AI grading?

Accepted Answer

R² measures how closely two sets of scores agree. R² = 1.0 is perfect agreement; R² = 0.6–0.7 is what two careful humans usually achieve on long-form exams; R² = 0.81 (Lectora's number) is inside the high end of human-on-human consistency. It matters because it tells you whether the AI's draft is close enough to a careful grader's judgment to be worth reviewing instead of starting from scratch.

Question 3

How does Lectora compare to two independent human graders?

Accepted Answer

On the validated MED12 dataset, Lectora's agreement with the course teacher (R² = 0.81) was higher than the agreement between the two human graders the exam was originally scored by (R² = 0.64). That's the headline finding: Lectora's draft agreed with the course teacher more closely than the two humans agreed with each other.

Question 4

Was Lectora validated on real exams?

Accepted Answer

Yes. The published anchor is the MED12 final exam at the University of Bergen — a real six-hour clinical exam, 895 real candidates, ~36,000 real sub-question scores. Not a synthetic test set, not a small sample, not a friendly subset. The same methodology runs against real cohorts in every new partnership before deployment.

Question 5

How many candidates were in the published MED12 validation?

Accepted Answer

895, across a single sitting of the MED12 final exam.

Question 6

Has Lectora been validated for medicine specifically?

Accepted Answer

Yes — the MED12 validation is a medical exam, and the targeted manual scoring workflow has since been run across eleven additional final clinical exam sittings at UiB Medicine on the same calibration loop. See the UiB Medicine pilot case study for the per-exam breakdown. For other clinical exams the same validation methodology applies; we run new validations as institutions onboard with their own rubrics and exam corpora.

Question 7

Has Lectora been validated for handwritten math?

Accepted Answer

Independently, yes. Researchers at the Mathematics Institute and STEM Education Research Centre at UiB ran two studies in autumn 2025 testing AI grading using Lectora as the tool. The MAT101 milestone study (n = 1,051 paired AI/professor scores from 356 students across milestone checks 0–4 on a 0–6 scale) found AI agreement of R² = 0.68 (MAE 0.64 pt, 87.7% pass/fail agreement at the 3.0/6 boundary). The MAT111 study (20 papers graded four times across 11 graders on a 0–17 scale) found AI assistance dropped inter-rater ICC from 0.87 to 0.61 on the early prototype, with grader-attributable variance rising from 3.7% to 26%. Student perception was also tested under two contrasting designs: in MAT101's side-by-side design (each student saw all three feedback sources together) AI rated lowest (4.38/7 vs 5.72/7 for instructor), but in MAT111's single-condition design (each student received either Human+AI or Human-only, blind to which) there was no significant usefulness difference (p = 0.45) — strong evidence that the MAT101 gap is at least partly a side-by-side artefact. Full findings — including where AI underperformed and the methodological caveats — are in the UiB MatNat case study.

Question 8

Has Lectora been validated for finance exams?

Accepted Answer

Finance is a separate, ongoing pilot at NHH Finance — Norway's leading business school — where Lectora's workflow has been tested for pass/fail screening on large multi-page hand-ins in financial asset management. The validation pattern is the same as for medicine: a real cohort the course teacher has already graded, scored blind by Lectora, compared item-by-item. Per-cohort R² numbers aren't published yet; the per-pilot summary is shared with the course coordinator.

Question 9

Which institutions has Lectora been validated with?

Accepted Answer

Three academic partners. The Faculty of Medicine at UiB: twelve final clinical exam sittings, including the published MED12 validation (R² = 0.81 vs 0.64 between two human graders). The Mathematical Institute at UiB: independent UiB MatNat + STEM Education Research Centre studies on MAT101 (R² = 0.68 AI vs professor agreement) and MAT111 (AI assistance dropped inter-rater ICC from 0.87 to 0.61) in autumn 2025 — full findings, including where AI underperformed. NHH Finance: long-form analytical hand-ins, pass/fail screening pilot. Each partnership runs its own validation against the course teacher's prior grading before deployment.

Question 10

How much manual grading does Lectora actually save?

Accepted Answer

Across the twelve sittings the workflow has been run on at UiB Medicine, the aggregate is 12,694 examiner-task pairs scored manually out of 41,733 across the cohort — a 69.6% workload reduction. Per-exam saving ranges from ~40% on sittings where many candidates cluster around the pass/fail threshold to ~84% on larger sittings with sparse risk. A 14-candidate stratified calibration set plus the small block of boundary-risk candidates Lectora's live regression flags replaces the full-cohort pass; everyone else relies on the calibrated draft. Smaller cohorts save proportionally less because the 14-candidate calibration is a bigger fraction of them.

Question 11

How many exams has Lectora been run across at UiB Medicine?

Accepted Answer

Twelve final clinical exam sittings to date — 889 candidate-exam pairs and roughly 43,700 item-pair comparisons across the cohort. The published validation R² of 0.81 vs 0.64 comes from one of those sittings (the MED12 final exam, 895 candidates); the other eleven sittings use the same targeted manual scoring workflow. See the UiB Medicine case study for the workflow walkthrough; an anonymized public demo of the per-exam analysis is also available on request.

How has Lectora been validated for grading and feedback?

What does R² mean in the context of grading?

What evidence is there that Lectora is accurate?

What's the inter-rater agreement among human graders?

Is Lectora more consistent than human graders?

What does this mean for workload? Targeted manual scoring at the pass/fail boundary