A Keyword-Based Technique to Evaluate Broad Question Answer Script

The paper unveils an automated grading pipeline that flags grammar errors, extracts keywords, and scores student essays by comparing those keywords to trusted online and instructor sources—saving faculty hours and boosting feedback consistency.

1. What the method is

The authors present an integrated software system that electronically grades long-form exam answers. It first checks spelling and grammar, then mines keywords from each student response and contrasts their weighted frequencies against keywords gathered from reference answers and Wikipedia. The combined linguistic and semantic scores yield an automated mark, eliminating manual essay evaluation.

2. Why the method was developed

Marking subjective answers is time-intensive, inconsistent, and prone to human bias—especially as class sizes and recruitment exams grow. Existing optical-mark approaches only handle multiple-choice questions. The team therefore sought a scalable technique that would replicate expert grading for descriptive responses, cut turnaround times, and provide objective, data-driven feedback to students and institutions.

3. Who should care

University examination boards and registrars
Educational-technology product managers
HR teams conducting large-scale written assessments

4. How the method works

The workflow runs in two stages. Linguistic Analysis uses open-source grammar and spell-check APIs to count errors and generate a language score. Answer Analysis tokenises the cleaned text, removes stop-words, and computes word-frequency vectors. Parallel algorithms scrape authoritative web pages and instructor keys, process them similarly, and produce reference vectors. A comparison algorithm measures vector overlap to assign content marks, which are then blended with the language score for a final grade.

5. How it was evaluated

The prototype graded 100 student scripts from a computer-science course. Manual faculty scores served as ground truth. Precision, recall, F1, and relative-error metrics were reported for both keyword matching and overall grades, while confusion matrices highlighted false positives and negatives in pass-fail decisions.

6. How it performed

The system achieved a precision of 0.91 and an F1 of 0.89 against lecturer grades, with average relative error below 3 %. Grammar detection reduced marking variance by 15 %, and total grading time fell from hours to seconds per script. *(Source: arXiv 2506.21461, 2025)*

← Back to dossier index