Supported Quality Metrics

1. Alignment with Learning Objectives

Purpose: Verify questions accurately assess intended learning outcomes and match instructional goals.

References: Haladyna et al. [10], Sireci [17]

Scope: Question-level

Parameters:

learning_objectives: Source of objectives ("auto_extract", "provided", or list)
alignment_threshold: Minimum acceptable alignment score (default: 70)

Evaluation Criteria:

Direct assessment of stated objectives
Coverage of key concepts
Appropriate depth and breadth

Example Configuration:

- name: "alignment"
  version: "1.0"
  evaluators: ["gpt4"]
  parameters:
    learning_objectives: "auto_extract"
    alignment_threshold: 75

2. Cognitive Level Appropriateness

Purpose: Ensure questions target appropriate levels of Bloom's taxonomy.

Bloom's Taxonomy Levels:

Remember: Recall facts and basic concepts
Understand: Explain ideas or concepts
Apply: Use information in new situations
Analyze: Draw connections among ideas
Evaluate: Justify a decision or course of action
Create: Produce new or original work

References: Anderson & Krathwohl [2], Haladyna & Rodriguez [11]

Scope: Question-level

Parameters:

taxonomy: "bloom" or "webb"
target_level: Expected cognitive level
tolerance: Allow ±1 level deviation

Example Configuration:

- name: "cognitive_level"
  version: "1.0"
  evaluators: ["gpt4"]
  parameters:
    taxonomy: "bloom"
    target_level: "apply"
    tolerance: 1

3. Clarity and Precision

Purpose: Assess whether question stems and answer options use clear, unambiguous language without unnecessary complexity.

References: Downing [8], Haladyna et al. [10]

Scope: Question-level

Evaluation Criteria:

Language complexity appropriate for audience
Absence of ambiguous phrasing
Clear, concise wording
No unnecessary jargon
Proper use of terminology

Example Configuration:

- name: "clarity"
  version: "1.0"
  evaluators: ["gpt4", "claude_opus"]
  parameters:
    target_audience: "undergraduate"
    complexity_threshold: "moderate"

4. Answer Key Correctness

Purpose: Verify exactly one option is unambiguously correct (or clearly best) while all distractors are unambiguously incorrect.

References: Haladyna et al. [10], Haladyna & Rodriguez [11]

Scope: Question-level

Evaluation Criteria:

One clearly correct answer
All distractors are definitively incorrect
No ambiguity in correctness
Correct answer is verifiable from source material

Example Configuration:

- name: "answer_correctness"
  version: "1.0"
  evaluators: ["gpt4"]
  parameters:
    verify_source: true
    require_unambiguous: true

5. Distractor Quality

Purpose: Evaluate whether incorrect options are plausible to students lacking mastery but clearly wrong to knowledgeable students; should be based on common misconceptions.

References: Gierl et al. [9], Haladyna & Rodriguez [11]

Scope: Question-level

Evaluation Criteria:

Plausibility to novices
Based on documented misconceptions
Not obviously incorrect
Discriminates between knowledge levels
Avoids "all of the above" or "none of the above"

Example Configuration:

- name: "distractor_quality"
  version: "1.0"
  evaluators: ["gpt4"]
  parameters:
    misconception_based: true
    plausibility_threshold: 60
    discrimination_required: true

6. Homogeneous Options

Purpose: Ensure all answer choices are parallel in grammatical structure and homogeneous in content type.

References: Haladyna et al. [10], Downing [8], Applegate et al. [18]

Scope: Quiz-level, with per-question analysis and quiz-level aggregation

Implementation Notes:

The metric runs in three phases: per-question option analysis, per-question scoring, and quiz-level aggregation.
For each applicable question, answer choices are classified by grammatical form, content type, and formatting signals before being scored.
The final quiz-level score combines the per-question scores and applies a small penalty when major heterogeneity issues recur across a quiz.
True/false questions are treated as not applicable and are excluded from the aggregate denominator.

Evaluation Criteria:

Parallel grammatical structure across answer choices
Homogeneous content type across answer choices
Consistent formatting, punctuation, and broad length patterns
Detection of structural outliers such as one full sentence among short phrases or one code fragment among prose options
Transparent issue reporting through per-question diagnostics retained in the metric output

Example Configuration:

- name: "homogeneous_options"
  version: "1.0"
  evaluators: ["gpt4"]
  enabled: true

7. Absence of Cueing

Purpose: Detect grammatical, semantic, or structural clues that inadvertently reveal the correct answer.

References: Downing [8], Haladyna et al. [10]

Scope: Question-level

Common Cues to Detect:

Grammatical inconsistencies (e.g., "an" before consonant)
Length differences (correct answer often longest)
Specificity differences (correct answer more detailed)
Absolute terms ("always", "never") in distractors
Verbal associations between stem and correct answer
Convergence cues (correct answer includes elements of all options)

Example Configuration:

- name: "cueing_absence"
  version: "1.0"
  evaluators: ["gpt4"]
  parameters:
    check_grammar: true
    check_length: true
    check_specificity: true
    check_absolutes: true
    check_associations: true

8. Grammatical Correctness

Purpose: Ensure both stem and options are grammatically correct and properly punctuated.

References: Haladyna et al. [10], Haladyna & Rodriguez [11]

Scope: Question-level

Evaluation Criteria:

Proper grammar in stem
Proper grammar in all options
Correct punctuation
Subject-verb agreement
Consistent tense usage

Example Configuration:

- name: "grammar"
  version: "1.0"
  evaluators: ["gpt4"]
  parameters:
    strict_mode: true
    check_punctuation: true

9. Factual Accuracy

Purpose: Verify questions and answers are factually correct, evidence-based, free from errors and biases, and aligned with provided source material.

Scope: Question-level

Evaluation Dimensions:

Factual Correctness: Are all statements accurate? Are there outdated facts or clear errors?
Evidence-Based Content: Is the answer verifiable fact rather than opinion or theory?
Bias and Distortion: Is it free from political, cultural, or personal bias? Are all options presented fairly?
Source Alignment: Does it align with the provided source material? Does it contradict it?
Objectivity: Would reasonable experts agree with the factual claims?

Scoring Scale:

0-20: Highly Inaccurate (major errors, built on false premises)
21-40: Inaccurate (notable errors, partially opinion)
41-60: Moderately Accurate (mostly factual but minor inaccuracies)
61-80: Accurate (factually correct and evidence-based)
81-100: Highly Accurate (objective, perfectly grounded in evidence)

Output:

Detailed reasoning across all five dimensions
List of specific major errors found (if any)
Numerical score (0-100)

Example Configuration:

- name: "accuracy"
  version: "1.1"
  evaluators: ["gpt4", "claude_opus"]