Complete Usage Guide
Complete Usage Guide
Environment Variables
The .env file stores your API credentials:
# Azure OpenAI (for enterprise deployments)
AZURE_OPENAI_ENDPOINT=https://your-resource.openai.azure.com/
AZURE_OPENAI_API_KEY=your-api-key
AZURE_OPENAI_API_VERSION=2024-02-15-preview
# OpenAI (direct API)
OPENAI_API_KEY=sk-your-key-here
# Anthropic Claude
ANTHROPIC_API_KEY=sk-ant-your-key-here
# Ollama models (optional, recommended with provider: "ollama")
OLLAMA_ENDPOINT=http://localhost:11434
OLLAMA_API_KEY=not-required
# Generic OpenAI-compatible fallback
CUSTOM_LLM_ENDPOINT=http://localhost:11434/v1
CUSTOM_LLM_API_KEY=optional-key
Ollama Setup (Local Endpoint)
To use Ollama as a local evaluator backend:
- Start the local server with
ollama serve. - Use
http://localhost:11434as your endpoint (or your configured port).
Recommended .env values:
OLLAMA_ENDPOINT=http://localhost:11434
OLLAMA_API_KEY=not-required
# Optional fallback aliases
CUSTOM_LLM_ENDPOINT=http://localhost:11434/v1
CUSTOM_LLM_API_KEY=not-required
Sanity checks:
# Ollama model list
ollama list
# Ollama tags API
curl http://localhost:11434/api/tags
Notes:
- If multiple local evaluators use different
modelIDs, Ollama loads them as needed. ollamaresolves env vars in this order:OLLAMA_ENDPOINT/OLLAMA_API_KEY, thenCUSTOM_LLM_ENDPOINT/CUSTOM_LLM_API_KEY.- The runner performs fail-early validation for
ollama: endpoint configured, server reachable, and configured model names present in/api/tags. - If required models are missing, preflight attempts automatic pulls via
/api/pullbefore aborting.
Benchmark Configuration
Create YAML files in config/ directory to define benchmark runs.
Basic Configuration
benchmark:
name: "basic-evaluation"
version: "1.0.0"
runs: 3
evaluators:
gpt4:
provider: "openai"
model: "gpt-4"
temperature: 0.0
max_tokens: 500
metrics:
- name: "alignment"
version: "1.0"
evaluators: ["gpt4"]
parameters:
learning_objectives: "auto_extract"
- name: "distractor_quality"
version: "1.0"
evaluators: ["gpt4"]
parameters:
misconception_based: true
inputs:
quiz_directory: "data/quizzes"
source_directory: "data/inputs"
outputs:
results_directory: "data/results"
Advanced Configuration
For comparing multiple models:
benchmark:
name: "comprehensive-comparison"
version: "2.0.0"
runs: 5 # More runs for better statistics
evaluators:
# Multiple models for comparison
gpt4:
provider: "openai"
model: "gpt-4"
temperature: 0.0
max_tokens: 500
gpt35:
provider: "openai"
model: "gpt-3.5-turbo"
temperature: 0.0
max_tokens: 500
claude_opus:
provider: "anthropic"
model: "claude-3-opus-20240229"
temperature: 0.0
max_tokens: 500
metrics:
# Test alignment with multiple evaluators
- name: "alignment"
version: "1.0"
evaluators: ["gpt4", "gpt35", "claude_opus"]
parameters:
learning_objectives: "auto_extract"
enabled: true
# Test cognitive level with best model
- name: "cognitive_level"
version: "1.0"
evaluators: ["gpt4"]
parameters:
taxonomy: "bloom"
target_level: "apply"
enabled: true
# Test clarity across models
- name: "clarity"
version: "1.0"
evaluators: ["gpt4", "claude_opus"]
enabled: true
# Test distractor quality
- name: "distractor_quality"
version: "1.0"
evaluators: ["gpt4"]
parameters:
misconception_based: true
enabled: true
inputs:
quiz_directory: "data/quizzes"
source_directory: "data/inputs"
outputs:
results_directory: "data/results"
Hybrid Azure + Ollama Configuration
Use larger hosted models for expensive metrics and local models for lower-cost checks:
benchmark:
name: "hybrid-azure-ollama"
version: "1.0.0"
runs: 3
evaluators:
azure_gpt4:
provider: "azure_openai"
model: "gpt-4"
temperature: 0.0
max_tokens: 700
ollama_fast:
provider: "ollama"
model: "qwen2.5:7b-instruct"
base_url: "http://localhost:11434"
temperature: 0.0
max_tokens: 300
ollama_reasoning:
provider: "ollama"
model: "llama3.1:8b-instruct"
base_url: "http://localhost:11434"
temperature: 0.0
max_tokens: 500
metrics:
- name: "difficulty"
version: "1.0"
evaluators: ["ollama_fast"]
parameters:
rubric: "bloom_taxonomy"
target_audience: "undergraduate"
enabled: true
- name: "grammatical_correctness"
version: "1.0"
evaluators: ["ollama_fast"]
enabled: true
- name: "clarity"
version: "1.0"
evaluators: ["ollama_reasoning", "azure_gpt4"]
enabled: true
- name: "coverage"
version: "1.1"
evaluators: ["azure_gpt4"]
parameters:
granularity: "balanced"
enabled: true
Notes:
ollamais the recommended provider for Ollama endpoints.openai_compatibleremains available for generic OpenAI-compatible backends (vLLM, local proxies, etc.).- You can define multiple local evaluators with different
modelvalues and route metrics accordingly. - If all local evaluators use the same Ollama instance, keep
base_urlidentical.
Data Preparation
Quiz JSON Format
Quizzes must be in JSON format with this schema:
{
"quiz_id": "unique_identifier",
"title": "Human-readable title",
"source_material": "filename.md",
"learning_objectives": [
"Students will be able to...",
"Students will understand..."
],
"questions": [
{
"question_id": "unique_question_id",
"question_type": "single_choice | multiple_choice | true_false",
"question_text": "The question text",
"options": ["Option 1", "Option 2", "..."],
"correct_answer": "For SC/TF" || ["For", "MC"],
"source_reference": "Optional reference to source",
"bloom_level": "Optional: remember|understand|apply|analyze|evaluate|create",
"metadata": {}
}
],
"metadata": {
"optional": "fields"
},
"created_at": "2024-01-15T10:00:00"
}
Question Types
Single Choice (one correct answer):
{
"question_type": "single_choice",
"options": ["A", "B", "C", "D"],
"correct_answer": "B"
}
Multiple Choice (multiple correct answers):
{
"question_type": "multiple_choice",
"options": ["A", "B", "C", "D"],
"correct_answer": ["B", "D"]
}
True/False:
{
"question_type": "true_false",
"options": ["True", "False"],
"correct_answer": "True"
}
Source Material Format
Source materials should be Markdown files in data/inputs/:
# Main Topic
## Section 1: Introduction
Content about the topic...
### Subsection 1.1
More detailed content...
## Section 2: Advanced Topics
Advanced content...
Tips:
- Use clear section headers
- Include all content that questions might reference
- Keep formatting simple (avoid complex tables/diagrams)
- Use the same structure as your educational materials
Running Benchmarks
Basic Execution
python main.py --config config/benchmark_example.yaml
Command-Line Options
# Use custom .env file
python main.py --config config/my_benchmark.yaml --env .env.production
# Skip aggregation (save only raw results)
python main.py --config config/my_benchmark.yaml --no-aggregate
# Custom run bundle name
python main.py --config config/my_benchmark.yaml --output-prefix experiment_001
# Stream verbose debug logs in terminal
python main.py --config config/my_benchmark.yaml --debug
What Happens During Execution
-
Initialization
- Loads configuration from YAML
- Loads environment variables
- Initializes LLM providers
- Registers metrics
-
Data Loading
- Loads all quizzes from
quiz_directory - Loads corresponding source materials
- Validates data schemas
- Loads all quizzes from
-
Evaluation Loop
- For each run (1 to
runs)- For each quiz
- For each metric
- For each evaluator
- Generates prompt
- Calls LLM
- Parses response
- Records result
- For each evaluator
- For each metric
- For each quiz
- For each run (1 to
-
Aggregation (if enabled)
- Groups results by metric and evaluator
- Calculates statistics (mean, median, std dev, min, max)
- Generates summary report
-
Output
- Creates a run bundle directory in
outputs.results_directory - Saves
results.json,aggregated.json(unless--no-aggregate),summary.txt,metadata.json - Saves a full execution log in
run.log
- Creates a run bundle directory in
Progress Monitoring
- Default terminal output shows concise run progress (
INFOlevel). - Full detailed logs are always written to each run bundle (
run.log). - Use
--debugif you want full verbose logs live in terminal.
Understanding Results
Output Files
After execution, you'll find one run bundle directory in data/results/ per benchmark run:
results.json— Raw results from all evaluationsaggregated.json— Statistical aggregations (if aggregation enabled)summary.txt— Human-readable reportmetadata.json— Run metadata (config, timestamps, run id)run.log— Full detailed execution log
Reading the Summary
======================================================================
BENCHMARK RESULTS SUMMARY
======================================================================
Configuration: example-benchmark
Version: 1.0.0
Total Runs: 3
Quizzes Evaluated: 1
ALIGNMENT WITH LEARNING OBJECTIVES
----------------------------------------------------------------------
Evaluator: gpt-4
Mean: 82.50 # Average alignment score
Median: 83.00 # Middle value (robust to outliers)
Std Dev: 4.12 # Consistency (lower = more consistent)
Min: 76.00 # Lowest score observed
Max: 88.00 # Highest score observed
N: 12 # Total number of evaluations
Evaluator: gpt-3.5-turbo
Mean: 79.25
Median: 80.00
Std Dev: 5.67
Min: 71.00
Max: 86.00
N: 12
DISTRACTOR QUALITY
----------------------------------------------------------------------
Evaluator: gpt-4
Mean: 71.33
Median: 72.00
Std Dev: 6.89
Min: 61.00
Max: 81.00
N: 12
Interpreting Results
Mean vs Median
- Mean: Average score, sensitive to outliers
- Median: Middle value, robust to outliers
- Large difference → check for outliers in raw data
Standard Deviation
- Low (< 5): Consistent evaluations
- Medium (5-10): Some variation (acceptable)
- High (> 10): High variance, may need more runs or prompt refinement
Comparing Evaluators
- Similar scores → models agree on metric
- Different scores → models have different perspectives or capabilities
- Check std dev to assess reliability
Raw Results Structure
{
"benchmark_id": "unique-run-id",
"benchmark_version": "1.0.0",
"config_hash": "abc123...",
"quiz_id": "quiz_example_001",
"run_number": 1,
"metrics": [
{
"metric_name": "alignment",
"metric_version": "1.0",
"score": 82.0,
"evaluator_model": "gpt-4",
"quiz_id": "quiz_example_001",
"question_id": "q1",
"parameters": {
"learning_objectives": "auto_extract"
},
"evaluated_at": "2024-01-15T10:30:00",
"raw_response": "LLM's actual response..."
}
],
"started_at": "2024-01-15T10:29:00",
"completed_at": "2024-01-15T10:35:00"
}