Quick Start
Quick Start
Get your quiz benchmark running in 5 minutes!
Installation
Prerequisites
- Python 3.13 or higher
- pip package manager
- API keys for at least one LLM provider
Setup Steps
# Clone repository
git clone <repository-url>
cd paper-al-quiz-generation-benchmark
# Create and activate virtual environment
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
Configuration
Step 1: Set Up Environment Variables
# Copy template
cp config/.env.example .env
# Edit with your API keys (at minimum, one provider)
nano .env
Example .env file:
# OpenAI (direct API)
OPENAI_API_KEY=sk-your-key-here
# Or Azure OpenAI (for enterprise deployments)
AZURE_OPENAI_ENDPOINT=https://your-resource.openai.azure.com/
AZURE_OPENAI_API_KEY=your-api-key
AZURE_OPENAI_API_VERSION=2024-02-15-preview
# Or Anthropic Claude
ANTHROPIC_API_KEY=sk-ant-your-key-here
# Or Ollama (recommended with provider: "ollama")
OLLAMA_ENDPOINT=http://localhost:11434
OLLAMA_API_KEY=not-required
# Fallback alias used by openai_compatible
CUSTOM_LLM_ENDPOINT=http://localhost:11434/v1
CUSTOM_LLM_API_KEY=not-required
Security Note: Never commit .env to version control. It's already listed in .gitignore.
Step 2: Set Up Ollama (Optional, for local models)
If you want to run local evaluators with provider: "ollama" (recommended), configure Ollama:
- Start the Ollama server (
ollama serve). - Confirm endpoint is reachable at
http://localhost:11434.
Quick check:
ollama list
curl http://localhost:11434/api/tags
If this returns model entries, your benchmark can call Ollama using base_url: "http://localhost:11434".
Missing configured models are pulled automatically during benchmark preflight.
Step 3: Configure Benchmark Settings
The repository includes an example configuration. To run a hybrid setup (Azure + Ollama), use:
benchmark:
name: "hybrid-benchmark"
version: "1.0.0"
runs: 1
evaluators:
azure_gpt4:
provider: "azure_openai"
model: "gpt-4" # Azure deployment name
temperature: 0.0
max_tokens: 500
ollama_fast:
provider: "ollama"
model: "qwen2.5:7b-instruct"
base_url: "http://localhost:11434"
temperature: 0.0
max_tokens: 300
metrics:
- name: "difficulty"
version: "1.0"
evaluators: ["ollama_fast"]
parameters:
rubric: "bloom_taxonomy"
target_audience: "undergraduate"
- name: "clarity"
version: "1.0"
evaluators: ["ollama_fast", "azure_gpt4"]
- name: "coverage"
version: "1.1"
evaluators: ["azure_gpt4"]
parameters:
granularity: "balanced"
inputs:
quiz_directory: "data/quizzes"
source_directory: "data/inputs"
outputs:
results_directory: "data/results"
To switch local models per metric, define multiple ollama evaluators (different model values) and assign them in each metric's evaluators list.
Running Your First Benchmark
The repository includes example quiz and source material. Just run:
python main.py --config config/benchmark_example.yaml
Viewing Results
Each run creates a readable bundle directory in data/results/:
# List recent bundles
ls -lt data/results
# Inspect one bundle
cat data/results/<run-bundle>/summary.txt
cat data/results/<run-bundle>/results.json
cat data/results/<run-bundle>/aggregated.json
cat data/results/<run-bundle>/run.log
Use --debug to stream verbose logs live in terminal:
python main.py --config config/benchmark_example.yaml --debug
What's Happening?
- Loading: Reads
data/quizzes/example_quiz.jsonanddata/inputs/python_intro.md - Evaluating: Each configured metric runs with each configured evaluator
- Repeating: The benchmark runs multiple times (configured via
runsin YAML) - Aggregating: Statistics (mean, median, std dev) are calculated across runs
- Reporting: Results are saved as JSON and human-readable text