Testing
This project uses pytest for automated tests.
Run Tests
pytest
Run Tests With Coverage
pytest --cov=src --cov-report=term-missing
Writing Tests
Key Principles
- No external LLM calls: tests must never hit OpenAI/Anthropic/Azure or any networked model.
- Deterministic: avoid randomness and time-based assertions; tests should be stable across runs.
- Behavior > quality: assert functional behavior (counts, outputs, validation), not model quality.
Mocking LLMs
Use the test mock provider from tests/conftest.py:
MockLLMProviderimplementsLLMProvider.generateand returns deterministic numeric strings.mock_llm_providerfixture registers it viaLLMProviderFactoryfor tests.
When writing tests that involve evaluation runs:
- Include the
mock_llm_providerfixture. - Register metrics via the
registered_metricsfixture. - Use
sample_config/sample_quizfixtures where possible.
Test Structure
- Unit tests: models, metrics, registry, config loader (no IO, no runner).
- Component tests: runner orchestration with mocked LLMs.
- IO/analysis tests: file round-trips and aggregation/reporting with fixed inputs.
What to Test
- Validation: input schema errors, config validation, parameter validation.
- Control flow: skip behavior for missing metrics/evaluators, run counts, metric counts.
- Parsing: metric response parsing success/failure cases.
- Aggregation/reporting: stats computed correctly for fixed scores.
What Not to Test
- Provider implementations (OpenAI/Anthropic/Azure): skip to avoid accidental network usage.
- LLM output quality: do not assert on semantic content beyond numeric parsing.