Testing

This project uses pytest for automated tests.

Run Tests

pytest

pytest --cov=src --cov-report=term-missing

No external LLM calls: tests must never hit OpenAI/Anthropic/Azure or any networked model.
Deterministic: avoid randomness and time-based assertions; tests should be stable across runs.
Behavior > quality: assert functional behavior (counts, outputs, validation), not model quality.

Use the test mock provider from tests/conftest.py:

MockLLMProvider implements LLMProvider.generate and returns deterministic numeric strings.
mock_llm_provider fixture registers it via LLMProviderFactory for tests.

When writing tests that involve evaluation runs:

Unit tests: models, metrics, registry, config loader (no IO, no runner).
Component tests: runner orchestration with mocked LLMs.
IO/analysis tests: file round-trips and aggregation/reporting with fixed inputs.

Validation: input schema errors, config validation, parameter validation.
Control flow: skip behavior for missing metrics/evaluators, run counts, metric counts.
Parsing: metric response parsing success/failure cases.
Aggregation/reporting: stats computed correctly for fixed scores.

Provider implementations (OpenAI/Anthropic/Azure): skip to avoid accidental network usage.
LLM output quality: do not assert on semantic content beyond numeric parsing.