AI Code Review Pipeline
Hephaestus runs an AI-powered code review pipeline that evaluates merge requests against configurable software engineering practices. When a student opens a non-draft MR (or uses /hephaestus review), the system detects relevant practices, runs an LLM agent inside a sandboxed container, parses the structured output, and posts findings as MR comments and inline diff notes.
Pipeline overview
Key components
| Component | Class | Responsibility |
|---|---|---|
| Detection gate | PracticeReviewDetectionGate | 8-check gate: draft skip, workspace resolution, agent config, practice matching, runForAllUsers bypass, assignee presence, Keycloak health, assignee role |
| Review handler | PullRequestReviewHandler | Context assembly (diff, metadata, practices), diff summary computation, post-execution delivery orchestration |
| Result parser | PracticeDetectionResultParser | Parses agent JSON, validates and normalizes slugs, deduplicates by practice (highest confidence wins). Never throws -- failures go to discarded list |
| Delivery composer | DeliveryComposer | Inline-first rendering: inlinable findings (with file locations) become compact MR summary entries + full diff notes; non-inlinable findings get full detail in MR summary |
| Diff validator | DiffHunkValidator | Validates diff note line positions against actual diff hunks. Snaps invalid positions to nearest valid line (TreeSet.floor/ceiling) |
| Feedback service | FeedbackDeliveryService | Posts MR summary comment and diff notes to the git provider. Suppresses delivery for closed, merged, draft, or opted-out PRs |
| Bot command | BotCommandProcessor | Listens for /hephaestus review comments (via Spring @TransactionalEventListener) to retrigger reviews |
| Job executor | AgentJobExecutor | NATS pull consumer: claims jobs with SKIP LOCKED, dispatches to sandbox executor pool, persists results in micro-transactions |
Agent architecture
A single backend powers practice detection: the Pi practice agent, wired via PracticePiAdapter on top of the shared PiRuntimeFactory. It uses a single-pass architecture — one agent evaluates all relevant practices for the diff and persists structured findings via custom Pi tools (report_finding, set_review_summary).
The runner (.run-pi.mjs) drives the Pi SDK in-process: initial analysis with a soft-timeout steering message and a hard-timeout abort, followed by a format retry if the persisted output is incomplete. A top-level watchdog hard-exits the process at AGENT_BUDGET_MS + 30s so the orchestrator always observes a terminal state.
Workspace layout
Every agent container gets this file structure (agent-specific files noted):
/workspace/
repo/ # Git repository (read-only bind mount)
context/target/
metadata.json # PR title, body, author, branches, stats
comments.json # Latest 500 review comments
diff.patch # Unified diff with [L<n>] annotations
diff_stat.txt # Changed files summary
diff_summary.md # Per-file diff chunks with index table
contributor_history.json # Prior findings for this author (optional)
.practices/
index.json # [{slug, name, category}]
{slug}.md # Per-practice criteria (generated from DB)
all-criteria.md # All criteria bundled (reduces tool calls)
.precompute/practices/{slug}.ts # Precompute scripts (from DB, if present)
.precompute-out/ # Precompute output (summary.md + per-practice JSON)
task.json # TaskEnvelope<PracticeReviewTask> — prompt, jobId, workspaceId
.pi/ # Pi SDK agent dir ($PI_CODING_AGENT_DIR)
AGENTS.md # Pi orchestrator instructions
settings.json # Pi SDK configuration (provider, model, compaction)
extensions/ # Custom provider extensions (auto-discovered)
.run-pi.mjs # Runner script
.analysis/practices/.gitkeep # Directory for intermediate findings
.output/ # Agent writes final results here
Output schema
The agent returns a JSON object with a findings array:
{
"findings": [
{
"practiceSlug": "hardcoded-secrets",
"title": "API key exposed in source",
"verdict": "NEGATIVE",
"severity": "CRITICAL",
"confidence": 0.95,
"evidence": {
"locations": [{ "path": "Config.swift", "startLine": 9, "endLine": 9 }],
"snippets": ["private let apiToken = \"ghp_abc123\""]
},
"reasoning": "Hardcoded credential on +line...",
"guidance": "Delete the line and use environment variables...",
"suggestedDiffNotes": [
{
"filePath": "Config.swift",
"startLine": 9,
"endLine": 9,
"body": "Delete this credential..."
}
]
}
]
}
Verdicts: POSITIVE (good practice), NEGATIVE (violation), NOT_APPLICABLE (practice irrelevant to this diff).
Severities: CRITICAL, MAJOR, MINOR, INFO -- defined per practice in the criteria files.
Practices
Practices are stored in the database (practice table, criteria column). At runtime, the handler generates .practices/{slug}.md files from the DB criteria and injects them into the agent workspace. Each practice defines:
- What to look for
- Severity classification rules
- False-positive exclusions
The current deployment uses 13 practices (12 software engineering + hardcoded-secrets). Practices are fully configurable per workspace and can be added or modified without code changes.
Delivery pipeline
After the agent returns findings, the server runs a 6-step delivery pipeline in PullRequestReviewHandler.deliver():
-
Parse --
PracticeDetectionResultParservalidates all fields, normalizes slugs (toLowerCase+ replace_with-), deduplicates by practice (highest confidence wins), and collectssuggestedDiffNotesfrom NEGATIVE findings. Malformed entries are captured in adiscardedlist (never throws). -
Filter by diff scope --
filterByDiffScoperemoves findings whose evidence locations don't intersect the actual diff. Prevents hallucinated findings about unchanged code. -
Persist -- Validated findings are saved as
PracticeFindingentities in the database. -
Compose --
DeliveryComposerpartitions findings into:- Inlinable (have file locations, not in internal paths like
context/target/, practice not inNON_INLINABLE_PRACTICES) -- compact list in MR summary, full detail in diff notes - Non-inlinable (
mr-description-quality,commit-discipline, or no file location) -- full detail in MR summary - When all findings are positive, composes a short approval comment naming top positive practices
- Inlinable (have file locations, not in internal paths like
-
Validate positions --
DiffHunkValidatorparses the unified diff to extract valid new-side line numbers per file. Invalid positions are snapped to the nearest valid line (TreeSet.floor/ceiling). -
Post --
FeedbackDeliveryServicechecks suppression conditions (PR closed, merged, draft, or author opted out) and, if not suppressed, posts the MR summary comment (with an HTML marker<!-- hephaestus:practice-review:{jobId} -->for identification) and inline diff notes to the git provider's API. On re-runs,DiffNotePosterfirst deletes old diff notes bearing the<!-- hephaestus-diff-note -->marker to prevent accumulation.
Bot command
Students can type /hephaestus review in an MR comment to retrigger a review. The flow:
GitLabNoteMessageHandlerdetects the command prefix and publishes aBotCommandReceivedEventBotCommandProcessorlistens asynchronously, validates the PR state, evaluates the detection gate, and submits a new review job
This uses Spring's event system to avoid a module dependency cycle between gitprovider and agent.
Database schema
Key tables for code review:
| Table | Key Columns | Purpose |
|---|---|---|
agent_config | name, agent_type, model_name, model_version, enabled, llm_api_key (encrypted), llm_provider, credential_mode, timeout_seconds, max_concurrent_jobs, allow_internet | LLM backend configuration per workspace |
agent_job | status, idempotency_key, job_token (encrypted), config_snapshot (JSONB), delivery_status, llm_* usage columns | Job lifecycle: QUEUED → RUNNING → COMPLETED/FAILED. Tracks container ID, exit code, LLM cost |
practice | slug, name, description (TEXT, NOT NULL), category, criteria (TEXT), trigger_events (JSONB), precompute_script (TEXT), is_active, workspace_id | Practice definitions. Unique constraint on (workspace_id, slug) |
practice_finding | idempotency_key, title, verdict, severity, confidence, evidence (JSONB), reasoning, guidance, agent_job_id, practice_id, target_type, target_id, contributor_id, detected_at | Individual findings per target (PR) per practice |
Configuration
Application properties
hephaestus:
agent:
image:
reference: ghcr.io/ls1intum/hephaestus/agent-pi:latest # dev; prod overrides via docker/agent-image-pin.env
pull-policy: IF_NOT_PRESENT
# require-digest: true # set only in application-prod.yml
nats:
enabled: true # Enable agent job processing
server: nats://localhost:4222
sandbox:
llm-proxy-port: 38080 # Must match server port
docker-host: unix:///var/run/docker.sock
git:
enabled: true
storage-path: /tmp/hephaestus-git-repos
Production pins via docker/agent-image-pin.env; see Agent image digests.
Dev trigger
For development, enable the REST endpoint to manually trigger reviews:
hephaestus:
dev:
trigger-enabled: true
Then trigger with:
curl -X POST "http://localhost:${SERVER_PORT}/api/dev/trigger-review?prId=123&workspaceId=1"
The port must match your SERVER_PORT environment variable (default: 38080 in .env).
Adding a new practice
- Insert a row in the
practicetable with all required fields:slug,name,description,workspace_id, andtrigger_events(JSONB array of event names like["PULL_REQUEST_OPENED", "PULL_REQUEST_UPDATED"]) - Set the
criteriacolumn with the evaluation criteria text (Markdown). If omitted, thedescriptioncolumn is used as fallback - Set
categoryfor grouping (e.g.,security,reliability,design,process) - No code changes needed -- the handler generates
{slug}.mdfrom the DB criteria and the agent reads practices dynamically fromindex.json
Extending to new languages
The Pi orchestrator instructions (pi-orchestrator.md, mounted at .pi/AGENTS.md) are language-agnostic. Language-specific guidance lives in the practice criteria. To support a new language:
- Write new practice criteria targeting the language's patterns and insert them in the
practicetable - The agent orchestrator file is language-agnostic -- no changes needed unless the language requires special analysis strategies
- Practice criteria (in DB) may benefit from language-specific examples; precompute scripts (also in DB) often need per-language regex tweaks