AtlasML Deployment Guide

This guide covers deployment workflows, CI/CD integration, and production deployment strategies for AtlasML.

Deployment Overview

AtlasML uses a containerized deployment model with GitHub Actions for CI/CD:

GitHub Actions Workflows

1. Build and Push Docker Image

File: .github/workflows/atlas_build-and-push-docker.yml

This workflow automatically builds and pushes Docker images when code changes.

Trigger Conditions

on:
  push:
    paths:
      - 'atlas/**'
      - '.github/workflows/atlas_build-and-push-docker.yml'
  release:
    types:
      - created

Triggers on:

Any push to atlas/ directory
Workflow file changes
New GitHub release

What It Does

Builds Docker image using /atlas/AtlasMl/Dockerfile
Tags image with:
- Branch name (e.g., main, feature-branch)
- PR number (e.g., pr-123)
- Release version (e.g., v1.2.0)
Pushes to GitHub Container Registry (ghcr.io)

Usage

Automatic (on push):

# Make changes
git add atlas/AtlasMl/
git commit -m "Update AtlasML"
git push

# GitHub Actions automatically builds and pushes image
# Result: ghcr.io/ls1intum/edutelligence/atlasml:main

Manual trigger:

Go to GitHub Actions tab
Select "AtlasML - Build Docker Images"
Click "Run workflow"
Select branch
Click "Run workflow"

2. Deploy to Test Environment

File: .github/workflows/atlas_deploy-test.yml

Manual deployment workflow for test/staging environments.

Workflow Configuration

name: AtlasML - Deploy to Test 1

on:
  workflow_dispatch:
    inputs:
      image-tag:
        type: string
        description: 'Image tag to deploy (default: pr-<number> if PR exists, latest for default branch)'
      deploy-atlasml:
        type: boolean
        default: true
        description: (Re-)deploys AtlasML.

Required GitHub Configuration

The workflow uses the organization's standard naming conventions for secrets and variables.

Environment Variables (in GitHub Environment Atlas - Test 1):

VM_HOST - Target server hostname/IP
VM_USERNAME - SSH username

Environment Secrets (in GitHub Environment Atlas - Test 1):

VM_SSH_PRIVATE_KEY - SSH private key for server access
All application secrets (e.g., WEAVIATE_HOST, WEAVIATE_API_KEY, OPENAI_API_KEY, etc.)

tip

The reusable deployment workflow automatically creates the .env file from all secrets configured in the GitHub environment, excluding infrastructure secrets like VM_SSH_PRIVATE_KEY.

Steps

Step 1: Provision Environment

Sets up Traefik directories and SSL certificate file on the remote server:

- name: Setup Traefik on remote host
  uses: appleboy/ssh-action@v1.2.4
  with:
    host: ${{ vars.VM_HOST }}
    username: ${{ vars.VM_USERNAME }}
    key: ${{ secrets.VM_SSH_PRIVATE_KEY }}
    script: |
      sudo mkdir -p /opt/atlasml/traefik
      sudo touch /opt/atlasml/traefik/acme.json
      sudo chmod 600 /opt/atlasml/traefik/acme.json

note

The .env file is automatically created by the reusable deployment workflow using secrets and variables configured in the GitHub environment. You don't need to manually write environment variables.

Step 2: Deploy AtlasML

Uses the organization's reusable workflow to deploy:

- name: Deploy AtlasML
  uses: ls1intum/.github/.github/workflows/deploy-docker-compose.yml@main
  with:
    environment: 'Atlas - Test 1'
    docker-compose-file: './atlas/docker-compose.prod.yml'
    main-image-name: ls1intum/edutelligence/atlasml
    image-tag: ${{ inputs.image-tag }}
    deployment-base-path: '/opt/atlasml'
  secrets: inherit

How to Deploy

Go to GitHub Actions
- Navigate to repository → Actions tab
Select Workflow
- Click "AtlasML - Deploy to Test 1"
Run Workflow
- Click "Run workflow" button
- Select branch (usually main)
- Enter image tag:
  - Branch name: main
  - PR number: pr-123
  - Version: v1.2.0
- Click "Run workflow"
Monitor Progress
- Watch workflow execution in real-time
- Check logs for any errors

Example:

Image tag: main
Deploy AtlasML: ✓ (checked)

Deployment Strategies

Strategy 1: Continuous Deployment (Dev/Test)

For development and test environments, deploy automatically on every push.

# Add to .github/workflows/atlas_deploy-test.yml
on:
  push:
    branches:
      - develop
    paths:
      - 'atlas/**'

Flow:

Code Push → Auto Build → Auto Deploy to Test

Strategy 2: Manual Deployment (Staging)

For staging, require manual approval.

jobs:
  deploy:
    environment:
      name: staging
      # Requires manual approval

Flow:

Code Push → Auto Build → Manual Approval → Deploy to Staging

Strategy 3: Release-Based Deployment (Production)

For production, only deploy tagged releases.

on:
  release:
    types:
      - published

Flow:

Create Release → Auto Build → Manual Deploy with Version Tag → Production

Production Deployment Process

Step-by-Step Production Deployment

1. Prepare Release

# Ensure main branch is stable
git checkout main
git pull origin main

# Run tests locally
cd atlas/AtlasMl
poetry run pytest

# Check for linting issues
poetry run ruff check .
poetry run black --check .

2. Create GitHub Release

# Create and push tag
git tag -a v1.2.0 -m "Release version 1.2.0"
git push origin v1.2.0

Or via GitHub UI:

Go to Releases → Draft a new release
Choose tag: Create new tag v1.2.0
Title: AtlasML v1.2.0
Description: Release notes
Click "Publish release"

3. Wait for Image Build

GitHub Actions automatically:

Builds Docker image
Tags as v1.2.0
Pushes to ghcr.io/ls1intum/edutelligence/atlasml:v1.2.0

Monitor at: Actions → "AtlasML - Build Docker Images"

4. Deploy to Production

Go to Actions → "AtlasML - Deploy to Production"
Click "Run workflow"
Enter image tag: v1.2.0
Click "Run workflow"
Monitor deployment logs

5. Verify Deployment

# SSH to production server
ssh production-server

# Check container status
docker ps | grep atlasml

# Check health
curl http://localhost/api/v1/health

# Check logs
docker logs atlasml --tail 50

# Test endpoint
curl -X POST http://localhost/api/v1/competency/suggest \
  -H "Authorization: your-api-key" \
  -H "Content-Type: application/json" \
  -d '{"description":"Python programming","course_id":1}'

6. Monitor Post-Deployment

Check Sentry for errors (if configured)
Monitor logs for first hour
Verify Artemis integration working
Check response times

Rollback Procedure

If deployment fails or issues are discovered:

Quick Rollback

# SSH to server
ssh production-server

cd /opt/atlasml

# Set previous version in .env
echo "IMAGE_TAG=v1.1.0" > .env.new
cat .env >> .env.new
mv .env.new .env

# Pull and restart
docker-compose -f docker-compose.prod.yml pull
docker-compose -f docker-compose.prod.yml up -d

# Verify
docker logs atlasml
curl http://localhost/api/v1/health

Rollback via GitHub Actions

Go to Actions → "AtlasML - Deploy to Production"
Click "Run workflow"
Enter previous image tag: v1.1.0
Click "Run workflow"

Zero-Downtime Deployment

For high-availability setups:

Using Load Balancer

Process:

Deploy new version to Instance 2
Wait for health check to pass
Load balancer routes traffic to Instance 2
Update Instance 1
Both instances now running new version

Using Docker Compose

# docker-compose.prod.yml
services:
  atlasml:
    image: ghcr.io/ls1intum/edutelligence/atlasml:${IMAGE_TAG}
    deploy:
      replicas: 2
      update_config:
        parallelism: 1
        delay: 10s
        order: start-first

Deploy:

docker-compose -f docker-compose.prod.yml up -d --no-deps --build atlasml

This updates containers one at a time, ensuring at least one is always running.

Environment-Specific Configurations

Development

# .env.dev
ENV=development
ATLAS_API_KEYS=dev-key
WEAVIATE_HOST=localhost
OPENAI_API_KEY=dev-key

Staging

# .env.staging
ENV=staging
ATLAS_API_KEYS=staging-key-1,staging-key-2
WEAVIATE_HOST=weaviate-staging.internal
OPENAI_API_KEY=staging-key
SENTRY_DSN=https://...@sentry.../staging

Production

# .env.production
ENV=production
ATLAS_API_KEYS=prod-key-1,prod-key-2,prod-key-3
WEAVIATE_HOST=weaviate.internal
OPENAI_API_KEY=prod-key
SENTRY_DSN=https://...@sentry.../production

Automated Deployment with CI/CD

Full CI/CD Pipeline

# .github/workflows/atlas_ci_cd.yml
name: AtlasML CI/CD

on:
  push:
    branches: [main, develop]
  pull_request:
    branches: [main]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.13'
      - name: Install dependencies
        run: |
          cd atlas/AtlasMl
          pip install poetry
          poetry install
      - name: Run tests
        run: |
          cd atlas/AtlasMl
          poetry run pytest
      - name: Lint
        run: |
          cd atlas/AtlasMl
          poetry run ruff check .

  build:
    needs: test
    if: github.ref == 'refs/heads/main'
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Build and push Docker image
        # Build image

  deploy-test:
    needs: build
    if: github.ref == 'refs/heads/develop'
    runs-on: ubuntu-latest
    steps:
      - name: Deploy to test
        # Deploy to test environment

  deploy-prod:
    needs: build
    if: github.event_name == 'release'
    runs-on: ubuntu-latest
    environment:
      name: production
    steps:
      - name: Deploy to production
        # Deploy to production

Deployment Checklist

Pre-Deployment

During Deployment

Post-Deployment

Deployment Troubleshooting

Issue: Image Pull Fails

Error: Error response from daemon: pull access denied

Solution:

# Login to GHCR
echo $GITHUB_TOKEN | docker login ghcr.io -u USERNAME --password-stdin

# Or ensure image is public

Issue: Container Starts But Unhealthy

Check:

# View health check logs
docker inspect atlasml | grep -A 10 Health

# Test endpoint manually
docker exec atlasml curl http://localhost:8000/api/v1/health

# Check application logs
docker logs atlasml

Common causes:

Weaviate connection failed
Missing environment variables
OpenAI API key invalid

Issue: Old Version Still Running

Solution:

# Force pull new image
docker-compose -f docker-compose.prod.yml pull

# Remove old container
docker-compose -f docker-compose.prod.yml down

# Start with new image
docker-compose -f docker-compose.prod.yml up -d

# Verify version (check logs for startup message)
docker logs atlasml | grep "Starting"

Monitoring Deployments

Deployment Metrics

Track these metrics during and after deployment:

Response Time: Should remain stable
Error Rate: Should not increase
Memory Usage: Check for memory leaks
CPU Usage: Should be within normal range

Using Sentry

If Sentry is configured, monitor:

# Check Sentry dashboard at
https://sentry.io/organizations/your-org/issues/

# Filter by:
- Environment: production
- Time: Last hour
- Release: v1.2.0

Using Docker Stats

# Monitor resource usage
docker stats atlasml

# Output:
# CONTAINER  CPU %  MEM USAGE/LIMIT    MEM %
# atlasml    2.5%   256MB/2GB          12.8%

Next Steps

Configuration: Manage environment variables and secrets
Monitoring: Set up comprehensive monitoring
Troubleshooting: Resolve production issues

Resources

GitHub Actions Documentation: https://docs.github.com/en/actions
Docker Compose: https://docs.docker.com/compose/
GHCR: https://docs.github.com/en/packages
SSH Actions: https://github.com/appleboy/ssh-action

Deployment Overview​

GitHub Actions Workflows​

1. Build and Push Docker Image​

Trigger Conditions​

What It Does​

Usage​

2. Deploy to Test Environment​

Workflow Configuration​

Required GitHub Configuration​

Steps​

How to Deploy​

Deployment Strategies​

Strategy 1: Continuous Deployment (Dev/Test)​

Strategy 2: Manual Deployment (Staging)​

Strategy 3: Release-Based Deployment (Production)​

Production Deployment Process​

Step-by-Step Production Deployment​

1. Prepare Release​

2. Create GitHub Release​

3. Wait for Image Build​

4. Deploy to Production​

5. Verify Deployment​

6. Monitor Post-Deployment​

Rollback Procedure​

Quick Rollback​

Rollback via GitHub Actions​

Zero-Downtime Deployment​

Using Load Balancer​

Using Docker Compose​

Environment-Specific Configurations​

Development​

Staging​

Production​

Automated Deployment with CI/CD​

Full CI/CD Pipeline​

Deployment Checklist​

Pre-Deployment​

During Deployment​

Post-Deployment​

Deployment Troubleshooting​

Issue: Image Pull Fails​

Issue: Container Starts But Unhealthy​

Issue: Old Version Still Running​

Monitoring Deployments​

Deployment Metrics​

Using Sentry​

Using Docker Stats​

Next Steps​

Resources​

Deployment Overview

GitHub Actions Workflows

1. Build and Push Docker Image

Trigger Conditions

What It Does

Usage

2. Deploy to Test Environment

Workflow Configuration

Required GitHub Configuration

Steps

How to Deploy

Deployment Strategies

Strategy 1: Continuous Deployment (Dev/Test)

Strategy 2: Manual Deployment (Staging)

Strategy 3: Release-Based Deployment (Production)

Production Deployment Process

Step-by-Step Production Deployment

1. Prepare Release

2. Create GitHub Release

3. Wait for Image Build

4. Deploy to Production

5. Verify Deployment

6. Monitor Post-Deployment

Rollback Procedure

Quick Rollback

Rollback via GitHub Actions

Zero-Downtime Deployment

Using Load Balancer

Using Docker Compose

Environment-Specific Configurations

Development

Staging

Production

Automated Deployment with CI/CD

Full CI/CD Pipeline

Deployment Checklist

Pre-Deployment

During Deployment

Post-Deployment

Deployment Troubleshooting

Issue: Image Pull Fails

Issue: Container Starts But Unhealthy

Issue: Old Version Still Running

Monitoring Deployments

Deployment Metrics

Using Sentry

Using Docker Stats

Next Steps

Resources