AtlasML Monitoring Guide

This guide covers how to monitor AtlasML in production, including health checks, logging, metrics, error tracking, and alerting.

Monitoring Overview

Effective monitoring ensures AtlasML runs reliably in production:

Health Checks

Built-in Health Endpoint

AtlasML provides a health check endpoint:

# Check health
curl http://localhost/api/v1/health

# Expected response
[]

# HTTP 200 = healthy
# Non-200 = unhealthy

What it checks:

Application is running
Can respond to requests
Basic connectivity

Does NOT check:

Weaviate connectivity
OpenAI API availability
Database integrity

Docker Health Check

Configured in docker-compose.prod.yml:

healthcheck:
  test: ["CMD", "python", "-c", "import urllib.request; import sys; sys.exit(0 if urllib.request.urlopen('http://localhost:8000/api/v1/health').getcode() == 200 else 1)"]
  interval: 30s
  timeout: 10s
  retries: 3
  start_period: 30s

note

The health check uses Python's urllib instead of curl because curl is not installed in the slim Python Docker image. This approach is more reliable and doesn't require additional dependencies.

Parameters:

interval: Check every 30 seconds
timeout: Fail if no response in 10s
retries: Mark unhealthy after 3 consecutive failures
start_period: Grace period during startup

Check status:

# View health status
docker ps

# Output:
# CONTAINER  IMAGE     STATUS
# atlasml    ...       Up 5 minutes (healthy)

# If unhealthy:
# atlasml    ...       Up 5 minutes (unhealthy)

View health check logs:

docker inspect atlasml | jq '.[0].State.Health'

Output:

{
  "Status": "healthy",
  "FailingStreak": 0,
  "Log": [
    {
      "Start": "2025-01-15T10:00:00Z",
      "End": "2025-01-15T10:00:01Z",
      "ExitCode": 0,
      "Output": "[]"
    }
  ]
}

Custom Health Checks

For more comprehensive checks, add custom endpoints:

Check Weaviate connectivity:

curl http://localhost/api/v1/health/weaviate

Check OpenAI API:

curl http://localhost/api/v1/health/openai

note

These endpoints don't exist by default. See Developer Guide to implement.

Logging

Application Logs

View logs:

# Real-time logs
docker logs -f atlasml

# Last 100 lines
docker logs --tail 100 atlasml

# Since specific time
docker logs --since "2025-01-15T10:00:00" atlasml

# With timestamps
docker logs -f --timestamps atlasml

Example log output:

2025-01-15 10:30:45 INFO:     Started server process [1]
2025-01-15 10:30:45 INFO:     Waiting for application startup.
2025-01-15 10:30:45 INFO:     Application startup complete.
2025-01-15 10:30:45 INFO:     Uvicorn running on http://0.0.0.0:8000
2025-01-15 10:31:12 INFO:     POST /api/v1/competency/suggest HTTP/1.1 200
2025-01-15 10:31:15 INFO:     GET /api/v1/health HTTP/1.1 200

Log Levels

AtlasML uses standard Python logging levels:

Level	When to Use
`DEBUG`	Detailed debugging info
`INFO`	General informational messages
`WARNING`	Warning messages (non-critical issues)
`ERROR`	Error messages (failures)
`CRITICAL`	Critical failures (service down)

Filter by level:

# Only errors
docker logs atlasml 2>&1 | grep ERROR

# Only warnings and errors
docker logs atlasml 2>&1 | grep -E "(WARNING|ERROR)"

Log Rotation

Configured in docker-compose.prod.yml:

logging:
  driver: 'json-file'
  options:
    max-size: '50m'
    max-file: '5'

Settings:

max-size: 50MB per log file
max-file: Keep 5 files
Total max: 250MB of logs

Benefits:

Prevents disk space issues
Automatic cleanup
Maintains history

Check log files:

# Find log files
sudo ls -lh /var/lib/docker/containers/$(docker ps -q -f name=atlasml)/*.log

# Total size
sudo du -sh /var/lib/docker/containers/$(docker ps -q -f name=atlasml)

Centralized Logging

For production, send logs to a centralized system:

Option 1: Syslog

# docker-compose.prod.yml
logging:
  driver: syslog
  options:
    syslog-address: "tcp://syslog.company.com:514"
    tag: "atlasml"

Option 2: Fluentd

logging:
  driver: fluentd
  options:
    fluentd-address: "fluentd.company.com:24224"
    tag: "atlasml"

Option 3: ELK Stack (Elasticsearch, Logstash, Kibana)

logging:
  driver: gelf
  options:
    gelf-address: "udp://logstash.company.com:12201"
    tag: "atlasml"

Option 4: CloudWatch (AWS)

logging:
  driver: awslogs
  options:
    awslogs-region: "us-east-1"
    awslogs-group: "atlasml"
    awslogs-stream: "production"

Resource Monitoring

Container Resource Usage

Real-time stats:

docker stats atlasml

# Output:
# CONTAINER  CPU %  MEM USAGE/LIMIT    MEM %  NET I/O        BLOCK I/O
# atlasml    2.5%   256MB/2GB          12.8%  1.2MB/500KB    10MB/5MB

Monitor continuously:

# Update every 2 seconds
docker stats atlasml --no-stream=false

# Format output
docker stats atlasml --format "table {{.Container}}\t{{.CPUPerc}}\t{{.MemUsage}}"

Set Resource Limits

Prevent resource exhaustion:

# docker-compose.prod.yml
services:
  atlasml:
    deploy:
      resources:
        limits:
          cpus: '2.0'
          memory: 2G
        reservations:
          cpus: '1.0'
          memory: 512M

Limits:

cpus: Max 2 CPU cores
memory: Max 2GB RAM

Reservations:

cpus: Guaranteed 1 core
memory: Guaranteed 512MB

Monitor against limits:

docker stats atlasml --no-stream

# If MEM % approaches 100%, increase limit or optimize code

Disk Usage

Check container disk usage:

docker system df

# Output:
# TYPE           TOTAL    ACTIVE   SIZE      RECLAIMABLE
# Images         10       2        5.2GB     3.1GB (59%)
# Containers     5        2        100MB     50MB (50%)
# Local Volumes  3        2        2GB       500MB (25%)

Check Weaviate data volume:

docker volume inspect weaviate-data | jq '.[0].Mountpoint'
sudo du -sh $(docker volume inspect weaviate-data | jq -r '.[0].Mountpoint')

Clean up unused resources:

# Remove unused images
docker image prune -a

# Remove unused volumes (CAREFUL!)
docker volume prune

# Full cleanup
docker system prune -a --volumes

Error Tracking with Sentry

Setup Sentry

Create account: https://sentry.io
Create project: Select "Python" → "FastAPI"
Get DSN: Copy from project settings

Configure in .env:

SENTRY_DSN=https://YOUR_SENTRY_KEY@YOUR_ORG.ingest.sentry.io/YOUR_PROJECT_ID
ENV=production

What Sentry Captures

Unhandled exceptions
API endpoint errors
Performance metrics
User context (IP, endpoint, request data)
Stack traces
Breadcrumbs (events leading to error)

Example error in Sentry:

WeaviateConnectionError: Could not connect to Weaviate at localhost:8085

Stack trace:
  File "atlasml/routers/competency.py", line 45, in suggest_competencies
    embeddings = weaviate_client.get_embeddings()
  File "atlasml/clients/weaviate.py", line 120, in get_embeddings
    raise WeaviateConnectionError(...)

Environment: production
Release: v1.2.0
User IP: 192.168.1.100
Endpoint: POST /api/v1/competency/suggest

Monitoring Sentry

Access dashboard:

https://sentry.io/organizations/your-org/issues/

Filter by:

Environment (production, staging, development)
Release version (v1.2.0)
Time range (last hour, day, week)
Error type
Endpoint

Key metrics:

Error frequency: Errors per hour
Affected users: Number of unique IPs/users
First seen: When error first occurred
Last seen: Most recent occurrence

Alert Rules

Set up alerts for critical errors:

Example: Alert on 10+ errors in 5 minutes

Go to Sentry → Alerts → Create Alert Rule
Conditions:
- When: 10 events occur
- In: 5 minutes
- For project: atlasml
- Environment: production
Actions:
- Send email to: ops@company.com
- Send Slack notification to: #alerts

Performance Monitoring

Response Time Monitoring

Manual testing:

# Measure response time
time curl -X POST http://localhost/api/v1/competency/suggest \
  -H "Authorization: your-key" \
  -H "Content-Type: application/json" \
  -d '{"description":"test","course_id":1}'

# Output:
# real    0m0.342s  ← Total time

Load testing with Apache Bench:

# Install
sudo apt-get install apache2-utils

# Run 100 requests, 10 concurrent
ab -n 100 -c 10 \
  -H "Authorization: your-key" \
  -H "Content-Type: application/json" \
  -p request.json \
  http://localhost/api/v1/competency/suggest

# Results:
# Requests per second: 29.12 [#/sec]
# Time per request: 34.343 [ms] (mean)
# Time per request: 3.434 [ms] (mean, across all concurrent)

Using Prometheus and Grafana

For advanced metrics, integrate Prometheus:

1. Add Prometheus Exporter

# Install prometheus-client
poetry add prometheus-client

Add to app.py:

from prometheus_client import make_asgi_app

# Add Prometheus metrics endpoint
metrics_app = make_asgi_app()
app.mount("/metrics", metrics_app)

2. Configure Prometheus

# prometheus.yml
scrape_configs:
  - job_name: 'atlasml'
    static_configs:
      - targets: ['atlasml:8000']
    metrics_path: '/metrics'
    scrape_interval: 15s

3. Run Prometheus

# docker-compose.monitoring.yml
services:
  prometheus:
    image: prom/prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
    networks:
      - shared-network

4. Set Up Grafana

  grafana:
    image: grafana/grafana
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
    volumes:
      - grafana-data:/var/lib/grafana
    networks:
      - shared-network

Access: http://localhost:3000 (admin/admin)

Add Prometheus data source:

URL: http://prometheus:9090

Create dashboard with metrics:

Request rate
Response time (p50, p95, p99)
Error rate
CPU/Memory usage
Weaviate query time

Alerting

Docker Health-Based Alerts

Script to check and alert:

#!/bin/bash
# check-health.sh

CONTAINER="atlasml"
STATUS=$(docker inspect -f '{{.State.Health.Status}}' $CONTAINER 2>/dev/null)

if [ "$STATUS" != "healthy" ]; then
    echo "⚠️ ALERT: AtlasML is $STATUS"
    # Send alert (examples below)
fi

Send email alert:

echo "AtlasML is unhealthy!" | mail -s "AtlasML Alert" ops@company.com

Send Slack alert:

curl -X POST https://hooks.slack.com/services/YOUR/WEBHOOK/URL \
  -H 'Content-Type: application/json' \
  -d '{"text":"⚠️ AtlasML is unhealthy!"}'

Run as cron job:

# Check every 5 minutes
*/5 * * * * /opt/atlasml/check-health.sh

Log-Based Alerts

Alert on specific errors:

#!/bin/bash
# alert-on-errors.sh

ERRORS=$(docker logs atlasml --since 5m 2>&1 | grep -c "ERROR")

if [ $ERRORS -gt 10 ]; then
    echo "⚠️ $ERRORS errors in last 5 minutes!" | \
        mail -s "AtlasML Errors" ops@company.com
fi

Resource-Based Alerts

Alert on high memory usage:

#!/bin/bash
# alert-memory.sh

MEM_PERCENT=$(docker stats atlasml --no-stream --format "{{.MemPerc}}" | sed 's/%//')

if (( $(echo "$MEM_PERCENT > 80" | bc -l) )); then
    echo "⚠️ Memory usage at ${MEM_PERCENT}%!" | \
        mail -s "AtlasML Memory Alert" ops@company.com
fi

Monitoring Dashboard Example

Simple Monitoring Script

#!/bin/bash
# monitor-atlasml.sh

echo "=== AtlasML Monitoring Dashboard ==="
echo ""

# Health status
echo "Health Status:"
HEALTH=$(docker inspect -f '{{.State.Health.Status}}' atlasml 2>/dev/null || echo "unknown")
echo "  Status: $HEALTH"
echo ""

# Resource usage
echo "Resource Usage:"
docker stats atlasml --no-stream --format "  CPU: {{.CPUPerc}}\n  Memory: {{.MemUsage}}\n  Network: {{.NetIO}}"
echo ""

# Recent errors
echo "Recent Errors (last 5):"
docker logs atlasml --since 1h 2>&1 | grep ERROR | tail -5
echo ""

# Request rate (approximate)
echo "Request Rate (last minute):"
REQUESTS=$(docker logs atlasml --since 1m 2>&1 | grep -c "POST\\|GET")
echo "  $REQUESTS requests"
echo ""

# Uptime
echo "Uptime:"
docker inspect -f '{{.State.StartedAt}}' atlasml
echo ""

Run:

chmod +x monitor-atlasml.sh
./monitor-atlasml.sh

Output:

=== AtlasML Monitoring Dashboard ===

Health Status:
  Status: healthy

Resource Usage:
  CPU: 2.5%
  Memory: 256MB / 2GB
  Network: 1.2MB / 500KB

Recent Errors (last 5):
  (no errors found)

Request Rate (last minute):
  15 requests

Uptime:
  2025-01-15T10:00:00Z

Monitoring Checklist

Daily

Check health status
Review error logs (last 24h)
Check resource usage trends
Verify Sentry dashboard (if configured)

Weekly

Monthly

Troubleshooting Monitoring Issues

Health Check Always Failing

Check:

# Test endpoint manually
curl http://localhost:8000/api/v1/health

# If works, check Docker health config
docker inspect atlasml | jq '.[0].Config.Healthcheck'

No Logs Appearing

Check:

# Verify logging driver
docker inspect atlasml | jq '.[0].HostConfig.LogConfig'

# Check log file exists
sudo ls -l /var/lib/docker/containers/$(docker ps -q -f name=atlasml)/*.log

Sentry Not Receiving Errors

Check:

# Verify SENTRY_DSN set
docker exec atlasml printenv SENTRY_DSN

# Test Sentry connection
docker exec atlasml python -c "from sentry_sdk import capture_message; capture_message('test')"

Next Steps

Installation: Set up monitoring during installation
Deployment: Deploy with monitoring enabled
Configuration: Configure Sentry and logging
Troubleshooting: Use monitoring to debug issues

Resources

Docker Logging: https://docs.docker.com/config/containers/logging/
Sentry Documentation: https://docs.sentry.io/
Prometheus: https://prometheus.io/docs/
Grafana: https://grafana.com/docs/
Docker Health Checks: https://docs.docker.com/engine/reference/builder/#healthcheck

Monitoring Overview​

Health Checks​

Built-in Health Endpoint​

Docker Health Check​

Custom Health Checks​

Logging​

Application Logs​

Log Levels​

Log Rotation​

Centralized Logging​

Option 1: Syslog​

Option 2: Fluentd​

Option 3: ELK Stack (Elasticsearch, Logstash, Kibana)​

Option 4: CloudWatch (AWS)​

Resource Monitoring​

Container Resource Usage​

Set Resource Limits​

Disk Usage​

Error Tracking with Sentry​

Setup Sentry​

What Sentry Captures​

Monitoring Sentry​

Alert Rules​

Performance Monitoring​

Response Time Monitoring​

Using Prometheus and Grafana​

1. Add Prometheus Exporter​

2. Configure Prometheus​

3. Run Prometheus​

4. Set Up Grafana​

Alerting​

Docker Health-Based Alerts​

Log-Based Alerts​

Resource-Based Alerts​

Monitoring Dashboard Example​

Simple Monitoring Script​

Monitoring Checklist​

Daily​

Weekly​

Monthly​

Troubleshooting Monitoring Issues​

Health Check Always Failing​

No Logs Appearing​

Sentry Not Receiving Errors​

Next Steps​

Resources​

Monitoring Overview

Health Checks

Built-in Health Endpoint

Docker Health Check

Custom Health Checks

Logging

Application Logs

Log Levels

Log Rotation

Centralized Logging

Option 1: Syslog

Option 2: Fluentd

Option 3: ELK Stack (Elasticsearch, Logstash, Kibana)

Option 4: CloudWatch (AWS)

Resource Monitoring

Container Resource Usage

Set Resource Limits

Disk Usage

Error Tracking with Sentry

Setup Sentry

What Sentry Captures

Monitoring Sentry

Alert Rules

Performance Monitoring

Response Time Monitoring

Using Prometheus and Grafana

1. Add Prometheus Exporter

2. Configure Prometheus

3. Run Prometheus

4. Set Up Grafana

Alerting

Docker Health-Based Alerts

Log-Based Alerts

Resource-Based Alerts

Monitoring Dashboard Example

Simple Monitoring Script

Monitoring Checklist

Daily

Weekly

Monthly

Troubleshooting Monitoring Issues

Health Check Always Failing

No Logs Appearing

Sentry Not Receiving Errors

Next Steps

Resources