Skip to main content

AtlasML Troubleshooting

This guide covers common production issues encountered by AtlasML administrators and how to resolve them.


Service Health Issues

Container Won't Start

Symptom:

docker ps
# atlasml is not listed

Diagnosis:

# Check logs
docker logs atlasml

# Check exit code
docker inspect atlasml | jq '.[0].State.ExitCode'

# View last run time
docker inspect atlasml | jq '.[0].State.StartedAt'

Common Causes & Solutions:

1. Missing Environment Variables

Error in logs:

KeyError: 'ATLAS_API_KEYS'

Solution:

# Check .env file exists
ls -la /opt/atlasml/.env

# Verify required variables
cat /opt/atlasml/.env | grep -E "(ATLAS_API_KEYS|WEAVIATE_HOST|WEAVIATE_PORT)"

# If missing, add them
nano /opt/atlasml/.env

2. Weaviate Not Running

Error in logs:

WeaviateConnectionError: Could not connect to Weaviate

Solution:

# Check if Weaviate is accessible
curl -H "Authorization: Bearer YOUR_WEAVIATE_API_KEY" https://your-weaviate-domain.com/v1/.well-known/ready
# Should return: {"status":"ok"}

# If Weaviate is not accessible, check the centralized Weaviate service
# (Weaviate runs on a separate server - see /weaviate directory)

# Restart AtlasML
docker-compose -f docker-compose.prod.yml restart atlasml

3. Port Already in Use

Error in logs:

ERROR: bind: address already in use: 0.0.0.0:80

Solution:

# Find process using port 80
sudo lsof -i :80
# or
sudo netstat -tulpn | grep :80

# Stop conflicting service
sudo systemctl stop nginx # or apache2

# Or change AtlasML port in compose file
# Edit docker-compose.prod.yml:
# ports:
# - '8080:8000' # Use 8080 instead of 80

4. Image Pull Failed

Error:

Error response from daemon: pull access denied for ghcr.io/ls1intum/edutelligence/atlasml

Solution:

# Login to GitHub Container Registry
echo $GITHUB_TOKEN | docker login ghcr.io -u USERNAME --password-stdin

# Pull image manually
docker pull ghcr.io/ls1intum/edutelligence/atlasml:main

# Restart
docker-compose -f docker-compose.prod.yml up -d

Container Starts But Unhealthy

Symptom:

docker ps
# STATUS: Up 2 minutes (unhealthy)

Diagnosis:

# Check health status
docker inspect atlasml | jq '.[0].State.Health'

# View health check logs
docker inspect atlasml | jq '.[0].State.Health.Log'

# Test health endpoint manually
curl http://localhost/api/v1/health

Solutions:

1. Health Check Timeout

If health check takes >10s, increase timeout:

# docker-compose.prod.yml
healthcheck:
timeout: 30s # Increase from 10s

2. Application Not Ready

If application is slow to start, increase start_period:

healthcheck:
start_period: 30s # Increase from 10s

3. Weaviate Connectivity

Test from container:

docker exec atlasml curl http://${WEAVIATE_HOST}:${WEAVIATE_PORT}/v1/.well-known/ready

If fails, check network connectivity and Weaviate status.


Connection Issues

Weaviate Connection Failed

Symptom:

WeaviateConnectionError: Could not connect to Weaviate at https://your-weaviate-domain.com

Diagnosis:

# 1. Check if Weaviate is accessible
curl -H "Authorization: Bearer YOUR_WEAVIATE_API_KEY" https://your-weaviate-domain.com/v1/.well-known/ready

# 2. Check Weaviate service status on Weaviate server
# SSH to the Weaviate server and check:
docker ps | grep weaviate
docker logs weaviate

# 3. Test from AtlasML container
docker exec atlasml curl -H "Authorization: Bearer ${WEAVIATE_API_KEY}" ${WEAVIATE_HOST}/v1/.well-known/ready

# 4. Check network
docker network inspect shared-network

Solutions:

If Weaviate Not Accessible

# Check DNS resolution
nslookup your-weaviate-domain.com

# Check if Weaviate server is reachable
ping your-weaviate-domain.com

# Verify Weaviate API key is correct in .env
cat /opt/atlasml/.env | grep WEAVIATE_API_KEY

# Restart AtlasML with updated configuration
docker-compose -f docker-compose.prod.yml restart

If Weaviate Server Down

SSH to the Weaviate server and check the service:

# Check Weaviate status
cd /path/to/edutelligence/weaviate
docker-compose ps

# View Weaviate logs
docker-compose logs weaviate

# Restart if needed
docker-compose restart weaviate

# Verify it's accessible
curl -H "Authorization: Bearer YOUR_API_KEY" https://your-weaviate-domain.com/v1/.well-known/ready

If Host Resolution Issue

# Check WEAVIATE_HOST value
docker exec atlasml printenv WEAVIATE_HOST

# If using service name, ensure on same Docker network
# If using localhost from container, use host.docker.internal

# Update .env
WEAVIATE_HOST=host.docker.internal

OpenAI API Connection Failed

Symptom:

OpenAI API Error: Authentication failed

Diagnosis:

# 1. Check API key is set
docker exec atlasml printenv OPENAI_API_KEY

# 2. Test API directly
curl https://${OPENAI_API_URL}/openai/deployments \
-H "api-key: ${OPENAI_API_KEY}"

Solutions:

Invalid API Key

# Verify key in Azure Portal
# Azure Portal → Azure OpenAI → Keys and Endpoint

# Update .env
OPENAI_API_KEY=correct-key-from-azure

# Restart
docker-compose -f docker-compose.prod.yml restart atlasml

Wrong URL

# Verify endpoint in Azure Portal
# Should be: https://{resource-name}.openai.azure.com

# Update .env
OPENAI_API_URL=https://correct-resource.openai.azure.com

# Restart
docker-compose -f docker-compose.prod.yml restart atlasml

Network/Firewall Block

# Test connectivity from server
curl -I https://your-resource.openai.azure.com

# If blocked, configure firewall to allow HTTPS to *.openai.azure.com

API Errors

401 Unauthorized

Symptom:

curl http://localhost/api/v1/competency/suggest
# {"detail":"Invalid API key"}

Diagnosis:

# 1. Check API keys configured
docker exec atlasml printenv ATLAS_API_KEYS

# 2. Verify format (must be comma-separated)
echo $ATLAS_API_KEYS
# Should be: key1,key2 (comma-separated, no brackets)

Solutions:

Missing Authorization Header

# ❌ Bad - No header
curl http://localhost/api/v1/competency/suggest

# ✅ Good - With Authorization header
curl -H "Authorization: your-api-key" http://localhost/api/v1/competency/suggest

Wrong Key Format in .env

# ❌ Bad - Incorrect formats
ATLAS_API_KEYS=["key1","key2"] # JSON array (not supported)
ATLAS_API_KEYS=[key1,key2] # Brackets
ATLAS_API_KEYS="key1, key2" # Spaces around commas

# ✅ Good - Comma-separated
ATLAS_API_KEYS=key1,key2

# Fix and restart
docker-compose -f docker-compose.prod.yml restart atlasml

Key Mismatch

# Verify Artemis is using correct key
# Check Artemis configuration: application-prod.yml
# atlas.atlasml.api-key should match one of ATLAS_API_KEYS

# Update either AtlasML or Artemis to match

422 Unprocessable Entity

Symptom:

{
"detail": [
{
"loc": ["body", "course_id"],
"msg": "field required",
"type": "value_error.missing"
}
]
}

Cause: Request body doesn't match expected schema

Solution:

# View API documentation
open http://localhost/docs

# Fix request to include all required fields
curl -X POST http://localhost/api/v1/competency/suggest \
-H "Authorization: your-key" \
-H "Content-Type: application/json" \
-d '{
"description": "Python programming",
"course_id": 1
}'

500 Internal Server Error

Symptom:

{"detail":"Internal server error"}

Diagnosis:

# Check application logs
docker logs atlasml --tail 100

# Look for stack traces
docker logs atlasml 2>&1 | grep -A 20 "ERROR"

# Check Sentry (if configured)
# Visit Sentry dashboard for detailed error info

Common Causes:

Database Error

# Check Weaviate connectivity
docker exec atlasml curl http://${WEAVIATE_HOST}:${WEAVIATE_PORT}/v1/.well-known/ready

# Check Weaviate logs
docker logs weaviate

OpenAI API Error

# Check OpenAI quota
# Azure Portal → Azure OpenAI → Usage

# Check rate limits
# If exceeded, wait or upgrade plan

Memory Issue

# Check memory usage
docker stats atlasml

# If near limit, increase memory

Performance Issues

Slow Response Times

Symptom: Requests take >5 seconds

Diagnosis:

# Measure response time
time curl -X POST http://localhost/api/v1/competency/suggest \
-H "Authorization: test" \
-H "Content-Type: application/json" \
-d '{"description":"test","course_id":1}'

# Check resource usage
docker stats atlasml

Solutions:

High CPU Usage

# Check CPU
docker stats atlasml --no-stream

# If consistently >80%, scale up:
# Option 1: Increase CPU limit
# docker-compose.prod.yml:
# deploy:
# resources:
# limits:
# cpus: '4.0' # Increase from 2.0

# Option 2: Scale horizontally (multiple instances)
docker-compose -f docker-compose.prod.yml up -d --scale atlasml=3

High Memory Usage

# Check memory
docker stats atlasml --no-stream

# If near limit, increase memory
# docker-compose.prod.yml:
# deploy:
# resources:
# limits:
# memory: 4G # Increase from 2G

Large Weaviate Collection

# Check collection size
docker exec weaviate curl http://localhost:8080/v1/schema

# If very large (>100k objects), consider:
# 1. Archiving old data
# 2. Filtering queries by course_id
# 3. Optimizing Weaviate configuration

OpenAI API Latency

# Benchmark OpenAI API
time curl https://${OPENAI_API_URL}/openai/deployments/... \
-H "api-key: ${OPENAI_API_KEY}"

# If slow, check:
# 1. API region (use closest region)
# 2. Rate limiting
# 3. Consider local embeddings for non-critical queries

High Memory Usage

Symptom:

docker stats atlasml
# MEM USAGE: 1.8GB / 2GB (90%)

Diagnosis:

# Monitor over time
docker stats atlasml --no-stream

# Check for memory leaks (if usage grows continuously)
# Run for 1 hour and compare

Solutions:

Increase Memory Limit

# docker-compose.prod.yml
deploy:
resources:
limits:
memory: 4G # Increase from 2G

Optimize Application

# Check if caching too much data
# Review recent code changes
# Look for memory leaks in logs

# Restart to clear memory (temporary fix)
docker-compose -f docker-compose.prod.yml restart atlasml

Add Memory Monitoring

# Set up alert when memory >80%
# See Monitoring Guide for details

Deployment Issues

New Version Not Deploying

Symptom: Container running but still old version

Diagnosis:

# Check image tag
docker inspect atlasml | jq '.[0].Config.Image'

# Check when image was pulled
docker inspect atlasml | jq '.[0].Created'

# Check available images
docker images | grep atlasml

Solutions:

Force Pull New Image

# Pull latest
docker-compose -f docker-compose.prod.yml pull

# Stop and remove container
docker-compose -f docker-compose.prod.yml down

# Start with new image
docker-compose -f docker-compose.prod.yml up -d

# Verify new version
docker logs atlasml | grep "Started"

Clear Image Cache

# Remove old images
docker rmi ghcr.io/ls1intum/edutelligence/atlasml:old-tag

# Pull specific version
IMAGE_TAG=v1.2.0 docker-compose -f docker-compose.prod.yml pull

# Restart
IMAGE_TAG=v1.2.0 docker-compose -f docker-compose.prod.yml up -d

Deployment Rollback Needed

Scenario: New version has critical bug

Quick Rollback:

cd /opt/atlasml

# Set previous version
echo "IMAGE_TAG=v1.1.0" > .env.temp
cat .env >> .env.temp
mv .env.temp .env

# Pull and restart
docker-compose -f docker-compose.prod.yml pull
docker-compose -f docker-compose.prod.yml up -d

# Verify
docker logs atlasml
curl http://localhost/api/v1/health

Data Issues

Weaviate Data Lost

Symptom: All competencies missing after restart

Diagnosis:

# Check volume exists
docker volume ls | grep weaviate

# Inspect volume
docker volume inspect weaviate-data

# Check mount point
docker inspect weaviate | jq '.[0].Mounts'

Solutions:

Volume Not Mounted

# compose.weaviate.yaml
# Ensure volume is configured:
services:
weaviate:
volumes:
- weaviate-data:/var/lib/weaviate # Mount volume

volumes:
weaviate-data: # Declare volume

Volume Deleted

# Check if backup exists
ls -lh /path/to/backups/weaviate-*.tar.gz

# Restore from backup
docker-compose -f compose.weaviate.yaml down
sudo tar -xzf weaviate-backup-20250115.tar.gz -C /
docker-compose -f compose.weaviate.yaml up -d

Create Regular Backups

# Backup script
#!/bin/bash
docker-compose -f compose.weaviate.yaml stop
sudo tar -czf weaviate-backup-$(date +%Y%m%d).tar.gz \
/var/lib/docker/volumes/weaviate-data
docker-compose -f compose.weaviate.yaml start

# Run daily via cron
0 2 * * * /opt/atlasml/backup-weaviate.sh

Network Issues

AtlasML Not Reachable from Artemis

Symptom: Artemis cannot connect to AtlasML

Diagnosis:

# Test from Artemis server
curl http://atlasml-server:80/api/v1/health

# Check firewall
sudo ufw status

Solutions:

Firewall Blocking

# Allow from Artemis server
sudo ufw allow from ARTEMIS_IP to any port 80

# Or allow all (less secure)
sudo ufw allow 80/tcp

# Verify
sudo ufw status numbered

Wrong Hostname/IP in Artemis

# application-prod.yml in Artemis
atlas:
atlasml:
base-url: https://correct-hostname-or-ip # Fix this
api-key: your-api-key

DNS Issue

# Test DNS resolution from Artemis
nslookup atlasml-server.company.com

# If fails, add to /etc/hosts
echo "192.168.1.100 atlasml-server" | sudo tee -a /etc/hosts

SSL/TLS Certificate Issues

Symptom: HTTPS connection fails

Diagnosis:

# Test SSL
curl -v https://atlasml.company.com/api/v1/health

# Check certificate
openssl s_client -connect atlasml.company.com:443 -servername atlasml.company.com

Solutions:

Certificate Expired

# Renew Let's Encrypt certificate
sudo certbot renew

# Reload Nginx
sudo systemctl reload nginx

Self-Signed Certificate

# If using self-signed, Artemis must trust it
# Copy cert to Artemis server
scp /etc/ssl/certs/atlasml.crt artemis-server:/usr/local/share/ca-certificates/
ssh artemis-server 'sudo update-ca-certificates'

Troubleshooting Checklist

When encountering an issue:

  1. Check service health

    docker ps
    docker logs atlasml --tail 50
  2. Test connectivity

    curl http://localhost/api/v1/health
    curl http://localhost:8085/v1/.well-known/ready
  3. Verify configuration

    docker exec atlasml env | grep -E "(WEAVIATE|OPENAI|ATLAS)"
  4. Check resources

    docker stats atlasml --no-stream
    df -h
  5. Review logs

    docker logs atlasml 2>&1 | grep ERROR
    docker logs weaviate 2>&1 | grep ERROR
  6. Check Sentry (if configured)

    • Visit Sentry dashboard
    • Filter by environment and time

Getting Help

Gather Information

Before reporting issues, collect:

# System info
uname -a
docker --version
docker-compose --version

# Container status
docker ps -a | grep atlasml

# Recent logs
docker logs atlasml --tail 100 > atlasml-logs.txt

# Configuration (mask secrets!)
docker exec atlasml env | grep -E "(WEAVIATE|ATLAS|ENV)" | sed 's/=.*/=***/' > config.txt

# Resource usage
docker stats atlasml --no-stream > stats.txt

Report Issue

Include in report:

  1. Description: What were you trying to do?
  2. Expected: What should happen?
  3. Actual: What actually happened?
  4. Environment:
    • OS and version
    • Docker version
    • AtlasML version/image tag
  5. Logs: Relevant error messages
  6. Steps to reproduce: How to recreate the issue

Common Error Messages

ErrorCauseSolution
WeaviateConnectionErrorWeaviate not runningStart Weaviate
401 UnauthorizedInvalid/missing API keyCheck Authorization header
422 Unprocessable EntityInvalid request bodyCheck request schema
500 Internal Server ErrorServer-side errorCheck logs
Connection refusedService not runningStart service
Address already in usePort conflictChange port or stop other service
No such containerContainer not runningStart container
pull access deniedImage not accessibleCheck credentials

Next Steps


Resources